Commit Graph

7 Commits

Author SHA1 Message Date
Guillaume Klein
a5d03e55fa Prevent out of range error in method split_tokens_on_unicode (#111) 2023-04-04 10:51:14 +02:00
Guillaume Klein
9fa1989073 Revert "Prevent out of range error in method split_tokens_on_unicode"
This reverts commit 36160c1e7e.
2023-04-04 10:25:41 +02:00
Guillaume Klein
36160c1e7e Prevent out of range error in method split_tokens_on_unicode 2023-04-04 10:17:56 +02:00
Guillaume Klein
39fddba886 Suppress some special tokens when the default set is not used 2023-03-30 12:42:29 +02:00
Guillaume Klein
d82be59d5f Fix unset attribute when using English-only models 2023-03-17 18:33:16 +01:00
Guillaume Klein
8bd013ea99 Add word-level timestamps (#43)
* Add word-level timestamps

* Fix alignment between the segments and the lists of words

* Fix truncated words list when the replacement character is decoded

* Check for empty text_tokens

* Add usage example in the readme

* Update ctranslate2 to 3.9

* Skip empty segment

* Set typing for the new methods
2023-03-15 15:02:28 +01:00
Guillaume Klein
c52adaca90 Create a helper class Tokenizer 2023-03-09 12:53:49 +01:00