Commit Graph

10 Commits

Author SHA1 Message Date
Jordi Mas
1195359984 Filter out non_speech_tokens in suppressed tokens (#898)
* Filter out non_speech_tokens in suppressed tokens
2024-07-05 14:43:11 +07:00
Oscaarjs
3084409633 Add V3 Support (#578)
* Add V3 Support

* update conversion example

---------

Co-authored-by: oscaarjs <oscar.johansson@conversy.se>
2023-11-24 23:16:12 +01:00
Guillaume Klein
727ab81f31 Improve error message for invalid task and language parameters (#466) 2023-09-12 10:02:23 +02:00
Guillaume Klein
a5d03e55fa Prevent out of range error in method split_tokens_on_unicode (#111) 2023-04-04 10:51:14 +02:00
Guillaume Klein
9fa1989073 Revert "Prevent out of range error in method split_tokens_on_unicode"
This reverts commit 36160c1e7e.
2023-04-04 10:25:41 +02:00
Guillaume Klein
36160c1e7e Prevent out of range error in method split_tokens_on_unicode 2023-04-04 10:17:56 +02:00
Guillaume Klein
39fddba886 Suppress some special tokens when the default set is not used 2023-03-30 12:42:29 +02:00
Guillaume Klein
d82be59d5f Fix unset attribute when using English-only models 2023-03-17 18:33:16 +01:00
Guillaume Klein
8bd013ea99 Add word-level timestamps (#43)
* Add word-level timestamps

* Fix alignment between the segments and the lists of words

* Fix truncated words list when the replacement character is decoded

* Check for empty text_tokens

* Add usage example in the readme

* Update ctranslate2 to 3.9

* Skip empty segment

* Set typing for the new methods
2023-03-15 15:02:28 +01:00
Guillaume Klein
c52adaca90 Create a helper class Tokenizer 2023-03-09 12:53:49 +01:00