Merge branch 'master' into prompt

Bump version to 1.0.0 (#696 )
Prevent infinite loop for out-of-bound timestamps in clip_timestamps (#697 )
2024-02-23 10:52:53 +08:00 · 2024-02-22 09:49:01 +01:00 · 2024-02-22 09:48:35 +01:00 · 2024-02-21 10:18:11 +01:00 · 2024-02-20 17:34:54 +01:00 · 2024-02-20 17:33:17 +01:00
6 changed files with 247 additions and 26 deletions
--- a/README.md
+++ b/README.md
@@ -8,6 +8,8 @@ This implementation is up to 4 times faster than [openai/whisper](https://github

 ## Benchmark

+### Whisper
+
 For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:

 * [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
@@ -36,6 +38,33 @@ For reference, here's the time and memory usage that are required to transcribe

 *Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*

+
+### Distil-whisper
+
+| Implementation | Precision | Beam size | Time | Gigaspeech WER |
+| --- | --- | --- | --- | --- |
+| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
+| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
+| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
+| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
+
+*Executed with CUDA 11.4 on a NVIDIA 3090.*
+
+<details>
+<summary>testing details (click to expand)</summary>
+
+For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
+```python
+from faster_whisper import WhisperModel
+
+model_size = "distil-large-v2"
+# model_size = "distil-medium.en"
+# Run on GPU with FP16
+model = WhisperModel(model_size, device="cuda", compute_type="float16")
+segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
+```
+</details>
+
 ## Requirements

 * Python 3.8 or greater
@@ -101,6 +130,8 @@ pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/

 ## Usage

+### Faster-whisper
+
 ```python
 from faster_whisper import WhisperModel

@@ -128,6 +159,18 @@ for segment in segments:
 segments, _ = model.transcribe("audio.mp3")
 segments = list(segments)  # The transcription will actually run here.
 ```
+### Faster-distil-whisper
+For usage of `faster-ditil-whisper`, please refer to: https://github.com/guillaumekln/faster-whisper/issues/533
+
+```python
+model_size = "distil-large-v2"
+# model_size = "distil-medium.en"
+model = WhisperModel(model_size, device="cuda", compute_type="float16")
+segments, info = model.transcribe("audio.mp3", beam_size=5, 
+    language="en", max_new_tokens=128, condition_on_previous_text=False)
+
+```
+NOTE: Empirically, `condition_on_previous_text=True` will degrade the performance of `faster-distil-whisper` for long audio. Degradation on the first chunk was observed with `initial_prompt` too.

 ### Word-level timestamps

@@ -176,16 +219,22 @@ See more model and transcription options in the [`WhisperModel`](https://github.

 Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!

+
+* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
 * [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
 * [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
 * [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS. 
 * [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
 * [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
 * [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
+* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
+* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
+* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
+* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.

 ## Model conversion

-When loading a model from its size such as `WhisperModel("large-v3")`, the correspondig CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
+When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).

 We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.

--- a/faster_whisper/feature_extractor.py
+++ b/faster_whisper/feature_extractor.py
@@ -142,11 +142,15 @@ class FeatureExtractor:
            data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
        return data.T

-    def __call__(self, waveform, padding=True):
+    def __call__(self, waveform, padding=True, chunk_length=None):
        """
        Compute the log-Mel spectrogram of the provided audio, gives similar results
        whisper's original torch implementation with 1e-5 tolerance.
        """
+        if chunk_length is not None:
+            self.n_samples = chunk_length * self.sampling_rate
+            self.nb_max_frames = self.n_samples // self.hop_length
+
        if padding:
            waveform = np.pad(waveform, [(0, self.n_samples)])

--- a/faster_whisper/transcribe.py
+++ b/faster_whisper/transcribe.py
@@ -14,7 +14,7 @@ import tokenizers
 from faster_whisper.audio import decode_audio
 from faster_whisper.feature_extractor import FeatureExtractor
 from faster_whisper.tokenizer import _LANGUAGE_CODES, Tokenizer
-from faster_whisper.utils import download_model, format_timestamp, get_logger
+from faster_whisper.utils import download_model, format_timestamp, get_end, get_logger
 from faster_whisper.vad import (
    SpeechTimestampsMap,
    VadOptions,
@@ -66,6 +66,9 @@ class TranscriptionOptions(NamedTuple):
    word_timestamps: bool
    prepend_punctuations: str
    append_punctuations: str
+    max_new_tokens: Optional[int]
+    clip_timestamps: Union[str, List[float]]
+    hallucination_silence_threshold: Optional[float]


 class TranscriptionInfo(NamedTuple):
@@ -213,6 +216,10 @@ class WhisperModel:
        append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
        vad_filter: bool = False,
        vad_parameters: Optional[Union[dict, VadOptions]] = None,
+        max_new_tokens: Optional[int] = None,
+        chunk_length: Optional[int] = None,
+        clip_timestamps: Union[str, List[float]] = "0",
+        hallucination_silence_threshold: Optional[float] = None,
    ) -> Tuple[Iterable[Segment], TranscriptionInfo]:
        """Transcribes an input file.

@@ -264,6 +271,16 @@ class WhisperModel:
            https://github.com/snakers4/silero-vad.
          vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
            parameters and default values in the class `VadOptions`).
+          max_new_tokens: Maximum number of new tokens to generate per-chunk. If not set,
+            the maximum will be set by the default max_length.
+          chunk_length: The length of audio segments. If it is not None, it will overwrite the
+            default chunk_length of the FeatureExtractor.
+          clip_timestamps: Union[str, List[float]]
+            Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
+             process. The last end timestamp defaults to the end of the file.
+          hallucination_silence_threshold: Optional[float]
+            When word_timestamps is True, skip silent periods longer than this threshold
+             (in seconds) when a possible hallucination is detected

        Returns:
          A tuple with:
@@ -313,7 +330,7 @@ class WhisperModel:
        else:
            speech_chunks = None

-        features = self.feature_extractor(audio)
+        features = self.feature_extractor(audio, chunk_length=chunk_length)

        encoder_output = None
        all_language_probs = None
@@ -379,6 +396,9 @@ class WhisperModel:
            word_timestamps=word_timestamps,
            prepend_punctuations=prepend_punctuations,
            append_punctuations=append_punctuations,
+            max_new_tokens=max_new_tokens,
+            clip_timestamps=clip_timestamps,
+            hallucination_silence_threshold=hallucination_silence_threshold,
        )

        segments = self.generate_segments(features, tokenizer, options, encoder_output)
@@ -406,8 +426,33 @@ class WhisperModel:
        encoder_output: Optional[ctranslate2.StorageView] = None,
    ) -> Iterable[Segment]:
        content_frames = features.shape[-1] - self.feature_extractor.nb_max_frames
+        content_duration = float(content_frames * self.feature_extractor.time_per_frame)
+
+        if isinstance(options.clip_timestamps, str):
+            TranscriptionOptions.clip_timestamps = [
+                float(ts)
+                for ts in (
+                    options.clip_timestamps.split(",")
+                    if options.clip_timestamps
+                    else []
+                )
+            ]
+        seek_points: List[int] = [
+            round(ts * self.frames_per_second) for ts in options.clip_timestamps
+        ]
+        if len(seek_points) == 0:
+            seek_points.append(0)
+        if len(seek_points) % 2 == 1:
+            seek_points.append(content_frames)
+        seek_clips: List[Tuple[int, int]] = list(
+            zip(seek_points[::2], seek_points[1::2])
+        )
+
+        punctuation = "\"'“¿([{-\"'.。,，!！?？:：”)]}、"
+
        idx = 0
-        seek = 0
+        clip_idx = 0
+        seek = seek_clips[clip_idx][0]
        all_tokens = []
        all_prompt_text = []
        prompt_reset_since = 0
@@ -421,12 +466,32 @@ class WhisperModel:
                all_tokens.extend(options.initial_prompt)

        last_speech_timestamp = 0.0
-        while seek < content_frames:
+        # NOTE: This loop is obscurely flattened to make the diff readable.
+        # A later commit should turn this into a simpler nested loop.
+        # for seek_clip_start, seek_clip_end in seek_clips:
+        #     while seek < seek_clip_end
+        while clip_idx < len(seek_clips):
+            seek_clip_start, seek_clip_end = seek_clips[clip_idx]
+            if seek_clip_end > content_frames:
+                seek_clip_end = content_frames
+            if seek < seek_clip_start:
+                seek = seek_clip_start
+            if seek >= seek_clip_end:
+                clip_idx += 1
+                if clip_idx < len(seek_clips):
+                    seek = seek_clips[clip_idx][0]
+                continue
            time_offset = seek * self.feature_extractor.time_per_frame
-            segment = features[:, seek : seek + self.feature_extractor.nb_max_frames]
-            segment_size = min(
-                self.feature_extractor.nb_max_frames, content_frames - seek
+            window_end_time = float(
+                (seek + self.feature_extractor.nb_max_frames)
+                * self.feature_extractor.time_per_frame
            )
+            segment_size = min(
+                self.feature_extractor.nb_max_frames,
+                content_frames - seek,
+                seek_clip_end - seek,
+            )
+            segment = features[:, seek : seek + segment_size]
            segment_duration = segment_size * self.feature_extractor.time_per_frame

            if self.logger.isEnabledFor(logging.DEBUG):
@@ -479,10 +544,33 @@ class WhisperModel:
            previous_seek = seek
            current_segments = []

+            # anomalous words are very long/short/improbable
+            def word_anomaly_score(word: dict) -> float:
+                probability = word.get("probability", 0.0)
+                duration = word["end"] - word["start"]
+                score = 0.0
+                if probability < 0.15:
+                    score += 1.0
+                if duration < 0.133:
+                    score += (0.133 - duration) * 15
+                if duration > 2.0:
+                    score += duration - 2.0
+                return score
+
+            def is_segment_anomaly(segment: Optional[dict]) -> bool:
+                if segment is None or not segment["words"]:
+                    return False
+                words = [w for w in segment["words"] if w["word"] not in punctuation]
+                words = words[:8]
+                score = sum(word_anomaly_score(w) for w in words)
+                return score >= 3 or score + 0.01 >= len(words)
+
+            def next_words_segment(segments: List[dict]) -> Optional[dict]:
+                return next((s for s in segments if s["words"]), None)
+
            single_timestamp_ending = (
                len(tokens) >= 2
-                and tokens[-2] < tokenizer.timestamp_begin
-                and tokens[-1] >= tokenizer.timestamp_begin
+                and tokens[-2] < tokenizer.timestamp_begin <= tokens[-1]
            )

            consecutive_timestamps = [
@@ -565,18 +653,70 @@ class WhisperModel:
                    last_speech_timestamp=last_speech_timestamp,
                )

-                word_end_timestamps = [
-                    w["end"] for s in current_segments for w in s["words"]
-                ]
-                if len(word_end_timestamps) > 0:
-                    last_speech_timestamp = word_end_timestamps[-1]
-                if not single_timestamp_ending and len(word_end_timestamps) > 0:
-                    seek_shift = round(
-                        (word_end_timestamps[-1] - time_offset) * self.frames_per_second
-                    )
+                if not single_timestamp_ending:
+                    last_word_end = get_end(current_segments)
+                    if last_word_end is not None and last_word_end > time_offset:
+                        seek = round(last_word_end * self.frames_per_second)

-                    if seek_shift > 0:
-                        seek = previous_seek + seek_shift
+                # skip silence before possible hallucinations
+                if options.hallucination_silence_threshold is not None:
+                    threshold = options.hallucination_silence_threshold
+                    if not single_timestamp_ending:
+                        last_word_end = get_end(current_segments)
+                        if last_word_end is not None and last_word_end > time_offset:
+                            remaining_duration = window_end_time - last_word_end
+                            if remaining_duration > threshold:
+                                seek = round(last_word_end * self.frames_per_second)
+                            else:
+                                seek = previous_seek + segment_size
+
+                    # if first segment might be a hallucination, skip leading silence
+                    first_segment = next_words_segment(current_segments)
+                    if first_segment is not None and is_segment_anomaly(first_segment):
+                        gap = first_segment["start"] - time_offset
+                        if gap > threshold:
+                            seek = previous_seek + round(gap * self.frames_per_second)
+                            continue
+
+                    # skip silence before any possible hallucination that is surrounded
+                    # by silence or more hallucinations
+                    hal_last_end = last_speech_timestamp
+                    for si in range(len(current_segments)):
+                        segment = current_segments[si]
+                        if not segment["words"]:
+                            continue
+                        if is_segment_anomaly(segment):
+                            next_segment = next_words_segment(
+                                current_segments[si + 1 :]
+                            )
+                            if next_segment is not None:
+                                hal_next_start = next_segment["words"][0]["start"]
+                            else:
+                                hal_next_start = time_offset + segment_duration
+                            silence_before = (
+                                segment["start"] - hal_last_end > threshold
+                                or segment["start"] < threshold
+                                or segment["start"] - time_offset < 2.0
+                            )
+                            silence_after = (
+                                hal_next_start - segment["end"] > threshold
+                                or is_segment_anomaly(next_segment)
+                                or window_end_time - segment["end"] < 2.0
+                            )
+                            if silence_before and silence_after:
+                                seek = round(
+                                    max(time_offset + 1, segment["start"])
+                                    * self.frames_per_second
+                                )
+                                if content_duration - segment["end"] < threshold:
+                                    seek = content_frames
+                                current_segments[si:] = []
+                                break
+                        hal_last_end = segment["end"]
+
+                last_word_end = get_end(current_segments)
+                if last_word_end is not None:
+                    last_speech_timestamp = last_word_end

            for segment in current_segments:
                tokens = segment["tokens"]
@@ -651,6 +791,21 @@ class WhisperModel:
        max_initial_timestamp_index = int(
            round(options.max_initial_timestamp / self.time_precision)
        )
+        if options.max_new_tokens is not None:
+            max_length = len(prompt) + options.max_new_tokens
+        else:
+            max_length = self.max_length
+
+        if max_length > self.max_length:
+            raise ValueError(
+                f"The length of the prompt is {len(prompt)}, and the `max_new_tokens` "
+                f"{max_length - len(prompt)}. Thus, the combined length of the prompt "
+                f"and `max_new_tokens` is: {max_length}. This exceeds the "
+                f"`max_length` of the Whisper model: {self.max_length}. "
+                "You should either reduce the length of your prompt, or "
+                "reduce the value of `max_new_tokens`, "
+                f"so that their combined length is less that {self.max_length}."
+            )

        for temperature in options.temperatures:
            if temperature > 0:
@@ -672,7 +827,7 @@ class WhisperModel:
                length_penalty=options.length_penalty,
                repetition_penalty=options.repetition_penalty,
                no_repeat_ngram_size=options.no_repeat_ngram_size,
-                max_length=self.max_length,
+                max_length=max_length,
                return_scores=True,
                return_no_speech_prob=True,
                suppress_blank=options.suppress_blank,
@@ -730,6 +885,8 @@ class WhisperModel:
            if (
                options.no_speech_threshold is not None
                and result.no_speech_prob > options.no_speech_threshold
+                and options.log_prob_threshold is not None
+                and avg_logprob < options.log_prob_threshold
            ):
                needs_fallback = False  # silence

@@ -803,6 +960,7 @@ class WhisperModel:
        word_durations = np.array([word["end"] - word["start"] for word in alignment])
        word_durations = word_durations[word_durations.nonzero()]
        median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
+        median_duration = min(0.7, float(median_duration))
        max_duration = median_duration * 2

        # hack: truncate long words at sentence boundaries.
--- a/faster_whisper/utils.py
+++ b/faster_whisper/utils.py
@@ -22,6 +22,9 @@ _MODELS = {
    "large-v2": "Systran/faster-whisper-large-v2",
    "large-v3": "Systran/faster-whisper-large-v3",
    "large": "Systran/faster-whisper-large-v3",
+    "distil-large-v2": "Systran/faster-distil-whisper-large-v2",
+    "distil-medium.en": "Systran/faster-distil-whisper-medium.en",
+    "distil-small.en": "Systran/faster-distil-whisper-small.en",
 }


@@ -143,3 +146,10 @@ class disabled_tqdm(tqdm):
    def __init__(self, *args, **kwargs):
        kwargs["disable"] = True
        super().__init__(*args, **kwargs)
+
+
+def get_end(segments: List[dict]) -> Optional[float]:
+    return next(
+        (w["end"] for s in reversed(segments) for w in reversed(s["words"])),
+        segments[-1]["end"] if segments else None,
+    )
--- a/faster_whisper/version.py
+++ b/faster_whisper/version.py
@@ -1,3 +1,3 @@
 """Version information."""

-__version__ = "0.10.0"
+__version__ = "1.0.0"
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,5 +1,5 @@
-av==10.*
-ctranslate2>=3.22,<4
+av==11.*
+ctranslate2>=4.0,<5
 huggingface_hub>=0.13
 tokenizers>=0.13,<0.16
 onnxruntime>=1.14,<2
Author	SHA1	Message	Date
heimoshuiyu	4b64ef1f70	Merge branch 'master' into prompt	2024-02-23 10:52:53 +08:00
trungkienbkhn	06d32bf0c1	Bump version to 1.0.0 (#696 )	2024-02-22 09:49:01 +01:00
Purfview	30d6043e90	Prevent infinite loop for out-of-bound timestamps in clip_timestamps (#697 ) Same as https://github.com/openai/whisper/pull/2005	2024-02-22 09:48:35 +01:00
BBC-Esq	22c75d0cc3	Update README.md (#672 ) Add Faster-Whisper-Transcriber to community integrations.	2024-02-21 10:18:11 +01:00
trungkienbkhn	092067208b	Add clip_timestamps and hallucination_silence_threshold options (#646 )	2024-02-20 17:34:54 +01:00
Jordi Mas	6ffcbdfbc2	Fix typos in README.md (#668 )	2024-02-20 17:33:17 +01:00
Purfview	52695567c9	Bumps up PyAV version to support Python 3.12.x (#679 )	2024-02-20 17:31:07 +01:00
IlianP	c6b28ed3a0	Update README.md (#685 ) I'm surprised that WhisperX hasn't made it into this list yet, as it has more stars than faster-whisper itself 🚀	2024-02-20 17:28:00 +01:00
trungkienbkhn	4ab646035f	Upgrade ctranslate2 version to support CUDA 12 (#694 )	2024-02-20 17:26:55 +01:00
heimoshuiyu	d04e685ca2	Merge branch 'master' into prompt	2024-02-19 17:31:58 +08:00
Purfview	f144e4c83d	Expands the note for distil-whisper (#659 )	2024-01-28 21:48:40 +01:00
Purfview	3aec421849	Add: More clarity of what "max_new_tokens" does (#658 ) * Add: More clarity of what "max_new_tokens" does	2024-01-28 21:40:33 +01:00
Dominik Macháček	64b9f244bd	Whisper-Streaming mention (#656 ) under community integrations	2024-01-25 18:27:27 +01:00
Purfview	00efce1696	Bugfix: Illogical "Avoid computing higher temperatures on no_speech" (#652 )	2024-01-24 11:54:43 +01:00
metame	ad3c83045b	support distil-whisper (#557 )	2024-01-24 10:17:12 +01:00
Jürgen Fleiß	72ff979a2e	Add GUI faster-whisper project README.md (#554 ) Added aTrain GUI faster-whisper transcription and diarization tool as community project. Co-authored-by: JuergenFleiss <118339672+Juergen-J-F@users.noreply.github.com>	2024-01-18 13:01:02 +01:00
makaveli	615de0d2d9	add WhisperLive to community integration (#647 )	2024-01-18 12:54:14 +01:00