Merge remote-tracking branch 'upstream/master' into prompt

Merge branch 'master' into prompt
format code
2023-12-25 17:56:50 +08:00 · 2023-06-24 18:03:05 +08:00 · 2023-04-20 02:00:57 +08:00 · 2023-04-20 01:49:10 +08:00
11 changed files with 57 additions and 327 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -7,7 +7,7 @@ Contributions are welcome! Here are some pointers to help you install the librar
 We recommend installing the module in editable mode with the `dev` extra requirements:

 ```bash
-git clone https://github.com/SYSTRAN/faster-whisper.git
+git clone https://github.com/guillaumekln/faster-whisper.git
 cd faster-whisper/
 pip install -e .[dev]
 ```
--- a/2
+++ b/2
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2023 SYSTRAN
+Copyright (c) 2023 Guillaume Klein

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
+[![CI](https://github.com/guillaumekln/faster-whisper/workflows/CI/badge.svg)](https://github.com/guillaumekln/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)

 # Faster Whisper transcription with CTranslate2

@@ -8,13 +8,11 @@ This implementation is up to 4 times faster than [openai/whisper](https://github

 ## Benchmark

-### Whisper
-
 For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:

 * [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
 * [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
-* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
+* [faster-whisper](https://github.com/guillaumekln/faster-whisper)@[cce6b53e](https://github.com/guillaumekln/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)

 ### Large-v2 model on GPU

@@ -38,33 +36,6 @@ For reference, here's the time and memory usage that are required to transcribe

 *Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*

-
-### Distil-whisper
-
-| Implementation | Precision | Beam size | Time | Gigaspeech WER |
-| --- | --- | --- | --- | --- |
-| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
-| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
-| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
-| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
-
-*Executed with CUDA 11.4 on a NVIDIA 3090.*
-
-<details>
-<summary>testing details (click to expand)</summary>
-
-For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
-```python
-from faster_whisper import WhisperModel
-
-model_size = "distil-large-v2"
-# model_size = "distil-medium.en"
-# Run on GPU with FP16
-model = WhisperModel(model_size, device="cuda", compute_type="float16")
-segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
-```
-</details>
-
 ## Requirements

 * Python 3.8 or greater
@@ -117,21 +88,19 @@ pip install faster-whisper
 ### Install the master branch

 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
 ```

 ### Install a specific commit

 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
 ```

 </details>

 ## Usage

-### Faster-whisper
-
 ```python
 from faster_whisper import WhisperModel

@@ -159,25 +128,6 @@ for segment in segments:
 segments, _ = model.transcribe("audio.mp3")
 segments = list(segments)  # The transcription will actually run here.
 ```
-### Faster Distil-Whisper
-
-The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
-checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet 
-demonstrates how to run inference with distil-large-v3 on a specified audio file:
-
-```python
-from faster_whisper import WhisperModel
-
-model_size = "distil-large-v3"
-
-model = WhisperModel(model_size, device="cuda", compute_type="float16")
-segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
-
-for segment in segments:
-    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
-```
-
-For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).

 ### Word-level timestamps

@@ -197,7 +147,7 @@ The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad)
 segments, _ = model.transcribe("audio.mp3", vad_filter=True)
 ```

-The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
+The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:

 ```python
 segments, _ = model.transcribe(
@@ -220,28 +170,22 @@ logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

 ### Going further

-See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
+See more model and transcription options in the [`WhisperModel`](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.

 ## Community integrations

 Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!

-
-* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
 * [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
 * [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
 * [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS. 
 * [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
 * [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
 * [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
-* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
-* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
-* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
-* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.

 ## Model conversion

-When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
+When loading a model from its size such as `WhisperModel("large-v3")`, the correspondig CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).

 We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.

--- a/faster_whisper/assets/init.py
+++ b/faster_whisper/assets/init.py
--- a/faster_whisper/audio.py
+++ b/faster_whisper/audio.py
@@ -102,18 +102,3 @@ def _resample_frames(frames, resampler):
    # Add None to flush the resampler.
    for frame in itertools.chain(frames, [None]):
        yield from resampler.resample(frame)
-
-
-def pad_or_trim(array, length: int, *, axis: int = -1):
-    """
-    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
-    """
-    if array.shape[axis] > length:
-        array = array.take(indices=range(length), axis=axis)
-
-    if array.shape[axis] < length:
-        pad_widths = [(0, 0)] * array.ndim
-        pad_widths[axis] = (0, length - array.shape[axis])
-        array = np.pad(array, pad_widths)
-
-    return array
--- a/faster_whisper/feature_extractor.py
+++ b/faster_whisper/feature_extractor.py
@@ -142,15 +142,11 @@ class FeatureExtractor:
            data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
        return data.T

-    def __call__(self, waveform, padding=True, chunk_length=None):
+    def __call__(self, waveform, padding=True):
        """
        Compute the log-Mel spectrogram of the provided audio, gives similar results
        whisper's original torch implementation with 1e-5 tolerance.
        """
-        if chunk_length is not None:
-            self.n_samples = chunk_length * self.sampling_rate
-            self.nb_max_frames = self.n_samples // self.hop_length
-
        if padding:
            waveform = np.pad(waveform, [(0, self.n_samples)])

--- a/faster_whisper/transcribe.py
+++ b/faster_whisper/transcribe.py
@@ -11,10 +11,10 @@ import ctranslate2
 import numpy as np
 import tokenizers

-from faster_whisper.audio import decode_audio, pad_or_trim
+from faster_whisper.audio import decode_audio
 from faster_whisper.feature_extractor import FeatureExtractor
 from faster_whisper.tokenizer import _LANGUAGE_CODES, Tokenizer
-from faster_whisper.utils import download_model, format_timestamp, get_end, get_logger
+from faster_whisper.utils import download_model, format_timestamp, get_logger
 from faster_whisper.vad import (
    SpeechTimestampsMap,
    VadOptions,
@@ -66,9 +66,6 @@ class TranscriptionOptions(NamedTuple):
    word_timestamps: bool
    prepend_punctuations: str
    append_punctuations: str
-    max_new_tokens: Optional[int]
-    clip_timestamps: Union[str, List[float]]
-    hallucination_silence_threshold: Optional[float]


 class TranscriptionInfo(NamedTuple):
@@ -216,12 +213,6 @@ class WhisperModel:
        append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
        vad_filter: bool = False,
        vad_parameters: Optional[Union[dict, VadOptions]] = None,
-        max_new_tokens: Optional[int] = None,
-        chunk_length: Optional[int] = None,
-        clip_timestamps: Union[str, List[float]] = "0",
-        hallucination_silence_threshold: Optional[float] = None,
-        language_detection_threshold: Optional[float] = None,
-        language_detection_segments: int = 1,
    ) -> Tuple[Iterable[Segment], TranscriptionInfo]:
        """Transcribes an input file.

@@ -273,20 +264,6 @@ class WhisperModel:
            https://github.com/snakers4/silero-vad.
          vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
            parameters and default values in the class `VadOptions`).
-          max_new_tokens: Maximum number of new tokens to generate per-chunk. If not set,
-            the maximum will be set by the default max_length.
-          chunk_length: The length of audio segments. If it is not None, it will overwrite the
-            default chunk_length of the FeatureExtractor.
-          clip_timestamps: Union[str, List[float]]
-            Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
-             process. The last end timestamp defaults to the end of the file.
-             vad_filter will be ignored if clip_timestamps is used.
-          hallucination_silence_threshold: Optional[float]
-            When word_timestamps is True, skip silent periods longer than this threshold
-             (in seconds) when a possible hallucination is detected
-          language_detection_threshold: If the maximum probability of the language tokens is higher
-           than this value, the language is detected.
-          language_detection_segments: Number of segments to consider for the language detection.

        Returns:
          A tuple with:
@@ -306,7 +283,7 @@ class WhisperModel:
            "Processing audio with duration %s", format_timestamp(duration)
        )

-        if vad_filter and clip_timestamps == "0":
+        if vad_filter:
            if vad_parameters is None:
                vad_parameters = VadOptions()
            elif isinstance(vad_parameters, dict):
@@ -336,7 +313,7 @@ class WhisperModel:
        else:
            speech_chunks = None

-        features = self.feature_extractor(audio, chunk_length=chunk_length)
+        features = self.feature_extractor(audio)

        encoder_output = None
        all_language_probs = None
@@ -346,51 +323,15 @@ class WhisperModel:
                language = "en"
                language_probability = 1
            else:
-                if (
-                    language_detection_segments is None
-                    or language_detection_segments < 1
-                ):
-                    language_detection_segments = 1
-                seek = 0
-                detected_language_info = {}
-                content_frames = (
-                    features.shape[-1] - self.feature_extractor.nb_max_frames
-                )
-                while (
-                    seek <= content_frames
-                    and seek
-                    < self.feature_extractor.nb_max_frames * language_detection_segments
-                ):
-                    segment = features[
-                        :, seek : seek + self.feature_extractor.nb_max_frames
-                    ]
-                    encoder_output = self.encode(segment)
-                    # results is a list of tuple[str, float] with language names and
-                    # probabilities.
-                    results = self.model.detect_language(encoder_output)[0]
-                    # Parse language names to strip out markers
-                    all_language_probs = [
-                        (token[2:-2], prob) for (token, prob) in results
-                    ]
-                    # Get top language token and probability
-                    language, language_probability = all_language_probs[0]
-                    if (
-                        language_detection_threshold is None
-                        or language_probability > language_detection_threshold
-                    ):
-                        break
-                    detected_language_info.setdefault(language, []).append(
-                        language_probability
-                    )
-                    seek += segment.shape[-1]
-                else:
-                    # If no language detected for all segments, the majority vote of the highest
-                    # projected languages for all segments is used to determine the language.
-                    language = max(
-                        detected_language_info,
-                        key=lambda lang: len(detected_language_info[lang]),
-                    )
-                    language_probability = max(detected_language_info[language])
+                segment = features[:, : self.feature_extractor.nb_max_frames]
+                encoder_output = self.encode(segment)
+                # results is a list of tuple[str, float] with language names and
+                # probabilities.
+                results = self.model.detect_language(encoder_output)[0]
+                # Parse language names to strip out markers
+                all_language_probs = [(token[2:-2], prob) for (token, prob) in results]
+                # Get top language token and probability
+                language, language_probability = all_language_probs[0]

                self.logger.info(
                    "Detected language '%s' with probability %.2f",
@@ -438,9 +379,6 @@ class WhisperModel:
            word_timestamps=word_timestamps,
            prepend_punctuations=prepend_punctuations,
            append_punctuations=append_punctuations,
-            max_new_tokens=max_new_tokens,
-            clip_timestamps=clip_timestamps,
-            hallucination_silence_threshold=hallucination_silence_threshold,
        )

        segments = self.generate_segments(features, tokenizer, options, encoder_output)
@@ -468,34 +406,10 @@ class WhisperModel:
        encoder_output: Optional[ctranslate2.StorageView] = None,
    ) -> Iterable[Segment]:
        content_frames = features.shape[-1] - self.feature_extractor.nb_max_frames
-        content_duration = float(content_frames * self.feature_extractor.time_per_frame)
-
-        if isinstance(options.clip_timestamps, str):
-            TranscriptionOptions.clip_timestamps = [
-                float(ts)
-                for ts in (
-                    options.clip_timestamps.split(",")
-                    if options.clip_timestamps
-                    else []
-                )
-            ]
-        seek_points: List[int] = [
-            round(ts * self.frames_per_second) for ts in options.clip_timestamps
-        ]
-        if len(seek_points) == 0:
-            seek_points.append(0)
-        if len(seek_points) % 2 == 1:
-            seek_points.append(content_frames)
-        seek_clips: List[Tuple[int, int]] = list(
-            zip(seek_points[::2], seek_points[1::2])
-        )
-
-        punctuation = "\"'“¿([{-\"'.。,，!！?？:：”)]}、"
-
        idx = 0
-        clip_idx = 0
-        seek = seek_clips[clip_idx][0]
+        seek = 0
        all_tokens = []
+        all_prompt_text = []
        prompt_reset_since = 0

        if options.initial_prompt is not None:
@@ -507,34 +421,13 @@ class WhisperModel:
                all_tokens.extend(options.initial_prompt)

        last_speech_timestamp = 0.0
-        # NOTE: This loop is obscurely flattened to make the diff readable.
-        # A later commit should turn this into a simpler nested loop.
-        # for seek_clip_start, seek_clip_end in seek_clips:
-        #     while seek < seek_clip_end
-        while clip_idx < len(seek_clips):
-            seek_clip_start, seek_clip_end = seek_clips[clip_idx]
-            if seek_clip_end > content_frames:
-                seek_clip_end = content_frames
-            if seek < seek_clip_start:
-                seek = seek_clip_start
-            if seek >= seek_clip_end:
-                clip_idx += 1
-                if clip_idx < len(seek_clips):
-                    seek = seek_clips[clip_idx][0]
-                continue
+        while seek < content_frames:
            time_offset = seek * self.feature_extractor.time_per_frame
-            window_end_time = float(
-                (seek + self.feature_extractor.nb_max_frames)
-                * self.feature_extractor.time_per_frame
-            )
+            segment = features[:, seek : seek + self.feature_extractor.nb_max_frames]
            segment_size = min(
-                self.feature_extractor.nb_max_frames,
-                content_frames - seek,
-                seek_clip_end - seek,
+                self.feature_extractor.nb_max_frames, content_frames - seek
            )
-            segment = features[:, seek : seek + segment_size]
            segment_duration = segment_size * self.feature_extractor.time_per_frame
-            segment = pad_or_trim(segment, self.feature_extractor.nb_max_frames)

            if self.logger.isEnabledFor(logging.DEBUG):
                self.logger.debug(
@@ -586,33 +479,10 @@ class WhisperModel:
            previous_seek = seek
            current_segments = []

-            # anomalous words are very long/short/improbable
-            def word_anomaly_score(word: dict) -> float:
-                probability = word.get("probability", 0.0)
-                duration = word["end"] - word["start"]
-                score = 0.0
-                if probability < 0.15:
-                    score += 1.0
-                if duration < 0.133:
-                    score += (0.133 - duration) * 15
-                if duration > 2.0:
-                    score += duration - 2.0
-                return score
-
-            def is_segment_anomaly(segment: Optional[dict]) -> bool:
-                if segment is None or not segment["words"]:
-                    return False
-                words = [w for w in segment["words"] if w["word"] not in punctuation]
-                words = words[:8]
-                score = sum(word_anomaly_score(w) for w in words)
-                return score >= 3 or score + 0.01 >= len(words)
-
-            def next_words_segment(segments: List[dict]) -> Optional[dict]:
-                return next((s for s in segments if s["words"]), None)
-
            single_timestamp_ending = (
                len(tokens) >= 2
-                and tokens[-2] < tokenizer.timestamp_begin <= tokens[-1]
+                and tokens[-2] < tokenizer.timestamp_begin
+                and tokens[-1] >= tokenizer.timestamp_begin
            )

            consecutive_timestamps = [
@@ -695,62 +565,18 @@ class WhisperModel:
                    last_speech_timestamp=last_speech_timestamp,
                )

-                if not single_timestamp_ending:
-                    last_word_end = get_end(current_segments)
-                    if last_word_end is not None and last_word_end > time_offset:
-                        seek = round(last_word_end * self.frames_per_second)
+                word_end_timestamps = [
+                    w["end"] for s in current_segments for w in s["words"]
+                ]
+                if len(word_end_timestamps) > 0:
+                    last_speech_timestamp = word_end_timestamps[-1]
+                if not single_timestamp_ending and len(word_end_timestamps) > 0:
+                    seek_shift = round(
+                        (word_end_timestamps[-1] - time_offset) * self.frames_per_second
+                    )

-                # skip silence before possible hallucinations
-                if options.hallucination_silence_threshold is not None:
-                    threshold = options.hallucination_silence_threshold
-
-                    # if first segment might be a hallucination, skip leading silence
-                    first_segment = next_words_segment(current_segments)
-                    if first_segment is not None and is_segment_anomaly(first_segment):
-                        gap = first_segment["start"] - time_offset
-                        if gap > threshold:
-                            seek = previous_seek + round(gap * self.frames_per_second)
-                            continue
-
-                    # skip silence before any possible hallucination that is surrounded
-                    # by silence or more hallucinations
-                    hal_last_end = last_speech_timestamp
-                    for si in range(len(current_segments)):
-                        segment = current_segments[si]
-                        if not segment["words"]:
-                            continue
-                        if is_segment_anomaly(segment):
-                            next_segment = next_words_segment(
-                                current_segments[si + 1 :]
-                            )
-                            if next_segment is not None:
-                                hal_next_start = next_segment["words"][0]["start"]
-                            else:
-                                hal_next_start = time_offset + segment_duration
-                            silence_before = (
-                                segment["start"] - hal_last_end > threshold
-                                or segment["start"] < threshold
-                                or segment["start"] - time_offset < 2.0
-                            )
-                            silence_after = (
-                                hal_next_start - segment["end"] > threshold
-                                or is_segment_anomaly(next_segment)
-                                or window_end_time - segment["end"] < 2.0
-                            )
-                            if silence_before and silence_after:
-                                seek = round(
-                                    max(time_offset + 1, segment["start"])
-                                    * self.frames_per_second
-                                )
-                                if content_duration - segment["end"] < threshold:
-                                    seek = content_frames
-                                current_segments[si:] = []
-                                break
-                        hal_last_end = segment["end"]
-
-                last_word_end = get_end(current_segments)
-                if last_word_end is not None:
-                    last_speech_timestamp = last_word_end
+                    if seek_shift > 0:
+                        seek = previous_seek + seek_shift

            for segment in current_segments:
                tokens = segment["tokens"]
@@ -759,8 +585,16 @@ class WhisperModel:
                if segment["start"] == segment["end"] or not text.strip():
                    continue

-                all_tokens.extend(tokens)
-                idx += 1
+                check_prompt_num = 1
+                if all(
+                    [
+                        text.strip() != i.strip()
+                        for i in all_prompt_text[-check_prompt_num:]
+                    ]
+                ):
+                    all_tokens.extend(tokens)
+                    all_prompt_text.append(text)
+                    idx += 1

                yield Segment(
                    id=idx,
@@ -817,21 +651,6 @@ class WhisperModel:
        max_initial_timestamp_index = int(
            round(options.max_initial_timestamp / self.time_precision)
        )
-        if options.max_new_tokens is not None:
-            max_length = len(prompt) + options.max_new_tokens
-        else:
-            max_length = self.max_length
-
-        if max_length > self.max_length:
-            raise ValueError(
-                f"The length of the prompt is {len(prompt)}, and the `max_new_tokens` "
-                f"{max_length - len(prompt)}. Thus, the combined length of the prompt "
-                f"and `max_new_tokens` is: {max_length}. This exceeds the "
-                f"`max_length` of the Whisper model: {self.max_length}. "
-                "You should either reduce the length of your prompt, or "
-                "reduce the value of `max_new_tokens`, "
-                f"so that their combined length is less that {self.max_length}."
-            )

        for temperature in options.temperatures:
            if temperature > 0:
@@ -853,7 +672,7 @@ class WhisperModel:
                length_penalty=options.length_penalty,
                repetition_penalty=options.repetition_penalty,
                no_repeat_ngram_size=options.no_repeat_ngram_size,
-                max_length=max_length,
+                max_length=self.max_length,
                return_scores=True,
                return_no_speech_prob=True,
                suppress_blank=options.suppress_blank,
@@ -911,8 +730,6 @@ class WhisperModel:
            if (
                options.no_speech_threshold is not None
                and result.no_speech_prob > options.no_speech_threshold
-                and options.log_prob_threshold is not None
-                and avg_logprob < options.log_prob_threshold
            ):
                needs_fallback = False  # silence

@@ -986,7 +803,6 @@ class WhisperModel:
        word_durations = np.array([word["end"] - word["start"] for word in alignment])
        word_durations = word_durations[word_durations.nonzero()]
        median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
-        median_duration = min(0.7, float(median_duration))
        max_duration = median_duration * 2

        # hack: truncate long words at sentence boundaries.
--- a/faster_whisper/utils.py
+++ b/faster_whisper/utils.py
@@ -22,10 +22,6 @@ _MODELS = {
    "large-v2": "Systran/faster-whisper-large-v2",
    "large-v3": "Systran/faster-whisper-large-v3",
    "large": "Systran/faster-whisper-large-v3",
-    "distil-large-v2": "Systran/faster-distil-whisper-large-v2",
-    "distil-medium.en": "Systran/faster-distil-whisper-medium.en",
-    "distil-small.en": "Systran/faster-distil-whisper-small.en",
-    "distil-large-v3": "Systran/faster-distil-whisper-large-v3",
 }


@@ -53,7 +49,7 @@ def download_model(
    """Downloads a CTranslate2 Whisper model from the Hugging Face Hub.

    Args:
-      size_or_id: Size of the model to download from https://huggingface.co/Systran
+      size_or_id: Size of the model to download from https://huggingface.co/guillaumekln
        (tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
        large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
        (e.g. Systran/faster-whisper-large-v3).
@@ -147,10 +143,3 @@ class disabled_tqdm(tqdm):
    def __init__(self, *args, **kwargs):
        kwargs["disable"] = True
        super().__init__(*args, **kwargs)
-
-
-def get_end(segments: List[dict]) -> Optional[float]:
-    return next(
-        (w["end"] for s in reversed(segments) for w in reversed(s["words"])),
-        segments[-1]["end"] if segments else None,
-    )
--- a/faster_whisper/version.py
+++ b/faster_whisper/version.py
@@ -1,3 +1,3 @@
 """Version information."""

-__version__ = "1.0.1"
+__version__ = "0.10.0"
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,5 +1,5 @@
-av==11.*
-ctranslate2>=4.0,<5
+av==10.*
+ctranslate2>=3.22,<4
 huggingface_hub>=0.13
 tokenizers>=0.13,<0.16
 onnxruntime>=1.14,<2
--- a/setup.py
+++ b/setup.py
@@ -37,7 +37,7 @@ setup(
    long_description=get_long_description(),
    long_description_content_type="text/markdown",
    author="Guillaume Klein",
-    url="https://github.com/SYSTRAN/faster-whisper",
+    url="https://github.com/guillaumekln/faster-whisper",
    classifiers=[
        "Development Status :: 4 - Beta",
        "Intended Audience :: Developers",
Author	SHA1	Message	Date
heimoshuiyu	b835bdaaf1	Merge remote-tracking branch 'upstream/master' into prompt	2023-12-25 17:56:50 +08:00
heimoshuiyu	9f24e2c735	Merge branch 'master' into prompt	2023-06-24 18:03:05 +08:00
heimoshuiyu	9a646b69e6	format code	2023-04-20 02:00:57 +08:00
heimoshuiyu	49af9564ab	Ignore repeated prompt	2023-04-20 01:49:10 +08:00