Compare commits
4 Commits
master
...
b835bdaaf1
| Author | SHA1 | Date | |
|---|---|---|---|
|
b835bdaaf1
|
|||
|
9f24e2c735
|
|||
|
9a646b69e6
|
|||
|
49af9564ab
|
@@ -7,7 +7,7 @@ Contributions are welcome! Here are some pointers to help you install the librar
|
|||||||
We recommend installing the module in editable mode with the `dev` extra requirements:
|
We recommend installing the module in editable mode with the `dev` extra requirements:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/SYSTRAN/faster-whisper.git
|
git clone https://github.com/guillaumekln/faster-whisper.git
|
||||||
cd faster-whisper/
|
cd faster-whisper/
|
||||||
pip install -e .[dev]
|
pip install -e .[dev]
|
||||||
```
|
```
|
||||||
|
|||||||
2
LICENSE
2
LICENSE
@@ -1,6 +1,6 @@
|
|||||||
MIT License
|
MIT License
|
||||||
|
|
||||||
Copyright (c) 2023 SYSTRAN
|
Copyright (c) 2023 Guillaume Klein
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|||||||
70
README.md
70
README.md
@@ -1,4 +1,4 @@
|
|||||||
[](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [](https://badge.fury.io/py/faster-whisper)
|
[](https://github.com/guillaumekln/faster-whisper/actions?query=workflow%3ACI) [](https://badge.fury.io/py/faster-whisper)
|
||||||
|
|
||||||
# Faster Whisper transcription with CTranslate2
|
# Faster Whisper transcription with CTranslate2
|
||||||
|
|
||||||
@@ -8,13 +8,11 @@ This implementation is up to 4 times faster than [openai/whisper](https://github
|
|||||||
|
|
||||||
## Benchmark
|
## Benchmark
|
||||||
|
|
||||||
### Whisper
|
|
||||||
|
|
||||||
For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
|
For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
|
||||||
|
|
||||||
* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
|
* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
|
||||||
* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
|
* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
|
||||||
* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
|
* [faster-whisper](https://github.com/guillaumekln/faster-whisper)@[cce6b53e](https://github.com/guillaumekln/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
|
||||||
|
|
||||||
### Large-v2 model on GPU
|
### Large-v2 model on GPU
|
||||||
|
|
||||||
@@ -38,33 +36,6 @@ For reference, here's the time and memory usage that are required to transcribe
|
|||||||
|
|
||||||
*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
|
*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
|
||||||
|
|
||||||
|
|
||||||
### Distil-whisper
|
|
||||||
|
|
||||||
| Implementation | Precision | Beam size | Time | Gigaspeech WER |
|
|
||||||
| --- | --- | --- | --- | --- |
|
|
||||||
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
|
|
||||||
| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
|
|
||||||
| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
|
|
||||||
| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
|
|
||||||
|
|
||||||
*Executed with CUDA 11.4 on a NVIDIA 3090.*
|
|
||||||
|
|
||||||
<details>
|
|
||||||
<summary>testing details (click to expand)</summary>
|
|
||||||
|
|
||||||
For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
|
|
||||||
```python
|
|
||||||
from faster_whisper import WhisperModel
|
|
||||||
|
|
||||||
model_size = "distil-large-v2"
|
|
||||||
# model_size = "distil-medium.en"
|
|
||||||
# Run on GPU with FP16
|
|
||||||
model = WhisperModel(model_size, device="cuda", compute_type="float16")
|
|
||||||
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
|
|
||||||
```
|
|
||||||
</details>
|
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
* Python 3.8 or greater
|
* Python 3.8 or greater
|
||||||
@@ -117,21 +88,19 @@ pip install faster-whisper
|
|||||||
### Install the master branch
|
### Install the master branch
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
|
pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Install a specific commit
|
### Install a specific commit
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
|
pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
|
||||||
```
|
```
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### Faster-whisper
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from faster_whisper import WhisperModel
|
from faster_whisper import WhisperModel
|
||||||
|
|
||||||
@@ -159,25 +128,6 @@ for segment in segments:
|
|||||||
segments, _ = model.transcribe("audio.mp3")
|
segments, _ = model.transcribe("audio.mp3")
|
||||||
segments = list(segments) # The transcription will actually run here.
|
segments = list(segments) # The transcription will actually run here.
|
||||||
```
|
```
|
||||||
### Faster Distil-Whisper
|
|
||||||
|
|
||||||
The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
|
|
||||||
checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet
|
|
||||||
demonstrates how to run inference with distil-large-v3 on a specified audio file:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from faster_whisper import WhisperModel
|
|
||||||
|
|
||||||
model_size = "distil-large-v3"
|
|
||||||
|
|
||||||
model = WhisperModel(model_size, device="cuda", compute_type="float16")
|
|
||||||
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
|
|
||||||
|
|
||||||
for segment in segments:
|
|
||||||
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
|
|
||||||
```
|
|
||||||
|
|
||||||
For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
|
|
||||||
|
|
||||||
### Word-level timestamps
|
### Word-level timestamps
|
||||||
|
|
||||||
@@ -197,7 +147,7 @@ The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad)
|
|||||||
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
|
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
|
The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
segments, _ = model.transcribe(
|
segments, _ = model.transcribe(
|
||||||
@@ -220,28 +170,22 @@ logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
|
|||||||
|
|
||||||
### Going further
|
### Going further
|
||||||
|
|
||||||
See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
|
See more model and transcription options in the [`WhisperModel`](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
|
||||||
|
|
||||||
## Community integrations
|
## Community integrations
|
||||||
|
|
||||||
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
|
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
|
||||||
|
|
||||||
|
|
||||||
* [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
|
|
||||||
* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
|
* [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
|
||||||
* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
|
* [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
|
||||||
* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS.
|
* [whisper-standalone-win](https://github.com/Purfview/whisper-standalone-win) Standalone CLI executables of faster-whisper for Windows, Linux & macOS.
|
||||||
* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
|
* [asr-sd-pipeline](https://github.com/hedrergudene/asr-sd-pipeline) provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
|
||||||
* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
|
* [Open-Lyrics](https://github.com/zh-plus/Open-Lyrics) is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into `.lrc` files in the desired language using OpenAI-GPT.
|
||||||
* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
|
* [wscribe](https://github.com/geekodour/wscribe) is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with [wscribe-editor](https://github.com/geekodour/wscribe-editor)
|
||||||
* [aTrain](https://github.com/BANDAS-Center/aTrain) is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows ([Windows Store App](https://apps.microsoft.com/detail/atrain/9N15Q44SZNS2)) and Linux.
|
|
||||||
* [Whisper-Streaming](https://github.com/ufal/whisper_streaming) implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
|
|
||||||
* [WhisperLive](https://github.com/collabora/WhisperLive) is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
|
|
||||||
* [Faster-Whisper-Transcriber](https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber) is a simple but reliable voice transcriber that provides a user-friendly interface.
|
|
||||||
|
|
||||||
## Model conversion
|
## Model conversion
|
||||||
|
|
||||||
When loading a model from its size such as `WhisperModel("large-v3")`, the corresponding CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
|
When loading a model from its size such as `WhisperModel("large-v3")`, the correspondig CTranslate2 model is automatically downloaded from the [Hugging Face Hub](https://huggingface.co/Systran).
|
||||||
|
|
||||||
We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
|
We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
|
||||||
|
|
||||||
|
|||||||
@@ -102,18 +102,3 @@ def _resample_frames(frames, resampler):
|
|||||||
# Add None to flush the resampler.
|
# Add None to flush the resampler.
|
||||||
for frame in itertools.chain(frames, [None]):
|
for frame in itertools.chain(frames, [None]):
|
||||||
yield from resampler.resample(frame)
|
yield from resampler.resample(frame)
|
||||||
|
|
||||||
|
|
||||||
def pad_or_trim(array, length: int, *, axis: int = -1):
|
|
||||||
"""
|
|
||||||
Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
|
|
||||||
"""
|
|
||||||
if array.shape[axis] > length:
|
|
||||||
array = array.take(indices=range(length), axis=axis)
|
|
||||||
|
|
||||||
if array.shape[axis] < length:
|
|
||||||
pad_widths = [(0, 0)] * array.ndim
|
|
||||||
pad_widths[axis] = (0, length - array.shape[axis])
|
|
||||||
array = np.pad(array, pad_widths)
|
|
||||||
|
|
||||||
return array
|
|
||||||
|
|||||||
@@ -142,15 +142,11 @@ class FeatureExtractor:
|
|||||||
data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
|
data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
|
||||||
return data.T
|
return data.T
|
||||||
|
|
||||||
def __call__(self, waveform, padding=True, chunk_length=None):
|
def __call__(self, waveform, padding=True):
|
||||||
"""
|
"""
|
||||||
Compute the log-Mel spectrogram of the provided audio, gives similar results
|
Compute the log-Mel spectrogram of the provided audio, gives similar results
|
||||||
whisper's original torch implementation with 1e-5 tolerance.
|
whisper's original torch implementation with 1e-5 tolerance.
|
||||||
"""
|
"""
|
||||||
if chunk_length is not None:
|
|
||||||
self.n_samples = chunk_length * self.sampling_rate
|
|
||||||
self.nb_max_frames = self.n_samples // self.hop_length
|
|
||||||
|
|
||||||
if padding:
|
if padding:
|
||||||
waveform = np.pad(waveform, [(0, self.n_samples)])
|
waveform = np.pad(waveform, [(0, self.n_samples)])
|
||||||
|
|
||||||
|
|||||||
@@ -11,10 +11,10 @@ import ctranslate2
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import tokenizers
|
import tokenizers
|
||||||
|
|
||||||
from faster_whisper.audio import decode_audio, pad_or_trim
|
from faster_whisper.audio import decode_audio
|
||||||
from faster_whisper.feature_extractor import FeatureExtractor
|
from faster_whisper.feature_extractor import FeatureExtractor
|
||||||
from faster_whisper.tokenizer import _LANGUAGE_CODES, Tokenizer
|
from faster_whisper.tokenizer import _LANGUAGE_CODES, Tokenizer
|
||||||
from faster_whisper.utils import download_model, format_timestamp, get_end, get_logger
|
from faster_whisper.utils import download_model, format_timestamp, get_logger
|
||||||
from faster_whisper.vad import (
|
from faster_whisper.vad import (
|
||||||
SpeechTimestampsMap,
|
SpeechTimestampsMap,
|
||||||
VadOptions,
|
VadOptions,
|
||||||
@@ -66,9 +66,6 @@ class TranscriptionOptions(NamedTuple):
|
|||||||
word_timestamps: bool
|
word_timestamps: bool
|
||||||
prepend_punctuations: str
|
prepend_punctuations: str
|
||||||
append_punctuations: str
|
append_punctuations: str
|
||||||
max_new_tokens: Optional[int]
|
|
||||||
clip_timestamps: Union[str, List[float]]
|
|
||||||
hallucination_silence_threshold: Optional[float]
|
|
||||||
|
|
||||||
|
|
||||||
class TranscriptionInfo(NamedTuple):
|
class TranscriptionInfo(NamedTuple):
|
||||||
@@ -216,12 +213,6 @@ class WhisperModel:
|
|||||||
append_punctuations: str = "\"'.。,,!!??::”)]}、",
|
append_punctuations: str = "\"'.。,,!!??::”)]}、",
|
||||||
vad_filter: bool = False,
|
vad_filter: bool = False,
|
||||||
vad_parameters: Optional[Union[dict, VadOptions]] = None,
|
vad_parameters: Optional[Union[dict, VadOptions]] = None,
|
||||||
max_new_tokens: Optional[int] = None,
|
|
||||||
chunk_length: Optional[int] = None,
|
|
||||||
clip_timestamps: Union[str, List[float]] = "0",
|
|
||||||
hallucination_silence_threshold: Optional[float] = None,
|
|
||||||
language_detection_threshold: Optional[float] = None,
|
|
||||||
language_detection_segments: int = 1,
|
|
||||||
) -> Tuple[Iterable[Segment], TranscriptionInfo]:
|
) -> Tuple[Iterable[Segment], TranscriptionInfo]:
|
||||||
"""Transcribes an input file.
|
"""Transcribes an input file.
|
||||||
|
|
||||||
@@ -273,20 +264,6 @@ class WhisperModel:
|
|||||||
https://github.com/snakers4/silero-vad.
|
https://github.com/snakers4/silero-vad.
|
||||||
vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
|
vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
|
||||||
parameters and default values in the class `VadOptions`).
|
parameters and default values in the class `VadOptions`).
|
||||||
max_new_tokens: Maximum number of new tokens to generate per-chunk. If not set,
|
|
||||||
the maximum will be set by the default max_length.
|
|
||||||
chunk_length: The length of audio segments. If it is not None, it will overwrite the
|
|
||||||
default chunk_length of the FeatureExtractor.
|
|
||||||
clip_timestamps: Union[str, List[float]]
|
|
||||||
Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
|
|
||||||
process. The last end timestamp defaults to the end of the file.
|
|
||||||
vad_filter will be ignored if clip_timestamps is used.
|
|
||||||
hallucination_silence_threshold: Optional[float]
|
|
||||||
When word_timestamps is True, skip silent periods longer than this threshold
|
|
||||||
(in seconds) when a possible hallucination is detected
|
|
||||||
language_detection_threshold: If the maximum probability of the language tokens is higher
|
|
||||||
than this value, the language is detected.
|
|
||||||
language_detection_segments: Number of segments to consider for the language detection.
|
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
A tuple with:
|
A tuple with:
|
||||||
@@ -306,7 +283,7 @@ class WhisperModel:
|
|||||||
"Processing audio with duration %s", format_timestamp(duration)
|
"Processing audio with duration %s", format_timestamp(duration)
|
||||||
)
|
)
|
||||||
|
|
||||||
if vad_filter and clip_timestamps == "0":
|
if vad_filter:
|
||||||
if vad_parameters is None:
|
if vad_parameters is None:
|
||||||
vad_parameters = VadOptions()
|
vad_parameters = VadOptions()
|
||||||
elif isinstance(vad_parameters, dict):
|
elif isinstance(vad_parameters, dict):
|
||||||
@@ -336,7 +313,7 @@ class WhisperModel:
|
|||||||
else:
|
else:
|
||||||
speech_chunks = None
|
speech_chunks = None
|
||||||
|
|
||||||
features = self.feature_extractor(audio, chunk_length=chunk_length)
|
features = self.feature_extractor(audio)
|
||||||
|
|
||||||
encoder_output = None
|
encoder_output = None
|
||||||
all_language_probs = None
|
all_language_probs = None
|
||||||
@@ -346,51 +323,15 @@ class WhisperModel:
|
|||||||
language = "en"
|
language = "en"
|
||||||
language_probability = 1
|
language_probability = 1
|
||||||
else:
|
else:
|
||||||
if (
|
segment = features[:, : self.feature_extractor.nb_max_frames]
|
||||||
language_detection_segments is None
|
encoder_output = self.encode(segment)
|
||||||
or language_detection_segments < 1
|
# results is a list of tuple[str, float] with language names and
|
||||||
):
|
# probabilities.
|
||||||
language_detection_segments = 1
|
results = self.model.detect_language(encoder_output)[0]
|
||||||
seek = 0
|
# Parse language names to strip out markers
|
||||||
detected_language_info = {}
|
all_language_probs = [(token[2:-2], prob) for (token, prob) in results]
|
||||||
content_frames = (
|
# Get top language token and probability
|
||||||
features.shape[-1] - self.feature_extractor.nb_max_frames
|
language, language_probability = all_language_probs[0]
|
||||||
)
|
|
||||||
while (
|
|
||||||
seek <= content_frames
|
|
||||||
and seek
|
|
||||||
< self.feature_extractor.nb_max_frames * language_detection_segments
|
|
||||||
):
|
|
||||||
segment = features[
|
|
||||||
:, seek : seek + self.feature_extractor.nb_max_frames
|
|
||||||
]
|
|
||||||
encoder_output = self.encode(segment)
|
|
||||||
# results is a list of tuple[str, float] with language names and
|
|
||||||
# probabilities.
|
|
||||||
results = self.model.detect_language(encoder_output)[0]
|
|
||||||
# Parse language names to strip out markers
|
|
||||||
all_language_probs = [
|
|
||||||
(token[2:-2], prob) for (token, prob) in results
|
|
||||||
]
|
|
||||||
# Get top language token and probability
|
|
||||||
language, language_probability = all_language_probs[0]
|
|
||||||
if (
|
|
||||||
language_detection_threshold is None
|
|
||||||
or language_probability > language_detection_threshold
|
|
||||||
):
|
|
||||||
break
|
|
||||||
detected_language_info.setdefault(language, []).append(
|
|
||||||
language_probability
|
|
||||||
)
|
|
||||||
seek += segment.shape[-1]
|
|
||||||
else:
|
|
||||||
# If no language detected for all segments, the majority vote of the highest
|
|
||||||
# projected languages for all segments is used to determine the language.
|
|
||||||
language = max(
|
|
||||||
detected_language_info,
|
|
||||||
key=lambda lang: len(detected_language_info[lang]),
|
|
||||||
)
|
|
||||||
language_probability = max(detected_language_info[language])
|
|
||||||
|
|
||||||
self.logger.info(
|
self.logger.info(
|
||||||
"Detected language '%s' with probability %.2f",
|
"Detected language '%s' with probability %.2f",
|
||||||
@@ -438,9 +379,6 @@ class WhisperModel:
|
|||||||
word_timestamps=word_timestamps,
|
word_timestamps=word_timestamps,
|
||||||
prepend_punctuations=prepend_punctuations,
|
prepend_punctuations=prepend_punctuations,
|
||||||
append_punctuations=append_punctuations,
|
append_punctuations=append_punctuations,
|
||||||
max_new_tokens=max_new_tokens,
|
|
||||||
clip_timestamps=clip_timestamps,
|
|
||||||
hallucination_silence_threshold=hallucination_silence_threshold,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
segments = self.generate_segments(features, tokenizer, options, encoder_output)
|
segments = self.generate_segments(features, tokenizer, options, encoder_output)
|
||||||
@@ -468,34 +406,10 @@ class WhisperModel:
|
|||||||
encoder_output: Optional[ctranslate2.StorageView] = None,
|
encoder_output: Optional[ctranslate2.StorageView] = None,
|
||||||
) -> Iterable[Segment]:
|
) -> Iterable[Segment]:
|
||||||
content_frames = features.shape[-1] - self.feature_extractor.nb_max_frames
|
content_frames = features.shape[-1] - self.feature_extractor.nb_max_frames
|
||||||
content_duration = float(content_frames * self.feature_extractor.time_per_frame)
|
|
||||||
|
|
||||||
if isinstance(options.clip_timestamps, str):
|
|
||||||
TranscriptionOptions.clip_timestamps = [
|
|
||||||
float(ts)
|
|
||||||
for ts in (
|
|
||||||
options.clip_timestamps.split(",")
|
|
||||||
if options.clip_timestamps
|
|
||||||
else []
|
|
||||||
)
|
|
||||||
]
|
|
||||||
seek_points: List[int] = [
|
|
||||||
round(ts * self.frames_per_second) for ts in options.clip_timestamps
|
|
||||||
]
|
|
||||||
if len(seek_points) == 0:
|
|
||||||
seek_points.append(0)
|
|
||||||
if len(seek_points) % 2 == 1:
|
|
||||||
seek_points.append(content_frames)
|
|
||||||
seek_clips: List[Tuple[int, int]] = list(
|
|
||||||
zip(seek_points[::2], seek_points[1::2])
|
|
||||||
)
|
|
||||||
|
|
||||||
punctuation = "\"'“¿([{-\"'.。,,!!??::”)]}、"
|
|
||||||
|
|
||||||
idx = 0
|
idx = 0
|
||||||
clip_idx = 0
|
seek = 0
|
||||||
seek = seek_clips[clip_idx][0]
|
|
||||||
all_tokens = []
|
all_tokens = []
|
||||||
|
all_prompt_text = []
|
||||||
prompt_reset_since = 0
|
prompt_reset_since = 0
|
||||||
|
|
||||||
if options.initial_prompt is not None:
|
if options.initial_prompt is not None:
|
||||||
@@ -507,34 +421,13 @@ class WhisperModel:
|
|||||||
all_tokens.extend(options.initial_prompt)
|
all_tokens.extend(options.initial_prompt)
|
||||||
|
|
||||||
last_speech_timestamp = 0.0
|
last_speech_timestamp = 0.0
|
||||||
# NOTE: This loop is obscurely flattened to make the diff readable.
|
while seek < content_frames:
|
||||||
# A later commit should turn this into a simpler nested loop.
|
|
||||||
# for seek_clip_start, seek_clip_end in seek_clips:
|
|
||||||
# while seek < seek_clip_end
|
|
||||||
while clip_idx < len(seek_clips):
|
|
||||||
seek_clip_start, seek_clip_end = seek_clips[clip_idx]
|
|
||||||
if seek_clip_end > content_frames:
|
|
||||||
seek_clip_end = content_frames
|
|
||||||
if seek < seek_clip_start:
|
|
||||||
seek = seek_clip_start
|
|
||||||
if seek >= seek_clip_end:
|
|
||||||
clip_idx += 1
|
|
||||||
if clip_idx < len(seek_clips):
|
|
||||||
seek = seek_clips[clip_idx][0]
|
|
||||||
continue
|
|
||||||
time_offset = seek * self.feature_extractor.time_per_frame
|
time_offset = seek * self.feature_extractor.time_per_frame
|
||||||
window_end_time = float(
|
segment = features[:, seek : seek + self.feature_extractor.nb_max_frames]
|
||||||
(seek + self.feature_extractor.nb_max_frames)
|
|
||||||
* self.feature_extractor.time_per_frame
|
|
||||||
)
|
|
||||||
segment_size = min(
|
segment_size = min(
|
||||||
self.feature_extractor.nb_max_frames,
|
self.feature_extractor.nb_max_frames, content_frames - seek
|
||||||
content_frames - seek,
|
|
||||||
seek_clip_end - seek,
|
|
||||||
)
|
)
|
||||||
segment = features[:, seek : seek + segment_size]
|
|
||||||
segment_duration = segment_size * self.feature_extractor.time_per_frame
|
segment_duration = segment_size * self.feature_extractor.time_per_frame
|
||||||
segment = pad_or_trim(segment, self.feature_extractor.nb_max_frames)
|
|
||||||
|
|
||||||
if self.logger.isEnabledFor(logging.DEBUG):
|
if self.logger.isEnabledFor(logging.DEBUG):
|
||||||
self.logger.debug(
|
self.logger.debug(
|
||||||
@@ -586,33 +479,10 @@ class WhisperModel:
|
|||||||
previous_seek = seek
|
previous_seek = seek
|
||||||
current_segments = []
|
current_segments = []
|
||||||
|
|
||||||
# anomalous words are very long/short/improbable
|
|
||||||
def word_anomaly_score(word: dict) -> float:
|
|
||||||
probability = word.get("probability", 0.0)
|
|
||||||
duration = word["end"] - word["start"]
|
|
||||||
score = 0.0
|
|
||||||
if probability < 0.15:
|
|
||||||
score += 1.0
|
|
||||||
if duration < 0.133:
|
|
||||||
score += (0.133 - duration) * 15
|
|
||||||
if duration > 2.0:
|
|
||||||
score += duration - 2.0
|
|
||||||
return score
|
|
||||||
|
|
||||||
def is_segment_anomaly(segment: Optional[dict]) -> bool:
|
|
||||||
if segment is None or not segment["words"]:
|
|
||||||
return False
|
|
||||||
words = [w for w in segment["words"] if w["word"] not in punctuation]
|
|
||||||
words = words[:8]
|
|
||||||
score = sum(word_anomaly_score(w) for w in words)
|
|
||||||
return score >= 3 or score + 0.01 >= len(words)
|
|
||||||
|
|
||||||
def next_words_segment(segments: List[dict]) -> Optional[dict]:
|
|
||||||
return next((s for s in segments if s["words"]), None)
|
|
||||||
|
|
||||||
single_timestamp_ending = (
|
single_timestamp_ending = (
|
||||||
len(tokens) >= 2
|
len(tokens) >= 2
|
||||||
and tokens[-2] < tokenizer.timestamp_begin <= tokens[-1]
|
and tokens[-2] < tokenizer.timestamp_begin
|
||||||
|
and tokens[-1] >= tokenizer.timestamp_begin
|
||||||
)
|
)
|
||||||
|
|
||||||
consecutive_timestamps = [
|
consecutive_timestamps = [
|
||||||
@@ -695,62 +565,18 @@ class WhisperModel:
|
|||||||
last_speech_timestamp=last_speech_timestamp,
|
last_speech_timestamp=last_speech_timestamp,
|
||||||
)
|
)
|
||||||
|
|
||||||
if not single_timestamp_ending:
|
word_end_timestamps = [
|
||||||
last_word_end = get_end(current_segments)
|
w["end"] for s in current_segments for w in s["words"]
|
||||||
if last_word_end is not None and last_word_end > time_offset:
|
]
|
||||||
seek = round(last_word_end * self.frames_per_second)
|
if len(word_end_timestamps) > 0:
|
||||||
|
last_speech_timestamp = word_end_timestamps[-1]
|
||||||
|
if not single_timestamp_ending and len(word_end_timestamps) > 0:
|
||||||
|
seek_shift = round(
|
||||||
|
(word_end_timestamps[-1] - time_offset) * self.frames_per_second
|
||||||
|
)
|
||||||
|
|
||||||
# skip silence before possible hallucinations
|
if seek_shift > 0:
|
||||||
if options.hallucination_silence_threshold is not None:
|
seek = previous_seek + seek_shift
|
||||||
threshold = options.hallucination_silence_threshold
|
|
||||||
|
|
||||||
# if first segment might be a hallucination, skip leading silence
|
|
||||||
first_segment = next_words_segment(current_segments)
|
|
||||||
if first_segment is not None and is_segment_anomaly(first_segment):
|
|
||||||
gap = first_segment["start"] - time_offset
|
|
||||||
if gap > threshold:
|
|
||||||
seek = previous_seek + round(gap * self.frames_per_second)
|
|
||||||
continue
|
|
||||||
|
|
||||||
# skip silence before any possible hallucination that is surrounded
|
|
||||||
# by silence or more hallucinations
|
|
||||||
hal_last_end = last_speech_timestamp
|
|
||||||
for si in range(len(current_segments)):
|
|
||||||
segment = current_segments[si]
|
|
||||||
if not segment["words"]:
|
|
||||||
continue
|
|
||||||
if is_segment_anomaly(segment):
|
|
||||||
next_segment = next_words_segment(
|
|
||||||
current_segments[si + 1 :]
|
|
||||||
)
|
|
||||||
if next_segment is not None:
|
|
||||||
hal_next_start = next_segment["words"][0]["start"]
|
|
||||||
else:
|
|
||||||
hal_next_start = time_offset + segment_duration
|
|
||||||
silence_before = (
|
|
||||||
segment["start"] - hal_last_end > threshold
|
|
||||||
or segment["start"] < threshold
|
|
||||||
or segment["start"] - time_offset < 2.0
|
|
||||||
)
|
|
||||||
silence_after = (
|
|
||||||
hal_next_start - segment["end"] > threshold
|
|
||||||
or is_segment_anomaly(next_segment)
|
|
||||||
or window_end_time - segment["end"] < 2.0
|
|
||||||
)
|
|
||||||
if silence_before and silence_after:
|
|
||||||
seek = round(
|
|
||||||
max(time_offset + 1, segment["start"])
|
|
||||||
* self.frames_per_second
|
|
||||||
)
|
|
||||||
if content_duration - segment["end"] < threshold:
|
|
||||||
seek = content_frames
|
|
||||||
current_segments[si:] = []
|
|
||||||
break
|
|
||||||
hal_last_end = segment["end"]
|
|
||||||
|
|
||||||
last_word_end = get_end(current_segments)
|
|
||||||
if last_word_end is not None:
|
|
||||||
last_speech_timestamp = last_word_end
|
|
||||||
|
|
||||||
for segment in current_segments:
|
for segment in current_segments:
|
||||||
tokens = segment["tokens"]
|
tokens = segment["tokens"]
|
||||||
@@ -759,8 +585,16 @@ class WhisperModel:
|
|||||||
if segment["start"] == segment["end"] or not text.strip():
|
if segment["start"] == segment["end"] or not text.strip():
|
||||||
continue
|
continue
|
||||||
|
|
||||||
all_tokens.extend(tokens)
|
check_prompt_num = 1
|
||||||
idx += 1
|
if all(
|
||||||
|
[
|
||||||
|
text.strip() != i.strip()
|
||||||
|
for i in all_prompt_text[-check_prompt_num:]
|
||||||
|
]
|
||||||
|
):
|
||||||
|
all_tokens.extend(tokens)
|
||||||
|
all_prompt_text.append(text)
|
||||||
|
idx += 1
|
||||||
|
|
||||||
yield Segment(
|
yield Segment(
|
||||||
id=idx,
|
id=idx,
|
||||||
@@ -817,21 +651,6 @@ class WhisperModel:
|
|||||||
max_initial_timestamp_index = int(
|
max_initial_timestamp_index = int(
|
||||||
round(options.max_initial_timestamp / self.time_precision)
|
round(options.max_initial_timestamp / self.time_precision)
|
||||||
)
|
)
|
||||||
if options.max_new_tokens is not None:
|
|
||||||
max_length = len(prompt) + options.max_new_tokens
|
|
||||||
else:
|
|
||||||
max_length = self.max_length
|
|
||||||
|
|
||||||
if max_length > self.max_length:
|
|
||||||
raise ValueError(
|
|
||||||
f"The length of the prompt is {len(prompt)}, and the `max_new_tokens` "
|
|
||||||
f"{max_length - len(prompt)}. Thus, the combined length of the prompt "
|
|
||||||
f"and `max_new_tokens` is: {max_length}. This exceeds the "
|
|
||||||
f"`max_length` of the Whisper model: {self.max_length}. "
|
|
||||||
"You should either reduce the length of your prompt, or "
|
|
||||||
"reduce the value of `max_new_tokens`, "
|
|
||||||
f"so that their combined length is less that {self.max_length}."
|
|
||||||
)
|
|
||||||
|
|
||||||
for temperature in options.temperatures:
|
for temperature in options.temperatures:
|
||||||
if temperature > 0:
|
if temperature > 0:
|
||||||
@@ -853,7 +672,7 @@ class WhisperModel:
|
|||||||
length_penalty=options.length_penalty,
|
length_penalty=options.length_penalty,
|
||||||
repetition_penalty=options.repetition_penalty,
|
repetition_penalty=options.repetition_penalty,
|
||||||
no_repeat_ngram_size=options.no_repeat_ngram_size,
|
no_repeat_ngram_size=options.no_repeat_ngram_size,
|
||||||
max_length=max_length,
|
max_length=self.max_length,
|
||||||
return_scores=True,
|
return_scores=True,
|
||||||
return_no_speech_prob=True,
|
return_no_speech_prob=True,
|
||||||
suppress_blank=options.suppress_blank,
|
suppress_blank=options.suppress_blank,
|
||||||
@@ -911,8 +730,6 @@ class WhisperModel:
|
|||||||
if (
|
if (
|
||||||
options.no_speech_threshold is not None
|
options.no_speech_threshold is not None
|
||||||
and result.no_speech_prob > options.no_speech_threshold
|
and result.no_speech_prob > options.no_speech_threshold
|
||||||
and options.log_prob_threshold is not None
|
|
||||||
and avg_logprob < options.log_prob_threshold
|
|
||||||
):
|
):
|
||||||
needs_fallback = False # silence
|
needs_fallback = False # silence
|
||||||
|
|
||||||
@@ -986,7 +803,6 @@ class WhisperModel:
|
|||||||
word_durations = np.array([word["end"] - word["start"] for word in alignment])
|
word_durations = np.array([word["end"] - word["start"] for word in alignment])
|
||||||
word_durations = word_durations[word_durations.nonzero()]
|
word_durations = word_durations[word_durations.nonzero()]
|
||||||
median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
|
median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
|
||||||
median_duration = min(0.7, float(median_duration))
|
|
||||||
max_duration = median_duration * 2
|
max_duration = median_duration * 2
|
||||||
|
|
||||||
# hack: truncate long words at sentence boundaries.
|
# hack: truncate long words at sentence boundaries.
|
||||||
|
|||||||
@@ -22,10 +22,6 @@ _MODELS = {
|
|||||||
"large-v2": "Systran/faster-whisper-large-v2",
|
"large-v2": "Systran/faster-whisper-large-v2",
|
||||||
"large-v3": "Systran/faster-whisper-large-v3",
|
"large-v3": "Systran/faster-whisper-large-v3",
|
||||||
"large": "Systran/faster-whisper-large-v3",
|
"large": "Systran/faster-whisper-large-v3",
|
||||||
"distil-large-v2": "Systran/faster-distil-whisper-large-v2",
|
|
||||||
"distil-medium.en": "Systran/faster-distil-whisper-medium.en",
|
|
||||||
"distil-small.en": "Systran/faster-distil-whisper-small.en",
|
|
||||||
"distil-large-v3": "Systran/faster-distil-whisper-large-v3",
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -53,7 +49,7 @@ def download_model(
|
|||||||
"""Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
|
"""Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
size_or_id: Size of the model to download from https://huggingface.co/Systran
|
size_or_id: Size of the model to download from https://huggingface.co/guillaumekln
|
||||||
(tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
|
(tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
|
||||||
large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
|
large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
|
||||||
(e.g. Systran/faster-whisper-large-v3).
|
(e.g. Systran/faster-whisper-large-v3).
|
||||||
@@ -147,10 +143,3 @@ class disabled_tqdm(tqdm):
|
|||||||
def __init__(self, *args, **kwargs):
|
def __init__(self, *args, **kwargs):
|
||||||
kwargs["disable"] = True
|
kwargs["disable"] = True
|
||||||
super().__init__(*args, **kwargs)
|
super().__init__(*args, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
def get_end(segments: List[dict]) -> Optional[float]:
|
|
||||||
return next(
|
|
||||||
(w["end"] for s in reversed(segments) for w in reversed(s["words"])),
|
|
||||||
segments[-1]["end"] if segments else None,
|
|
||||||
)
|
|
||||||
|
|||||||
@@ -1,3 +1,3 @@
|
|||||||
"""Version information."""
|
"""Version information."""
|
||||||
|
|
||||||
__version__ = "1.0.1"
|
__version__ = "0.10.0"
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
av==11.*
|
av==10.*
|
||||||
ctranslate2>=4.0,<5
|
ctranslate2>=3.22,<4
|
||||||
huggingface_hub>=0.13
|
huggingface_hub>=0.13
|
||||||
tokenizers>=0.13,<0.16
|
tokenizers>=0.13,<0.16
|
||||||
onnxruntime>=1.14,<2
|
onnxruntime>=1.14,<2
|
||||||
|
|||||||
2
setup.py
2
setup.py
@@ -37,7 +37,7 @@ setup(
|
|||||||
long_description=get_long_description(),
|
long_description=get_long_description(),
|
||||||
long_description_content_type="text/markdown",
|
long_description_content_type="text/markdown",
|
||||||
author="Guillaume Klein",
|
author="Guillaume Klein",
|
||||||
url="https://github.com/SYSTRAN/faster-whisper",
|
url="https://github.com/guillaumekln/faster-whisper",
|
||||||
classifiers=[
|
classifiers=[
|
||||||
"Development Status :: 4 - Beta",
|
"Development Status :: 4 - Beta",
|
||||||
"Intended Audience :: Developers",
|
"Intended Audience :: Developers",
|
||||||
|
|||||||
Reference in New Issue
Block a user