support distil-whisper (#557)

This commit is contained in:
metame
2024-01-24 17:17:12 +08:00
committed by GitHub
parent 72ff979a2e
commit ad3c83045b
4 changed files with 76 additions and 3 deletions

View File

@@ -8,6 +8,8 @@ This implementation is up to 4 times faster than [openai/whisper](https://github
## Benchmark
### Whisper
For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
@@ -36,6 +38,33 @@ For reference, here's the time and memory usage that are required to transcribe
*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
### Distil-whisper
| Implementation | Precision | Beam size | Time | Gigaspeech WER |
| --- | --- | --- | --- | --- |
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
*Executed with CUDA 11.4 on a NVIDIA 3090.*
<details>
<summary>testing details (click to expand)</summary>
For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
```python
from faster_whisper import WhisperModel
model_size = "distil-large-v2"
# model_size = "distil-medium.en"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
```
</details>
## Requirements
* Python 3.8 or greater
@@ -101,6 +130,8 @@ pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/
## Usage
### Faster-whisper
```python
from faster_whisper import WhisperModel
@@ -128,6 +159,18 @@ for segment in segments:
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
```
### Faster-distil-whisper
For usage of `faster-ditil-whisper`, please refer to: https://github.com/guillaumekln/faster-whisper/issues/533
```python
model_size = "distil-large-v2"
# model_size = "distil-medium.en"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5,
language="en", max_new_tokens=128, condition_on_previous_text=False)
```
NOTE: emprically, `condition_on_previous_text=True` will degrade the performance of `faster-distil-whisper` for long audio.
### Word-level timestamps