fix gpt refine prompt

add env OPENAI_LLM_MODEL
add gpt refine
2024-12-06 23:34:14 +08:00 · 2024-12-06 18:01:22 +08:00 · 2024-12-06 17:53:04 +08:00 · 2024-11-28 18:52:00 +08:00 · 2024-11-21 22:45:02 +08:00 · 2024-11-21 22:44:49 +08:00
7 changed files with 201 additions and 100 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -1 +1,2 @@
 /venv
+/.git
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
 /venv
+/.git
--- a/2
+++ b/2
@@ -1,4 +1,4 @@
-FROM docker.io/nvidia/cuda:12.0.0-cudnn8-runtime-ubuntu22.04
+FROM nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04

 RUN apt-get update && \
    apt-get install -y ffmpeg python3 python3-pip git && \
--- a/README.md
+++ b/README.md
@@ -5,21 +5,36 @@ Whisper-FastAPI is a very simple Python FastAPI interface for konele and OpenAI
 ## Features

 - **Translation and Transcription**: The application provides an API for konele service, where translations and transcriptions can be obtained by connecting over websockets or POST requests.
- **Language Support**: If the target language is English, then the application will translate any source language to English.
+- **Language Support**: If no language is specified, the language will be automatically recognized from the first 30 seconds.
 - **Websocket and POST Method Support**: The project supports a websocket (`/konele/ws`) and a POST method to `/konele/post`.
 - **Audio Transcriptions**: The `/v1/audio/transcriptions` endpoint allows users to upload an audio file and receive transcription in response, with an optional `response_type` parameter. The `response_type` can be 'json', 'text', 'tsv', 'srt', and 'vtt'.
 - **Simplified Chinese**: The traditional Chinese will be automatically convert to simplified Chinese for konele using `opencc` library.

+## GPT Refine Result
+
+You can choose to use the OpenAI GPT model for post-processing transcription results. You can also provide context to GPT to allow it to modify the text based on your context.
+
+Set the environment variables `OPENAI_BASE_URL=https://api.openai.com/v1` and `OPENAI_API_KEY=your-sk` to enable this feature.
+
+When the client sends a request with `gpt_refine=True`, this feature will be activated. Specifically:
+
+- For `/v1/audio/transcriptions`, submit using `curl <api_url> -F file=audio.mp4 -F gpt_refine=True`.
+- For `/v1/konele/ws` and `/v1/konele/post`, use the URL format `/v1/konele/ws/gpt_refine`.
+
+The default model is `gpt-4o-mini` set by environment variable `OPENAI_LLM_MODEL`.
+
+You can easily edit the code LLM's prompt to better fit your workflow. It's just a few lines of code. Give it a try, it's very simple!
+
 ## Usage

 ### Konele Voice Typing

 For konele voice typing, you can use either the websocket endpoint or the POST method endpoint.

- **Websocket**: Connect to the websocket at `/konele/ws` and send audio data. The server will respond with the transcription or translation.
- **POST Method**: Send a POST request to `/konele/post` with the audio data in the body. The server will respond with the transcription or translation.
+- **Websocket**: Connect to the websocket at `/konele/ws` (or `/v1/konele/ws`) and send audio data. The server will respond with the transcription or translation.
+- **POST Method**: Send a POST request to `/konele/post` (or `/v1/konele/post`) with the audio data in the body. The server will respond with the transcription or translation.

-You can also use the demo I have created to quickly test the effect at <https://yongyuancv.cn/konele/ws> and <https://yongyuancv.cn/konele/post>
+You can also use the demo I have created to quickly test the effect at <https://yongyuancv.cn/v1/konele/post>

 ### OpenAI Whisper Service

--- a/requirements.txt
+++ b/requirements.txt
@@ -4,5 +4,6 @@ uvicorn[standard]
 whisper_ctranslate2
 opencc
 prometheus-fastapi-instrumentator
-git+https://github.com/heimoshuiyu/faster-whisper@prompt
+git+https://github.com/heimoshuiyu/faster-whisper@a759f5f48f5ef5b79461a6461966eafe9df088a9
 pydub
+aiohttp
--- a/requirements_version.txt
+++ b/requirements_version.txt
@@ -1,49 +1,58 @@
+aiohappyeyeballs==2.4.4
+aiohttp==3.11.10
+aiosignal==1.3.1
 annotated-types==0.7.0
-anyio==4.4.0
-av==12.3.0
+anyio==4.7.0
+async-timeout==5.0.1
+attrs==24.2.0
+av==14.0.0
 certifi==2024.8.30
 cffi==1.17.1
-charset-normalizer==3.3.2
+charset-normalizer==3.4.0
 click==8.1.7
 coloredlogs==15.0.1
-ctranslate2==4.4.0
+ctranslate2==4.5.0
 exceptiongroup==1.2.2
-fastapi==0.114.1
-faster-whisper @ git+https://github.com/heimoshuiyu/faster-whisper@28a4d11a736d8cdeb4655ee5d7e4b4e7ae5ec8e0
-filelock==3.16.0
+fastapi==0.115.6
+faster-whisper @ git+https://github.com/heimoshuiyu/faster-whisper@a759f5f48f5ef5b79461a6461966eafe9df088a9
+filelock==3.16.1
 flatbuffers==24.3.25
-fsspec==2024.9.0
+frozenlist==1.5.0
+fsspec==2024.10.0
 h11==0.14.0
-httptools==0.6.1
-huggingface-hub==0.24.6
+httptools==0.6.4
+huggingface-hub==0.26.3
 humanfriendly==10.0
-idna==3.8
+idna==3.10
 mpmath==1.3.0
-numpy==2.1.1
-onnxruntime==1.19.2
+multidict==6.1.0
+numpy==2.1.3
+onnxruntime==1.20.1
 OpenCC==1.1.9
-packaging==24.1
+packaging==24.2
 prometheus-fastapi-instrumentator==7.0.0
-prometheus_client==0.20.0
-protobuf==5.28.0
+prometheus_client==0.21.1
+propcache==0.2.1
+protobuf==5.29.1
 pycparser==2.22
-pydantic==2.9.1
-pydantic_core==2.23.3
+pydantic==2.10.3
+pydantic_core==2.27.1
 pydub==0.25.1
 python-dotenv==1.0.1
-python-multipart==0.0.9
+python-multipart==0.0.19
 PyYAML==6.0.2
 requests==2.32.3
 sniffio==1.3.1
-sounddevice==0.5.0
-starlette==0.38.5
-sympy==1.13.2
-tokenizers==0.20.0
-tqdm==4.66.5
+sounddevice==0.5.1
+starlette==0.41.3
+sympy==1.13.3
+tokenizers==0.21.0
+tqdm==4.67.1
 typing_extensions==4.12.2
-urllib3==2.2.2
-uvicorn==0.30.6
-uvloop==0.20.0
-watchfiles==0.24.0
-websockets==13.0.1
-whisper-ctranslate2==0.4.5
+urllib3==2.2.3
+uvicorn==0.32.1
+uvloop==0.21.0
+watchfiles==1.0.0
+websockets==14.1
+whisper-ctranslate2==0.5.0
+yarl==1.18.3
--- a/whisper_fastapi.py
+++ b/whisper_fastapi.py
@@ -1,13 +1,17 @@
-import tqdm
+import aiohttp
+import os
+import sys
+import dataclasses
+import faster_whisper
 import json
-from fastapi.responses import StreamingResponse
+from fastapi.responses import PlainTextResponse, StreamingResponse
 import wave
 import pydub
 import io
 import hashlib
 import argparse
 import uvicorn
-from typing import Annotated, Any, BinaryIO, Literal, Generator, Tuple, Iterable
+from typing import Annotated, Any, BinaryIO, Literal, Generator, Tuple, Iterable, Union
 from fastapi import (
    File,
    HTTPException,
@@ -19,12 +23,19 @@ from fastapi import (
    WebSocket,
 )
 from fastapi.middleware.cors import CORSMiddleware
-from src.whisper_ctranslate2.whisper_ctranslate2 import Transcribe
 from src.whisper_ctranslate2.writers import format_timestamp
 from faster_whisper.transcribe import Segment, TranscriptionInfo
 import opencc
 from prometheus_fastapi_instrumentator import Instrumentator

+# redirect print to stderr
+_print = print
+
+
+def print(*args, **kwargs):
+    _print(*args, file=sys.stderr, **kwargs)
+
+
 parser = argparse.ArgumentParser()
 parser.add_argument("--host", default="0.0.0.0", type=str)
 parser.add_argument("--port", default=5000, type=int)
@@ -40,16 +51,13 @@ Instrumentator().instrument(app).expose(app, endpoint="/konele/metrics")
 ccc = opencc.OpenCC("t2s.json")

 print(f"Loading model to device {args.device}...")
-transcriber = Transcribe(
-    model_path=args.model,
+model = faster_whisper.WhisperModel(
+    model_size_or_path=args.model,
    device=args.device,
-    device_index=0,
-    compute_type="default",
-    threads=args.threads,
-    cache_directory=args.cache_dir,
+    cpu_threads=args.threads,
    local_files_only=args.local_files_only,
 )
-print(f"Model loaded to device {transcriber.model.model.device}")
+print(f"Model loaded to device {model.model.device}")


 # allow all cors
@@ -62,56 +70,108 @@ app.add_middleware(
 )


-def stream_writer(generator: Generator[dict[str, Any], Any, None]):
+async def gpt_refine_text(
+    ge: Generator[Segment, None, None], info: TranscriptionInfo, context: str
+) -> str:
+    text = build_json_result(ge, info).text.strip()
+    model = os.environ.get("OPENAI_LLM_MODEL", "gpt-4o-mini")
+    if not text:
+        return ""
+    body: dict = {
+        "model": model,
+        "temperature": 0.1,
+        "stream": False,
+        "messages": [
+            {
+                "role": "system",
+                "content": f"""
+You are a audio transcription text refiner. You may refer to the context to correct the transcription text. 
+Your task is to correct the transcribed text by removing redundant and repetitive words, resolving any contradictions, and fixing punctuation errors.
+Keep my spoken language as it is, and do not change my speaking style. Only fix the text.
+Response directly with the text.
+                    """.strip(),
+            },
+            {
+                "role": "user",
+                "content": f"""
+context: {context}
+---
+transcription: {text}
+                     """.strip(),
+            },
+        ],
+    }
+    print(f"Refining text length: {len(text)} with {model}")
+    print(body)
+    async with aiohttp.ClientSession() as session:
+        async with session.post(
+            os.environ.get("OPENAI_BASE_URL", "https://api.openai.com/v1")
+            + "/chat/completions",
+            json=body,
+            headers={
+                "Authorization": f'Bearer {os.environ["OPENAI_API_KEY"]}',
+            },
+        ) as response:
+            return (await response.json())["choices"][0]["message"]["content"]
+
+
+def stream_writer(generator: Generator[Segment, Any, None]):
    for segment in generator:
        yield "data: " + json.dumps(segment, ensure_ascii=False) + "\n\n"
    yield "data: [DONE]\n\n"


-def text_writer(generator: Generator[dict[str, Any], Any, None]):
+def text_writer(generator: Generator[Segment, Any, None]):
    for segment in generator:
-        yield segment["text"].strip() + "\n"
+        yield segment.text.strip() + "\n"


-def tsv_writer(generator: Generator[dict[str, Any], Any, None]):
+def tsv_writer(generator: Generator[Segment, Any, None]):
    yield "start\tend\ttext\n"
    for i, segment in enumerate(generator):
-        start_time = str(round(1000 * segment["start"]))
-        end_time = str(round(1000 * segment["end"]))
-        text = segment["text"].strip()
+        start_time = str(round(1000 * segment.start))
+        end_time = str(round(1000 * segment.end))
+        text = segment.text.strip()
        yield f"{start_time}\t{end_time}\t{text}\n"


-def srt_writer(generator: Generator[dict[str, Any], Any, None]):
+def srt_writer(generator: Generator[Segment, Any, None]):
    for i, segment in enumerate(generator):
        start_time = format_timestamp(
-            segment["start"], decimal_marker=",", always_include_hours=True
+            segment.start, decimal_marker=",", always_include_hours=True
        )
        end_time = format_timestamp(
-            segment["end"], decimal_marker=",", always_include_hours=True
+            segment.end, decimal_marker=",", always_include_hours=True
        )
-        text = segment["text"].strip()
+        text = segment.text.strip()
        yield f"{i}\n{start_time} --> {end_time}\n{text}\n\n"


-def vtt_writer(generator: Generator[dict[str, Any], Any, None]):
+def vtt_writer(generator: Generator[Segment, Any, None]):
    yield "WEBVTT\n\n"
-    for i, segment in enumerate(generator):
-        start_time = format_timestamp(segment["start"])
-        end_time = format_timestamp(segment["end"])
-        text = segment["text"].strip()
+    for _, segment in enumerate(generator):
+        start_time = format_timestamp(segment.start)
+        end_time = format_timestamp(segment.end)
+        text = segment.text.strip()
        yield f"{start_time} --> {end_time}\n{text}\n\n"


+@dataclasses.dataclass
+class JsonResult(TranscriptionInfo):
+    segments: list[Segment]
+    text: str
+
+
 def build_json_result(
    generator: Iterable[Segment],
-    info:  dict,
-) -> dict[str, Any]:
+    info: TranscriptionInfo,
+) -> JsonResult:
    segments = [i for i in generator]
-    return info | {
-        "text": "\n".join(i["text"] for i in segments),
-        "segments": segments,
-    }
+    return JsonResult(
+        text="\n".join(i.text for i in segments),
+        segments=segments,
+        **dataclasses.asdict(info),
+    )


 def stream_builder(
@@ -121,12 +181,13 @@ def stream_builder(
    language: str | None,
    initial_prompt: str = "",
    repetition_penalty: float = 1.0,
-) -> Tuple[Iterable[dict], dict]:
-    segments, info = transcriber.model.transcribe(
+) -> Tuple[Generator[Segment, None, None], TranscriptionInfo]:
+    segments, info = model.transcribe(
        audio=audio,
        language=language,
        task=task,
-        initial_prompt=initial_prompt,
+        vad_filter=vad_filter,
+        initial_prompt=initial_prompt if initial_prompt else None,
        word_timestamps=True,
        repetition_penalty=repetition_penalty,
    )
@@ -134,27 +195,14 @@ def stream_builder(
        "Detected language '%s' with probability %f"
        % (info.language, info.language_probability)
    )
+
    def wrap():
-        last_pos = 0
-        with tqdm.tqdm(total=info.duration, unit="seconds", disable=True) as pbar:
-            for segment in segments:
-                start, end, text = segment.start, segment.end, segment.text
-                pbar.update(end - last_pos)
-                last_pos = end
-                data = segment._asdict()
-                if data.get('words') is not None:
-                    data["words"] = [i._asdict() for i in data["words"]]
-                if info.language == "zh":
-                    data["text"] = ccc.convert(data["text"])
-                yield data
+        for segment in segments:
+            if info.language == "zh":
+                segment.text = ccc.convert(segment.text)
+            yield segment

-    info_dict = info._asdict()
-    if info_dict['transcription_options'] is not None:
-        info_dict['transcription_options'] = info_dict['transcription_options']._asdict()
-    if info_dict['vad_options'] is not None:
-        info_dict['vad_options'] = info_dict['vad_options']._asdict()
-    
-    return wrap(), info_dict
+    return wrap(), info


@app.websocket("/k6nele/status")
@@ -171,8 +219,12 @@ async def konele_status(

@app.websocket("/k6nele/ws")
@app.websocket("/konele/ws")
+@app.websocket("/konele/ws/gpt_refine")
+@app.websocket("/k6nele/ws/gpt_refine")
@app.websocket("/v1/k6nele/ws")
@app.websocket("/v1/konele/ws")
+@app.websocket("/v1/konele/ws/gpt_refine")
+@app.websocket("/v1/k6nele/ws/gpt_refine")
 async def konele_ws(
    websocket: WebSocket,
    task: Literal["transcribe", "translate"] = "transcribe",
@@ -220,15 +272,17 @@ async def konele_ws(
        language=None if lang == "und" else lang,
        initial_prompt=initial_prompt,
    )
-    result = build_json_result(generator, info)

-    text = result.get("text", "")
+    if websocket.url.path.endswith("gpt_refine"):
+        result = await gpt_refine_text(generator, info, initial_prompt)
+    else:
+        result = build_json_result(generator, info).text

    await websocket.send_json(
        {
            "status": 0,
            "segment": 0,
-            "result": {"hypotheses": [{"transcript": text}], "final": True},
+            "result": {"hypotheses": [{"transcript": result}], "final": True},
            "id": md5,
        }
    )
@@ -237,8 +291,12 @@ async def konele_ws(

@app.post("/k6nele/post")
@app.post("/konele/post")
+@app.post("/k6nele/post/gpt_refine")
+@app.post("/konele/post/gpt_refine")
@app.post("/v1/k6nele/post")
@app.post("/v1/konele/post")
+@app.post("/v1/k6nele/post/gpt_refine")
+@app.post("/v1/konele/post/gpt_refine")
 async def translateapi(
    request: Request,
    task: Literal["transcribe", "translate"] = "transcribe",
@@ -283,37 +341,51 @@ async def translateapi(
        language=None if lang == "und" else lang,
        initial_prompt=initial_prompt,
    )
-    result = build_json_result(generator, info)

-    text = result.get("text", "")
+    if request.url.path.endswith("gpt_refine"):
+        result = await gpt_refine_text(generator, info, initial_prompt)
+    else:
+        result = build_json_result(generator, info).text

    return {
        "status": 0,
-        "hypotheses": [{"utterance": text}],
+        "hypotheses": [{"utterance": result}],
        "id": md5,
    }


-@app.post("/v1/audio/transcriptions")
+@app.post("/v1/audio/transcriptions", response_model=Union[JsonResult, str])
+@app.post("/v1/audio/translations", response_model=Union[JsonResult, str])
 async def transcription(
+    request: Request,
    file: UploadFile = File(...),
    prompt: str = Form(""),
    response_format: str = Form("json"),
-    task: str = Form("transcribe"),
+    task: str = Form(""),
    language: str = Form("und"),
    vad_filter: bool = Form(False),
    repetition_penalty: float = Form(1.0),
+    gpt_refine: bool = Form(False),
 ):
    """Transcription endpoint

    User upload audio file in multipart/form-data format and receive transcription in response
    """

+    if not task:
+        if request.url.path == "/v1/audio/transcriptions":
+            task = "transcribe"
+        elif request.url.path == "/v1/audio/translations":
+            task = "translate"
+        else:
+            raise HTTPException(400, "task parameter is required")
+
    # timestamp as filename, keep original extension
    generator, info = stream_builder(
        audio=io.BytesIO(file.file.read()),
        task=task,
        vad_filter=vad_filter,
+        initial_prompt=prompt,
        language=None if language == "und" else language,
        repetition_penalty=repetition_penalty,
    )
@@ -327,6 +399,8 @@ async def transcription(
    elif response_format == "json":
        return build_json_result(generator, info)
    elif response_format == "text":
+        if gpt_refine:
+            return PlainTextResponse(await gpt_refine_text(generator, info, prompt))
        return StreamingResponse(text_writer(generator), media_type="text/plain")
    elif response_format == "tsv":
        return StreamingResponse(tsv_writer(generator), media_type="text/plain")
Author	SHA1	Message	Date
heimoshuiyu	460ad77a2f	fix gpt refine prompt	2024-12-06 23:34:14 +08:00
heimoshuiyu	890da4f4ac	add env OPENAI_LLM_MODEL	2024-12-06 18:01:22 +08:00
heimoshuiyu	4784bd53a2	add gpt refine	2024-12-06 17:53:04 +08:00
heimoshuiyu	bd2c6b95cf	update faster-whisper	2024-11-28 18:52:00 +08:00
heimoshuiyu	0e46bd91d4	format code	2024-11-21 22:45:02 +08:00
heimoshuiyu	99272b230f	Upgrade Dependency	2024-11-21 22:44:49 +08:00
heimoshuiyu	3c01a76405	Convert Traditional Chinese to Simplified Chinese	2024-11-21 22:44:27 +08:00
heimoshuiyu	3401c59c4b	update faster-whisper	2024-11-18 11:04:00 +08:00
heimoshuiyu	76b32bc9c4	update faster-whisper	2024-11-15 08:33:43 +08:00
heimoshuiyu	4a5ba38f5e	upgrade with faster-whisper upstream	2024-11-15 01:04:49 +08:00
heimoshuiyu	8ae81a124d	update dependencies	2024-10-25 22:06:05 +08:00
heimoshuiyu	0faaf0f301	support translate endpoint	2024-09-13 16:21:35 +08:00
heimoshuiyu	fab1ec9d03	fix: initial_prompt params	2024-09-13 16:13:19 +08:00
heimoshuiyu	71bde08b17	fix: vad_filter params	2024-09-13 16:10:38 +08:00
heimoshuiyu	a53a2fb80e	fix typing hint	2024-09-13 16:09:37 +08:00