Skip to main content
  1. Posts/

Local Audio Transcription with Speaker Diarization

· David Steeman · AI, Self-Hosted

I wanted a transcription tool that runs entirely on my own hardware — no audio leaves the machine, no cloud APIs, no subscriptions. Something that handles any language (including tricky ones like Flemish dialect), produces speaker-labeled output, and can be tuned with domain-specific vocabulary for whatever context I’m transcribing.

What I ended up with is a Docker container powered by an NVIDIA RTX 3090 that transcribes audio with Whisper, aligns every word to a precise timestamp, identifies who said what, runs LLM post-correction to fix brand names and phonetic errors, and can even generate a structured summary with decisions and action items — all in about two minutes for a 42-minute recording, or three to five minutes with every quality enhancement enabled.

This post covers what I built, why, and the bumps I hit along the way. It’s been updated to reflect the v2 pipeline overhaul.

Why build this yourself?
#

Cloud transcription services are good. They’re also a privacy trade-off: your audio leaves your machine, passes through someone else’s infrastructure, and is stored on their terms. For meeting notes or voice memos, that may be fine. For anything sensitive — medical consultations, legal conversations, internal business meetings, personal recordings — it’s not.

Beyond privacy, there are practical limitations:

Language and dialect. Cloud services are optimized for clean, standard language. Regional dialects, accented speech, and informal conversation patterns consistently degrade their accuracy. I transcribe Flemish audio — spoken Dutch with heavy regional pronunciation, local place names, and informal speech patterns that standard Dutch models don’t handle well out of the box.

Domain terminology. Every field has its own vocabulary. Medical transcription involves drug names and anatomical terms. Legal recordings contain jargon that sounds like common words but means something specific. Generic models guess at these terms and often guess wrong.

Control. A local pipeline lets you choose the model size and quality level (from quick drafts to near-perfect transcription), add vocabulary hints, generate summaries, and process as many files as you want without per-minute costs or rate limits.

The solution architecture
#

The v2 pipeline has six stages, all running inside a single Docker container with GPU passthrough to my NVIDIA RTX 3090:

Audio file (.m4a)
     |
     v
+---------------------------------------------+
|  Stage 1: Audio Preprocessing               |
|  - Loudness normalisation (ffmpeg loudnorm)  |
|  - High-pass 80 Hz filter                    |
|  - Stereo channel splitting for phone calls  |
+---------------------------------------------+
     |
     v
+---------------------------------------------+
|  Stage 2: Multi-engine ASR (WhisperX)       |
|  - Engine A: large-v3 with max quality       |
|  - Engine B: Dutch fine-tune (nl audio only) |
+---------------------------------------------+
     |
     v
+---------------------------------------------+
|  Stage 3: ROVER Reconciliation (optional)   |
|  - Merges Engine A + Engine B results        |
|  - Glossary-weighted majority vote           |
+---------------------------------------------+
     |
     v
+---------------------------------------------+
|  Stage 4: Speaker Diarization               |
|  - pyannote/speaker-diarization-3.1          |
|  - Per-channel mode for stereo phone calls   |
+---------------------------------------------+
     |
     v
+---------------------------------------------+
|  Stage 5: LLM Post-Correction (optional)    |
|  - Fixes brand names and phonetic errors     |
|  - Local (Ollama) or cloud (GLM via Z.ai)    |
+---------------------------------------------+
     |
     v
+---------------------------------------------+
|  Stage 6: Render                            |
|  - audio.txt (verbatim, speaker-labeled)     |
|  - audio.cleaned.txt (post-corrected)        |
+---------------------------------------------+
     |
     v
RTX 3090 (CUDA, ~8 GB peak VRAM)

Audio preprocessing normalises loudness, applies a high-pass filter, and can split stereo phone call recordings into separate channels for better diarization.

Multi-engine ASR runs WhisperX with the large-v3 model as the primary engine, with an optional second pass using a Dutch fine-tuned model. The results are reconciled via ROVER (Recognizer Output Voting Error Reduction) — a confidence-weighted majority vote that picks the best transcription from both engines, with glossary terms used as tie-breakers.

Speaker diarization uses pyannote to determine who is speaking when. It returns speaker segments with timestamps, which get matched against the word-level timestamps from alignment to label each transcribed word with a speaker.

LLM post-correction takes the verbatim transcript and fixes phonetic mishearings — things like vloekFluke, AnnexterAnixter, comscopeCommScope. It runs locally via Ollama or in the cloud via the latest GLM model through Z.ai’s Anthropic-compatible API. The result is two files: a verbatim transcript and a cleaned version with corrections applied.

The result is two text files with speaker labels:

# audio.txt (verbatim)
[SPEAKER_00]
  Goedemorgen, ik heb hier net mijn man met de vloek.
  Ik weet niet of we al een gigantisch veel werk...

[SPEAKER_01]
  Ja, kom maar binnen. Waar kan ik u mee helpen?
# audio.cleaned.txt (post-corrected)
[SPEAKER_00]
  Goedemorgen, ik heb hier net mijn man met de Fluke.
  Ik weet niet of we al een gigantisch veel werk...

What’s new in v2
#

The initial version had three stages: WhisperX transcription, word alignment, and speaker diarization. While it worked well for basic transcription, real-world use on Flemish business calls exposed quality gaps that no decoding parameter could fix.

The poisoned prompt problem. The auto-prompt feature piped raw Ollama output directly into Whisper’s initial_prompt parameter. When using qwen3 models with thinking mode enabled, the output included thinking... blocks and ANSI terminal escape codes that were fed straight into the transcription prompt — actively degrading quality. The v2 prompt builder now strips ANSI codes, thinking blocks, and enforces strict length limits.

Multi-stage Python pipeline. The v1 pipeline was a monolithic bash script with embedded Python here-docs — a maintenance headache and the structural cause of the prompt poisoning. The v2 pipeline moves all heavy logic into Python modules under pipeline/, with the bash script acting only as an orchestrator.

Audio preprocessing. Phone call recordings often have inconsistent loudness levels and low-frequency rumble. The v2 pipeline adds loudness normalisation (ffmpeg loudnorm to -16 LUFS), high-pass filtering at 80 Hz, and stereo channel splitting for phone calls where the two channels represent different speakers.

Glossary support. A plain-text glossary file maps common misrecognitions to their correct forms (vloek -> Fluke, comscope -> CommScope). The glossary is used in three places: the prompt builder appends canonical terms, ROVER uses them as tie-breakers, and the post-correction LLM receives the full glossary in its system prompt.

Two outputs. Every run produces both a verbatim transcript and an LLM-cleaned version, so you can see exactly what was corrected.

Key design decisions
#

Pre-loading models into the Docker image
#

The Whisper large-v3 model is 2.9 GB. Downloading it at runtime is slow and unreliable — and if you’re running the container on a machine with spotty internet, it might fail entirely. I bake the model directly into the Docker image. The v2 image also pre-downloads alignment models, the diarization model, and the Dutch fine-tuned ASR model, bringing the total image size to about 20 GB. It builds once and is cached after that. Every transcription run starts with all models ready to go.

Vocabulary prompt files
#

This was the most impactful quality improvement in v1, and it still is. Whisper supports an initial_prompt parameter — a string of text that primes the model before transcription begins. By feeding it a paragraph of domain-specific terminology, the model becomes much better at recognizing those terms correctly.

As an example, I transcribe medical consultations that involve a lot of clinical terminology. A prompt file for that context might contain relevant vocabulary like this:

Dit is een medisch consult. De gesprekken gaan over hypertensie,
diabetes mellitus, cholesterol, bloeddruk, receptuur, huisarts,
specialist, verwijzing, bloedonderzoek, recept, dosering,
bijwerkingen, chronische aandoening, preventief onderzoek, ...

Without the prompt, the model consistently misrecognizes terms like hypertensie as similar-sounding common words. With it, those terms are recognized correctly.

The same approach works for any domain. A medical prompt would list drug names and clinical terminology. A legal prompt would include case-specific terms. A podcast prompt might include recurring guest names and topic-specific jargon. You create a plain text file for each context and pass it via --prompt.

The glossary
#

The glossary is a plain text file at ~/.config/whisper/glossary.txt that maps common misrecognitions to their correct forms. Sections like [brand], [person], [place], and [term] control the weighting — brand names get the highest priority in ROVER tie-breaking.

[brand]
vloek -> Fluke
annexter -> Anixter
comscope -> CommScope
connect-wise -> ConnectWise

[person]
brent -> Brent

[place]
berendrechtstraat -> Berendrechtstraat

The glossary is used in three places: the prompt builder appends canonical terms to vocabulary hints, ROVER prefers glossary terms when voting on ambiguous words, and the post-correction LLM receives the full glossary in its system prompt.

LLM post-correction
#

Even with good prompts and glossary support, some misrecognitions slip through — especially Flemish phonetic errors that sound plausible in Dutch. The --correct flag runs the verbatim transcript through an LLM that fixes brand names and phonetic errors while preserving the original meaning.

For even better results, --cloud-correct uses the latest GLM model via Z.ai’s Anthropic-compatible API. The model is auto-detected from the Z.ai models endpoint and cached for 24 hours, so you always get the best available version without manual configuration.

The post-correction applies strict guardrails: only changes with per-word confidence >= 0.8 are accepted, and any segment with more than 35% token churn from the verbatim is rejected and falls back to the original with an [!unverified] tag. This prevents the LLM from paraphrasing or hallucinating.

Auto-stopping Ollama to free VRAM
#

My RTX 3090 has 24 GB of VRAM, which sounds generous. But Ollama — my always-running local LLM service — typically consumes about 20 GB of that. That doesn’t leave enough room for WhisperX.

The transcription script handles this automatically: it stops Ollama before transcription, then restarts it when the script exits — even if it exits with an error, thanks to trap cleanup EXIT. Peak VRAM usage during diarization is around 8 GB, well within the 3090’s capacity.

Persistent Docker volume for model caching
#

The alignment and diarization models are another ~3 GB of downloads on first run. Since the container uses --rm (ephemeral — deleted after each run), those downloads would happen every single time without persistence.

I solved this with a named Docker volume:

docker volume create whisper-hf-cache
# Mounted as: -v whisper-hf-cache:/root/.cache/huggingface

The models download once, then get served from the volume on every subsequent run. The container stays ephemeral, but the cache survives.

Credentials outside the container
#

The HuggingFace token (needed for pyannote’s gated models) lives at ~/.config/whisper/hf-token and is mounted read-only into the container. The optional Z.ai API key for cloud correction goes in ~/.config/whisper/zai-key. No secrets baked into the Docker image.

The journey (or: it wasn’t that simple)
#

The final setup is clean, but getting there was a process. Here are the highlights of what went wrong — across both v1 and v2.

CUDA library issues. My first Dockerfile used the nvidia/cuda:12.4.0-base-ubuntu22.04 image. Everything built fine, but at runtime WhisperX couldn’t find libcublas.so.12. The base CUDA image doesn’t include the full runtime libraries. Switching to the runtime variant (nvidia/cuda:12.4.0-runtime-ubuntu22.04) fixed it.

HuggingFace gated models. pyannote’s speaker diarization models are gated — you need a HuggingFace account and you need to manually accept the model terms before you can download them. I kept getting GatedRepoError: 403 until I visited each model page on huggingface.co and clicked “Agree”. There are three separate models to accept, and the error message doesn’t tell you which one is missing.

API changes in pyannote. After upgrading to a newer version of pyannote, the code that iterates over diarization segments stopped working. The DiarizeOutput object no longer supports direct iteration via itertracks(). The new API exposes results through a .speaker_diarization property instead. I ended up handling both formats:

try:
    diarize_df = [{'start': turn.start, 'end': turn.end, 'speaker': speaker}
                  for turn, _, speaker in diarize_segments.itertracks(yield_label=True)]
except AttributeError:
    annotation = diarize_segments.speaker_diarization
    diarize_df = [{'start': turn.start, 'end': turn.end, 'speaker': speaker}
                  for turn, _, speaker in annotation.itertracks(yield_label=True)]

The m4a decoding problem. pyannote couldn’t decode m4a files directly — it relies on torchcodec, which wasn’t installed in the container. Rather than adding another dependency, I load the audio via torchaudio.load() as a waveform tensor and pass that to the diarization pipeline. Two lines of code, problem solved.

The disappearing cache. After a failed download attempt, HuggingFace left behind a .no_exist marker file in the cache volume. On the next run, it saw the marker and skipped the download — but the model wasn’t actually there. Cleaning the marker from the Docker volume resolved it.

The use_auth_token parameter removal. A newer version of the HuggingFace libraries removed the use_auth_token parameter in favor of token=. A minor change, but it took a confusing stack trace to track down.

The poisoned auto-prompt. The v1 auto-prompt pipeline piped raw ollama output directly into Whisper’s initial_prompt. When using qwen3 models with thinking mode, the output included thinking blocks and ANSI escape codes that actively hurt transcription quality. One particularly bad prompt file ran to about 800 lines of qwen3 thinking output with terminal control codes. The v2 prompt builder now strips all of this before anything reaches Whisper.

The Z.ai API migration. I initially used Z.ai’s OpenAI-compatible PaaS endpoint for cloud post-correction. It worked but had compatibility quirks. Switching to their Anthropic-compatible Messages API endpoint (/api/anthropic/v1/messages) gave a cleaner integration and more reliable results.

None of these were showstoppers, but together they turned a “set up Whisper in Docker” afternoon into a multi-day project. The good news is that with the issues resolved, the setup is now completely repeatable.

Performance
#

For a ~42-minute recording on an RTX 3090, timings depend on the quality preset:

QualityTimeNotes
fast~30 secondsSmall model, no enhancements
medium~1 minuteMedium model
good~2 minutesLarge-v3, no enhancements
perfect~3-5 minutesLarge-v3 + all enhancements + post-correction

The good preset breaks down like this:

StageTime
Model load~5 seconds
Transcription~42 seconds (58x realtime)
Word alignment~5 seconds
Speaker diarization~60 seconds
Total~2 minutes

The perfect preset adds audio preprocessing, optional multi-engine ASR with ROVER reconciliation, and LLM post-correction, which brings the total to three to five minutes depending on the number of segments that need correction.

With --summary, add about 30 seconds for LLM summary generation.

VRAM usage peaks at about 8 GB during the diarization stage. The large-v3 model itself uses ~4.5 GB, the alignment model loads and unloads (~1 GB), and the diarization model peaks around ~2 GB. Plenty of headroom on the 24 GB RTX 3090.

Using it
#

The whole thing is wrapped in a single bash script with quality presets and optional features:

# Quick draft — fast model, lower accuracy, quickest results
~/claudecode/projects/whisper/transcribe "./recording.m4a" --quality fast --language nl

# High quality with auto-generated vocabulary (recommended for most use cases)
~/claudecode/projects/whisper/transcribe "./recording.m4a" \
    --quality good --language nl --auto-prompt

# Maximum quality with all enhancements
~/claudecode/projects/whisper/transcribe "./recording.m4a" \
    --quality perfect --language nl --auto-prompt --summary

# Maximum quality with cloud post-correction
~/claudecode/projects/whisper/transcribe "./recording.m4a" \
    --quality perfect --language nl --auto-prompt --cloud-correct

# With a manual vocabulary prompt for a known domain
~/claudecode/projects/whisper/transcribe "./recording.m4a" \
    --model large-v3 --language nl --prompt prompt-medical.txt

The script handles everything: building the Docker image if it doesn’t exist, stopping Ollama, running each pipeline stage, writing the output files, and restarting Ollama. The output is two plain text files alongside the audio file — a verbatim transcript with speaker labels and an optional cleaned version with LLM corrections applied.

At the end of each run, the script prints a summary of which stages ran:

=== Pipeline summary ===
Ran:     preprocess asr (large-v3) diarize post-correct (12 segments fixed)
Skipped: ensemble (engine B unavailable)

Auto-prompting: letting the machine figure out the vocabulary
#

Writing vocabulary prompt files by hand works well, but it’s a manual step you need to repeat for every new context. A medical transcription needs different terms than a legal one. A podcast about technology needs different terms than one about cooking. I wanted to remove that friction.

The solution is a 5-step iterative pipeline that uses a second local model — an LLM running via Ollama — to automatically extract and refine relevant vocabulary:

audio.m4a
  |
  +-- Step 1: Quick scan (Whisper medium model, ~10 seconds on GPU)
  |    -> rough transcript text
  |
  +-- Step 2: Keyword extraction (Ollama LLM)
  |    -> comma-separated vocabulary list (prompt v1)
  |
  +-- Step 3: Refined scan (Whisper medium + prompt v1, ~10 seconds)
  |    -> cleaner transcript text
  |
  +-- Step 4: Keyword refinement (Ollama LLM)
  |    -> refined vocabulary list (prompt v2)
  |
  +-- Step 5: Full transcription (Whisper large-v3 + prompt v2)
       -> aligned, speaker-identified transcript

The iteration matters. The first scan (step 1) produces rough text — good enough to understand the topic, but with misrecognized words. The LLM extracts keywords from this rough text (step 2), which may include some garbage. But feeding those keywords back into a second medium-model scan (step 3) produces significantly cleaner text, because the vocabulary hints correct the most egregious misrecognitions. The LLM then refines the keyword list against this cleaner text (step 4), removing false positives and adding missed terms. This refined prompt (step 4’s output) goes into the final large-v3 transcription.

The v2 prompt builder sanitises all LLM output before it reaches Whisper — stripping ANSI codes, qwen3 thinking blocks, and enforcing length limits to prevent the poisoned prompt issue from v1.

If Ollama isn’t available or the model isn’t downloaded yet, the script falls back gracefully — it just runs the full transcription without a prompt and warns the user. No hard dependencies.

Quality presets
#

Not every transcription needs the largest model. Sometimes you just want to quickly check what’s on a recording. Other times you need the best possible accuracy. The --quality flag bundles model selection, enhancement flags, and quality parameters into four presets:

PresetFinal modelEnhanceDenoiseEnsembleCorrectUse case
fastsmalloffoffoffoffQuick drafts
mediummediumoffoffoffoffGood balance
goodlarge-v3offoffoffoffHigh quality
perfectlarge-v3onononon (local)Maximum quality

The perfect preset enables all enhancement stages: audio preprocessing, multi-engine ASR with ROVER, and local LLM post-correction. Use --cloud-correct alongside it to upgrade to cloud-based correction for better results on Flemish phonetics.

You can override individual stages with negation flags: --quality perfect --no-correct runs everything except post-correction, --quality perfect --no-denoise skips denoising, etc.

Quality parameters
#

The presets also configure three parameters that trade speed for accuracy:

Parameterfastmediumgood / perfectPurpose
beam_size2510How many candidate translations to search
best_of2510How many candidates to sample before beam search
batch_size16128Smaller batches = more careful processing

You can override the final model from any preset with --model. For example, --quality fast --model large-v3 gives you the fast preset’s speed-optimized parameters but with the large-v3 model for the final pass — useful when you want high quality on the final transcription but don’t care about the scan stages.

Automatic summaries
#

The --summary flag generates a structured summary after transcription using a local LLM via Ollama. It appends three sections to the transcript file:

---

## Samenvatting
Het team besprak de voortgang van het bouwproject en de
leveringsproblemen bij de hoofdaannemer...

## Beslissingen
- De fundering wordt volgende week afgerond conform planning...
- Er wordt overgestapt op een alternatieve leverancier voor staal...

## Actiepunten
- [ ] Offerte aanvragen bij de nieuwe leverancier voor staalprofelen...
- [ ] Planning update delen met de opdrachtgever uiterlijk vrijdag...

The LLM is instructed to respond in the same language as the transcript and to include specific names, places, and numbers — not generic filler. It adds about 30 seconds to the total pipeline time.

The summary works well combined with --auto-prompt: the auto-prompt pipeline generates an optimal vocabulary first, the final transcription uses that vocabulary for better accuracy, and then the summary is generated from the higher-quality transcript.

Making it your own
#

The setup is designed to be generic. To use it for a different context — medical transcription, legal depositions, podcast transcription, whatever — you can combine the features you need:

Quality presets--quality fast for quick drafts, --quality good for most use cases, --quality perfect when accuracy matters most.

Manual prompting — create a text file with domain-specific terms and pass it via --prompt. Best when you know the vocabulary in advance.

Auto-prompting — just add --auto-prompt and let the pipeline figure it out. The LLM will analyze the content and generate an appropriate vocabulary prompt automatically.

Post-correction — add --correct for local LLM cleanup or --cloud-correct for cloud-based correction of brand names and phonetic errors.

Summaries — add --summary to append a structured summary with decisions and action items to the transcript.

Glossary — maintain a ~/.config/whisper/glossary.txt with common misrecognitions for your domain. The glossary informs prompts, ROVER voting, and post-correction.

The alignment model is selected automatically based on the detected language. The diarization is language-agnostic. The Docker container, GPU setup, and caching logic don’t change.

Wrapping up
#

The project lives at github.com/steemandavid/whisper-transcribe , alongside the other self-hosted tools on my home server. If you have an NVIDIA GPU with at least 8 GB of VRAM and a Docker setup with the NVIDIA runtime, you can have the same pipeline running in under an hour — assuming you don’t hit the same library and API issues I did. If you do, well, at least now you know the fixes.

The v2 pipeline overhaul takes the tool well beyond raw transcription. Multi-engine ASR with ROVER reconciliation combines the strengths of multiple models. Audio preprocessing handles the messy reality of phone call recordings and meetings. The glossary provides a structured way to teach the pipeline about domain-specific terminology. LLM post-correction catches the misrecognitions that slip through everything else — producing both a verbatim transcript and a cleaned version. And quality presets let you trade speed for accuracy depending on the situation.