Skip to main content
  1. Posts/

Local Whisper Transcription with Speaker Diarization: My GPU-Powered Docker Setup

· David Steeman · AI, Self-Hosted

I wanted a transcription tool that runs entirely on my own hardware — no audio leaves the machine, no cloud APIs, no subscriptions. Something that handles any language (including tricky ones like Flemish dialect), produces speaker-labeled output, and can be tuned with domain-specific vocabulary for whatever context I’m transcribing.

What I ended up with is a Docker container powered by an NVIDIA RTX 3090 that transcribes audio with Whisper, aligns every word to a precise timestamp, and identifies who said what — all in about two minutes for a 42-minute recording.

This post covers what I built, why, and the bumps I hit along the way.

Why build this yourself?
#

Cloud transcription services are good. They’re also a privacy trade-off: your audio leaves your machine, passes through someone else’s infrastructure, and is stored on their terms. For meeting notes or voice memos, that may be fine. For anything sensitive — medical consultations, legal conversations, internal business meetings, personal recordings — it’s not.

Beyond privacy, there are practical limitations:

Language and dialect. Cloud services are optimized for clean, standard language. Regional dialects, accented speech, and informal conversation patterns consistently degrade their accuracy. I transcribe Flemish audio — spoken Dutch with heavy regional pronunciation, local place names, and informal speech patterns that standard Dutch models don’t handle well out of the box.

Domain terminology. Every field has its own vocabulary. Medical transcription involves drug names and anatomical terms. Legal recordings contain jargon that sounds like common words but means something specific. Generic models guess at these terms and often guess wrong.

Control. A local pipeline lets you choose the model size, tune quality parameters, add vocabulary hints, and process as many files as you want without per-minute costs or rate limits.

The solution architecture
#

The pipeline has three stages, all running inside a single Docker container with GPU passthrough to my NVIDIA RTX 3090:

Audio file (.m4a)
     |
     v
+---------------------------------------------+
|  Docker container: whisper-transcribe        |
|                                              |
|  1. WhisperX (large-v3 model, pre-loaded)    |
|     -> Transcription with vocabulary prompt  |
|     -> beam_size=10, best_of=10, temp=0      |
|                                              |
|  2. wav2vec2 alignment (Dutch)               |
|     -> Word-level timestamp alignment        |
|                                              |
|  3. pyannote speaker diarization             |
|     -> Identifies SPEAKER_00, SPEAKER_01...  |
|                                              |
|  Output: .txt with speaker labels            |
+---------------------------------------------+
     |
     v
RTX 3090 (CUDA, ~8 GB peak VRAM)

WhisperX handles the actual speech-to-text conversion, using the large-v3 Whisper model. It’s a batched implementation that’s significantly faster than vanilla Whisper.

wav2vec2 alignment takes the transcription and aligns every word to a precise timestamp in the audio. This is essential for the next step.

pyannote speaker diarization analyzes the audio to determine who is speaking when. It returns speaker segments with timestamps, which get matched against the word-level timestamps from alignment to label each transcribed word with a speaker.

The result is a plain text file with speaker labels:

[SPEAKER_00]
  Good morning, I have an appointment at 9.
  I wanted to ask about the project timeline.

[SPEAKER_01]
  Yes, come on in.
  Are you referring to the north or south site?

Key design decisions
#

Pre-loading the model into the Docker image
#

The Whisper large-v3 model is 2.9 GB. Downloading it at runtime is slow and unreliable — and if you’re running the container on a machine with spotty internet, it might fail entirely. I bake the model directly into the Docker image:

COPY model.bin /opt/whisper-models/large-v3/model.bin
COPY large-v3-support/ /opt/whisper-models/large-v3/
ENV WHISPER_MODEL_DIR=/opt/whisper-models

The image ends up around 7 GB, but it builds once and is cached after that. Every transcription run starts with the model ready to go.

Vocabulary prompt files
#

This was the most impactful quality improvement. Whisper supports an initial_prompt parameter — a string of text that primes the model before transcription begins. By feeding it a paragraph of domain-specific terminology, the model becomes much better at recognizing those terms correctly.

As an example, I transcribed conversations with a Flemish municipal council that involved a lot of local government jargon. My prompt file for that context contains about 1,400 characters of relevant vocabulary:

Dit is een gesprek op een Vlaamse gemeentelijke dienst. De gesprekken
gaan over mobiliteit, stedenbouw, onteigening, verkaveling,
omgevingsvergunning, gewestplan, ruimtelijk uitvoeringsplan,
bestemmingsplan, perceel, cadaster, rooilijn, schepen, college van
burgemeester en schepenen, ...

Without the prompt, the model consistently misrecognizes terms like onteigening (expropriation) as similar-sounding common words. With it, those terms are recognized correctly.

The same approach works for any domain. A medical prompt would list drug names and clinical terminology. A legal prompt would include case-specific terms. A podcast prompt might include recurring guest names and topic-specific jargon. You create a plain text file for each context and pass it via --prompt.

Auto-stopping Ollama to free VRAM
#

My RTX 3090 has 24 GB of VRAM, which sounds generous. But Ollama — my always-running local LLM service — typically consumes about 20 GB of that. That doesn’t leave enough room for WhisperX.

The transcription script handles this automatically:

if systemctl is-active --quiet ollama 2>/dev/null; then
    OLLAMA_WAS_ACTIVE=true
    echo "Stopping Ollama to free GPU memory..."
    sudo systemctl stop ollama
    sleep 2
fi

cleanup() {
    if $OLLAMA_WAS_ACTIVE; then
        echo "Restarting Ollama..."
        sudo systemctl start ollama
    fi
}
trap cleanup EXIT

It stops Ollama before transcription, then restarts it when the script exits — even if it exits with an error, thanks to trap cleanup EXIT. Peak VRAM usage during diarization is around 8 GB, well within the 3090’s capacity.

Persistent Docker volume for model caching
#

The alignment and diarization models are another ~3 GB of downloads on first run. Since the container uses --rm (ephemeral — deleted after each run), those downloads would happen every single time without persistence.

I solved this with a named Docker volume:

docker volume create whisper-hf-cache
# Mounted as: -v whisper-hf-cache:/root/.cache/huggingface

The models download once, then get served from the volume on every subsequent run. The container stays ephemeral, but the cache survives.

Credentials outside the container
#

The HuggingFace token (needed for pyannote’s gated models) lives at ~/.config/whisper/hf-token and is mounted read-only into the container:

-v ~/.config/whisper/hf-token:/run/secrets/hf-token:ro

No secrets baked into the Docker image. The token file is backed up with the rest of my home directory.

The journey (or: it wasn’t that simple)
#

The final setup is clean, but getting there was a process. Here are the highlights of what went wrong.

CUDA library issues. My first Dockerfile used the nvidia/cuda:12.4.0-base-ubuntu22.04 image. Everything built fine, but at runtime WhisperX couldn’t find libcublas.so.12. The base CUDA image doesn’t include the full runtime libraries. Switching to the runtime variant (nvidia/cuda:12.4.0-runtime-ubuntu22.04) fixed it.

HuggingFace gated models. pyannote’s speaker diarization models are gated — you need a HuggingFace account and you need to manually accept the model terms before you can download them. I kept getting GatedRepoError: 403 until I visited each model page on huggingface.co and clicked “Agree”. There are three separate models to accept, and the error message doesn’t tell you which one is missing.

API changes in pyannote. After upgrading to a newer version of pyannote, the code that iterates over diarization segments stopped working. The DiarizeOutput object no longer supports direct iteration via itertracks(). The new API exposes results through a .speaker_diarization property instead. I ended up handling both formats:

try:
    diarize_df = [{'start': turn.start, 'end': turn.end, 'speaker': speaker}
                  for turn, _, speaker in diarize_segments.itertracks(yield_label=True)]
except AttributeError:
    annotation = diarize_segments.speaker_diarization
    diarize_df = [{'start': turn.start, 'end': turn.end, 'speaker': speaker}
                  for turn, _, speaker in annotation.itertracks(yield_label=True)]

The m4a decoding problem. pyannote couldn’t decode m4a files directly — it relies on torchcodec, which wasn’t installed in the container. Rather than adding another dependency, I load the audio via torchaudio.load() as a waveform tensor and pass that to the diarization pipeline. Two lines of code, problem solved.

The disappearing cache. After a failed download attempt, HuggingFace left behind a .no_exist marker file in the cache volume. On the next run, it saw the marker and skipped the download — but the model wasn’t actually there. Cleaning the marker from the Docker volume resolved it.

The use_auth_token parameter removal. A newer version of the HuggingFace libraries removed the use_auth_token parameter in favor of token=. A minor change, but it took a confusing stack trace to track down.

None of these were showstoppers, but together they turned a “set up Whisper in Docker” afternoon into a multi-day project. The good news is that with the issues resolved, the setup is now completely repeatable.

Performance
#

For the 42-minute recording on an RTX 3090:

StageTime
Model load~5 seconds
Transcription~42 seconds (58x realtime)
Word alignment~5 seconds
Speaker diarization~60 seconds
Total~2 minutes

That’s 42 minutes of audio transcribed with word-level timestamps and speaker labels in roughly two minutes. On a CPU, the same process would take well over an hour.

VRAM usage peaks at about 8 GB during the diarization stage. The large-v3 model itself uses ~4.5 GB, the alignment model loads and unloads (~1 GB), and the diarization model peaks around ~2 GB. Plenty of headroom on the 24 GB RTX 3090.

Using it
#

The whole thing is wrapped in a single bash script:

# Basic transcription
~/claudecode/projects/whisper/transcribe "./recording.m4a" --model large-v3 --language nl

# With vocabulary prompt for your domain
~/claudecode/projects/whisper/transcribe "./recording.m4a" \
    --model large-v3 --language nl --prompt prompt-medical.txt

The script handles everything: building the Docker image if it doesn’t exist, stopping Ollama, running the pipeline, writing the output file, and restarting Ollama. The output is a plain text file alongside the audio file, with speaker labels and clean formatting.

Making it your own
#

The setup is designed to be generic. To use it for a different context — medical transcription, legal depositions, podcast transcription, whatever — you just need to:

  1. Create a vocabulary prompt file with domain-specific terms
  2. Pass it via --prompt
  3. Set --language to your target language (or omit it for auto-detection)

The alignment model is selected automatically based on the detected language. The diarization is language-agnostic. The Docker container, GPU setup, and caching logic don’t change.

Wrapping up
#

The project lives at github.com/steemandavid , alongside the other self-hosted tools on my home server. If you have an NVIDIA GPU with at least 8 GB of VRAM and a Docker setup with the NVIDIA runtime, you can have the same pipeline running in under an hour — assuming you don’t hit the same library and API issues I did. If you do, well, at least now you know the fixes.