Problem
Teams I worked with kept exporting Otter transcripts and then spending thirty minutes hand-labeling speakers. They didn’t want cloud transcription either — half the meetings were under NDA. Existing self-hosted options handled each meeting in isolation; nothing carried a speaker identity across sessions.
Approach
A persistent voice-print library, scoped per workspace. Enroll once with fifteen seconds of clean audio, then every transcript afterward gets named diarization for free. The library lives on the same box as the model; nothing phones home. Identities are mutable — names change, voices don’t.
Stack
whisper.cpp (large-v3) for ASR, pyannote 3.1 with a custom embedding head for diarization, SQLite plus Faiss for the voice-print store, Tauri + Svelte for the desktop UI. Runs on a Mac mini M2 or any Linux box.
What shipped
v1 in February 2025 — enroll, transcribe, label, export markdown. v1.4 in April added Slack post-meeting summaries via a local model. Currently in production with three small teams; one of them runs it on a Mac mini under a conference-room TV.
What’s next
A proper eval framework that grades diarization against a held-out set, run on every commit. Right now it’s a notebook I run by hand once a week — which is exactly the kind of toil VoxTel exists to kill.