EagerHQ
← Back to BlogEngineering14 min read

Voxlit Under the Hood: How We Built a Voice-First AI Agent for macOS

A full engineering breakdown of Voxlit. CoreML hotword detection, streaming STT over WebSocket, the tool-enabled agent, and the Go cloud backend that ties it together.

By Rajdeep ChaudhariTechnical

Voxlit is the voice-first AI agent we ship for macOS. Say "Hey Voxlit" and the app wakes, transcribes, and hands the result to an agent that can act in the app you are focused on. It is open source, it runs as a native binary, and the whole pipeline was designed to be predictable. This post is a full engineering tour.

Hotword on-device, streaming to the cloud only when the user asks, agent tools that act instead of chat. That is the shape of Voxlit.

01 / Client

The macOS app layer.

The Voxlit client is a single Swift target. SwiftUI for the menubar UI and settings, AppKit where we need to reach into accessibility and the clipboard. It ships as a universal binary for Apple Silicon and Intel, signed and notarized for Gatekeeper.

Audio capture

AVAudioEngine owns the input graph. A single AVAudioInputNode feeds two consumers: the on-device hotword model and a ring buffer that survives device route changes.

swift
let engine = AVAudioEngine()
let input = engine.inputNode
let format = input.outputFormat(forBus: 0)

input.installTap(onBus: 0, bufferSize: 320, format: format) { buffer, _ in
  hotword.process(buffer)
  ringBuffer.write(buffer)
}

engine.prepare()
try engine.start()

The 320 frame size corresponds to 20ms at 16kHz, which is the cadence the hotword model expects and also a clean fit for the streaming STT protocol.

Handling route changes

Ninety percent of the bug reports we saw in early beta came from Bluetooth headsets disconnecting or sample rates changing when the user switched outputs. The fix is a subscription to AVAudioSession route-change notifications plus a forced engine restart when the sample rate shifts.

02 / Hotword

On-device wake-word detection.

Hotword detection runs fully on the user's machine. Nothing leaves the laptop until the user explicitly triggers dictation, which is both a privacy guarantee and a battery one.

  • A 220KB CoreML model, quantised to 8-bit weights. Input is 40 mel filterbanks over 1.2 seconds of audio. Output is a single sigmoid score.
  • Sliding window evaluation every 200ms. A simple two-state debouncer prevents double triggers.
  • Neural Engine first, CPU fallback. The NE path draws so little power that always-on listening is measured in milliwatts.

If you want the walkthrough on building the model itself, data collection through CoreML conversion, see On-Device Hotword Detection on macOS With CoreML.

03 / Streaming STT

From mic to transcript.

Once the hotword fires, the client opens a WebSocket to Voxlit Cloud. The protocol is binary framed and intentionally boring.

protocol
client -> server: { "type": "start", "codec": "pcm16", "rate": 16000 }
client -> server: <binary pcm frame, 20ms>
client -> server: <binary pcm frame, 20ms>
server -> client: { "type": "partial", "text": "draft an email to" }
server -> client: { "type": "partial", "text": "draft an email to the team" }
client -> server: { "type": "end" }
server -> client: { "type": "final", "text": "Draft an email to the team." }
  • PCM16 on the wire, not Opus. Voice-first STT providers give better accuracy on PCM and the bandwidth at 16kHz is trivial for a single user session.
  • A voice-activity detector runs locally. If the user goes silent for more than 800ms after a committed partial, the client sends end automatically.
  • Partial transcripts paint to a transient HUD window. The HUD is non-interactive and disappears on final.
04 / Agent

The thing that actually does work.

The agent is a thin orchestrator. It builds a context packet from the transcript plus local state, picks a model, and runs a tool-use loop with a small, focused tool surface.

Context packet

  • The raw transcript.
  • Active application bundle ID and window title.
  • Currently selected text, captured via the Accessibility API with user consent.
  • A short rolling history of the last three commands for disambiguation.

Tool surface

tools
write_to_cursor(text)           # insert at the current cursor position
paste_and_format(text, style)   # paste respecting target-app conventions
translate(text, target_lang)
explain_error(text)             # used inside terminals and IDEs
summarise_selection()
replace_selection(text)
open_url(url)

The tool set is deliberately small. Every tool added to an agent multiplies the failure surface. We only ship a tool once we can point at ten real user sessions where it is the right answer.

Users want agents that act, not chat. write_to_cursor is used roughly ten times more often than any other tool.

05 / Cloud

Voxlit Cloud in Go.

Voxlit Cloud is a small Go service that sits between the client and the upstream STT plus LLM providers. It runs on serverless containers behind a global load balancer.

  • Per-session WebSocket upgrade, then a goroutine per connection. No shared mutable state between sessions.
  • Provider abstraction for STT with automatic fallback. If the primary provider returns a non-recoverable error or latency exceeds a threshold, we swap on the fly.
  • Short-lived signed tokens on the client. The client never holds a long-lived API key for any upstream provider.
  • Zero transcript retention beyond the connection lifecycle. Anonymous usage metrics (tool calls, model picks, latency histograms) are the only telemetry.

Why Go

The workload is mostly I/O coordination with occasional light processing. Go's goroutines fit perfectly, the binary ships in under 20MB, and cold starts are fast enough that we do not need to keep instances warm.

06 / What we learned

Lessons from a year of beta.

  • Audio plumbing on macOS outranks the entire AI layer in bug reports. Route changes, sample rates, and Bluetooth pairing dominate the tracker.
  • On-device hotword is not a nice-to-have. Users read always listening on the tin and need to be able to verify that the stream stops at their CPU.
  • The HUD that shows live partials is the single highest-value UI element. Without it, people think the app is hung.
  • Tool design matters more than prompt engineering. A clear, narrow tool wins against any amount of prompt gymnastics on a fuzzy one.
  • Latency budget to first partial is 300ms. Past that, the experience feels broken even if the final transcript is perfect.
07 / Try it

Get the app.

Voxlit is free during beta at voxlit.co. The source is on GitHub under the EagerHQ org. Pull requests and issues welcome, especially on audio plumbing.

Found it useful? Pass it on.
#Voxlit#Swift#SwiftUI#CoreML#macOS#WebSocket#AI Agent#Open Source
Got something to build?
Cloud, SaaS, web, or agentic AI. If it ships, we want to build it.
hello@eagerhq.com →