#Executive Summary
The morty-voice project — a Rust-based voice assistant powered by Gemini 3.1 Flash Live — has proven the concept but exposed critical failures in both the voice pipeline and the development lifecycle surrounding it. Over a single sprint session, we logged 50+ human interventions, 9 process failures, and roughly 6 hours lost to rate limits and pipeline bugs.
This proposal replaces the custom audio pipeline with Pipecat (the emerging standard for local voice + LLM orchestration), migrates development infrastructure to Linear Agent API + cloud CI + CodeRabbit, introduces automated voice quality testing via Hamming AI, and wraps deployment in Kamal for zero-touch, rollback-capable deploys.
The net cost increase is approximately $84/month. The expected result: a voice assistant that ships reliably without requiring Paul to babysit every merge.
#Current State: What's Broken
Let's be honest about where we are. The prototype works — Paul can talk to Gemini, trigger smart home commands, and query Linear. But the system around it is held together with duct tape and manual intervention.
By the Numbers
Specific Failures
- PM cron nudging the wrong platform. The automation sends nudges to GitHub instead of Linear, where the actual work is tracked. This is a wiring bug that's been burning cycles for weeks.
- CI running on self-hosted Mac mini. The same machine running the voice assistant is also running CI. A heavy test suite competes with audio processing for CPU and memory.
- Direct code pushes bypassing the pipeline. When the pipeline is painful, people route around it. That's not a discipline problem — it's a tooling problem.
- Cyrus (Claude Code) hitting rate limits and going silent. The AI developer hits Anthropic's rate limits, stops working, and nobody notices until the sprint stalls. No fallback, no alert, no graceful degradation.
- Webhook + comment scraping for Linear integration. Instead of using Linear's structured API, we're scraping webhook payloads and parsing comment text. Fragile, lossy, and creates phantom state mismatches.
- No automated audio testing. The core product is a voice assistant, and we have zero automated validation that audio quality hasn't regressed.
#Proposed Architecture
Voice Pipeline: Pipecat + Gemini Live
The problem: The current custom Rust audio pipeline handles recording, playback, VAD, echo cancellation, and LLM communication in a monolithic architecture. Every change risks breaking the audio path, and there's no standard framework for testing or extending it.
The solution: Pipecat — a Python framework purpose-built for real-time voice AI pipelines. It's the emerging reference architecture for exactly what morty-voice does: local mic/speaker → VAD → LLM → tool calls → speech output.
Architecture
┌─────────────────────────────────────────────────────┐
│ Pipecat Pipeline │
│ │
│ Shure MV7+ ──→ LocalTransport (16kHz input) │
│ │ │
│ ▼ │
│ aec3-rs (Echo Cancellation) ──→ Silero VAD │
│ │ │
│ ▼ │
│ GeminiMultimodalLiveLLMService │
│ (Gemini 3.1 Flash Live) │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ │ │
│ ▼ ▼ │
│ Speech Output Async Tool Dispatch │
│ (24kHz PCM → DAC) ┌────┴────┐ │
│ │ │ │ │
│ Hue Linear OpenClaw │
└─────────────────────────────────────────────────────┘
Key Design Decisions
- LocalTransport for audio I/O. Pipecat's LocalTransport handles the Shure MV7+ mic input (16kHz) and built-in speaker output (24kHz). No custom audio device management code.
- aec3-rs for echo cancellation. Gemini has no server-side AEC — this must be client-side.
aec3-rsis a pure Rust port of WebRTC's AEC3 algorithm, battle-tested across billions of WebRTC sessions. - Silero VAD for voice activity detection. Pipecat integrates Silero VAD natively. This replaces any custom VAD logic and provides proper barge-in support.
- Frame-based pipeline. Pipecat processes audio in frames, not raw streams. Every stage (AEC → VAD → LLM → output) is a composable, testable unit.
- Async tool dispatch. Tool calls (Hue, Linear, OpenClaw) MUST be async. Blocking tool execution causes dead air.
- Session reconnection. Gemini Live sessions have a 10-15 minute lifetime. The pipeline must handle reconnection transparently using resumption tokens.
What Stays in Rust
Not everything moves to Python. The aec3-rs echo cancellation stays as a Rust library called via PyO3 or as a subprocess. Audio-critical path processing benefits from Rust's performance guarantees. The Pipecat pipeline orchestrates; Rust handles the hot path.
Development Pipeline: Linear Agent API + Cloud CI
The problem: The current pipeline uses webhook scraping and GitHub comment parsing to bridge Linear and the development workflow. CI runs on the Mac mini. There's no automated code review, and PRs accumulate as zombies.
The solution: A proper structured pipeline using Linear's Agent API, cloud CI, and automated review.
Linear Agent API
Linear launched their Agent API — a structured interface designed for exactly this use case: AI agents that need to read, create, update, and transition issues programmatically.
- Before: Webhook fires → parse JSON payload → scrape comment text → guess at state transitions
- After: Agent API call → structured issue object → explicit state mutation → confirmation
This eliminates the entire class of "phantom state" bugs where the system thinks an issue is in one state but Linear shows another.
Cloud CI
| Runner Type | What It Runs | Why |
|---|---|---|
| Cloud (ubuntu-latest) | Linting, formatting, unit tests, integration tests, CodeRabbit review | Scalable, isolated, doesn't compete with voice assistant |
| Self-hosted (Mac mini) | Audio hardware integration tests ONLY | Needs physical access to Shure MV7+ and speakers |
CodeRabbit for Automated Review
CodeRabbit provides AI-powered code review at $12-30/user/month. For a single-developer project with an AI coding agent, this is the highest-leverage spend available:
- Catches AI-generated issues. Research shows AI PRs have 1.7x more issues than human PRs and security bugs at 1.5-2x the rate.
- Reviews every PR. Unlike human reviewers, it never skips a review because it's 11pm.
- Integrates with GitHub. Comments directly on PRs with line-level suggestions.
Human Approval Gate
Every PR that passes CI and CodeRabbit review still requires human approval before merge to production. This is non-negotiable.
Deployment: Kamal + Health Checks
The problem: After a PR merges, nothing happens. Deployment is manual: SSH into the Mac mini, pull the latest code, rebuild, restart.
The solution: Kamal — Basecamp's zero-downtime deployment tool, designed for deploying Docker containers to a single machine via SSH, with health checks and automatic rollback.
Deployment Flow
git push (merge to main)
│
▼
GitHub Actions (cloud)
├── Build Docker image
├── Run test suite
├── Push to GitHub Container Registry
│
▼
Kamal Deploy (via SSH to Mac mini)
├── Pull new container image
├── Start new container alongside old one
├── Run health check (HTTP + audio device probe)
│
├── ✅ Health check passes → Route traffic to new container
│ → Stop old container
│ → Done
│
└── ❌ Health check fails → Kill new container
→ Keep old container running
→ Create Linear issue automatically
→ Alert via Discord
What This Means
- No manual restart, ever.
git pushis the deploy command. - Docker containers, not bare processes. Pinned dependencies. No more "it works on my machine."
- Automatic rollback. If the new version can't pass a health check, Kamal keeps the old version running.
- Health checks that matter. Not just "is the process alive?" but "can it hear, talk, and reach Gemini?"
Testing: Hamming AI + PESQ
The problem: The core product is a voice assistant and there are zero automated tests for voice quality. Audio regressions ship silently.
The solution: Automated voice quality testing using Hamming AI for end-to-end conversation testing and PESQ/MOS metrics for audio signal quality.
Hamming AI — Conversation Quality
Hamming AI provides automated voice agent testing with 95-96% agreement with human evaluators. It runs synthetic conversations, evaluates response quality, detects regressions, and runs as part of CI/CD.
PESQ/MOS — Audio Signal Quality
PESQ (Perceptual Evaluation of Speech Quality) provides an objective MOS score. If the score drops below MOS < 3.5, the deploy is blocked — catching echo cancellation regressions, audio gain issues, sample rate conversion bugs, and VAD cutting off speech.
Deploy Gate
Deploy Pipeline:
├── Unit tests pass? → ✅ Continue / ❌ Block
├── Integration tests pass? → ✅ Continue / ❌ Block
├── CodeRabbit review clean? → ✅ Continue / ❌ Block
├── Hamming AI score ≥ 90%? → ✅ Continue / ❌ Block
├── PESQ MOS ≥ 3.5? → ✅ Continue / ❌ Block
├── Human approval? → ✅ Deploy / ❌ Block
└── Health check post-deploy? → ✅ Live / ❌ Rollback
Rate Limit Resilience
The problem: Cyrus (Claude Code) hits Anthropic rate limits and goes silent. No fallback, no alert, no queue. Work stops. This single failure mode has cost ~6 hours in one sprint.
The solution: LiteLLM proxy for multi-provider routing, plus operational guardrails.
LiteLLM Proxy
- Multi-provider routing. If Anthropic rate-limits, route to a secondary provider for non-critical work.
- Token budget tracking. Real-time visibility into spend per consumer, per provider.
- Rate limit awareness. Proactively queues requests before hitting limits.
Operational Guardrails
| Strategy | Implementation |
|---|---|
| Off-peak scheduling | Batch work runs during off-peak hours (2-6am PT) |
| Token budget alerts | Daily budget cap per consumer. At 80%, alert. At 100%, queue non-critical work. |
| Graceful degradation | When rate-limited: queue work items in Linear, notify via Discord, resume when capacity returns |
| Provider diversity | Critical path (voice) uses Gemini. Development uses Anthropic/OpenAI via LiteLLM. |
Monitoring: Sentry + Grafana
The problem: When something breaks, the detection mechanism is "Paul notices." There's no crash reporting, no metrics dashboard, no automated alerting.
Sentry — Error Tracking
Sentry (free tier: 5K errors/month) provides automatic crash detection with full context, auto-created Linear issues for new error classes, and release tracking per Kamal deploy.
Grafana — Operational Dashboard
| Metric | Source | Alert Threshold |
|---|---|---|
| Voice response latency (P50, P95, P99) | Pipecat metrics | P95 > 2s |
| Audio quality (PESQ MOS) | Test pipeline | MOS < 3.5 |
| Gemini session reconnects/hour | Pipecat logs | > 6/hour |
| Tool dispatch success rate | Application logs | < 95% |
| Echo cancellation effectiveness | AEC metrics | Residual echo > -40dB |
| Rate limit events/hour | LiteLLM proxy | > 5/hour |
| Memory/CPU utilization | System metrics | CPU > 80% sustained |
Alert Flow
Error detected (Sentry/Grafana)
│
▼
Create Linear issue (automated)
│
▼
Send Discord notification
│
▼
If critical (voice assistant down):
→ Page Paul via Discord DM
→ Auto-rollback via Kamal if health check fails
#Human-in-the-Loop Gates
The contrarian research is unambiguous: zero-human-review is not viable for AI-generated code. 88% of AI agent projects fail before production. AI PRs carry 1.7x more issues. 43% of AI patches that pass CI introduce new failures under adversarial conditions.
This doesn't mean AI coding agents are useless — it means they need guardrails. Here are the five non-negotiable human gates:
Architecture Decisions
AI agents optimize locally. They'll refactor a function beautifully while introducing a dependency that breaks the deployment model. Architecture requires system-level thinking that current AI can't reliably provide.
Security-Sensitive Changes
AI-generated code has security bugs at 1.5-2x the human rate. For a system that controls smart home devices and has access to Linear/OpenClaw, a security regression isn't just a bug — it's a liability.
Production Deploy Approval
The final human checkpoint. A 30-second review of "what changed, do I trust it?" This is the cheapest gate with the highest expected value.
Test Scenario Authoring
AI can't test what it doesn't know is wrong. The subtle, domain-specific scenarios that catch real bugs require human authorship. AI can write the tests, but humans must design what to test.
Weekly Code Quality Audit
Drift happens slowly. A weekly scan of merged PRs, error trends, and code quality metrics catches the gradual degradation that no individual PR review would flag.
#Migration Path
Four phases, one week each. Each phase is independently valuable — if we stop after Phase 1, we're still better off.
Phase 1 — Week 1: Cloud CI + Linear Agent API + CodeRabbit
Goal: Fix the development pipeline. Stop the bleeding.
| Task | Effort | Dependency |
|---|---|---|
| Migrate CI to GitHub cloud runners | 4h | None |
| Keep self-hosted runner for audio tests only | 2h | CI migration |
| Integrate Linear Agent API (replace webhook scraping) | 8h | None |
| Set up CodeRabbit on the repo | 1h | None |
| Fix PM cron to target Linear (not GitHub) | 1h | Linear API |
| Clean up zombie PRs | 2h | None |
Success criteria: CI runs in the cloud. Linear issues update via structured API. Every PR gets automated review. Zero zombie PRs.
Phase 2 — Week 2: Pipecat Voice Pipeline Migration
Goal: Replace the custom audio pipeline with Pipecat.
| Task | Effort | Dependency |
|---|---|---|
| Set up Pipecat with LocalTransport | 8h | None |
| Integrate GeminiMultimodalLiveLLMService | 8h | Pipecat setup |
| Wire aec3-rs into Pipecat pipeline (PyO3) | 6h | Pipecat setup |
| Migrate tool dispatch to Pipecat async pattern | 4h | Gemini integration |
| Implement session reconnection with resumption tokens | 4h | Gemini integration |
| Integration test: full voice conversation loop | 4h | All above |
Success criteria: Full voice conversation via Pipecat with equivalent or better quality. Tool calls, echo cancellation, and barge-in all working.
Phase 3 — Week 3: Kamal Deployment + Hamming AI Testing
Goal: Zero-touch deployment with automated quality gates.
| Task | Effort | Dependency |
|---|---|---|
| Dockerize morty-voice | 4h | Phase 2 complete |
| Set up Kamal config for Mac mini | 4h | Docker |
| Implement health checks (audio device, Gemini, tools) | 4h | Docker |
| Create synthetic voice test corpus | 4h | None |
| Integrate Hamming AI for conversation testing | 6h | Test corpus |
| Set up PESQ scoring in CI pipeline | 4h | Test corpus |
| Wire deploy gate (all checks must pass) | 2h | All above |
Success criteria: git push → full test suite → human approval → deploy → health check → live. Automatic rollback if anything fails.
Phase 4 — Week 4: Sentry Integration + Monitoring Dashboard
Goal: Know when things break before Paul does.
| Task | Effort | Dependency |
|---|---|---|
| Integrate Sentry SDK into morty-voice | 2h | None |
| Configure Sentry → Linear issue creation | 2h | Sentry |
| Set up LiteLLM proxy | 4h | None |
| Configure multi-provider routing + budget alerts | 4h | LiteLLM |
| Build Grafana dashboard (key metrics) | 6h | Sentry, LiteLLM |
| Configure alerting (Discord notifications) | 2h | Grafana |
| Document runbook for common failures | 4h | All above |
Success criteria: Errors auto-create tickets. Rate limits trigger fallback routing. Dashboard shows voice quality, latency, and error rates in real time.
#Cost Analysis
Current Monthly Costs
| Anthropic Max (20x plan) | $200 |
| GitHub Actions | $0 |
| Gemini API | ~$10 |
| Paul's time (babysitting) | Priceless |
Proposed Monthly Costs
| Anthropic Max (20x plan) | $200 |
| GitHub Actions (cloud) | ~$10 |
| Gemini API | ~$10 |
| CodeRabbit (Pro) | $24 |
| Hamming AI | ~$50 |
| Sentry / Grafana / Kamal / LiteLLM | $0 |
Additional monthly cost
Eliminates 50+ human interventions, 6 hours lost productivity, silent failures, manual deploys, and zombie PRs.
At Paul's effective hourly rate, the 6 hours saved per sprint pays for ~18 months of additional tooling — every single sprint.
#Risk Assessment
The contrarian research demands we acknowledge what could go wrong. Enthusiasm doesn't ship software; realism does.
Pipecat is early-stage. Breaking changes, incomplete documentation, and edge cases in Gemini Live integration could stall Phase 2.
Mitigation: Pin Pipecat version. Maintain ability to fall back to v1 pipeline for 30 days post-migration. Keep the Rust audio path compilable as an escape hatch.
43% of AI patches that pass CI introduce new failures. More automation doesn't fix this if the test suite doesn't cover the right scenarios.
Mitigation: Human-authored test scenarios. Weekly code quality audit. PESQ regression detection as a hard deploy gate.
Six new systems to maintain. That's cognitive overhead.
Mitigation: Every tool has a free tier or is open source. If any tool creates more problems than it solves, rip it out. The architecture is modular.
Running in Docker on macOS adds virtualization overhead. Audio device passthrough is not trivial.
Mitigation: Test early in Phase 3. Kamal can deploy to bare metal with a process manager if Docker audio is problematic.
One unsupervised AI agent burned $5,623 in a month. Usage-based pricing can surprise you.
Mitigation: Hard budget caps in LiteLLM. Monthly cost review. Every service has known, bounded costs.
Everything runs on one Mac mini. Hardware failure means total outage.
Mitigation: Accepted risk for a personal voice assistant. Kamal's Docker setup makes migration to a new machine a single command. Back up configuration and secrets offsite.
#Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────────┐
│ MORTY-VOICE v2 — FULL ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────────┘
VOICE PIPELINE (Runtime — Mac mini)
═══════════════════════════════════
Paul speaks ┌──────────────────────────────────────────────┐
│ │ Pipecat Pipeline │
▼ │ │
┌──────────┐ │ ┌────────┐ ┌─────────┐ ┌───────────┐ │
│ Shure │───16kHz──▶│ aec3-rs│──▶│ Silero │──▶│ Gemini │ │
│ MV7+ Mic │ │ │ (AEC) │ │ VAD │ │ 3.1 Flash │ │
└──────────┘ │ └────────┘ └─────────┘ │ Live │ │
│ ▲ └─────┬─────┘ │
┌──────────┐ │ │ │ │
│ Built-in │◀─24kHz─│───────┼────────────────────────────┤ │
│ Speakers │ │ │ │ │
└──────────┘ │ Speaker output │ │
│ │ (reference signal ▼ │
Paul hears │ for AEC) ┌──────────────┐ │
│ │ Async Tool │ │
│ │ Dispatch │ │
│ └──────┬───────┘ │
└────────────────────────────────┼───────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Philips │ │ Linear │ │ OpenClaw │
│ Hue │ │ API │ │ (Claude) │
└──────────┘ └──────────┘ └──────────┘
DEVELOPMENT PIPELINE (CI/CD)
════════════════════════════
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Cyrus │───▶│ GitHub │───▶│ GitHub │───▶│ CodeRab │
│ (Claude │ PR │ Repo │ │ Actions │ │ bit │
│ Code) │ │ │ │ (Cloud) │ │ Review │
└──────────┘ └──────────┘ │ │ └────┬─────┘
▲ │ │ • Lint/fmt │ │
│ │ │ • Unit tests │ │
┌──────────┐ │ │ • Int. tests │ ▼
│ LiteLLM │ │ │ • PESQ/MOS │ ┌──────────┐
│ Proxy │ │ │ • Hamming AI │ │ Human │
│(fallback)│ │ └──────────────┘ │ Approval │
└──────────┘ │ └────┬─────┘
│ │
│ ┌──────────────┐ │
│ │ Self-hosted │ │
└─────────▶│ Runner (Mac) │ │
│ Audio HW tests│ │
└──────────────┘ │
▼
DEPLOYMENT ┌──────────┐
══════════ │ Kamal │
│ Deploy │
┌──────────────────────────────┐ └────┬─────┘
│ Mac mini │ │
│ ┌────────────┐ ┌────────┐ │◀── SSH ────────────────┘
│ │ morty-voice│ │ old │ │
│ │ (new) │ │ (kept │ │ Health Check:
│ │ │ │ until │ │ ✅ → swap & go live
│ │ Docker │ │ new │ │ ❌ → rollback, create
│ │ container │ │ is │ │ Linear issue,
│ │ │ │ live) │ │ alert Discord
│ └────────────┘ └────────┘ │
└──────────────────────────────┘
MONITORING
══════════
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Sentry │────────▶│ Linear │ │ Discord │
│ (errors) │ auto- │ (issues) │ │ (alerts) │
└────────────┘ create └────────────┘ └────────────┘
▲ ▲
│ ┌────────────┐ │
└──────────────│ Grafana │──────────────────┘
│ (metrics) │ threshold
│ │ alerts
│ • Latency │
│ • MOS │
│ • Sessions │
│ • Errors │
└────────────┘
LINEAR INTEGRATION
══════════════════
┌──────────┐ Agent API ┌──────────┐ assign ┌──────────┐
│ Morty │─────────────▶│ Linear │──────────▶│ Cyrus │
│ (PM) │ structured │ Agent │ issues │ (Dev) │
│ │◀─────────────│ API │◀──────────│ │
└──────────┘ status └──────────┘ updates └──────────┘
#Appendix: Key Technical Specs
| Parameter | Value | Source |
|---|---|---|
| Gemini input sample rate | 16kHz mono PCM | Gemini Live API docs |
| Gemini output sample rate | 24kHz mono PCM | Gemini Live API docs |
| Gemini session lifetime | 10-15 minutes | Gemini Live API docs |
| AEC algorithm | WebRTC AEC3 (via aec3-rs) | WebRTC project |
| VAD model | Silero VAD v5 | Silero models |
| PESQ quality threshold | MOS ≥ 3.5 (Good) | ITU-T P.862 |
| Hamming AI accuracy | 95-96% vs human evaluators | Hamming AI benchmarks |
| CodeRabbit pricing | $12-30/user/month | CodeRabbit.ai |
| Sentry free tier | 5,000 errors/month | Sentry pricing |
| Kamal | Free (open source, MIT) | kamal-deploy.org |
This document is a living proposal. It will be updated as implementation progresses and assumptions are validated or invalidated. The architecture is designed to be modular — every component can be replaced independently without rebuilding the system.