#Executive Summary

The morty-voice project — a Rust-based voice assistant powered by Gemini 3.1 Flash Live — has proven the concept but exposed critical failures in both the voice pipeline and the development lifecycle surrounding it. Over a single sprint session, we logged 50+ human interventions, 9 process failures, and roughly 6 hours lost to rate limits and pipeline bugs.

This proposal replaces the custom audio pipeline with Pipecat (the emerging standard for local voice + LLM orchestration), migrates development infrastructure to Linear Agent API + cloud CI + CodeRabbit, introduces automated voice quality testing via Hamming AI, and wraps deployment in Kamal for zero-touch, rollback-capable deploys.

The net cost increase is approximately $84/month. The expected result: a voice assistant that ships reliably without requiring Paul to babysit every merge.

#Current State: What's Broken

Let's be honest about where we are. The prototype works — Paul can talk to Gemini, trigger smart home commands, and query Linear. But the system around it is held together with duct tape and manual intervention.

By the Numbers

50+
Human interventions
per sprint session
9
Process failures
in one evening
~6h
Hours lost
rate limits & bugs
0
Automated audio tests
regressions ship silently
0
Automated deploys
every deploy is manual
Zombie PRs
polluting the repo

Specific Failures

#Proposed Architecture

Voice Pipeline: Pipecat + Gemini Live

The problem: The current custom Rust audio pipeline handles recording, playback, VAD, echo cancellation, and LLM communication in a monolithic architecture. Every change risks breaking the audio path, and there's no standard framework for testing or extending it.

The solution: Pipecat — a Python framework purpose-built for real-time voice AI pipelines. It's the emerging reference architecture for exactly what morty-voice does: local mic/speaker → VAD → LLM → tool calls → speech output.

Architecture

┌─────────────────────────────────────────────────────┐
│                   Pipecat Pipeline                    │
│                                                       │
│  Shure MV7+ ──→ LocalTransport (16kHz input)         │
│       │                                               │
│       ▼                                               │
│  aec3-rs (Echo Cancellation) ──→ Silero VAD          │
│                                       │               │
│                                       ▼               │
│                    GeminiMultimodalLiveLLMService     │
│                         (Gemini 3.1 Flash Live)       │
│                              │                        │
│                    ┌─────────┴─────────┐              │
│                    │                   │              │
│                    ▼                   ▼              │
│              Speech Output      Async Tool Dispatch   │
│            (24kHz PCM → DAC)     ┌────┴────┐         │
│                                  │    │    │         │
│                                Hue  Linear OpenClaw  │
└─────────────────────────────────────────────────────┘

Key Design Decisions

What Stays in Rust

Not everything moves to Python. The aec3-rs echo cancellation stays as a Rust library called via PyO3 or as a subprocess. Audio-critical path processing benefits from Rust's performance guarantees. The Pipecat pipeline orchestrates; Rust handles the hot path.


Development Pipeline: Linear Agent API + Cloud CI

The problem: The current pipeline uses webhook scraping and GitHub comment parsing to bridge Linear and the development workflow. CI runs on the Mac mini. There's no automated code review, and PRs accumulate as zombies.

The solution: A proper structured pipeline using Linear's Agent API, cloud CI, and automated review.

Linear Agent API

Linear launched their Agent API — a structured interface designed for exactly this use case: AI agents that need to read, create, update, and transition issues programmatically.

This eliminates the entire class of "phantom state" bugs where the system thinks an issue is in one state but Linear shows another.

Cloud CI

Runner TypeWhat It RunsWhy
Cloud (ubuntu-latest)Linting, formatting, unit tests, integration tests, CodeRabbit reviewScalable, isolated, doesn't compete with voice assistant
Self-hosted (Mac mini)Audio hardware integration tests ONLYNeeds physical access to Shure MV7+ and speakers

CodeRabbit for Automated Review

CodeRabbit provides AI-powered code review at $12-30/user/month. For a single-developer project with an AI coding agent, this is the highest-leverage spend available:

Human Approval Gate

Every PR that passes CI and CodeRabbit review still requires human approval before merge to production. This is non-negotiable.


Deployment: Kamal + Health Checks

The problem: After a PR merges, nothing happens. Deployment is manual: SSH into the Mac mini, pull the latest code, rebuild, restart.

The solution: Kamal — Basecamp's zero-downtime deployment tool, designed for deploying Docker containers to a single machine via SSH, with health checks and automatic rollback.

Deployment Flow

git push (merge to main)
        │
        ▼
GitHub Actions (cloud)
  ├── Build Docker image
  ├── Run test suite
  ├── Push to GitHub Container Registry
  │
  ▼
Kamal Deploy (via SSH to Mac mini)
  ├── Pull new container image
  ├── Start new container alongside old one
  ├── Run health check (HTTP + audio device probe)
  │
  ├── ✅ Health check passes → Route traffic to new container
  │                           → Stop old container
  │                           → Done
  │
  └── ❌ Health check fails  → Kill new container
                              → Keep old container running
                              → Create Linear issue automatically
                              → Alert via Discord

What This Means


Testing: Hamming AI + PESQ

The problem: The core product is a voice assistant and there are zero automated tests for voice quality. Audio regressions ship silently.

The solution: Automated voice quality testing using Hamming AI for end-to-end conversation testing and PESQ/MOS metrics for audio signal quality.

Hamming AI — Conversation Quality

Hamming AI provides automated voice agent testing with 95-96% agreement with human evaluators. It runs synthetic conversations, evaluates response quality, detects regressions, and runs as part of CI/CD.

PESQ/MOS — Audio Signal Quality

PESQ (Perceptual Evaluation of Speech Quality) provides an objective MOS score. If the score drops below MOS < 3.5, the deploy is blocked — catching echo cancellation regressions, audio gain issues, sample rate conversion bugs, and VAD cutting off speech.

Deploy Gate

Deploy Pipeline:
  ├── Unit tests pass?          → ✅ Continue / ❌ Block
  ├── Integration tests pass?   → ✅ Continue / ❌ Block
  ├── CodeRabbit review clean?  → ✅ Continue / ❌ Block
  ├── Hamming AI score ≥ 90%?   → ✅ Continue / ❌ Block
  ├── PESQ MOS ≥ 3.5?           → ✅ Continue / ❌ Block
  ├── Human approval?           → ✅ Deploy  / ❌ Block
  └── Health check post-deploy? → ✅ Live    / ❌ Rollback

Rate Limit Resilience

The problem: Cyrus (Claude Code) hits Anthropic rate limits and goes silent. No fallback, no alert, no queue. Work stops. This single failure mode has cost ~6 hours in one sprint.

The solution: LiteLLM proxy for multi-provider routing, plus operational guardrails.

LiteLLM Proxy

Operational Guardrails

StrategyImplementation
Off-peak schedulingBatch work runs during off-peak hours (2-6am PT)
Token budget alertsDaily budget cap per consumer. At 80%, alert. At 100%, queue non-critical work.
Graceful degradationWhen rate-limited: queue work items in Linear, notify via Discord, resume when capacity returns
Provider diversityCritical path (voice) uses Gemini. Development uses Anthropic/OpenAI via LiteLLM.

Monitoring: Sentry + Grafana

The problem: When something breaks, the detection mechanism is "Paul notices." There's no crash reporting, no metrics dashboard, no automated alerting.

Sentry — Error Tracking

Sentry (free tier: 5K errors/month) provides automatic crash detection with full context, auto-created Linear issues for new error classes, and release tracking per Kamal deploy.

Grafana — Operational Dashboard

MetricSourceAlert Threshold
Voice response latency (P50, P95, P99)Pipecat metricsP95 > 2s
Audio quality (PESQ MOS)Test pipelineMOS < 3.5
Gemini session reconnects/hourPipecat logs> 6/hour
Tool dispatch success rateApplication logs< 95%
Echo cancellation effectivenessAEC metricsResidual echo > -40dB
Rate limit events/hourLiteLLM proxy> 5/hour
Memory/CPU utilizationSystem metricsCPU > 80% sustained

Alert Flow

Error detected (Sentry/Grafana)
        │
        ▼
  Create Linear issue (automated)
        │
        ▼
  Send Discord notification
        │
        ▼
  If critical (voice assistant down):
    → Page Paul via Discord DM
    → Auto-rollback via Kamal if health check fails

#Human-in-the-Loop Gates

The contrarian research is unambiguous: zero-human-review is not viable for AI-generated code. 88% of AI agent projects fail before production. AI PRs carry 1.7x more issues. 43% of AI patches that pass CI introduce new failures under adversarial conditions.

This doesn't mean AI coding agents are useless — it means they need guardrails. Here are the five non-negotiable human gates:

Architecture Decisions

Who: Paul  ·  When: Before any structural change

AI agents optimize locally. They'll refactor a function beautifully while introducing a dependency that breaks the deployment model. Architecture requires system-level thinking that current AI can't reliably provide.

Security-Sensitive Changes

Who: Paul  ·  When: Any change touching auth, secrets, permissions

AI-generated code has security bugs at 1.5-2x the human rate. For a system that controls smart home devices and has access to Linear/OpenClaw, a security regression isn't just a bug — it's a liability.

Production Deploy Approval

Who: Paul  ·  When: After all automated checks pass

The final human checkpoint. A 30-second review of "what changed, do I trust it?" This is the cheapest gate with the highest expected value.

Test Scenario Authoring

Who: Paul  ·  When: New features or gaps in test coverage

AI can't test what it doesn't know is wrong. The subtle, domain-specific scenarios that catch real bugs require human authorship. AI can write the tests, but humans must design what to test.

Weekly Code Quality Audit

Who: Paul  ·  When: 30 minutes, Monday morning

Drift happens slowly. A weekly scan of merged PRs, error trends, and code quality metrics catches the gradual degradation that no individual PR review would flag.

#Migration Path

Four phases, one week each. Each phase is independently valuable — if we stop after Phase 1, we're still better off.

Phase 1 — Week 1: Cloud CI + Linear Agent API + CodeRabbit

Goal: Fix the development pipeline. Stop the bleeding.

TaskEffortDependency
Migrate CI to GitHub cloud runners4hNone
Keep self-hosted runner for audio tests only2hCI migration
Integrate Linear Agent API (replace webhook scraping)8hNone
Set up CodeRabbit on the repo1hNone
Fix PM cron to target Linear (not GitHub)1hLinear API
Clean up zombie PRs2hNone

Success criteria: CI runs in the cloud. Linear issues update via structured API. Every PR gets automated review. Zero zombie PRs.

Phase 2 — Week 2: Pipecat Voice Pipeline Migration

Goal: Replace the custom audio pipeline with Pipecat.

TaskEffortDependency
Set up Pipecat with LocalTransport8hNone
Integrate GeminiMultimodalLiveLLMService8hPipecat setup
Wire aec3-rs into Pipecat pipeline (PyO3)6hPipecat setup
Migrate tool dispatch to Pipecat async pattern4hGemini integration
Implement session reconnection with resumption tokens4hGemini integration
Integration test: full voice conversation loop4hAll above

Success criteria: Full voice conversation via Pipecat with equivalent or better quality. Tool calls, echo cancellation, and barge-in all working.

Phase 3 — Week 3: Kamal Deployment + Hamming AI Testing

Goal: Zero-touch deployment with automated quality gates.

TaskEffortDependency
Dockerize morty-voice4hPhase 2 complete
Set up Kamal config for Mac mini4hDocker
Implement health checks (audio device, Gemini, tools)4hDocker
Create synthetic voice test corpus4hNone
Integrate Hamming AI for conversation testing6hTest corpus
Set up PESQ scoring in CI pipeline4hTest corpus
Wire deploy gate (all checks must pass)2hAll above

Success criteria: git push → full test suite → human approval → deploy → health check → live. Automatic rollback if anything fails.

Phase 4 — Week 4: Sentry Integration + Monitoring Dashboard

Goal: Know when things break before Paul does.

TaskEffortDependency
Integrate Sentry SDK into morty-voice2hNone
Configure Sentry → Linear issue creation2hSentry
Set up LiteLLM proxy4hNone
Configure multi-provider routing + budget alerts4hLiteLLM
Build Grafana dashboard (key metrics)6hSentry, LiteLLM
Configure alerting (Discord notifications)2hGrafana
Document runbook for common failures4hAll above

Success criteria: Errors auto-create tickets. Rate limits trigger fallback routing. Dashboard shows voice quality, latency, and error rates in real time.

#Cost Analysis

Current Monthly Costs

Anthropic Max (20x plan)$200
GitHub Actions$0
Gemini API~$10
Paul's time (babysitting)Priceless
~$210/mo

Proposed Monthly Costs

Anthropic Max (20x plan)$200
GitHub Actions (cloud)~$10
Gemini API~$10
CodeRabbit (Pro)$24
Hamming AI~$50
Sentry / Grafana / Kamal / LiteLLM$0
~$294/mo

Additional monthly cost

+$84/mo

Eliminates 50+ human interventions, 6 hours lost productivity, silent failures, manual deploys, and zombie PRs.
At Paul's effective hourly rate, the 6 hours saved per sprint pays for ~18 months of additional tooling — every single sprint.

#Risk Assessment

The contrarian research demands we acknowledge what could go wrong. Enthusiasm doesn't ship software; realism does.

Risk 1: Pipecat Immaturity
Medium LikelihoodHigh Impact

Pipecat is early-stage. Breaking changes, incomplete documentation, and edge cases in Gemini Live integration could stall Phase 2.

Mitigation: Pin Pipecat version. Maintain ability to fall back to v1 pipeline for 30 days post-migration. Keep the Rust audio path compilable as an escape hatch.

Risk 2: Over-Automation Leading to Quality Erosion
Medium LikelihoodHigh Impact

43% of AI patches that pass CI introduce new failures. More automation doesn't fix this if the test suite doesn't cover the right scenarios.

Mitigation: Human-authored test scenarios. Weekly code quality audit. PESQ regression detection as a hard deploy gate.

Risk 3: Tooling Sprawl
Medium LikelihoodMedium Impact

Six new systems to maintain. That's cognitive overhead.

Mitigation: Every tool has a free tier or is open source. If any tool creates more problems than it solves, rip it out. The architecture is modular.

Risk 4: Docker Overhead on Mac mini
Low LikelihoodMedium Impact

Running in Docker on macOS adds virtualization overhead. Audio device passthrough is not trivial.

Mitigation: Test early in Phase 3. Kamal can deploy to bare metal with a process manager if Docker audio is problematic.

Risk 5: Cost Creep from AI Services
High LikelihoodMedium Impact

One unsupervised AI agent burned $5,623 in a month. Usage-based pricing can surprise you.

Mitigation: Hard budget caps in LiteLLM. Monthly cost review. Every service has known, bounded costs.

Risk 6: Single-Machine Dependency
Low LikelihoodCritical Impact

Everything runs on one Mac mini. Hardware failure means total outage.

Mitigation: Accepted risk for a personal voice assistant. Kamal's Docker setup makes migration to a new machine a single command. Back up configuration and secrets offsite.

#Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         MORTY-VOICE v2 — FULL ARCHITECTURE                      │
└─────────────────────────────────────────────────────────────────────────────────┘

  VOICE PIPELINE (Runtime — Mac mini)
  ═══════════════════════════════════

  Paul speaks         ┌──────────────────────────────────────────────┐
       │              │            Pipecat Pipeline                  │
       ▼              │                                              │
  ┌──────────┐        │  ┌────────┐   ┌─────────┐   ┌───────────┐  │
  │ Shure    │───16kHz──▶│ aec3-rs│──▶│ Silero  │──▶│ Gemini    │  │
  │ MV7+ Mic │        │  │ (AEC)  │   │  VAD    │   │ 3.1 Flash │  │
  └──────────┘        │  └────────┘   └─────────┘   │   Live    │  │
                      │       ▲                      └─────┬─────┘  │
  ┌──────────┐        │       │                            │        │
  │ Built-in │◀─24kHz─│───────┼────────────────────────────┤        │
  │ Speakers │        │       │                            │        │
  └──────────┘        │  Speaker output                    │        │
       │              │  (reference signal                 ▼        │
  Paul hears          │   for AEC)              ┌──────────────┐   │
                      │                         │ Async Tool   │   │
                      │                         │  Dispatch    │   │
                      │                         └──────┬───────┘   │
                      └────────────────────────────────┼───────────┘
                                                       │
                                    ┌──────────────────┼──────────────────┐
                                    │                  │                  │
                                    ▼                  ▼                  ▼
                              ┌──────────┐      ┌──────────┐      ┌──────────┐
                              │ Philips  │      │  Linear  │      │ OpenClaw │
                              │   Hue    │      │   API    │      │ (Claude) │
                              └──────────┘      └──────────┘      └──────────┘


  DEVELOPMENT PIPELINE (CI/CD)
  ════════════════════════════

  ┌──────────┐    ┌──────────┐    ┌──────────────┐    ┌──────────┐
  │  Cyrus   │───▶│  GitHub  │───▶│ GitHub       │───▶│ CodeRab  │
  │ (Claude  │ PR │   Repo   │    │ Actions      │    │  bit     │
  │  Code)   │    │          │    │ (Cloud)      │    │ Review   │
  └──────────┘    └──────────┘    │              │    └────┬─────┘
       ▲               │          │ • Lint/fmt   │         │
       │               │          │ • Unit tests │         │
  ┌──────────┐         │          │ • Int. tests │         ▼
  │ LiteLLM  │         │          │ • PESQ/MOS   │    ┌──────────┐
  │  Proxy   │         │          │ • Hamming AI │    │  Human   │
  │(fallback)│         │          └──────────────┘    │ Approval │
  └──────────┘         │                              └────┬─────┘
                       │                                   │
                       │          ┌──────────────┐         │
                       │          │  Self-hosted  │         │
                       └─────────▶│  Runner (Mac) │         │
                                  │ Audio HW tests│         │
                                  └──────────────┘         │
                                                           ▼
  DEPLOYMENT                                        ┌──────────┐
  ══════════                                        │  Kamal   │
                                                    │  Deploy  │
  ┌──────────────────────────────┐                  └────┬─────┘
  │          Mac mini            │                       │
  │  ┌────────────┐  ┌────────┐ │◀── SSH ────────────────┘
  │  │ morty-voice│  │  old   │ │
  │  │ (new)      │  │ (kept  │ │    Health Check:
  │  │            │  │  until │ │    ✅ → swap & go live
  │  │ Docker     │  │  new   │ │    ❌ → rollback, create
  │  │ container  │  │  is    │ │         Linear issue,
  │  │            │  │  live) │ │         alert Discord
  │  └────────────┘  └────────┘ │
  └──────────────────────────────┘


  MONITORING
  ══════════

  ┌────────────┐         ┌────────────┐         ┌────────────┐
  │   Sentry   │────────▶│   Linear   │         │  Discord   │
  │ (errors)   │ auto-   │  (issues)  │         │  (alerts)  │
  └────────────┘ create  └────────────┘         └────────────┘
       ▲                                              ▲
       │              ┌────────────┐                  │
       └──────────────│  Grafana   │──────────────────┘
                      │ (metrics)  │   threshold
                      │            │   alerts
                      │ • Latency  │
                      │ • MOS      │
                      │ • Sessions │
                      │ • Errors   │
                      └────────────┘


  LINEAR INTEGRATION
  ══════════════════

  ┌──────────┐  Agent API   ┌──────────┐  assign   ┌──────────┐
  │  Morty   │─────────────▶│  Linear  │──────────▶│  Cyrus   │
  │  (PM)    │  structured  │  Agent   │  issues   │  (Dev)   │
  │          │◀─────────────│   API    │◀──────────│          │
  └──────────┘  status      └──────────┘  updates  └──────────┘

#Appendix: Key Technical Specs

ParameterValueSource
Gemini input sample rate16kHz mono PCMGemini Live API docs
Gemini output sample rate24kHz mono PCMGemini Live API docs
Gemini session lifetime10-15 minutesGemini Live API docs
AEC algorithmWebRTC AEC3 (via aec3-rs)WebRTC project
VAD modelSilero VAD v5Silero models
PESQ quality thresholdMOS ≥ 3.5 (Good)ITU-T P.862
Hamming AI accuracy95-96% vs human evaluatorsHamming AI benchmarks
CodeRabbit pricing$12-30/user/monthCodeRabbit.ai
Sentry free tier5,000 errors/monthSentry pricing
KamalFree (open source, MIT)kamal-deploy.org

This document is a living proposal. It will be updated as implementation progresses and assumptions are validated or invalidated. The architecture is designed to be modular — every component can be replaced independently without rebuilding the system.