MicrosserviΓ§o de processamento de Γ‘udio com multi-provider STT/TTS, processamento assΓncrono via Celery, storage MinIO e fallback inteligente entre providers.
UsuΓ‘rios de WhatsApp adoram enviar Γ‘udios. Em alguns segmentos (clΓnicas, mais de 40% das mensagens sΓ£o Γ‘udio.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Processor Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WhatsApp Voice Note Web/Mobile Audio Text Response
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Processor (Port 8002) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastAPI + Celery β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
ββββββββββββ΄βββββββββββ βββββββββββ΄ββββββββββ
βΌ βΌ βΌ βΌ
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
β STT β β STT β β TTS β β TTS β
β Primary β β Fallback β β Primary β β Fallback β
β Groq β β OpenAI β βElevenLabs β β Edge TTS β
β Whisper β β Whisper β β β β β
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
β β β β
ββββββββββββ¬βββββββββββ βββββββββ¬ββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β MinIO β β Redis β
β Storage β β + Celery β
β audio-files β β Workers β
βββββββββββββββ βββββββββββββββWhisper Large v3 via Groq API. Extremamente rΓ‘pido (hardware dedicado), excelente para portuguΓͺs brasileiro.
API oficial da OpenAI. Mais lento que Groq, mas extremamente confiΓ‘vel como backup.
Providers adicionais configurΓ‘veis para casos especΓficos ou requisitos de compliance.
Model: eleven_multilingual_v2. Vozes ultra-realistas, excelente entonaΓ§Γ£o em portuguΓͺs.
Model: tts-1-hd, Voice: shimmer. Qualidade alta, bom backup quando ElevenLabs indisponΓvel.
Voice: pt-BR-FranciscaNeural. Gratuito (Microsoft Edge), usado para desenvolvimento/testes ou fallback de emergΓͺncia.
Γudio Γ© pesado. Um voice note de 2 minutos pode levar 5-10 segundos para transcrever. Processar sincronamente bloquearia o servidor. A soluΓ§Γ£o:Celery workers dedicados.
API recebe o arquivo de Γ‘udio (qualquer formato), valida e salva no MinIO. Retorna job_id imediatamente.
Worker pega da fila, baixa do MinIO, converte formato se necessΓ‘rio, envia pro provider STT, salva resultado.
Resultado disponΓvel via polling (GET /status/job_id) ou webhook callback quando configurado.
# 1. Cliente envia Γ‘udio
POST /api/v1/transcribe
Content-Type: multipart/form-data
file: voice_note.ogg
# Response imediata (~50ms)
{
"job_id": "abc123",
"status": "queued",
"estimated_seconds": 8
}
# 2. Celery worker processa em background
[Worker] Downloading from MinIO: voice_note.ogg
[Worker] Converting OGG β WAV (ffmpeg)
[Worker] Sending to Groq Whisper...
[Worker] Transcription complete: "Oi, quero marcar uma consulta..." "Hi, I want to schedule an appointment..."
[Worker] Saving result to Redis
# 3. Cliente verifica status
GET /api/v1/status/abc123
{
"job_id": "abc123",
"status": "completed",
"transcription": "Hi, I want to schedule an appointment for tomorrow at 10am",
"duration_seconds": 12.5,
"provider": "groq",
"confidence": 0.97
}WhatsApp usa OGG/OPUS. Navegadores usam WebM. iPhones usam M4A. O Audio Processor aceita qualquer um e converte internamente via FFmpeg.
O sistema detecta o formato automaticamente (magic bytes, nΓ£o extensΓ£o). Se o provider STT nΓ£o suporta o formato, converte para WAV via FFmpeg antes de enviar. Tudo transparente para o chamador.
MINIO_ENDPOINT=minio:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=*****
MINIO_BUCKET=audio-files
MINIO_USE_SSL=false
# Console de gestΓ£o
# http://localhost:9003Providers de IA falham. Groq pode ter pico de latΓͺncia, ElevenLabs pode estar em manutenΓ§Γ£o. O sistema tenta automaticamente o prΓ³ximo provider na cadeia de fallback.
# Cadeia de fallback STT
TRANSCRIPTION_SERVICE=groq # Primary
try:
result = groq_whisper.transcribe(audio)
except (Timeout, RateLimitError, ServiceUnavailable):
logger.warning("Groq failed, falling back to OpenAI")
result = openai_whisper.transcribe(audio)
except Exception:
logger.error("All STT providers failed")
raise AudioProcessingError("Transcription unavailable")
# Cadeia de fallback TTS
TTS_SERVICE=elevenlabs # Primary
try:
audio = elevenlabs.synthesize(text)
except (Timeout, QuotaExceeded):
logger.warning("ElevenLabs failed, falling back to OpenAI")
audio = openai_tts.synthesize(text)
except Exception:
logger.warning("Paid TTS failed, using free Edge TTS")
audio = edge_tts.synthesize(text) # Always availableSe provider nΓ£o responde em 30s, assume falha e tenta prΓ³ximo.
Rate limit ou quota excedida? Fallback automΓ‘tico sem perder a mensagem.
Edge TTS Γ© gratuito e sempre disponΓvel como ΓΊltimo recurso.
Cada tenant pode ter configuraΓ§Γ΅es de TTS personalizadas: voz diferente, provider preferido, velocidade de fala. Armazenado no Memory Engine e carregado em runtime.
{
"tenant_id": "clinica_sp",
"tts_config": {
"provider": "elevenlabs",
"voice_id": "mPDAoQyGzxBSkE0OAOKw",
"model": "eleven_multilingual_v2",
"speed": 1.0,
"stability": 0.5,
"similarity_boost": 0.75
}
}{
"tenant_id": "loja_xyz",
"tts_config": {
"provider": "openai",
"voice": "shimmer",
"model": "tts-1-hd",
"speed": 1.1,
"response_format": "opus"
}
}βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WhatsApp Voice Note Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cliente envia Γ‘udio Audio Processor AI Engine
β β β
βΌ β β
Evolution API β β
β β β
βΌ β β
WhatsApp Integration β β
β β β
β POST /transcribe β β
ββββββββββββββββββββββββββββΆβ β
β β β
β { job_id: "abc" } β β
βββββββββββββββββββββββββββββ β
β β β
β [Celery processa] β β
β β β
β GET /status/abc β β
ββββββββββββββββββββββββββββΆβ β
β β β
β { text: "I want..." } β β
βββββββββββββββββββββββββββββ β
β β β
β POST /chat (text) β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββΆ
β β β
β β { response: "..." } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β POST /synthesize β β
ββββββββββββββββββββββββββββΆβ β
β β β
β { audio_url: "..." } β β
βββββββββββββββββββββββββββββ β
β β β
βΌ β β
Evolution API β β
β β β
βΌ β β
Cliente recebe Γ‘udio β βSempre responde em Γ‘udio. Ideal para usuΓ‘rios que preferem ouvir.
Sempre responde em texto. Economia de custo de TTS.
Responde no mesmo formato que recebeu (Γ‘udio β Γ‘udio, texto β texto).
FastAPI + Uvicorn
Celery + Redis
MinIO (S3-compatible)
FFmpeg
Groq Whisper Large v3
ElevenLabs Multilingual v2
OpenAI TTS-1-HD
Edge TTS (Microsoft)
STT, TTS, conversΓ£o de formatos, integraΓ§Γ£o com chatbots - tenho experiΓͺncia com os desafios de produΓ§Γ£o.