cases.aiEngine.meta.title

! O Problema

Chatbots baseados puramente em LLMs sofrem de um problema fundamental: não-determinismo. O mesmo input pode gerar outputs diferentes, e decisões críticas de negócio ficam à mercê de "temperatura" e contexto aleatório.

Cenários Inaceitáveis

✗"Alucinação de horário": LLM confirma consulta às 14h quando o usuário disse 15h. Paciente perde consulta, clínica perde receita.
✗"Confirmação ambígua": Usuário diz "ok", LLM interpreta como confirmação. Mas era "ok, entendi" (acknowledgement), não "ok, confirma".
✗"Intent drift": Conversa sobre preço vira agendamento sem o usuário pedir. LLM "ajudando demais".
✗"Loop infinito": LLM chama ferramenta, falha, chama de novo, falha... até timeout ou custo explodir.

Para um sistema de agendamento médico/odontológico enterprise, isso é inaceitável. A solução não é "melhorar o prompt" — é separar o que precisa ser determinístico (decisões de negócio) do que pode ser probabilístico (geração de linguagem natural).

🏗️ Arquitetura Híbrida

O AI Engine implementa uma arquitetura híbrida onde a lógica de negócio é 100% determinística (FSM + regras) e o LLM é usado apenas para tarefas onde criatividade é desejável (humanização, clarificação). lógica de negócio é 100% determinística (FSM + regras) e o LLM é usado apenas para tarefas onde criatividade é desejável (humanização, clarificação).

Fluxo de Execução

┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER MESSAGE                                    │
└───────────────────────────────────┬─────────────────────────────────────────┘
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  workflow_hygiene → user_signal_node → semantic_routing                      │
│  (TTL/Reset)         (Intent Extract)    (Embeddings <10ms)                 │
└───────────────────────────────────┬─────────────────────────────────────────┘
                                    │
                    ┌───────────────┴───────────────────────┐
                    │                                       │
           [High Confidence]                       [Low Confidence]
                    │                                       │
                    ▼                                       ▼
┌───────────────────────────────┐       ┌─────────────────────────────────────┐
│     BYPASS LLM (Fast Path)    │       │   structured_analysis_node (LLM)    │
│     ~40% of requests          │       │   Intent + Entities Extraction      │
└───────────────┬───────────────┘       └─────────────────┬───────────────────┘
                │                                         │
                └────────────────────┬────────────────────┘
                                     ▼
        ┌────────────────────────────┴────────────────────────────┐
        │                                                          │
   [pricing]                  [schedule/reschedule/cancel]    [information]
        │                                                          │
        ▼                              ▼                           ▼
┌───────────────────┐   ┌──────────────────────────────┐   ┌──────────────┐
│ pricing_fast_lane │   │   BOOKING STATE MACHINE      │   │  agent_node  │
│ (Deterministic)   │   │   (FSM - 8 States)           │   │  (RAG/Tools) │
│ - RAG Search      │   │   ┌────────────────────────┐ │   │              │
│ - Cache Check     │   │   │ detect → discover_slots │ │   │              │
│ - Direct Response │   │   │ → await_selection       │ │   │              │
└─────────┬─────────┘   │   │ → await_confirmation    │ │   └──────┬───────┘
          │             │   │ → completed/handover    │ │          │
          │             │   └────────────────────────┘ │          │
          │             └──────────────┬───────────────┘          │
          │                            ▼                          │
          │             ┌──────────────────────────────┐          │
          │             │ booking_response_renderer    │          │
          │             │ (LLM Humanization)           │          │
          └─────────────┴──────────────┬───────────────┴──────────┘
                                       ▼
                        ┌──────────────────────────────┐
                        │    sanitize_followup_node    │
                        │    (Guard + Format)          │
                        └──────────────┬───────────────┘
                                       ▼
                        ┌──────────────────────────────┐
                        │      FINAL RESPONSE          │
                        └──────────────────────────────┘

LangGraph: Por que não chains simples?

graph_builder.py

# LangGraph offers granular control over flow
graph = StateGraph(
    state_schema=ConversationState,  # State typed
    context_schema=ContextSchema,    # Injected context
)

# Conditional edges based on state
graph.add_conditional_edges(
    "semantic_routing",
    lambda state: route_by_intent(state),  # Pure function
    {
        "pricing": "pricing_fast_lane",
        "appointment": "booking_flow_router",
        "information": "agent_node",
        "greeting": "format_response",
    }
)

# Differentiated RetryPolicy by error type
retry_policy = RetryPolicy(
    max_attempts=3,
    initial_interval=0.5,
    backoff_factor=2.0,
    retry_on=TRANSIENT_EXCEPTIONS,  # Only transient!
)

Chains são lineares — entrada → processamento → saída. Booking flow precisa de ramificações complexas: usuário pode cancelar no meio de um reschedule, pedir clarificação, ou escalar para humano a qualquer momento. LangGraph permite modelar isso como um grafo real.

🎯 Semantic Routing

Antes de chamar o LLM para análise estruturada (caro, ~200ms), o Semantic Router tenta classificar o intent via embeddings (barato, <10ms). Se a confiança for alta, bypassa o LLM completamente.

semantic_routing.py

async def semantic_routing_node(state: dict, service: Any, config: Any):
    """
    Semantic Routing (Pre-LLM).
    
    1. Check Priority-0 flags (Booking/Handover) → If yes, ignore semantic
    2. Call SemanticRouter.classify() via embeddings
    3. If High Confidence:
       a. Create synthetic structured_analysis
       b. Validate via RoutingDecisionEngine (Policies/Circuit Breaker)
       c. Set routing_decision + fast_path_taken=True
    4. If Low Confidence: add hints, proceed to LLM
    """
    
    # Priority-0: deterministic states have absolute priority
    pending_action = state.get("pending_action", "")
    if pending_action.startswith("confirm_"):
        # User is confirming something - don't reclassify
        return {"skip_semantic": True, "reason": "pending_confirmation"}
    
    # Classification via embeddings (~7ms p95)
    result = await semantic_router.classify(
        text=user_message,
        hybrid_alpha=0.7,        # 70% embedding, 30% keyword
        abstain_threshold=0.55,  # Below this → goes to LLM
        bias_weight=0.03,        # Boost for priority intents
    )
    
    if result.confidence >= 0.85:
        # High confidence: bypass LLM
        return {
            "routing_decision": result.route,
            "fast_path_taken": True,
            "semantic_confidence": result.confidence,
        }
    else:
        # Low confidence: hints for the LLM
        return {
            "semantic_hints": result.top_routes,
            "fast_path_taken": False,
        }

Hybrid Alpha: Embeddings + Keywords

Embeddings puros falham em casos onde palavras-chave são críticas (ex: "cancelar" vs "gostaria de remarcar"). O Semantic Router usa classificação híbrida:

score_final = 0.7 × embedding_similarity + 0.3 × keyword_match

Configurável via env: SEMANTIC_ROUTER_HYBRID_ALPHA=0.7

HYBRID_ALPHA

0.7 = 70% embeddings. Valores menores priorizam keywords.

ABSTAIN_THRESHOLD

0.55 = confiança mínima. Abaixo disso, passa pro LLM.

BIAS_WEIGHT

0.03 = boost para intents críticos (booking > greeting).

✓ Resultado

~40% das requests são classificadas com alta confiança e bypassam o LLM completamente. Economia de ~$0.002/request × 40% = redução significativa de custo OpenAI.

🔄 Booking State Machine

O coração do booking flow é uma FSM (Finite State Machine) que define exatamente quais transições são válidas. Não importa o que o LLM "ache" — se a transição não está na tabela, não acontece.

Estados e Transições

                                    ┌──────────┐
                                    │  detect  │
                                    └────┬─────┘
                    ┌───────────────────┼───────────────────┐
                    ▼                   ▼                   ▼
          ┌─────────────────┐  ┌───────────────┐  ┌──────────────────┐
          │ discover_booking│  │ discover_slots│  │  clarification   │
          │ (search existing)│  │ (search slots) │  │ (ask details)  │
          └────────┬────────┘  └───────┬───────┘  └────────┬─────────┘
                   │                   │                   │
                   └─────────┬─────────┘                   │
                             ▼                             │
                   ┌─────────────────────┐                 │
                   │   await_selection   │◄────────────────┘
                   │ (multiple options)  │
                   └──────────┬──────────┘
                              ▼
                   ┌─────────────────────┐
                   │ await_confirmation  │
                   │ (user confirms)     │
                   └──────────┬──────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
        ┌───────────┐  ┌────────────┐  ┌─────────────┐
        │ completed │  │  handover  │  │ (back loop) │
        │ (success) │  │ (human)    │  │             │
        └───────────┘  └────────────┘  └─────────────┘

Transições Válidas (SSoT)

booking_fsm_contract.py

# Single Source of Truth for transitions
FSM_TRANSITIONS = {
    "detect": {
        "discover_booking",   # Search existing appointment
        "discover_slots",     # Search available slots
        "await_confirmation", # Direct confirmation
        "clarification",      # Ask for more details
        "handover",           # Escalate to human
    },
    "discover_booking": {
        "await_selection",    # Multiple found
        "discover_slots",     # Needs slots
        "await_confirmation", # Single found
        "clarification",      # Needs details
        "handover",
    },
    "discover_slots": {
        "await_selection",    # Multiple slots
        "await_confirmation", # Single slot
        "clarification",
        "handover",
    },
    "await_selection": {
        "await_confirmation", # User selected
        "discover_slots",     # New search
        "discover_booking",   # Change booking
        "handover",
    },
    "await_confirmation": {
        "completed",          # ✅ ERP executed
        "discover_slots",     # User changed mind
        "discover_booking",
        "handover",
    },
    # Terminal states
    "completed": {"completed", "handover"},
    "handover": {"handover"},
}

def is_valid_transition(from_state: str, to_state: str) -> bool:
    """Validate if transition is allowed."""
    return to_state in FSM_TRANSITIONS.get(from_state, set())

Operações Suportadas

schedule

Novo agendamento. Busca slots → seleção → confirmação → ERP create.

reschedule

Remarcar existente. Busca booking → slots → confirmação → ERP update.

cancel

Cancelar existente. Busca booking → confirmação → ERP delete.

📝 Context Composer (ECA)

O Context Composer implementa a Enhanced Context Architecture (ECA) — um sistema de montagem de contexto determinístico que busca dados de múltiplas fontes, aplica budget de tokens, e formata em blocos ordenados para o LLM.

context_composer.py

class ContextComposer:
    """
    Composes unified context from multiple sources.
    
    Sources:
    - Memory Engine (handover state, customer memory)
    - Tenant Config (identity, rules)
    - Conversation History
    
    Features:
    - Budget management (CONTEXT_BUDGET_TOKENS=1200)
    - Priority-based truncation
    - Block-based format (never cuts IDENTITY, RULES, INPUT)
    """
    
    def __init__(self):
        self.budget_tokens = int(os.getenv("CONTEXT_BUDGET_TOKENS", "1200"))
        self.memory_engine_url = os.getenv("MEMORY_ENGINE_URL")
    
    async def compose(
        self,
        tenant_id: str,
        conversation_id: str,
        user_id: str,
        message: str,
        tenant_config: dict,
        existing_history: list | None = None,
    ) -> ContextContractV1:
        """
        Compose full context with budget enforcement.
        
        Block Priority (never truncated):
        1. IDENTITY - who the assistant is
        2. RULES - tenant business rules  
        3. INPUT - user message
        
        Truncatable (in priority order):
        4. MEMORY - customer memory
        5. HANDOVER - handover context
        6. FOCUS - focused context (RAG results)
        """
        # Fetch parallel (asyncio.gather)
        memory_ctx, handover_ctx = await asyncio.gather(
            self._fetch_memory(tenant_id, user_id),
            self._fetch_handover(tenant_id, conversation_id),
        )
        
        # Build identity from tenant config
        identity = self._build_identity(tenant_config)
        
        # Apply budget with priority truncation
        return self._apply_budget(
            identity=identity,
            memory=memory_ctx,
            handover=handover_ctx,
            message=message,
        )

Budget Management

Com contexto de 1200 tokens e múltiplas fontes, é fácil estourar o budget. O Context Composer usa truncation inteligente:

1.Blocos intocáveis: IDENTITY, RULES, INPUT nunca são truncados
2.Compressão semântica: MEMORY pode ser comprimido (90% redução via Memory Engine)
3.Truncation por prioridade: FOCUS (RAG) é truncado primeiro se necessário

⚡ Pricing Fast Lane

Consultas de preço são o caso de uso mais comum (~35% das mensagens) e têm um padrão previsível: busca RAG + formatação. O Pricing Fast Lane executa isso diretamente, sem passar pelo agent_node (que faz iterações LLM caras).

pricing_fast_lane.py

# ==============================================================================
# 🔧 NOTE ON TOOL BUDGET (Tech Lead Review)
# ==============================================================================
# This module (pricing_fast_lane) is NOT subject to tool_budget because:
# 1. It's an OPTIMIZED path that executes tool directly
# 2. It's DETERMINISTIC - doesn't enter iteration loop
# 3. tool_budget is for controlling agent ITERATIONS
# 4. If fast lane fails, terminates without handover
# ==============================================================================

async def pricing_fast_lane_node(state: dict, service: Any, config: Any):
    """
    Optimized path for price queries.
    
    Flow:
    1. Check cache (intelligent_cache_service)
    2. If miss: RAG search via docling_semantic_search_tool
    3. Format response directly (no LLM)
    4. Return with fast_lane_executed=True
    """
    # Intelligent cache with coalescing
    cache_key = build_intelligent_cache_key(
        tenant_id=config.tenant_id,
        query_normalized=normalize_query(user_message),
    )
    
    cached = await get_or_compute_with_coalescing(
        key=cache_key,
        compute_fn=lambda: _do_rag_search(state, service),
        ttl_seconds=300,
    )
    
    if cached:
        return {
            "pricing_response": cached,
            "fast_lane_executed": True,
            "cache_hit": True,
        }
    
    # RAG search (deterministic)
    results = await docling_semantic_search(
        query=user_message,
        tenant_id=config.tenant_id,
        k=5,
        threshold=0.7,  # Adaptive threshold
    )
    
    # Format response (template-based, no LLM)
    response = format_pricing_response(results)
    
    return {
        "pricing_response": response,
        "fast_lane_executed": True,
    }

⚡ Resultado

Fast lane reduz latência de ~2s (agent iteration) para ~200ms (RAG direto). Cache hit: <50ms. Sem custo de tokens LLM para ~35% das requests.

🔒 Booking Flow Reducer

O Reducer é o guardião do booking_flow state. Todo update passa por ele, que valida whitelist de sources, monotonicity de turn_seq, e transições válidas. Se algo violar as invariantes, o update é rejeitado.

booking_flow_reducer.py

# Whitelist of allowed sources
ALLOWED_UPDATERS = {
    "booking_flow_router",      # Main state machine router
    "discover_booking_node",    # Booking discovery (IO task)
    "discover_slots_node",      # Slot discovery (IO task)
    "confirm_booking",          # Confirm booking node
    "confirm_reschedule",       # Confirm reschedule node
    "confirm_cancel",           # Confirm cancel node
    "task_create_appointment",  # ERP mutation task
    "memory_engine_sync",       # Memory Engine persistence
    "workflow_hygiene",         # Workflow hygiene reset
}

class BookingFlowUpdate:
    """
    Validated booking_flow update with all constraints checked.
    
    Constraints:
    1. Whitelist validation (source must be in ALLOWED_UPDATERS)
    2. Monotonicity check (turn_seq >= current)
    3. Immutability guards (stage transitions must be valid)
    4. Automatic normalization (version++, updated_at)
    """
    
    def __init__(self, update: dict, current: dict, source: str):
        # Whitelist check
        if source not in ALLOWED_UPDATERS:
            booking_flow_rejections_total.labels(reason="unauthorized_source").inc()
            raise BookingFlowUpdateError(
                f"Source '{source}' not authorized to update booking_flow"
            )
        
        # Monotonicity check
        new_seq = update.get("turn_seq", 0)
        current_seq = current.get("turn_seq", 0)
        if new_seq < current_seq:
            booking_flow_rejections_total.labels(reason="out_of_order").inc()
            raise BookingFlowUpdateError(
                f"Out-of-order update: {new_seq} < {current_seq}"
            )
        
        # Transition validation
        new_stage = update.get("stage")
        current_stage = current.get("stage", "detect")
        if new_stage and not is_valid_transition(current_stage, new_stage):
            booking_flow_rejections_total.labels(reason="invalid_transition").inc()
            raise BookingFlowUpdateError(
                f"Invalid transition: {current_stage} → {new_stage}"
            )

Por que Reducer?

✓Previne race conditions: Múltiplos nós podem tentar atualizar o state simultaneamente. Reducer serializa e valida.
✓Auditoria: Cada update é logado com source, turn_seq, e transição. Facilita debugging.
✓Métricas: Prometheus counters para updates e rejections por reason.

🔁 Retry Policies

Nem todo erro merece retry. Erros transientes (timeout, rate limit) são retriáveis. Erros permanentes (auth failure, bad request) não são — retry só gasta tempo e dinheiro.

graph_builder.py

# SOTA FIX (December 2025): Transient vs Permanent Exceptions

TRANSIENT_EXCEPTIONS = (
    httpx.TimeoutException,
    httpx.ConnectError,
    httpx.ReadTimeout,
    openai.APIConnectionError,
    openai.RateLimitError,       # Retriable with backoff
    openai.APITimeoutError,
    ConnectionError,
    TimeoutError,
    asyncio.TimeoutError,
)

# Permanent exceptions should NOT be retried
PERMANENT_EXCEPTIONS = (
    openai.AuthenticationError,   # Invalid API key
    openai.BadRequestError,       # Malformed prompt
    openai.PermissionDeniedError, # No permission
    ValueError,
    TypeError,
    KeyError,
)

# RetryPolicy uses whitelist (retry_on), not blacklist
retry_policy = RetryPolicy(
    max_attempts=3,
    initial_interval=0.5,         # 500ms
    backoff_factor=2.0,           # 500ms → 1s → 2s
    retry_on=TRANSIENT_EXCEPTIONS,
)

⚠️ Por que whitelist e não blacklist?

Blacklist é perigosa: se um novo tipo de erro permanente aparecer (ex: nova exception do OpenAI SDK), ele seria retriado por default. Whitelist é fail-safe: só retria o que conhecemos como transiente.

📊 Resultados

~40%

Requests bypassam LLM via Semantic Routing

<10ms

Classificação via embeddings (p95)

Transições inválidas de FSM em produção

~200ms

Pricing Fast Lane (vs ~2s agent)

Decisões Técnicas Chave

→FSM separada do LLM: Decisões de negócio são determinísticas. LLM só humaniza.
→Semantic routing pre-LLM: Classificação barata antes de gastar tokens.
→Reducer com whitelist: Só sources autorizados atualizam estado crítico.
→Retry com whitelist: Só transient exceptions são retriadas.
→Fast lanes para padrões conhecidos: Pricing não precisa de iteração LLM.

Stack Técnico

Python 3.12LangGraphLangChainFastAPIOpenAI GPT-4EmbeddingsRedisPostgreSQLPrometheusstructlog

Explore Outros Case Studies

Veja como outros componentes do Optimus foram construídos

Rules Engine →Memory Engine →LLM Pool Management →

AI Conversation Engine