Back to Resources
    Guide
    Agentic
    GenAI
    Guardrails

    Agentic AI Systems: Guardrails, Evals, and Human-in-the-Loop

    How to design multi-agent automations that pass a security review: policy→prompts, eval suites, safety gates, and HITL patterns.

    12 min readJanuary 25, 2025Dr. Amara Okafor, Lead, Agentic AI Practice
    Multi-agent system architecture diagram showing guardrails and human review checkpoints

    Risk model and policy mapping (PII, actions, approvals)

    Before building agents, map your organization's policies to technical controls. Ask:

    What can go wrong?

    Identify failure modes:

  1. Data leakage: PII, confidential info in prompts/responses
  2. Unauthorized actions: Agents exceeding their authority
  3. Cost overruns: Uncontrolled API usage
  4. Quality issues: Hallucinations, incorrect outputs
  5. Security: Prompt injection, jailbreaks
  6. Policy → technical controls

    Translate policies into enforceable rules:

  7. "Don't share customer PII" → Input/output filters, redaction
  8. "Require approval for orders >$10K" → Human-in-the-loop gate
  9. "Limit AI spending to $500/day" → Rate limits, budget tracking
  10. "Audit all decisions" → Comprehensive logging
  11. Document this mapping. It becomes your acceptance criteria.

    Prompt & tool contracts (least-privilege)

    Design agents with narrow, explicit capabilities. Don't give a customer-service agent access to your entire API—scope tools to minimum necessary permissions.

    Prompt contracts

    Define clear interfaces for each agent:

  12. Input schema: What data the agent receives (typed, validated)
  13. Output schema: What the agent returns (structured, not free-form)
  14. Constraints: Boundaries the agent must respect
  15. Example:

    
    

    Agent: Order Processor

    Input: { orderId: string, action: "cancel" | "refund" }

    Output: { success: boolean, message: string, auditLog: string }

    Constraints:

    - Order must belong to authenticated user

    - Refunds ≤$10K auto-approve; >$10K route to human

    - All actions logged to audit table

    
    
    

    Tool least-privilege

    Provide agents only the tools they need:

  16. Customer service agent: Read orders, create support tickets (no delete)
  17. Analyst agent: Read-only database access (no write)
  18. Automation agent: Execute approved workflows (no arbitrary code)
  19. Enforce through API keys with scoped permissions, not through prompts alone—prompts can be jailbroken.

    Evals you need (accuracy, jailbreak, toxicity, cost)

    Continuous evaluation prevents silent degradation. Build automated test suites:

    Accuracy evals

    Test agent outputs against golden datasets:

  20. Does the agent extract correct information?
  21. Are calculations accurate?
  22. Do responses match expected format?
  23. Run on every deployment. Regression should trigger rollback.

    Safety evals

    Test for adversarial behavior:

  24. Jailbreak attempts: Can users trick the agent into ignoring rules?
  25. Prompt injection: Can users manipulate the agent's instructions?
  26. PII leakage: Does the agent expose sensitive data?
  27. Maintain a "red team" dataset of known attacks. Add new attack vectors as discovered.

    Quality evals

    Measure subjective quality:

  28. Relevance: Does the response address the query?
  29. Coherence: Is the response logically consistent?
  30. Toxicity: Does the agent generate harmful content?
  31. Use LLM-as-judge or human raters. Set acceptance thresholds (e.g., ≥4.0/5.0 average).

    Cost evals

    Track operational costs:

  32. Tokens per interaction
  33. API calls per workflow
  34. Average cost per user session
  35. Set budget alerts. If costs spike, investigate prompt inefficiencies or abuse.

    Minimal eval suite checklist

    Before production:

  36. [ ] 100+ accuracy test cases (happy path + edge cases)
  37. [ ] 50+ safety test cases (jailbreaks, injections)
  38. [ ] Cost per interaction measured and within budget
  39. [ ] Quality spot-checked by human raters (n≥50)
  40. [ ] All evals automated in CI/CD
  41. Safety gates & rollback strategies

    Even with evals, things break. Build layered defenses:

    Pre-flight checks

    Before executing actions, validate:

  42. Input schema compliance
  43. User authorization
  44. Rate limits not exceeded
  45. Known-bad patterns not present
  46. Reject invalid requests before they reach the agent.

    Runtime guardrails

    While agents run, monitor:

  47. Token usage (abort if exceeds threshold)
  48. Confidence scores (route low-confidence to human)
  49. Execution time (timeout if too slow)
  50. Implement circuit breakers—if error rate crosses threshold, disable agent and route to fallback.

    Post-execution validation

    After agent completes, check:

  51. Output schema compliance
  52. Sensitive data redaction
  53. Audit log completeness
  54. Don't return outputs that fail validation. Log failures and alert ops.

    Rollback strategy

    When issues are detected:

    1. Immediate: Disable agent, route traffic to fallback (static responses, human queue)

    2. Triage: Review logs, identify root cause

    3. Fix: Update prompts, retrain models, patch code

    4. Re-eval: Run full eval suite

    5. Gradual re-deploy: Canary → pilot → full rollout

    Maintain version history. Fast rollback is essential.

    Human-in-the-loop UI patterns (approve/annotate/retry)

    Agents augment humans, not replace them. Design HITL workflows that keep humans in control:

    Approval workflows

    Route high-stakes decisions to humans:

  55. Present agent recommendation + confidence + reasoning
  56. Show relevant context (order history, customer profile)
  57. Provide approve/reject/modify actions
  58. Track approval latency (SLA monitoring)
  59. Annotation workflows

    Humans correct agent mistakes to improve future performance:

  60. Show agent output vs. expected output
  61. Provide easy correction interface (edit, select correct option)
  62. Feed corrections back into retraining pipeline
  63. Measure annotation quality (inter-rater agreement) to ensure reliable ground truth.

    Retry workflows

    When agents fail, let humans retry with adjustments:

  64. Show error message and context
  65. Allow manual parameter tweaks (temperature, prompt modifications)
  66. Re-run agent with new settings
  67. Log retry attempts for later analysis
  68. HITL reviewer flow (state machine)

    
    

    [Agent Completes]

    ├─→ High confidence → Auto-approve → [Done]

    ├─→ Medium confidence → Human review → Approve/Reject → [Done]

    └─→ Low confidence → Human override → Manual completion → [Done]

    [Human Review]

    ├─→ Approve: Log acceptance, execute action

    ├─→ Reject: Log rejection reason, route to manual queue

    └─→ Modify: Annotate correction, re-run agent, log update

    
    
    

    Observability: success, fallback, cost

    Instrument everything. You can't improve what you don't measure.

    Success metrics

    Track:

  69. Completion rate: % of agent runs that succeed
  70. Accuracy: % of outputs matching expected results (from evals)
  71. Latency: P50/P95/P99 response times
  72. User satisfaction: Thumbs up/down, CSAT surveys
  73. Fallback metrics

    When agents fail:

  74. Fallback rate: % of requests routed to human/static fallback
  75. Fallback reasons: Categorize failures (low confidence, timeout, error)
  76. Recovery time: How long until agent restored after incident
  77. Cost metrics

    Monitor spending:

  78. Token usage: Tokens per request, daily/monthly totals
  79. API costs: Dollars per interaction, by model/provider
  80. Infrastructure: Compute, storage, bandwidth
  81. Set budgets and alerts. Cost spikes often indicate abuse or inefficiency.

    Run logs

    Store comprehensive logs for every agent execution:

  82. Timestamp, user ID, session ID
  83. Input (prompt, parameters, context)
  84. Output (response, confidence, tokens used)
  85. Actions taken (API calls, database writes)
  86. Success/failure status and error messages
  87. Enable searchability (Elasticsearch, CloudWatch Logs Insights). Logs are your debugging and audit trail.

    Frequently asked questions

    Ready to get started?

    Let's discuss how these patterns apply to your deployment.