AgentCore Performance Loop

Observe. Evaluate. Improve.
AWS Bedrock AgentCore | 2026.05
Overview

Agenda

  1. The Problem
  2. Platform Overview
  3. The Loop: Observe → Evaluate → Improve
  4. Phase 1: Observe
  5. Phase 2: Evaluate
  6. Phase 3: Improve
  7. Real Results & When to Use
Challenge

The Problem

  • 15-30% failure rate at launch
  • Models drift daily; teams fix weekly
  • Developers guess, not measure
  • Each fix introduces new regressions
  • Manual tuning cannot scale
Architecture

The Platform

  • Build: tools + memory + identity
  • Deploy: runtime + gateway
  • Assess: observe + evaluate
  • Any framework supported
AgentCore Platform Architecture
Core Concept

Config Change, Not Code

  • Swap prompts via config
  • No redeploy needed
  • Targets prompts + tool descriptions
  • Not MLOps for model weights
Observe Evaluate Improve Cycle
Capabilities

3 Core Capabilities

Recommendations
Config Bundles
A/B Testing
Promote / Rollback
  • Bundle = versioned immutable JSON
  • Contains prompt + tools + guardrails
  • Roll back in one click
  • Feedback loop: winner feeds next cycle
Phase 1

Observe

  • Every call auto-captured
  • OpenTelemetry format
  • Real-time dashboards
  • Sessions + latency + cost
Observability Dashboard
Deep Dive

Trace Detail

Trace Detail View

Each step in the agent's reasoning chain

Phase 2

Evaluate

  • Auto-score every trace
  • 10 built-in dimensions
  • Alarm on score drops
  • Online + on-demand modes
Evaluation Scores
Evaluators

Evaluator Types

10
Built-in
0-1
Score Range
Custom
  • Correctness · Tool Selection · Goal Success
  • Safety · Helpfulness · Faithfulness
  • Custom LLM-as-Judge for your domain
Phase 3

Improve

Traces
Recommendations
Config Bundle
Batch Eval
A/B Test
  • Needs 200+ traces to start
  • Compares good vs bad traces
  • Generates prompt revision
  • Human reviews before ship
Validation

A/B Testing

User Traffic
Gateway
Control (v1)
Treatment (v2)
  • Split traffic by percentage
  • Sticky per session
  • 500-1000 sessions per variant
  • Statistical confidence before promote
Evidence

Real Results

Retail (Baseline)
0.856
Retail (Treatment)
0.962
Holdout (Baseline)
0.849
Holdout (Treatment)
0.926
Rail (Baseline)
0.715
Rail (Treatment)
0.759

Source: Caylent Private Beta (2026) + NTT DATA

Fit Assessment

When to Use

Strong Fit

  • Already on AgentCore
  • Tracing enabled
  • 500+ sessions/week
  • Quality metrics defined

Weak Fit

  • Not on AgentCore
  • Under 100 sessions/week
  • Cannot define "good"
  • Need vendor-neutral
Takeaway

Key Takeaways

  • Human approves; AI proposes
  • Config change, not code change
  • Loop: Observe → Evaluate → Improve
  • Evidence replaces intuition
  • Preview available now

Questions?

Performance Loop · Public Preview · May 2026

docs.aws.amazon.com/bedrock/agentcore

1 / 16 00:00