TOFU PILLAR

How to Build an AI Agent: A 2026 Production Playbook

Eight steps from "we should build an agent" to a production system real users actually use, with the code patterns and the hard-earned lessons. April 2026 reference.

By Christian Vismara · 2026-04-29

Building an AI agent in April 2026 means: define a narrow task, pick a frontier model (Claude Opus 4.7 or GPT-5.5), choose an orchestration framework (Vercel AI SDK v6 for TypeScript or LangGraph 1.0 for stateful Python), wire in tool use and memory, add observability via Langfuse or LangSmith, build an eval set before shipping, and deploy with hard step and cost limits. Most production agents ship in 4-8 weeks.

What an AI agent actually is

An AI agent is software that uses an LLM to plan and execute multi-step tasks. The LLM is the decision-maker. The orchestration layer is the body that holds state, calls tools, and enforces limits.

Three things separate an agent from a chatbot:

  • Multi-step planning. The agent decides what to do next based on what just happened, not just what the user said.
  • Tool use. The agent calls external functions (APIs, databases, services) to take actions or retrieve information.
  • Bounded autonomy. The agent runs without human input until it completes the task, hits a step limit, or needs to escalate.

Step 1: Define the task scope

The single biggest predictor of whether an agent project succeeds is scope clarity. Agents that try to "handle anything a customer might ask" fail unpredictably. Agents that handle one well-defined task succeed reliably.

Good agent scope looks like:

  • Inputs are bounded. Customer message + their account context, not arbitrary input from anywhere.
  • Tools are bounded. 3-8 tools, not 50.
  • Success is measurable. "Resolved the ticket without escalation" or "Booked a qualified meeting" or "Generated a draft post within brand guidelines."
  • Failure is recoverable. Worst case the agent fails and a human picks up. Not worst case the agent processes a million-dollar refund.

If your scope can't fit those four constraints, narrow the scope. Don't build the agent yet.

Step 2: Choose your model and framework

April 2026 default choices for production agents:

Model selection

  • Claude Opus 4.7 when complex reasoning, agentic coding, or long-horizon autonomous work matters more than cost. $5/M input, $25/M output.
  • Claude Sonnet 4.6 for the bulk of production traffic. Strong tool use, lower cost.
  • GPT-5.5 when you need vision, lower hallucination rate, or you're standardised on OpenAI.
  • Gemini 3.1 Pro for cost-effective long-context (1M tokens). Strong on Google Cloud workloads.
  • Llama 4 Maverick / Mistral Large 3 / DeepSeek V4 when cost or data residency dominates.

Framework selection

  • Vercel AI SDK v6 for TypeScript/Next.js builds. Has the Agent abstraction, ToolLoopAgent, full MCP support, and integrates with Vercel AI Gateway.
  • OpenAI Agents SDK for Python single-agent with sandbox execution. April 2026 update is GA.
  • LangGraph 1.0 for stateful multi-step agents with explicit state transitions. Best for agents that need durable execution.
  • Claude Agent SDK when you want subagent transcript helpers, sessionstore, and Anthropic-native skills support.
  • CrewAI v1.10 for role-based multi-agent collaboration. Faster to ship than LangGraph for that pattern.
  • Mastra 1.0 for TypeScript-first teams who want a workflow engine plus agent abstractions in one stack.
  • Custom orchestrator for simple agents where the framework is more burden than benefit. ~200 lines of TypeScript or Python.

Step 3: Architecture (planner, executor, tools, memory)

A typical production agent has four parts:

The planner. The LLM call that decides what to do next given the goal and current state. In simple agents this is one prompt. In stateful ones (LangGraph) it's a node in the graph.

The executor. The runtime that calls tools, parses results, and feeds them back to the planner. Vercel AI SDK's Agent abstraction does this for you. So does LangGraph. Custom code does it manually.

The tools. Functions the agent can call. Database queries, API calls, file operations. Each tool has a schema (input/output types) and a description (what it does, when to use it).

The memory. Short-term: the conversation history kept inside the context window during one session. Long-term: external storage (vector DB or structured store) the agent retrieves from at the start of each turn.

Step 4: Tool use (with code)

Tool use is where the agent gets its hands on real systems. Function calling is the mechanism: the LLM emits structured JSON describing which function to call with which arguments, the runtime executes it, the result feeds back.

Example with the Anthropic SDK in TypeScript (Claude Opus 4.7):

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const tools = [{
  name: 'get_order_status',
  description: 'Look up the status of a customer order by order ID.',
  input_schema: {
    type: 'object',
    properties: {
      order_id: { type: 'string', description: 'The order ID' }
    },
    required: ['order_id']
  }
}];

async function runAgent(question: string) {
  const messages = [{ role: 'user', content: question }];
  let response = await client.messages.create({
    model: 'claude-opus-4-7',
    max_tokens: 1024,
    tools,
    messages,
  });

  while (response.stop_reason === 'tool_use') {
    const toolUse = response.content.find(c => c.type === 'tool_use');
    const result = await executeToolCall(toolUse.name, toolUse.input);
    messages.push(
      { role: 'assistant', content: response.content },
      { role: 'user', content: [{ type: 'tool_result', tool_use_id: toolUse.id, content: result }] }
    );
    response = await client.messages.create({
      model: 'claude-opus-4-7',
      max_tokens: 1024,
      tools,
      messages,
    });
  }

  return response.content.find(c => c.type === 'text')?.text;
}

Same logic with Vercel AI SDK v6 (cleaner, less boilerplate):

import { generateText, tool } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { z } from 'zod';

const result = await generateText({
  model: anthropic('claude-opus-4-7'),
  tools: {
    get_order_status: tool({
      description: 'Look up the status of a customer order by order ID.',
      parameters: z.object({ order_id: z.string() }),
      execute: async ({ order_id }) => await fetchOrderStatus(order_id),
    }),
  },
  maxSteps: 10,
  prompt: 'Where is order #12345?',
});

console.log(result.text);

Step 5: Memory (short-term and long-term)

Short-term memory is the conversation history. Just an array of messages passed back to the LLM each turn. Trim it when it approaches the context window.

Long-term memory needs deliberate design. Common pattern: a vector store of past interactions, retrieved at the start of each session and prepended to the system prompt.

import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';

async function loadRelevantMemories(userId: string, query: string) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-large'),
    value: query,
  });

  // Search vector DB (pgvector, Turbopuffer, Pinecone, etc.)
  const memories = await vectorStore.search({
    userId,
    embedding,
    limit: 5,
  });

  return memories.map(m => m.text).join('\n');
}

Don't over-engineer memory. Most production agents ship with conversation history only. Add long-term memory when you have a concrete use case for it (the user expects continuity across sessions, the agent needs facts learned in past interactions).

Step 6: Observability

Every prompt, every tool call, every retry: logged. April 2026 standards:

  • Langfuse for open-source observability. Self-hostable, OpenTelemetry GenAI compliant, full tracing of LLM calls.
  • LangSmith for tight LangGraph integration. Managed only.
  • Helicone for drop-in proxy with no code changes. Generous free tier.
  • Vercel AI Gateway if you're on Vercel. Sub-20ms routing, dashboard metrics, multi-provider failover.

What to log: every prompt, every model response, every tool call with inputs and outputs, total tokens, latency, cost. Without this you can't debug. The teams that skip observability are the teams that have unfixable agents in production.

Step 7: Evaluation

An eval set is a list of (input, expected behaviour) pairs you run on every prompt change. Without it you can't tell if a tweak made the agent better or worse.

const evalSet = [
  {
    input: 'Where is order #12345?',
    expectedTools: ['get_order_status'],
    expectedOutcome: 'Provides order status with tracking link'
  },
  {
    input: 'I want to cancel my subscription',
    expectedTools: ['get_subscription', 'cancel_subscription'],
    expectedOutcome: 'Confirms cancellation with effective date'
  },
  // ... 50-200 cases
];

async function runEval() {
  const results = await Promise.all(
    evalSet.map(async (testCase) => {
      const result = await runAgent(testCase.input);
      const judgment = await llmJudge(result, testCase.expectedOutcome);
      return { case: testCase, result, pass: judgment.pass };
    })
  );

  const passRate = results.filter(r => r.pass).length / results.length;
  console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
}

Build the eval set before you ship. We've never seen a team that built one regret it. We've seen many teams that didn't regret it.

Step 8: Deployment

Standard 2026 deployment patterns:

  • Vercel for Next.js + Vercel AI SDK builds. Cleanest path. Includes Vercel AI Gateway out of the box.
  • AWS Lambda + API Gateway for serverless TypeScript or Python. Standard for enterprise compliance.
  • Containerised on ECS / Cloud Run / Kubernetes for long-running agents that need durable connections.
  • Self-hosted on a VPS for n8n-based automation workflows. $20/month and no per-execution costs.

Pick based on what you already run. Don't introduce new infrastructure for an agent build unless the project specifically needs it.

Common production gotchas

Cost spikes from runaway agents. An agent that loops can burn $50-500 in tokens before you notice. Hard step limits and per-session cost limits. Always.

Tool description quality matters more than tool code. The LLM picks which tool to use based on the description. Vague descriptions = wrong tool calls. Spend time on descriptions.

System prompts get long fast. Then they hit the context window and the agent quality degrades. Periodically audit the system prompt; cut anything not earning its tokens.

Tools that fail silently are the worst. A tool returning null when it should return data, while the LLM happily summarises the null as a normal result. Tools should throw on failure, not return empty.

Evaluating with another LLM has bias. LLM-as-judge is fast but biased toward verbose, confident outputs. Pair with deterministic checks where possible (did the right tool get called?).

A 4-week build plan

Realistic timeline for a single production agent:

Week 1: Scope. Define inputs, outputs, tools, success criteria. Build the eval set first (50+ test cases). Pick model and framework.

Week 2: Core build. System prompt, tool definitions, planner-executor loop. Get the agent passing 60% of the eval set on a happy path.

Week 3: Edge cases and observability. Tune prompts, add fallbacks, wire in Langfuse or LangSmith. Get the eval pass rate to 85%+.

Week 4: Production hardening. Deploy. Add cost limits, alerting, monitoring. Run on real traffic with low rollout (10% of users). Tune from real failures, not eval failures.

Most agents need another 2-3 weeks of tuning post-launch before they hit their stable production behaviour. Plan for it.

Frequently Asked Questions

Pick a narrow well-scoped task. Use Vercel AI SDK v6's Agent abstraction (TypeScript) or the Anthropic Python SDK with tool calling. 50 lines of code gets you a working agent. Add complexity only when the simple version hits a real wall, not before.
LangChain (the original chain abstraction) is legacy in 2026. Use LangGraph 1.0 if you need stateful multi-step agents with explicit transitions, or Vercel AI SDK v6 for TypeScript-first builds. Custom orchestration (no framework) is the right call for simple agents where you want full control of the prompt loop.
Claude Opus 4.7 for complex reasoning, agentic coding, and long-horizon autonomous work. Claude Sonnet 4.6 for cost-sensitive production agents. GPT-5.5 for general-purpose, especially when vision or code generation matters. Open-source (Llama 4 Maverick) when cost or residency dominates. Pick per use case, not per ideology.
Hard limits and observability. Max step count (10-20 typical). Max cost per session ($1-5 typical). Confidence threshold for escalation to human. Eval set tested on every prompt change. Real-time alerting on cost or latency anomalies. None of this is optional for production.
Only if your agent needs to retrieve information from a corpus larger than the context window. For internal docs Q&A, customer history lookup, or product catalogue search, yes. For agents that just take actions on structured data, no. Most agents we ship don't use a vector DB; the ones that do, use pgvector or Turbopuffer.
4-8 weeks for most use cases. Single-task automation lands at 4 weeks. Multi-tool agents with several integrations and an eval set take 6-8. Multi-agent systems with handoffs take 8-12. The unhappy path is "we want a flexible agent that does many things" — those projects take 6 months and ship 3.

Building an agent and want senior eyes on the architecture?

30 minutes. We review your scope, the stack choice, and the failure modes before you write a line of code.