AI Agents Under The Hood: How They Really Work

When people hear the phrase “AI agent,” it often sounds like something almost magical - a digital teammate that thinks, plans, and acts on its own. Headlines call them autonomous. Product demos look like science fiction.

Underneath all of that, the machinery is much more grounded. Once you see how it actually works, two things happen. It stops feeling mysterious, and it starts to feel even more powerful, because you can see where the real control points are.

This post is a plain-language walk through what is going on inside modern AI agents, from the raw prediction engine at the heart of it, up through tools, memory, and orchestration. If you build or rely on agents in your marketing stack, understanding this will help you separate hype from real capability.

What is the model actually doing?

At the center of almost every agent system sits a large language model (LLM) - ChatGPT, Claude, Gemini, and similar systems. In spite of all the features built around them, they are doing one core thing: given the text seen so far, predict the most likely next token.

A token is a small chunk of text, roughly a word or piece of a word. The model has been trained on huge volumes of human-written material - articles, books, code, chats - and has learned the statistical patterns of language. Given a prompt, it keeps predicting the next token, then the next, and the next, until it has produced a full response.

A helpful way to think about this:

  • Your phone’s autocomplete suggests “Paris” after “The capital of France is”.

  • An LLM does the same kind of thing, but using far more context and trained on far more data.

The gap is not in the basic operation. The gap is in the scale.

What the model does not do

It is important to be clear about what is not happening. The model:

  • Does not “look up” answers from a database by itself.

  • Does not have persistent memory from one conversation to the next unless you wire it.

  • Does not have beliefs, goals, or awareness.

  • Does not reliably know when it is wrong, which is why hallucinations appear.

Every reply is generated on the fly, token by token, driven by patterns it has learned and by the context you give it. All of the perceived “intelligence” comes from this very simple mechanism repeated at large scale.

Why structured output changed everything for agents

Out of the box, a language model is very good at freeform text. If you ask, “Can you help me write an email?”, it will generate a helpful paragraph. That is fine for chat, but not great if you want to build software that can act on its replies.

For software, the problem is that natural language is messy. It is hard for other programs to reliably parse and act on.

The big step forward for agents came when people realised you could ask the model to answer in a rigid format, not just in sentences - most commonly, JSON. JSON is a simple data format that software systems use to pass structured information around.

Instead of answering:

“Sure, I can help you send that email.”

you push the model to reply with something like:

json

{

  "action": "send_email",

  "to": "sarah@third i .ai",

  "subject": "Q3 report ready",

  "body": "Hi Sarah, the Q3 report is attached."

}

Now, another part of your system can read this and know exactly what to do next. The model is still predicting tokens, but those tokens form a machine-readable instruction instead of a paragraph.

This shift - from natural language to structured output - is the foundation for most modern agent frameworks and workflow tools. It turns a conversational system into a programmable one:

  • The model chooses what should happen next.

  • The JSON shape tells the orchestration layer how to execute it.

  • The tools do the real work.

  • The result comes back in and the loop continues.

What actually makes something an “agent”?

An AI agent is not just “an LLM with a prompt.” It is a collection of pieces that turn a text predictor into a system that can observe, act, and adapt.

You can think of a typical agent stack as several layers working together:

  1. User task
    Someone gives a goal in natural language - for example:
    “Summarise last week’s sales and email the summary to the team.”

  2. Language model
    The LLM reads the request plus any context you have supplied (like past messages or stored data) and produces a structured output describing what should happen next.

  3. Structured output
    The response is formatted as JSON or a similar schema, so your software can parse it. It might include an action, target, and any parameters needed to perform the step.

  4. Orchestration layer
    This part of the system reads the JSON and decides which tool to call, how to handle errors, and what the overall workflow should be. It routes tasks, sequences steps, and can enforce policies.

  5. Tools
    These are the actual capabilities: query a database, call an external API, send an email, write a file, generate a report, and so on. The model cannot run SQL or send email by itself; tools do that.

  6. Memory and context
    Agents can be given short-term context (within a session) and long-term memory (stored in a database). This lets them remember prior tasks, user settings, and previous outcomes.

  7. Feedback loop
    The result of each action is fed back to the model as new context. The model reads what happened, decides the next step, and the cycle repeats until the goal is complete.

A simple end-to-end example

Take the instruction:

“Summarise last week’s sales and email it to the team.”

A well-built agent might:

  1. Call a “database” tool to fetch last week’s sales.

  2. Use the model to generate a summary in the requested format.

  3. Call an “email” tool with the subject, body, and recipients.

  4. Log the result and update its memory that this task was completed.

From the outside, you just see a single sentence and a finished email. Inside, the system is running this loop: predict → output JSON → route → act → read result → repeat.

Why agents look like they are thinking

If you watch a mature agent system run, it often looks like a human workflow. It breaks the problem down, tries something, checks what happened, and adjusts if needed.

There are a few design patterns that create this effect.

Step-by-step reasoning

Instead of asking the model to jump straight to an answer, you can prompt it to “think step by step.” This is often called chain-of-thought prompting.

In practice, that means the model first lays out its plan in text:

  1. Identify which data source to query.

  2. Fetch sales data for last week.

  3. Summarise key metrics.

  4. Draft an email.

Then it follows that plan. This simple trick usually improves accuracy and makes it easier to see where something went wrong if it does.

Tool use

Agents are given a list of tools they are allowed to call: search, calculator, database, email, code runner, and more. The model is trained or prompted to decide when to answer from its own “knowledge” and when to use a tool.

This is what lets an agent:

  • Look up fresh numbers instead of guessing.

  • Run a calculation instead of trying to do math in text.

  • Pull from your own systems instead of only the training data.

Multi-step planning

Some tasks cannot be done in one shot. An agent might need to:

  • Gather data from several places.

  • Transform and filter it.

  • Make a decision based on the result.

  • Trigger follow-up actions.

By breaking the work into smaller actions and looping through “observe → decide → act,” an agent can handle tasks that no single LLM response would cover.

Self-correction

When a tool call fails, returns an error, or comes back with unexpected data, the agent can adjust. It might retry with different parameters, fall back to another tool, or ask the user for clarification.

From the outside, this looks like it is “reconsidering” or “debugging.” Internally, it is still the same loop of token prediction, but the prompts and context steer it toward diagnosing and fixing issues across steps.

Agents do not think like humans. They iterate like machines - quickly and often enough that it can look like thinking.

Why language has become the interface

For decades, using software meant learning its interface: menus, buttons, keyboard shortcuts, command-line syntax. The burden was on people to learn how the system wanted to be used.

Agents flip that around. Now the primary “interface” is natural language.

Instead of clicking through a CRM to build a complex filter, you can say:

“Find all enterprise customers who have not had a touchpoint in 90 days and draft a personalised follow-up for each.”

The agent:

  • Figures out which tools to call (CRM, templating, email).

  • Pulls the right records.

  • Drafts the emails.

  • Either sends them or queues them for review, depending on how you configure it.

You are still using software. But instead of learning its internal language, you let the model handle that translation layer for you.

What this architecture means for products like third i 

If you are building on top of these systems, or trusting them with live work, understanding this stack is not just an academic exercise. It directly shapes how reliable and safe your agents are.

Most users do not need to know what JSON looks like or how function calling works. But product and engineering teams do, because the quality of the agent depends on choices made at each layer:

  • How clearly you define the tools the model can use.

  • How you structure prompts and schemas so the model’s output is predictable.

  • How your orchestration layer handles errors, retries, and edge cases.

  • How you design memory so agents learn from real context without leaking or mixing it.

The model itself is increasingly a commodity. You can swap providers or move between versions. The enduring value is in the scaffolding you build around that model: the tools, the workflows, the guardrails, and the way it all fits your domain.

Anyone can call an LLM API. Building an agent that does useful work reliably, with clear boundaries and safe behavior, is an engineering and design challenge. That is where the real product lives.

Bringing it back to AI agents in marketing

For a marketing or growth team, this architecture is not theoretical. It decides:

  • Whether your “campaign agent” reads data and suggests actions, or actually changes budgets.

  • Whether a reporting agent quietly pushes wrong numbers when a data source fails.

  • Whether a creative analysis agent understands fatigue and seasonality, or just ranks ads by CTR.

Once you see agents as structured systems instead of magic, you can ask better questions:

  • Which tools should this agent have access to?

  • Where must a human always stay in the loop?

  • What should happen when something goes wrong?

Those are the questions that matter when you trust agents with real spend.

The question for us at third i  is not “will AI agents get powerful?” They already are. The real question is whether we understand them well enough to build on them carefully, and to put them to work on the boring, high-value parts of marketing without creating chaos.

That starts with seeing what is actually happening under the hood.

When people hear the phrase “AI agent,” it often sounds like something almost magical - a digital teammate that thinks, plans, and acts on its own. Headlines call them autonomous. Product demos look like science fiction.

Underneath all of that, the machinery is much more grounded. Once you see how it actually works, two things happen. It stops feeling mysterious, and it starts to feel even more powerful, because you can see where the real control points are.

This post is a plain-language walk through what is going on inside modern AI agents, from the raw prediction engine at the heart of it, up through tools, memory, and orchestration. If you build or rely on agents in your marketing stack, understanding this will help you separate hype from real capability.

What is the model actually doing?

At the center of almost every agent system sits a large language model (LLM) - ChatGPT, Claude, Gemini, and similar systems. In spite of all the features built around them, they are doing one core thing: given the text seen so far, predict the most likely next token.

A token is a small chunk of text, roughly a word or piece of a word. The model has been trained on huge volumes of human-written material - articles, books, code, chats - and has learned the statistical patterns of language. Given a prompt, it keeps predicting the next token, then the next, and the next, until it has produced a full response.

A helpful way to think about this:

  • Your phone’s autocomplete suggests “Paris” after “The capital of France is”.

  • An LLM does the same kind of thing, but using far more context and trained on far more data.

The gap is not in the basic operation. The gap is in the scale.

What the model does not do

It is important to be clear about what is not happening. The model:

  • Does not “look up” answers from a database by itself.

  • Does not have persistent memory from one conversation to the next unless you wire it.

  • Does not have beliefs, goals, or awareness.

  • Does not reliably know when it is wrong, which is why hallucinations appear.

Every reply is generated on the fly, token by token, driven by patterns it has learned and by the context you give it. All of the perceived “intelligence” comes from this very simple mechanism repeated at large scale.

Why structured output changed everything for agents

Out of the box, a language model is very good at freeform text. If you ask, “Can you help me write an email?”, it will generate a helpful paragraph. That is fine for chat, but not great if you want to build software that can act on its replies.

For software, the problem is that natural language is messy. It is hard for other programs to reliably parse and act on.

The big step forward for agents came when people realised you could ask the model to answer in a rigid format, not just in sentences - most commonly, JSON. JSON is a simple data format that software systems use to pass structured information around.

Instead of answering:

“Sure, I can help you send that email.”

you push the model to reply with something like:

json

{

  "action": "send_email",

  "to": "sarah@third i .ai",

  "subject": "Q3 report ready",

  "body": "Hi Sarah, the Q3 report is attached."

}

Now, another part of your system can read this and know exactly what to do next. The model is still predicting tokens, but those tokens form a machine-readable instruction instead of a paragraph.

This shift - from natural language to structured output - is the foundation for most modern agent frameworks and workflow tools. It turns a conversational system into a programmable one:

  • The model chooses what should happen next.

  • The JSON shape tells the orchestration layer how to execute it.

  • The tools do the real work.

  • The result comes back in and the loop continues.

What actually makes something an “agent”?

An AI agent is not just “an LLM with a prompt.” It is a collection of pieces that turn a text predictor into a system that can observe, act, and adapt.

You can think of a typical agent stack as several layers working together:

  1. User task
    Someone gives a goal in natural language - for example:
    “Summarise last week’s sales and email the summary to the team.”

  2. Language model
    The LLM reads the request plus any context you have supplied (like past messages or stored data) and produces a structured output describing what should happen next.

  3. Structured output
    The response is formatted as JSON or a similar schema, so your software can parse it. It might include an action, target, and any parameters needed to perform the step.

  4. Orchestration layer
    This part of the system reads the JSON and decides which tool to call, how to handle errors, and what the overall workflow should be. It routes tasks, sequences steps, and can enforce policies.

  5. Tools
    These are the actual capabilities: query a database, call an external API, send an email, write a file, generate a report, and so on. The model cannot run SQL or send email by itself; tools do that.

  6. Memory and context
    Agents can be given short-term context (within a session) and long-term memory (stored in a database). This lets them remember prior tasks, user settings, and previous outcomes.

  7. Feedback loop
    The result of each action is fed back to the model as new context. The model reads what happened, decides the next step, and the cycle repeats until the goal is complete.

A simple end-to-end example

Take the instruction:

“Summarise last week’s sales and email it to the team.”

A well-built agent might:

  1. Call a “database” tool to fetch last week’s sales.

  2. Use the model to generate a summary in the requested format.

  3. Call an “email” tool with the subject, body, and recipients.

  4. Log the result and update its memory that this task was completed.

From the outside, you just see a single sentence and a finished email. Inside, the system is running this loop: predict → output JSON → route → act → read result → repeat.

Why agents look like they are thinking

If you watch a mature agent system run, it often looks like a human workflow. It breaks the problem down, tries something, checks what happened, and adjusts if needed.

There are a few design patterns that create this effect.

Step-by-step reasoning

Instead of asking the model to jump straight to an answer, you can prompt it to “think step by step.” This is often called chain-of-thought prompting.

In practice, that means the model first lays out its plan in text:

  1. Identify which data source to query.

  2. Fetch sales data for last week.

  3. Summarise key metrics.

  4. Draft an email.

Then it follows that plan. This simple trick usually improves accuracy and makes it easier to see where something went wrong if it does.

Tool use

Agents are given a list of tools they are allowed to call: search, calculator, database, email, code runner, and more. The model is trained or prompted to decide when to answer from its own “knowledge” and when to use a tool.

This is what lets an agent:

  • Look up fresh numbers instead of guessing.

  • Run a calculation instead of trying to do math in text.

  • Pull from your own systems instead of only the training data.

Multi-step planning

Some tasks cannot be done in one shot. An agent might need to:

  • Gather data from several places.

  • Transform and filter it.

  • Make a decision based on the result.

  • Trigger follow-up actions.

By breaking the work into smaller actions and looping through “observe → decide → act,” an agent can handle tasks that no single LLM response would cover.

Self-correction

When a tool call fails, returns an error, or comes back with unexpected data, the agent can adjust. It might retry with different parameters, fall back to another tool, or ask the user for clarification.

From the outside, this looks like it is “reconsidering” or “debugging.” Internally, it is still the same loop of token prediction, but the prompts and context steer it toward diagnosing and fixing issues across steps.

Agents do not think like humans. They iterate like machines - quickly and often enough that it can look like thinking.

Why language has become the interface

For decades, using software meant learning its interface: menus, buttons, keyboard shortcuts, command-line syntax. The burden was on people to learn how the system wanted to be used.

Agents flip that around. Now the primary “interface” is natural language.

Instead of clicking through a CRM to build a complex filter, you can say:

“Find all enterprise customers who have not had a touchpoint in 90 days and draft a personalised follow-up for each.”

The agent:

  • Figures out which tools to call (CRM, templating, email).

  • Pulls the right records.

  • Drafts the emails.

  • Either sends them or queues them for review, depending on how you configure it.

You are still using software. But instead of learning its internal language, you let the model handle that translation layer for you.

What this architecture means for products like third i 

If you are building on top of these systems, or trusting them with live work, understanding this stack is not just an academic exercise. It directly shapes how reliable and safe your agents are.

Most users do not need to know what JSON looks like or how function calling works. But product and engineering teams do, because the quality of the agent depends on choices made at each layer:

  • How clearly you define the tools the model can use.

  • How you structure prompts and schemas so the model’s output is predictable.

  • How your orchestration layer handles errors, retries, and edge cases.

  • How you design memory so agents learn from real context without leaking or mixing it.

The model itself is increasingly a commodity. You can swap providers or move between versions. The enduring value is in the scaffolding you build around that model: the tools, the workflows, the guardrails, and the way it all fits your domain.

Anyone can call an LLM API. Building an agent that does useful work reliably, with clear boundaries and safe behavior, is an engineering and design challenge. That is where the real product lives.

Bringing it back to AI agents in marketing

For a marketing or growth team, this architecture is not theoretical. It decides:

  • Whether your “campaign agent” reads data and suggests actions, or actually changes budgets.

  • Whether a reporting agent quietly pushes wrong numbers when a data source fails.

  • Whether a creative analysis agent understands fatigue and seasonality, or just ranks ads by CTR.

Once you see agents as structured systems instead of magic, you can ask better questions:

  • Which tools should this agent have access to?

  • Where must a human always stay in the loop?

  • What should happen when something goes wrong?

Those are the questions that matter when you trust agents with real spend.

The question for us at third i  is not “will AI agents get powerful?” They already are. The real question is whether we understand them well enough to build on them carefully, and to put them to work on the boring, high-value parts of marketing without creating chaos.

That starts with seeing what is actually happening under the hood.

Read more