Context Engineering for Autonomous Marketing Agents: How We Went from Context Overflow to 90% Accuracy

Aravind Nair

May 6, 2026

A media buyer managing six-figure monthly ad spend across Google, Meta, and TikTok asks your AI chatbot: "Why did my ROAS drop last week?"

Think about what that requires. The system needs to identify which platforms matter, pull campaign-level data with the right date ranges, compare against historical baselines, detect anomalies, generate a root-cause analysis, and deliver it in language a marketer understands. Without knowing in advance that this question was coming.

This is the core challenge we faced building CoDI, the conversational AI analyst inside Third i. CoDI handles everything from simple lookups ("What was my spend last month?") to multi-platform diagnostics ("Which campaigns should I pause and why?") across Google Ads, Meta Ads, TikTok Ads, Instagram, and GA4. The user types plain English. CoDI figures out the rest.

The problem isn't building an agent that answers marketing questions. It's building one that answers any marketing question without drowning in its own context.

Anthropic's engineering team recently named this challenge context engineering — the practice of curating the optimal set of tokens at each step of inference. They describe context as "a finite resource with diminishing marginal returns" and highlight a pattern they call progressive disclosure: agents incrementally discovering relevant context through exploration rather than loading everything upfront.

We arrived at the same pattern independently, through four architecture iterations. Here's what happened at each one.

Iteration 1: Text-to-SQL with RAG — Fast to Ship, Fragile in Production

We started with Vanna, an open-source text-to-SQL framework. We ingested DDL schemas, documentation, and question-SQL pairs into a vector store. When a user asked a question, Vanna retrieved similar examples, bundled them with schema context, and sent it to the LLM to generate SQL.

For narrow, single-turn questions, it worked. "What was my Facebook spend last month?" retrieved the right schema fragment, and the SQL was correct. Token usage was lean — roughly 3,000–4,000 tokens per query.

Three failure modes killed this approach:

Context collision on cross-platform queries. "Compare my Google and Facebook campaign performance" pulled schema from multiple tables. Context ballooned past 15,000 tokens; the LLM started joining tables incorrectly or referencing wrong-platform columns.

No multi-turn reasoning. Marketers follow up: "Now break that down by ad set." But each question was brand new — no memory of prior analysis.

Analysis ceiling. "Why did my CPA spike?" requires computing anomalies and generating explanations. Text-to-SQL can fetch data. It cannot reason about it.

Iteration 2: Full Schema Injection + Semantic Layer

We made two moves. First, we introduced a CubeJS semantic layer in front of Postgres — a decision that proved foundational. Instead of raw SQL, we built 80+ CubeJS views with rich semantic descriptions. sma_14d became "14-day Simple Moving Average of the primary objective metric." bleed_flag became "indicator that this ad set is spending without delivering conversions." These descriptions gave the LLM dramatically better signal.

Second, we tried giving the LLM the entire schema catalog on every request. No RAG, no guessing. Just: here's everything, figure it out.

The semantic layer was a genuine leap. But the full catalog consumed 40,000–60,000 tokens before the LLM started reasoning. We saw infinite loops, hallucinated cube names, and queries against wrong platform views. More information, worse results.

We spent two weeks debugging a query where the LLM confidently generated a CubeJS request against a Facebook creative diagnostics view — for a Google Ads account. The schema was right there in context. The model just couldn't find it in 50,000 tokens of noise. That failure forced us to rethink the problem entirely. The question wasn't "how do we fit more context in?" It was "how do we give each step only what it needs?"

Iteration 3: The Progressive Disclosure Breakthrough

The insight: stop treating context as binary — everything or nothing — and start treating it as layers.

We rebuilt CoDI as a multi-step pipeline where each agent receives only the context required for its specific job:

The Planner sees lightweight metadata only — platform name, cube name, one-line description. For three connected platforms, roughly 2,500 tokens instead of 50,000+. Its job: select which 2–4 cubes are relevant.

The Query Builder receives detailed schema — dimensions, measures, data types — but only for the cubes the planner selected. Three cubes out of 80 means ~4,000 tokens instead of 50,000.

The Data Fetch step is entirely deterministic. CubeJS queries execute with zero LLM involvement. No tokens, no hallucination risk, no cost.

The Analyst receives the fetched data in a token-efficient columnar format we developed. Instead of row-based JSON ([{"campaign": "A", "spend": 100}, ...]), we send columnar JSON ({"campaign": ["A", "B"], "spend": [100, 200]}). For marketing datasets with 20–100 rows, this reduced data tokens by approximately 60%.

Per-query token usage dropped from 60,000–80,000 to 15,000–25,000. Response accuracy jumped from 70% to 85%. But the real shift was qualitative: each agent now operated in a clean, focused context window where every token was doing useful work.

Iteration 4: Multi-Level Orchestration for Real-World Complexity

The single pipeline handled most queries well. But marketing analysis often requires parallel workstreams. "How are my campaigns performing?" might need Facebook diagnostics, Google Ads trends, and GA4 website data simultaneously — each requiring different analysis strategies.

We evolved into a three-level orchestration architecture:

Level 1 — The Router classifies the query and decomposes it into sub-flows. Token cost: ~1,500. Non-marketing queries get rejected cheaply — no expensive downstream processing for off-topic questions.

Level 2 — The Orchestrator dispatches specialized agent flows in parallel threads. Diagnostics queries route to pre-configured flows; general questions route to the autonomous pipeline.

Level 3 — The Analysis Pipeline is a 10-step sequence where progressive disclosure operates at its finest. The critical design principle: each step builds on the outputs of previous steps, not their inputs. The recommendation agent sees synthesized insights, not raw data. The chart agent sees insights and recommendations, not the schema catalog.

For known diagnostic patterns — budget underspend, audience saturation, creative fatigue, CPA anomalies — we go further. These flows use pre-configured queries with exact CubeJS dimensions and pre-computed statistical signals: 14-day moving averages, rolling 7-day metrics, consecutive decay counts. The LLM's job is purely analytical — evaluate triggers and translate technical signals into marketer-friendly language. This handles roughly 40% of queries with lower latency and cost.

What the Numbers Show

	V1 (RAG)	V2 (Full Schema)	V3 (Pipeline)	V4 (Multi-Level)
Tokens per query	3K–8K	60K–80K	15K–25K	20K–35K (across 10+ agents)
Response accuracy	~60%	~70%	85%	90%+
Cross-platform queries	Unreliable	Context overflow	Reliable	Parallel execution
Diagnostic depth	None	Basic	Single-pass	Multi-signal, trigger-based

V4 uses more total tokens than V3 for complex queries — it runs multiple specialized agents in parallel. But cost per unit of insight is dramatically lower, and the output quality is categorically different. Where V1 returned a raw data table, V4 returns a narrative that weaves together insight cards ("Your Facebook CPA spiked 40% — driven by audience saturation in two ad sets"), recommendations with projected impact ("Pause Ad Set X — projected savings of $500/week"), and executable actions tied to specific campaigns by name.

Five Lessons That Generalize

The semantic layer is the real enabler. Progressive disclosure works because CubeJS gives us clean abstraction layers: table-level metadata for planning, detailed metadata for query construction, structured data for analysis. Without this, progressive disclosure is just "give the LLM less stuff and hope."

Deterministic steps are your most reliable agents. Three of our ten pipeline steps involve zero LLM calls — pure API fetches against a typed schema. CubeJS queries either return correct data or a clear error. LLM-generated SQL can hallucinate column names and return plausible-looking wrong data — in analytics, that's the most dangerous failure mode there is.

Autonomy is expensive. Use it where it matters. When you know the diagnostic pattern, hardcode the data requirements and let the LLM focus on interpretation. We pre-configure roughly 40% of our flows this way, eliminating an entire LLM planning call per step.

Columnar JSON is underrated. Row-to-columnar data transformation reduced token counts by 60% across every data-carrying step. For any system passing structured data to LLMs, this is low-hanging fruit.

Compact conversation history, don't just truncate it. Between turns, we distill prior conversation into a tight summary that preserves architectural decisions, unresolved questions, and key metrics — while discarding redundant tool outputs and verbose explanations. We cap these summaries at 800 characters. The goal isn't to remember less; it's to remember what matters.

What's Next

The current system tells marketers what to do. Next: doing it for them. Our action cards already identify specific campaigns, ad sets, and budget changes with projected impact — executing through platform APIs is the natural extension.

We're also expanding to LinkedIn Ads, Snapchat, and additional organic channels. Every new platform adds cubes to the catalog, but the planner's per-query token cost stays flat — the agent only loads what it needs.

The broader pattern — progressive disclosure through a semantic layer with multi-level orchestration — applies anywhere an autonomous agent reasons over a large structured data catalog. Context is a finite resource. The job isn't to minimize it. It's to maximize signal-to-noise at every step.

A media buyer managing six-figure monthly ad spend across Google, Meta, and TikTok asks your AI chatbot: "Why did my ROAS drop last week?"

The problem isn't building an agent that answers marketing questions. It's building one that answers any marketing question without drowning in its own context.

We arrived at the same pattern independently, through four architecture iterations. Here's what happened at each one.

Iteration 1: Text-to-SQL with RAG — Fast to Ship, Fragile in Production

Three failure modes killed this approach:

No multi-turn reasoning. Marketers follow up: "Now break that down by ad set." But each question was brand new — no memory of prior analysis.

Analysis ceiling. "Why did my CPA spike?" requires computing anomalies and generating explanations. Text-to-SQL can fetch data. It cannot reason about it.

Iteration 2: Full Schema Injection + Semantic Layer

Second, we tried giving the LLM the entire schema catalog on every request. No RAG, no guessing. Just: here's everything, figure it out.

Iteration 3: The Progressive Disclosure Breakthrough

The insight: stop treating context as binary — everything or nothing — and start treating it as layers.

We rebuilt CoDI as a multi-step pipeline where each agent receives only the context required for its specific job:

The Query Builder receives detailed schema — dimensions, measures, data types — but only for the cubes the planner selected. Three cubes out of 80 means ~4,000 tokens instead of 50,000.

The Data Fetch step is entirely deterministic. CubeJS queries execute with zero LLM involvement. No tokens, no hallucination risk, no cost.

Iteration 4: Multi-Level Orchestration for Real-World Complexity

We evolved into a three-level orchestration architecture:

Level 2 — The Orchestrator dispatches specialized agent flows in parallel threads. Diagnostics queries route to pre-configured flows; general questions route to the autonomous pipeline.

What the Numbers Show

	V1 (RAG)	V2 (Full Schema)	V3 (Pipeline)	V4 (Multi-Level)
Tokens per query	3K–8K	60K–80K	15K–25K	20K–35K (across 10+ agents)
Response accuracy	~60%	~70%	85%	90%+
Cross-platform queries	Unreliable	Context overflow	Reliable	Parallel execution
Diagnostic depth	None	Basic	Single-pass	Multi-signal, trigger-based

Five Lessons That Generalize

What's Next

Start Free Trial!

Your Budget Has Silent Leaks: Inside The AI That Hunts Bad Ads And Keywords

This article breaks down how we teach AI to scan ad accounts for “money pits” – the creatives and keywords that quietly burn budget without moving real outcomes. It explains how our analyzers focus on real patterns, not random spikes, so you get a short, actionable list of leaks to fix instead of more dashboard noise.

Vimal Babu

May 26, 2026

agents

Winning With AI Agents Has Very Little To Do With The Model You Pick

If you run performance marketing today, you hear the same question every time AI agents come up: “So, which model are you using?” For real-world results, that is the least useful place to focus. Models are already very close in capability. GPT, Claude, Gemini and others keep narrowing the gap every few months. If your edge depends on calling one specific model API, it will not last long. The teams that are quietly getting better ROAS, lower waste, and fewer surprises from AI agents are not winning because they found a secret model. They are winning because of the craft wrapped around the model: the way they write prompts, define agents, and orchestrate how those agents work together. That is where the real leverage lives.

Abhinav Krishna

May 19, 2026

agents

AI Agents Under The Hood: How They Really Work

This post explains that AI agents are not magic robots, but tools built on one simple trick: predicting the next word very well, then wrapping that prediction engine with rules, tools, and memory so it can actually do jobs for you. It shows how this setup lets you talk to software in plain language while keeping humans in control of what the agent can see, decide, and change.

Aravindhan

May 8, 2026

Your Budget Has Silent Leaks: Inside The AI That Hunts Bad Ads And Keywords

Vimal Babu

May 26, 2026

agents

Winning With AI Agents Has Very Little To Do With The Model You Pick

Abhinav Krishna

May 19, 2026