Context Engineering for Autonomous Marketing Agents: How We Went from Context Overflow to 90% Accuracy

Aravind Nair

A media buyer managing six-figure monthly ad spend across Google, Meta, and TikTok asks your AI chatbot: "Why did my ROAS drop last week?"

Think about what that requires. The system needs to identify which platforms matter, pull campaign-level data with the right date ranges, compare against historical baselines, detect anomalies, generate a root-cause analysis, and deliver it in language a marketer understands. Without knowing in advance that this question was coming.

This is the core challenge we faced building CoDI, the conversational AI analyst inside Third i. CoDI handles everything from simple lookups ("What was my spend last month?") to multi-platform diagnostics ("Which campaigns should I pause and why?") across Google Ads, Meta Ads, TikTok Ads, Instagram, and GA4. The user types plain English. CoDI figures out the rest.

The problem isn't building an agent that answers marketing questions. It's building one that answers any marketing question without drowning in its own context.

Anthropic's engineering team recently named this challenge context engineering — the practice of curating the optimal set of tokens at each step of inference. They describe context as "a finite resource with diminishing marginal returns" and highlight a pattern they call progressive disclosure: agents incrementally discovering relevant context through exploration rather than loading everything upfront.

We arrived at the same pattern independently, through four architecture iterations. Here's what happened at each one.

Iteration 1: Text-to-SQL with RAG — Fast to Ship, Fragile in Production

We started with Vanna, an open-source text-to-SQL framework. We ingested DDL schemas, documentation, and question-SQL pairs into a vector store. When a user asked a question, Vanna retrieved similar examples, bundled them with schema context, and sent it to the LLM to generate SQL.

For narrow, single-turn questions, it worked. "What was my Facebook spend last month?" retrieved the right schema fragment, and the SQL was correct. Token usage was lean — roughly 3,000–4,000 tokens per query.

Three failure modes killed this approach:

Context collision on cross-platform queries. "Compare my Google and Facebook campaign performance" pulled schema from multiple tables. Context ballooned past 15,000 tokens; the LLM started joining tables incorrectly or referencing wrong-platform columns.

No multi-turn reasoning. Marketers follow up: "Now break that down by ad set." But each question was brand new — no memory of prior analysis.

Analysis ceiling. "Why did my CPA spike?" requires computing anomalies and generating explanations. Text-to-SQL can fetch data. It cannot reason about it.

Iteration 2: Full Schema Injection + Semantic Layer

We made two moves. First, we introduced a CubeJS semantic layer in front of Postgres — a decision that proved foundational. Instead of raw SQL, we built 80+ CubeJS views with rich semantic descriptions. sma_14d became "14-day Simple Moving Average of the primary objective metric." bleed_flag became "indicator that this ad set is spending without delivering conversions." These descriptions gave the LLM dramatically better signal.

Second, we tried giving the LLM the entire schema catalog on every request. No RAG, no guessing. Just: here's everything, figure it out.

The semantic layer was a genuine leap. But the full catalog consumed 40,000–60,000 tokens before the LLM started reasoning. We saw infinite loops, hallucinated cube names, and queries against wrong platform views. More information, worse results.

We spent two weeks debugging a query where the LLM confidently generated a CubeJS request against a Facebook creative diagnostics view — for a Google Ads account. The schema was right there in context. The model just couldn't find it in 50,000 tokens of noise. That failure forced us to rethink the problem entirely. The question wasn't "how do we fit more context in?" It was "how do we give each step only what it needs?"

Iteration 3: The Progressive Disclosure Breakthrough

The insight: stop treating context as binary — everything or nothing — and start treating it as layers.

We rebuilt CoDI as a multi-step pipeline where each agent receives only the context required for its specific job:

The Planner sees lightweight metadata only — platform name, cube name, one-line description. For three connected platforms, roughly 2,500 tokens instead of 50,000+. Its job: select which 2–4 cubes are relevant.

The Query Builder receives detailed schema — dimensions, measures, data types — but only for the cubes the planner selected. Three cubes out of 80 means ~4,000 tokens instead of 50,000.

The Data Fetch step is entirely deterministic. CubeJS queries execute with zero LLM involvement. No tokens, no hallucination risk, no cost.

The Analyst receives the fetched data in a token-efficient columnar format we developed. Instead of row-based JSON ([{"campaign": "A", "spend": 100}, ...]), we send columnar JSON ({"campaign": ["A", "B"], "spend": [100, 200]}). For marketing datasets with 20–100 rows, this reduced data tokens by approximately 60%.

Per-query token usage dropped from 60,000–80,000 to 15,000–25,000. Response accuracy jumped from 70% to 85%. But the real shift was qualitative: each agent now operated in a clean, focused context window where every token was doing useful work.

Iteration 4: Multi-Level Orchestration for Real-World Complexity

The single pipeline handled most queries well. But marketing analysis often requires parallel workstreams. "How are my campaigns performing?" might need Facebook diagnostics, Google Ads trends, and GA4 website data simultaneously — each requiring different analysis strategies.

We evolved into a three-level orchestration architecture:

Level 1 — The Router classifies the query and decomposes it into sub-flows. Token cost: ~1,500. Non-marketing queries get rejected cheaply — no expensive downstream processing for off-topic questions.

Level 2 — The Orchestrator dispatches specialized agent flows in parallel threads. Diagnostics queries route to pre-configured flows; general questions route to the autonomous pipeline.

Level 3 — The Analysis Pipeline is a 10-step sequence where progressive disclosure operates at its finest. The critical design principle: each step builds on the outputs of previous steps, not their inputs. The recommendation agent sees synthesized insights, not raw data. The chart agent sees insights and recommendations, not the schema catalog.

For known diagnostic patterns — budget underspend, audience saturation, creative fatigue, CPA anomalies — we go further. These flows use pre-configured queries with exact CubeJS dimensions and pre-computed statistical signals: 14-day moving averages, rolling 7-day metrics, consecutive decay counts. The LLM's job is purely analytical — evaluate triggers and translate technical signals into marketer-friendly language. This handles roughly 40% of queries with lower latency and cost.


What the Numbers Show


V1 (RAG)

V2 (Full Schema)

V3 (Pipeline)

V4 (Multi-Level)

Tokens per query

3K–8K

60K–80K

15K–25K

20K–35K (across 10+ agents)

Response accuracy

~60%

~70%

85%

90%+

Cross-platform queries

Unreliable

Context overflow

Reliable

Parallel execution

Diagnostic depth

None

Basic

Single-pass

Multi-signal, trigger-based


V4 uses more total tokens than V3 for complex queries — it runs multiple specialized agents in parallel. But cost per unit of insight is dramatically lower, and the output quality is categorically different. Where V1 returned a raw data table, V4 returns a narrative that weaves together insight cards ("Your Facebook CPA spiked 40% — driven by audience saturation in two ad sets"), recommendations with projected impact ("Pause Ad Set X — projected savings of $500/week"), and executable actions tied to specific campaigns by name.

Five Lessons That Generalize

The semantic layer is the real enabler. Progressive disclosure works because CubeJS gives us clean abstraction layers: table-level metadata for planning, detailed metadata for query construction, structured data for analysis. Without this, progressive disclosure is just "give the LLM less stuff and hope."

Deterministic steps are your most reliable agents. Three of our ten pipeline steps involve zero LLM calls — pure API fetches against a typed schema. CubeJS queries either return correct data or a clear error. LLM-generated SQL can hallucinate column names and return plausible-looking wrong data — in analytics, that's the most dangerous failure mode there is.

Autonomy is expensive. Use it where it matters. When you know the diagnostic pattern, hardcode the data requirements and let the LLM focus on interpretation. We pre-configure roughly 40% of our flows this way, eliminating an entire LLM planning call per step.

Columnar JSON is underrated. Row-to-columnar data transformation reduced token counts by 60% across every data-carrying step. For any system passing structured data to LLMs, this is low-hanging fruit.

Compact conversation history, don't just truncate it. Between turns, we distill prior conversation into a tight summary that preserves architectural decisions, unresolved questions, and key metrics — while discarding redundant tool outputs and verbose explanations. We cap these summaries at 800 characters. The goal isn't to remember less; it's to remember what matters.

What's Next

The current system tells marketers what to do. Next: doing it for them. Our action cards already identify specific campaigns, ad sets, and budget changes with projected impact — executing through platform APIs is the natural extension.

We're also expanding to LinkedIn Ads, Snapchat, and additional organic channels. Every new platform adds cubes to the catalog, but the planner's per-query token cost stays flat — the agent only loads what it needs.

The broader pattern — progressive disclosure through a semantic layer with multi-level orchestration — applies anywhere an autonomous agent reasons over a large structured data catalog. Context is a finite resource. The job isn't to minimize it. It's to maximize signal-to-noise at every step.

A media buyer managing six-figure monthly ad spend across Google, Meta, and TikTok asks your AI chatbot: "Why did my ROAS drop last week?"

Think about what that requires. The system needs to identify which platforms matter, pull campaign-level data with the right date ranges, compare against historical baselines, detect anomalies, generate a root-cause analysis, and deliver it in language a marketer understands. Without knowing in advance that this question was coming.

This is the core challenge we faced building CoDI, the conversational AI analyst inside Third i. CoDI handles everything from simple lookups ("What was my spend last month?") to multi-platform diagnostics ("Which campaigns should I pause and why?") across Google Ads, Meta Ads, TikTok Ads, Instagram, and GA4. The user types plain English. CoDI figures out the rest.

The problem isn't building an agent that answers marketing questions. It's building one that answers any marketing question without drowning in its own context.

Anthropic's engineering team recently named this challenge context engineering — the practice of curating the optimal set of tokens at each step of inference. They describe context as "a finite resource with diminishing marginal returns" and highlight a pattern they call progressive disclosure: agents incrementally discovering relevant context through exploration rather than loading everything upfront.

We arrived at the same pattern independently, through four architecture iterations. Here's what happened at each one.

Iteration 1: Text-to-SQL with RAG — Fast to Ship, Fragile in Production

We started with Vanna, an open-source text-to-SQL framework. We ingested DDL schemas, documentation, and question-SQL pairs into a vector store. When a user asked a question, Vanna retrieved similar examples, bundled them with schema context, and sent it to the LLM to generate SQL.

For narrow, single-turn questions, it worked. "What was my Facebook spend last month?" retrieved the right schema fragment, and the SQL was correct. Token usage was lean — roughly 3,000–4,000 tokens per query.

Three failure modes killed this approach:

Context collision on cross-platform queries. "Compare my Google and Facebook campaign performance" pulled schema from multiple tables. Context ballooned past 15,000 tokens; the LLM started joining tables incorrectly or referencing wrong-platform columns.

No multi-turn reasoning. Marketers follow up: "Now break that down by ad set." But each question was brand new — no memory of prior analysis.

Analysis ceiling. "Why did my CPA spike?" requires computing anomalies and generating explanations. Text-to-SQL can fetch data. It cannot reason about it.

Iteration 2: Full Schema Injection + Semantic Layer

We made two moves. First, we introduced a CubeJS semantic layer in front of Postgres — a decision that proved foundational. Instead of raw SQL, we built 80+ CubeJS views with rich semantic descriptions. sma_14d became "14-day Simple Moving Average of the primary objective metric." bleed_flag became "indicator that this ad set is spending without delivering conversions." These descriptions gave the LLM dramatically better signal.

Second, we tried giving the LLM the entire schema catalog on every request. No RAG, no guessing. Just: here's everything, figure it out.

The semantic layer was a genuine leap. But the full catalog consumed 40,000–60,000 tokens before the LLM started reasoning. We saw infinite loops, hallucinated cube names, and queries against wrong platform views. More information, worse results.

We spent two weeks debugging a query where the LLM confidently generated a CubeJS request against a Facebook creative diagnostics view — for a Google Ads account. The schema was right there in context. The model just couldn't find it in 50,000 tokens of noise. That failure forced us to rethink the problem entirely. The question wasn't "how do we fit more context in?" It was "how do we give each step only what it needs?"

Iteration 3: The Progressive Disclosure Breakthrough

The insight: stop treating context as binary — everything or nothing — and start treating it as layers.

We rebuilt CoDI as a multi-step pipeline where each agent receives only the context required for its specific job:

The Planner sees lightweight metadata only — platform name, cube name, one-line description. For three connected platforms, roughly 2,500 tokens instead of 50,000+. Its job: select which 2–4 cubes are relevant.

The Query Builder receives detailed schema — dimensions, measures, data types — but only for the cubes the planner selected. Three cubes out of 80 means ~4,000 tokens instead of 50,000.

The Data Fetch step is entirely deterministic. CubeJS queries execute with zero LLM involvement. No tokens, no hallucination risk, no cost.

The Analyst receives the fetched data in a token-efficient columnar format we developed. Instead of row-based JSON ([{"campaign": "A", "spend": 100}, ...]), we send columnar JSON ({"campaign": ["A", "B"], "spend": [100, 200]}). For marketing datasets with 20–100 rows, this reduced data tokens by approximately 60%.

Per-query token usage dropped from 60,000–80,000 to 15,000–25,000. Response accuracy jumped from 70% to 85%. But the real shift was qualitative: each agent now operated in a clean, focused context window where every token was doing useful work.

Iteration 4: Multi-Level Orchestration for Real-World Complexity

The single pipeline handled most queries well. But marketing analysis often requires parallel workstreams. "How are my campaigns performing?" might need Facebook diagnostics, Google Ads trends, and GA4 website data simultaneously — each requiring different analysis strategies.

We evolved into a three-level orchestration architecture:

Level 1 — The Router classifies the query and decomposes it into sub-flows. Token cost: ~1,500. Non-marketing queries get rejected cheaply — no expensive downstream processing for off-topic questions.

Level 2 — The Orchestrator dispatches specialized agent flows in parallel threads. Diagnostics queries route to pre-configured flows; general questions route to the autonomous pipeline.

Level 3 — The Analysis Pipeline is a 10-step sequence where progressive disclosure operates at its finest. The critical design principle: each step builds on the outputs of previous steps, not their inputs. The recommendation agent sees synthesized insights, not raw data. The chart agent sees insights and recommendations, not the schema catalog.

For known diagnostic patterns — budget underspend, audience saturation, creative fatigue, CPA anomalies — we go further. These flows use pre-configured queries with exact CubeJS dimensions and pre-computed statistical signals: 14-day moving averages, rolling 7-day metrics, consecutive decay counts. The LLM's job is purely analytical — evaluate triggers and translate technical signals into marketer-friendly language. This handles roughly 40% of queries with lower latency and cost.


What the Numbers Show


V1 (RAG)

V2 (Full Schema)

V3 (Pipeline)

V4 (Multi-Level)

Tokens per query

3K–8K

60K–80K

15K–25K

20K–35K (across 10+ agents)

Response accuracy

~60%

~70%

85%

90%+

Cross-platform queries

Unreliable

Context overflow

Reliable

Parallel execution

Diagnostic depth

None

Basic

Single-pass

Multi-signal, trigger-based


V4 uses more total tokens than V3 for complex queries — it runs multiple specialized agents in parallel. But cost per unit of insight is dramatically lower, and the output quality is categorically different. Where V1 returned a raw data table, V4 returns a narrative that weaves together insight cards ("Your Facebook CPA spiked 40% — driven by audience saturation in two ad sets"), recommendations with projected impact ("Pause Ad Set X — projected savings of $500/week"), and executable actions tied to specific campaigns by name.

Five Lessons That Generalize

The semantic layer is the real enabler. Progressive disclosure works because CubeJS gives us clean abstraction layers: table-level metadata for planning, detailed metadata for query construction, structured data for analysis. Without this, progressive disclosure is just "give the LLM less stuff and hope."

Deterministic steps are your most reliable agents. Three of our ten pipeline steps involve zero LLM calls — pure API fetches against a typed schema. CubeJS queries either return correct data or a clear error. LLM-generated SQL can hallucinate column names and return plausible-looking wrong data — in analytics, that's the most dangerous failure mode there is.

Autonomy is expensive. Use it where it matters. When you know the diagnostic pattern, hardcode the data requirements and let the LLM focus on interpretation. We pre-configure roughly 40% of our flows this way, eliminating an entire LLM planning call per step.

Columnar JSON is underrated. Row-to-columnar data transformation reduced token counts by 60% across every data-carrying step. For any system passing structured data to LLMs, this is low-hanging fruit.

Compact conversation history, don't just truncate it. Between turns, we distill prior conversation into a tight summary that preserves architectural decisions, unresolved questions, and key metrics — while discarding redundant tool outputs and verbose explanations. We cap these summaries at 800 characters. The goal isn't to remember less; it's to remember what matters.

What's Next

The current system tells marketers what to do. Next: doing it for them. Our action cards already identify specific campaigns, ad sets, and budget changes with projected impact — executing through platform APIs is the natural extension.

We're also expanding to LinkedIn Ads, Snapchat, and additional organic channels. Every new platform adds cubes to the catalog, but the planner's per-query token cost stays flat — the agent only loads what it needs.

The broader pattern — progressive disclosure through a semantic layer with multi-level orchestration — applies anywhere an autonomous agent reasons over a large structured data catalog. Context is a finite resource. The job isn't to minimize it. It's to maximize signal-to-noise at every step.