AI strategy – apiphani

Drop the Backpack: What $900/Day in AI Costs Taught Us About MCP

Josh Greenwell — Thu, 12 Feb 2026 18:14:26 +0000

TL;DR: MCPs are not efficient. Code execution makes tool usage intelligent, consistent, time efficient, and cost efficient while providing an additional layer of security between the data and the AI.

Introduction

“Recently, I was talking with my colleague Braden about MCPs, tooling in AI, and token usage. Our team develops LuumenAI, the intelligence used in our observability and automation platform for monitoring ERP environments. As we got deeper into the conversation, I remember saying something along the lines of, “MCPs might become the NFTs of AI.” A bit hyperbolic, but I believe MCPs are a fad that will eventually die out as more practitioners grow disillusioned and move on.

For me, something never quite fit about the concept of MCPs in the way they were originally described or implemented. As it turns out, the creators of the Model Context Protocol (MCP), Anthropic, have come to a similar conclusion.

This whitepaper is about an expensive MCP lesson we learned the hard way: Three cost spikes that hit $100, then $300, then $900 per day in an environment with zero users and zero paying customers – just a handful of developers testing the system.

Here’s what went wrong, why the standard fix isn’t enough, and what we learned that actually works.

Part I: The Problem

First, What is MCP?

MCP is an open standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments. The purpose of MCP is to help frontier models produce better, more relevant responsesThis sounds compelling, in theory, as it’s an important problem to try to solve in AI efficiency. But in practice, the MCP standard has failed. Anthropic itself has stated that traditional MCP patterns increase agent cost and latency. And results from an experiment by Cloudflare have found that MCP tools are inefficient, costly, and make the models “dumber”.

How AI Pricing Works

Before we can talk about why AI is expensive, we need to understand how AI actually processes text. The answer is tokens.

A token is the fundamental unit that a language model works with. When you send text to an AI, it doesn’t read words or characters the way humans do. Instead, it breaks everything down into tokens, which are chunks of text that the model has learned to recognize. A token might be a whole word like “hello,” a partial word like “ing” or “tion,” or even just a single character for uncommon symbols.

Modern LLMs use an algorithm called Byte-Pair Encoding (BPE) to do this. BPE starts with individual characters and iteratively merges the most frequently occurring pairs until it builds up a vocabulary of common sub-word units. This is why common words like “the” become single tokens, while rare or technical terms get split into multiple pieces. For example, “unhappiness” might become three tokens: “un,” “happi,” and “ness.”

A rule of thumb for English text is that 1 token equals about 4 characters, or roughly 0.75 words. So, 1,000 tokens is approximately 750 words, but this can vary significantly. Code, technical documentation, and non-English text often tokenize less efficiently, meaning more tokens per word.

The reason AI providers charge by token comes down to computational cost. Every token that goes into a model (input) and every token that comes out (output) requires processing power. Input tokens and output tokens are often priced differently. Output tokens typically cost more because generating new text requires the model to run inference step by step, predicting one token at a time.

Let’s look at current pricing for Anthropic’s Claude 4.5 models:

Model	Input (per 1K tokens)	Output (per 1K tokens)
Claude Opus 4.5	$0.005	$0.025
Claude Sonnet 4.5	$0.003	$0.015
Claude Haiku 4.5	$0.001	$0.005

Notice that output tokens cost 5x more than input tokens. This pricing structure means that a chatty AI that generates long responses will cost significantly more than one that gives concise answers. It also means that any architecture that repeatedly sends large contexts back and forth will accumulate costs quickly.

To put it in perspective: If you’re running 100,000 tokens through Claude Sonnet 4.5 (a mix of 70K input and 30K output), that’s roughly $0.21 + $0.45 = $0.66 per request. Run that 1,000 times a day and you’re looking at $660/day. Run it across 100 users making 10 requests each? Same math.

Why MCP Bleeds Tokens

There are two compounding problems with how MCP handles context.

First, tool definitions load upfront. When you give an AI access to tools (functions it can call), each tool needs a definition. This definition includes the tool’s name, description, parameters, parameter types, and often examples of when to use it. A well-documented tool might be 100-300 tokens. Multiply that by 90 tools and you’re looking at 9,000 – 27,000 tokens of tool definitions sent with every request.¹

To put that in dollar terms: At Claude Sonnet 4.5 rates, 9,000 tokens cost about $0.027 and 27,000 tokens cost about $0.081. That doesn’t sound like much, but remember this cost is incurred on every single request – even before the user says anything useful. Run 10,000 requests per day and you’re looking at $270 – $810 daily just for tool definitions.

It gets worse. In the standard MCP approach, all tool definitions get sent with every request, even when the user’s query only needs one or two tools. That’s because the model needs to see all available options to decide which ones to use.

With LuumenAI, we had over 90 tools from just 5 implementations. This meant every base request started at roughly 20,000 tokens before the user typed anything. A simple “hello” message: 20,000 tokens – where the actual message was 1 token and the other 19,999 were tool definitions sitting there waiting to be useful.

Second, iterative calls compound context. When the model calls a tool and gets a result, that result is added to the context… then the model is called again with the original context PLUS the tool result. Then, if the model calls another tool, you get original context PLUS first result PLUS second result.

Each iterative call compounds the previous context, as illustrated in the following diagram.

Two tool calls compounding one user message

The “Lost in the Middle” Problem

Current AI models are good at writing code but struggle to sustain long conversations with large amounts of context. One reason this happens is due to the “Lost in the Middle” problem.

The “Lost in the Middle” phenomenon was documented by Liu et al. in their 2024 research. It reveals a critical limitation in how language models process information within their context windows. Their experiments on multi-document question answering and key-value retrieval tasks demonstrated that model performance follows a distinctive U-shaped curve. Accuracy is highest when relevant information appears at the very beginning or end of the input context, but degrades significantly when critical data is positioned in the middle.

This occurs because of primacy and recency biases. Models tend to “remember” what they saw first and last, while information in the middle gets overshadowed. In some cases, GPT-3.5-Turbo performed worse with middle-positioned documents than when operating without any documents at all.

For our purposes, this means that as tool results accumulate in the context window, the AI increasingly struggles to locate and utilize the most relevant information, leading to degraded reasoning quality. The model literally gets “dumber” as context grows.

Part II: What Went Wrong with LuumenAI

We experienced this ourselves with LuumenAI’s early implementation. In the first version, we created a ReAct agent and connected tools to it. We tracked the agent’s decisions and had some cool visuals of the AI’s workflow. As we continued to iterate on the agent’s capability, token count skyrocketed and at times the model seemed to become less functional.

In this first iteration of LuumenAI, we experienced three major token spikes – with zero users. The first spike was ~$100 a day, the second was $300 a day, and the third was $600 – $900 a day. For an (at the time) unlaunched product, this was a serious problem.

Let’s look at the pitfall in each situation.

Spike #1: The Summarization Loop ($100/day)

First, we implemented a summarization tool that fed large amounts of vulnerability information into the LLM to output a summary for quick, concise reading. The problem was that we ran this summary for every server and every vulnerability on each of those servers every hour. It quickly became large amounts of text being repeatedly processed on our development machines.

Servers	Vulns/Server	Runs/Day	Tokens/Run	Daily Tokens	Daily Cost
3	~17	24	1.17M	28M	$100
10	20	24	5M	120M	$648
100	20	24	50M	1.2B	$6,480
1000	20	24	500M	12B	$64,800

1000 servers is a relatively normal number. Some companies have over 200 alone.

Just 1000 servers with 20 vulnerabilities costs $65,000 a day! We quickly implemented a caching layer and deduped our processing to bring things back into line.Lesson learned: Never ask the AI to do the same work twice.

Spike #2: Tool Definition Overload ($300/day)

Next, we implemented a large suite of tools (70+ ServiceNow tools, 15+ Dynatrace tools, and others) which immediately brought every base request to 20,000 tokens. Sending the word “Hello” was 20,000 tokens. Anything more complicated requiring tools could push token counts to over 125,000.

Users	System Prompt	Tool Definitions	User Message	Cost
1	~4.5k	~15.5k	1	~$0.10
10	~4.5k	~15.5k	1	~$1.00
1000	~4.5k	~15.5k	1	~$100.00

1000 Users sending just “Hello” would cost ~$100.

That’s ~18.5x the cost for that one token!

We quickly implemented optimizations and observability beyond what we already had to try to better understand how the token counts were getting so large. We culled tools we weren’t using yet and added dynamic tool insertion into prompts. That got us back in line, but left us with a lingering problem: What happens when we actually need 90 critical tools?

Lesson learned: Tool definitions are expensive. Give the model only what it needs when it needs it.

Spike #3: The Perfect Storm ($600-$900/day)

Finally, we aggressively used the models to do huge multi-step processing on text-rich files. Doing this unintentionally created a combination of our first and second mistakes. There weren’t many tools anymore, but the large blocks of text being processed en masse through many stages still created those large token counts – which get replicated over each step. The result was two days of $600 and $900 dollar spikes, respectively.

Step	Input Tokens	Output Tokens	Total Tokens	Cost
User Msg	5,000	0	5,000	~$0.03
Tool 1	5,000	10,000	15,000	~$0.08
Tool 2	15,000	10,000	25,000	~$0.14
Tool 3	25,000	10,000	35,000	~$0.19
Tool 4	35,000	10,000	45,000	~$0.24
AI Response	45,000	500	45,500	~$0.25
				~$0.92

Total cost is the SUM of all costs!

Users	Requests	Total
1	1	$1
10	1	$10
100	1	$100
10	10	$100
100	10	$1000

100 users making just 10 complex requests a day costs $1000.

10,000 Users making 10 complex requests a day costs $100,000 a day! This caused us to reevaluate the core architecture of our system.

Lesson learned: Iterative tool calling compounds costs. Find architectures that minimize round trips between the model and external systems.

Part III: Anthropic’s Fix (And Why It’s Not Enough)

While I was writing this article, Anthropic came out with their solution to the MCP problem. It is a step forward. But using MCP feels like running a race wearing a backpack full of rocks – and Anthropic’s solution is to add more straps to help better distribute the weight. Yes, it’ll get you to the finish line, but you’ll be exhausted and you’ll have made lots of suboptimal decisions along the way. A better solution is to just drop the backpack.

Anthropic’s Three New Features

In their new article, Anthropic shared how Claude will start to use dynamic tooling and a level of code execution (sometimes) to be more efficient with token usage. They introduced three new features:

1. Tool Search Tool: Instead of loading all tool definitions upfront (which Anthropic admits can hit 134K tokens internally), Claude can now search for tools on demand. Tools are marked with defer_loading: true, and Claude only sees the Tool Search Tool itself (~500 tokens), plus always-loaded tools. When Claude needs a capability, it searches. This keeps context windows lean but adds another inference step.

2. Programmatic Tool Calling: Claude can now write code to orchestrate tools instead of making individual API round-trips. This allows for parallel execution and prevents intermediate results from piling into context. Anthropic claims 37% reduction in token usage and reduced latency for multi-step tasks.

3. Tool Use Examples: Sample calls alongside schemas to improve accuracy. Anthropic’s testing showed parameter handling accuracy jumping from 72% to 90%. Schemas alone don’t communicate real-world usage well enough.

This does move us closer to a complete solution, but why keep piling on?

This Isn’t the Answer

The fundamental problem with Anthropic’s solution is that it’s complexity stacked on complexity. They’re not solving the underlying architectural issue; they’re adding more layers to manage the symptoms.

It’s still MCP at the core. Tool Search Tool and Programmatic Tool Calling are bandages over the MCP wound. You’re still defining tools in the MCP format, still dealing with MCP server connections, still working within a protocol that was designed for a different mental model. The complexity doesn’t disappear; it just gets shuffled around.

The training data gap. As Cloudflare pointed out: LLMs have seen millions of open source projects with real TypeScript and Python code. They’ve seen a tiny set of contrived tool-calling examples constructed by their own developers. Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.

Inference overhead. With traditional tool calling, the output of each tool call must feed into the LLM’s neural network – just to be copied over to the inputs of the next call. Even with Programmatic Tool Calling, you’re still running inference to generate the orchestration code, when you could just, well, write code that calls APIs directly.

Feature fragmentation. These features are in beta and require specific headers. Implementation requires the advanced-tool-use-2025-11-20 header. The features aren’t mutually exclusive so you end up layering them: “Tool Search discovers tools, examples ensure correct invocation, programmatic calling handles orchestration.” That’s three systems to manage what should be just one.

Comparing the steps to using MCP vs. using Code Execution

From the start of working with AI, I always thought it was odd that we didn’t just call APIs. Like, it can write code and read docs. Why wouldn’t it just do curl requests or something similar? This is a very simplified take, but it is also not infeasible at this time.

Part IV: The Real Solution

Code Execution Over APIs

Cloudflare’s “Code Mode” approach represents what I believe is the correct direction. Instead of exposing MCP tools directly to the LLM, they convert MCP tools into a TypeScript API and ask the LLM to write code that calls that API.

The results are interesting. LLMs can handle many more tools, and more complex tools, when those tools are presented as a TypeScript API, rather than directly. This makes sense: LLMs have an enormous amount of real-world TypeScript in their training set, but only a small set of contrived examples of tool calls.

The approach shows its strength in multi-step operations. With traditional tool calling, intermediate results pile into context whether they’re useful or not. When the LLM writes code, it can skip all that and only read back the final results it needs. The code handles loops, conditionals, and data transformations. The LLM generates the logic once, the sandbox executes it, and only the relevant output returns to context.

The key insight: MCP is really just “a uniform way to expose an API for doing something, along with documentation needed for an LLM to understand it, with authorization handled out-of-band.” We don’t have to present tools as tools. We can convert them into a programming language API. The LLM writes against that API and a sandbox executes it.

How It Works

Instead of presenting tools directly to an LLM for individual invocation, we expose them as a programmatic API (typically TypeScript or Python) and let the model write code that orchestrates multiple tool calls. The code runs in a secure, isolated sandbox. This execution environment is deliberately restricted from general network access and can only interact with the specific APIs we provide.

The sandbox acts as a controlled runtime where the AI-generated code executes without the ability to access unauthorized resources or leak data. It can call our predefined APIs, process results in memory, filter and transform data, and only return the final curated output back to the model’s context. This means intermediate results stay within the sandbox and never bloat the context window. We’re talking about potentially large payloads: 10,000 spreadsheet rows or full document contents that never touch the LLM.

Cloudflare implements this using V8 isolates, which are lightweight JavaScript execution environments that can spin up in milliseconds and provide strong security guarantees. The key architectural shift is that tool definitions become filesystem-based TypeScript interfaces that the model can discover and import on demand, rather than loading all definitions upfront into the context.

Anthropic states in one of their examples: “This reduces the token usage from 150,000 tokens to 2,000 tokens – a time and cost savings of 98.7%.”

Architecture Comparison

Traditional MCP + Anthropic’s Fixes	Code Execution Approach
Load tool definitions → Tool Search → Inference to select → Tool call → Result to context → Repeat	Load TypeScript API docs → Generate code → Execute in sandbox → Return final result only
Multiple inference passes for multi-step tasks	Single inference pass generates full orchestration
Intermediate results bloat context	Only final results return to context
Limited training data for tool-calling format	Massive training data for TypeScript/Python

Same multi-tool call process as in the first image, but with Code Execution

Part V: How LuumenAI Is Solving It

LuumenAI’s solution creates specialized agents with code repositories that can execute specialized scripts and API calls written by the AI to interact with connections. This reduces the number of tools substantially while increasing accuracy and efficiency – and also reducing cost.

I want to be clear: We’re not fully there yet. LuumenAI is currently somewhere in the middle of this journey, using a hybrid approach as we continue to build toward full code execution. But we’re moving in that direction because the evidence is compelling, and every step we take reduces costs and improves quality.

Specialized Sub-Agents

Instead of giving an agent 90+ MCP tools and hoping it picks the right ones, we build specialized sub-agents. Each sub-agent has a focused domain (ServiceNow, monitoring, documentation) and a code-execution environment. When the AI needs to interact with a system, it doesn’t call a predefined tool. It writes a script against that system’s API, executes it in isolation, and returns only the relevant results.

By creating sub-agents we are able to limit tools to groups of agent specialties and minimized context data. A monitoring sub-agent only has monitoring tools. A documentation sub-agent only has documentation tools. These sub-agents can process a request and give only relevant data back to the main agent to reduce the overall context from compounding. Instead of one agent with 90 tools seeing 90 tool definitions, we have specialized agents each seeing 10-15 tools relevant to their domain.

Code Execution Environment

We implemented code execution to further optimize the tool calls in these agents. Rather than the model making sequential tool calls that each add to the context, it writes code that executes in a sandbox, processes results locally, and returns only the curated output.

The AI already knows how to write code. It’s been trained on millions of examples. Let it do what it’s good at.

This approach means:

Context stays small because we’re not loading 90 tool definitions
The AI’s responses are more accurate because it’s working in a paradigm it understands
Costs drop dramatically because we’re not compounding token usage across inference passes
We get a natural security boundary because the code executes in an isolated sandbox with only the permissions we grant.

A Real-World Example

LuumenAI is a co-pilot tool that works with systems engineers and Linux admins to manage SAP/ERP systems. It has access to the Luumen ecosystem, including monitoring tools, incident management and reporting, documentation, and more.

When a user creates a chat with the co-pilot, it has context to the client the user is working on, the exact system or set of systems the user is interacting with, and can use a suite of tools to get information about those systems and best practices/historical problem resolution for those instances or similar instances.

Let’s examine what happens when a user asks for current problems for the system they are viewing, along with any documentation that would help fix the issues. As outlined by Anthropic, the MCP approach requires that context be sent with EVERY request, meaning that every subsequent tool call and LLM call needs ALL the content from the previous calls, bloating the context dramatically. Code execution allows us to control, deterministically, how and what data we get. It allows us to process those results and provide a curated clean response back to the AI that reduces the number of tool calls (and layering of context info) and provides increased data security.

Rough example of Luumen multi-agent structure using Code Execution

The Results So Far

The combination of sub-agents and code execution transformed our cost structure. We went from $300-$900/day spikes with zero users to sustainable single-digit dollar costs during active development. More importantly, we built an architecture that can scale to real user loads without the exponential cost growth we were seeing before. This approach reduced our token usage by over 98% for complex multi-tool operations.

We’re continuing to push further toward full code execution as we develop LuumenAI. Every iteration gets us closer to the architecture we believe is the future of AI tooling.

Metric	Before (MCP)	After (Code Execution)
Daily Cost (dev testing)	$300-$900	$1-$5
Tokens per complex request	60,000-120,000+	~5,000-10,000
Tool definitions loaded	90 (all)	10-15 (relevant)
Context compounding	Yes (exponential)	No (controlled)

Conclusion: Drop the Backpack

AI costs are driven by tokens, and tokens accumulate faster than you might expect. Every system prompt, every tool definition, every intermediate result, and every response adds to the total. Without careful architecture, these costs can spiral out of control before you have a single paying customer.

Our experience with LuumenAI taught us three critical lessons:

Caching and deduplication are essential. Don’t ask the AI to do the same work twice.
Tool definitions are expensive. Give the model only what it needs when it needs it.
Iterative tool calling compounds costs. Find architectures that minimize round trips between the model and external systems.

The path forward isn’t to avoid AI. It’s to be intentional about how you use it. Observability tools like Langsmith gave us visibility into where tokens were being consumed. Architectural patterns like sub-agents and code execution gave us control over that consumption. Together, they let us build an AI-powered product that delivers value without bankrupting us in the process.

Anthropic is building better backpacks. Cloudflare is showing us we don’t need the backpack at all. We’re taking it a step further: Build the agents around code execution from the start, not as a feature bolted on top of a tool-calling protocol.

Drop the backpack. Run the race.

Appendix: A Note on Prompt Caching

¹Prompt caching can mitigate the tool definition load on repeated requests. When enabled, AI providers cache the static portions of your prompt (like tool definitions) and charge reduced rates for cached tokens. For example, Anthropic charges ~90% less for cached input tokens.

However, it’s important to note that cached tokens are still not free. You’re paying less, but you’re still paying for every request that includes those tool definitions. At Anthropic’s rates, cached tokens cost $0.0003 per 1K tokens for Sonnet 4.5 (compared to $0.003 for uncached). For 27,000 tokens of tool definitions across 10,000 requests per day, caching reduces your cost from $810/day to $81/day. That’s a meaningful savings, but you’re still paying $81 daily just for tool definitions.

More importantly, caching helps with the cost problem but doesn’t solve the architectural issues of context bloat and the “Lost in the Middle” phenomenon. You’re still loading all those definitions into context. The AI is still potentially getting confused by irrelevant options. And you’re still compounding context with each iterative tool call. Caching is a band-aid, not a cure.

Appendix: Calculations

² All cost calculations are using a standard Sonnet cost structure of 4/5 input tokens at $0.003 per 1,000 tokens and 1/5 output tokens at $0.015 per 1,000 tokens, then rounded to the nearest cent or dollar.

References

Anthropic – Advanced Tool Use
Anthropic – Code Execution with MCP
Cloudflare – Code Mode: The Better Way to Use MCP
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics, 12, 157-173. arXiv:2307.03172
Anthropic – Claude Pricing
Hugging Face – Byte-Pair Encoding Tokenization

About the Author

Josh Greenwell

Software engineer at apiphani and co-founder of Culture Booster

The Achilles’ Heel of AI Strategy

Mark Kujawski — Wed, 04 Feb 2026 16:35:10 +0000

Why 95% of AI Initiatives Fail and How Data Quality and Governance Can Fix It

As with prior technology waves, the current AI surge is marked by rapid adoption, inflated expectations, and uneven results. AI has become ubiquitous across enterprise strategy discussions, which often outpace the organizational foundations required to support it.

We are witnessing efforts to achieve incredible outcomes in process automation and the leveraging of machine-level intelligence to produce great decision-making capabilities in a deployable operational platform. These efforts, however, are generating a depressing statistic:

95% of Enterprise AI Projects Fail.*

That’s correct. 95 out of 100 AI projects fail to meet their success criteria, which begs the question, why? What is the Achilles’ heel of most AI strategies? What is preventing their success as well as an attractive return on investment?

The answer, though not trivial, is straightforward.The Achilles’ heel of AI strategy is a persistent lack of data quality and the absence of effective data governance. Without integrity in the data foundation — and clear accountability for how data is created, managed, and used — even the most sophisticated AI strategies collapse under their own weight.

A series of high-profile AI failures illustrates this reality:

These stories illustrate that AI fails not because it thinks poorly, but because it learns poorly from data that lacks governance. Let’s take a deeper dive into both subjects.

* Toscano, Joe, “Why 95% Of AI Projects Fail — And 4 Ways To Be In The 5% That Succeed”, Forbes, Sept 2025, Forbes

Start Enterprise AI Readiness Assessment

The Quality of Data

Data is everything to AI. AI requires enormous amounts of data inputs and sources to feed its voracious, machine-driven appetite and to refine and improve its logical models and neural networks. AI does not fix bad data; it amplifies it. If training data is incomplete, biased, or out-of-date, AI models produce distorted predictions that erode trust and create compliance risks.

This is compounded by another problem we’ve observed across the organizations we support. AI projects are often driven by technology aspirations rather than enterprise data realities. The result: Proof-of-concept models that never scale, analytics that contradict themselves, and insights no one fully trusts.

What are the root-causes of data quality failures?

Fragmented Data Ecosystems: Data is scattered across ERP, CRM, and MES (for example), as well as unstructured sources with little synchronization. To achieve real-time decision-driving capabilities, all required data must be available and presented in real-time. This is a status that few organizations have achieved.
- Example: Customer churn models trained on CRM data without capturing support tickets or billing records.
- Impact: AI underestimates risk or misclassifies outcomes due to incomplete learning context.
Poor Data Quality and Data Origination: Inconsistent master data, missing lineage, and unreliable inputs feeding critical algorithms.
- Example: Manufacturing AI reading incorrect temperature values due to uncalibrated IoT sensors.
- Impact: Predictive maintenance or quality control models generate false alerts or fail to detect anomalies.
Duplicate or Redundant Data: This is one of the most prevalent and most difficult conditions for automated remedies: the issue of repeated records inflating the apparent frequency or weight of certain features.
- Example: One of the most important discoveries we made for the manufacturing division of a pharmaceutical company was navigating the “splits and collisions” of fragmented patient data.
- Impact: Multiple instances of the same patient data records were completed by skewing the results of AI algorithms for tracking insurance remediation.
Lack of Data Lineage and Traceability: This category involves the inability to track data origin, transformations, and ownership of data inputs. These failures often stem from poor data quality, amplification of bias, regulatory violations, and models that cannot generalize. This is because the origin, transformations, and quality of the data were not tracked.
- Example: Unity Technologies’ $110 million ad-targeting error. The core issue stemmed from Unity’s ad targeting system, which utilized data from various sources to personalize ad delivery. A lack of clear data lineage meant that the origin and transformations of the data used to train and operate the ad-targeting AI were not fully understood or documented.
- Impact: This failure demonstrates how poor data management, including a lack of lineage, can lead to incorrect AI model outputs, resulting in a significant economic loss.

There are many more categories for examining the examples and impacts of poor data quality. The remedy for these problems is the focus of the second half of this examination: Strong corporate data and governance structures will largely eliminate the data problems that cause the high rate of failure for AI initiatives.

Corporate Governance for AI Strategy

Governance is often misunderstood as a bureaucratic layer that is similar to the deployment of other system guardrails, like password management and trouble tickets. Governance is the operating system of a well-functioning, data-driven enterprise and is a critical factor in using AI effectively and responsibly across the organization.

One of the earliest indicators of ineffective AI governance mirrors a challenge many organizations faced 15 years ago with the emergence of “shadow IT.” This happened as SaaS applications spread rapidly, introducing a subscription-based model that allowed individual teams to set up their own tools (e.g., separate Salesforce instances) without IT oversight.

The result was a wild west scenario of uncontrollable data usage and the exposure of corporate intellectual property and sensitive financial data. It introduced considerable risk and left IT with limited opportunity to regain control without a fundamental shift in governance policies. The same issues are currently happening today with the proliferation of AI projects at the enterprise department level. Unclear ownership, ad-hoc data stewardship, and an absence of executive oversight are the primary contributors to ineffective AI strategy. One of the first ways to restore control is through the strict application of governance protocols designed for AI use cases and business-aligned deployments.

What are the hallmarks of an effective AI governance structure?

1. Strategic Alignment and Value Stewardship

AI governance ensures that AI investments are explicitly tied to enterprise objectives, not isolated technology initiatives. Governance bodies (typically operating at the Board and executive committee level) prioritize AI use cases based on measurable business value, risk tolerance, and strategic relevance.

This function answers the following fundamental questions:

Why is AI being deployed?
Where does it create competitive advantage?
Which AI initiatives should be scaled, paused, or terminated?

Without this layer, organizations experience AI sprawl, duplicated models, and fragmented investments with unclear ROI.

2. Data Integrity and Trust Enablement

Because AI systems are only as reliable as the data they consume, governance establishes ownership, accountability, and quality standards for enterprise data assets. This includes:

Data lineage and provenance requirements
Authoritative data sources (“single source of truth”)
Quality thresholds for model training and inference
Controls over synthetic, third-party, and externally sourced data

In mature organizations, governance treats data as a regulated strategic asset, not an operational byproduct. This directly mitigates the Achilles’ heel of AI via confidently automated decisions built on untrusted data.

3. Risk, Ethics, and Regulatory Oversight

AI governance institutionalizes risk management across the AI lifecycle, including:

Model bias and fairness
Explainability and auditability
Regulatory compliance (current and emerging)
Legal, reputational, and operational exposure

Rather than relying on ad hoc ethical reviews, mature governance embeds repeatable controls that are reviewed, tested, and enforced – like financial controls or cybersecurity frameworks. This is increasingly critical as regulators and courts treat AI-driven decisions as corporate acts, not technical artifacts.

4. Operating Model and Decision Rights

Effective AI governance clearly defines who owns what decisions:

Who approves AI use cases?
Who certifies models for production?
Who is accountable when AI outcomes are wrong?
Who can override or shut down an AI system?

As AI autonomy increases, governance replaces ambiguity with formal decision rights, escalation paths, and kill-switch authority. This prevents “shadow AI” and ensures humans remain accountable for machine-driven outcomes.

5. Continuous Oversight and Adaptation

Unlike static policies, mature AI governance is dynamic and evolutionary. It continuously:

Monitors model performance and drift
Reassesses risk as data and business conditions change
Incorporates new regulations and standards
Retires models that no longer meet trust or value thresholds

This transforms governance from a gatekeeper into a living management system; one that adapts at the same pace as AI itself. Adopting a new approach to governance is the first critical step in improving your data quality as well as putting effective guard rails around your data and making your entire operative process ready for the effective use of AI technology.

Without governance, AI efforts degrade through model drift, shadow initiatives, and uncontrolled risk – eroding long-term value. Strong governance ensures higher-quality data, clear guardrails, and an operating model that enables AI to deliver reliable, sustainable outcomes.

Determining the Proper Path to a Sustainable AI Strategy

Over-reliance on platforms and tools, rather than alignment with business goals and operating models, is a fundamental flaw of AI strategy that can be rectified through adoption of best practices. Apiphani works with enterprise organizations operating complex, mission-critical systems (like SAP), where reliability, accuracy, and accountability are non-negotiable. In these environments, AI initiatives cannot be separated from the conditions in which they operate.

What we consistently observe is that the models themselves rarely drive AI failures. They occur when advanced capabilities are introduced into environments with fragmented data, unclear ownership, and insufficient operational discipline.

Addressing this challenge does not require additional tools or more sophisticated algorithms. It requires establishing the foundational conditions that allow AI to operate reliably and predictably at scale. Our apiphani AI Strategy Framework is anchored by three pillars.

Here’s how we do it.

1. Data Integrity Foundation

A comprehensive data quality assessment (focused on accuracy, completeness, timeliness, and lineage) is the foundation for evaluating and optimizing data architecture for performance
The establishment of a data integrity index as a benchmark for AI readiness
Automated validation workflows using AI-driven data profiling and anomaly detection

2. Governance by Design

We’ve designed an AI Center of Excellence (CoE) that offers a consistent, scalable model for implementing effective AI strategy. Elements include:

A Data and AI Governance Council aligned to business domains
Policy frameworks for model lifecycle management, ethical AI, and compliance
Metadata management and lineage tracking to ensure transparency

3. AI Value Realization

Integration of governance metrics into AI ROI dashboards
Diagnostic tools to visualize data and governance health
Continuous improvement cycles connecting governance KPIs to business outcomes

The Path Forward

Organizations that treat governance as the backbone rather than the brake of AI strategy will outperform peers who chase the latest models without considering their foundations. The future of enterprise AI belongs to companies that understand this simple truth: AI is only as intelligent as the integrity of the data and governance that supports it.

Apiphani helps organizations generate powerful AI strategies by aligning data strategy, governance, and AI implementation into a single, coherent framework that delivers measurable business value.

The first step is our AI Readiness Assessment, which evaluates your organization across data readiness, platform and operational maturity, governance and risk controls, and the ability to safely deploy AI in mission-critical environments.

AI strategy – apiphani

Drop the Backpack: What $900/Day in AI Costs Taught Us About MCP

Introduction

Part I: The Problem

First, What is MCP?

How AI Pricing Works

Why MCP Bleeds Tokens

The “Lost in the Middle” Problem

Part II: What Went Wrong with LuumenAI

Spike #1: The Summarization Loop ($100/day)

Spike #2: Tool Definition Overload ($300/day)

Spike #3: The Perfect Storm ($600-$900/day)

Part III: Anthropic’s Fix (And Why It’s Not Enough)

Anthropic’s Three New Features

This Isn’t the Answer

Part IV: The Real Solution

Code Execution Over APIs

How It Works

Architecture Comparison

Part V: How LuumenAI Is Solving It

Specialized Sub-Agents

Code Execution Environment

This approach means:

A Real-World Example

The Results So Far

Conclusion: Drop the Backpack

Appendix: A Note on Prompt Caching

Appendix: Calculations

References

About the Author

Josh Greenwell

The Achilles’ Heel of AI Strategy

Why 95% of AI Initiatives Fail and How Data Quality and Governance Can Fix It

95% of Enterprise AI Projects Fail.*

The Quality of Data

Corporate Governance for AI Strategy

What are the hallmarks of an effective AI governance structure?

1. Strategic Alignment and Value Stewardship

2. Data Integrity and Trust Enablement

3. Risk, Ethics, and Regulatory Oversight

4. Operating Model and Decision Rights

5. Continuous Oversight and Adaptation

Determining the Proper Path to a Sustainable AI Strategy

Here’s how we do it.

1. Data Integrity Foundation

2. Governance by Design

We’ve designed an AI Center of Excellence (CoE) that offers a consistent, scalable model for implementing effective AI strategy. Elements include:

3. AI Value Realization

The Path Forward

Are you ready to take that journey?

About the Author

Mark Kujawski