Whitepapers – apiphani

The Case for Adopting Data Products. Proven Methods for Building Better BI, ML, and AI Solutions

James Kendrick — Mon, 02 Mar 2026 15:36:29 +0000

Why Use Data Products

Data products aren’t mandatory for building BI dashboards, ML models or AI solutions, but they dramatically improve your odds of delivering successful, repeatable outcomes by adding semantic clarity and governance.

One Unified Intelligence Platform

In data and AI architecture, data products (trusted, reusable data assets with clear meaning, ownership, governance, and a defined way to be consumed) are the glue that makes the various architecture components operate as one unified intelligence platform.

Without data products, each tool in the architecture operates using its own interpretation of the data. With them, analytics, ML and AI share a consistent semantic foundation regardless of vendor stack. Consider how this plays out across two different ecosystems: An SAP-centric architecture and an AWS-native architecture.

SAP-Centric Architecture: SAP S/4HANA, Datasphere, Joule, and Databricks

SAP S/4HANA generates operational data
Datasphere models and governs it
Joule and AI services consume it
Databricks or similar platforms extend advanced analytics

All rely on shared, governed data products to maintain consistent business meaning.

AWS-Native Architecture: AWS S3, RedShift or Athena, SageMaker, and QuickSight

Raw data lands in AWS S3
The data is transformed through Redshift or Athena
Then feeds both SageMaker models and QuickSight dashboards

Using data products ensures consistent definitions, governance, and reusable interfaces across analytics and AI workflows in this architecture.

Strong Predictors of AI Success

Organizations experimenting with AI can produce early results without structured data products. But when AI initiatives are measured by scalability, reliability, and enterprise adoption, a consistent pattern emerges: High-performing organizations treat governed data products as a foundational layer rather than an afterthought.

Data products act as strong predictors of success because they:

Address a primary root cause of AI failure: Inconsistent semantics and unreliable data foundations.
Emerge independently across high-maturity AI organizations, even when different technology stacks are used.
Enable repeatability and governance, allowing models and analytics to move from experimentation to production.
Support cross-domain AI, where insights and models span multiple business functions.
Align with modern enterprise architectures, including SAP’s evolving data and AI strategy.
Correlate with stronger business outcomes: Organizations adopting governed data layers consistently outperform those that rely on ad-hoc pipelines.

BI dashboards, ML models and AI solutions can be built without formal data products, but why would you want to? Organizations seeking scalable, reliable, enterprise-grade outcomes consistently find that data products are indispensable to achieving a successful outcome.

Organizational alignment matters as much as the technology

A second critical predictor of success is proactive engagement from the C-suite. Data has long driven strategic advantage in data-intensive industries such as Finance, Media, and Retail. But today data’s importance extends across every sector.

Executive sponsorship ensures that data products are treated as business assets, not just technical artifacts.

Technical and operational readiness must progress together

Adopting data products requires both of the following:

Technical enablement: Platforms, architecture, and tooling
Operational capability: Ownership models, governance processes, and data modeling skills

These dimensions are interdependent. Any delay in adopting either one can slow time-to-value. We recommend beginning with focused technical pilots that demonstrate clear business outcomes through small, easily understood implementations.

Early wins help build momentum, validate governance approaches, and create the organizational alignment needed to scale.

Understanding the Technical Platform

Let’s return to our SAP-centric and AWS-native examples introduced earlier for this discussion.

Spaces

In SAP Datasphere architecture, Spaces are the primary organizational construct used to structure and govern data products. Multiple Spaces enable both long-term data domain ownership and cross-domain collaboration, as well as temporary collaboration environments.

Spaces are the most crucial construct for data products. Spaces provide for both long-term Data Domain and Cross Domain creation as well as shorter-term collaboration spaces.

Create Spaces for different data domains like Customer Data, Product Data, Sales Data, Financial Planning & Analysis (FP&A), Social Data, Streaming Data, Financial Data, HR Data, and Manufacturing Data.

Enable data sharing and collaboration among these Spaces to encourage reuse (e.g., for an R&D project), while ensuring sensitive data is protected using methods like data masking and authorization. The PERMISSION Space authorization table, managed by designated security and administrative users, controls access rights for sharing data across these Spaces.

Architecture

The overall architecture to assemble and consume data products can be defined wholly within SAP Business Data Cloud. Alternatively, with a little (not a lot) more work, the architecture can be built with SAP Datasphere and Databricks tools – or with AWS cloud tools like S3, Redshift, Athena, and Quick Suite.

The architecture is typically organized into layered components:

Inbound Layer: Capture or federate raw data from source systems and external platforms.
Harmonization Layer: Standardize, transform, and clean data to ensure consistent structure and meaning across domains.
Propagation Layer: Create unified consumption entities – such as analytic data products, semantic models, and reporting views – that can be reused across BI, ML, and AI scenarios.
Reporting Layer: Optimize views specifically for reporting and analytics to support consistent branding, presentation, and user experience.

Governance

Effective data products require defined governance practices to ensure trust, consistency, and usability across domains. Establish clear ownership, naming conventions, and data lineage so users can understand and rely on the data they consume. Adopt a data catalog to manage data products and associated assets.

Roadmap

Adoption should follow a structured, value-driven roadmap aligned to business priorities and execution readiness. Create the roadmap based on business value, organizational priorities, and the ability to execute, typically a “crawl, walk, run” maturity approach.

Define initiatives by domain and cross-domain opportunities tied to clear business outcomes and supporting business cases. Select two to three visible data product opportunities that are achievable – not overly complex, but meaningful enough to demonstrate delivery and value.

Types of Data Products

Organizations typically work with two primary categories of data products:

Certified data products provided or governed centrally.
Custom data products built internally or through partners.

The following sections describe how these approaches apply within SAP Business Data Cloud (BDC) and broader data architectures.

Certified Data Products

Certified data products are governed, production-ready assets that follow standardized definitions, quality controls, and ownership models.

Data products arrive in SAP Business Data Cloud (BDC) in a basic form containing only the essential data for a business entity. Within SAP BDC, basic data products can be combined with other basic data products to form derived data products. These derived data products provide broader business context and are typically more useful for analytics and AI consumption.

Note: BDC is not mandatory to build your own or adopt preconfigured data products from SAP partners.

The following figure shows an SAP Business Data Cloud example of how source-level data products evolve into derived, consumer-oriented data products and higher-level business insights.

Source: SAP. Introducing Business Data Cloud. Focusing on Data Products and Intelligent Applications

Build Your Own

In addition to centrally certified data products, organizations may build their own commercial-grade or self-service data products tailored to specific business needs. These internally developed data products can still follow certified standards for governance, UX, and lifecycle management to ensure consistency and reuse across BI, ML, and AI initiatives.

For certified dashboards and commercial-grade data products, we recommend the following delivery lifecycle:

Data Product Stage	Delivery Approach
Specification and Visual Design	Follow a standard specification template and define the consumption design for data structures and user interaction.
System Connection	Establish pipeline connections to new or existing source systems.
Ingestion Data Streams	Configure ingestion or federation at defined frequencies.
Transformation Base Data Products (unit test)	Structure, transform, and store foundational data tables.
User Experience Design (UX)	Design dashboard experiences with product UX expertise and SAP Analytics Cloud (SAC) specialization where applicable.
Consumption Dashboards (unit test)	Develop analytic views and dashboards (e.g., Athena views, QuickSight, or Power BI).
Product Validation (integration/acceptance test)	Validate transformations and consumption layers through integration testing and business acceptance.
Production and Validation	Use CI/CD pipelines to promote development assets to production and validate production readiness.
Beta	Release to a small test group for feedback and refinement.
GA Onboarding	Assign standard roles to consumers and validate access permissions.
Launch	Client Data Product Owner responsible for training, communication, and consumer support following the Product Launch Checklist. Apiphani will provide all the launch checklist items associated with development and support.

Self service is a development that now brings organizations foundational value from data products. Using existing BI dashboards and Spaces as a starting poinit, self-service users can now rapidly bring new BI dashboards and Spaces into use with organized, certified data already available.

Data Product Marketplaces

Data product marketplaces provide curated assets that accelerate adoption by offering preconfigured datasets, models, and analytics aligned to specific business domains.

SAP: Available within SAP Business Data Cloud (BDC) via the SAP Business Accelerator Hub. These offerings include curated datasets, integration components, and analytical applications designed to support data-driven decision-making.

See: Data Product | Data Products | SAP Business Accelerator Hub

Apiphani: Available with or without SAP BDC. Organizations can select from an Apiphani catalog of preconfigured agents and KPIs spanning energy and manufacturing domains such as Finance, Engineering, Supply & Demand, Sales, and HR.

Implementation and Operational Considerations

Moving from data product concepts to real-world adoption requires a combination of governance practices, technical design decisions, and operating model alignment. The following considerations focus on how organizations evaluate potential data products, enable controlled self-service, and make architecture choices that balance agility with consistency.

These practices are not tied to a single platform; they apply across SAP Business Data Cloud, AWS-native environments, and hybrid architectures. Establishing clear evaluation criteria, access models, and data integration patterns helps ensure that data products remain scalable, governed, and reusable as adoption grows.

Data Products Evaluation Template

Use a consistent framework to evaluate and prioritize candidate data products:

Opportunity / Purpose
Business Priorities (Specific ROI or enabling priorities and strategies)
Core BI and AI Value (qualitative, quantitative, and strategic impact)
Technology and Data Availability
Deployable / Time to Value

Self-Service Data Access

Implement self-service capabilities that allow business users to explore and model data independently while relying on governed data products as a foundation. This reduces reliance on centralized IT and increases agility without compromising consistency.

User Groups and Permissions

Define user groups and reusable roles to enforce appropriate access and authorization. Clearly structured roles help manage who can view, modify, or share data products across domains.

Remote Tables vs. Data Replication

Determine whether to use remote tables for real-time access without duplication, or replicated data for improved performance. Remote tables support immediate updates, while replication is better suited for performance-critical analytics.

CDS Views

When creating remote tables, we prefer using Core Data Services (CDS) views over direct S/4 tables to enhance performance and maintainability.

Operating Model

Successfully adopting data products requires more than architecture and tooling. It requires an operating model that aligns business leadership, governance structures, and technical delivery. Organizations that scale BI, ML, and AI initiatives treat data products as long-lived assets supported by clear ownership, domain leadership, and enterprise coordination.

The following roles and practices outline how operating models evolve to sustain governed, reusable data products across SAP Business Data Cloud, AWS-native, and hybrid environments.

Domain Leadership

Data domain strategy should align directly with business priorities and execution. Business leaders manage domains of defined size and scope, ensuring accountability for outcomes and data quality. While data products may integrate multiple domains, each data product should have a primary domain responsible for definition and implementation.

Data Product Owners guide success through key lifecycle phases — Concept, Business Planning, Development, Launch, and Support — shifting organizations from traditional project delivery toward a product-based operating model.

Data Product Ownership

A Data Product Owner is a business-savvy, technically aware steward responsible for ensuring that each data product remains accurate, governed, discoverable, and valuable for analytics and AI use cases.

This role operates at the intersection of business, data engineering, and data science and is one of the most important roles in a modern SAP data architecture. Key responsibilities include:

Promoting and communicating data product value
Representing consumer needs and adoption priorities
Owning business meaning, definitions, and semantic consistency
Ensuring data quality and trust
Coordinating with other Data Product Owners across domains

Center of Excellence

The Center of Excellence (CoE) provides enterprise-wide leadership across discovery, governance, innovation, and community engagement. The CoE partners with domain leaders and Data Product Owners to catalog and manage data assets, collaborates with IT infrastructure teams on permissions and standards, and maintains a shared forum for tools, patterns, and emerging use cases.

Data Catalog

IT and apiphani teams jointly maintain secure infrastructure operations, managing system requests, incidents, and ongoing platform optimization. A centralized data catalog supports discoverability, governance, and lifecycle management of data products.

Building effective data pipelines requires specialized expertise across architecture, engineering, DevOps, and consumption design. Successful implementation depends on strong integration practices, security alignment, and continuous performance monitoring across enterprise environments.

C-Suite Role

Executive sponsorship is essential to drive organizational alignment around data, analytics, and AI. The C-suite plays a critical role in shifting mindset and prioritizing data products as strategic assets.

Engage executive leadership early to establish visibility and alignment, and deepen involvement once initial pilot data products demonstrate measurable value.

Culture and Mindset Changes

Together with evolving ways of working across Domain Leaders, Data Product Owners, and the CoE, the C-Suite enables the shift toward a data-driven culture, with the following focus areas guiding the transition to a steady-state operating model..

Executive teams recognize and expect data products as key drivers of business performance, consistently delivering above-benchmark results and exceptional outcomes in strategic initiatives.
Market leaders leverage embedded data products throughout their products, customer interactions, and operations. These organizations consistently generate and implement new ideas to enhance existing data products and develop new ones, driving continuous innovation.
Data Pipeline Acceleration begins to show how reusable solution components and reliable data transformations and data views turn into system and user consumption at increasing speed to value, i.e., the AWS Data Flywheel.
Data Self Service enables comprehensive data access across the enterprise. The platform provides streamlined data discovery, enterprise-grade analytics, and automated business insights at scale powered by tools such as SAP Just Ask.

Conclusion

Data products provide the structure that allows BI, ML, and AI initiatives to move beyond experimentation into scalable, governed business capabilities. Whether implemented within SAP Business Data Cloud, AWS-native architectures, or hybrid environments, success depends on more than technology alone. It requires clear ownership, strong governance, and an operating model aligned to business outcomes.

Organizations that treat data as a product, supported by domain leadership and executive sponsorship, create a foundation for repeatable innovation, faster time to value, and sustained competitive advantage.

About the Authors

James Kendrick

Principal Director of Data and Analytics Products at apiphani.

Mario de Felipe

Global Director of SAP Technology and Innovation at apiphani.

Drop the Backpack: What $900/Day in AI Costs Taught Us About MCP

Josh Greenwell — Thu, 12 Feb 2026 18:14:26 +0000

TL;DR: MCPs are not efficient. Code execution makes tool usage intelligent, consistent, time efficient, and cost efficient while providing an additional layer of security between the data and the AI.

Introduction

“Recently, I was talking with my colleague Braden about MCPs, tooling in AI, and token usage. Our team develops LuumenAI, the intelligence used in our observability and automation platform for monitoring ERP environments. As we got deeper into the conversation, I remember saying something along the lines of, “MCPs might become the NFTs of AI.” A bit hyperbolic, but I believe MCPs are a fad that will eventually die out as more practitioners grow disillusioned and move on.

For me, something never quite fit about the concept of MCPs in the way they were originally described or implemented. As it turns out, the creators of the Model Context Protocol (MCP), Anthropic, have come to a similar conclusion.

This whitepaper is about an expensive MCP lesson we learned the hard way: Three cost spikes that hit $100, then $300, then $900 per day in an environment with zero users and zero paying customers – just a handful of developers testing the system.

Here’s what went wrong, why the standard fix isn’t enough, and what we learned that actually works.

Part I: The Problem

First, What is MCP?

MCP is an open standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments. The purpose of MCP is to help frontier models produce better, more relevant responsesThis sounds compelling, in theory, as it’s an important problem to try to solve in AI efficiency. But in practice, the MCP standard has failed. Anthropic itself has stated that traditional MCP patterns increase agent cost and latency. And results from an experiment by Cloudflare have found that MCP tools are inefficient, costly, and make the models “dumber”.

How AI Pricing Works

Before we can talk about why AI is expensive, we need to understand how AI actually processes text. The answer is tokens.

A token is the fundamental unit that a language model works with. When you send text to an AI, it doesn’t read words or characters the way humans do. Instead, it breaks everything down into tokens, which are chunks of text that the model has learned to recognize. A token might be a whole word like “hello,” a partial word like “ing” or “tion,” or even just a single character for uncommon symbols.

Modern LLMs use an algorithm called Byte-Pair Encoding (BPE) to do this. BPE starts with individual characters and iteratively merges the most frequently occurring pairs until it builds up a vocabulary of common sub-word units. This is why common words like “the” become single tokens, while rare or technical terms get split into multiple pieces. For example, “unhappiness” might become three tokens: “un,” “happi,” and “ness.”

A rule of thumb for English text is that 1 token equals about 4 characters, or roughly 0.75 words. So, 1,000 tokens is approximately 750 words, but this can vary significantly. Code, technical documentation, and non-English text often tokenize less efficiently, meaning more tokens per word.

The reason AI providers charge by token comes down to computational cost. Every token that goes into a model (input) and every token that comes out (output) requires processing power. Input tokens and output tokens are often priced differently. Output tokens typically cost more because generating new text requires the model to run inference step by step, predicting one token at a time.

Let’s look at current pricing for Anthropic’s Claude 4.5 models:

Model	Input (per 1K tokens)	Output (per 1K tokens)
Claude Opus 4.5	$0.005	$0.025
Claude Sonnet 4.5	$0.003	$0.015
Claude Haiku 4.5	$0.001	$0.005

Notice that output tokens cost 5x more than input tokens. This pricing structure means that a chatty AI that generates long responses will cost significantly more than one that gives concise answers. It also means that any architecture that repeatedly sends large contexts back and forth will accumulate costs quickly.

To put it in perspective: If you’re running 100,000 tokens through Claude Sonnet 4.5 (a mix of 70K input and 30K output), that’s roughly $0.21 + $0.45 = $0.66 per request. Run that 1,000 times a day and you’re looking at $660/day. Run it across 100 users making 10 requests each? Same math.

Why MCP Bleeds Tokens

There are two compounding problems with how MCP handles context.

First, tool definitions load upfront. When you give an AI access to tools (functions it can call), each tool needs a definition. This definition includes the tool’s name, description, parameters, parameter types, and often examples of when to use it. A well-documented tool might be 100-300 tokens. Multiply that by 90 tools and you’re looking at 9,000 – 27,000 tokens of tool definitions sent with every request.¹

To put that in dollar terms: At Claude Sonnet 4.5 rates, 9,000 tokens cost about $0.027 and 27,000 tokens cost about $0.081. That doesn’t sound like much, but remember this cost is incurred on every single request – even before the user says anything useful. Run 10,000 requests per day and you’re looking at $270 – $810 daily just for tool definitions.

It gets worse. In the standard MCP approach, all tool definitions get sent with every request, even when the user’s query only needs one or two tools. That’s because the model needs to see all available options to decide which ones to use.

With LuumenAI, we had over 90 tools from just 5 implementations. This meant every base request started at roughly 20,000 tokens before the user typed anything. A simple “hello” message: 20,000 tokens – where the actual message was 1 token and the other 19,999 were tool definitions sitting there waiting to be useful.

Second, iterative calls compound context. When the model calls a tool and gets a result, that result is added to the context… then the model is called again with the original context PLUS the tool result. Then, if the model calls another tool, you get original context PLUS first result PLUS second result.

Each iterative call compounds the previous context, as illustrated in the following diagram.

Two tool calls compounding one user message

The “Lost in the Middle” Problem

Current AI models are good at writing code but struggle to sustain long conversations with large amounts of context. One reason this happens is due to the “Lost in the Middle” problem.

The “Lost in the Middle” phenomenon was documented by Liu et al. in their 2024 research. It reveals a critical limitation in how language models process information within their context windows. Their experiments on multi-document question answering and key-value retrieval tasks demonstrated that model performance follows a distinctive U-shaped curve. Accuracy is highest when relevant information appears at the very beginning or end of the input context, but degrades significantly when critical data is positioned in the middle.

This occurs because of primacy and recency biases. Models tend to “remember” what they saw first and last, while information in the middle gets overshadowed. In some cases, GPT-3.5-Turbo performed worse with middle-positioned documents than when operating without any documents at all.

For our purposes, this means that as tool results accumulate in the context window, the AI increasingly struggles to locate and utilize the most relevant information, leading to degraded reasoning quality. The model literally gets “dumber” as context grows.

Part II: What Went Wrong with LuumenAI

We experienced this ourselves with LuumenAI’s early implementation. In the first version, we created a ReAct agent and connected tools to it. We tracked the agent’s decisions and had some cool visuals of the AI’s workflow. As we continued to iterate on the agent’s capability, token count skyrocketed and at times the model seemed to become less functional.

In this first iteration of LuumenAI, we experienced three major token spikes – with zero users. The first spike was ~$100 a day, the second was $300 a day, and the third was $600 – $900 a day. For an (at the time) unlaunched product, this was a serious problem.

Let’s look at the pitfall in each situation.

Spike #1: The Summarization Loop ($100/day)

First, we implemented a summarization tool that fed large amounts of vulnerability information into the LLM to output a summary for quick, concise reading. The problem was that we ran this summary for every server and every vulnerability on each of those servers every hour. It quickly became large amounts of text being repeatedly processed on our development machines.

Servers	Vulns/Server	Runs/Day	Tokens/Run	Daily Tokens	Daily Cost
3	~17	24	1.17M	28M	$100
10	20	24	5M	120M	$648
100	20	24	50M	1.2B	$6,480
1000	20	24	500M	12B	$64,800

1000 servers is a relatively normal number. Some companies have over 200 alone.

Just 1000 servers with 20 vulnerabilities costs $65,000 a day! We quickly implemented a caching layer and deduped our processing to bring things back into line.Lesson learned: Never ask the AI to do the same work twice.

Spike #2: Tool Definition Overload ($300/day)

Next, we implemented a large suite of tools (70+ ServiceNow tools, 15+ Dynatrace tools, and others) which immediately brought every base request to 20,000 tokens. Sending the word “Hello” was 20,000 tokens. Anything more complicated requiring tools could push token counts to over 125,000.

Users	System Prompt	Tool Definitions	User Message	Cost
1	~4.5k	~15.5k	1	~$0.10
10	~4.5k	~15.5k	1	~$1.00
1000	~4.5k	~15.5k	1	~$100.00

1000 Users sending just “Hello” would cost ~$100.

That’s ~18.5x the cost for that one token!

We quickly implemented optimizations and observability beyond what we already had to try to better understand how the token counts were getting so large. We culled tools we weren’t using yet and added dynamic tool insertion into prompts. That got us back in line, but left us with a lingering problem: What happens when we actually need 90 critical tools?

Lesson learned: Tool definitions are expensive. Give the model only what it needs when it needs it.

Spike #3: The Perfect Storm ($600-$900/day)

Finally, we aggressively used the models to do huge multi-step processing on text-rich files. Doing this unintentionally created a combination of our first and second mistakes. There weren’t many tools anymore, but the large blocks of text being processed en masse through many stages still created those large token counts – which get replicated over each step. The result was two days of $600 and $900 dollar spikes, respectively.

Step	Input Tokens	Output Tokens	Total Tokens	Cost
User Msg	5,000	0	5,000	~$0.03
Tool 1	5,000	10,000	15,000	~$0.08
Tool 2	15,000	10,000	25,000	~$0.14
Tool 3	25,000	10,000	35,000	~$0.19
Tool 4	35,000	10,000	45,000	~$0.24
AI Response	45,000	500	45,500	~$0.25
				~$0.92

Total cost is the SUM of all costs!

Users	Requests	Total
1	1	$1
10	1	$10
100	1	$100
10	10	$100
100	10	$1000

100 users making just 10 complex requests a day costs $1000.

10,000 Users making 10 complex requests a day costs $100,000 a day! This caused us to reevaluate the core architecture of our system.

Lesson learned: Iterative tool calling compounds costs. Find architectures that minimize round trips between the model and external systems.

Part III: Anthropic’s Fix (And Why It’s Not Enough)

While I was writing this article, Anthropic came out with their solution to the MCP problem. It is a step forward. But using MCP feels like running a race wearing a backpack full of rocks – and Anthropic’s solution is to add more straps to help better distribute the weight. Yes, it’ll get you to the finish line, but you’ll be exhausted and you’ll have made lots of suboptimal decisions along the way. A better solution is to just drop the backpack.

Anthropic’s Three New Features

In their new article, Anthropic shared how Claude will start to use dynamic tooling and a level of code execution (sometimes) to be more efficient with token usage. They introduced three new features:

1. Tool Search Tool: Instead of loading all tool definitions upfront (which Anthropic admits can hit 134K tokens internally), Claude can now search for tools on demand. Tools are marked with defer_loading: true, and Claude only sees the Tool Search Tool itself (~500 tokens), plus always-loaded tools. When Claude needs a capability, it searches. This keeps context windows lean but adds another inference step.

2. Programmatic Tool Calling: Claude can now write code to orchestrate tools instead of making individual API round-trips. This allows for parallel execution and prevents intermediate results from piling into context. Anthropic claims 37% reduction in token usage and reduced latency for multi-step tasks.

3. Tool Use Examples: Sample calls alongside schemas to improve accuracy. Anthropic’s testing showed parameter handling accuracy jumping from 72% to 90%. Schemas alone don’t communicate real-world usage well enough.

This does move us closer to a complete solution, but why keep piling on?

This Isn’t the Answer

The fundamental problem with Anthropic’s solution is that it’s complexity stacked on complexity. They’re not solving the underlying architectural issue; they’re adding more layers to manage the symptoms.

It’s still MCP at the core. Tool Search Tool and Programmatic Tool Calling are bandages over the MCP wound. You’re still defining tools in the MCP format, still dealing with MCP server connections, still working within a protocol that was designed for a different mental model. The complexity doesn’t disappear; it just gets shuffled around.

The training data gap. As Cloudflare pointed out: LLMs have seen millions of open source projects with real TypeScript and Python code. They’ve seen a tiny set of contrived tool-calling examples constructed by their own developers. Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.

Inference overhead. With traditional tool calling, the output of each tool call must feed into the LLM’s neural network – just to be copied over to the inputs of the next call. Even with Programmatic Tool Calling, you’re still running inference to generate the orchestration code, when you could just, well, write code that calls APIs directly.

Feature fragmentation. These features are in beta and require specific headers. Implementation requires the advanced-tool-use-2025-11-20 header. The features aren’t mutually exclusive so you end up layering them: “Tool Search discovers tools, examples ensure correct invocation, programmatic calling handles orchestration.” That’s three systems to manage what should be just one.

Comparing the steps to using MCP vs. using Code Execution

From the start of working with AI, I always thought it was odd that we didn’t just call APIs. Like, it can write code and read docs. Why wouldn’t it just do curl requests or something similar? This is a very simplified take, but it is also not infeasible at this time.

Part IV: The Real Solution

Code Execution Over APIs

Cloudflare’s “Code Mode” approach represents what I believe is the correct direction. Instead of exposing MCP tools directly to the LLM, they convert MCP tools into a TypeScript API and ask the LLM to write code that calls that API.

The results are interesting. LLMs can handle many more tools, and more complex tools, when those tools are presented as a TypeScript API, rather than directly. This makes sense: LLMs have an enormous amount of real-world TypeScript in their training set, but only a small set of contrived examples of tool calls.

The approach shows its strength in multi-step operations. With traditional tool calling, intermediate results pile into context whether they’re useful or not. When the LLM writes code, it can skip all that and only read back the final results it needs. The code handles loops, conditionals, and data transformations. The LLM generates the logic once, the sandbox executes it, and only the relevant output returns to context.

The key insight: MCP is really just “a uniform way to expose an API for doing something, along with documentation needed for an LLM to understand it, with authorization handled out-of-band.” We don’t have to present tools as tools. We can convert them into a programming language API. The LLM writes against that API and a sandbox executes it.

How It Works

Instead of presenting tools directly to an LLM for individual invocation, we expose them as a programmatic API (typically TypeScript or Python) and let the model write code that orchestrates multiple tool calls. The code runs in a secure, isolated sandbox. This execution environment is deliberately restricted from general network access and can only interact with the specific APIs we provide.

The sandbox acts as a controlled runtime where the AI-generated code executes without the ability to access unauthorized resources or leak data. It can call our predefined APIs, process results in memory, filter and transform data, and only return the final curated output back to the model’s context. This means intermediate results stay within the sandbox and never bloat the context window. We’re talking about potentially large payloads: 10,000 spreadsheet rows or full document contents that never touch the LLM.

Cloudflare implements this using V8 isolates, which are lightweight JavaScript execution environments that can spin up in milliseconds and provide strong security guarantees. The key architectural shift is that tool definitions become filesystem-based TypeScript interfaces that the model can discover and import on demand, rather than loading all definitions upfront into the context.

Anthropic states in one of their examples: “This reduces the token usage from 150,000 tokens to 2,000 tokens – a time and cost savings of 98.7%.”

Architecture Comparison

Traditional MCP + Anthropic’s Fixes	Code Execution Approach
Load tool definitions → Tool Search → Inference to select → Tool call → Result to context → Repeat	Load TypeScript API docs → Generate code → Execute in sandbox → Return final result only
Multiple inference passes for multi-step tasks	Single inference pass generates full orchestration
Intermediate results bloat context	Only final results return to context
Limited training data for tool-calling format	Massive training data for TypeScript/Python

Same multi-tool call process as in the first image, but with Code Execution

Part V: How LuumenAI Is Solving It

LuumenAI’s solution creates specialized agents with code repositories that can execute specialized scripts and API calls written by the AI to interact with connections. This reduces the number of tools substantially while increasing accuracy and efficiency – and also reducing cost.

I want to be clear: We’re not fully there yet. LuumenAI is currently somewhere in the middle of this journey, using a hybrid approach as we continue to build toward full code execution. But we’re moving in that direction because the evidence is compelling, and every step we take reduces costs and improves quality.

Specialized Sub-Agents

Instead of giving an agent 90+ MCP tools and hoping it picks the right ones, we build specialized sub-agents. Each sub-agent has a focused domain (ServiceNow, monitoring, documentation) and a code-execution environment. When the AI needs to interact with a system, it doesn’t call a predefined tool. It writes a script against that system’s API, executes it in isolation, and returns only the relevant results.

By creating sub-agents we are able to limit tools to groups of agent specialties and minimized context data. A monitoring sub-agent only has monitoring tools. A documentation sub-agent only has documentation tools. These sub-agents can process a request and give only relevant data back to the main agent to reduce the overall context from compounding. Instead of one agent with 90 tools seeing 90 tool definitions, we have specialized agents each seeing 10-15 tools relevant to their domain.

Code Execution Environment

We implemented code execution to further optimize the tool calls in these agents. Rather than the model making sequential tool calls that each add to the context, it writes code that executes in a sandbox, processes results locally, and returns only the curated output.

The AI already knows how to write code. It’s been trained on millions of examples. Let it do what it’s good at.

This approach means:

Context stays small because we’re not loading 90 tool definitions
The AI’s responses are more accurate because it’s working in a paradigm it understands
Costs drop dramatically because we’re not compounding token usage across inference passes
We get a natural security boundary because the code executes in an isolated sandbox with only the permissions we grant.

A Real-World Example

LuumenAI is a co-pilot tool that works with systems engineers and Linux admins to manage SAP/ERP systems. It has access to the Luumen ecosystem, including monitoring tools, incident management and reporting, documentation, and more.

When a user creates a chat with the co-pilot, it has context to the client the user is working on, the exact system or set of systems the user is interacting with, and can use a suite of tools to get information about those systems and best practices/historical problem resolution for those instances or similar instances.

Let’s examine what happens when a user asks for current problems for the system they are viewing, along with any documentation that would help fix the issues. As outlined by Anthropic, the MCP approach requires that context be sent with EVERY request, meaning that every subsequent tool call and LLM call needs ALL the content from the previous calls, bloating the context dramatically. Code execution allows us to control, deterministically, how and what data we get. It allows us to process those results and provide a curated clean response back to the AI that reduces the number of tool calls (and layering of context info) and provides increased data security.

Rough example of Luumen multi-agent structure using Code Execution

The Results So Far

The combination of sub-agents and code execution transformed our cost structure. We went from $300-$900/day spikes with zero users to sustainable single-digit dollar costs during active development. More importantly, we built an architecture that can scale to real user loads without the exponential cost growth we were seeing before. This approach reduced our token usage by over 98% for complex multi-tool operations.

We’re continuing to push further toward full code execution as we develop LuumenAI. Every iteration gets us closer to the architecture we believe is the future of AI tooling.

Metric	Before (MCP)	After (Code Execution)
Daily Cost (dev testing)	$300-$900	$1-$5
Tokens per complex request	60,000-120,000+	~5,000-10,000
Tool definitions loaded	90 (all)	10-15 (relevant)
Context compounding	Yes (exponential)	No (controlled)

Conclusion: Drop the Backpack

AI costs are driven by tokens, and tokens accumulate faster than you might expect. Every system prompt, every tool definition, every intermediate result, and every response adds to the total. Without careful architecture, these costs can spiral out of control before you have a single paying customer.

Our experience with LuumenAI taught us three critical lessons:

Caching and deduplication are essential. Don’t ask the AI to do the same work twice.
Tool definitions are expensive. Give the model only what it needs when it needs it.
Iterative tool calling compounds costs. Find architectures that minimize round trips between the model and external systems.

The path forward isn’t to avoid AI. It’s to be intentional about how you use it. Observability tools like Langsmith gave us visibility into where tokens were being consumed. Architectural patterns like sub-agents and code execution gave us control over that consumption. Together, they let us build an AI-powered product that delivers value without bankrupting us in the process.

Anthropic is building better backpacks. Cloudflare is showing us we don’t need the backpack at all. We’re taking it a step further: Build the agents around code execution from the start, not as a feature bolted on top of a tool-calling protocol.

Drop the backpack. Run the race.

Appendix: A Note on Prompt Caching

¹Prompt caching can mitigate the tool definition load on repeated requests. When enabled, AI providers cache the static portions of your prompt (like tool definitions) and charge reduced rates for cached tokens. For example, Anthropic charges ~90% less for cached input tokens.

However, it’s important to note that cached tokens are still not free. You’re paying less, but you’re still paying for every request that includes those tool definitions. At Anthropic’s rates, cached tokens cost $0.0003 per 1K tokens for Sonnet 4.5 (compared to $0.003 for uncached). For 27,000 tokens of tool definitions across 10,000 requests per day, caching reduces your cost from $810/day to $81/day. That’s a meaningful savings, but you’re still paying $81 daily just for tool definitions.

More importantly, caching helps with the cost problem but doesn’t solve the architectural issues of context bloat and the “Lost in the Middle” phenomenon. You’re still loading all those definitions into context. The AI is still potentially getting confused by irrelevant options. And you’re still compounding context with each iterative tool call. Caching is a band-aid, not a cure.

Appendix: Calculations

² All cost calculations are using a standard Sonnet cost structure of 4/5 input tokens at $0.003 per 1,000 tokens and 1/5 output tokens at $0.015 per 1,000 tokens, then rounded to the nearest cent or dollar.

References

Anthropic – Advanced Tool Use
Anthropic – Code Execution with MCP
Cloudflare – Code Mode: The Better Way to Use MCP
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics, 12, 157-173. arXiv:2307.03172
Anthropic – Claude Pricing
Hugging Face – Byte-Pair Encoding Tokenization

About the Author

Josh Greenwell

Software engineer at apiphani and co-founder of Culture Booster

The Achilles’ Heel of AI Strategy

Mark Kujawski — Wed, 04 Feb 2026 16:35:10 +0000

Why 95% of AI Initiatives Fail and How Data Quality and Governance Can Fix It

As with prior technology waves, the current AI surge is marked by rapid adoption, inflated expectations, and uneven results. AI has become ubiquitous across enterprise strategy discussions, which often outpace the organizational foundations required to support it.

We are witnessing efforts to achieve incredible outcomes in process automation and the leveraging of machine-level intelligence to produce great decision-making capabilities in a deployable operational platform. These efforts, however, are generating a depressing statistic:

95% of Enterprise AI Projects Fail.*

That’s correct. 95 out of 100 AI projects fail to meet their success criteria, which begs the question, why? What is the Achilles’ heel of most AI strategies? What is preventing their success as well as an attractive return on investment?

The answer, though not trivial, is straightforward.The Achilles’ heel of AI strategy is a persistent lack of data quality and the absence of effective data governance. Without integrity in the data foundation — and clear accountability for how data is created, managed, and used — even the most sophisticated AI strategies collapse under their own weight.

A series of high-profile AI failures illustrates this reality:

These stories illustrate that AI fails not because it thinks poorly, but because it learns poorly from data that lacks governance. Let’s take a deeper dive into both subjects.

* Toscano, Joe, “Why 95% Of AI Projects Fail — And 4 Ways To Be In The 5% That Succeed”, Forbes, Sept 2025, Forbes

Start Enterprise AI Readiness Assessment

The Quality of Data

Data is everything to AI. AI requires enormous amounts of data inputs and sources to feed its voracious, machine-driven appetite and to refine and improve its logical models and neural networks. AI does not fix bad data; it amplifies it. If training data is incomplete, biased, or out-of-date, AI models produce distorted predictions that erode trust and create compliance risks.

This is compounded by another problem we’ve observed across the organizations we support. AI projects are often driven by technology aspirations rather than enterprise data realities. The result: Proof-of-concept models that never scale, analytics that contradict themselves, and insights no one fully trusts.

What are the root-causes of data quality failures?

Fragmented Data Ecosystems: Data is scattered across ERP, CRM, and MES (for example), as well as unstructured sources with little synchronization. To achieve real-time decision-driving capabilities, all required data must be available and presented in real-time. This is a status that few organizations have achieved.
- Example: Customer churn models trained on CRM data without capturing support tickets or billing records.
- Impact: AI underestimates risk or misclassifies outcomes due to incomplete learning context.
Poor Data Quality and Data Origination: Inconsistent master data, missing lineage, and unreliable inputs feeding critical algorithms.
- Example: Manufacturing AI reading incorrect temperature values due to uncalibrated IoT sensors.
- Impact: Predictive maintenance or quality control models generate false alerts or fail to detect anomalies.
Duplicate or Redundant Data: This is one of the most prevalent and most difficult conditions for automated remedies: the issue of repeated records inflating the apparent frequency or weight of certain features.
- Example: One of the most important discoveries we made for the manufacturing division of a pharmaceutical company was navigating the “splits and collisions” of fragmented patient data.
- Impact: Multiple instances of the same patient data records were completed by skewing the results of AI algorithms for tracking insurance remediation.
Lack of Data Lineage and Traceability: This category involves the inability to track data origin, transformations, and ownership of data inputs. These failures often stem from poor data quality, amplification of bias, regulatory violations, and models that cannot generalize. This is because the origin, transformations, and quality of the data were not tracked.
- Example: Unity Technologies’ $110 million ad-targeting error. The core issue stemmed from Unity’s ad targeting system, which utilized data from various sources to personalize ad delivery. A lack of clear data lineage meant that the origin and transformations of the data used to train and operate the ad-targeting AI were not fully understood or documented.
- Impact: This failure demonstrates how poor data management, including a lack of lineage, can lead to incorrect AI model outputs, resulting in a significant economic loss.

There are many more categories for examining the examples and impacts of poor data quality. The remedy for these problems is the focus of the second half of this examination: Strong corporate data and governance structures will largely eliminate the data problems that cause the high rate of failure for AI initiatives.

Corporate Governance for AI Strategy

Governance is often misunderstood as a bureaucratic layer that is similar to the deployment of other system guardrails, like password management and trouble tickets. Governance is the operating system of a well-functioning, data-driven enterprise and is a critical factor in using AI effectively and responsibly across the organization.

One of the earliest indicators of ineffective AI governance mirrors a challenge many organizations faced 15 years ago with the emergence of “shadow IT.” This happened as SaaS applications spread rapidly, introducing a subscription-based model that allowed individual teams to set up their own tools (e.g., separate Salesforce instances) without IT oversight.

The result was a wild west scenario of uncontrollable data usage and the exposure of corporate intellectual property and sensitive financial data. It introduced considerable risk and left IT with limited opportunity to regain control without a fundamental shift in governance policies. The same issues are currently happening today with the proliferation of AI projects at the enterprise department level. Unclear ownership, ad-hoc data stewardship, and an absence of executive oversight are the primary contributors to ineffective AI strategy. One of the first ways to restore control is through the strict application of governance protocols designed for AI use cases and business-aligned deployments.

What are the hallmarks of an effective AI governance structure?

1. Strategic Alignment and Value Stewardship

AI governance ensures that AI investments are explicitly tied to enterprise objectives, not isolated technology initiatives. Governance bodies (typically operating at the Board and executive committee level) prioritize AI use cases based on measurable business value, risk tolerance, and strategic relevance.

This function answers the following fundamental questions:

Why is AI being deployed?
Where does it create competitive advantage?
Which AI initiatives should be scaled, paused, or terminated?

Without this layer, organizations experience AI sprawl, duplicated models, and fragmented investments with unclear ROI.

2. Data Integrity and Trust Enablement

Because AI systems are only as reliable as the data they consume, governance establishes ownership, accountability, and quality standards for enterprise data assets. This includes:

Data lineage and provenance requirements
Authoritative data sources (“single source of truth”)
Quality thresholds for model training and inference
Controls over synthetic, third-party, and externally sourced data

In mature organizations, governance treats data as a regulated strategic asset, not an operational byproduct. This directly mitigates the Achilles’ heel of AI via confidently automated decisions built on untrusted data.

3. Risk, Ethics, and Regulatory Oversight

AI governance institutionalizes risk management across the AI lifecycle, including:

Model bias and fairness
Explainability and auditability
Regulatory compliance (current and emerging)
Legal, reputational, and operational exposure

Rather than relying on ad hoc ethical reviews, mature governance embeds repeatable controls that are reviewed, tested, and enforced – like financial controls or cybersecurity frameworks. This is increasingly critical as regulators and courts treat AI-driven decisions as corporate acts, not technical artifacts.

4. Operating Model and Decision Rights

Effective AI governance clearly defines who owns what decisions:

Who approves AI use cases?
Who certifies models for production?
Who is accountable when AI outcomes are wrong?
Who can override or shut down an AI system?

As AI autonomy increases, governance replaces ambiguity with formal decision rights, escalation paths, and kill-switch authority. This prevents “shadow AI” and ensures humans remain accountable for machine-driven outcomes.

5. Continuous Oversight and Adaptation

Unlike static policies, mature AI governance is dynamic and evolutionary. It continuously:

Monitors model performance and drift
Reassesses risk as data and business conditions change
Incorporates new regulations and standards
Retires models that no longer meet trust or value thresholds

This transforms governance from a gatekeeper into a living management system; one that adapts at the same pace as AI itself. Adopting a new approach to governance is the first critical step in improving your data quality as well as putting effective guard rails around your data and making your entire operative process ready for the effective use of AI technology.

Without governance, AI efforts degrade through model drift, shadow initiatives, and uncontrolled risk – eroding long-term value. Strong governance ensures higher-quality data, clear guardrails, and an operating model that enables AI to deliver reliable, sustainable outcomes.

Determining the Proper Path to a Sustainable AI Strategy

Over-reliance on platforms and tools, rather than alignment with business goals and operating models, is a fundamental flaw of AI strategy that can be rectified through adoption of best practices. Apiphani works with enterprise organizations operating complex, mission-critical systems (like SAP), where reliability, accuracy, and accountability are non-negotiable. In these environments, AI initiatives cannot be separated from the conditions in which they operate.

What we consistently observe is that the models themselves rarely drive AI failures. They occur when advanced capabilities are introduced into environments with fragmented data, unclear ownership, and insufficient operational discipline.

Addressing this challenge does not require additional tools or more sophisticated algorithms. It requires establishing the foundational conditions that allow AI to operate reliably and predictably at scale. Our apiphani AI Strategy Framework is anchored by three pillars.

Here’s how we do it.

1. Data Integrity Foundation

A comprehensive data quality assessment (focused on accuracy, completeness, timeliness, and lineage) is the foundation for evaluating and optimizing data architecture for performance
The establishment of a data integrity index as a benchmark for AI readiness
Automated validation workflows using AI-driven data profiling and anomaly detection

2. Governance by Design

We’ve designed an AI Center of Excellence (CoE) that offers a consistent, scalable model for implementing effective AI strategy. Elements include:

A Data and AI Governance Council aligned to business domains
Policy frameworks for model lifecycle management, ethical AI, and compliance
Metadata management and lineage tracking to ensure transparency

3. AI Value Realization

Integration of governance metrics into AI ROI dashboards
Diagnostic tools to visualize data and governance health
Continuous improvement cycles connecting governance KPIs to business outcomes

The Path Forward

Organizations that treat governance as the backbone rather than the brake of AI strategy will outperform peers who chase the latest models without considering their foundations. The future of enterprise AI belongs to companies that understand this simple truth: AI is only as intelligent as the integrity of the data and governance that supports it.

Apiphani helps organizations generate powerful AI strategies by aligning data strategy, governance, and AI implementation into a single, coherent framework that delivers measurable business value.

The first step is our AI Readiness Assessment, which evaluates your organization across data readiness, platform and operational maturity, governance and risk controls, and the ability to safely deploy AI in mission-critical environments.