Datadog Finds Capacity Limits Behind Most AI Failures

Datadog has revealed that infrastructure capacity constraints, rather than model quality, are now the leading cause of artificial intelligence failures in production environments. According to its latest AI engineering report, nearly 60 per cent of AI system failures stem from limited infrastructure capacity, highlighting a growing operational challenge for businesses scaling AI deployments.

The findings, based on anonymised usage data from thousands of organizations running large language models (LLMs), show that approximately 5 per cent of AI requests fail during production. Notably, most of these failures are linked to system limitations rather than issues with model performance. As a result, enterprises are increasingly facing difficulties as they transition AI from experimental phases to real-world applications.

Moreover, the report highlights a sharp rise in multi-model adoption. Around 69 per cent of companies now use three or more AI models simultaneously. OpenAI continues to dominate the market with a 63 per cent usage share, while Google’s Gemini and Anthropic’s Claude models have seen significant growth, increasing by 20 and 23 percentage points respectively.

At the same time, organizations are rapidly adopting agent-based frameworks, which have doubled in usage year over year. However, this expansion has introduced additional layers of complexity. For instance, the volume of data processed per AI request has surged, with token usage doubling for typical teams and quadrupling for high-usage organizations. Consequently, these trends are placing increasing pressure on infrastructure and operational systems.

Operational Complexity Becomes a Key Barrier

As AI systems grow more sophisticated, managing them efficiently has become a critical concern. The report indicates that fragmented workflows, repeated retries, and inefficient routing between models and tools are contributing to system instability.

Yanbing Li, Chief Product Officer at Datadog, drew a comparison to the early evolution of cloud computing.

“AI is starting to look a lot like the early days of cloud,” said Li. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models – they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”

These insights suggest that organizations must now focus beyond model accuracy. They need to manage uptime, optimize routing, and ensure seamless interaction between models, data pipelines, and agent frameworks. Even a 5 per cent failure rate can significantly impact customer-facing applications, leading to disruptions and higher operational costs.

Regional Insights and Cost Pressures

Datadog also observed similar challenges in the Australia and New Zealand (A/NZ) region, where enterprises are rapidly adopting multi-model and agent-driven architectures.

“In A/NZ, the focus has firmly shifted to running AI reliably in production and multi-model architectures and agentic workflows are becoming standard, but that maturity is exposing significant gaps. A failure rate sitting at around five per cent, largely driven by capacity constraints, is a material concern in industries where uptime and trust are non-negotiable. AI systems are increasingly resembling distributed systems, yet many teams are still not managing them with the operational discipline that demands,” said Yadi Narayana, Chief Technology Officer for APJ, Datadog.

Furthermore, rising token consumption is driving up operational costs. As AI requests increase in size and frequency, companies are spending more on both model usage and supporting infrastructure—especially when inefficiencies like duplicate requests or poor optimization occur.

“There is also a cost problem hiding in plain sight. Token consumption is climbing fast, while optimisation techniques, like prompt caching and smarter context design, remain largely untapped. The next phase is about closing the gap between how sophisticated these systems have become and how rigorously they’re being operated. Organisations will prioritise foundational capabilities like observability, governance, and cost control, over accelerating deployment speed, building AI systems that are reliable, scalable, and accountable,” said Narayana.

Industry-Wide Shift Toward System Reliability

The report reflects a broader shift across the technology sector. As businesses move beyond experimentation, the key challenge is no longer just accessing advanced models but ensuring that AI systems run reliably at scale.

This perspective aligns with insights from Guillermo Rauch, whose company Vercel builds modern web application tools.

“The next wave of agent failures won’t be about what agents can’t do but what teams can’t observe,” said Rauch. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”

Overall, Datadog’s analysis concludes that AI adoption is evolving from a model-centric challenge to a systems-level issue. As organizations scale AI, reliability, visibility, and cost management will define long-term success.

Recommended Cyber Technology News:

To participate in our interviews, please write to our CyberTech Media Room at info@intentamplify.com

🔒 Login or Register to continue reading

Tags: AI failures, AI infrastructure, capacity limits, Datadog, Observability, token usage

CyberTech Intelligence

Connect with Us

Datadog Finds Capacity Limits Behind Most AI Failures

Operational Complexity Becomes a Key Barrier

Regional Insights and Cost Pressures

Industry-Wide Shift Toward System Reliability

CyberTech Media Room

Share With

Recent Posts

OpenAI’s Election Security Push Signals a New Era of AI Infrastructure Responsibility

Rethinking End-to-End Logistics: A Strategic Guide to AI-Driven Fulfillment with Amazon Supply Chain Services

Beyond Fulfillment: Why AI and Automation Are Now the Core of Scalable Supply Chain Strategy

Contact Us

Quick Links

Insights

Get in touch

Connect with Us

Our Other Brands

From Insights to Intelligence – A New Era Begins.

GTM Strategy

Demand Intelligence

Pipeline Activation

Round Tables

Sponsored Research

Targeted Content

Webinars & Panels

Vendor Intelligence

Strategic Consulting

From Audience Engagement to Buying Group Intelligence to Pipeline Activation

Get Your Custom Audience & Pipeline Plan

Datadog Finds Capacity Limits Behind Most AI Failures

Operational Complexity Becomes a Key Barrier

Regional Insights and Cost Pressures

Industry-Wide Shift Toward System Reliability

CyberTech Media Room

Share With

Recent Posts

OpenAI’s Election Security Push Signals a New Era of AI Infrastructure Responsibility

Rethinking End-to-End Logistics: A Strategic Guide to AI-Driven Fulfillment with Amazon Supply Chain Services

From Warehouse to Doorstep: How Automation and AI Are Closing the Supply Chain Complexity Gap

Scale Without the Overhead: Inside Amazon Supply Chain Services and the Future of End-to-End Logistics

Beyond Fulfillment: Why AI and Automation Are Now the Core of Scalable Supply Chain Strategy

The AI-Powered Supply Chain: How Amazon’s ASCS Is Reshaping Multi-Channel Fulfillment for Modern Businesses

Contact Us

Quick Links

Insights

Get in touch

Connect with Us

Our Other Brands

From Insights to Intelligence – A New Era Begins.

GTM Strategy

Demand Intelligence

Pipeline Activation

Round Tables

Sponsored Research

Targeted Content

Webinars & Panels

Vendor Intelligence

Strategic Consulting

See Your Target Accounts Already in Market

Access Real Buyer Intent Data for Cybersecurity & B2B Tech

From Audience Engagement to Buying Group Intelligence to Pipeline Activation

Get Your Custom Audience & Pipeline Plan