Datadog has revealed that infrastructure capacity constraints, rather than model quality, are now the leading cause of artificial intelligence failures in production environments. According to its latest AI engineering report, nearly 60 per cent of AI system failures stem from limited infrastructure capacity, highlighting a growing operational challenge for businesses scaling AI deployments.
The findings, based on anonymised usage data from thousands of organizations running large language models (LLMs), show that approximately 5 per cent of AI requests fail during production. Notably, most of these failures are linked to system limitations rather than issues with model performance. As a result, enterprises are increasingly facing difficulties as they transition AI from experimental phases to real-world applications.
Moreover, the report highlights a sharp rise in multi-model adoption. Around 69 per cent of companies now use three or more AI models simultaneously. OpenAI continues to dominate the market with a 63 per cent usage share, while Google’s Gemini and Anthropic’s Claude models have seen significant growth, increasing by 20 and 23 percentage points respectively.
At the same time, organizations are rapidly adopting agent-based frameworks, which have doubled in usage year over year. However, this expansion has introduced additional layers of complexity. For instance, the volume of data processed per AI request has surged, with token usage doubling for typical teams and quadrupling for high-usage organizations. Consequently, these trends are placing increasing pressure on infrastructure and operational systems.
Operational Complexity Becomes a Key Barrier
As AI systems grow more sophisticated, managing them efficiently has become a critical concern. The report indicates that fragmented workflows, repeated retries, and inefficient routing between models and tools are contributing to system instability.
Yanbing Li, Chief Product Officer at Datadog, drew a comparison to the early evolution of cloud computing.
“AI is starting to look a lot like the early days of cloud,” said Li. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models – they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”
These insights suggest that organizations must now focus beyond model accuracy. They need to manage uptime, optimize routing, and ensure seamless interaction between models, data pipelines, and agent frameworks. Even a 5 per cent failure rate can significantly impact customer-facing applications, leading to disruptions and higher operational costs.
Regional Insights and Cost Pressures
Datadog also observed similar challenges in the Australia and New Zealand (A/NZ) region, where enterprises are rapidly adopting multi-model and agent-driven architectures.
“In A/NZ, the focus has firmly shifted to running AI reliably in production and multi-model architectures and agentic workflows are becoming standard, but that maturity is exposing significant gaps. A failure rate sitting at around five per cent, largely driven by capacity constraints, is a material concern in industries where uptime and trust are non-negotiable. AI systems are increasingly resembling distributed systems, yet many teams are still not managing them with the operational discipline that demands,” said Yadi Narayana, Chief Technology Officer for APJ, Datadog.
Furthermore, rising token consumption is driving up operational costs. As AI requests increase in size and frequency, companies are spending more on both model usage and supporting infrastructure—especially when inefficiencies like duplicate requests or poor optimization occur.
“There is also a cost problem hiding in plain sight. Token consumption is climbing fast, while optimisation techniques, like prompt caching and smarter context design, remain largely untapped. The next phase is about closing the gap between how sophisticated these systems have become and how rigorously they’re being operated. Organisations will prioritise foundational capabilities like observability, governance, and cost control, over accelerating deployment speed, building AI systems that are reliable, scalable, and accountable,” said Narayana.
Industry-Wide Shift Toward System Reliability
The report reflects a broader shift across the technology sector. As businesses move beyond experimentation, the key challenge is no longer just accessing advanced models but ensuring that AI systems run reliably at scale.
This perspective aligns with insights from Guillermo Rauch, whose company Vercel builds modern web application tools.
“The next wave of agent failures won’t be about what agents can’t do but what teams can’t observe,” said Rauch. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”
Overall, Datadog’s analysis concludes that AI adoption is evolving from a model-centric challenge to a systems-level issue. As organizations scale AI, reliability, visibility, and cost management will define long-term success.
Recommended Cyber Technology News:
- Vodafone and Google Cloud Expand Partnership with AI and Cybersecurity Solutions
- GitLab Expands Amazon Bedrock Integration for DevSecOps
- NDPC, CIoD Partner to Boost Data Protection in Nigeria
To participate in our interviews, please write to our CyberTech Media Room at info@intentamplify.com
🔒 Login or Register to continue reading




