Inside Amazon’s Struggle to Crack Nvidia’s AI-Chip Dominance

Nvidia CEO Jensen Huang is not counting Amazon out, though, noting his customer will soon become his competitor: “They are about to design a chip to replace ours.”

By Eugene Kim • May 28, 2024

Add Entrepreneur

Andrej Sokolow/Getty Images; F. Carter Smith/Bloomberg via Getty Images; Alyssa Powell/BI via Business Insider

Key Takeaways

Amazon’s AI chips lag far behind Nvidia GPUs, with low adoption rates among large cloud customers.
Nvidia’s dominance with its CUDA platform poses significant challenges for Amazon’s AI chips.
Amazon is working with the open-source community to improve AI-chip adoption and market share.

^{This article originally appeared on Business Insider.}

Amazon is struggling to compete with Nvidia‘s dominant AI chips, with low usage, “compatibility gaps,” and project-migration issues putting millions of dollars in cloud revenue at risk, according to confidential internal documents and people familiar with the situation.

It’s another sign of how far behind the cloud giant is in the generative-AI race. This also shows how hard it will be for tech companies including Microsoft, OpenAI, Google, Meta, and Amazon to break Nvidia’s grip on this huge and important market.

Stacy Rasgon, Bernstein Research’s veteran chip analyst, said every major tech company wanted a piece of Nvidia’s business but that no one had been able to make a mark.

“I’m not aware of anyone using AWS chips in any sort of large volumes,” Rasgon told Business Insider.

AI platforms and AWS profitability

^{Amazon Web Services’ outgoing CEO, Adam Selipsky, with Nvidia CEO Jensen Huang. Amazon via BI}

Amazon Web Services is the leading cloud company, renting access to computing power and storage over the internet. Part of its success, especially its profitability, is based on designing its own data-center chips, rather than paying Intel for these pricy components.

In the new generative-AI era, Amazon is trying to pull this off again by making its own artificial-intelligence chips, known as Trainium and Inferentia. This time, the idea is to avoid paying for expensive Nvidia GPUs, while still providing cloud customers with powerful AI services.

Amazon is at least four years into this effort, and so far, Nvidia is proving a harder nut to crack.

Nvidia spent more than a decade building a platform called CUDA to make it easier for developers to use its graphics processing units. The industry has gotten used to this. Amazon now has to somehow unwind all those habits and complex technical interrelationships.

Back when Amazon took on Intel’s chips in the first cloud boom, AWS was the pioneer, creating the platform that established standards and processes. This time, it’s Nvidia that has the beachhead. If Amazon can’t breach this in a significant way, it may be stuck paying Nvidia, and AWS profitability could suffer.

‘Friction points and stifled adoption’

The enormity of this challenge is revealed in the internal Amazon documents BI obtained and by the people familiar with the company’s AI-chip efforts. These people asked not to be identified discussing sensitive private matters. Their identities are known to BI.

Last year, the adoption rate of Trainium chips among AWS’s largest customers was just 0.5% that of Nvidia’s GPUs, according to one of the internal documents. This assessment, which measures usage levels of different AI chips through AWS’s cloud service, was prepared in April.

Inferentia, an AWS chip designed for a type of AI task known as inference, was only slightly better, at 2.7% of the Nvidia usage rate.

The internal document said large cloud customers had faced “challenges adopting” AWS’s custom AI chips, in part due to the strong appeal of Nvidia’s CUDA platform.

“Early attempts from customers have exposed friction points and stifled adoption,” the document said.

This contradicts the more upbeat outlook Amazon executives have shared about AWS’s AI-chip efforts. In an April earnings call, CEO Andy Jassy said demand for AWS’s custom silicon was “quite high,” and his annual shareholder letter this year name-checked a number of early customers, such as Snap, Airbnb, and Anthropic.

Even the Anthropic example has a big asterisk next to it. Amazon invested about $4 billion in this AI startup, and part of that deal requires Anthropic to use AWS’s in-house AI chips.

The roots of Amazon’s chip business

^{Selipsky talks about the company’s Graviton chip at a re:Invent conference. Amazon via BI}

Amazon’s chip business started in earnest when it acquired the startup Annapurna Labs in 2015 for roughly $350 million. This helped the tech giant design its own chips. It now offers Arm-based Graviton central processing units for most non-AI computing tasks. Inferentia debuted in 2018, and Trainium first came out in 2020.

Other tech giants, including Google and Microsoft, are also designing custom AI chips. At the same time, the three major cloud firms, AWS, Microsoft Azure, and Google Cloud, are some of the largest Nvidia customers, as they sell access to those GPUs through their own cloud services.

The in-house AI-chip efforts have yet to make a major dent in Nvidia’s grip on the market. On Wednesday, Nvidia reported another blowout quarter, more than tripling revenue from a year ago. The chipmaker is valued at about $2.5 trillion. That’s at least $500 billion more than Amazon. Nvidia now accounts for roughly 80% of the AI-chip market, the research firm Omdia found.

Top suppliers of AI processors for cloud and enterprise data centers

Percent market share; 2024 projections

An AWS spokesperson said there’s “growing customer excitement” over the latest Trainium chip, which is being used by Anthropic and Databricks to train AI models.

AWS’s AI chips are still relatively new, so Amazon measures their success in terms of their overall usage and positive customer feedback, both of which “are growing well,” rather than their share of workloads, the spokesperson added.

The sheer amount of GPU capacity AWS has built up over the past decade contributes to “very large usage” of Nvidia chips, the spokesperson said.

“We’re encouraged by the progress we’re making with our custom AI chips,” they said in a statement. “Building great hardware is a long term investment, and one where we have experience being persistent and successful.”

‘Parity with CUDA’

Internally at Amazon, Nvidia’s CUDA platform is repeatedly cited as the biggest roadblock for the AI-chip initiative. Launched in 2006, CUDA is an ecosystem of developer tools, AI libraries, and programming languages that makes it easier to use Nvidia’s GPUs for AI projects. CUDA’s head start has given Nvidia an incredibly sticky platform, which many consider the secret sauce behind the company’s success.

Amazon expects only modest adoption of its AI chips unless its own software platform, AWS Neuron, can achieve “improved parity with CUDA capabilities,” one of the internal documents said.

Neuron is designed to help developers more easily build on top of AWS’s AI chips, but the current setup “prevents migration from NVIDIA CUDA,” the document added.

Meta, Netflix, and other companies have asked for AWS Neuron to start supporting fully sharded data parallel, a type of data-training algorithm for GPUs. Without that, these companies won’t “even consider” using Trainium chips for their AI-training needs, according to this internal Amazon document.

Snowflake’s CUDA decision

^{Sridhar Ramaswamy speaks at a Collision conference in Toronto. Eóin Noonan/Getty Images via BI}

Snowflake, a leading data-cloud provider, is one of the companies that chose Nvidia GPUs over Amazon’s AI chips when training its own large language model, Arctic. That’s even though Snowflake is a major AWS cloud customer.

Snowflake CEO Sridhar Ramaswamy told BI that familiarity with CUDA made it hard to transition to a different GPU, especially when doing so could risk dealing with unexpected outcomes.

Training AI models is expensive and time-consuming, so any hiccups can create costly delays. Picking what you already know — CUDA, in this case — is a no-brainer at the moment for many AI developers.

The cost efficiency and advanced performance of Nvidia GPUs may be “years” ahead of the competition, Ramaswamy said.

“We trained Arctic on Amazon EC2 P5 instances to use Nvidia silicon because we had architected it on CUDA,” he told BI. (EC2 P5 gives AWS customers cloud access to Nvidia H100 GPUs).

Amazon’s spokesperson told BI that AWS was providing cloud customers “the choice and flexibility to use what works best for them,” rather than forcing them to switch.

Nvidia’s early investment in CUDA makes it an “essential tool” to develop with GPUs, but it’s “uniquely focused on Nvidia’s hardware,” the Amazon spokesperson added. AWS’s goal with Neuron is not necessarily to build parity with CUDA, the spokesperson said.

AWS can still generate revenue when customers use its cloud services for AI tasks — even if they choose the Nvidia GPU options, rather than Trainium and Inferentia. It’s just that this might be less profitable for AWS.

Failure to replace GPUs

Explosive demand for Nvidia chips is also causing a GPU shortage at Amazon, according to one of the internal documents and the people familiar with the matter.

An obvious response to this would be to have cloud customers use Amazon’s AI chips instead. However, some of the largest AWS customers have not been willing to use these homegrown alternatives, the documents said.

Inferentia chips fall behind the performance and cost efficiencies of Nvidia GPUs, some customers have told Amazon, and their performance issues have been escalated internally, according to this document.

The failure to replace customer demand for Nvidia GPUs with that for AWS’s offerings has resulted in delayed services and put millions of dollars in potential revenue at risk, the document said.

AWS discussed splitting workloads across different regions and setting up more flexible and higher-priced blocks of Nvidia GPU capacity to alleviate the shortage issue, according to this document.

Amazon even uses Nvidia GPUs for its own projects

^{Amazon VP and distinguished engineer James Hamilton Amazon via BI}

Even some internal Amazon AI projects rely on Nvidia GPUs, rather than AWS’s homegrown chips.

Earlier this year, Amazon’s retail team used a cluster of Nvidia GPUs, including V100s, A10s, and A100s, to build a model for a new AI image-creation tool, another internal document showed. There was no mention of AWS chips.

In January, James Hamilton, an Amazon senior vice president and distinguished engineer, gave a presentation on machine learning and said one of his projects used 13,760 Nvidia A100 GPUs.

It’s unclear how well AWS AI chips are doing financially since Amazon doesn’t break out specific cloud-segment sales. But in April, Jassy disclosed that its array of AI products was on pace to generate multibillion-dollar revenue this year. That’s important for AWS, as its growth rate has stagnated in recent years, though it bounced back in recent quarters.

Amazon is following a sales strategy of making a variety of AI chips available through its cloud service. According to an internal sales guideline seen by BI, Amazon sales reps are encouraged to mention access to both high-end Nvidia GPUs and more affordable AWS chips when selling AI compute services.

One person familiar with the situation told BI that Amazon saw a big enough opportunity in the lower-end market, even if that means ceding top-of-line customers to Nvidia for now.

Prioritizing open source

The AI-chip market is forecast to more than double to $140 billion by 2027, the research firm Gartner found.

To get a larger share of the market, Amazon wants to work more closely with the open-source community, according to one of the internal documents obtained by BI.

For example, AWS’s AI chips still have “compatibility gaps” in certain open-source frameworks, making Nvidia GPUs a more popular option. By embracing open-source technologies, AWS can build a “differentiated experience,” it said.

As part of this effort, Amazon is prioritizing support for open-source models like Llama and Stable Diffusion, one of the people said.

Amazon’s spokesperson said AWS had partnered with Hugging Face, a hub for AI models, and was a founding member of an open-source machine-learning initiative called OpenXLA to make it easier for the open-source community to take advantage of AWS AI chips.

Don’t count Amazon out

Despite Amazon’s AI-chip struggles, the effort seems to have caught the attention of Nvidia CEO Jensen Huang.

During an interview earlier this year, Huang was asked about Nvidia’s competitive landscape.

“Not only do we have competition from competitors, we have competition from our customers,” Huang said, referring to cloud giants such as AWS that sell access to both Nvidia GPUs and their own chips. “And I’m the only competitor to a customer fully knowing they are about to design a chip to replace ours.”

Key Takeaways

Amazon’s AI chips lag far behind Nvidia GPUs, with low adoption rates among large cloud customers.
Nvidia’s dominance with its CUDA platform poses significant challenges for Amazon’s AI chips.
Amazon is working with the open-source community to improve AI-chip adoption and market share.