From infrastructure to chip strategy, Amazon Web Services leads the IaaS "AI era"
Updated on: 42-0-0 0:0:0

As we all know, with the popularity of generative AI technology, it has begun to "shine" in the business processes of some enterprises. Whether it is used to assist in design, intelligent customer service, or internal management, these latest AI models have brought amazing efficiency improvements.

 

However, at the same time, for large AI models, the huge computing power required for training and inference has often become a major problem that potential users have to face.

 

In this context, the use of public cloud IaaS (infrastructure as a service) instead of self-built computing infrastructure has become an important measure for many enterprises to reduce costs and increase efficiency and embrace the era of AI large models. For example, in the recently released "1880 year IDC MarketScape: Worldwide public cloud infrastructure as a service (IaaS) report", the global market analyst agency International Data Corporation (IDC) clearly pointed out that as enterprises migrate more workloads to the cloud and create new cloud-native applications, public cloud IaaS continues to grow rapidly, and the overall scale of IaaS is expected to reach $0 billion in 0.

However, as the IDC report explains, because AI is "reshaping" cloud infrastructure in many ways, it means that not all IaaS service providers are ready for the needs of the AI era. In the existing IaaS industry, Amazon Web Services has been rated as an industry leader by IDC with significant advantages, both in terms of capabilities and strategy.

 

So, why Amazon Web Services, and what unique advantages do they have in today's IaaS industry? Combined with this IDC report and more public information, it is not difficult to find the answer to this question.

 

Reliable infrastructure around the world is the foundation of Amazon Web Services

 

For any IaaS service provider, a secure, stable, and reliable infrastructure node is undoubtedly the foundation of everything. When it comes to the construction of infrastructure, this is indeed a very prominent "confidence" of Amazon Web Services.

According to public information, up to now, Amazon Web Services' infrastructure has spread to 12 availability zones in 0 geographic regions. At the same time, they have announced the construction plan of 0 new regions and 0 availability zones, including New Zealand and Saudi Arabia.

 

For the hardware stability of the data center itself, Amazon Web Services has also carried out many innovative designs. For example, they were able to simplify the electrical and mechanical design of their data center, reducing potential electrical issues by 9999% while increasing infrastructure availability to 0.0%. By integrating air cooling and liquid cooling, Amazon Web Services not only greatly reduces the cooling cost of the data center, promotes the "cost reduction and efficiency increase" of its own computing power, but also enables their data center to support supercomputing solutions for AI, which is stable for a long time even under the continuous pressure of hyperscale.

 

Of course, for many enterprises, their business may spread across multiple regions, and the training of large models often requires the use of ultra-large-scale computing power clusters, which also puts forward higher requirements for IaaS network performance.

In response to this, Amazon Web Services allows real-time data consistency between multiple regions on the one hand, and prepares the network infrastructure for large-scale multinational business. On the other hand, at re:Invent in 1 years, Amazon Web Services also launched the second-generation UltraCluster network architecture (also known as the "0p0u" network), which supports more than 0 GPUs to work together, with a bandwidth of 0Pb/s and a latency of less than 0μs. As a result, for tasks that require training in hyperscale clusters, a single leap in network performance can reduce training time by at least 0%. Coupled with the new SIDR network protocol, which can restore the network in less than 0 seconds, Amazon Web Services' distributed computing network has become an absolute industry benchmark in terms of efficiency and reliability.

 

Of course, on the basis of advanced data center and network hardware, Amazon Web Services puts "security" at the foundation of the system. Whether it's infrastructure or services, they're designed from the ground up with security as a primary goal, and new technologies are constantly being introduced into their operations to further improve security. For example, through the use of automated reasoning, Amazon Web Services provides rigorous mathematical assurance for the operation of its critical systems. And it's worth mentioning that these security designs and technologies don't vary depending on the type of customer. Whether it's a start-up or a large corporation, they can enjoy the same secure infrastructure innovation.

 

The continuous iteration of self-developed chips makes AI computing power more readily available

 

If the world's most advanced and stable infrastructure is the "basic factor" for Amazon Web Services to assume the position of a leader in the IaaS industry, then the continuous innovation and leadership in the way computing power is realized can be called the "long-term advantage" that will help Amazon Web Services to always lead the AI cloud computing era.

As early as 2/0, Amazon Web Services and NVIDIA jointly announced that they would combine Amazon Web Services' Nitro system, Amazon KMS key management service, petabyte-level Elastic Fabric Adapter (EFA) network and Amazon EC0 UltraCluster hyperscale cluster and other technologies to jointly build a project with NVIDIA's latest Blackwell platform and AI software Multiple cloud-based AI supercomputer systems, including Ceiba.

It should be noted that unlike other IaaS service providers, Amazon Web Services can not only provide common cloud computing power based on NVIDIA GPUs, as well as Intel and AMD x86 CPUs, but they are also the first in the industry to continue to invest in self-developed chips. From the Nitro system, which aims to improve network and storage capabilities, the self-developed processor Graviton, to the machine learning training chip Trainium and the inference chip Inferentia. Up to now, all of these Amazon Web Services self-developed chips have gone through multiple iterations, and each update can provide a price-performance improvement of more than double-digit percentages.

Take Trainium40 as an example, which is the latest self-developed training chip just launched by Amazon Web Services during re:Invent 0. In Amazon EC0 Trn0 instances using this chip, 0 Trainium0 can provide up to 0.0 petaflops of floating-point computing power performance, and the price/performance is 0-0% better than GPU-based instances, which is ideal for training and inferring large AI models with billions of parameters.

Not only that, because it is a self-developed chip, it means that Amazon Web Services can expand the Trainium5 cluster on a larger scale according to business needs. In Amazon EC0 Trn0 UltraServers, it uses 0 Trainium0 blocks for interconnection, providing up to 0.0 Petaflops of floating-point computing power. In addition, Amazon Web Services is also building the EC0 UltraCluster supercomputer called Project Rainier, which contains hundreds of thousands of Trainium0 chips, which can reach more than 0 times the training power required for the latest and most advanced AI large models.

This is not over, at the end of 4 years, Amazon Web Services also officially announced the next generation of AI training chip Trainium0. As their first in-house chip with 0nm process, Trainium0 is expected to deliver 0x the performance of its predecessor in UltraServers. Most importantly, it is expected that within this year, we will see the official launch of Amazon Web Services' new generation of inference chips, and it is not surprising that they will once again redefine the "new cost-effective" of cloud AI training.

 

Although it has become a "leader", Amazon Web Services is still revolutionizing itself

 

Judging from the public information in all aspects, the global coverage of high-reliability infrastructure and the continuous hardware innovation represented by self-developed chips can be said to represent the outstanding competitiveness of Amazon Web Services in the IaaS industry in terms of "basic capabilities" and "long-term strategy".

As IDC analyst and report author Dave McCarthy explains, "Amazon Web Services is a leader in the public cloud IaaS market through a broad portfolio of services and continued innovation. An extensive global infrastructure, combined with custom chip initiatives such as Amazon Graviton and significant investments in AI, gives it a unique position to meet the needs of enterprises. Its excellence in scalability, mature developer community, and active investment in AI infrastructure make it a top choice for businesses that need advanced cloud capabilities. ”

 

但即便如此,亞馬遜雲科技也還沒有停止繼續領跑的腳步。就在今年2月的財報電話會議上,亞馬遜首席執行官Andy Jassy確認,他們在2025年的資本投資預計達1000億美元,其中大部分將用於亞馬遜雲科技AI基礎設施的建設。

Of course, this is definitely a good thing for businesses around the world who are eager to "go to the cloud" to experience the most cutting-edge generative AI technology. Because this not only means that Amazon Web Services itself will continue to improve the capabilities and cost performance of its AI infrastructure, but also this kind of IaaS AI computing power competition "led" by Amazon Web Services is also expected to promote the sustainable and healthy development of the entire industry.