Generative AI: 4 Ways We Engineer AWS Infrastructure for This Innovation

Adapting and Innovating the Infrastructure for Generative AI

Prasad Kalyanaraman July 3, 2024

7 minutes read

Generative Artificial Intelligence (Generative AI) has transformed our world seemingly overnight. Within a few months, it became commonplace for individuals and enterprises to use the new technology to enhance decision-making, transform customer experiences, and boost creativity and innovation. But the underlying infrastructure that powers generative AI was not built in day—in fact, it is the result of years of innovation.

Artificial Intelligence and Machine Learning (ML) have been a focus for Amazon for more than 25 years. Many Amazon capabilities that customers use daily are driven by ML, like shopping recommendations and packaging decisions. Within Amazon Web Services (AWS), we have been focused on bringing that knowledge and capability to our customers by putting ML in to the hands of every developer, data scientist, and expert practitioner.

Generative AI is now a multibillion-dollar revenue run rate business for AWS. Over 100,000 customers across industries—including adidas, New York Stock Exchange, Pfizer, Ryanair, and Toyota—are using AWS AI and ML services to reinvent experiences for their customers. Additionally, many of the leading generative AI models are trained and run on AWS.

All of this work is underpinned by AWS’s global infrastructure, including our data centres, global network, and custom AI chips. There is no compression algorithm for experience, and since we have been building large-scale data centres for more than 15 years and GPU-based servers for more than 12 years, we have a massive existing footprint of AI infrastructure.

AWS continues to adapt and improve upon our strong infrastructure foundation as the world changes rapidly, and we are delivering innovations specifically for generative AI. Here are some key ways we are innovating on our leading global infrastructure to support generative AI at scale.

1. Delivering Low-Latency, Large-Scale Networking

Generative AI models require massive amounts of data to train and run efficiently. The larger and more complex the model is, the longer the training duration. As you increase time to train, you are not only increasing operating costs but also slowing down innovation. Traditional networks are not sufficient for the low latency and large scale needed for generative AI model training.

At AWS, we are constantly working to reduce network latency and improve performance for customers (and allow them to maximise generative AI in the process. Our approach is unique in that we have built our own network devices and network operating systems for every layer of the stack—from the Network Interface Card to the top-of-rack switch to the data centre network to the internet-facing router and our backbone routers.

This approach not only gives us greater control over improving security, reliability, and performance for customers, but also enables us to move faster than others to innovate. For example, in 2019, we introduced Elastic Fabric Adapter (EFA), a network interface custom-built by AWS that provides operating system bypass capabilities to Amazon EC2 instances. This enables customers to run applications requiring high levels of inter-node communications at scale. EFA uses Scalable Reliable Datagram (SRD), a high-performance, lower-latency network transport protocol that was designed specifically by AWS, for AWS.

More recently, we moved fast to deliver a new network for generative AI workloads. Our first-generation UltraCluster network, built in 2020, supported 4,000 GPUs with a latency of eight microseconds between servers. The new network, UltraCluster 2.0, supports more than 20,000 GPUs with 25% latency reduction. It was built in just seven months—this speed would not have been possible without investments over the years in our own custom network devices and software.

Internally, we call UltraCluster 2.0 the ‘10p10u’ network, as it delivers tens of petabits per second of throughput, with a round-trip time of less than 10 microseconds. The new network results in at least 15% reduction in time to train a model.

2. Continuously Improving the Energy Efficiency of Our Data Centres

Training and running AI models can be energy-intensive, so efficiency efforts are critical. AWS is committed to running our business in an efficient way to reduce our impact on the environment. Not only is this the right thing to do for communities and for our planet, but it also helps AWS reduce costs, and we can then pass those cost savings on to our customers.

For many years, we have focused on improving energy efficiency across our infrastructure. Some examples include:

Optimising the longevity and airflow performance of the cooling media in our data centre cooling systems.
Using advanced modelling methods to understand how a data centre will perform before it is built and to optimise how we position servers in a rack and in the data hall so that we are maximising power utilisation.
Building data centres to be less carbon-intensive, including using lower-carbon concrete and steel, and transitioning to hydrotreated vegetable oil for backup generators.

New research by Accenture shows these efforts are paying off. The research estimates that AWS’s infrastructure is up to 4.1 times more efficient than on-premises and when optimising on AWS, associated workloads’ carbon footprint can be reduced by up to 99%. But we cannot stop there as power demand increases.

AI chips perform mathematical calculations at high speed, making them critical for ML models. They also generate much more heat than other types of chips, so new AI servers that require more than 1,000 watts of power per chip will need to be liquid-cooled. However, some AWS services utilise network and storage infrastructure that does not require liquid cooling, and therefore, cooling this infrastructure with liquid would be an inefficient use of energy.

AWS’s latest data centre design seamlessly integrates optimised air-cooling solutions alongside liquid cooling capabilities for the most powerful AI chipsets, like the NVIDIA Grace Blackwell Superchips. This flexible, multi-modal cooling design allows us to extract maximum performance and efficiency whether running traditional workloads or AI/ML models. Our team has engineered our data centres—from rack layouts to electrical distribution to cooling techniques—so that we continuously increase energy efficiency, no matter the compute demands.

3. Ensuring Security from the Ground Up

One of the most common infrastructure questions we hear from customers as they explore generative AI is how to protect their highly sensitive data. At AWS, security is our top priority, and it is built into everything we do.

Our infrastructure is monitored 24/7, and when data leaves our physical boundaries and travels between our infrastructure locations, it is encrypted at the underlying network layer. Not all clouds are built the same, which is adding to the number of companies moving their AI focus to AWS.

AWS is architected to be the most secure and reliable global cloud infrastructure. Our approach to securing AI infrastructure relies on three key principles:

Complete isolation of the AI data from the infrastructure operator, meaning the infrastructure operator must have no ability to access customer content and AI data, such as AI model weights and data processed with models.
Ability for customers to isolate AI data from themselves, meaning the data remains inaccessible from customers’ own users and software.
Protected infrastructure communications, meaning the communication between devices in the ML accelerator infrastructure must be protected.

In 2017, we launched the AWS Nitro System—a pioneering design of specialised hardware and software that protects customers’ code and data from unauthorised access during processing. The Nitro System fulfills the first principle of Secure AI Infrastructure by isolating customers’ AI data from AWS operators.

The second principle is fulfilled by our integrated solution between AWS Nitro Enclaves and AWS Key Management Service (AWS KMS). With Nitro Enclaves and AWS KMS, customers can encrypt their sensitive AI data using keys that they own and control, store that data in a location of their choice, and securely transfer the encrypted data to an isolated compute environment for inferencing.

Throughout this entire process, the sensitive AI data is encrypted and isolated from their own users and software on their EC2 instance, and AWS operators cannot access this data.

Previously, Nitro Enclaves operated only in the CPU. Recently, we took that a step further when we announced our plans to extend this Nitro end-to-end encrypted flow to include first-class integration with ML accelerators and GPUs, fulfilling the third principle.

4. Using AWS AI Chips for Generative AI

The chips that power generative AI are crucial, impacting how quickly, inexpensively, and sustainably you can train and run models.

For many years, AWS has innovated to reduce the costs of our services. This is no different for AI – by helping customers keep costs under control, we can ensure AI is accessible to customers of all sizes and industries. So for the last several years, we’ve been designing our own AI chips, including AWS Trainium and AWS Inferentia. These purpose-built chips offer superior price-performance and make it more energy efficient to train and run generative AI models.

Generative AI

AWS Trainium is designed to speed up and lower the cost of training ML models by up to 50 percent over other comparable training-optimised Amazon EC2 instances, and AWS Inferentia enables models to generate inferences more quickly and at lower cost, with up to 40% better price performance than other comparable inference-optimised Amazon EC2 instances.

Demand for our AI chips is quite high given their favourable price-performance benefits relative to available alternatives. Trainium2 is our third-generation AI chip and will be available later this year. Trainium2 is designed to deliver up to 4x faster training than first generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train foundation models and large language models in a fraction of the time, while improving energy efficiency up to 2x.

Additionally, AWS works with partners including NVIDIA, Intel, Qualcomm, and AMD to offer the broadest set of accelerators in the cloud for ML and generative AI applications. And we will continue to innovate in order to deliver future generations of AWS-designed chips that deliver even better price-performance for customers.

Amidst the AI (and generative AI) boom, it is important that organisations choose the right compute infrastructure to lower costs and ensure high performance. At AWS, we are proud to offer our customers the most secure, performant, cost-effective, and energy-efficient infrastructure for building and scaling ML applications.