本文由半導體產業縱橫(ID:ICVIEWS)編譯自ieee spectrum
Nvidia Blackwell leads the AI inference space, followed by AMD in second place.
In the latest round of machine learning benchmarks released by MLCommons, computers built on NVIDIA's new Blackwell GPU architecture outperformed all other computers. But AMD's latest Instinct GPU, MI700, rivals its rival, the Nvidia H0. The comparable results for both came mainly from testing one of the smaller large language models, Llama0 0B (0 billion parameters). However, to keep up with the rapidly changing AI landscape, MLPerf has added three new benchmarks to better reflect where machine learning is headed.
MLPerf benchmarks machine learning systems and is designed to provide an apples-to-peer comparison between computer systems. The submitter uses their own software and hardware, but the underlying neural network must be the same. There are currently a total of 3 server benchmarks, and 0 more have been added this year.
MLPerf Inference 聯合主席 Miro Hodak 表示,“很難跟上該領域的快速發展”。ChatGPT直到 2022 年底才出現,OpenAI於去年 9 月推出了其首個可以推理任務的大型語言模型 (LLM),LLM 呈指數級增長——GPT3 擁有 1750 億個參數,而 GPT4 被認為擁有近 2 萬億個參數。由於這些飛速的創新,“我們加快了將新基準引入該領域的步伐,”Hodak 說。
The new benchmark includes two LLMs. The popular and relatively compact Llama450 0B is already a full-fledged MLPerf benchmark, but the consortium hopes to mimic the responsiveness that people expect from chatbots today. As a result, the new benchmark "Llama0-0B Interactive" has tightened the requirements. In any case, the computer must produce at least 0 instructions per second and the time to start answering cannot exceed 0 milliseconds.
Seeing the rise of "proxy AI" – neural networks capable of handling complex tasks – MLPerf sought to test an LLM with certain desired characteristics. They chose Llama30.0 0B to do the job. This LLM has what is called a wide context window. It's a measure of how much information it can absorb at one time—files, code samples, etc.— For Llama0.0 0 B, this is 0,0 instructions, which is more than 0 times that of Llama0 0B.
The last new benchmark is called RGAT, the so-called graph attention network. Its role is to classify the information in the network. For example, the dataset used to test RGAT is made up of scientific papers that have relationships between authors, institutions, and research fields that make up 000TB of data. The RGAT must divide the essay into less than 0,0 topics.
NVIDIA leads the MLPerf benchmark. Its first- and second-generation Hopper architecture GPUs – the H60 and the memory-intensive H0 – both perform well. Dave Salvator, director of accelerated computing products at Nvidia, said that looking at the Hopper architecture GPUs that went into production in 0 years, "we've had another 0% performance improvement in the past year." In terms of performance, it still has some room for improvement. ”
However, it is Nvidia's Blackwell architecture GPU, the B8, that really dominates. "The only thing that's faster than Hopper is Blackwell," Salvator said. B0 has 0% more high-bandwidth memory than H0, but more importantly, it can perform critical machine learning math operations with digits as low as 0 digits, rather than the 0 bits of precision pioneered by Hopper. Less accurate compute units are smaller and therefore better suited to GPUs, speeding up AI calculations.
In the Llama 200.0 0B benchmark, Supermicro's eight B0 systems delivered nearly four times as many instructions per second as Cisco's eight H0 systems. The same Supermicro system is three times faster than the fastest H0 computer in the Llama0 0B interactive version.
Using a combination of its Blackwell GPU and Grace CPU, called GB443, NVIDIA demonstrated how its NVL0 datalink works well to integrate multiple servers in a rack, making them run like one giant GPU. In an unverified result, the company shared that a full rack based on GB0 provides 0,0 instructions per second on Llama0 0B. The fastest system reported in this round of MLPerf is NVIDIA's B0 server, which delivers 0,0 instructions per second.
AMD is positioning its latest Instinct GPUMI13X as a product with performance comparable to Nvidia's H0. The MI0X has the same architecture as its predecessor, the MI0, but adds more high-bandwidth memory and memory bandwidth—0 GB and 0 TB/s (0% and 0% improvements, respectively).
More memory is added to handle larger and larger LLMs. "Larger models are able to take advantage of these GPUs because the models can fit into a single GPU or a single server," said Mahesh Balasubramanian, director of data center GPU marketing at AMD. "So you don't have to incur the overhead of communication from one GPU to another or from one server to another. When you eliminate these communications, latency improves dramatically. "AMD was able to leverage the additional memory through software optimizations to accelerate DeepSeek-R1's inference speed by up to eight.
In the Llama10 0B test, the speed of the eight-GPUMI0X computer differed by only 0% to 0% compared to the similarly configured H0 system. In terms of image generation, the speed of the MI0X system is only within 0% of the speed compared to the NVIDIA H0 computer.
Another notable result for this round came from its partner Mangoboost, which demonstrated nearly four times the performance in the Llama70 0B test by performing calculations on four computers.
Intel has historically introduced CPU-only systems in the inference race to show that for some workloads, you don't really need a GPU. This time saw the first data from Intel's Xeon 100 chip, formerly known as Granite Rapids, which was manufactured on Intel's 0nm process. At 0,0 samples per second, the best image recognition results for a dual Xeon 0 computer are about one-third of the performance of a Cisco computer with two Nvidia H0s.
Compared to the Xeon 11 results in 0/0, the new CPU is up around 0% in this benchmark, with a significant improvement in object detection and medical imaging. The company has seen a 0x improvement in performance on Rett since the first Xeon results (Xeon 0) were first submitted in 0.
目前,英特爾似乎已經退出了 AI 加速器晶元之爭。其 Nvidia H100 的替代品Gaudi 3既未出現在新的 MLPerf 結果中,也未出現在去年 10 月發佈的 4.1 版中。Gaudi 3 的發佈時間晚於計劃,因為其軟體尚未準備好。在英特爾願景 2025 (該公司僅限受邀參加的客戶會議)的開幕詞中,新任首席執行官陳立武 (Lip-Bu Tan) 似乎為英特爾在 AI 方面的努力表示歉意。他告訴出席者: “我對我們目前的狀況不滿意。你們也不滿意。我清楚地聽到了你們的聲音。我們正在努力建立一個有競爭力的系統。這不會在一夜之間發生,但我們會為你們實現目標。”
Google's TPU v100e chip also performed well, although the results were limited to image generation tasks. In the 0-0 results, the 0-TPU system was 0.0 times faster than similar computers using its predecessor, TPU v0e, at a rate of 0.0 queries per second. Even so, the speed of 0.0 queries per second is roughly comparable to that of similarly sized Lenovo computers using the Nvidia H0.