Chiplets Enter The Supercomputer Race

Nations compete on speed using very different compute architectures.

popularity

Several entities from various nations are racing each other to deliver and deploy chiplet-based exascale supercomputers, a new class of systems that are 1,000x faster than today’s supercomputers.

The latest exascale supercomputer CPU and GPU designs mix and match complex dies in advanced packages, adding a new level of flexibility and customization for supercomputers. For years, various nations have been vying for the leadership position in this space, with benefits that extend well beyond just supercomputers. These large and expensive systems pave the way for tremendous breakthroughs in AI, biology, defense, energy, and science.

Today’s supercomputers, as well as the new exascale systems, are based on the principles of conventional computing, which is completely different than quantum computers. In conventional computing, the information is stored in bits, which can be either a zero or one. In quantum computing, the information is stored in quantum bits, or qubits, which can exist as a zero, a one, or a combination of both. The superposition state enables a quantum computer to outperform traditional systems, but quantum systems are still years away from being practical.

State-of-the-art conventional supercomputers can perform more than 1 quadrillion (1015) floating-point operations per second (petaFLOPS or Pflop/s). Today, the Fugaku, a supercomputer built by Riken and Fujitsu, is the world’s fastest system, with a high-performance Linpack (HPL) benchmark score of 442 Pflop/s. The HPL score reflects the performance of a system in solving certain linear equations. It doesn’t reflect the overall performance of a system.

Exascale speed
Meanwhile, several entities from China, Europe, Japan and the United States have been developing exascale-class supercomputers, which perform a quintillion calculations (1018) or more each second (exaFLOPS or Eflop/s).

Recently, two supercomputers in China claimed to have broken the Eflop/s barrier, although those results are still unsubstantiated. And later this year, the U.S. is expected to deploy its first exascale supercomputer, a 1.5 Eflop/s or faster system called Frontier. Based on AMD’s server processors and GPU accelerators, Frontier is located at the Oak Ridge National Laboratory.

The U.S. also is developing two other exascale supercomputers, including Aurora, which is being built at the Argonne National Laboratory. Aurora is built around Intel’s server processors and GPUs.

From an architectural standpoint, all supercomputers are similar. These systems are comprised of a multitude of racks, each of which consists of many compute nodes. Each compute node has several CPUs and GPUs. Traditionally, many of these chips were large and complex system-on-a-chip (SoC) devices, where all functions are incorporated on a monolithic die.

That’s beginning to change. Some, but not all, exascale supercomputers are using a chiplet approach, particularly the U.S.-based systems. Instead of an SoC, the CPUs and GPUs in these systems incorporate smaller dies or tiles, which are then fabricated and reaggregated into advanced packages. Simply put, it’s relatively easier to fabricate smaller dies with higher yields than large SoCs.

The idea of incorporating multiple dies in a package isn’t new, especially in high-performance computing (HPC). “The idea of putting multiple chips in a single package has been around for a long time. IBM used a multi-chip carrier in the early 1980s to build their mainframes,” said Bob Sorensen, senior vice president of research at Hyperion Research. “So in theory, chiplets are merely the most recent incarnation of multiple dies in a single package. But chiplets can allow an HPC designer to build the processor that has the exact computational, memory, and I/O capabilities best suited to an HPC’s expected workload.”

There are several changes and announcements in this market. Among them:

  • China is deploying exascale supercomputers.
  • The U.S. is readying its first exascale systems.
  • AMD and Intel disclosed details about their chips for the exascale era.
  • The industry released a new standard to connect chiplets in a package.

Fig. 1: Slated for deployment later this year, the Frontier exascale supercomputer targets 1.5 Eflops/s performance. Source: Oak Ridge National Laboratory

Fig. 1: Slated for deployment later this year, the Frontier exascale supercomputer targets 1.5 Eflops/s performance. Source: Oak Ridge National Laboratory

Supercomputer race
In total, the supercomputer market is projected to grow from $6.6 billion in 2021 to $7.8 billion in 2022, according to Hyperion Research. Hyperion splits the supercomputer market into three segments—leadership/exascale, large ($3 million and up each), and entry-level ($500,000 to $3 million). Each exascale system sells for roughly $600 million.

For years, supercomputers have been used for numerous applications. “Supercomputing is needed for many things, including massive simulation tasks like weather forecasting, massive arithmetic computing tasks like cryptocurrency mining, massive image processing tasks like satellite image processing, and massive neural network computing for deep learning training,” said Aki Fujimura, CEO of D2S. “It is used extensively in semiconductor manufacturing for problems like inverse lithography technologies, mask process correction, simulation-based verification of masks and wafers, and mask and wafer inspection.”

Viewed as a timeline, the computing field has made enormous progress. In 1945, the University of Pennsylvania developed ENIAC, the first general-purpose electronic digital computer. Using vacuum tubes to process the data, ENIAC executed 5,000 additions per second.

Starting in the 1950s, transistors replaced vacuum tubes in many systems, enabling faster computers. Transistors, the key building blocks in chips, serve as a switch in devices.

In 1964, now-defunct Control Data introduced the CDC 6600, the world’s first supercomputer. The 6600 incorporated a 60-bit processor using transistors with 2 MIPS of performance. Since then, supercomputers have become far more powerful. At the same time, various nations continue to leapfrog each other for the performance leadership position.

For example, in 2008, IBM’s Roadrunner was the world’s fastest supercomputer with a performance of 1.026 Pflop/s. It became the first supercomputer to reach this milestone. Then, in 2010, China jumped into the leadership position with the Tianhe-1A, a supercomputer with a performance level of 2.57 Pflop/s.

Since 2020, Japan’s Fugaku has held the No. 1 position in supercomputing. IBM’s Summit holds the No. 2 spot and is the fastest supercomputer in the U.S.

The Fugaku system consists of 158,976 compute nodes for a total of 7,630,848 Arm processor cores. “Each node is equipped with a processor called the A64FX, which consists of 48-core general-purpose processor cores and four assistant cores. A64FX is fabricated with a 7nm process,” said Shuji Yamamura, a researcher at Fujitsu/Riken, in a paper at the recent ISSCC event.

Fugaku uses a custom-built ARM processor. It’s not a chiplet architecture. In contrast, China’s supercomputers tend to use custom processors. Many non-exascale supercomputers use merchant chips.

“For the more mainstream HPC sector, hardware decisions are primarily based on the availability of more mainstream mass components,” Hyperion’s Sorensen said. “These might include Intel CPUs, Nvidia GPUs and InfiniBand interconnects. They may be configured to be best suited to the HPC workload environment or may have some aggressive packaging and cooling capabilities to deal with the power issues.”

Both CPUs and GPUs play a key role in HPC. “For sequential data processing type of programming, CPUs tend to be more cost-effective than GPUs. But for tasks that compute a lot for any given unit of data, GPUs can be much more efficient, particularly if a computing task can be cast into a single-instruction multiple-data (SIMD) problem. This is where much of the data is processed in parallel and executed in the same instructions on different data,” D2S’ Fujimura said.

Exascale era
Going forward, supercomputing is entering the exascale era, which promises to deliver new breakthroughs in biology, defense, science and other fields.

Exascale systems are expensive to develop. “At the exascale range, a $500 million-plus HPC may have upward of 20% of its overall budget dedicated to the development of special features like custom chips, interconnects, and other components to meet some targeted workload requirement,” Hyperion’s Sorensen said.

Several entities are developing exascale supercomputers. China appears to have a narrow lead, followed closely by the United States. Europe is behind the pack. Earlier this year, the European High Performance Computing Joint Undertaking (EuroHPC) launched several new projects, including an exascale program. It’s unclear when the EU will launch a system.

China has three exascale supercomputers in the works, Sunway Oceanlite, Tianhe-3, and Sugon, according to Hyperion Research. Installed in the National Supercomputer Center in Wuxi, Sunway Oceanlite was completed in 2021. Last year, researchers claimed to have reached 1.3 Eflop/s level in peak performance. This system is based on an internally-designed SW39010 CPU. In total, the system consists of more than 38 million CPU cores, according to Hyperion.

Completed late last year, Tianhe-3 has demonstrated 1.7 Eflop/s of performance. Meanwhile, the Sugon system has been delayed. None of the performance results from China have been confirmed.

While China tends to use traditional custom processors, the U.S.-based exascale systems are taking another approach. The CPUs and GPUs are leveraging chiplets, where you mix and match dies and assemble them in packages.

To date, AMD, Intel, Marvell and others have developed chiplet-based designs, mainly for server and other high-end applications. The concept is also ideal for supercomputing.

“Chiplets will be implemented in several applications that benefit from their characteristics, including significant size reduction, lower power consumption, and better high-speed performance,” said Richard Otte, president and CEO of Promex, the parent company of QP Technologies. “For example, the DoD and DARPA are working to get the fastest supercomputers into their labs, and chiplets will help enable this.”

Today, the U.S. has three exascale systems in the works—Aurora, El Capitan, and Frontier. Frontier is expected to be in operation in late 2022, followed by Aurora and El Capitan in 2023.

In 2019, the U.S. Department of Energy (DOE) awarded Cray the contract to build the Frontier exascale supercomputer at Oak Ridge National Labs. In 2019, Cray was acquired by Hewlett Packard Enterprise (HPE).

HPE built the platform for Frontier, which supports a multitude of compute nodes. Each compute node supports one of AMD’s server CPUs and four AMD GPU accelerators.

Based on a 6nm process from TSMC, AMD’s new GPU accelerator incorporates two dies, which in total consist of 58 billion transistors. The architecture surpasses 380 teraflops of peak performance.

The GPU architecture is incorporated in a 2.5D package with a twist. In most 2.5D/3D packages, dies are stacked or placed side-by-side on top of an interposer, which incorporates through-silicon vias (TSVs). The TSVs provide an electrical connection from the dies to the board.

“TSVs are the enabling technology of 3D-ICs, [providing] electrical connections between the stacked chips. The main advantage of the 3D-IC technology with TSVs is that it provides a much shorter interconnection between different components, which results in lower resistive-capacitive delay and smaller device footprint,” said Luke Hu, a researcher at UMC, in a recent paper.

Fig. 2: Different options for high-performance compute packaging, interposer-based 2.5D vs. Fan-Out Chip on Substrate (FOCoS). Source: ASE

Fig. 2: Different options for high-performance compute packaging, interposer-based 2.5D vs. Fan-Out Chip on Substrate (FOCoS). Source: ASE

In 2.5D/3D packages the interposer works, but there is wasted space on the structure. So several companies have developed an alternative approach called a silicon bridge. A bridge is a tiny piece of silicon with routing layers, which connects one chip to another in a package. In one example, Intel has developed the Embedded Multi-die Interconnect Bridge (EMIB), a silicon bridge that is typically embedded in the substrate.

Meanwhile, in AMD’s GPU, the company stacks a GPU and high-bandwidth memory (HBM) side-by-side on a silicon bridge. HBM is basically a DRAM memory stack.

Unlike EMIB, which is embedded in the substrate, AMD puts the bridge on top of the substrate. AMD calls this a 2.5D Elevated Fanout Bridge (EFB).

Fig. 3: Substrate-based bridge vs. AMD’s 2.5D Elevated Fanout Bridge (EFB) Source: AMD

Fig. 3: Substrate-based bridge vs. AMD’s 2.5D Elevated Fanout Bridge (EFB) Source: AMD

Other exascale supercomputers are in the works. Not long ago, Lawrence Livermore National Laboratory, HPE and AMD announced El Capitan, an exascale system that is expected to exceed 2 Eflop/s. This system is based on AMD’s chiplet-based CPUs and GPUs.

In 2019, meanwhile, the DOE, Intel and HPE announced plans to build Aurora, a ≥2 Eflop/s system. Originally, Aurora was expected to be delivered to Argonne in 2021, but that was pushed out due to chip delays at Intel.

Aurora is based on HPE’s supercomputer platform, with more than 9,000 compute nodes. Each node consists of two of Intel’s Sapphire Rapids processors, six of Intel’s GPU accelerators (code-named Ponte Vecchio), and a unified memory architecture. It consists of 10 petabytes (PB) of memory and 230PB of storage.

Sapphire Rapids is a next-generation Xeon processor, which incorporates 4 smaller CPU dies in a package. Based on Intel’s 7nm finFET process, the dies are connected using EMIB.

The processor consists of more than 100MB of shared L3 cache, 8 DDR5 channels, and 32GT/s PCIe/CXL lanes. “New technologies include Intel Advanced Matrix Extensions (AMX), a matrix multiplication capability for acceleration of AI workloads and new virtualization technologies to address new and emerging workloads,” said Nevine Nassif, a principal engineer at Intel, in a presentation at the recent ISSCC event.

In Aurora, the CPU works with Ponte Vecchio, a GPU based on Intel’s Xe-HPC microarchitecture. This complex device incorporates 47 tiles on five process nodes in a package. In total, the device consists of more than 100 billion transistors.

Basically, Ponte Vecchio stacks two base dies on a substrate. On each base die, Intel stacks a memory fabric, followed by compute and SRAM tiles. The device also has eight HBM2E tiles. To enable the dies to communicate with each other, Intel uses a proprietary die-to-die link.

Based on Intel’s 7nm process, the two base dies provide a communication network for the GPU. The dies include memory controllers, voltage regulators, power management and 16 PCIe Gen5/CXL host interface lanes.

On each base die, Intel stacks 8 compute tiles and 4 SRAM tiles. The compute tiles are based on TSMC’s 5nm process, while the SRAM is built around Intel’s 7nm technology.

In total, the device incorporates 16 compute tiles and 8 SRAM tiles. Each compute tile has 8 cores. “Each core contains 8 vector engines, processing 512-bit floating-point/integer operands, and 8 matrix engines with an 8-deep systolic array executing 4096-bit vector operations,” said Wilfred Gomes, an Intel fellow, in a paper at ISSCC.

For power delivery, Intel implements so-called fully integrated voltage regulators (FIVRs) on the base dies. “FIVR on the base die delivers up to 300W per base die into a 0.7V supply,” Gomes said. “3D-stacked FIVRs enable high-bandwidth fine-grained control over multiple voltage domains and reduce input current.”

Thermal management poses a significant challenge in advanced packaging. To address this issue, Intel places a heat spreader on the GPU. Then, a thermal interface material (TIM) is applied on the top dies.

“The TIM eliminates air gaps caused by different die stack heights to reduce thermal resistance. In addition to the 47 functional tiles, there are 16 additional thermal shield dies stacked to provide a thermal solution over exposed base die area to conduct heat,” Gomes said.

How to develop chiplets
Supercomputing is just one of many applications for chiplets. Recently, several vendors have developed chiplet-like designs for servers. Future chiplet architectures are in the works.

Developing a chiplet-like design is appealing, but there several challenges. It takes resources and several elements to develop chiplets.

As stated, in chiplets, instead of designing a large SoC, you design a chip using smaller dies from the ground up. Then, you fabricate the dies and re-assemble them into a package. There are several design considerations associated with this.

“In a sense, this kind of advanced package or advanced product requires high-density interconnects,” said Choon Lee, chief technology officer of JCET. “So in that context, packaging itself is no longer just a single die in a package with encapsulation. In more advanced packaging, you have to think about the layout, the interactions with the chip and the package, and how to route these layers. The question is how do you really optimize the layout to get the optimal performance or maximum performance in the package.”

That’s not the only issue. In the package, some dies are stacked. Other dies reside elsewhere in the package. So you need a way to connect one die to another using die-to-die interconnects.

Today’s chiplet-like designs connect the dies using proprietary buses and interfaces, which is limiting the adoption of the technology. Several organizations have been working on open buses and interface standards.

In the latest effort, ASE, AMD, Arm, Google, Intel, Meta, Microsoft, Qualcomm, Samsung, and TSMC recently formed a consortium that is establishing a chiplet-enabled die-to-die interconnect standard. The group also ratified the UCIe specification, an open industry interconnect standard at the package level. The UCIe 1.0 specification covers the die-to-die I/O physical layer, die-to-die protocols, and software stack.

“The age of chiplets has truly arrived, driving the industry to evolve from silicon-centric thinking to system-level planning, and placing the crucial focus on co-design of the IC and package,” said Lihong Cao, director of engineering and technical marketing at ASE. “We are confident that UCIe will play a pivotal role in enabling ecosystem efficiencies by lowering development time and cost through open standards for interfaces between various IPs within a multi-vendor ecosystem, as well as utilization of advanced package-level interconnect.”

That doesn’t solve all problems. In all packages, the thermal budget is a big concern. “Power dissipation and power usage are big challenges,” said Michael Kelly, vice president of advanced packaging development and integration at Amkor. “It’s hitting home in the packaging industry because of the integration at the package level. Unfortunately, silicon generates a lot of wasted heat. It’s not thermally efficient. You need to dump the heat somewhere. We have to make that as thermally efficient as possible for whoever is doing the thermal dissipation in the final product, whether that’s in a phone case or a water cooler in the data center. How much actual electrical current we have to deliver into a high-performance package is also getting interesting. Power is not going down, but voltages are sliding down. To deliver the same total power or more power, our currents are going up. Things like electromigration need to be addressed. We’re probably going to need more voltage conversion and voltage regulation in the package. That way we can bring higher voltages into the package and then separate them into lower voltages. That means we don’t have to drag as much total current into the package. So power is hitting us in two ways. It’s heat, but it’s also managing that power delivery network electrically. That’s forcing more content into the package, while also doing your best on thermal power dissipation.”

Conclusion
Clearly, chiplets constitute an enabling technology and they’re making their way into server designs. Recently, Apple introduced a Mac desktop with a chiplet-like processor design. Now chiplet-based exascale supercomputers are on the scene.

For exascale supercomputers, chiplet-based approaches are being used for Frontier, El Capitan and Aurora systems. Others such as the Fugaku and the Sunway Oceanlite, continue to follow the traditional SoC-based approach. Both methods work. Let the race begin.

Related Stories
The Great Quantum Computing Race
Companies and countries are pouring tens of billions of dollars into different qubit technologies, but it’s still too early to predict a winner.

Next-Gen 3D Chip/Packaging Race Begins
Hybrid bonding opens up whole new level of performance in packaging, but it’s not the only improvement.

Piecing Together Chiplets
Changes that could push this packaging approach into the mainstream, and the challenges ahead.

Advanced Packaging’s Next Wave
A long list of options is propelling multi-chip packages to the forefront of design, while creating a dizzying number of options and tradeoffs



Leave a Reply


(Note: This name will be displayed publicly)