top of page

Blackwell Architecture Deep Dive: How NVIDIA Redefines AI Peaks with Advanced Packaging and HBM3E

  • Writer: Amiee
    Amiee
  • Apr 27
  • 9 min read

The wave of artificial intelligence is sweeping the globe at astonishing speed. From large language models to scientific computing, the demand for computing power seems insatiable. However, traditional monolithic chip designs are gradually approaching physical limits, and Moore's Law is faltering. Facing this challenge, NVIDIA's answer isn't merely process scaling but a more profound transformation—a packaging revolution centered around the Blackwell architecture.


This isn't just another GPU upgrade; it's a rethinking of chip design, manufacturing, and interconnection methods. Blackwell's emergence signals the acceleration of an era where advanced packaging technology "assembles" multiple chips to break through single-chip limitations. Whether you're an enthusiast wanting to understand the latest tech trends and why Blackwell is so significant, or a professional seeking deep technical details, eager to explore the underlying principles and challenges, this article will peel back the layers of Blackwell's mystery, from basic concepts to core technologies. We'll explore how it leverages key technologies like CoWoS-L packaging and HBM3E memory to once again define the limits of AI computation.


NVIDIA Blackwell Architecture,source:NVIDIA
NVIDIA Blackwell Architecture,source:NVIDIA


The AI Compute Race Escalates: Why Was Blackwell Born?


Imagine an AI model as an insatiably curious student needing to read ever-increasing volumes of books (data) to learn. The GPU is the brain helping the student read and understand quickly. As AI models become increasingly complex and the amount of data they need to process explodes (e.g., evolving from billions to trillions of parameters), the processing power of a single brain (a single GPU chip) starts to lag behind.

There are two main reasons:


  1. Physical Limits of Chip Size (Reticle Limit): Chips are manufactured on silicon wafers using photolithography. The exposure area of the lithography machine (called the reticle) has a physical size limit. Manufacturing a single, massive chip far exceeding this limit is extremely difficult and costly with current technology.


  2. Yield Issues: The larger the chip, the higher the probability of defects. A tiny defect can render the entire giant chip useless, drastically reducing yield and skyrocketing costs.


Although the Hopper architecture is powerful, it still faces these bottlenecks when confronted with the demands of next-generation AI models. To continuously increase computing power, NVIDIA had to find a new approach. Blackwell's core idea is: since making one giant chip is hard, let's tightly connect two (or more) powerful "smaller" chips (dies) together, making them function like a single, more powerful, unified chip. This is the Multi-Chip Module (MCM) design, and the key to realizing it lies in advanced packaging technology.



From Single Die to Dual Die: The Core Change in Blackwell Architecture


The most striking change in the Blackwell architecture is the shift from Hopper's single GPU die design to a dual GPU die design (using the B200 GPU as an example). These two independently manufactured GPU dies are tightly integrated via an ultra-fast internal interconnect technology.


The benefits of this design are clear:


  • Breaking Size Limits: Two smaller dies are easier to manufacture, avoiding the Reticle Limit problem of a single giant chip.

  • Improving Yield: Even if one small die has a defect, only that die is lost, not the entire (hypothetical) giant monolithic chip. It's easier to screen qualified dies for assembly.

  • Potential Cost-Effectiveness: Beyond a certain scale, the cost of manufacturing two smaller dies and combining them might be lower than producing a single giant chip with equivalent performance.


However, for two independent chips to operate efficiently as one, the "communication bridge" between them must be extremely fast and have low latency. This introduces another key Blackwell technology: NVLink-C2C (Chip-to-Chip). This is a high-speed interface specifically designed for inter-die connection, providing a staggering 10 TB/s of bandwidth. It ensures seamless data transfer between the two Blackwell GPU dies, allowing them to work together as if they were a single, unified processing unit.



The Art of Packaging: How CoWoS-L Connects the Blackwell Duo


Having a high-speed chip-to-chip interconnect isn't enough. How to physically "mount" these two GPU dies and their required High Bandwidth Memory (HBM) together, providing all necessary connections, is the cornerstone of Blackwell's success. This is where TSMC's CoWoS-L (Chip-on-Wafer-on-Substrate with Local Silicon Interconnect) advanced packaging technology comes into play.


We can think of CoWoS-L as a highly complex "adapter board" or "interposer":


  • Chip-on-Wafer (CoW): First, the GPU dies and HBM memory stacks are precisely placed (mounted) onto this special interposer.


  • Local Silicon Interconnect (L): What makes CoWoS-L special is the embedding of small silicon bridges within the interposer. These silicon bridges provide extremely high-density connection lines. It's through these bridges that the ultra-high bandwidth, low-latency connections between the two Blackwell GPU dies, and between the GPU dies and HBM memory, are achieved. Compared to traditional organic interposers, silicon bridges can accommodate finer, denser wiring, enabling faster transmission speeds and lower power consumption.


  • on-Substrate (oS): Finally, this interposer, carrying the chips and HBM, is mounted onto a conventional package substrate, which then connects to the external printed circuit board (PCB).


CoWoS-L enables Blackwell's dual-die design. It provides a platform capable of not only housing two huge GPU dies (each already near the Reticle Limit) but also integrating up to eight HBM3E memory stacks, ensuring unprecedented data transfer capabilities between them. Arguably, without advanced packaging technology like CoWoS-L, Blackwell's grand vision could not be realized.



Solving the Memory Bottleneck: The Crucial Role of HBM3E


For AI computation, the GPU's processing power is vital, but the ability to feed it enough data promptly and quickly is equally critical. Memory Bandwidth—the rate of data transfer between the GPU and its dedicated memory—is often a significant bottleneck determining actual performance.


The Blackwell architecture incorporates the currently fastest High Bandwidth Memory technology: HBM3E.


  • HBM (High Bandwidth Memory): This is a 3D stacked memory technology. Imagine stacking memory chips vertically like building blocks, instead of laying them flat on a motherboard, and connecting them directly through vertical channels called Through-Silicon Vias (TSVs). This significantly shortens data transmission paths, increasing bandwidth and reducing power consumption.


  • HBM3E: This is an enhanced version of HBM3 ("E" stands for Evolved or Extended), offering higher transfer rates and larger capacity per stack.


In the Blackwell B200 GPU, each GPU die is surrounded by four HBM3E memory stacks, totaling eight stacks. This brings the total memory capacity to 192GB and, more importantly, provides a staggering 8 TB/s of memory bandwidth. For comparison, the previous generation Hopper H100's HBM3 bandwidth was approximately 3.35 TB/s.


Such high memory bandwidth is crucial for training large language models with trillions of parameters. It allows model parameters and training data to be loaded into the GPU much faster, reducing wait times and significantly boosting training efficiency. For inference applications dealing with massive datasets, high bandwidth also markedly reduces latency.



Beyond the Chip: Fifth-Gen NVLink and NVLink Switch for System-Level Interconnect


While the Blackwell GPU itself is extremely powerful, training modern hyperscale AI models often requires hundreds or even thousands of GPUs working in concert. Efficiently connecting these GPUs into a massive computing cluster is another key challenge.


NVIDIA addresses this with its fifth-generation NVLink technology and the new NVLink Switch chip.


  • Fifth-Gen NVLink: Provides each Blackwell GPU with up to 1.8 TB/s of bidirectional bandwidth for direct GPU-to-GPU interconnection. This doubles the 900 GB/s of the previous generation Hopper.


  • NVLink Switch Chip: This is a separate, specialized switch chip designed to connect a large number of GPUs. It's also built using advanced process technology and incorporates numerous NVLink ports. Using the NVLink Switch, a high-bandwidth, low-latency switch fabric can be constructed, connecting up to 576 Blackwell GPUs into a single NVLink Domain. This allows them to coordinate work like one giant super-GPU, without going through traditional, slower Ethernet or InfiniBand switches (though the latter are still used to connect different NVLink Domains).


This system-level interconnect design is vital for supporting distributed training of ultra-large models, effectively reducing communication overhead and enhancing the overall cluster's computational efficiency.



Not Just GPUs: The Converged Power of the Grace Blackwell Superchip


NVIDIA didn't stop at just the Blackwell GPU. They also introduced the Grace Blackwell Superchip (GB200), continuing the design philosophy of the Grace Hopper (GH200) by integrating the CPU and GPU even more tightly.


The GB200 connects one Arm-based Grace CPU with two Blackwell B200 GPUs on the same module via the ultra-fast NVLink-C2C interface.


  • Potential for Shared Memory Space: The Grace CPU and Blackwell GPUs can access a unified memory pool (though physically, the CPU's DDR5/LPDDR5X and the GPU's HBM3E remain distinct). This simplifies the programming model and potentially reduces the need for data copying between the CPU and GPU, especially for complex workloads requiring significant CPU pre-processing or post-processing.


  • Advantages for Specific Applications: For scenarios demanding close CPU-GPU collaboration, such as large-scale inference, large database queries, and recommender systems, the GB200 offers extremely high bandwidth and low-latency connectivity, promising significant performance improvements.


The GB200 Superchip embodies NVIDIA's vision for the future of heterogeneous computing (CPU + GPU collaboration), aiming to provide a more integrated and efficient platform.



Performance Leaps and Power Efficiency Challenges


According to NVIDIA's official data, the Blackwell architecture achieves significant performance gains in AI training and inference. For instance, training large models like GPT-MoE-1.8T is reportedly 4 times faster on Blackwell compared to Hopper. In inference, the speedup can reach 7 times or even higher (especially when using new FP4/FP6 number formats).


However, immense power accompanies this powerful performance. A single B200 GPU's Thermal Design Power (TDP) is reported to be as high as 1200W, far exceeding the H100's 700W. An HGX B200 server containing 8 B200 GPUs could consume over 10kW in total power. Furthermore, a GB200-based rack (like the NVL72) reaches a staggering 120kW.

Such high power consumption poses severe challenges for data center cooling (liquid cooling becomes essential) and power delivery infrastructure. While performance per watt is claimed to have improved, the increase in absolute power consumption remains an issue the industry must collectively address and solve.



Blackwell vs. Hopper Key Specification Comparison

Feature

NVIDIA B200 (Blackwell)

NVIDIA H100/H200 (Hopper)

Architecture Design

Dual-Die MCM (Multi-Chip Module)

Monolithic Die

Process Node

TSMC 4NP (Custom 4nm)

TSMC 4N (Custom 5nm)

Transistor Count

2 x 104 Billion = 208 Billion

80 Billion (H100) / N/A (H200 Chip)

GPU Dies

2

1

Memory Type

HBM3E

HBM3 (H100) / HBM3E (H200)

Memory Capacity

192 GB

80 GB (H100) / 141 GB (H200)

Memory Bandwidth

8 TB/s

3.35 TB/s (H100) / 4.8 TB/s (H200)

NVLink (GPU-GPU)

5th Gen, 1.8 TB/s

4th Gen, 900 GB/s

NVLink (C2C)

10 TB/s (Internal)

N/A

Single GPU TDP

Up to 1200W

Up to 700W

AI Performance

Multiples of Hopper (specific tasks)

(Baseline)

Packaging Technology

CoWoS-L

CoWoS-S

Note: Some data may vary slightly depending on the specific product model. Performance improvements depend on the specific application and software optimization.



Manufacturing Challenges and Ecosystem Impact


Blackwell's ambition also brings significant manufacturing challenges:


  • CoWoS-L Capacity and Yield: CoWoS-L is an extremely complex and expensive packaging technology. Ensuring sufficient capacity and high yield is immense pressure for TSMC. Its capacity directly limits Blackwell GPU shipments.


  • HBM3E Supply: High-speed, high-capacity HBM3E is also a scarce resource. Memory giants like SK Hynix, Samsung, and Micron are striving to expand production capacity to meet demand.


  • Supply Chain Integration: From chip fabrication and HBM production to the final CoWoS-L packaging, close coordination and synchronization across all supply chain links are required.


  • Testing Complexity: Testing a complex module containing two GPU dies and eight HBM stacks is far more difficult than testing a single-die GPU.


Blackwell's launch not only solidifies NVIDIA's leadership in AI hardware but also profoundly impacts the entire ecosystem. It drives the development and adoption of advanced packaging technologies, stimulates innovation and capacity expansion for HBM memory, and compels competitors (like AMD, Intel, and cloud providers with their custom chips) to accelerate their efforts. Simultaneously, its staggering power consumption pushes for transformations in data center infrastructure, particularly in cooling technologies.



Future Outlook: The AI Hardware Landscape After Blackwell


The Blackwell architecture represents the current pinnacle of overcoming single-chip limitations through advanced packaging technology. It clearly points the way for future high-performance computing chip development: a shift from pursuing single giant chips to intelligent multi-chip integration.


We can anticipate:


  • Maturation of the Chiplet Ecosystem: MCM designs similar to Blackwell will become increasingly common. More flexible chiplet designs may emerge in the future, allowing customers to combine different functional units (CPU, GPU, AI accelerators, I/O, etc.) based on their needs.


  • Continued Evolution of Packaging Technology: Beyond CoWoS, more advanced 3D stacking packaging technologies (e.g., direct chip-on-chip stacking) might appear, further reducing interconnect distances and boosting performance.


  • Rise of Optical Interconnects: When electrical interconnects reach their bottlenecks, using light signals for ultra-high-speed data transmission between chips or systems could become the next-generation solution.


  • Hardware-Software Co-design: To fully exploit the potential of complex hardware, software (like the CUDA platform) needs to be designed and optimized in closer coordination with the hardware architecture.


Blackwell is not the end but the beginning of a new era in AI hardware. It demonstrates that through systematic innovation—combining advanced processes, architectural design, memory technology, interconnect technology, and revolutionary packaging solutions—humanity can continue to push the boundaries of computation even in the face of seemingly insurmountable physical limits.



Conclusion


NVIDIA's Blackwell architecture is more than just a routine GPU upgrade; it's a revolution led by advanced packaging technology. By cleverly integrating two powerful GPU dies with high-speed HBM3E memory using CoWoS-L packaging, Blackwell successfully overcomes the physical limitations of monolithic chips, delivering unprecedented compute power leaps for AI training and inference. From the dual-die MCM design, 10 TB/s NVLink-C2C interconnect, and 8 TB/s HBM3E memory bandwidth to the system-level fifth-gen NVLink and NVLink Switch, every element reflects the pursuit of ultimate performance.


For technology enthusiasts, the Blackwell story showcases how engineers use clever "assembly" methods to bypass traditional limitations and continuously drive technological progress, meeting the enormous demand for computing power in the AI era. For professionals, Blackwell's design details, the cutting-edge technologies employed like CoWoS-L and HBM3E, and the challenges and opportunities it presents for manufacturing, power consumption, and the ecosystem offer valuable insights and inspiration.


Despite facing challenges in power consumption and manufacturing complexity, Blackwell undoubtedly sets a new benchmark for the development of next-generation AI hardware, heralding a new era dominated by multi-chip integration and advanced packaging.



Subscribe to AmiTech Newsletter

Thanks for submitting!

  • LinkedIn
  • Facebook

© 2024 by AmiNext Fin & Tech Notes

bottom of page