top of page

The Cooling Revolution: Navigating the AI Server Thermal Management Ecosystem and Supply Chain

  • Writer: Sonya
    Sonya
  • Aug 3
  • 14 min read

The AI Computing Boom and the Ensuing Thermal Crisis


This section aims to establish the fundamental market driver: the exponential growth in the power consumption of Artificial Intelligence (AI) accelerators, which presents severe thermal challenges for data center infrastructure, thereby catalyzing a shift from air cooling to liquid cooling.


Unprecedented Power Demands of Modern AI Accelerators


At the heart of the issue is the sharp increase in the Thermal Design Power (TDP) of processors designed for intensive AI workloads like deep learning. A modern AI Graphics Processing Unit (GPU) can have a thermal output exceeding 300 watts per chip, a level at which traditional air cooling becomes insufficient. High heat generation directly impedes performance, as overheating can lead to thermal throttling, system crashes, and even physical damage to the hardware.


This problem is rapidly worsening. While air cooling can handle server racks with power densities below 20-25 kW, AI racks now far exceed this threshold, making liquid cooling a necessity rather than an option.


NVIDIA's GPU Roadmap as a Market Catalyst


NVIDIA's product roadmap is the primary determinant of the pace of development for the entire data center cooling industry. The evolution of its product line—from Hopper to Blackwell, and on to a future roadmap including Rubin and Feynman—dictates the thermal management needs of server ODMs and data center operators.


  • Blackwell Generation (2024-2025): The Blackwell B200 GPU has a TDP of 1,200W, while the Blackwell Ultra B300 raises this to 1,400W, making liquid cooling a mandatory requirement for these platforms. Its flagship GB200 NVL72 rack system has a peak power consumption of 120 kW, far beyond the capabilities of air cooling.

  • Rubin Generation (2026-2027): The upcoming Rubin R100 GPU is expected to have a TDP of around 1,800W. The Rubin Ultra will integrate four GPU chips in a single socket, pushing the expected TDP to 3,600W. These figures not only indicate that liquid cooling is essential but also mean that the intensity and efficiency of liquid cooling solutions will become key performance differentiators.

  • Feynman Generation and Beyond (2028+): Projections for the Feynman architecture and its successors are even more staggering, with per-package TDPs potentially reaching 4,400W to 9,000W. This trajectory signals a future shift from today's advanced liquid cooling technologies (like direct-to-chip) to more cutting-edge solutions, such as immersion cooling, and eventually to embedded cooling.


NVIDIA's GPU roadmap and its escalating TDPs are no longer just a chip supplier's product plan; they have effectively become the blueprint for future data center architecture. Each new GPU generation forces the entire downstream supply chain—from server design and facility construction to component manufacturing—to undertake corresponding technological upgrades and investments. Any enterprise wishing to deploy cutting-edge AI technology must adopt liquid cooling. Consequently, NVIDIA's decisions are directly shaping the direction of this multi-billion-dollar infrastructure transformation market.


The Inevitable Limits of Air Cooling in High-Density Environments


The physics are clear: water has over 3,000 times the heat capacity of air, giving liquid a fundamental superiority as a heat transfer medium.


Air cooling is inefficient, capable of removing only about 30% of the heat generated by a server, whereas liquid cooling can capture nearly 100%. Cooling systems account for approximately 40% of a data center's total energy consumption, making the inefficiency of air cooling a major operational cost and sustainability issue. The Uptime Institute's 2023 survey predicts that by the end of this decade, direct liquid cooling will surpass air cooling as the primary method for IT infrastructure, confirming this industry-wide shift.


With the proliferation of liquid cooling, the industry needs new standards for energy efficiency assessment. The traditional Power Usage Effectiveness (PUE) metric—the ratio of total facility power to IT equipment power—is inadequate for evaluating liquid-cooled data centers. Traditional air cooling is considered facility power, while some components of direct liquid cooling (like pumps) may be counted as IT equipment power, making direct PUE comparisons between different cooling architectures misleading. Therefore, the industry is moving towards more comprehensive metrics like Total Usage Effectiveness (TUE), defined as the ratio of total data center power to the total power of computing components. TUE more accurately reflects the combined benefits of liquid cooling in improving both facility and IT system efficiency, making it crucial for assessing the return on investment in liquid cooling.


Technical Analysis of Advanced Cooling Solutions


This section provides a detailed technical analysis of the primary liquid cooling methods, comparing their principles, performance, and applicable scenarios, concluding with a comprehensive summary table for quick comparison.


Direct-to-Chip (D2C) Cooling: Precision Thermal Management


  • Core Principle: Liquid coolant circulates through "cold plates" mounted directly on the primary heat-generating components (CPUs, GPUs), absorbing heat directly at the source. This is a highly targeted method, but it often still requires a supplementary air cooling system to handle the remaining 25-30% of the heat in the rack.

  • Single-Phase D2C: The coolant (typically a water-glycol mixture) remains in a liquid state throughout the entire circulation loop. Due to its relative simplicity, lower cost, and mature technology, it is the current mainstream choice. However, it is less efficient than two-phase D2C, and leaks of water-based coolants can be catastrophic. Key components include cold plates, pumps, tubing, and a Coolant Distribution Unit (CDU).

  • Two-Phase D2C: This method uses a dielectric fluid with a low boiling point. The liquid boils as it absorbs heat from the chip, turning into vapor (phase change). This phase change process absorbs a large amount of latent heat, making it significantly more efficient. The vapor then flows to a condenser, where it releases heat and condenses back into a liquid. Although more efficient, its systems are more complex, costlier, and face severe challenges from PFAS (per- and polyfluoroalkyl substances) regulations.


Immersion Cooling: The Ultimate Thermal Solution


  • Core Principle: The entire server or its components are fully submerged in a non-conductive dielectric liquid, achieving 100% heat capture from all components, not just the main processors. This method eliminates the need for server fans and complex air handling infrastructure.

  • Single-Phase Immersion (1-PIC): Servers are submerged in a hydrocarbon-based liquid (similar to mineral oil). The liquid is circulated by pumps to a heat exchanger for cooling before returning to the tank. This method is favored for its simplicity, reliability, and lower upfront cost compared to two-phase immersion. Key players like Green Revolution Cooling (GRC) are proponents of this technology.

  • Two-Phase Immersion (2-PIC): Servers are submerged in a fluorocarbon-based liquid with a very low boiling point (e.g., around 50°C). The heat generated by the servers directly boils the surrounding liquid. The resulting vapor rises, condenses on cooling coils at the top of the sealed tank, and then drips back into the tank, creating a passive, pump-free cooling cycle. This is the most energy-efficient method available, with a PUE as low as 1.01-1.02. However, it faces extremely high costs (fluid, sealed tanks), complex maintenance, and an existential threat from PFAS regulations due to its required coolants.


The PFAS regulatory crisis has effectively split the liquid cooling market in two. This crisis stems from the reliance of two-phase cooling technologies on fluorocarbon-based liquids (i.e., PFAS, such as 3M's Novec products). In December 2022, due to the designation of PFAS as harmful "forever chemicals" by the U.S. Environmental Protection Agency (EPA) and EU authorities, 3M announced it would completely exit all PFAS production by the end of 2025. This decision has created an "immediate risk of obsolescence" for technologies dependent on such liquids and has reportedly led hyperscalers like Microsoft and Meta to halt their two-phase immersion cooling research. Consequently, the industry's most direct path forward is to double down on single-phase technologies that are not subject to this regulatory risk, such as single-phase D2C using water-glycol mixtures and single-phase immersion cooling using hydrocarbon oils.


Meanwhile, Direct-to-Chip (D2C) cooling is emerging as the dominant transitional technology because it offers a "brownfield-friendly" upgrade path for existing data centers. Operators have massive investments in their current air-cooled facilities, and a complete overhaul is not economically feasible. D2C systems can be retrofitted into existing racks and rooms to cool the hottest components, while allowing existing air handling systems to manage the remaining ambient heat, enabling a gradual upgrade. In contrast, immersion cooling represents a more radical, "greenfield-only" future. It requires entirely new infrastructure, such as large, heavy tanks, specialized hardware, and potentially even cranes for maintenance, making it almost exclusively suitable for new-build AI factories designed from the ground up.


Table 1: Comparative Summary of AI Server Cooling Technologies


The following table provides a structured comparison of the key characteristics of different AI server cooling technologies.

Feature

Air Cooling

D2C (Single-Phase)

D2C (Two-Phase)

Immersion (Single-Phase)

Immersion (Two-Phase)

Cooling Efficiency (% Heat Capture)

~30%

~70−75%

>90%

~100%

~100%

Max Supported Rack Density (kW)

<35

>140

>140

>200

>250

Typical PUE

1.6-1.9

Improved TUE

Improved TUE

1.02-1.03

1.01-1.02

Space Density

Low

High

High

Very High

Very High

Relative Capital Expenditure (CapEx)

Low

Medium

High

High

Very High

Relative Operating Expenditure (OpEx)

High

Low

Low

Very Low

Variable (due to fluid loss)

Maintenance Complexity

Low

Medium

High

Medium-High

High

Brownfield Retrofit Feasibility

N/A

High

Medium

Very Low

Very Low

Key Risks

Performance Limits

Leaks, Piping Complexity

PFAS Ban, Fluid Cost

Fluid Compatibility, Hardware Maint.

PFAS Ban, Fluid Cost/Loss


Mapping the AI Cooling Market Ecosystem


This section identifies the main categories of players in the market and analyzes their strategic roles, interdependencies, and competitive dynamics, expanding from the chip level to the entire data center.


The Epicenter: AI Accelerator Designers (NVIDIA)


NVIDIA's role extends far beyond chip design; it also validates and certifies entire systems, including cooling solutions. The "NVIDIA-Certified Systems" program ensures that partners' servers meet standards for performance, manageability, and security. Its "DGX-Ready Colocation" program specifically certifies data centers that can provide the necessary infrastructure, including liquid cooling solutions, to host its high-end systems. This creates a strong incentive for colocation providers like Digital Realty and Equinix to adopt and standardize on NVIDIA-approved cooling architectures.


NVIDIA actively co-develops reference designs with partners, such as the GB300 NVL72 deployed with CoreWeave, Dell, and Vertiv, which showcases a pre-validated, rack-level solution integrating compute, power, and cooling. This co-design process effectively sets industry standards.


The Integrators: Server OEMs and ODMs


These companies are responsible for manufacturing the physical servers and rack-level systems. They are the primary customers for cooling component suppliers and the main suppliers to end-users (hyperscalers, enterprises).


  • Supermicro: A key player, Supermicro pursues a deep vertical integration strategy. They design and manufacture their own CDUs, cold plates, and manifolds, offering a single-vendor, complete rack-level liquid cooling solution. Their ability to ship thousands of liquid-cooled racks per month demonstrates mature, high-volume manufacturing capabilities. As of late 2024, the company holds a dominant share of about 75% in the liquid-cooled AI server rack market.

  • Foxconn: As the world's largest electronics manufacturing service provider, Foxconn leverages its massive scale and vertical integration capabilities. Its strategic alliance with industrial electromechanical and power infrastructure expert TECO aims to extend its value chain from server racks to complete data center construction, offering a one-stop solution.

  • Wiwynn/Wistron, Quanta, Inventec: These are major Taiwanese ODMs that primarily serve the hyperscale market, manufacturing custom servers for clients like Meta and Google. Their business model is based on co-design and high-volume, cost-effective manufacturing.


The Drivers: Hyperscalers and Cloud Service Providers (CSPs)


Companies like Google, Meta, Microsoft, and AWS are the primary demand drivers for AI infrastructure. Their immense scale and advanced technical requirements push the limits of technology. They have evolved from mere customers to active co-design partners, working directly with ODMs and component suppliers to develop custom solutions that meet their specific needs.


Their internal research and deployments often lead new technology trends. Google's work on AI-driven cooling optimization and Microsoft's focus on waterless, sustainable cooling solutions both set industry trends. Meta's active contributions to the Open Compute Project (OCP) drive open standards for hardware, including cooling.


The Specialists: Professional Cooling Infrastructure Providers


These companies provide the critical power and thermal management infrastructure that surrounds the IT equipment.


  • Vertiv & Schneider Electric: These are global giants with comprehensive portfolios covering power distribution, thermal management (both air and liquid), and monitoring solutions. They work closely with chipmakers (NVIDIA, Intel) and server OEMs to provide integrated solutions. For example, Vertiv's strategy includes a broad CDU portfolio and a focus on end-to-end solutions under its "Vertiv 360AI" initiative.

  • Immersion Cooling Specialists: These are niche but important players focused on immersion technology. Key companies include Green Revolution Cooling (GRC) and Submer in the single-phase sector.

  • Other Key Players: Companies like CoolIT Systems (a leader in D2C), Stulz, and Daikin are also significant competitors in the broader data center cooling market.


The AI infrastructure market has evolved from a linear "vendor-customer" model to a deeply collaborative "co-design ecosystem." In the past, chipmakers sold standardized products to OEMs, who then sold them to enterprises. Today, the extreme technical demands of AI have broken this model. Hyperscalers now design their own server specifications and contribute them to OCP, working directly with ODMs for manufacturing. Chipmakers like NVIDIA design rack-level reference architectures and collaborate directly with infrastructure specialists (like Vertiv) and ODMs (like Supermicro) to create pre-validated systems. This forms a triangular relationship: NVIDIA defines the thermal problem, hyperscalers define the operational and scale requirements, and ODMs and infrastructure providers handle the engineering integration. Success no longer depends on the superiority of a single component but on the quality of collaboration and integration across the entire ecosystem.


Against this backdrop, a strategic divergence has emerged among server suppliers: "vertically integrated solution providers" (like Supermicro) versus "hyperscale-focused ODMs" (like Wiwynn, Quanta). Supermicro's strategy is to control the entire technology stack, offering its own branded, complete solutions to a broad market, including enterprises and smaller CSPs, to capture higher margins and maintain technological control. In contrast, ODMs like Wiwynn and Quanta focus on providing large-scale, cost-effective manufacturing services for a few hyperscale clients. This differentiation will shape M&A activity, R&D priorities, and market share competition in the coming years.


Deconstructing the Global Liquid Cooling Supply Chain


This section provides a detailed breakdown of the supply chain at the component level, identifying key parts, their manufacturers, and the critical role of specific geographic regions, particularly Taiwan.


Bill of Materials: Core Component Suppliers


  • Coolant Distribution Units (CDUs) & Heat Exchangers: As the "engine" of the liquid cooling system, the CDU manages the flow, temperature, and pressure of the coolant loop. Major suppliers include large infrastructure players like Vertiv, Trane, and Schneider Electric, as well as server OEMs like Supermicro. Specialized manufacturers such as Boyd Corporation and LiquidStack also hold significant positions. Heat exchangers, a core component of CDUs, are manufactured by specialists like Kaori and global industrial companies like Alfa Laval and Xylem.

  • Cold Plates & Coolant Distribution Manifolds (CDMs): Cold plates are the components that make direct contact with the chips, while manifolds distribute the liquid within the rack. Taiwanese companies dominate this sector, with Auras, Delta, Sunon, and CCI being key suppliers to NVIDIA and major ODMs.

  • Pumps: These are critical components for circulating coolant in single-phase systems, and their reliability is paramount. Major manufacturers include industrial specialists like Moog, Wilo, Danfoss, and Cat Pumps.

  • Quick Disconnects (UQDs): These leak-proof connectors allow for server maintenance without draining the entire cooling loop, making them essential for serviceability. Key players include Parker Hannifin, CPC (Colder Products Company), and Motivair. The OCP's Universal Quick Disconnect (UQD) standard is driving interoperability in the industry.


The Lifeline: Dielectric Fluids and Coolant Market


The choice of coolant is critical for performance, safety, and material compatibility. 3M's decision to exit PFAS manufacturing by the end of 2025 has created a massive shock to the supply of two-phase cooling liquids (Novec, Fluorinert) and has effectively slowed the adoption of 2-PIC technology.


This shift has created a significant market opportunity for other chemical companies. Chemours is actively promoting its Opteon™ 2P50, an HFO-based liquid with an extremely low Global Warming Potential (GWP), intended as a successor for two-phase applications. They are actively forming partnerships with companies like NTT Data and Navin Fluorine to establish a supply chain and validate the technology, aiming for commercial production in 2026. Single-phase immersion liquids are typically hydrocarbon-based oils, supplied by companies like GRC (ElectroSafe).


Focus: Taiwan's Vertically Integrated Supply Chain


Taiwan is the undisputed global hub for AI server manufacturing, accounting for approximately 90% of the global market. This dominance extends deep into every segment of the liquid cooling supply chain. Its highly concentrated and integrated ecosystem fosters rapid innovation and collaboration between ODMs and component suppliers.


The liquid cooling supply chain is currently in a "component scramble." The pace of GPU development is outstripping the manufacturing capacity and standardization of key cooling components (like CDUs, high-pressure UQDs, and complex cold plates), creating significant supply chain risks and opportunities. The primary bottleneck for market growth in 2023 was the insufficient production capacity of components like CDUs, not a lack of demand. This indicates that companies that can rapidly scale up production of these critical components, such as Supermicro (with a capacity of 5,000 racks per month) or Kaori (expanding manifold capacity), will gain substantial market share.


However, the high concentration of the AI server and cooling supply chain in Taiwan also presents significant geopolitical and operational risks. In response to this single-point-of-failure risk, supply chain diversification has become a trend. Server ODMs are actively establishing new production bases in Thailand, Vietnam, and Malaysia, with capacity in these regions expected to approach 50% by 2026. Foxconn is also utilizing its facilities in Vietnam and Wisconsin, USA, and is investing in an AI server production hub in Texas. This suggests that while Taiwan will continue to be a center for R&D and engineering, final assembly and some component manufacturing will become more geographically dispersed.


Table 2: Key Taiwanese Suppliers in the AI Liquid Cooling Value Chain


The following table clearly illustrates Taiwan's key role in the AI liquid cooling supply chain.

Component Category

Key Taiwanese Suppliers

Server Assembly/ODM

Foxconn, Quanta, Wiwynn, Inventec, Gigabyte

CDU

Kaori, Jih Maw, Kenmec

Cold Plate

Auras, CCI, Jentech, Delta

Manifold (CDM)

Kaori, Auras, CCI, Delta

Quick Disconnect (UQD)

Jazwares

Fan/Blower

Auras, CCI, Delta, Fulltech, AVC, SHYUAN YA

Rack/Chassis

Chenbro, Aicipc, Sheng Ming

Heat Exchanger

Kaori


Strategic Forces Shaping the Future Trajectory


This section analyzes the macro forces defining the industry's future direction—standardization, sustainability, and long-term R&D.


The Standardization Driver: Open Compute Project (OCP)


OCP's mission is to apply open-source principles to hardware, fostering a multi-vendor, interoperable ecosystem to prevent vendor lock-in. Its Cooling Environments Project has dedicated workstreams for cold plates, immersion, door heat exchangers, and heat reuse, demonstrating its comprehensive approach to standardization.


OCP has published detailed requirement documents and guidelines for liquid cooling components, covering material compatibility, operating parameters, and reliability expectations. This work is crucial for building confidence among data center operators and enabling a competitive supply chain. By defining interface standards like UQDs, OCP allows operators to source components from multiple suppliers (e.g., Parker, CPC) and ensure their interoperability.


OCP plays a key role in de-risking the liquid cooling transition for the broader market, acting as a standard-setting counterweight to NVIDIA's de facto market dominance. NVIDIA's designs (like the GB200 rack) become industry default standards due to its market leadership, which could lead to risks of a single-vendor ecosystem and proprietary interfaces. However, OCP, backed by hyperscalers like Meta, creates open standards for the interfaces between components (like UQDs and manifold connections). This allows operators to flexibly combine products from different vendors, provided they adhere to OCP specifications. Thus, OCP's work is not in opposition to NVIDIA but is complementary; it translates the high-level requirements set by NVIDIA's roadmap into open, multi-vendor standards, thereby accelerating market adoption by reducing risk and fostering competition.


The Sustainability Mandate: From Energy Efficiency to Heat Reuse


The primary advantage of liquid cooling is energy efficiency, which directly translates to a lower carbon footprint. However, the sustainability narrative is moving beyond simply reducing power consumption. Liquid cooling captures heat in a more concentrated and usable form (hot water) compared to air cooling (dispersed hot air), which creates significant opportunities for heat reuse.


Real-world case studies demonstrate the viability of this approach. Meta's data center in Odense, Denmark, captures waste heat and supplies it to the local district heating network, warming nearly 11,000 homes. Other potential applications include greenhouse heating, industrial processes, and aquaculture. However, the main barriers to widespread heat reuse are economic and logistical. It requires data centers to be located near heat consumers (cities, factories) and necessitates substantial upfront infrastructure investment (pipelines, heat pumps).


Heat reuse in data centers is transitioning from a niche corporate social responsibility (CSR) activity to a potentially viable economic model. This shift is driven by both the efficient heat capture capabilities of liquid cooling and increasing regulatory pressure. The high-usability hot water produced by liquid cooling makes it a transportable and sellable commodity.


Meanwhile, European regulations are beginning to mandate heat reuse for new data centers. Economic analyses show that under the right conditions (e.g., proximity to a district heating network), the payback period for heat reuse infrastructure can be less than two years. This transforms waste heat from a liability that must be disposed of into an asset that can be sold, creating a new revenue stream for data center operators and making data centers a component of community energy infrastructure.


The Next Frontier: Chip-Level Thermal Management Innovation


As GPU TDPs continue to march towards 10,000W and beyond, even today's advanced liquid cooling technologies may become insufficient. The future of cooling lies in integrating it directly into the semiconductor package itself.


  • Embedded Cooling/Microfluidic Cooling: This technology involves fabricating microchannels directly on the silicon die or interposer, allowing coolant to flow within millimeters of the transistors. This eliminates thermal interface materials (TIMs) and significantly reduces thermal resistance.

  • Fluidic Through-Silicon Vias (F-TSVs): This is a concept where the through-silicon vias (TSVs) normally used for electrical signals are repurposed to allow coolant to flow vertically through 3D chip stacks. This is considered key for cooling future 3D-stacked memory (HBM) and logic chips.

  • Key R&D Hubs: Research institutions like imec are at the forefront of developing these next-generation solutions, demonstrating concepts like 3D-printed impingement coolers and modeling the thermal challenges in advanced chip architectures.


Conclusion and Strategic Recommendations


This section synthesizes the report's findings, presents concise conclusions, and offers forward-looking, actionable recommendations for key stakeholders among the target audience.


Summary of Core Findings


The AI-driven thermal crisis is forcing a rapid and comprehensive industry-wide shift to liquid cooling. This is not a cyclical trend but a permanent architectural transformation. On the technology front, D2C is the dominant transitional solution, while single-phase immersion cooling offers promise for new-build AI factories. The development of two-phase cooling has been severely hampered by regulatory constraints. In terms of market dynamics, the landscape has evolved into a collaborative co-creation ecosystem, with NVIDIA setting the pace, hyperscalers defining the scale, and a complex, Taiwan-centric global supply chain racing to deliver.

Subscribe to AmiTech Newsletter

Thanks for submitting!

  • LinkedIn
  • Facebook

© 2024 by AmiNext Fin & Tech Notes

bottom of page