& Immersion Cooling

Overview of Data Center Cooling Needs

Modern data centers generate significant amounts of heat due to the high density of computing hardware. Efficient cooling is essential to maintain the reliability and longevity of IT equipment while minimizing energy consumption and operating costs.

Over time, various cooling technologies have been developed, each with different approaches to thermal management. The most common methods include traditional air cooling, direct-to-chip liquid cooling, and immersion cooling.

Traditional
Air Cooling

Air cooling has been the predominant cooling method in data centers for decades. It relies on chilled air circulation to remove heat from servers and other IT hardware. The key components of an air-cooled data center include:

Computer Room Air Conditioners (CRACs) or Computer Room Air Handlers (CRAHs): These systems generate chilled air, often using refrigeration cycles or chilled water from an external chiller.

Raised Floors and Hot/Cold Aisle Containment: Raised floors allow chilled air to be distributed beneath the servers, while hot/cold aisle containment strategies optimize airflow efficiency by isolating hot and cold air streams.

Fans and Heat Sinks: Servers include internal fans and heat sinks to transfer heat from processors and memory to the surrounding air.

Limitations of Air Cooling

Inefficiency at high densities: As server density increases, air cooling becomes less effective due to thermal limitations and airflow constraints.

High energy consumption: Large fans, air conditioning units, and chillers consume significant power, leading to high operating costs and poor Power Usage Effectiveness (PUE).

Limited cooling capacity: Air cooling struggles to manage extreme heat loads in high-performance computing (HPC) and AI workloads.

To overcome these challenges, liquid cooling technologies have emerged, offering higher thermal efficiency and better scalability.

Direct-to-Chip
Liquid Cooling

Direct-to-chip cooling (also known as cold plate cooling) improves thermal management by circulating a liquid coolant directly over heat-generating components such as CPUs, GPUs, and memory modules.

How Direct-to-Chip Cooling Works

Cold plates are mounted directly onto processors or other high-power components.

A closed-loop liquid coolant system circulates a dielectric or water-based coolant through the cold plates, absorbing heat.

The heated coolant is then transferred to a heat exchanger, where it is cooled before being recirculated.

Advantages

  • Higher efficiency than air cooling due to better thermal conductivity of liquids.
  • Targeted cooling reduces hotspots and improves performance.
  • Lower energy costs as liquid cooling reduces reliance on high-powered air conditioning.

Challenges

  • Requires specialized infrastructure, including pumps, tubing and liquid distribution units.
  • Risk of leaks if not properly designed and maintained.
  • Only cools specific components, leaving other parts of the system reliant on air cooling.

While direct-to-chip cooling significantly improves efficiency, it is still a hybrid approach that combines liquid and air cooling. For maximum efficiency and simplicity, immersion cooling has emerged as a fully liquid-based solution.

Immersion Cooling
A Detailed Breakdown

Immersion cooling is a cutting-edge technology where entire server components are submerged in a thermally conductive, dielectric liquid that absorbs heat directly from the hardware. This method eliminates the need for traditional air cooling, fans, and heat sinks, offering significant energy savings and improved thermal management.

Types of
Immersion Cooling

There are two primary types of immersion cooling:
single-phase and two-phase cooling.

a. Single-Phase Immersion Cooling

In single-phase immersion cooling, IT hardware is submerged in a dielectric liquid that remains in its liquid state throughout the cooling process.

The heat generated by the hardware is absorbed by the coolant.

A pump or natural convection circulates the heated liquid to a heat exchanger, where it is cooled by an external cooling loop (e.g., chilled water or ambient cooling).

The cooled liquid is recirculated back into the immersion tank.

Advantages

  • Simplicity and reliability: No phase change reduces maintenance complexity.
  • Compatibility with existing infrastructure: Easier to implement in conventional data centers.
  • Low maintenance costs: No need for refrigerants or high-pressure cooling loops.

b. Two-Phase Immersion Cooling

In two-phase immersion cooling, a specialized dielectric fluid is used, which boils at relatively low temperatures (~50-60°C). The process works as follows:

Heat from the IT hardware causes the fluid to vaporize.

The vapor rises and condenses on a heat exchanger positioned above the liquid surface.

The condensed liquid then drips back into the tank, completing the cycle.

Advantages

  • Higher thermal efficiency: The phase change process removes heat more effectively.
  • Greater heat dissipation capacity: Ideal for ultra-high-density computing environments.
  • Reduced pumping requirements: Natural convection from the boiling process minimizes energy use.

Challenges

  • Higher costs due to specialized dielectric fluids and complex system design.
  • Fluid loss over time as small amounts of vapor escape.
  • More challenging maintenance due to the boiling process and vapor management.

Key Benefits
of Immersion Cooling

Regardless of the specific approach, immersion cooling offers several distinct advantages over traditional cooling methods:

01. Drastic
Energy Savings

  • Eliminates fans and reduces reliance on air conditioning.
  • Lower PUE (often approaching 1.03–1.10 vs. 1.4–2.0 for traditional air cooling).
  • Can leverage free cooling in many climates without expensive chillers.

02. Superior
Thermal Performance

  • More uniform heat dissipation across IT hardware.
  • Can handle higher power densities without risk of overheating.
  • Reduces hotspots and thermal cycling, increasing component lifespan.

03. Increased
Hardware Reliability

  • No moving parts inside the server (fans removed).
  • Protects electronics from dust, humidity, and oxidation.
  • Reduces mechanical stress on components.

04. Space Efficiency

  • Compact immersion tanks allow for higher density deployments.
  • Eliminates bulky air-cooling infrastructure (raised floors, CRACs, ductwork)

05. Eco-Friendliness

  • Reduces water consumption compared to traditional cooling towers.
  • Can use waste heat recovery for secondary applications (e.g., district heating).

Challenges & Adoption Considerations

Despite its advantages, immersion cooling is still an emerging technology with some adoption barriers:

Upfront costs are higher than conventional cooling due to the need for specialized tanks, fluids, and infrastructure.

Fluid compatibility concerns exist, as not all IT components are designed for immersion.

Vendor lock-in risk, as immersion fluids and solutions are not yet standardized across all hardware manufacturers.

However, as high-performance computing, AI workloads, and edge computing demand more efficient cooling solutions, immersion cooling is gaining traction in data centers worldwide.

Conclusion

Immersion cooling represents a transformative shift in data center thermal management, offering unmatched energy efficiency, thermal performance, and space optimization. While air cooling and direct-to-chip liquid cooling remain viable solutions for many environments, immersion cooling provides a future-proof approach for handling extreme computing loads. As data center energy costs and sustainability concerns continue to rise, immersion cooling is poised to become a mainstream solution in next-generation infrastructure.

The Evolution of TDP in CPUs and GPUs:

The Rising
Power Demands
of AI & HPC

The Power Consumption Challenge

Over the past decade, the power consumption of CPUs and GPUs has increased dramatically, driven by the growing demands of high-performance computing (HPC), artificial intelligence (AI), and data-intensive workloads. Traditional computing power consumption followed a steady growth curve, but the advent of AI accelerators, deep learning workloads, and high-density computing has caused a skyrocketing increase in total power per chip and per server.

This surge in power consumption has created significant challenges for thermal management, energy efficiency, and data center sustainability. In this section, we explore the historical evolution of CPU and GPU thermal design power (TDP), the impact of AI workloads on power density, and how these trends are pushing traditional cooling solutions to their limits.

Evolution of CPU & GPU TDP Over the Last Decade

CPU TDP Growth:
From 100W to 500W

Thermal Design Power (TDP) is a measure of the maximum heat a processor is expected to generate under normal workloads.

Over the past decade, CPUs have seen a significant increase in TDP, driven by rising core counts, higher clock speeds, and increased power draw for AI inference and high-performance computing.

Year
Intel Xeon
(Server CPUs) TDP

AMD EPYC
(Server CPUs) TDP

Consumer CPUs
(Intel Core & AMD Ryzen) TDP

2014
~140W (Intel Xeon E5-2600v3)
~140W (AMD Opteron)
~88-95W (Intel Haswell, AMD FX)
2017
~165W (Intel Xeon Skylake)
~180W (AMD EPYC Naples)
~105W (Ryzen 7 1800X, Core i7-8700K)
2020
~205-280W (Intel Ice Lake)
~280W (AMD EPYC Milan)
~125W (Ryzen 9 5950X, Core i9-10900K)
2023
~350-400W (Intel Sapphire Rapids)
~400-500W (AMD EPYC Genoa)
~250W+ (Ryzen 9 7950X, Core i9-13900K)

Server CPUs have transitioned from ~140W to over 400W per socket.

Consumer CPUs, once limited to 95-125W, now exceed 250W under load.

The introduction of AI acceleration in CPUs (e.g., Intel AMX, AMD XDNA) has further increased power draw.

GPU TDP Growth:
The AI Power Explosion

GPUs have seen an even more dramatic increase in power consumption, particularly with the rise of AI accelerators and deep learning workloads.

While GPUs were originally designed for graphics processing, modern AI workloads require thousands of matrix multiplications per second, leading to higher power densities and thermal loads.

Year
NVIDIA Datacenter GPUs
(Tesla /A100/H100) TDP

AMD Instinct
Datacenter GPUs TDP

Consumer GPUs
(RTX/Radeon) TDP

2014
~235W (NVIDIA Tesla K80)
~275W (AMD FirePro S9150)
~250W (GTX 980, Radeon R9 290X)
2017
~300W (NVIDIA Tesla V100)
~300W (AMD Radeon Instinct MI25)
~300W (RTX 2080 Ti, RX 5700 XT)
2020
~400W (NVIDIA A100)
~300W (AMD Instinct MI100)
~350W (RTX 3090, RX 6900 XT)
2023
~700W (NVIDIA H100 SXM)
~500W (AMD Instinct MI250X)
~450W (RTX 4090, RX 7900 XTX)
2024+
~1000W+ (NVIDIA B100?)
~600W+ (AMD MI300X)
~600W (RTX 5090 rumored)

Datacenter GPUs have increased from ~235W in 2014 to nearly 1000W in 2024.

The NVIDIA H100, widely used for AI training, can consume up to 700W per unit.

AMD's MI300X AI accelerator is expected to exceed 600W.

Consumer GPUs now approach 500-600W, up from 250W a decade ago.

AI Workloads: The Catalyst for Skyrocketing Power Demands

AI models like GPT-4, Llama, and Stable Diffusion require massive amounts of computation, pushing GPU clusters to the limit.

Training a single AI model can consume megawatts of power, equivalent to the electricity usage of entire small towns.

AI inference workloads, deployed at scale across cloud providers like AWS, Google, and Azure, further add to global data center energy consumption.

The Explosion of Server Power Consumption

The rise of AI, HPC, and hyperscale cloud computing has led to unprecedented increases in power consumption per server, per rack, and per data center.

Server Power Consumption Trends

Year
Typical Server Power
(CPU & GPU Combined)

AMD Instinct
Datacenter GPUs TDP

2014
~500W-1kW per server
~1.5-2kW for high-performance nodes
2017
~1-2kW per server
~3-4kW per AI-focused node
2020
~2-3kW per server
~6-8kW for AI training servers
2023
~5-7kW per server
~10-15kW per AI node
2024+
~8-12kW per server
~20-30kW+ per AI node

AI Clusters and Mega-Racks

AI supercomputers, like Meta’s Research AI Cluster (RSC) and Microsoft’s OpenAI clusters, consume tens of megawatts.

High-density AI racks now exceed 100kW per rack, compared to 10-20kW per rack in traditional data centers.

Hyperscale AI data centers require massive cooling solutions, far beyond what traditional air cooling can provide.

The Limits of Traditional Cooling

With server power exceeding 10kW+ per node and AI racks surpassing 100kW, traditional air cooling and direct-to-chip solutions are reaching their practical limits. The key issues include:

Air cooling cannot efficiently dissipate 500W+ per chip without excessive fan speeds, noise, and inefficiency.

Water cooling (direct-to-chip) is effective, but still requires heat sinks and complex tubing.

Higher densities (20-30kW per server) mean traditional cooling becomes infeasible due to space, energy, and thermal constraints.

This is why immersion cooling is emerging as the dominant solution for future AI and HPC workloads. By fully submerging hardware in dielectric liquid, immersion cooling enables ultra-high-density AI compute with lower energy costs and minimal cooling infrastructure.

The Future:
Why Immersion Cooling
is Inevitable

With CPU and GPU power skyrocketing, immersion cooling is no longer a niche solution, it is becoming a necessity for high-performance AI workloads. The key drivers include:

Ultra-high
thermal efficiency

Can cool 100kW+ racks with minimal energy.

No moving parts
inside the server

Eliminates fan failures, vibration, and dust issues.

Supports
AI and HPC at scale
Designed for 500W-1000W GPUs and future megawatt-scale AI clusters.

Lower total cost
of ownership (TCO)
Reduces cooling infrastructure and energy costs.

As AI power demands continue to rise, immersion cooling is the only sustainable path forward for next-generation data centers.