Transcript for:
Future Trends in Liquid Cooling Technology

Hello, thanks for coming. I'm Jason Zeiler. I'm the liquid cooling product manager here at HPE. Today we're going to talk about the future of liquid cooling. I hope that you like this session. Afterwards, if you have any questions, I'll be off to the stage just off the side, so feel free to come and grab me. So we'll get kicked off here. One of the things I always like talking about just to really set the stage for what's happening. in the industry broadly in both enterprise and HPC is chips are changing, right? For the last 20 years, even really the last five to 10 years, we've seen pretty flat power curve. For what chip densities? are for TDP, the thermal design power. So you can think about, you know, for the last three to four years, what was a high-end Intel CPU, NVIDIA GPU, AMD CPU? For CPUs, we were talking about kind of 200 watts. That was pretty hot. Today, it's around 300 watts. GPUs, around 500 watts. But there's some very interesting things happening in the market right now. This power war trend is increasing density very quickly. Our friends making these amazing products are moving... moving into kind of some new design methodologies, really 3D silicon stacking. So we're starting to see more stuff packed into a small footprint, which means the power is going up. But there's some other things happening with cooling. So this line, GPU power is going up, CPU's power is going up, that's not really a big surprise. If you've listened to different presentations from different groups, we're going to see 1,000 watt and beyond GPUs very quickly. But the thing that's not often talked about is the the T-case or the silicon temperature that's really the maximum temperature these chips can get up to. So if we think about a chip from last generation, we can get these chips very hot. They could run at 90, 95, nearly 100 Celsius and they would continue running. I've seen some T-cases as low as 60 Celsius. And so that's kind of a corner case, but it definitely isn't going up. Chips are becoming less tolerant of very high heat because there's a lot of components happening. inside of that package and that's what we're gonna talk a bit about today why that makes it a really good case for liquid cooling now whenever I talk about the value prop like why do you care about liquid cooling why is this interesting in the past the top of this triangle has been very easy part of the story for me to tell if you want a 350 watt CPU running flat out all the time liquid cooling is the way to go cools it very consistently there's a lot of good kind of performance goodness there but efficiency is becoming a really important part of the story. I'm starting to get more and more questions from folks in the finance department or the CFO's office saying, how can we use less power? How can we pay less for OpEx over time? Efficiency is a really good part of that story too. Liquid cooling for the most part is gonna use substantially less power at the rack, but in the data center overall to exchange that heat. We think about all the fans inside of the servers. Well, what if we can remove them? What if we can run them at idle? That's. That has a really good efficiency element. Now, the other part of this is density. So when we talk with customers about rack densities today, often 17 kilowatts is like the high end for them. And what that means is many of these racks are near empty. But what if we can fully populate racks? We'll need more power and we'll need cooling, but we're going to need a far smaller data center overall. So density is kind of a third part of the story we talk about with liquid cooling. Now, it's always interesting. We talk about all these big systems that we build, HP. We're building all this amazing exascale class technology, really big systems. What is interesting is they're liquid cooled. We build them rack and roll, and we ship them directly to customers. For enterprise, that hasn't always been very important. For AI, it's going to be very important. We're going to build very high-density racks. We're going to build them liquid cooled, and we want them to have a really good, positive customer experience. We want them to be ready to go when they show up. And so that's where we're going to leverage all of this experience building these really big exascale class systems and supercomputers. And we're going to apply that to AI. Now, I always like showing some of this basic information to just show what is table stakes to be expected with liquid cooling. We did a quick comparison of our XD2000 server. So this is a higher density HPC product. We took an air-cooled server and a liquid-cooled server. And right away, when we benchmarked them against each other, there was nearly a 15% decrease in chassis power. Now, that's not that surprising. The fans run at idle, right? So we put cold plates on the CPUs and the GPUs. The servers don't have to run very hard. This does not include what happens at the data center. If we don't have to use perimeter cooling, the savings is much more substantial, but I always like showing what's happening at the chassis level. We actually noticed... a mild increase in power or performance as well because the chips are being cooled so consistently. So we didn't have any hot spots in the rack, we're able to see very consistent cooling at the chip, so there was a bit of a bump at performance. And so when we kind of combine these numbers together, we come away with we get about 20% more performance per kilowatt so each kilowatt that we put into the rack because we're using so much less for the actual cooling infrastructure we get to use that for performance so just tells a nicer story you know at the end of the day so when we take higher level view of this, this is always really interesting to talk about what are the things to think about? What are the real benefits of liquid cooling at a more macro level? So when we look on the left here, when we talk about just cooling costs, air-based cooling versus DLC, we did this example, 10,000 server cluster, we assumed some variables for the electricity cost. This is primarily kind of a North American or US power rate. So if you're thinking about a European example, it would be drastically different. Their power costs are going to be nearly four times the cost. The electricity cost just to do a lot of the cooling for an air-based data center in this configuration is just over two million dollars per year. When we looked at the liquid cooling costs, it was a fraction of that, about three hundred thousand. So that's kind of a magnitude of what we would expect kind of for ROI, for the CFO office. Why would we want to move towards liquid cooling? There's a lot of goodness baked into the energy cost. But what we're talking a lot about in the industry today is is carbon emissions, carbon footprint, energy neutrality. When we look at that from a CO2 emission standpoint, that is also pretty substantial. So we could expect about 8,700 tons of CO2 being released to all that power, where we're going to get that electricity from, versus 1,200 tons. So we're just taking the power consumption, relating it to CO2 emissions. This is always fun, talking about households. So a lot of these big data centers today, they'll use about the same power as 2,000 homes. that itself is pretty large but if we can reduce that to 280 homes that is something that we're a lot happier with for energy consumption and now the last one here this is probably one of my favorites is density so we talked about the triangle why density is interesting for this same system 10 000 servers using kind of this power rate and some of the assumptions for how much we could actually pack into a rack we would need almost five times as many server racks to do it in an air-cooled configuration as we would for DLC. So just kind of a quick example of what this means for smaller data centers overall and kind of greater efficiencies. Now, why HPE? I always like showing this visual because it shows really our involvement in what's been happening in the industry over long periods of time. Liquid cooling may seem like a new idea, but really HPE, Cray, and a lot of our partners have been doing this for a long time. So you can see way back in the 60s, IBM was doing some interesting work with immersion. And kind of where I started putting the bubbles, where Cray Research started entering the space, Cray was doing refrigeration-based cooling. Cray was doing immersion. We were doing a lot of cold plate designs. But today, we're a combination of all three of those companies. So Cray, we look at SGI and HPE. We have been building innovative liquid-cooled systems for decades. And that is really where we are today. Everything that we do in HPC or high-performance computing, we're trickling that technology down to all of our enterprise applications. same cold plate design same kind of CD use the coolant distribution units same kind of manifolds but we test it and we prove it in HPC and then we roll it out to the broader market so the stuff I was like you know tangibly what is the stuff that's liquid cooled today from a server I'll show you the servers and I'm going to show you the bigger hardware in the pro-liant group the pro-liant 360 365 and 380 3d 5 those are available with liquid cooling you can check them out on their booth today we have all the cold plate systems there. In HPC, the XD2000 and the 6500, the 2000 has a CPU-based platform and the GPUs are in the 6500. Those today are one of the, I would say, most popular AI systems that we're shipping. A lot of, we could do four-way systems, eight-way systems, all liquid cooled. And those are going to be one of our front runners. And then everything we've done in Cray. Actually, as table stakes, these systems are only liquid cooled and have no fans. And I'll show you a bit a lot of the industry is going. If we can, we want to reduce the reliance on air cooling in the server as much as possible and move to cold plate cooling because of all the energy efficiency reasons I mentioned. So the other stuff, again, what's really exciting is I have almost everything here on the HPC show flume floor. So if you go towards the AI section, this stuff is there. You can touch and feel it, see how big it is. So we'll start on the right and we'll kind of move towards the left here. When we think about about liquid cooling today. This is really Perfect example. A lot of cool plates, coolant distribution units, no fans. This is the absolute best density and energy efficiency story. When we move into the middle, this is HP Cray XD and ProLiant today. We have a lot of flexibility because we're using direct liquid cooling on all the hottest components, the GPUs and CPUs, but we're using air cooling for DIMMs, capacitors, VR, all the other little stuff. And so it's a bit cheaper, but it allows us to have a lot greater flexibility for SKUs, what you'd like to put in racks. And then what's on the left, liquid to air cooling. So this is where there's a lot of kind of different opinions in the industry, and this is where I'll put mine. Anything that requires facility water, the building needs water, and it gets pumped to the racks, I consider liquid cooling. So rear door heat exchangers and arcs, to me, are also liquid cooling. We're not using cold plates, but we are taking facility water, we're plumbing it directly to this hardware, and we're creating cold air very close to the IT. And so if you're able to go over that barrier of adding liquid into the data center, you have a lot of options. All of this is at play. So we'll talk about kind of the spread. So today, you know, especially in enterprise, almost all of the racks are below 20 kilowatts. 7 kilowatts, 17 kilowatts, that's kind of where a lot of that IT plays. But where we're going with higher chip densities, 350 to 500 watt CPUs, quickly we can move into that 40 kilowatt range. But AI... easily is going to be between 60 and 80, depending if we're fully populating racks. So what is popular with HPC today is going to become very common in AI as well. Now, customers absolutely can buy one box, two boxes, but for that kind of maximum efficiency of putting a lot of stuff in a rack, building a smaller data center, it's going to create higher density racks, which we'll need to think about cooling. And so you can kind of see the spread of how these different technologies play at different power levels. But I think really, This is kind of that sweet spot where a lot of AI systems are going to go between 60 and 80 kilowatts, which means you have a lot of choice for what kind of cooling technologies you want to use. So let's talk about each one, especially if you've never seen them before. And again, you can go and touch them. But liquid to air cooling, rear door heat exchangers. The main takeaway, because I'm going to go through this quick, is rear door heat exchangers are all about neutralizing the hot air coming out of the rack. to decrease the thermal footprint of these solutions in the data center. So imagine today you are using 10 kilowatt racks, and the data center is really designed to handle that much for air handling. You want to add some AI racks. They are now 40 kilowatts. 30 kilowatts of additional heat often can overwhelm the data center for what their air handling was designed for. Rear door heat exchangers take facility water in, they cool down a large coil, and the servers are going to push their hot exhaust air through it. and it's going to cool it back down to a room neutral temperature. So as far as the data center is concerned, this has zero thermal footprint. You still need a cold aisle management system, you still need hot aisle containment because air is moving around, but this allows the rack to exist in an older data center without overburdening it with air infrastructure. So I always like to include just a quick CFD model, just kind of helps drive home, you know, how this technology really works. So we go one level up. We start talking about HPE ARCS. This is an adaptive rack cooling solution or cooling system. Similar to rear door heat exchangers, we're going to take facility water, we're going to cool down a coil. But the major difference is this is going to generate cold air on the inlet side, and then the ARCS is going to suck in all the hot air. So this is pretty similar to like in-row coolers that you see in the data center today. For every few racks, there might be something making cold air, but it's pumping it out to the data center kind of broadly. ARX is actually tying one of the ARX coolers, it's in the middle, to four racks and they're contained kind of like in a pod structure. And so it's actually very quiet to walk near one of these because they're acoustically sealed and they trap all of the heat. So we've actually had customers that deploy these in warehouses, very non-typical data center environments because it needs no external air handling, but similar to before data centers that are struggling with air heat, this also will have zero thermal footprint from an air handling perspective. but we can go a lot higher for rack densities. So kind of a quick again visual you can see the ARCS unit is creating cold air, providing it both sides left and right to the racks. They're sucking it in like a normal data center, but unlike ARCS, it's also sucking in all the exhaust heat and pushing all of that temperature increase to water. So this would be kind of a 100% liquid to air solution. So as we start talking about DLC, this is again the really cool stuff with cold plates. This has the really high energy efficiency story. We have these on the floor. You can touch and feel them. For the most part, kind of as like a general statement, we capture about 70% of the server heat to liquid. and 30% needs to be handled by some kind of air handling unit. Still your hot aisle containment, or you can marry this with a rear door heat exchanger or arcs, but we take about 70% of this into direct liquid cooling. Now when we look inside this technology, this is my kind of one thing to give you today, you're at the bar, you're talking about direct liquid cooling, this is kind of one piece of interesting technology. What's happening inside of these cold plates, the technology that's most often being used in the industry is skived fin cold plates. So if you can think about an air-cooled server today, you can look at any server and you could count the number of fins. Visually, it's easy to count them. Skyfin cold plates also are using fins, but they are much, much smaller. So for example, often we will have 100 fins per inch. So often these cold plates, when you're looking at them, it's kind of difficult to actually see that there's gaps between the fins, but this density allows us to create a massive surface area, and that's why liquid cooling is so effective. We're taking a very small piece of hardware, but we're really making it very large because we're allowing a lot of liquid to touch the fins, and that's where we get all the thermal transfer. So tonight you'll sound really smart when you're talking about Skyfin coal plates, how they're made, and even how they make this. It's worth looking on YouTube to see it, but they will really take a copper block, a sharp blade, and they will peel individual fins back over and over and over again. It's not that complicated, but if you didn't know, nobody else will know this. It's kind of my one gift to you for tonight. Now... When we start to talk about the other building blocks, inside of every rack, there's going to be two loops of liquid. So this is one of the things most folks want to learn about, is how is actual liquid being moved throughout these racks? The main takeaways is there is a liquid cooling loop on the primary side, the building. That's what the building is responsible for. It's almost always water. And what's happening inside of the rack is its own closed loop. And all of the solutions that we ship are propylene glycol, 25% PG. 75% water. The main reason we do that is for health over long term. We want these systems to run very reliably. We don't want any weird stuff to grow in them. We're able to tackle that really easily with propylene glycol. The other pieces that are in all systems is a coolant distribution unit. So inside of this CDU mounted inside of the rack is a CDU. They weigh about 150 pounds. They're 4 UN height. They manage the pumping. They manage the temperature. They manage a reservoir. They manage pressure. And so they're really the heart of any liquid cooling CDU, watching what's happening inside of the racks. Now what's really interesting is for customers that are, you know, going even beyond, you know, 40 kilowatts a rack, they want to go higher and they want to do 100% heat capture to liquid, they want no heat management in the data center, we will often marry direct liquid cooling, that's 70% per rack, with an arc system. So arcs is managing all of that 30% of all of these racks. So think about, For 100 kilowatt racks, we can take 70 kilowatts of each of those racks into direct liquid cooling and then we'll manage the other 30 kilowatts with arcs. Now that's an extreme example and you can scale it up or scale it down any way you'd like, but when we combine them all of that heat ends up into facility water. That's a really good energy efficiency story and really high density. Now I'll touch on this really briefly. We do a lot of very high-end systems. All we've done in Exascale, this is the infrastructure and we have them kind of also shown. This is the Cray EX4000. Each of these cabinets, 100% liquid cooled, 400 kilowatts. This is kind of as dense as we go, very power hungry, but very capable systems. But we also build a smaller version. So just two years ago, we have one of these on the floor to look at. This is the EX2500, single rack scalable, exascale technology, fully liquid cooled, but we ship these in single rack. So just a bit of a different spin on the technology. What these blades look like? No fans. There's not a single fan in this infrastructure which allows us to get very high energy efficiency. Antonio talked at the keynote about top 500. The majority of the top 500 systems are this infrastructure, but we're also dominating the top 100 of the green list because they're so energy efficient. That's kind of one of the nice things about liquid cooling. You get energy efficiency and performance and density. It's kind of all part of the product. And then lastly, All we do in Cray-X, also we do liquid cooling for the fabric. All of the slingshot switches are also 100% liquid cooled. So just something kind of interesting. Now, kind of as I wrap up here, one thing I always like to talk about a bit is also immersion. Immersion is very popular right now in the industry. We're talking a lot about immersion. HPE, we primarily are doing direct liquid cooling. And the main reason is those racks are, for the most part, behaving like an air-cooled rack. Density grows vertically. We can build very dense racks, very easily service them, rack and roll. We can ship them direct to customers. Fully cabled. all the hoses filled with liquid, all networking. We can't do the same thing with immersion. We do immersion. We do immersion offerings with customers today. But the primary systems we do is direct liquid cooling. And I believe this is going to be the future for the next several years. Now, last thing, how do you get into liquid cooling? Really, this is, I think, it's not hard to get people excited about direct liquid cooling, liquid to air solutions. But after the excitement settles, the question is, well, how do we actually do this? Again, our data center has has X capacity and we want to go beyond that. Well, the first option obviously is retrofitting. It can be expensive, but it can be done. The one thing about a lot of existing data centers is there is liquid. Somewhere in the data center may be being used for perimeter cooling, some kind of refrigerant. Often we can use a lot of that infrastructure to start tying into liquid cooling systems. But what we talk a lot about today is planning for new data centers. And this doesn't always have to be a brand new building, concrete, foundations. We do a lot. lot of work with pods. Data centers inside of shipping containers, fully self-contained, including their kind of liquid to air heat exchanging systems out to atmosphere. And so that's something really interesting we talk about. The third one is co-location. I believe this is where we're going to see a very fast growth in the market, is co-location data centers that are enabling very high-powered racks with liquid cooling. Like I said, I don't find it very hard to get people excited about liquid cooling, even internally. A lot of companies believe this is the the path, but administratively it can be difficult to get a new building built, get all the permits approved. So if we could actually sell a system, fully deploy it, but in a co-location data center, often that removes some of the hurdles around that immediate, where does this thing live? This 60 kilowatt amazing machine, but that has a lot of cooling and power requirements. And then the fourth one is don't buy the hardware at all. We can do HPEI cloud or supercomputing as a service where it's really a pay-to-play service. You don't have to own the hardware at all. So there's really kind of a few choices you can play in this space. But I hope that you enjoyed that fast and furious lesson on liquid cooling. Like I said, I'm kind of here off to the side to answer any questions. And I hope that you found that valuable. Thanks so much.