>> Good morning everyone and welcome back to theCUBE's live coverage of the Red Hat Summit AnsibleFest 2025 here at the Boston Convention Center. I'm your host, Rebecca Knight, alongside my co-host and analyst, Rob Strechay. Rob, we are both fresh from the main stage where there were lots of Red Hatters and it felt almost visionary. It was obviously lots of new cool product announcements and stuff that you usually see, but it was really cool. >> I think why I love open-source and I love Red Hat and how Red Hat brings open-source to customers is the fact that it does help people understand why certain pieces of open-source are really important. And I think to me that's why these sessions are super critical. >> Exactly. And with that, I would like to introduce our next guest. We have Joe Fernandes, vice president and GM AI business unit at Red Hat. Thank you so much for coming on theCUBE, Joe. And welcoming back Brian Stevens, senior vice president net AI chief technology officer here at Red Hat. Thank you so much. So Brian, I'm going to start with you and actually ask you one of these big visionary questions. What is Red Hat's vision for the future of AI? >> It's a big question. Really like parallel to Linux to be honest. It's like the way AI is heading is it's a very fragmented kind of world. Every model is coming out, amazing open- source models are coming out, but they're coming out with infrastructure and how to run them. And then every GPU or hardware accelerator, and there's actually lots of them. I know we don't know about all of them, we know about one big one, but they too also often bring their own kind of stack on how to use them. And so our vision's really been like how do we unify that into a common platform like we did with Linux where there could be one core, in this case it's project called VLM that can run all models and run all accelerators. And then in so doing, think about what that means to end users. If they're not like DIYing, everything that they want to do where they can just have a one VLM platform and get to use all the best in future models and all the accelerators seamlessly. So it's kind of going back to the old days of what we did with Linux, but we think it really matters right here for AI as well. >> And I think to me, one of the keys around that is, and again, Kubernetes is now 11 years old or almost 11 years old, and Linux, you're on RHEL 10 now and it's been around for a long time. I cut my teeth on Unix and then Linux as an admin way back in the day. So makes me feel old. But anyway, when you start to look at what's going on, Joe, we're really in the platform. Because I think a big factor for things like Kubernetes and Linux has always been just the intimidation factor with it and especially with AI. And I think some of the stuff was talked about on the main stage today and what's coming in RHEL 10 and what's coming in OpenShift around AI seems to help kind of make it less intimidating. Kind of give us a view of what was announced and what's really been packaged together. >> So I mean, Red Hat's always been a platform company. As Brian mentioned 20 plus years ago it was bringing x86 and Linux together to power enterprise applications and we brought that to the enterprise. Then with OpenShift and Kubernetes it was containerized applications and cloud native application architectures and enabling that across a hybrid cloud environment because those apps needed to run not just in the data center but across all the major public clouds out at the edge. That's really powered the OpenShift business and the growth in Kubernetes and the cloud native ecosystem. I think AI is just, that's next evolution. And so as a platform provider, we need to enable customers to run their AI models across any environment, any accelerator, and really any model they choose to power their business. And again, it's new. Things that are new or scary. I remind customers that 10 years ago, containers and cloud native were new, but now if you're building a new application and you're not building it on a cloud native and containerized architecture, you're sort of out of the mainstream. So now the new application architecture's agents, but I think agents three to five years, it's just going to be how applications are built. They're going to infuse inference, they're going to infuse tool calling and they're still going to run on this Linux distributed environment that goes back at Red Hat to the beginnings of the company and so forth. So we're really excited about that. >> And there was some announcements around the inference server and things like that, which to me is also a big piece of bringing things together as part of what Red Hat does, because it's a lot of different pieces that Red Hat had separated out previously, plus things like MCP, which I know that people are still trying to wrap their heads around, have Rock brought it to open-source and now it's kind of how the models talk to other tooling and things like that. Help people understand what Red Hat's Inference Server and how that really helps. >> Yeah, I mean, the inference server, like I was saying before is kind of the core. It's the equivalent of Linux if you will, and Red Hat AI Inference Server being our chosen name. VLM being the open- source project equivalent to the Linux kernel. So it really is that glue layer. It's the thing that can stay the same so that all the innovation and accelerators and models can reach users without change. And like Joe talked a lot about the platform, but that is the platform, that is the basis of the platform in which it's amazing about it is once you get that part out of the way, can you imagine the churn for anybody trying to consume all this, if everything changes every time they want to adopt something, it's like, "Throw all that out. " You put people in a position where they have to make bets and nobody should be making bets. So I think what we've done with VLM, Red Hat Inference Server can be that core platform and then all the stuff that we've talked about, the right agents, MCP, blah, blah, all that stuff is the stuff that runs on top. And that's going to evolve because we weren't even talking about MCP, I don't know, three months ago. We weren't talking about agents nine months ago. So I lay awake saying like, "We definitely haven't figured it all out on top of the platform. " That's going to innovate, evolve, but luckily the platform stays there and it's the thing that can kind of enable all of that to happen. >> Yeah, I think it was Google I think was saying on stage, it's the year of inference and we're really agree with that because inference is about bringing these models to production for productive use cases. And VLM is that open-source standard inference server that abstracts those models from the infrastructure on which it runs again a hybrid environment at the accelerator level, not just at the public cloud to on-premise edge level. And then agents are just a use case for inference. Like an agent is an application that infuses inference and large language models into that application to provide it more intelligence. And then you realize the model can't do everything so it needs to also call out the tools. That's where sort of MCP fits in. But to Brian's point, we talked about what we're doing with MCP and also bringing Llama Stack. There's a lot more that's going to happen or that is happening in the open-source community. So you'll see that top layer of the platform continue to grow and evolve as we start seeing those applications and those agents come to fruition. >> And how does it make AI run more efficiently? What is the key? >> So a couple of things and like Brian explained, so first just the inference server itself, that's the engine that's running the models. So the inference server needs to be efficient in terms of how it processes the model requests and sort of distributes that. And then, Brian, talking about Brian and Neural Magic brought also a lot of expertise in model compression. And maybe you wanted to add on that. >> Like the inference servers, that's not a new thing. Models, we've been doing models forever, right? Predictive models and natural language processing and people were summarizing documents, so they always had an imprint server, but they were different back then. I know back then south, so stupid. So we spent two years, three years, but a lot's changed. But the way the inference service could be simpler because when you did an inference request, you typically just gave it one document or you gave it 100 documents and they actually gave you the results back by just doing one sweep through the model. So one process cycle through the model and you always knew whoever's using it exactly how long that would take. So now it's a new world where all of a sudden the inference requests are turning into generative applications. So guess what? The input that's coming into the inference can be small. We just had a question or it could be massive. You fed all these documents and then the generation, which is the output that can be, you could say you just want one sentence summary or you could say you want it to write you a physics paper. So the inputs and outputs are completely different. And then more over on top of that, the inference server itself, instead of just doing that one sweep is going around and around. It does a lot every time you're doing a token. So it's gotten crazy complex in terms of juggling all these incoming requests. You're packing those together, feeding it into the GPUs to do all the math operations, presenting an OpenAI compatible inference layer in front because everybody wants compatibility. So it's really turned into, as an old Unix person, I did call you all, I'm all, but as a unit person myself, that was building the stuff you had to manage, that was what an OS did. It was all these complex algorithms and for the first time they have to go inside of an inference server itself. So the efficiency side is all of that stuff put together on how do you make it efficient? Because like I was saying, GPUs have a really small amount of memory, really small amount of memory, and that amount of memory actually dictates how many requests you could handle simultaneously. So if you had twice the memory, you could have them twice the request. So your throughput is up but you don't. So you got to really be efficient on the memory and then you have to be really efficient on how you actually program the GPUs and the accelerators and how you feed them. But all this software stuff we had, VLM was already being used by a lot of people because it was easy to use. And I think we've four to five X via performance on it and just like the last year. So it just shows you and there's still that much more work that can probably be done. >> And just building off of that, this is where the LLMD stuff comes in as well because how you use the KV cache and things of that nature. >> I love that you're watching. <inaudible> It's so hard to distill it down even for myself honestly. But think about it like this way. So everything we just talked about was one server and servers come with one to eight GPUs or accelerators attached to them. So that's kind of the VLM, lots of requests coming in, lots of accelerators underneath it, but that's one server. But what we've learned is that if now you're coming into enterprise IT and they're not going to run one server and they're not going to have one inference workload. They're going to have lots of inference workloads. So the question is how do you orchestrate that like we did with Kubernetes and apps, how do you orchestrate now LLM serving across the cluster? So that's one part. We chose Kubernetes with the LMD project. We said let's not teach it something else. There are no Kubernetes, let's get LLMs in the Kubernetes world. But the second part is the more complex part. The second part was around, it's all about meeting the service level objectives of the individual users. Because developers are going to get a little bit of quota. They don't need quite the response time, but then you're going to have production serving that's going to maybe scale more and it has to have a certain response time. So all those running on the same cluster LMD by doing this thing called disaggregation, it moves away. It separates the processing of all that input for the first token from the processing of those next tokens, which can be quick. And so doing it can meet varying SLOs. The end users have all the while driving more tokens per unit of infrastructure. >> And 10 years ago we were here talking about our Kubernetes launch. It's very parallel. Docker was the HOP project. Docker innovated how you packaged for Linux container technology like package applications for Linux, which we'd been working on for many years, but ultimately it was Kubernetes. And how do you take that a bunch of containers and run them at scale because an enterprise application doesn't all run in one container and the load and capacity that's needed to power those applications in production, you need to distribute many containers across a cluster. So that relationship between Docker and Kubernetes, kind of very similar to know VLM is a single inference server. How are we going to run that at scale for these large scales? So it's I think a really big project and I think we'll look back on this launch as being sort of a pivotal moment and evolving to at scale production deployments. >> So along with that, Red Hat's vision is any model, any accelerator, any cloud. So elevator pitch, how do the new products and community that were announced today help customers achieve this goal? >> So on the product side, we've made a few announcements. So any model, right? I think Brian mentioned this or Chris mentioned this, two years ago, there was no open-source GenAI models. It was sort of a OpenAI versus Google kind of world. And then with the launch of Meta and Mistral, and you've seen just a bunch of state-of-the-art open- source AI models released over the past two years, and that's great for customers, gives customers choices. So we've announced a validated model program for Red Hat AI to give customers confidence that, "Hey, we've tested these models, the latest versions of these models. We've matched them up with the versions of our platform, we're providing data and so forth. " So that's on the model side, but then where are you going to run those models? You need to run them on GPUs, accelerated hardware. So again, Nvidia is the gorilla of the space. We need to run great on Nvidia and VLM helps us do that. But you heard from AMD today, you heard from Intel today, you heard from Google who's building out TPU accelerators, AWS with Inferentia and Trainium and IBM with Spyre. So a lot of companies are seeing this inference opportunity on the hardware side and trying to be part of that. Again, as part of being a hybrid provider, we need to do that. And then any cloud, it's just like what we were saying for applications over the last 10 years. Not every model is going to run on Azure or on Google. They're going to run across all the clouds, in the data center and even generative AI out at the edge. So kind of extending that hybrid cloud vision that we've been talking about for the past 10 years and now into hybrid AI. That's really what that sort of means for us. And there's a lot more products and capabilities that go with that, but that's sort of in a nutshell what it means to us. >> Yeah, one thing they noted in that was you talked about like VLM running on Nvidia. Let's face it, you walk into a data center today and it's Nvidia dominated because they produce amazing accelerators, and of course they have their own stack, which they should, and they're the best at it because they've had so many years of experience. What's kind of stunning to me is the vision of how do we get any model to any accelerator. That was why we needed a new open-source project called VLM. But being that it's actually 22 months old, I think right now it's not good enough that if it's like, "Okay on Nvidia used their stack and then for all the other accelerators, we use this other thing," it's kind of a toy, it's not as good enough to be on Nvidia, but that's actually not what happening. In 22 months, VLM even on Nvidia is being chosen in one part just because it's just had the luck of being born later. So the Hugging Face world, it just did. So it kind of runs those models and Hugging Face where all the models are natively and seamlessly, so there's no extra work which they kind of have on their server just because predated it. And then second, it used to be that, yeah, you got usability, but it wasn't as performant and now it's every bit as performant as the Nvidia stack. So it's really emerged as the de facto way even on Nvidia. So even a shop is just an Nvidia shop. It's a really compelling, easy to use story. And Nvidia will say this themselves, is that because of that relationship that we have in the VLM community in Red Hat with all those model providers, we get access to those models earlier, as do other people are in the server, but we can implement support inside of VLM quicker. So when Gemma came out, when Mistrals came out, when Llamas came out, they don't give as much time. They give us like I was saying, two days to two weeks, but we get it done. And Nvidia's even publicly stated that the best way they can bring new models to their users is actually using VLM themselves. So that's a huge test and that's a good thing. It's a huge testament to converging and combining our joint innovation together on a same unified project. And I think that's how we're not competing with anybody any longer on that vision. We may get to claim the vision, but it's a common project now that very much has invited everybody else. >> Absolutely. Yeah. I think one other thing, last question because you had Ricardo up from Meta talking about Llama Stack. When I talk about Llama Stack, people get confused between the llama models and Llama Stack. How does Llama Stack from a Red Hat and what you're using it for help people get to Llama Stack systems faster? What are the building blocks that really Helps? >> So yeah, as you know, Meta is known for their Llama models and that really unlocked a lot of open- source generative AI innovation and made people believe that open- source can really compete in this space. So a lot of credit to them, including now the latest release of Llama 4. But what we need to do is also then be able to leverage those models to build applications and agents. And so we have capabilities like inference, like RAG, like tool calls, guardrails. We were looking for an API to string that together for the end users. So the end user would be your data scientists and your developers that are leveraging the platform to build their applications and agents. So we had the decision, do we just build our own API or do we use something that's out there? Meta had just released Llama Stack as part of, I think a Llama 3 launch. We liked what we saw there, and as I think Brian made this point, open- source isn't just about a license, it's about a community. So it was open-source license, but that gave us the opportunity to work with Meta and other partners who had similar interests. And that's really what it is. So it's going to become the core API for end users who want to build agents and applications on the platform. Now, they could still just use the VLM and the Inference API, if that's all they need and they already have agents that can bring those to our platform. But as they build new agents, it'll bring new capabilities and also integrate with other capabilities like we talked about, the Model Context Protocol, MCP from Anthropic, that's already integrated into the Llama Stack agents API for tool calling. And there's more to come. We're working with Google on their agent to agent protocol stuff, and there'll be a lot more projects. If you think about what the cloud native ecosystem looked like 10 years ago versus what the CNCF community looks like today, that same dynamic is going to happen here. So we're really excited to work with Llama Stack, in the Llama Stack project with Meta and others in MCP with Anthropic and others, with Google, and there'll be more involvement like that, engagement like that in the community as we sort of move along. Love it. >> The future is bright. Joe, Brian, thank you so much for coming on theCUBE, a really fun and interesting conversation. >> Great. Thank you. - Thank you so much for having us. >> I'm Rebecca Knight for Rob Strechay. Stay tuned for more of theCUBE's live coverage of the Red Hat Summit. You're watching theCUBE, the leader in enterprise tech news and analysis.