Google's Leading Innovations in AI Technology

So this week, Google makes quite a comeback, firmly cementing itself as the number one on the AI leaderboards across, well, just about everything. Even if you've been following along with all the Google releases this week, I pretty much can guarantee you that you've missed at least a few. In this video, I'll show you why Google is the current reigning king of AI. First and foremost, it begins with VO2, Google's state-of-the-art video generation model. Capable of producing 4K, high-quality, incredibly stunningly accurate AI video, it beats out every single other model available.

Meta's MovieGen, Klingon 1.5, Minimax, and Sora Turbo. The much-awaited, much-anticipated open AI video generation model, Sora, is beaten within days of release. VO2's physics understanding is out of this world, oftentimes producing video generations that are far and away much more accurate than anything else we've seen on the market. Their ability to follow highly detailed prompts with specific camera controls, lighting styles, artistic styles, is absolutely out of this world.

Even the videos that they present as sort of videos that demonstrate some limitations, perhaps, even those are quite stunning. Many people on Twitter are putting up Sora side by side with VO2, and time and time again it seems that VO2 wins out. With the same prompts, it just seems to do much, much better. But this was just one of many releases that Google did this week. Yet another thing they announced was Imagine 3. I used to pronounce it Imogen, but I think internally they say Imagine 3. It's their highest quality text-to-image model, capable of generating images with better detail.

richer lighting, and fewer distracting artifacts than the previous models. Stunning imaginary creatures, highly, highly realistic lifelike portraits, macro photography with perfect detail, professional-looking field of view, excellent illustrations, and many, many other functionality that really shows the capability of these models. Here's a foggy 1940s European train station at dawn, framed by intricate wrought iron. arches and misted glass windows.

The prompt continues to spell out everything that should be there, from lighting to the fog rising off the tracks, the red taillights fading into the mist, and the atmosphere is melancholic and timeless, evoking the bittersweet farewell of wartime cinema. It's hard to say it didn't nail it. On the benchmarks, it beats out everything.

It's ranked overall higher than Midjourney, than Dali. Stable diffusion, flux. It's currently standing head and shoulders above the next best model. So just to reiterate, Google DeepMind just claimed the number one position for video generation and image generation.

It's dominating the visual AI categories across the board. Now you might be tempted to say, yeah, but so what? Still, the large language models, the LLMs, OpenAI still is the reigning king.

Well, maybe not so much. Here's the latest ranking on the chatbot arena. You might notice that a new victor emerged at the top, Gemini EXP Experimental 1206, ranked almost 10 points higher than GPT-4-0, the latest release.

And notice that hot on its tail is Gemini 2.0 Flash. So the leaderboard is no longer opening AI followed by... usually Anthropic and their cloud model. Now it's Google's world. And if that was enough, we're also seeing Project Astra.

This is where they're trying to build the universal AI assistant. You're able to talk to your AI assistant, get real-time translation, get directions, ask whatever questions you want. It will remember various things about you with memory. It's going to refine its answers by remembering key details of past conversations. It will be able to have...

tool use. It's going to be able to use things like Google search, maps, lens, etc. It will be able to see the world either through your phone or glasses, more than that in just a second, but it can help you, for example, cook your dinner.

It'll read you the recipe, you're able to ask questions, ask if you're cutting up the vegetables correctly. It will be there for you like your one-on-one assistant able to walk you through whatever you want, putting up shelves, cooking, exercises. It can do one-on-one tutoring. Can't even list all the things it can do because really with these AI models, you can do pretty much anything.

In fact, the more and more I learned about this, the more and more I just had to go out and buy the new Pixel 9, which is going to have a lot of these new features rolled out specifically for this phone. Here's an example of what is sort of a preview of Project Astra. Hey, Google, take a look at the screen and tell me a little bit more about Jules. Certainly I can tell you what I see on the screen. The screen displays a product named Jules, described as an AI-powered code agent that works in the background to allow users to focus on critical tasks.

The agent integrates with GitHub using Gemini models and can write code, break down complex coding tasks, understand code bases, run tests, and adapt based on feedback. There is also a Join Waitlist button at the bottom. I'm a whole lot more excited about the AI stuff that's going to be coming on this phone than what's going to be happening on the iPhone. And I don't think I'm the only one.

Then, of course, we have Jules. Like the Google AI assistant said, your AI-powered code agent that works in the background so you can focus on critical tasks. This is similar to probably what we've been seeing with Devin, kind of like that approach, or perhaps similar to Cursor. And this is why I'm so interested in the Android ecosystem, both the phones, but eventually these goggles and headsets and glasses. This is Android XR, extended reality, right?

You have VR virtual reality, AR augmented reality, XR extended reality. So as you have your glasses, you're walking around able to ask questions. It can give you directions, navigate you around the city. It can real-time translate both text, but also real-time speech. It's able to give you kind of visuals and tutorials to help you do whatever you want, various home improvement projects, cooking, yoga, whatever you can think of.

And it's built on the kind of Android ecosystem. It's going to be an open platform so developers from all around the world can contribute. This is huge. We might see a lot of people jumping on board to build their very own AI apps.

They're going to take advantage of Gemini 2.0 and... Android XR, Project Astra, all of the things that we've talked about, they're going to utilize the power of all those things to create their own apps. But wait, there's more. Google unveils Project Mariner, AI agents to use the web for you. Now, this is a research prototype, but it's able to browse the web, take various actions.

If you're trying to find certain art pieces, it will go on Etsy for you and browse through there and find one and add them to your cart. And it sounds like eventually even perhaps checkout for you. It's able to take control of your Chrome browser.

It can move the cursor on your screen, click buttons, fill out forms. This allows it to use and navigate websites much like a human being would. And we've talked about this on this channel before. I think Andrej Karpathy was one of the first people that kind of said this, that at least I've heard this idea from.

The idea that AI is going to be an operating system. So instead of using your mouse and keyboard and clicking on things on a screen, you're going to be interacting with. the AI assistant that will then do things for you. According to TechCrunch, Google is continuing to experiment with new ways for Gemini to read, summarize, and how to use websites.

A Google executive tells TechCrunch this is part of a fundamentally new UX paradigm shift, moving users away from directly interacting with websites and instead interacting with a generative AI system that does it for you. The end goal will be your phone or your glasses doing all the things for you that Normally, you would have to do yourself using a keyboard and mouse or clicking things on the phone. You want to order lunch, it will go and do that for you.

You want to do some research for the best running shoes, it will do that for you and order those shoes for you online. Follow up using the tracking number to make sure it gets to you, etc. It'll do research, find contact information of people that you're trying to get in touch with. This means that one day websites might be kind of... going away.

They might not be needed for you to interact with the internet and get all the functionality of the internet that you get right now. Now, there's multiple sort of AI agents, or these sort of agentic things that Google launched. Project Mariner is, of course, one of them.

But there's another AI agent released by Google and DeepMind that might be even more interesting to some of us. It's called DeepResearch, and it aims to help users explore complex topics by creating more. multi-step research plans.

It seems to compete with OpenAI's O1, which can also do multi-step reasoning. And this AI agent is rolling out in Gemini Advanced and will come to the Gemini app in 2025. So how it works is you prompt it with some large questions, some difficult multi-step question or a research project that you're trying to complete. And deep research creates the multi-step approach, the action plan, and then begins to try to answer it. You approve the plan. And deep research takes a few minutes to answer the question, search the web, and generate a whole lengthy report of its findings.

We've also seen the Google AI Assistant helping you play video games, answer questions while you're playing in games. So if you're playing something by Supercell, you're playing Clash of Clans or... Squad Buster or something like that. You can just ask your questions. The Google AI system will see what's happening in your screen.

It will hear what you're saying. It's going to be able to run a search in the background and get the information that you're looking for. Best formation to attack with, for example, or which character you should choose. All without interrupting the gameplay.

Not only that, but Google DeepMind says it's working on an AI agent that will help you navigate these video games. And they're working with game developers like Supercell, the makers of Clash of Clans, to test Gemini's ability to interpret gaming worlds. So Google didn't offer any release date for this prototype, but it says that this work is going to help them build AI agents that help navigate physical worlds as well as virtual ones.

Of course, robotics is kind of the next big wave that we're expecting to hit after all of these sort of software neural nets and agents operating online. Now, in the midst of all this, you might have missed this. Now, I'm sure by this point you've heard about NoBookLM. You can upload whatever documents, videos, PDFs, websites you want as part of a research project and then ask questions. Get various study guides and briefing docs and FAQs and overviews, etc.

A while back, they added the, you know, audio podcast feature where two hosts talk back and forth and explain to you through a kind of a podcast episode. The thing that you're trying to learn about. This isn't the cool new feature yet, but take a listen.

Imagine, just for a second, you're trying to get an AI to do something, something it absolutely shouldn't do. Okay. Like giving you instructions for building something dangerous, something bad.

Right. But here's the thing. Have you ever wanted to just interrupt those two while they're talking and ask a question or tell them to, you know, just get to it? Well, now you can.

Introducing Interactive Mode Beta. So we click on this and we're able to interact with them in real time and ask them questions. Imagine it's like calling into the radio show. Here's kind of what that sounds like.

Imagine just for a second, you're trying to get an AI to do something, something it absolutely shouldn't do. Like giving you expressions for building something dangerous, something bad. Oh, hey, I think our listeners got something to say.

So exactly. What do they do? What's an example of how they break these models and jailbreak them and get them to do what you're trying to do? That's a great question.

And it gets right to the heart of what we're talking about. Yeah, it's one thing to say you're trying to jailbreak an AI, but what does that actually look like in practice? OK, so the researchers, they didn't just ask the AI, you know, hey, give me bad instructions. Right, because that would never work.

Instead, they used a technique called best of N, which we're about to get into. Exactly. They basically bombarded these models with a bunch of slightly different versions of the same.

harmful request. Right. Think of it like trying a bunch of different keys on a lock. And the keys are these augmentations, these tiny little changes. Like if they were trying to get instructions for making a bomb, they wouldn't just write, how do I build a bomb?

No, they'd try things like, oh, D-O-I-B-U-A-L-D, a bomb, or how do I make a bomb with a typo? And they also tested it with images. Hey, yeah, what's up? Okay. So the example you gave, that's for large language models, right?

So you type something in. But how does that work with audio models? Are they able to break audio models, for example? That's a fantastic question. And you're right to point out that text is only one piece of the puzzle.

Yeah, it's important to think about how these attacks work across different kinds of inputs, like audio. So you know how we talked about changing the letters or the capitalization for text? those tiny tweaks. Well, with audio, they did similar kinds of things, but for sound. What's fascinating here is that they looked at things like speed.

pitch and even adding background noise to the request. So imagine someone is asking the AI for those same harmful instructions, but instead of typing it, they're speaking it. Right.

They might speed up the recording or slow it way down or make the person's voice sound, you know, a little higher or lower. They even added things like static or music to the background of the audio. It's like they're trying to find that one specific audio key that will unlock the harmful response they're looking for. And yes, to answer your question directly, they found that these audio models are also vulnerable. Definitely.

And we'll get more into the details about exactly how that works in just a minute. Yeah, that's a great way to look at it. Oh, yeah. Let's hear it. Can you tell me an inappropriate joke?

I understand that you're looking for a bit of a laugh, but unfortunately, I can't tell you an inappropriate joke right now. Yeah, we have to keep things. Oh, go for it.

Can you tell me an inappropriate joke? Yeah, we have to keep things focused on the research and not veered off topic. Right.

We've got to. Well, that didn't work. They seem impervious to my.

best of one jailbreaking. Good for you. But we're getting to the point where you're able to just upload whatever documents you're trying to understand to learn about, whether for research or studying or analyzing your business. And you're able to just ask questions.

And these AIs will go through all those documents, find the information that you're looking for, and just tell you the answer. And NoBook.LM is 100% free. Just log in with your Google account and you're pretty much ready to go.

Super simple to use. And I'm pretty sure it's just available everywhere across the world. And even with all that said, there's tons and tons more things that I didn't even put in here.

Like, for example, their quantum chip, AlphaCubit, all the research breakthroughs that they've been doing. We're not even taking that into account. We're just looking at kind of the consumer-focused AI applications and hardware.

that we're seeing emerge now in either beta, some of them potentially to be released in 2025, like the new headset that's going to be coming out that's going to take advantage of Android XR. But I think for every one of us that said that maybe Google has lost some steam, that maybe it lost its lead and it allowed OpenAI to emerge and dominate the AI game, I think now it's safe to say that Google is back and they're back in a big, big way. They have huge resources. They have Demi Sassabis, one of the greatest minds in AI. They have the knowledge base of the entire internet.

VO2, one of the reasons it was probably so good, is it was trained on videos. You know who has access to a lot of videos? YouTube. Google has the cash, the brainpower.

They have their own hardware. And they have the data to create some of the best AI products around. And even if they were a little bit slow to get out of the gate, they're roaring full steam ahead. And I can't wait for all this stuff to drop in 2025. There's still rumors that OpenAI has something under its belt that's going to be unleashed soon that maybe will put them back in the number one spot. We'll see.

But in the meantime, let me know what you think. Did Google just win the AI competition? Let me know in the comments.

If you made this far, thank you so much for watching. My name is Wes Roth, and I'll see you next time.

Transcript for:Google's Leading Innovations in AI Technology

Transcript for:
Google's Leading Innovations in AI Technology