What we can do with AI video is about to get even more wild. In this video, I'm going to show you seven different research papers that have come out recently that truly show off how good some of this AI tech is getting. And I'm kind of doing it in this order of like least impressive to most impressive.
So make sure you stick around for all of them because it just gets crazier and crazier as we see each of these things that we're on the brink of having access to. It's been a while since I've shared one of these videos where I break down a whole bunch of cool research. and I don't want to waste any more of your time, so let's just get right into it.
Now, the first couple of research papers focus on virtual try-on technology, starting with this research called CatVitan. Concastination is all you need for virtual try-on with diffusion models. This is a model which lets you give an input image of a person, an input image of an item of clothing you want them to try on, and it figures out how to basically superimpose that piece of clothing.
over the original image in a way that maintains the pose and the person underneath in the original image we can see a whole bunch of examples here like of this woman and these pants and here's the new version it made or this guy here with this white suit and here's the output where it put him in this white suit we can see in this person to person garment transfer it can actually take outfits that other people are wearing and put them on your input person we can see that these are two clearly different people here the one on the top is the input image the one on the bottom is the one that they want to use the pants from and you can see these pants get transferred onto this person here here's some other fun examples showing that it works with like anime characters here's a picture of robert downey jr and another picture of dr strange and it turned robert downey jr into dr strange here's a picture of elon on a red carpet and a person wearing what i could only describe as like a k-pop looking outfit and it put elon in that same outfit And this model is designed to be very light and very simple. You can see the efficiency of our model is demonstrated in three aspects, lightweight network, parameter efficient training, and simplified inference. So it's designed to do it quickly and very simply so that these virtual try-ons can happen like on device, like e-commerce companies can set this up right inside of a mobile app and not need to use a cloud. It can actually do the virtual try-on directly on your device.
The code for this one's available on GitHub. and they have a hugging face space where you can try it out and I'll link that up in the description below. Here's a little test I did where I uploaded a full body picture of myself, took one of the shirt pictures that was down in the bottom and you can see how it sort of superimposed it onto me here. If I select one of these dresses and submit that, I instantly regret my decisions.
And if you think the idea of virtual try-ons are cool, that was a very, very basic tool for virtual try-ons. But there's also any to any try on leveraging adaptive position embeddings for versatile virtual clothing tasks. It just rolls off the tongue. This one allows you to give an input image, multiple items of clothing, and then get images back of the original person wearing that clothing. And just like the other one, it works with upper, lower, and overall.
We can see some examples of that here. And they've got some nice comparisons to other models. You can see that they have the input person image here, the garment image here. And you can see the various models, GP VTON, OOTD, IDM VTON, CAT VTON. This is the one we were just looking at.
Still looks pretty good. And then this is the new model here, which looks even better. And we can see all sorts of examples where this one works quite a bit better. Now, while CAT VTON, the one we were looking at before, was basically designed to take an input image and a picture of some clothing and sort of like overlap them and figure out how to use the original image.
along with the clothing and make it look good, this one actually allows you to do all sorts of different things. You can see this one can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, and other conditions. So if you want to give it extra instructions like make the shirt sleeveless or use this suit but make it red, you can do things like that which you can't do with the other model. Now the first two research papers that I've shown off here have both been focused on virtual try on on images. And I said, this video is all about how AI video is getting wild.
Well, the reason I wanted to show you these first is because I think it shows off the controllability that we have with these AI images now, where we can put any buddy into any clothes we want. And one thing you've probably noticed if you paid any attention to AI video recently is that some of the best AI video starts with an image. When you're using tools like Kling and Sora. and minimax they all tend to perform the best when you start with an input image as opposed to just a prompt and if we can now make images of any character we can imagine wearing anything we can imagine and then that's like the starting frame of our generation video, it only leads me to believe that we're going to get so much more granular with exactly what the people look like and what they're wearing when we generate videos. But the next handful of research papers I'm going to talk about are all very, very specifically focused on video.
Next up is this diff you eraser, a diffusion model for a video in painting. So this is a model that lets you take a video input, mask out a person or literally anything in the video, and it will remove whatever it was you masked out from the video. Now you can see on the left, the person that's masked out in the middle, we can see what this type of technology typically looks like, where it kind of creates like a ghost.
And then you can see on the very right, what this new tech does, where it actually uses AI to do a better job at guessing what would have been behind the person. So as we play through this 10 second video here, you can see the ghost in the middle here. And then on the right, you can see there's still a little bit of it if you look really close, but it's doing a much better job of estimating what that.
background should have looked like once the person was removed. Here's another example of a dog running like an agility course here. You can see the dogs masked out.
You can see in the middle, you'll see the like ghosting version, and then you'll see on the right, this new model. And one thing you'll notice when you watch this video back is that you can still see a little tiny bit of the blur, but it's definitely much better than the middle version. Now, if they can only make an AI remove the shadows, that would be really cool as well.
if it figured okay this shadows from that dog and remove that as well i imagine that's only a matter of time but right now you can see it's definitely quite a bit improved a little bit more noticeable on the last video that we just looked at i will link up to these research links below so you can dive deeper if you really want to get into the weeds but to me some of this stuff just really shows how much more control we're about to get over the videos that we generate or even the videos we film with our own cameras we're gonna be able to do whatever we want with them even better than we ever were before. Here's another example where there's a car driving through a parking garage, and you can see it didn't do a great job. You can still see the brake lights lighting everything up, but man, it's getting a lot better than it used to be.
And that's just a small taste of some of the demos they have on here that you can check out. I will link up the research link below so you can check this one out as well. With whatever you're building online, you're going to want a place to show it off to the world, and that's why, for this video, I partnered with Hostinger.
Hostinger is the- only platform you need to build an online presence and it includes everything you need to grow online and right now they're having a new year's sale where you can get up to 80 off their plans now i personally recommend grabbing the business website builder because that's the plan that has all the cool ai features and if you use the coupon code matt wolf at checkout you'll get an additional 10 off your order once you're inside your hosting account it is super easy to build your website just click on websites over on the left click on the website list then click ad website and I recommend Hostinger's website builder because this is the tool that's going to build your entire website for you using AI. Give Hostinger a few details about the site you're trying to build. For this example, I'm building a guitar website.
I'll click create website and just let AI do the work for me. In less than a minute, I have a fully designed website with a hero header here, some example content, images that are relevant to the topic of the website, and everything else you'd need to get online quickly. I can quickly change the color scheme with the click of a button over here. And when I'm ready to edit the site, everything is drag and drop.
I can arrange this however I want and dial in the site exactly to my needs. And when I say it has everything you need to grow online, I mean it. Check out all these AI tools, image generators, writers, page generators, section generators, blog generators, product details generators, AI heat maps, AI SEO assistants, and even an AI logo maker.
Let's use the AI SEO assistant. Pick a few keywords we'd want to rank for here. And then watch as the AI.
actually optimizes the site for SEO for me. It is crazy how quickly and easily you can get websites online these days. And again, Hostinger is offering up to 80% off right now. So check it out over at hostinger.com slash Matt Wolfe. Don't forget to use the coupon code Matt Wolfe.
And thank you so much to Hostinger for sponsoring this video. This next one is called Matt Anyone, stable video matting with consistent memory propagation. We can see in this one, they give it a video input and then they mask out. a person inside the video and then it actually creates a mat out of that person and allows them to make a green screen version of it and it even picks up all the fine details in the hair as the person moves through the frame we can see some examples here of this guy walking through the war zone now we have a green screen version are all these people dancing in the room here and you can see it finds every single one of them and puts a green screen behind them this example with marquez talking to elon boom green screen even took the chairs they were sitting on out of the video here's another one from a movie boom green screened this is an ai generated video here and even that one boom they were able to remove the background and create a matte out of it and here's another example where they pulled it From a video game, this is Black Myth Wukong. They were able to green screen out and matte out just that character.
Using AI, we're much better able to not only remove things, but isolate very specific characters in a movie. Now you can take any one of these with their green screen behind them and overlay them on top of a completely different scene if we want to. Not only will our video editing workflows get a lot easier because of... research and tools like this, but more and more of the AI video tools in the future are probably just going to implement stuff like this.
They're going to give you a little check box. And if you turn the check box on, it will just automatically make a transparent background video. I also imagine with this diff you eraser here, jumping back to this one real quick, we're going to have these kinds of tools in our phones pretty soon. And the same way you're able to like circle things and images now and easily remove them because of AI, you're going to have that same functionality in video pretty soon. where you can record a video of like your child's play or something, and maybe a parent gets up and walks in front of the camera.
Well, you'll be able to highlight that person, erase them out and have just a video of your child in the play. I think that's really, really close. And probably like the next wave of AI tech we're going to get in our phones, maybe like next year. But things are getting even wilder here.
This is FilmAgent, a multi-agent framework for end-to-end film automation in virtual 3D spaces. This is like an AI film crew that works in a virtual environment. I believe these examples were using Unity, the game engine. But you can see Film Agent simulates key crew roles, directors, screenwriters, actors, and cinematographers, and integrates efficient human workflows within a sandbox environment. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations.
You can see they create a bunch of different environments inside of Unity, like... an apartment kitchen, a living room, a beverage room, a dining room, a billiard room, gaming room, office, roadside, etc. They also had AI script writers, AI positioning the cameras like cameramen, AI actors in the scenes, and they were able to create these little short films where AI autonomously created the whole thing.
Now, the graphics aren't amazing on this. These are just basic 3D objects inside of Unity, but the real feat here of this research was creating this sort team of AI agents that did all of the various roles of the filming. You've got the AI cameraman picking where in the scene they are. You've got the AI actors doing their thing. You've got the AI script writers who wrote the scripts that the actors are using.
And all of these AIs are sort of communicating with each other to give a final output. And when we look at the paper here, it actually says human evaluation shows that film agent outperforms all baselines across all aspects and scores 3.98 out of five on average. showing the feasibility of multi-agent collaboration and filmmaking.
This 3.98 out of 5 was judged by humans on plot coherence, alignment between dialogue and actor profiles, appropriateness of camera setting, and accuracy of actor actions. So for the most part, humans that watch these videos thought they were pretty coherent. Although this evaluation doesn't seem to be based on any sort of entertainment value, so I don't know if it's necessarily something people want to watch yet, but it's coherent. But based on this, it's only a matter of time where you're going to be able to give a prompt of this is the type of film I want to see made. And a series of agents all communicating with each other will go and actually create that film for you.
But things are even wilder. ByteDance recently showed off this research called Omni Human One, rethinking the scaling up of one stage conditioned human animation models. This is basically a tool where you can give a single image input and a single.
audio input and it will create a video from those two. Now we can see here that they didn't give us any of the starting images, but they do say for the sake of a clean layout, we have omitted the display of reference images, which are the first frame of the generated video in most cases. So what we're seeing here are the first frames, but when we press play, that's the generation we saw. So this pretty clearly AI generated image was given a song input.
and this was the output. That was created from just one image and then the song you heard. Here's another one where you've got one image and here's a song.
She's actually playing the piano in the video and singing. Here's another example. We're looking at the starting frame right now.
All three of these were most likely AI generated images as the starting frame, but it also works with real people images. And this is basically just deep faking at this point. Like here's an input image of Bill Maher. And then there's an audio of Bill Maher actually. speaking and here's the sync my first guess is the man who made electric cars a thing and is currently working on perfecting reusable rockets space travel connecting the human brain directly to computers connecting cities with electromagnetic bullet trains and here's another one of a ted talk these principles will not only make your user's journey more pleasant they'll contribute to better business metrics as well but just notice how he moves his hands around as he's talking here and it looks like he's giving a real ted talk and this whole thing started with just this one input image and that audio you heard.
It works with cartoon images, it works with people portraits. And just explore all the musical theater options out there, but don't just stick for the song that everyone thinks is amazing. And there are tons of examples here. And again, this comes from one input image, one audio file. It takes that input image and animates it to match with that audio file that you gave it, essentially allowing you to create anything you can imagine.
You can generate a starting image with AI, give it an audio clip and make that AI character look like they're speaking or singing. You can take images of real people, upload them, record their audio of them speaking, upload it, match it up and make it look like they're actually saying that in whatever scenario they're in. Now, think about that with AI generated video.
We've already got the technology to create AI images. We've already got the technology and tools like 11 labs to create realistic sounding voice. How far off are we from just doing that all in a single prompt?
Make a video of Elon Musk. saying some outlandish crazy stuff, right? Give them some text that you want to see them speak. It will generate the image of Elon Musk, generate the audio using 11 labs in Elon Musk's voice, take the image that was created, take the audio that was created, merge them together.
You've got a single text prompt that generated basically a deep fake of Elon Musk saying whatever you want. Eventually there'll be one tool that will just do all of that just from a single text prompt. Right now you can... already start to do that by just sort of making multiple tools work together.
It's getting just wild and honestly a little bit scary how easy this is all getting. And then the final one I want to show you is the one that I am absolutely the most impressed by right now. It is VideoJam joint appearance motion representations for enhanced motion generation in video models. Remember how people were making gymnastic videos and they were coming out all freaky and weird like this dude here?
Well, this new tool. actually makes them look realistic and way more coherent. Here's a gymnastics video on the right of this clip here.
On the left, you can see the older style videos of somebody hula hooping, where the hula hoop just sort of moves completely off their body at one point. And on the right, you can see somebody that actually looks like a real human hula hooping. And there are all sorts of examples here.
An otter riding rollerblades on two legs in a bustling city park, twirling past picnicking families. Well. The one on the left looks wonky.
The one on the right actually looks fairly realistic. A woman doing pushup exercise. The one on the left, she's just kind of staying in the same position. The one on the right, she actually looks like she's doing pushups. A bear wobbling slightly as it rides a bicycle down a forest trail.
The one on the left has no back wheel and the bear's just floating. The one on the right looks like a bear riding a bicycle. Like they're actually figuring out how to make the physics work in this. Now, this isn't a new AI video generation model here. This is a new way to...
train video models. So you'll probably see tools like Sora and runway and cling and all of those tools use this technology in their training to get the physics down and actually make the videos look like good, realistic videos. And you're going to want to check out this page.
I'll put it in the description because it's got all sorts of examples that are honestly just like mind blowing. We can see here, like somebody trying to do a headstand and their head is literally removed from their body and Sora cling and DIT here don't look a whole lot better. And then we've got a woman that actually looks like she's doing a headstand and it looks like what you think it should look like. And tons of other examples here, all really, really impressive outputs down the right side.
This is one of the biggest advancements we've seen in AI video generation. As soon as these models start using this new tech, we can see here VideoJam is a framework that explicitly instills a strong motion prior to any video generation model. But as soon as all the big AI video companies get their hands on this code, we're going to see a big leap in the actual physics and actually getting these videos to look right and more realistic.
Like we're going to see a huge leap as soon as this research is out in the wild. And now putting it all together, this is why I think AI video is just going to get more and more just insane throughout this year. We're going to be able to put any outfits on any characters we want. We're going to be able to erase anything from videos.
We're going to be able to mask out anything from videos. We're going to be able to create agents that know how to set up the camera angles and write the stories and AI actors. in the videos and basically AIs that all work together to form a story for us. We're getting the ability to match images with audio and turn that into a video of anyone we can imagine saying or singing anything we can imagine.
And we're getting more and more realistic physics. and understanding of how the real world and motion actually works to get much more coherent videos. And once these technologies start intersecting and combining with each other, the sky's the limit of the videos we'll be able to produce using AI. I mean, I imagine a lot of this stuff will all be built into Runway or Kling or Sora or name your favorite tool of choice. A lot of this technology is going to work its way into those tools and give us more and more and more controllability over the exact output.
of the videos that we wanted to generate. And that to me is super exciting. I love it.
It's also a little bit scary because it's making it so easy for anybody to do any of this stuff, which obviously has some negative implications if bad actors are using it to try to fool people into believing things are true that aren't actually true. But for creative people, all of this stuff is amazing. And I'm so excited to get my hands on it and play with it and be able to create anything my brain can think up and put out into the world. So amazing. What a time to be alive.
I hope you guys are seeing what I'm seeing and how exciting this is all becoming. I hope you enjoyed this video. I hope you got a little bit of a glimpse into the future of what's coming.
A lot of this research isn't actually available for just anyone yet, but all of this code will be made available pretty soon, meaning more and more tools are gonna get access to it and anybody's gonna be able to get their hands on these abilities pretty soon. So really exciting stuff. That's what I got for you. If you like learning about the latest AI tools, the latest AI news, and just constantly staying the loop. Make sure you check out futuretools.io.
This is where I share everything that I come across, news, tools, research, everything in between. And I have a free newsletter where I share just the most important news and the coolest tools twice a week with you, straight to your inbox. And if you sign up, I'll give you free access to the AI income database, a cool database I've been building out of ways to make side income using various AI tools.
It's all free. You can find it at futuretools.io. Thank you so much for tuning into this one.
And thank you so much to Hostinger for sponsoring it. Hopefully, I'll see you in the next video. Bye-bye.