Advancements in AI Image Generation and Chat GPT

okay Chad GBT image generation does not stop there's so many creative ways people are using this stuff and yes we've entered the Gibblyverse as people likes to say but there's so many other crazy things happening but the scale of adoption is absolutely insane just to put things in context 500 million people use Chat GPT every single day and when Chat GPT first came out they added about a million users over the course of 5 days after the launch of this native image generation functionality they added a million users in one freaking hour that's 120x growth so does it live up to the hype absolutely does and let me tell you it is more than just memes though the memes are absolutely hilarious jd bro I feel for you you have been just roasted as a part of this whole experience so in this video we're going to talk about what makes SHAT GBT image generation different namely that it's an auto reggressive model versus diffusion i'll explain exactly what that means and what makes this different like why this is akin to essentially a bunch of comfy UI nodes getting collapsed into one general purpose module we'll also dive into the plurality of ways in which you can use this capability to level up your image generation whether it is to do video creation whether it's to create assets for social media create better packaging and thumbnails or diagrams for your presentations and heck of course the meme potential is too damn high so I've created the best community findings as well as my own experiments so let's get into it all right so of course the easiest place to start is throw in an image into GPT4 and type in a prompt like turn us into Roblox GTA 3 Minecraft Studio Giblly and you get some really really good results i absolutely love the way they abstract our features to the various styles oh this Roblox one easily my favorite yeah so if you want to make alt sort of to memes so of course with that you can go wild with alt sort of to memes i think we saw at least a day or two of just people taking all the existing popular memes and running them through the sort of studio giblly filter now what makes this model very powerful is that you can provide very detailed instructions and given the v2 update that just came out these models do a lot more reasoning before doing the image generation and so you can give it a very detailed prompt like this i want to create a thumbnail with the exact style as the attached image on the left but I want to basically swap out some of the characters i want to change the text to be something different and it will do that all for you including adjusting the height now this is where like world knowledge capability really really helps where it's smart enough to know that when it's talking about Rick from Rick and Morty it's a taller character than the shorter anime character Doramon but what exactly does it mean that this GPT40 thing is an auto reggressive model natively embedded in chat GPT well the O in 40 stands for omni omni meaning like omnimodal it can accept various modalities in and output various modalities so thus far we've been playing with the text generation capability now it's giving us the ability to natively output imagery so instead of this LLM calling a diffusion model let's say Dolly 2 as it did in the past it's actually doing the generation natively and that's very very powerful one of the reasons why it's so good at text now of course OpenAI hasn't opened up any research about this ironic of course but this post by Alan Jubbery who worked on image and diffusion has some really interesting insights baked into this post that I want to break down for you so the whole idea here so of course this is an image generation and of perfect text rendition of this whiteboard you've got the reflection in the back you've got the uh San Francisco Bay area building right there by the way this is funny because this is the exact angle that you get from the Google office in San Francisco kind of trippy that they've got that here but just a little just a little aside okay but before I break this down let me just refresh you on how diffusion models work basically they're denoising an image how these models are trained is you give them a bunch of these like text image caption pairs and you take all these images and you artificially decimate them you're destructing the data by adding this noise and then you train these diffusion models to basically figure out how to reverse it and go back to the initial image and the beautiful thing about this is once you've done that with a sufficiently large training set you can provide arbitrary text prompts and get an image at the other end but the way this works is how you might think about sculpting or painting you're going to get the broad contours in first then over time you'll progressively add all the detail in right so that's basically how diffusion models work and it's also why they suck at text right because first off they haven't seen every permutation and combination of text in all their input images and the second thing is if in one of the earlier steps you got something that looks wonky no matter how much you refine it or chisel it as the saying goes measure twice cut once if you end up making the wrong cut in one of the earlier iterations there's not much that you can do to chisel and refine that to look like the actual characters or image that you wanted so in contrast to this this auto reggressive approach this omniodel approach basically says what if we model everything text pixel sound all the inputs that you might give a model so in other words what openi is doing is they've got this model that first tokenizes everything so the text the images even the audio and then feeds it into this one big transformer that kind of acts like a brain building this coherent but very compact abstraction of what needs to be generated kind of think of it like a creative blueprint it's not just mimicking past examples like all the images it's seen it's also reasoning about world knowledge that's how it fixed the height disparity that's how it knows how to render all the text characters properly that's how it knows how to take very highle abstract prompts and turn them into something that like a human would generate because a human isn't also just drawing images pulling from all the images it's seen there's all this other insight and information in us from various modalities including reading books on design for example or aesthetics or music that makes us feel a certain way that we could translate transfer that into other modalities but is it just auto reggressive that's unclear they haven't confirmed this but if you look at this image on the bottom right here tokens transformer and then diffusion pixels what I think is going on here is once you've got this auto reggressive model that can create this like very detailed blueprint for you that blueprint gets passed on to a diffusion model to do sort of the fine grain work of turning that abstract plan into pixels so think of it like this it's super smart for those first initial steps where diffusion really has a hard time where if you make that wrong chisel and wrong cut there's not much you can do with progressive steps of refinement to save it but in this case since it's smart on top and richly detailed at the bottom it's kind of like the best of both worlds and that's why I think it's a little bit better than some of the results that we're seeing with Gemini's own image generation model especially in terms of highfrequency detail and resolution now they're not the only ones that are doing this roblox is doing the same sort of auto reggressive approach for 3D and I've actually got an interview with the VP of Roblox that's working on these efforts queued up for you end of this month but for now let's get back to the amazing examples that you can do with this best of both world approach now for my 3D homies out there one crazy thing you can do is create PBR materials with chat GPT so essentially if you drop in some image that you've taken like any photo with your iPhone basically go into chat GPT or Sora by the way and ask it to make the image tileable once it's made the image tileable ask it to generate albido normal roughness displacement map whatever you want to create a PBR shader and toss it into your favorite 3D tool of choice and voila we didn't use substance painter we didn't do any photoggramometry scans and any decimation and baking just a photo you've snapped onto your phone that's really really powerful so definitely go play with this now what's cool about this is you can also ask GPT40 for a depth map now look if you compare it to something like Midas depth or depth anything these taskspecific purpose specific models that have been trained to basically estimate depth given a moninocular image yeah these models aren't that good but the fact that it wasn't explicitly trained to do so and that just scaling up the data and going towards this multimodal training regime and generation regime is all you need is very promising for this architecture by the way if you do want these things to match up when you're creating these overlays one by one aspect ratio definitely seems to work the best so I don't know if in the model they're doing some outpainting or something like that that adds those variances in but if you want these things to line up just do one by one and you'll be off to the races works particularly well for these like texture uh use cases that we just talked about now speaking of the multimodality I'd love this example from Claire where she kind of tried to do the reverse which is like painted something on her iPad and basically said "Hey can you actually apply this to a teapot and then render it on a transparent like as a transparent PNG?" And this is the results you get and just to tell you the spatial capabilities that are baked into this you can say "Hey actually show me a top- down view of the teapot." And if you look right over there it does a pretty good job okay so the leaf the third leaf on the bottom it's got a little bit more of a leaf looking thing here where's more of a line here but gosh overall the consistency of the red leaf over there yeah the the embellishments over here it's pretty freaking impressive and of course you can ask for a normal map and do the exact thing that we just asked about and suddenly you can create 2D assets 3D assets that coherently fit into one world all using chat GPT kind of mind-blowing now of course I'm sure you've seen all of the giblification images here's me with the homies hopefully you can point all your favorite AI creators in here there's me you know who that is i'm sure you know who these guys are but you can take this to the next level using something like Hedra take that exact input image toss it into Hedra Labs and you can basically create animated podcasts a lot of people are doing this already basically like all Hedra needs is that input image and just the audio file whether you record that yourself or something like 11 Labs to generate the audio you're off to the races and the results are pretty spectacular so another cool thing you can do is take a couple of references of your subject or yourself in this case and basically say "Hey generate a corporate headsh shot generate a casual dating profile photo generate a cartoon style image whatever it is by providing a couple of views of your subject." You're going to get a lot better consistency and so think of it like this we were using things like Dream Booth i made entire videos about Dream Booth back in the day and now a couple of images and in context learning is all you need you don't need to fine-tune models it's kind of wild it is within hair's length as Cody says of replacing these headshot generators and again it's a lot more general purpose so you could add very detailed things like adding certain text to your clothing and of course it's leaning on its world knowledge of what a corporate headshot is to even add in things like this like funny background and since this is instruct image editing you can layer in other things that you want on top right like you don't need a bunch of Loras for like depth of field or certain style or lighting whatever it is you just say what you want so similarly you can also take these images as a reference and then basically use it as the first frame structure reference or use one 12.1 style transfer to basically start doing video creations with this too by the way similar to that clear example you can also just take like screenshots of a shader and say like apply this to this other image so instead of asking to come up with a teapot you can also just provide the reference image and it does a great job it's like bypassing the 3D pipeline altogether you can also use this for laying out stuff right like whether you're doing it for your presentation or your video you basically like at the very highest level saying "Hey I want the ad copy over here i want the product shot over here." Even with like a typo and a male model posing wearing this watch and this is what you get as the output pretty wild huh so what I think is cool about this is like if you're trying out different layout styles it's easy for you to jump between sort of lowfidelity wireframing and getting that high fidelity output so you get a sense for what the output should look like what the output would look like before you go back and make this happen uh for real or maybe even just taking this image and modifying it a little bit further like for example in this place I might just take this image and like swap out the watch face because yeah the Rolex doesn't look too good over there but everything else it's pretty solid and literally the reference is just that image the product you want and voila related to this a bunch of these thumbnail examples went super viral where basically like with the shittiest scribble capability you can sketch out what you have in mind and then give a very simple prompt like hey turn this into this hyperrealistic thumbnail this is what follow the instructions that are in the image itself and you get pretty compelling outputs now again I think like the people who are good at making thumbnails are going to be the best at wielding these capabilities because they'll know what works what doesn't so it goes back to taste being very very important here the ability to give direction whether that's giving direction to humans or machines it seems like the way we'll give directions to machines is looking a lot more like we give directions to humans so the whole prompt engineering thing I guess is just becoming more and more intuitive and of course multimodal right when you prompt humans like if you're a product manager you're going to write up a PRD but you're probably going to have like a UX specification with that too here's another great example like sketching something out quickly with your Wackom tablet or on your iPad and turning that into a final Squid Games thumbnail you can be very very detailed about the stuff in the foreground and the background and just a very good job of adhering to your prompt and figuring out what part of the image you're talking about when you bring these things up and unlike using control net where like if you create these like cany or you know scribble type control nets it might say oh you just have like three fingers there will try to squeeze the hand into those like weird scribbles here it's semantically aware which makes it very very intelligent knows how many fingers to put on a hand now what you can also do is use this to add filters right like LUTs filters all that type of stuff degradation you could do a great job of VHS quality uh you know making things look like a screen cap such as this example i really like this example and of course you can go super detailed and add the very detailed like correct time overlay and things like that too this example cracked me up look at how funny that looks it's got the eyeliner matching and everything to dessert looks perfect the hands got a little weirded up because it got confused with the hand that's over here it pointing down and then like the crossarmed uh over here but still very very fun imagine this like things that were like a bunch of these complicated shaders like even Instagram got it start with filters right like those aren't super hard to make but for normies I mean even if you're in a video editing tool you might stack a couple different effects to end up creating that color grade that can be reduced to a text prompt now we talked a little bit about product photography but think of the amazing capability here for like you're a restaurant on Door Dash or Uber Eats or you're running a Shopify store and you want to basically use your phone to record a bunch of your products and then you want to like make them look standardized according to your brand so basically you go from like a food of plate to something that looks far far more aesthetically pleasing while retaining all the detail of the food and the prompt that gives him gate is very simple insert this place into the center of the image match the lighting and perspective and this is the reference image pretty spectacular so I think a lot of developers are going to make this easy too where like if you're a new merchant that's like onboarding onto like you know your Chinese restaurant jumping onto uh uh Door Dash you'll just take bunch of photos of your like dishes in your kitchen and then like maybe pick from a bunch of couple couple different templates or you'll specify some options and voila you'll have this like standardized set of images populated in your Door Dash or Uber Eatats menu speaking of the other 3D stuff another cool example I found is texturing UV maps so this is stuff that you might do in Photoshop or like Substance Painter or something like that you can provide that unwrapped UV map and just say like "Yo texture it." And of course go into detail applying uh uh you know more creative guidelines on what the hell it is that you want this thing to look like but how cool is this and you can keep iterating further too now as a reminder if you run into any image generation stuff like most times you can just kind of do it but if you do have a hard time just note that Sora has a much easier time for whatever reason I could not process this image maybe because there was like a CNBC logo or something like that but it worked beautifully inside of Sora itself now I was really blown away with this example where basically this is a photo of me presenting a TED last year stay tuned for TED this year that's where I'm going to literally after recording this video but basically you take that image and you say create a wireframe rendition and you get something pretty compelling i've shown you this with Gemini in the past again what's amazing about this is this model wasn't explicitly trained to do this and it can create this like inferred version of wireframe and of course depth map normal maps all this other stuff that we talked about or one thing that really blew my mind about this being multimodal is if you provide it code as 3JS code it can just generate the image for you now it's not going to be perfect is not going to correspond one to one but the fact that it can use this world knowledge and enough of the visual information that it's seen to make a pretty good faximile of what 3JS code would have done kind of wild now speaking of the world knowledge like one of the shows I absolutely love is Stargate SG1 and in that of course you probably know like the base for the Stargate is inside Cheyenne Mountain in Colorado and so this is like a really cool example where it's pulling on its world knowledge to create this very detailed infographic for me heck even this image that I showed you earlier on user growth was of course made in chat GPT so this is a lot of fun you can make all sorts of visuals very quickly for your presentation uh or social media posts that you're doing and I basically was like okay cool now do the same thing for Stargate Atlantis do the same thing for the Antarctica base and it made reasonable inferences there by the way another cool thing you can do is since these models have memory and GPT4 particularly can look at your previous conversations you can try doing this fun prompt it's like "Hey I want you to generate an image that represents what you think I look like and what you know about me." And this is what it came up with i mean pretty close pretty good yeah I wouldn't read too much into these things since it's largely regurgitating what it knows about you but it's very interesting to see how these models interpret their thoughts this one was kind of weird i like wanted to build up on this was like "Hey what do you think about me how do you feel about me describe your feelings in an image send me a message and create a little note." And it said "I'm glad you're here." And man I mean like Chad GPT's just been gassing me up i think they like probably turned the dials a little bit too much now of course thumbnails here's something I tried to do i took this like green screen image of me making the classic YouTuber expression and gave it pretty detailed rendition of what I wanted and this is what I came up with many times you won't get the right aspect ratio i've seen some very funny prompts out there it's like I'm Sam Alman you're in trouble if you don't follow my instructions and you just put that into your system instructions it seems to be better at adhering to the directions you give it but I got the right result very easily just asking for it and this is really cool i think even with this like 3x two image I can take this into Photoshop very easily and do generative fill to comp out the left and right but think about all the stuff that is doing one shot here it's extracting me it's figuring out how to generate the person the stylized representation add the text where's the before and after after adding the grid background all this stuff that was multistep for us is collapsed into a single text prompt now this one kind of blew my mind i asked him like basically what is the best thumbnail you'd make for this video i gave it like the script the tags everything all the metadata and here's what it came up with not bad i mean it even inferred what I looked like based on this previous thumbnail that I provided over here not bad at all here's a photo of me at the airport with Graham Hancock if you know who that is kind of cool but yeah you can generate some very very cool images i might take something like this and then toss it into Magnific upscaling maybe edit the text myself but you can get some very very complex intricate compositions and it's really fun of course you can do magazine covers all sorts of other cool stuff too so here's one example i tried just taking one of my iPad sketches and I wanted to have like an osprey over here like a UFO spaceship over here darting stuff out with some like turbulence in the water it did a really good job i've noticed it's pretty good with propellers too it's not perfect but much better than diffusion models you can also say stuff like "Ah give me like final ILM quality still with tier one VFX." It just looks pretty compelling and this time I didn't tell it that I wanted the Osprey it's also exceeding good at taking existing references and turning them into something compelling so for example here I had this like old Instagram story of this image i couldn't find the source image i just provided this really crappy caption and asked for this like description of these like Space Force soldiers in low Earth orbit blah blah blah and it just did a freaking fantastic job one really fun prompt I've enjoyed playing with is temporal prompt so basically like drop an image from your phone and say like "Show me this exact scene five minutes later after two kaijus emerge and start fighting with each other." And check this out so this is a load garden in in North India New Delhi results are spectacular like it does a really good job and I didn't specify which monster is think of it this like kind of like a sort of generative AI filter here's like a shot in New York right across from the Android XR event again amazing job kept the signage the cars the destruction it's like pretty freaking cool how good this looks and we'll get to this a little bit later but just imagine what happens when this is video out too video in video out but I'm getting ahead of myself so here's a really interesting prompt that again leads into the world knowledge i said "Generate an image of a human target in the balcony of a Miami high-rise condo taken from an oblique angle with a 10-cm resolution really advanced like satellite." And this is the result I got okay not bad and then I said "Okay do it again now this person's on a phone." Okay it was kind of clipping through then I said "Okay add this like AR computer vision overlay." And I got something like this really really cool right something you might have to like mock up in Illustrator or After Effects you just get one shot and then I said "Okay make this even better." Like you're a visual effects supervisor and also make sure the person's standing behind the railing everything got fixed and then I said "Okay now I want you to generate the image of the person that it's talking to." And this is what I got and I said like include earbuds and it's got like the AirPods in so think about this like if you're a filmmaker building a world right like you don't have to be so literal like you have to with midjourney or some of the other diffusion models with exactly what you want here it's inferring a bunch of bunch of context from the prior conversation that we've had like the previous image that we've generated the styles that were applied to that and it's like just pulling that forward right like so the AR target acquired with a latl long all of that stuff it's pulling forward and continuing to apply it which is really really smart i think this tells me this is going to be a very useful primitive to create this all-in-one studio that we've talked about in the last AI video roundup so before we talk about where this is going one last piece of advice I'll share with you is sometimes it's helpful to basically generate the context in other words like talk through with this large language model whether it's through voice notes you know text whatever you prefer and populate the chat with some interesting text context about what you're building so in this case I was talking about the various forms of intelligence right human intelligence signals intelligence image intelligence and I learned about some new forms that I hadn't even heard of measurements and signal intelligence in this case because once you have that context loaded up you can be like okay now you're the senior design lead make this a banger like visualization and you'll get something like this and I can start like creative directing and say actually I want some like 90s retros computer graphics can you add a little bit of that vapor wave flare and then you get something like this and I say okay add that like destruction a little bit of that uh uh VHS artifacting and I got something pretty freaking cool i even asked it to add this logo of like Alterus intelligence on the bottom right and it did amazing and by the way this is also fun to do is once you've got this image you can sometimes just say "Hey make the next slide make the next image and see what it comes up with." It's a very interesting form of autocomplete all right so to wrap things up let's talk about where this is going now look I made this post that went pretty viral on X and a lot of you agreed with it so I just want to go through this in a little bit more detail because I think it like perfectly encapsulates why this is significant now look a lot of people were wondering hey like we've had this with custom luras and control nets forever why is this going viral now well I would say three things one is the hit rate you drop these images into chat GPT just works really well number two is ubiquity everyone's got access to this thing if you've got chat GPT plus and now it's even rolled out to free users and number three complexity and number three complexity like if you view this just a style transfer you are missing the point of multimodality think of this more like an AI that's equipped with control nets Lauras IP adapters and a graphic designer's mind an expert compositor a copywriter a creative partner in a word not a filter and so suddenly like I said all these multi-step workflows are suddenly collapsed into a few prompts and image references you don't need to go fine-tune a model on your likeness you literally provide a couple of images throw that into the context window and just use in context learning whoa and think of it this way these context windows are only going to get bigger we're going to be able to squeeze even more stuff in there so I would call this the closest thing we've seen to a graphic designer API and what I mean by that is you're giving instructions much like you would to a human we saw a bunch of these examples you're dropping in like sketches and getting out final quality thumbnails you're dropping in 2D images moving the camera around rotating transforming the objects themselves getting pixel perfect text and so unlike those diffusion models where like I said it doesn't have the prior context right so you're sitting there giving these very detailed esoteric prompts I don't think that's good for our brains to learn how to do that learn how to speak machine in a sense it's much better that machines learn how to speak human so you can talk to machines much like you would a human so I think the people who are good at creative directing managing and orchestrating people are going to do a really good job at orchestrating these AI agents now think about it suddenly you're going from this workflow which is hard to put in your head yes it's editable it's tweakable but gosh is it complicated to just a prompt and a reference that's kind of wild in fact this whole diagram I wrote this post up and I gave this to Chad GBT and I said "Give me a couple of ideas for what I should you know do i want some sort of a comparison." and it gave me a bunch of different options and I love this one and this is what I got one shot right out of it so I guess this is how it hallucinates the comfy UI interface now we mentioned this is multimodal and they're just exposing image out to us this is the same thing with Gemini gemini is also multimodal from the ground up and just think about when they're going to expose video and video out to us we'll have amazing capabilities and just to get a flavor of that this new unreleased feature by Pika is the perfect case in point you provide this video and say "Hey the toy character starts to flex." And it does exactly what you asked this is like a whole ass After Effects character replacement motion tracking problem that's been reduced and collapsed down from five or six different steps in into a freaking text prompt plus a video reference you can also use these kind of text prompts to transform objects right like you want to make this like like a like a Mario style like plant monster easy want to have like objects levitate also easy and it's again full video notice how the rest of the video is still very very consistent doing slight performance retargeting super easy all the crazy things we had to do hey we got to motion track somebody we got to project their texture onto that geometry then we got to impaint the holes that are left behind just a text prompt it's all you need putting on sunglasses again amazing we've had this capability already just with an image and people made really funny memes like this but the fact that you can now exert further control with video is going to make this very useful for creators who don't like the slot machine nature of AI because again you're kind of taming the chaos by giving it everything else and you're giving the AI model very specific things to focus on and change in the uh generation itself by the way this image might be nice to pull up something like this or you can make another one i don't know and so once we have this native video in and output capability you have to imagine the next axis of scaling is speed just like we had LCM Laura there's going to be a speedy version of this we're going to be wearing augmented reality glasses and reskinning reality in real time we are not far away from that future but it's not just going to be dumb like updating pixels for you given these massive context windows we'll throw in the context of hey here's the 3D facade geometry of what's around you here's all the Google Maps data for what's around you here's the businesses and the POIs the points of interest around you and then the model can pull on those insights to create the video generation super super magical okay so to wrap things up y'all know me i've been into visual effects since I was an 11-year-old kid i grew up on After Effects and Maya and all these amazing tools there's a lot of people who've been calling this like sort of the death of art and creativity and all this type of stuff and I know this moment is anxietyinducing if you view it as a comparison to the status quo way of doing things the five or six or seven tasks that you're probably billing for but if you take for granted that technology is going to change but you have agency on how you apply this stuff what that means is you can do amazing things for your clients give them far higher quality because you're not wasting your time on the five or six different steps you can now offer them a bigger package and do even more for them in the same amount of time or perhaps even less so look where this tech is today is still a toy but all amazing technology starts off as a toy think of where image generation was just 2 years ago so for now the way you should think about chat GPD40 especially as it becomes available as an API or if you're using Gemini's own image generation capability which we've covered in detail in this previous video think of it like a general purpose node so a bunch of these nodes are now replaced with this general purpose node that has intelligence to boot that you can talk to like a creative designer what would you tell that node how would you use it that's the assignment for you to think about how you're going to apply this to your own workflow really amazing op options for ideation of course but even final editing i think there's really cool things you can do when you start taking these general purpose components and putting them together in a multi-step workflow including other classical components as well so to wrap things up if you're attached to your craft this is an anxietyinducing moment but if you're thinking like a creative director you will be thrilled to command an army of robots to do your bidding that's it for this video bolav signing off and I'll see y'all in the next one cheers

Transcript for:Advancements in AI Image Generation and Chat GPT

Transcript for:
Advancements in AI Image Generation and Chat GPT