Transcript for:
Overview of Llama 3.2 and Meta Connect

meta connect just happened and meta just dropped llama 3.2 we have a new model new sizes Vision capabilities and so much more so that's what we're going to go through today and thank you to meta for partnering with me on this video so I'm going to talk about the highlights right away get you that information immediately and then I'm going to go more in depth on these topics in a moment so first llama 3.2 that's the big news llama 3.1 was a huge improvement over llama 3.0 and now we have 3.2 what what's different about llama 3.2 well now llama has Vision llama can actually see things and that is an incredible update to the Llama family of models we have an 11 billion parameter version and a 90 billion parameter version of their new vision capable models and these are dropin Replacements to llama 3.1 which means you don't have to change any of your code if you're already using it you don't have to really change anything you simply drop in these new models they're different sizes but they have all the capabilities of the text based intelligence and now they also have vision-based intelligence they also dropped two text only models that are tiny 1 billion and 3 billion these are specifically made to be run on edge devices now if you've been watching my videos at all you know I really believe in AI compute getting pushed to Edge devices and what are Edge devices cell phones computers Internet of Things devices basically anything that's not in the the cloud and I truly believe more and more AI compute is going to be pushed to Edge devices and this is a huge step in that direction models are becoming much more capable at a much smaller size and that's what we're seeing here llama 3.2 1 billion and 3 billion parameter Texton versions these are pre-trained and instruction tuned ready to go so I can imagine these fitting easily into the meta AI Rayband glasses the 1 billion and 3 billion parameter versions are are 128k context windows out of the box and they are state-of-the-art compared to their peers on use cases like summarization instruction following rewriting tasks all again running locally and this again confirms what I really believe the future of AI looks like which is a bunch of really small capable specialized models that can run on device so specifically for these models they're really good at these types of tasks and if you remember when I worked with Qualcomm on that video Qualcomm was very much about pushing AI compute to Edge devices and of course meta is partnered with Qualcomm on this and these models are ready to go out of the box optimized for Qualcomm and mediate Tech processors as I said supported by a broad ecosystem llama 3.2 11b and 90b vision models are drop in replacements for their corresponding text model equivalence while exceeding on image understanding tasks compared to closed models such as clad 3 ha coup now you know I'm going to be testing all of these models in subsequent videos so make sure you're subscribed to see those tests additionally unlike other open multimodal models both pre-trained and aligned models are available to be fine-tuned for custom applications using torch tune and deployed locally using torch chat and they're also available to try using our smart assistant meta AI now it's clear that meta is investing a ton into their ecosystem building out the tooling to find tune and services to host and basically everything that you need to have an open-source model in your personal life or your business they're also releasing their first llama stack distributions and that is a set of tools that developers can use to work with the Llama models and build everything around the core llm that is necessary to build production level applications here it describes llama stack as a way to greatly simplify the way developers work with llama models in different environments including single node on Prem cloud and on device enabling TurnKey deployment of retrieval augmented generation and tooling enabled applications with integrated safety and looking at the open source of course llama stack GitHub repo here are the things that it supports inference safety memory agentic system evaluation posttraining synthetic data generation and reward scoring each of those has a rest endpoint that you can use easily so you can download llama 3.2 from llama.com or hugging face and it's going to be available on some of meta's cloud Partners including AMD AWS datab bricks Dell Google Cloud grock IBM Intel Azure Nvidia Oracle Cloud Snowflake and more all right now let's look at some of the benchmarks so here are some benchmarks in this column and then in this Row the different models that it's comparing against so here's llama 3.2 1B and llama 3.2 3B versus Gemma 2B and 53.5 mini so these are comparing these small on device models and as we can see this llama 3.2 3B model actually performs incredibly well versus their peers in the same class of models here's MML U at 63 gsmk at 77 here's the ark challenge at 78 and here's one for Tool use so here's Nexus and bf clv2 really really good for being such a small model now let's look at the larger variants that have Vision enabled so here we have llama 3.29 B and 11b and comparing against Claud 3 Hau and GPT 40 mini and the Llama 3.2 90b seems to be the Best in Class almost across the board so let's test the tiny model first I'm on gro.com llama 3.2 1B preview right there and let's see how fast this thing is going to go write me a story oh my God 2,000 plus tokens per second look at that let's give it something a little bit more specific now let's just see if we can do it it write the Game snake in Python okay there it is 2,000 tokens per second and we'll see if it actually works oh look at that it worked unbelievable with 2,000 tokens per second a total output time of less than 1 second a 1 billion parameter model got the snake game on the first try very very impressive so I'm going to save the vision test for another video but for now let me tell you a little bit more about it the two largest models of the Llama 3.2 collection 11b and 90b support image reasoning use cases such as document level understanding including charts and graphs captioning of images and visual grounding tasks such as directionally pinpointing objects in images based on natural language descriptions for example a person could ask a question about which month in a previous year their small business had the best sales and llama 3.2 can then reason based on an available graph and quickly provide the answer now I definitely want to try the worlds Waldo with this Vision model as the first llama models to support Vision task the 11b and 90b models required an entirely new model architecture that supports image reasoning to add image input support we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model so built right into that core model but they used a new technique to do so the adapter consists of a series of cross attention layers that feed image encoder representations into the language model we trained the adapter on text image pairs to align the image representations with the language representations during adapter training we also updated the parameters of the image encoder but intentionally did not update the language model parameters by doing that we keep all the Texton capabilities intact providing developers a dropin replacement for llama 3.1 models so again it's going to be just as good as llama 3.1 text models but now it also has vision and if you want to read more about the details of how they actually achieved this I will drop links to everything in the description below in posttraining they did several rounds of alignment on supervised fine tuning rejection sampling and direct preference optimization DPO they leverage synthetic data Generation by using the Llama 3.1 model to filter and augment question and answers on top of in domain images so synthetic data is here it is here and it is ready and llama 3.1 is capable of it let alone llama 3.2 now they also use llama 3.1 the larger one as a teacher model to teach a much smaller version and that's how we got the 1 and 3 billion parameter llama 3.2 versions they use two methods pruning and distillation on the 1B and 3B models making them the first highly capable lightweight llama models that can fit on devices efficiently I am 100% behind on device AI compute so that's it congrats The Meta on another fantastic open source release I am going to be testing all of these different models I'm going to create two different test videos one for testing the text intelligence and then one for testing the vision intelligence thanks again to meta for partnering with me on this video If you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one