in this video we're going to look at one using the models that are hosted on hugging face both for the hugging face Hub which is quite common we've done that a number of times in other videos but two also for how you could do this locally how you can load these these models locally and use them in Lang chain so you'll see that I'm installing a lang chain I'm installing the hanging face Hub because I'm actually going to load the models locally I need to have the Transformers library and the sentence Transformers for embeddings later on so there are two ways to basically use the models the first way is the most common way that you'll see people talk about and that's basically just using the hugging face Hub so that's just pinging an API you need your hugging face API token in there all right the second way that works great for a lot of models for some of the models you'll see that actually it's even more optimal to Ping the API if you can rather than host it yourself because you're hosting probably it's going to be unless you've got a very fast GPU it might be slow Etc but it also raises some issues because the hugging face Hub version doesn't support all the models and it doesn't support there are certain models that you know it's supposed to support your text to text generation models and text generation models so these are the text to text ones are basically the encoded decoder models things like Bart things like T5 and your text generation are your decoder only models so these are more like the GPT models T2 Etc the thing you'll see is that for using there are times that you're going to find that the hugging face Hub actually doesn't work for some of the models whereas the loading it locally does that's what I want to show you here so okay first off let's just go through the standard way of doing it with the hugging face Hub so here you can see we're basically just setting up a large language model chain we've created a simple prompt this has got like a chain of prompting here we're going to pass in a question and we're going to say answer let's think about it step by step all right and you can see here then I basically instantiate this we with the hugging face Hub and the repo that I'm passing in so this is the model that I'm using is the flan T5 XL in this case I can set my temperature Etc there then I could basically just ping that model I just by running the chain because sure if I asked you what is capital of France Paris is the capital of France final answer it's Paris so that shows us that not only is it getting it right but it's doing the change of thought going on there another example of this is just asking it a little bit more in-depth question what area is best for growing wine in France best area for growing wine in France is the little one Valley and you can see that it basically talks about that that's in the south of France the area of France then it gives us a final answer which is just off screen there which is the luau Valley I so that's great if we want to use a flan model or something like that what if we want to use like the blender model so the blenderbot models were created by Facebook as chat models like chat so they're specifically trained on chat data sets and they have a number of them of different sizes and stuff like that here the one I'm using is trying to access is the blender 1 billion distilled so this is distilled I think from either 2.7 billion or the 9 billion model it's just still from quite a big model down BC I set it up exactly the same and boom it doesn't work so it's basically telling me that oh okay they only support text to text generation or text generation now this model is what they call one of the conv AI models conversational AI models so you'll see that okay this is not ideal right we're trying to use this with a hunger face Hub we can't use this so let's look at using the local models and we'll try in a bit with this one for the local model and see how it goes so what are the sort of advantages to using a local model one of the big ones is that you can then fine-tune models and use your models locally without having to put them up on hugging face you can just basically load your model in off your hard drive that kind of thing you can go for GPU hosted models so a lot of the model models that are on the hanging face Hub unless you're paying for eating face you don't get a GPU version of that model and if you do you maybe want to change the GPU that kind of thing so that's one of the advantages and then like I said before some of the models only work locally so we'll sort of look at this as we go through it so let's go back to that flan model now I'm going to go for a smaller one here just because I'm loading up a few different models in here and I don't want to overload the GPU but we'll look at okay how are we going to do this so we're going to bring in the hanging face pipeline so pipeline is a part of hugging face where you can basically simplify the tokenization simplify everything by putting it in a pipeline and you can set up some parameters in that Pipeline and then based on what pipeline that is that will then generate the response out so here you can see okay we're bringing in the flan T5 at large I'm setting up the tokenizer so okay I've got an auto tokenizer here right and plenty five is an encoded decoder model right so we've got both sides of the Transformer here so to use this we need an auto model from seek to seek language model so that's basically just saying an encoder decoder model here you can load it in 8-bit but you don't have to you could just load it in normal depending on which model you pick and stuff like that if you've got enough GPU Ram you can just go for it like that once you've got it loaded you then need to set up the pipeline so the pipeline you need to tell it what kind of pipeline here it is so the hiking face has a whole bunch of different pipelines for things like classification for named entity recognition I think for summarization things like that as well and currently none as far as I know none of those are supported in language we pass in the model we pass in the tokenizer and then we're going to basically just set a max length and then to bring this into running it just like a language model that like we would when we're doing open AI or the hug and face Hub we basically just tell it okay the local language model is going gonna be the hugging face Pipeline and then we pass in this pipeline that we've generated here so once we've done that we can then ping the the local model just like we would any large language model so I can ask it here what is the capital of France and you can see that there's no we've got no prompt here this is just direct conditional generation from the model so it comes out Paris we ask it now we can set it up with a large language model chain passing in the prompt that we created earlier on passing in the local language model that we just set up and then I ask it what is the capital of England the capital of England is London London is the capital Union so that the answer is London that shows you just you doing the same kind of thing right locally as you've done in arguing face Hub you can also do this with decoder models so decoder models you need to set them up a little bit differently so you can use the auto tokenizer so here I'm just going for one of the gpt2 models um setting up the auto tokenizer and then we've got an auto model for a causal language modeling so this is basically just going to load up our model you can see that because it's a decoder only model we have text generation not text to text Generations up here we had text to text generation right so you need to look up on the model that you're doing and see okay what kind of generation does this do for setting the pipeline I then basically set the model I set the tokenizer just like before and then I can basically query it and you can see this model is is the medium gpt2 which I think is about 700 million parameters from memory you can see we set it up we ask it the question just like normal it's not a great model right it's a very old model now it's got that we're talking about France but it hasn't really answered our question and it's just generating text out there so let's go back to the blenderbot model so the blenderbot model we saw up here wasn't working and now we want to see if we can get this working so here we've got the we're going to bring in the blenderbot is an encoder decoder model so we're going to bring it in we're going to use the auto tokenizer like all of them we're going to use the auto model for sequencer sequence modeling and then we're going to basically bring in that it's a pipeline and we're going to say that it's a text to text generation pipeline so even though this is what they call a comf uh AI model we can use this text to text generation here so I'm not sure why it doesn't work on the hanging face Hub but it certainly Works locally so we've done the same things as before we've brought it in and we can see that now this model is definitely not as big as the some of the models that we were using before and it's not fine-tuned with any instructions or anything like that it's just fine tune with chat we can ask it the same question and we can see that it gives us a very coherent conversational response doesn't seem to know the answer what are the areas what area is best for growing wine in France I'm not sure but I do know that France is one of the largest producers of wine in the world so this is definitely a answer that we would say okay yeah we could understand it perhaps a human or something saying something like this so This sort of just shows you setting these up the other ones too is if you want to do the embedding model you can actually do those locally too so here you're going to be making use of the sentence Transformer package and models and you can see here that we're bringing in this model here which is basically going to turn our text into a 768 Vector we set it up with hugging face embeddings and just pass the model in it will download it and then we can basically do an embedding of a string with HF embed query or if we want to actually embed a whole document of strings we can basically do HF embed documents so this is a way that you can make if you manually you could make the embeddings for things if you wanted to upload them to something like pine cone or we V8 for using for a semantic search a task or something like that so just I just wanted to make this quickly so you can see that you can actually use the hugging face models locally for some of the models like things like blenderbot it's the only way currently that you can use them and it definitely gets good results you can play around with it you can try these things locally you can then get a sentence of like okay what sort of GPU would I want if I wanted to put this into production Etc all right if you've got any questions or anything please put them in the comments if if you find this useful Please Subscribe and I'll be doing more videos like this in the future talk to you soon bye