Web Scraping with Llama 3.18 Guide

all right so here instead of using GPT for om mini or any other open AI model we are going to choose llama 3.18 and as always all we need to do to scrape any website is the URL and the fields in this case we are going to scrape this website scrape me.if which is a dummy website that we usually use in order to test our scraping scripts and here we are going to scrape the names and the prices of these Pokémons so let's get the URL let's place it in here and then we have the name and the price of the Pokémons let's click on scrape and once we are going to do that it's of course going to start the scraping process and then if we go to the back end we can see the inference already started and our llama 3.1 is going to start generating the Json answer that we then are going to use in order to show in our websites and we will have successful scraping attempt and as you can see here the total cost is $0 it is for free and everything happened on our machine okay so now that we have seen that it does work with llama 3.1 we need to talk about four things today the first one is where you can find the code and how you can set this up on your own machine because believe me it was not easy to share the code with you guys GitHub suspended my account without warning they haven't just suspended one project they literally suspended the whole account so I'm going to show you where you can see the code and how you can set this up on your own machine after that we are going to see how you can set up your own local llama 3.1 server how you can actually start using it in order for you to be able to to use this Universal whip scraper with your local model after doing that we are going to see how to use it with Gro and also Gemini flash because the models that we have with Gro and Gemini flash are actually very important for our use case and I'm going to tell you exactly why and if you feel like you already know all of these things just say until the end because this is where I would actually need your help I've got a lot of feedback last time and it was actually very good because it gives me ideas on what I should add to this Universal whip scraper and a big feedback was imagination and honestly I don't have a clear idea on how to do it this is where I would actually need your help and I will share a couple of questions that I have in mind in order for us to take this from just a universal whip scraper to a universal whip scraper / crawler so just stay until the end because your feedback helps me tremendously on getting this project to the next level with that being said let's jump to my screen all right so the first thing that we are going to start with in order to reproduce the project is going to be create a new folder let's call it scrap Master 2.0 and let's open vs code you have to have vs code and python configured inside of it let's open our folder that we have just created and now we are going to go to this website this is my website it is not the prettiest of websites guys but don't worry this is legitimately my website it was just a project I was working on a couple of weekends to see if I can actually launch a website for free this is literally for free except domain name anyway so here if you go to downloads and scrap Master 2.0 you will find all the steps in order for you to reproduce the project so the first thing that we are going to do after we create our project is go to the terminal and from here we are going to type in Python DM vv. VNV so we're creating our virtual environment we're going to let that happen and then we're going to come back here we're going to copy all the requirements these are all the libraries that we need we're going to come back here and then we're going to create a new file let's call it requirements. text inside of it we're going to put all of our requirements and then we are going to activate our virtual environment that it has just been created so that's being scripts activate clear going to see that we have a bnv in here and then we're going to pip install dasr requirements in order for it to install all of our requirements so we're going to let them install and we're going to come back here and as you can see here we need to create a EnV file this is going to be very important this is where we are going to put all of our keys let's put the API keys in here and here to be able to actually give the values just go to platform. open.com if you're going to use the gp4 do the same thing I think it's Google Cloud you get the API key it's very easy just Google it and then you get your API key if you want to use gr I'm going to do that quickly and I'm going to put them in here so I just put them inside of here so these are my keys let's go back here and now we need to download the Chrome driver just go here and then choose the Chrome driver depending on your own operating system and your machine for me it's windows 64 so I'm going to get this and then I am going to paste it in here and it's going to launch the download mode and then I will put it inside of my project that I have just created so here let me just wiar and extract it in here it will just be extracted in here and I will have a folder inside of here that's very good now let's go back here and let's go to the sixth step this is where we are going to create our assets. py so let's create that folder assets. py and then inside of it I am going to place all of this I am so sorry guys that you actually have to copy all of this and you don't have a a button to just click on it and copy I've actually tried to do that but the the platform where I am hosting my website is sites. google.com and when you embed your code in here you cannot have JavaScript nothing is going to work so even though that I've added actually copy scripts it did not work so let's create our second file so scraper Pui let's put all of this inside of it let's copy all of this this is literally the heart of our application this is where we have all of our functions let's print it inside of here and then let's create our last file which is streamlet application.py new file and then as always let's copy all of this let's copy it and put it inside of here and that is basically it and now we can run the command stream let's run streamless app.py and if we run it we basically going to have our application that is going to be run so let's let's try with gp4 om Min just to see if it works and let's say take something very simple like scre me. live let's take this let's place it in here and I want to scrape the name and the price let's click on scrape I am using the Headless option so it is not even going to open the the the website but as you can see this is working great so we have the tokens this is how much the cost was and this is already working great let's choose Gro for example and let's see if it's going to be able to do so it should be a bit faster we're going to use grock for the same scraping to see if it's going to be able to give us the same results and of course it will be able to do to do the same thing it is 3.17 CB so it it does have enough quality to give us the exact result of Gemini for om manyi and as you can see here the total cost is zero this is absolutely for free we can always see the tokens and this is basically the extraction all right so we're going to come back to Gemini Flash and Gro later on to talk about them and why they are important but for the moment let's try llama 3.18 B and see if it's going to work and of course we have an error in here and that's of course because we haven't launched the server yet okay so we have two ways to do it either we use Ama or LM Studio we need one of these softwares in order to be able to run a server of llama 3 locally on our machine I'm going to choose LM studio just because it is so much more user friendly and I just like it in general you can see the inference and starting the server is so much easier so just go to LM studio. and this is totally for free guys download it and once you download it just open it on your machine so this is LM studio and here you're going to find llama 3.1 literally being the first option if that's not the case if you're watching this video later on and there was a better model that came out look for this llama 3.18 B click on it and you basically are going to have this LM Studio Community which always has one of the most downloads so just choose this one I've actually tried it and it works just fun for me now here look for a version that actually goes with your machine I already have a quantized version it's not the highest quality but is working just fine if your machine is so much better just choose a better model you literally can see if your machine can basically handle the model or not as you can see here these are the models that my machine can partially handle and these models are going to be harder to run but once you download your model just go to local server and here you can literally choose the model and then you can click on start server and that's basically it it's going to work just fine just click on start server server and once you do that you will be able to basically run Lama 3 as your model for your Universal whip scraper application okay so now let's talk about why did I add Gro and Gem knife to this application why do we even need them I've actually seen a comment that this Universal web scraper if it does want to scrape so many websites at the same time it can never scale this is why I thought of Gro as being our best bet in terms of speed so Gro will help in terms of giving us the results as fast as possible but the first part of accessing the website and trying to get the data this is always going to be a problem either with traditional scrapers or with this new Universal scraper and when it does get to the website a traditional scraper will only go to a certain elements but it still needs wait for the websites to load and everything this is why I think with grock speed of inference we will almost get the same time with a traditional scraper or with a universal web scraper so to aners to this question of speed Gro will be so important if we want to scrape so many websites at the same time now why did I add Gemini flash well if you don't know about Gemini flash pricing it's actually very good as you can see here if we use Gemini we have 15 requests per minute that are free of charge and this can be up to 1,500 requests per day meaning that if you are scraping websites just for your own use you're not doing at scale you can actually use a really good you know model that is closed Source it's not open source but it is going to be free of charge so if you don't care about the data that you're scraping if you don't care that you need to call Gemini 1.5 flash this would be actually a very good alternative so this is why I've added Gemini 1.5 Flash in here so when we are going to use it and we are going to click on scrape so even though the scraping of course works even though that you will find that the cost is 0.002 this is actually for free because I am not over that daily limit so there's an ASX that should be here saying that as long as you don't go over the limit this is actually for free and also even though that GPT 40 mini is so cheap Gemini 1.5 flash is half the price so it's even cheaper even if you go over the limit and Gemini 1.5 flash is not bad with unstructured data meaning that for our use case I think the best right now is Gemini 1.5 Flash and Gro Lama if you don't care about launching it on your own machine if you do then you will go with a local model like llama 3.1 okay so now that we have talked about the best models to run our Universal web Scraper on let's talk about the last point which is another feature that a lot of you guys have asked about which is pagination and here I have a couple of questions that I need to ask because honestly I don't know the answer on what is the best way to implement dis pagination inside of this application so I was thinking of adding a little tole in here or a radio box or something that says you want to scrape multiple Pages for example scrape the URLs of these pages and if the user chooses to do that I am going to have a fine-tuned model of llama 3.1 or a Gemini flash or something like that and then this model its only work is going to give us like a table with the URLs it detects kind of a pattern of these pages so that's what I was thinking about but again there are websites that do not have this feature this is a very simple website and we can see that here we have page two page three page four so in a way we can detect these URLs and we can even have a little function that only gets all of the URLs from the page from markdowns and give them to the model to use list tokens and to make it more efficient so doing that we can kind of get the URLs where we are going to show them to the human and then the human will test to see if the URLs are correct or not and then the user can choose to launch the scraping and all of the pages if they wishes to but as I said there are websites that do not give you this simple tree of the pages and it's going to be very hard to script from these websites so what do you guys think about this how can we approach this in a universal way not for just one website but to be working on a max number of websites without it failing to detect the pagination patterns I read all the comments guys and I will see if you can actually give me another idea on how we can implement this pagination I've seen a couple of comments they were really good ideas but I still don't know how to approach it in a universal way anyways that has been me guys thank you guys so much for watching if you like the content drop a like And subscribe it does really mean a lot and I will catch you guys next time peace

Transcript for:Web Scraping with Llama 3.18 Guide

Transcript for:
Web Scraping with Llama 3.18 Guide