Web Scraping with Crawl 4 AI

I'm about to show you the easiest way to scrape any website you want using crawl for AI. You can scrape data from multiple different sources and do some data cleansing between them or even create smarter rag AI agents based on the data you now have saved in a personal database from scraping that data. And this helps you with not having to rely completely on an LLM to already know how to do everything. So, in this video, I'm going to show you how fast you can scrape any website. Now, some bonus points. This video will slowly turn into how to build a full REST API that you can type in any website URL that you want and we're going to scrape all the views from that site. If you're new to the channel, I'm Eric Roby, a software engineer with over a decade of experience and I've helped over 100,000 developers learn and grow within their craft. Now, with that, let's go ahead and jump into some code. All right, so I already have a environment created for this tutorial with a couple different main.py PI files that we're going to go through which is just increasing the skill level of CWL for AI and allowing us to be able to do different things and we're going to kind of go through everything step by step. But to get started with Crawl 4 AI, you need to do a pip install for the crawl for AI. And that's going to download everything we need pretty much to get crawling. It's really that easy. And we can plug in like LLMs to it to be able to kind of generate different types of code for us, but we're not going to do that right now. Once you install, you're going to get a whole bunch of um dependencies installed. We need to now probably do a setup. Now, a setup is just making sure that hey, you have the headless browsers and all the configurations on your computer to actually run the crawl. We have all the the dependencies for our code to run it, but it's going to say like, hey, you're going to want to open up like a Chrome browser, a headless Chrome browser, which means you can't see what's going on, but it's using it in the background to run the crawl. And really, you can just do this by typing in crawl4ai-s setup. That will do all the setup automatically. It probably won't take too long on my machine cuz I've already done it before, but on yours, it might need to install certain things. So, we can see that it's running the post installation setup, installing Playright browsers, which is what we're using for our crawl database initialization, all this kind of stuff, getting ready to go. All right. Now, what we're going to be crawling is my website. So codingwithroby.com here I just have a little bit about who I am. It's talking about what what I do with some you know reviews from students. We have like my about page which just talks about who I am. It has a newsletter page which you can like sign up if you want to. Courses where I sell online courses and things like that. So, we're going to be scraping all of this data to allow us to be able to kind of get um information about what's going on and then you can see how you can use this on any other website. So, to get started, if you go into main.py, so right here, we can see how easy it is to use crawl for a once you install it and run this setup. We have our main functionality and we're just going to be running this asynchronously. We're going to be crawling my website of www.codingwithroy.com and then we're just going to be printing the result.down. So, if we just go ahead and just see what this would look like. We do a Python 3 main.py and we just kind of let it do its thing and we can see here it is. That's how quick it just crawled that entire website. We can see the reviews and we can see what's going on. Now, that is only going to be crawling codingwithroy.com. So, when we went through here and we were looking at like the about page, that's a different URL. That's /about. This is slashnewsletter/pythonbrain bites. This is slashcourses login is slashlo. So what our current crawl did was it only went ahead and just crawled our entire codingwithroby.com. These are the links at the top and then it just has all the words. There's so much you can do with web scraping but this is just an example of like how we can do web scraping really quick for one page. Now, if you go ahead and we look at main um 2.py, we're going to be doing a little bit different. What I showed you before is how you can crawl just like a codingwithroy.com, but a lot of times you're not going to want to just crawl just one of their pages. You're going to want to crawl multiple pages. So, to crawl multiple pages, we need to set up a crawl batch. Now, our batch is going to be using um our browser config. So, we can add a little bit more information. So, here it this is default. It already does a headless browser by default, but we're going to be doing our browser config for a headless browser. We're going to be setting up our configuration, which checks the cache, checks for robot.ext files, and says if we want it streamed, and we don't want it streamed right now. We just want to get all the results back at once. Now, kind of a quick note, when you are going to be crawling, I have to kind of talk about the the legalness of web crawling. Now, right here, check robots text. So, I don't have a robots text on my page, but a lot of websites do, like youtube.com. This will tell you what you are not allowed to scrape. If you don't put this in here, crawl forai will scrape it if you put the website URL in there. However, they a lot of these companies have really good bot protection. What I mean by that is they can track your IP addresses. They can see where everything's going. So, if you're not working with the robots.ext text file, it is possible for their server to decline access to you. So, if you did that crawl 4 AI and you went to YouTube and you start hitting some other endpoints that they're saying they don't want you to hit, they might block your IP and then all a sudden you can't go to YouTube anymore. So, make sure you're using it with Qu and you're um you know, you're using their robots.ext file. And YouTube wrote something funny right here. Created in the distant future, the year 2000 after the robotic uprising of the mid90s started wiping out humans. So anyway, make sure that you are following the robots.ext file and that is right here. And that will allow crawl 4 AI just to follow it. We're going to set up our memory because we're doing more than one URL. We want to make sure that we have memory allocated for um the crawl. We are now passing in multiple URLs. So we're going to be passing in the normal coding rubby.com about content privacy and terms. We're then going to run our async webcrawler where we're going to be passing in our URLs which is going to be a list of all of our URLs. Once we get our results, it's going to come back as a list of results for each URL. We are then going to loop through all the results and then process the result, which really all I'm doing is just adding it and making it one giant thing. So, we're grabbing the markdown. We're making it only 150 characters instead of just printing the whole web page. We're grabbing the metadata which will tell you all the metadata, the key and value pairs. And then we're just going to say how many internal links are in our web scrape and how many external links are in our web scrape. So if we go ahead and we say run Python 3 of our main 2.py, it's going to give us this nice little performance monitor. And this performance monitor right here that we were getting, which is going to give us like our overview of these websites getting scraped, that's coming from up here. where we set up um this monitor right here. So crawling monitor display detailed information that's this and then we can see that we're getting the status code 200 because we are printing the status code 200 right here with the processing information. This is the content preview of 150 characters the metadata right here and then how many external and internal links we have and I'm doing that for each one. So this is coding with ROI/about slash terms SLP privacy slashc contact. So this is for all the information that we might want to web scrape. Now this is how you can just run it in your code. But a lot of times you're going to want this to be like in an application or in an API endpoint where you can just like pass in the URL without having to think too much about it. And so I'm going to show you how we can now set this up using fast API where you can just pass in a single URL and then we can just kind of grab everything that that website contains based on the URL that we're passing in. So what I'm going to do here is I'm going to do a pip install for fast API and uicorn and then I'm going to jump into main 3.py. Now this will go away once um our things are installed. And so here we can see that we're importing a lot more stuff. We have our crawl response now which is going to have our URL status code, our preview metadata, internal and external links. The same thing that we just showed on main 2, but we're going to be using pideantic for data validation of the response. We are then going to have one endpoint that is a crawl where we pass in a URL and it's going to be doing the exact same crawl but just on a single URL. So, we're making sure that it's coming in as a HTTP URL, but then we're converting it to a string right here because that's what it wants. It wants a string when we're doing our crawl. And then we're doing the same thing where we're just processing the result and we're printing it and returning it back as a crawl response. So right here, if I come in here and I say uvicorn main 3 col app- reload and we go to our URL, I say slashdocs and I come in here, I can now say try it out and I can pass in a URL. So I'm going to say https www.codingwithroi.com and I'm going to say execute. That's going to run our code. And here we can see our response body. We have our URL, our status code, our metadata, and our internal and external links. Just like that, it just returned exactly what we had, but it's only returning one URL instead of doing all of them. So, we learned in main 2.py how you can return a list of URLs. And here we're only returning one. So again, we're going to have this performance monitor in our terminal, which is pretty helpful if you can get access to that after deployment or something like that. But what's really nice too is I'm going to shut this real quick so we can look at the other browser. Oops, that's the same one here. Now, another thing that applications are going to have, and we can see this in this youtube.com/root.text, is that it's going to tell us about sitemaps. Now, a sitemap will tell you like all the access and available endpoints that you can crawl. So, if we come over here and we say sitemap.xml, it's going to tell us a whole bunch of different um URLs that are accessible from YouTube. Now, you still need to follow the robots.ext file, but it's going to tell you all the different URLs available. So, if you go to like coding.com and you say /sitemap.xml, XML. You can see all of the URLs that are associated with my website as of right now. So what we can do here is now that we know the sitemap.xml has all the access points for all the different URLs, we can allow a user to now pass in a sitemap and then we will scrape all the information based on the site map instead of basing it on a specific URL. So if we go into main 4.py, Pi, we can have the same thing where we're creating a fast API application. We're using our crawl response, but this time the URL must be a sitemap. And then once we get the sitemap, we are then going to crawl all the information based on the URL that we're passing in. We're going to be pulling out the location. So if we go back to our browser, we can see each one is as an XML has a lock right here. we can split it based on the locks, pull out the data, and then we can go ahead and crawl those URLs right here. So, we're going to be crawling twice, right? We're crawling the XML, we're grabbing the sitemap XML, all the location finders. We are getting all the pieces of information, and then we're going to crawl again everything that was in the sitemap. We are then going to just return it all. And right here we can see that we're just going to be doing the same thing afterwards where we're just going to be grabbing all the information for the internal and external. You can do whatever you want with links. You can split it and clean the code however you want. And then we're just going to return this. So I'm going to go back into here and say uicorn main for app.py. Open this back up. I am going to refresh it. But this time if I say https roi.com if we pass this in it's going to fail because it needs to be a sitemap and we can say slash sitemap.xml. Now, if we execute this, it's going to crawl the sitemap XML to grab all the URLs out. And then it's going to crawl each URL inside of there. And then it's going to break it up just like we did before, all clean so we can see it. So, we can see clean code effective functions. That's a blog. There's some information inside here. The next one, clean code roles of comments. They're starting with all my blog posts right here where it's just ton of telling us a ton of information regarding that. Then we have our normal page of coding with Roby. It just went through the sitemap and extracted out the different URLs and then we crawled each one specifically. And there is so much power with this. When you are crawling websites and getting the data, you can really do so many cool things, especially with AI agents becoming so popular and being able to create rag systems that decrease hallucination based on data that you have saved in a database. or if you just want to create a different website that's just more, you know, traditional website where you're gonna have all this data and you can do reviews and everything. It's so so powerful web crawling. So, I hope you were able to learn something from this awesome technology that's now out.

Transcript for:Web Scraping with Crawl 4 AI

Transcript for:
Web Scraping with Crawl 4 AI