Transcript for:
Efficient Web Scraping with AgentQL

Hi there! In today's video I will show you how you can scrape products from multiple different online stores with the same exact Python code. I will be using Playwright along with a really cool new AI-powered web scraping tool called AgentQL, who is also the sponsor of today's video. And what AgentQL does is it removes the need for digging through the website's source code in order to find a CSS selector or an expath identifier. in order to click an element or extract some information out of it. And instead, you will use this agent QL or agent query language to do it with AI, which basically means that you can use the same exact scraping code on multiple different websites, even if they look different. And it also makes your scraper resilient to changes because you're not tied down to a CSS selector, but instead you're using AI to scrape. And you can also transform or format the data with natural language. So for example, if you have a date on the website, and you want it in a different format than it's on the website, you can just ask it to do that. Or if there's a price, you can decide whether to include the currency symbol or not. And it's really fast to learn with their quick start guide, or just by watching this video. And it takes just a few minutes to set up. So if you want to try it out, you can go to agentql.com and click on get. API key and get started right away. But now, let's actually see how this works in practice. So let's go to VS Code and let's create a new file. Let's call this main.py and we're going to start with just a basic Playwright boilerplate code. So from Playwright.sync API we're going to import syncPlaywright. And then with syncPlaywright as p, we're going to do browser equals p.ChromiumLaunch and let's say headless is false so we actually see what's going on and what we normally do in Playwright is we say page equals browser.newPage but to extend this with agent.ql we need to import agent.ql and then we just simply do agent.ql.wrap and we wrap this newPage call with this and that's all you have to do. Now you have all the extra capabilities of agent QL in Playwright. So let's say we wanted to scrape this website. We want all these programmer t-shirts from this page. So we want the names and the prices. So how do we do that? Well, first of all, let's go to this page. So let's do page.goto and let's paste in the URL here. And then what we can do is we can call page.queryData. and in here we put in our agent query language. So what we can do is we can create this multiline string and we start the query always with curly braces, and then here we can describe the data we want to extract. So in this case we want the products, and we can denote this with square brackets to make it a list. And then we can put curly braces here again and we can say that we want to extract the name and the price of all the products. And that's pretty much it. So instead of digging through the source code to find the CSS selector, we just do this. And these are not CSS selectors or even any keywords built into AgentQL. This can be anything. You can put whatever you want in here. So for example, if we also want to scrape like the name of the creator of this t-shirt, then we can just add here another field. We can say creator and now it will scrape the creator field. And How we actually extract the information is we can say response equals this. And then in the response, we will actually have a dictionary. So we can say that the products is going to be response products. And then we can just loop over the products. So let's do for product in products. We're going to print all the products. So let's print the name is going to be product name. and let's also do the creator and let's also do the price and let's put one more print here to separate all these things from each other and we also have to call browser.close so now if we were to run this code let's see what happens so we are going to do python3 main.py and run it so it's going to open the chromium browser and it's going to go to the website we want to scrape and then it is going to scrape the information from this website with AgentQL using AI. And here it is, we got all the information extracted from this website. We have the names, we have the creators, and we have the prices. Now you might notice that there's a dollar sign in the price, but if we want it as an actual number, then what we can do is we can simply add parentheses after the price and we can say as a number. And then it will be a number. So let's try it out. Let's run it. And we will do the same thing again, we will open up the website, we're going to scrape the information, and this time it will be a number. And here we have it, now the prices are numbers. Now you might notice that the ordering of the products is different, but that is actually because of the website, because they order them differently on every page load. But now let's actually try to scrape the next pages as well, because we don't want just the first page, we want all of the products. So how would we do this? Let's take a look at this page. We have these search results here, and down here we have this page selector, where we can click on the second page. So let's do that with agentql. So what we can do is we can say page.queryElements, and this is similar to queryData, but instead of returning a dictionary, it actually returns the elements from the page, and then we can interact with them. So what we could do here... is we can again pass in a multi-line string here. And let's start again with curly braces. And let's say that in the page selector, we want to get the next page link. And that's basically what we have to do. But in order to make this more compatible with multiple websites, it's better to add some more context. So one thing you can do to add more context is just to add more of the surrounding elements here. So for example on this page we have like the product listing here and we have the programmer t-shirts heading and we have the popular and newest buttons. So what we could do is we could add here like the heading and then we could say newest link and we could say popular link and we could say product listing. So This just gives it a bit more context as to where on the page you are trying to find this next page link. But in this case, there's actually a better way even. So we don't have to do any of this stuff. Because we're of course going to put this in a loop. So let's put here a while true loop. And let's set the page number to 0. And let's do page number plus equals 1. And let's put all of this inside the loop. And then let's add some limit here also. Let's say if page number is greater than or equal to 3, then we're going to break. And now that we have the page number, we can actually add that as context into this query. So we can add here link to page, and we can put page number plus 1. And of course this has to be an f-string, and now we have to escape all of these curly braces from here. And how we get the next page link is we can say that this is response and we can say next page is going to be response dot page selector dot next page link. So note that this is not a dictionary, it's actually an object so we use the dot notation instead of this bracket notation. And then we can check if we have the next page, then we can do next page dot click. And in fact, otherwise we can just break because that's the last page if we don't have a next page button anymore So let's see if this thing works Let's run this again and I know it's not gonna work because there's a problem with this website Which we'll fix so first we open this thing and then we scrape the products from the first page Which works just fine, but as you can see we have this stupid pop-up here So this means we can't actually click on the next page button. So we have to solve this issue. And we can in fact solve this with agent QL. So let's do this. Let's get the next page link, and then we can say, if we don't have the next page link, then we can just break. And then before we click on the next page link, we can check if we have a pop-up. So we can say close pop-up is going to be page dot get by prompt. So this is a third method how we can get elements on the page. So let me just put a prompt in here. So we can say the button to close the popup if there is one. And it is actually intelligent enough to just return none if it can't find the element. So we can say if close popup. So if we have a popup or a button to close the popup, then we're just going to click it. So let's say close popup.click. And we should probably wait for like a second or something if there's some animation after the closing of the popup. So we can just do page.wait for timeout. and let's put like a thousand milliseconds, which is a second. And now since we formatted this prompt in this very general way, then it should work on multiple websites. So if there's a different kind of pop-up, then it will close that one. So let's see if this works. Let's run it and we're going to open it and we're going to load the page and then we're going to scrape the first page of products. And when the pop-up opens, we are actually going to close it. by clicking the X button. And there you go, we did it! And we also clicked on the next page, so now we're on page number two. And then we're just gonna scrape this and we're gonna go to page number three. And this time we actually clicked directly on page number three, because I think there's some cookie that the pop-up comes only on the first type. And now we should scrape this page and then we have all the products from all of the pages. And of course we have the limit that we only get three pages. And here we have them. We have all the products from all the pages in here, which is pretty cool. Now if that wasn't cool enough, then how about this? We can actually use a different website with the same exact code. So let's try this toolnut.com website. So this is a completely different e-commerce platform and a completely different website. But we still have products on here, and we have a page selector down here. So Let's slap this into the goto and let's try to run this. Python 3 main.py. And we're going to open toolnut.com and then we're going to scrape the products from the first page. And look at this. Again, we have this model here. So let's see if it will close this one as well. And here it goes. It closed it and then it went to page number two. So the same exact code works on multiple different websites. And now we're on page two, so we're gonna scrape page two and then go to page three and scrape that, and that's it. And here we are going to page number three, and finally we're gonna scrape this one. And here you have all of the products from all of the pages listed nicely. So let's try even a third website. So here we have this off the wagon weird funny and strange gag gifts. So let's see if we can scrape this one. Let's copy the URL. and let's put it in our GoTo and let's run it with the same exact code as before. Again, we're going to open up the website and we're going to scrape the products from the first page. And once again, we have a pop-up. But it doesn't matter because we are going to close it. And there it goes. And then we go to page number two. And then finally, we go to page number three. or not. But then we're actually stuck here in page number two. I'm not sure what it's trying to do. It's trying to find some button, but it can't find it. So what did it try to click? It seems like it tried to click on page number two, but it should have clicked on page number three. So let's see, do we have some problem in our code? We get the page number, so we actually pass it in here, but it still clicked the wrong one. So what we might do is we might add here the current page link and then we can say here link to page whatever the current page is. So then it will not get confused with these two. So let's try it again with this one. And we open it and we scrape the products of the first page and then we get the popup which we're going to close and we're going to close it and we go to page 2 and let's see if page 3 works as well. And here we go to page 3. So now it actually worked. And here we have all the products from all of the pages from that website as well. So that is pretty cool, I think. Now, what we could do here is we could make a function that only takes a URL of a website and then scrapes the whole thing. So let's actually do that. Let's define scrape products. And let's take the URL and let's do like a page limit here. And this is going to be an int and URL is going to be a string. So let's put the whole thing in here. And then let's add here to the goto just a URL. And then let's use page limit down here. Page limit. And let's initialize here like a list of all products. It's going to be a list. And then when we get the products, then what we could do is we can probably just say all products plus equals products. and we can even just do response products, like this. And then what we can do down here is we can just return the all products. So then this function will just take a URL of an online store, and the page limit, how many pages we want to scrape, and then it will return a list of dicts, which are the products. And let's pass in the headless here to headless, defaults to true actually. then we can just pass it in here. So now if we want to scrape something, we can just say here products equals scrape products and we pass in the url. So let's do the first one again and slap this in here and let's do two pages and then let's do json dumps and let's do products and let's print this thing and let's import json so we can see it nicely. And let's run this thing. Launch takes one positional argument, but two were given. So we have to actually say that headless is headless. And this will now actually be headless because I didn't pass that in. So we see nothing here. It's just happening in the background. And we should just get a nice JSON list of products. And here we have all the products. Now I used... JSON so that I can format this response, but I forgot to add the indent in there, so it looks like this. But anyway, I hope you can see that here we have products separately, nicely, as JSON. And the only thing we have to do is add the URL and say how many pages we want to scrape. But I think AgentQL is a pretty cool library. And if you want to check it out, then you can go to agentql.com and just go to get API key. and register and you will get your API key. And then the only thing you have to do is just run pip install agentql and after that you have to say agentql init which will ask for your API key. And then you can just use it. And in fact they have 1200 requests per month for free, so you can try it out for free. I will also leave a link in the description for their website. If you have any suggestions of what I should try and scrape with agentql, let me know in the comments. Thanks for watching and I will see you in the next one.