Transcript for:
Web Scraping Amazon - Handling Pagination

in this video i'm going to show you a way of dealing with pagination when web scraping amazon.com search pages hi everyone and welcome my name is john and let's get into it so the method that i'm going to be doing is i'm going to be checking the next button here to see if it exists and if it does i'm going to be scraping the url from it and then moving on with that page so on this page here we can see that we go up to six on page five i have this list here the pagination and we have a class with the link in d that says a last if i go to the next page and have a look at the same thing we'll see that the next button has disappeared and we can see we have no link and it just has a slightly different class a dash disabled a dash last so we can use that so i'm going to go to our code now and i'm going to import the modules that the packages that we need so we're going to be needing request html from this one to be able to scrape amazon so from requests html we're going to import html session and that's going to let us actually scrape the page then we need to set up our s is equal to html session which lets us access this easier i'm going to go ahead and grab the url for this page that we were just looking at but i'm going to go back to page one and i'm going to copy it out and i'm going to show you how i'm just going to chop it up a little bit first so we've got this long url here but we actually don't need this afterwards for the now so i'm going to remove that so we're basically just saying here and that's our search term there you can see the next thing that we want to do is we need to write a function that actually gets the data and then we can do our finding if the button exists and then go from there so we're also going to need beautiful suit for so i'm going to do from bs4 import view to i can't ever type this word there we go so now let's write our first function so the first one what we want to do is we want to take the url of the page and we want to return the soup out which we can then use past the html with so i'm going to call this one define for df for define our function get data and we're going to give it a url and then within this i'm going to do our r is equal to s.get because we want to use the session that we're creating up here and then the url and then we want to take that text variable that we get from our response sorry the text from our response and we want to put it into our soup so we can say soup is equal to and then it is going to be beautiful soup and then we want the r.text and we want to use the html dot parser so if this doesn't work or you get any errors i would print out the text value and just check that it doesn't say anything like bot or capture or anything like that but if you're doing exactly what i've done you should be exact you should be fine this is my method for web scraping amazon at the moment so what i'm going to do is i'm actually just going to return the soup out of this function so what this is going to do is that when we give this function a amazon url like this it's going to return the soup which is going to be all the html that we can then look and search in with beautiful soups find or find the ball just to test that this works i'm going to run this i'm just going to say print and we want to print whatever comes out of our function get data and we're giving it our url which is this one here so this should print out the whole of the soup in its entirety which will be everything from this page there it is we can see it all here and within that is the html of all the code that we actually want the products etc etc so i'm just going to get rid of that so what we can do is we can actually search within that soup to find whether that next page button exists now i'm going to do that in another function and then we can see the two work together so i'm going to say define for a new function i'm going to say get next page and we need to give this the soup that we're going to give in so i'm calling it soup here it doesn't matter what the name is but this is what we're going to give it uh we're going to give it the soup that comes out because that's where the information is so what we want to do here is we want to find to see that find that element if it's on the page so i'm going to say page is equal to soup dot find and let's go back to our page scroll to the bottom and i'm going to hover back over this and if i come into you can see where the pagination actually starts up here i'm going to find this class first just so we only get this list just in case there are other list items on the page so i'm going to say soup.find and it was unordered list and the class was this so that's our page and we're going to say if not page dot find because we're now searching in here so because we're using if not we need to go to our page and find the the next button that is that doesn't have an actual button so on the last page and we're going to say if this element is not on the page so i want to copy that and i'm going to say find l i and the class was this so if that doesn't exist so that's basically yeah if that doesn't exist our url is going to be the a tag that is within the list where the next button is so now i need to go back a page so we're saying if that element doesn't exist we want to find the link which is to the next page here here so we're going to say take copy this i'm going to say url is equal to and because if we just check this again we can see that this doesn't have this isn't a complete url it doesn't have the amazon.co.uk bit at the start so i'm just going to complete that manually here you could add this as a variable in the top if you wanted to but i'm just going to do i think i need www.amazon.co.u and then we're going to say that is plus and i'm going to turn this into a string and i'm going to say page.find and rli tag and the class is a dash last like that and within that so we're now let's go back to our tree let's make this a bit more clear so i've started in here and now i'm looking inside this list element here so we need to find the a tag within that so we can just chain these um findtam find methods together so we can now just do dot find and we can say let's find the a and that was i've forgotten already should just be able to find the a tag because that's there and then we want the href from that so this is quite a long um long line here but basically what we're saying is that if it doesn't find this class on the page which was the one that where the the next button is blanked out it will find it will create a url and we're going to start it with our string here and we're going to add in a string of we're finding the uh the list item that was a last which had the link in and then we're just saying from there dot find the a and then the href from that if i say then we return out of our ur function here we return the url and we can say else return nothing what this function does then is it searches the suit which is the html code of the page it searches for the pagination if it doesn't find this element it then constructs our url for us here mr comma and it returns that url if it does find that it returns nothing so we can just use this function normally in our code so what i'll do is i will say let's check that it works so we'll do get data so we can call this soup and we can save that into our variable and then we can do uh get next page which will be called our function soup and i will let's print that out so this should return us a full url for the next page except i've missed a bracket there we go so i missed i just had my brackets in uh in the wrong order here so i've sorted that out now after we run this we'll get our next page url there we go so we can see in here it says page is equal to two uh before what i've done is maybe done a loop for a full loop with the range where it adds the number into the page but that's that does work and that's a good option but i think this is a bit neater and now that we've got this function written you can save it and anytime you do any amazon scraping you can just paste it back in and away you go so i'll show you it working in a loop so if i would say the only function i'm missing really here is the pass function so i'm just going to return the title of the page every time so what i'm going to do is i'm going to say while true so this loop will run indefinitely and until we break out of it and then i'm going to say our soup is our thing here and then i'm going to say our url which is what we're returning from this function is equal to that so if we get a good url out here we'll get that back and now i can say if not url so if if if the function returns nothing we can just break out so what that will do is if um i'll just put print the url here so this is the sort of loop we could use if we had a pass function we could just stick it in the middle here and get the information from the page i'm just leaving that out for this explanation we're saying whilst it's true so this will run indefinitely until it breaks we are getting the data from the soup and then we're creating our new url from this function that we created here and if it returns nothing so here when if we find this element we return nothing it breaks out and for each page i'm just going to print the url and you'll see the page page urls just flick through up here so we can see there's one two two three four five six and up to six so that was all the pages bearing in mind that we started on page one i believe um so we actually scraped that page first here so we didn't generate that's why we didn't generate a page one here there so that's it guys hopefully you've found this useful it's quite a cool way of getting the pagination to work so anywhere on amazon you're scraping that you see these buttons which is across most of the site that function will work so we can now save this and we can just use it wherever so what we could do is we could create our own amazon scraping package and we could say we could just import this function from it and save it so that's quite a cool way of doing things so hopefully you've enjoyed this hopefully it's been useful and don't forget it's black friday coming up so next wednesday i'm going to have a much more complete video with some more cool stuff in it for you so if you're interested in that and it's obviously before then when you watch this hit the subscribe button give me a thumbs up stick some comments down below i try to respond to as many as i can let me know what you think and i will see you guys in the next one thank you and goodbye