Transcript for:
Scraping Amazon Data to CSV

In this video, we will be scraping data from Amazon website and converting it into CSV file format. So as a data engineer, you will be writing a lot of ETL pipelines. ETL stands for Extract Transform Load. Extract part of the ETL can be done using multiple things. You can extract data from API, analytics of website, web scraping and a lot of different things. So in this video, we are going to see one of the ways you can extract data and build your own data set. So that way is web scraping. I will explain you everything step by step in simple manner from understanding web scraping Understanding the structure of html data and we will also discuss some of the prerequisites required for this project So many cool things to do in this project and if you are seeing this channel for the first time then my name is tarshil I'm a freelance data engineer on this channel. We mainly talk about data engineering freelancing and productivity So if you like these type of content Then make sure you hit the subscribe button and if you learn something new from this video, then make sure you hit the like button Without wasting time, let's get started. So let's start by understanding the basic HTML structure. HTML stands for Hypertext Markup Language. And this is what the structure looks like. At the top of the document, you define the document type definition. This tells the browser that this is the HTML code. All of the things you see over here is called as tags. HTML language is built on top of tags. Each and every tag defines something. So you write HTML code between these HTML tags and if you want to display something on the website then you write everything inside the body tag. There are different tags available such as h1 tag to display header, p tag to display paragraphs and there are so many different tags available if you want to increase the size of the font, if you want to make it colorful and so many different things. So we are not going to deep dive into HTML. If you want to understand the basics of HTML, I will suggest you to go to the D'Abruzzi school. I will give you some of the basic strategy that you can use to understand the HTML code and scrape data from the website. So let's start by understanding the basic structure of Amazon website. So if you want to understand this price and how it is displayed on the Amazon website, you can directly go and check the code and I will show you how to check the right code for right particular element. But for now, just understand that this is the code that is being used to display this price number. So over here we see three things. One is a span. Okay, span is a tag available in HTML. Second thing we have over here is attribute. And third thing is the content, the text that you want to display on Amazon website. So as you can see over here, the 9,499 is displayed over here, but the text size is big. So that kind of formatting is happening inside the attribute. So there will be a one class stored inside the CSS, a dash off screen that is that contains all of this information. Now to scrape data, you don't have to understand this code. You just need to get the right information and put that in your Python code. So let's talk about the basic prerequisite required for this project. The number one thing you need is stable internet connection and laptop. And second thing you need to install Python 3.5 or anything higher than works fine. So if you don't have Python, I have the tutorial already. I will put the link in the description. You can go and watch the tutorial and install it. Second thing you need to do is install the Jupyter Notebook. In that tutorial, I've explained how to install the Jupyter Notebook also. So go there, watch it. and then start executing this project. So you need to install the python and second thing you need to install the Jupyter Notebook. Just watch the first five minutes of the video and you will understand everything. Let's get started with the actual execution. So considering you already installed python and you already have the Jupyter Notebook that looks something like this, we will start writing our code. Now before writing our code we need to install some of the packages that we will be using in this tutorial. First thing is Beautiful Soup. Beautiful Soup is a library that you can use to pass the HTML code and find the elements inside it. So all you have to do is write pip install bs4 and it will install Beautiful Soup in your pc. I already have it installed so it is telling me that requirement already satisfied but if you don't have it installed it will install from the scratch. Second thing is you need to install request because we will be sending request from our local pc to amazon website. Third thing we need to install is pandas because when we scrape data from online we need to convert it into the csv format and we will be using pandas to convert those text data into proper data frame and then store it into csv so after you install all of these packages you can clear it by clicking over here on this either button and it will clear it so now we will start executing our entire project first thing is we need to import all of the packages that we so this is how it looks like from bs for import beautiful soup import request and import pandas as pd so if you understand the basics of python we are just importing packages so now let's understand what we really want to do in this project first we want to send the request from our local pc to amazon website and get the html code so that we can get all the other information so the first thing is we need to go on amazon website so i'm just going like amazon.com and this is what the amazon website looks like now you can search anything over here so i will search about let's say gaming okay you can search whatever you want and I see these products are on my screen. Now if you want to understand the code, all you have to do is right click and click on inspect. on windows it will be like inspect elements so you just have to click it and you will see something like this on the side so this is the code of this entire page now this is what the code looks like so this is the entire code of this page available on amazon website now you can easily navigate to to the page and find the elements on the website all you have to do is click on this arrow just click on it and you can move anywhere in the website so let's say if i want to extract this title or you can click on it and it will directly or take you to the actual html code that displays this uh title so as you can see over here this is this is h2 tag h2 is the header tag if you double click between these three dots if you double click over here you will find that this uh title that is being displayed over here so this title is inside multiple tags so this title is inside span this is the class name that is uh that is being used on this tag and this is the actual content that we want to extract So this is how you understand the structure of the HTML. If you want to get the actual page so you can just click on view page source and this will redirect you to the entire HTML code. So what we really want is we want to download this complete HTML document programmatically. So for that we will be sending requests from our Python code to the Amazon website. And for that, we need to do two things. So let's extract the specific product information. So you can do whatever you want. You can do books, mobile phone, let's do it. playstation 4 okay if you want to extract the information about playstation 4 for our code we need two things one is the url so let's just copy this url and go inside our code url is equal to this and just paste it as it is so we got the url second thing we need to define is http header so whenever we visit any website on the internet you send http request to it http request contains a lot of different things and one of them is header Inside the header you will find a lot of different things and one of the most important thing in that is user agent user agent tells that you are trying to access this website and you are a genuine user by identifying your browser information And some of the other information required to access the website This is just basic if you want to understand more technicalities of user agent Now you can go and read more about it But you need to get your own user agent and how do you get it? You can go to whatismybrowser.com you have to click on detect my settings and just click on what is my user agent you will see something like this all you have to do is copy this user agent you might have your own user agent based on your browser and paste your user agent over here so we are defining header inside the user agent we will paste our user agent and we are also defining the language except language is english because inside the request we want to tell amazon that we want everything in english language so all you have to do is paste your user agent over here and just run it so you just defined your header and url and now we will be making request to the amazon website so for that to make the request we will be using request package dot get we will pass our url and we will pass our headers and just run this now all you have to do is check if you got the response 200 if you got the response 200 that means your http request was successful if you are getting 503 error then you can try after some time even if it doesn't work then you can try some other user agent you can find it online or even if that doesn't work then you can try amazon.in or amazon.uk based on your country and it should work okay so if you get the 200 response then uh you're good let's print content from this web page and you will see the entire html document document that we got so currently this request is returning this html document into bytes format so if you just write type you will see this is in the byte format we want to convert this into proper html format and for that we will be using beautiful soup this is very simple all you have to do is use the beautiful soup function that we imported send your web content and pass it into the html format if you do that and if you just print soup you will see proper html document converted into html file so you will see a lot of different things we don't want to get into everything right now we will go step by step so we were able to extract our html document now what we want to do we want to find different links available on that page so what you want to do is we want to go through this entire page and we want to get the product links so that we can iterate on each and every link and go and extract information from the pages so for that you just have to go to inspect element and we want to find the links so if you have basic understanding of html you will understand that all of the links are written inside the a tag that is called as the anchor tag so over here we are also seeing that a class and this is the class information and you will find some kind of links into this href element so for that we need to find this a tag with the class name like this so let's just try to do this practically uh to understand what is happening so all you have to do is there is a so there is a function available inside the beautiful soup this is find all find all all of the a tags available inside our page that we just extracted where the class name is whatever the class name there was inside this anchor tag you can copy this entire text okay and just paste it on your code let me just copy and paste it and let's see if we get something uh yes and just run it and let's see if we got the links copy links paste it over here and you will see we got the list of different links from that page so uh let's say this is digital horizon something uh let's just say the ps5 digital first edition horizon forbidden west bundle okay if we go to our amazon page and that should be our our first product if you see the second product ps5 console horizon bundle west bundle okay ps5 console let's start with the ps5 console where is that yeah ps5 console horizon forbidden okay so we were able to extract all of the a tags if we did the find all that means we were able to get all of the anchor tags from that particular page so we were able to extract the links now for now we will just do it for the one link and try to extract all of the information and after that we will just loop through each and every link and build the proper function so that we will get everything automatically inside the pandas data frame to extract one link all you have to do is links okay and slice this uh zero so we just get one link so now what we want to do we need to visit this products page so so we want to visit over here and for that we need to build the proper link as you can see over here we just we are just getting half link over here we need to append some things over here so first of all let's extract the link from this uh a tag all you have to write is dot get h r e f this is same as href and just run it and you are able to extract the actual link this get function href and you will be able to get the link now let's just store this link into uh link okay link variable and for this we will append product link is equal to https slash slash amazon amazon.com okay plus link okay if I do this I run this I will get the entire link you can copy this link and you can paste it on your browser to see if it works or not and it is taking us to the same page that we went earlier so our link is working we were able to successfully get the right link in our program to now again we want to repeat the same process we need to make request to this page to get the html document and then pass that page into html format so you'll be doing the same thing just copy this just write new web page your url is the product list over here header are the same just run it just check if your request got successful 200 that means your request was successful and then you need to pass this just write like new soup just copy the page to new page content html parser and parse this html over here so you'll be able to get this entire function okay so this is a bit straightforward this is what we did initially what we are really trying to do over here is that we're just going step by step to understand the entire process and how to get the element from one single page once we are able to get that then we can easily loop through this all of these pages and get the right information then the next thing we need to do we need to get some of the information from these pages so let's start let's click right click inspect element let's say if you want to extract the title okay click on this arrow click on this title and you will see span id is equal to product title so all you have to do is write something like this over here let's change it to the new soup find span id is product title so attributes are id and it is product title this is same as what we are seeing over here span id product title and if you run that let me just minimize this you will be able to see this entire html code extracted now to get the content just the text out of it all you have to do is write text you will see something like this but we are seeing some of the white spaces over here and over here and to remove those white spaces you just have to write strip this is a function available inside the python and it will remove all of the white spaces from left and right so we were able to extract the proper title uh over here uh let's let's look at some of the other information that we want to extract so let's say if we want to extract let me just put it over here let's say if you want to extract price okay click over here you will find this span class or price text size medium okay so let me just so let's just go back copy this span this is class and inside the class we have this class just copy it as it is paste it over here remove this for now and just run it so we were able to get uh this uh information now as you can see we are getting this span inside this pan there is one more span that displays one price the off screen price and inside this span we have one more span so if we do the text over here we will get some something some duplicates so so so if we do this if we do this we are getting duplicates so to handle this situation what we can do is dot find you can also do find on top of it span attributes is equal to class colon and let's just get the first one so let's just get let's just let us get the first one off screen and if you do this you will just get first thing so now you can do dot text to extract the price how easy that is so we were so we were able to extract the price like this then again you if you want to extract some other information let's say if you want to extract these ratings okay so so over here so over here as you can see we have span class a icon alt okay so we have to repeat the same things over here just minimize this let me just copy this span class was it span yeah span class a icon alt let's just remove this hopefully this works so we were able to extract 4.8 out of 5 stars you can do text and you will get the actual thing so you understand the process all you have to do is whatever the element we want to extract we want to find the tag name the attributes like it can be id class and copy the right attribute and just put it over here if it is nested then what you can do find inside that element and find the right element just like we did it for the price so this is step by step approach so we were able to extract like some of this element now these structures might be different according to your region so amazon keeps changing these things to you know protect their website from this web scraping and all of the other things so if these these exact text doesn't work try to go and check inside the code and see what is the actual code over there and modify your code accordingly i'm not 100% sure that this one code will work you have to understand a structure and then put the right information on your code so this was just a procedure way of doing it now i'll just show you one of the code that i've written and explain you step by step how it works because you understood the fundamental of how to extract data from amazon website now you will understand how to put everything together and make it work so i've already done that so you don't have to uh spend your time doing one thing again and again so over here again we are starting with importing our packages then i will come back to these functions let's understand the basic things of our code first thing is the header that we defined then the second is the url now url you can replace whatever you want then what you are doing we are doing the same sending the request to the amazon website then parsing it into the beautiful soup so whatever we did in our code we are following the same things then what we are doing is finding all of the anchor tags same we did over here just like this that this will basically return all of the links uh let me just show you again so that you guys can remember it so all of these links okay inside the list so we are getting it then we are defining link list okay to store each and individual href tags so what we are doing is looping through each and every link from this particular list and then up appending href so link dot get href into this link list so if you print that you will only get the actual link then we are defining dictionary we will come back to this this part when we are actually on that second thing we are doing doing over here is going through each and every link that we just appended and then we are sending requests to the amazon website again we are concatenating amazon.com with the link so the thing that we did over here like amazon.com plus link this is same but this is inside the loop so the uh new page we are assigning this link and we are directly getting the result and then we are doing the same thing parsing the html link so this thing we already did now instead of doing everything one by one what we did is we defined function for each and everything we did define the function such as get title get price get ratings get review counts get availability and all of those things so you can create your own function like this so I will explain you one to two functions and then you can explore the code by yourself. I'll put the code link in the description so you can go and check it. But the main important part over here is that like get title. So what we are doing is we are sending the soup information such as get title and we are calling the function. It will call this function if you understand the basics of Python, then you will know that this is the argument. We are writing everything into the try catch. So let's say if it is not able to find the product. title from the amazon website due to some bug then it will raise an exception because it is not able to find that product title so it will go inside the exception and it will keep string as empty and it will return the empty string but if it if it is able to find the text information then it will get the text and it will strip the value and it will return the title of the string so you will be able to append this title inside the dictionary that we defined so this is a dictionary d for each and every element that we are extracting title price rating reviews availability for each and every key we are creating a list and then we are appending all of this element one by one so let me just run this code and see what the output gives until the code runs let's understand what we are actually doing so what will happen is that it will go through each and every link and it will append all of those elements inside this list so title will have one complete list of titles price will have one complete list of prices All of those elements will get appended to it. Then what we are doing is we are using pandas data frame and using from dict function to build the pandas data frame. So you can directly build pandas data frame from the dictionary. Also, if you define dictionary format like this, it will directly create and convert it into the pandas data frame. Second thing what we are doing is we are replacing empty value with the null values. And then we are dropping all of the null values wherever we don't find the title. Because if we don't have the title, then it doesn't make sense. So we're replacing value with the nulls and we are then dropping those null values and we are storing the amazon data into csv format one thing you also need to install is numpy library so while i was running this code i got the error because i did not import the numpy so all you can do if you don't have the numpy you can just install pip install numpy and import numpy as np over here and run the code again and let's just wait for so my code ran successfully let's just start printing some of the things let's just print the dictionary over here and you'll able to see our title some of the titles were null that were that's why we didn't get anything but this is how the dictionary looks okay some of the elements we got some of the elements we didn't get because the page might not have those elements so let's just skip this and let's go to amazon df and print this amazon df and you'll be able to see some of the things such as some of the elements got dropped such as first two seven rows got dropped because we didn't have elements over there or because we didn't have the titles over there and you'll be able to see the title price ratings overview and availability so we were able to extract these things into the data frame and if you go to the folder such as uh let me just go to projects web scraping amazon data csv if you open you will see the all of the amazon data into proper csv file format so you were able to extract this csv now if you didn't understand all of these things then i will suggest you to go back and watch the tutorial again so what we really did is we understood the basics of html then we understood the basics of amazon html structure then we started our html code by understanding how to import library creating user agent how to request the amazon page understanding html structure getting the all of the links concatenating that link with amazon and going through each and every element to extract the title prices and all of those things then i combine all of these things together into proper functions so you can see it over here so this was everything for this video if you learn something new then make sure you hit the like button this will help this channel to grow and reach more and more people thank you for watching see you in the next video