Hi, the PageRank algorithm for Big Data Analytics is used to find out the importance of a page to estimate how good the website is. Now to start with what is a PageRank? PageRank is a vote which is given by all other pages on the website. on the web about how important a particular page on the web is.
A link to a page counts as a vote of support. The number of times a page is referred to by the forward link. it adds up to the website value.
And the number of times it is taken as an input to the previous page also adds up to the web value. The original PageRank algorithm was designed by the Lawrence Page and Sergey Brin. So let's see how the actual functionality of PageRank works. The first model which is very commonly used is a random surfer model.
In a random surfer model, The user behavior is taken where a surfer clicks on the links at random with no regards towards the content. It is just clicked when a user or the random surfer is surfing the web. It is given by this equation which says PR it is a page rank A of any page is calculated as 1 minus D plus D with a page rank of the current page. page to which it is linked backwards and then this is the outbound links on that particular page depends on the number of pages this initial page is linked to so let's see what all these things stand for PR of a is the page rank for page a for which we are calculating the page rank PR T of I is the page rank of page TI which is linked to page A.
That is, there is an in-link to this particular page A. And C of TI is number of outbound links on the page TI. D is the damping factor, which is also known as a teleport factor, and which is set between 0 and 1. By default, if any time the damping factor is not given, the most suitable damping factor chosen is 0.85.
By default, just remember one thing while solving the page rank problem. If the teleport factor and the damping factor is not given, it is taken as 0.85. Now, the rank of a document is given by the rank of those documents which are linked to it. If I have a document over here.
And how it is linked to all other document. It is on that I calculate the rank of this particular document. If this is linked backwards by these three.
So it has a good page ranking. The PR of each page depends on the PR of the pages pointing to it. That's what I said. Backward how much are pointing inwards to it. The inbound to it makes a lot of a difference.
But we don't know. what PR of those pages until the pages pointing to them have their own PR calculated. So the PR of this page also depends on the PR of all these pages which are pointing to it.
So page rank is an iterative process which will be happening and we'll see how it is iterative. Now these are the various links which have to be considered. The inbound link, the outbound link and the third important one.
is a dangling link now what is an inbound link inbound link for a page always increases the pages page rank it is how many are linking it back to it and the important aspect of outbound link is a lack of them on the web pages now outlink sometimes leads to the dangling links if there is an outbound of a to b and b is not pointing anywhere b is dangling that we'll see later then those are known as a dangling links or dead links that add to the negative effect to my web page ranking. When a web page has no outbound link, this is very important, its page rank cannot be contributed to other pages. This will see how it is because as we have seen in the formula that when we are using the equation, the inbound value adds up to the page rank.
So what happens when there is no outbound link? bound to it then it does not contribute anything to it as it is divided by number of the outbounds if number of the outbounds are zero then the page ranking contribution also becomes zero now how does the dangling link come to picture this is how the dangling link comes into picture because if there are dangling links their contribution becomes zero as the outbound is zero so as the outbound is zero my division factor becomes zero we'll see how it happens Now this is the same graph where I have added a web graph where I have added another node D. As I have added this another node D, if you remember the formula, it says 1 minus D plus D. There is an inward capacity for the node divided by the number of outbound links it has.
As this link does not have any outbound link. so my factor divisibility factor is zero so it does not contribute anything to my page ranking that is why these are known as dangling links or dead links which contribute to the negativity or the negate the page rank for a particular web. Now let's take this example with a teleport or a damping factor is equal to 0.85. So let's start working with it how to solve this particular small example.
We have the initial page rank for all pages is equal to 1. By default if page rank is not provided it is always presumed that in initial page rank for all the web pages is equal to 1. That is at the iteration 0, all the page rankings will be 1. And then we start to solve this particular equation to calculate the page rank for further iterations. Let's see how it follows. Initial page rank for value for all the pages is 1. The damping or the teleport factor is 0.85. Now for this equation, it is 1 minus d plus d now what is this pr c this is the number of this is the node which is inbound to my this page if you see to my page a how much is inbound the inbound is coming here this is my inbound arrow from c to a so pr of c is taken then c of ti now what is ti c so c of ti is the outbound links So how many are the outbound links? It is also 1. So what happens is 1 divided by 1. So this value becomes 1 and 1 minus D, 1 minus 0.85.
This is my damping factor plus the damping factor into 1. So 0.15 plus 0.15. So this is my page rank PRA at iteration level 1. Now this will be considered. for my further calculations.
Now I come to PR of B. I need to calculate the page rank for B. How many are the inbound links?
Inbound links are from A to B. This is how it goes. So this link is taken. So it is one only link coming.
So PR of A, C of A, 1.85, 0.85, 1. Now why this factor is taken as 2? Because from A, I have two links which are going to B. going out that is 1 to C and 1 to B so the value is 2 so this is 2 so 1 divided by 2 is 0.5 0.85 0.1 0.5 and the calculation is done so page rank for B at the iteration level 1 is 0.575 now let's calculate the page rank for the page C if you see here for the page C the two nodes the two pages which are coming here is a and b so i have to consider the page rank of these two if you will see the page rank of a was calculated to be one and the page rank was b calculated to be 0.575 so one divided by two because it has two outbounds and this has one outbound so 0.5 plus 0.575 1.075 so this is my page rank p PR of C at iteration level 1. So now at the first iteration level I have got the value of PR of A, PR of B and PR of C that is page rank for A, B and C.
Now let us calculate the iteration at page rank at level 2. So what happens? previous values which we had calculated that will be putting over here the formula the equation remains the same and we start substituting the values that we had calculated at iteration level one So PR of C comes here and the calculation. So this is PR of A, page rank A at iteration level 2, page rank of B at iteration level 2 and this is page rank of 3 at iteration level 3. There is one request. Please do not round off the values of the page rank.
Whatever values you get at whatever decimal levels, please write it down as it is. because as we have to check when we go further to get the consistency of the page rank the last few bits add on to it if we keep rounding them off then we will not get the consistency I will not know how good is the page rank so it is just please write down as it is whatever is the calculation that comes over here now this is how we calculate and in the end see we can do the calculation how much ever we want we can go to 12 iterations 13 iterations so on and so forth till it becomes a consistent page rank that we get but we can stop here by saying that at iteration 2 this is my page rank the further further calculations can be done using these iterations so this particular graph a b c of the page gives me these iteration values at iteration 0 i have all ones this is iteration 1 and this is iteration 2 this is how it has to be summed up to explain an example of of a page rank algorithm. Now let's take another simple example if it is told for the same page map that you need to draw a matrix and then calculate its page rank. If it is not specified then we use the rough surfer problem where random surfer problem where we use the equation to solve it but we should know how to draw the matrix out of it.
Now these are the three pages web pages A, B and C so we write down A, B. B and C. Now what is this 1 by 2 and what is this 1 by 2? From A because there are two outwards going there are two outbounds going so this value is 1 by 2. So from A to B is 1 by 2 from A to C is 1 by 2. From B I am going only to C so at C I have 1 from C I am going only to A so I have 1. The teleporter damping factor is 0.85.
When we are working with a matrix the damping factor factor usually does not make a difference but it has something called a Wiggin factor. Now this Wiggin factor I have taken it as 1 by 3. If it is provided it is very good. It helps us to solve a problem.
If not I have taken it as 1 by 3. Now let's see how it works. Very simple multiplication of matrix. I have it over here but what happens is when I have taken this value I have got the output same. So my page rank becomes consistent with this multiplication factor.
but if it this was not happening then what I need to do is I need to retake these values on the next slide again multiply and this I get as iteration 1 and whatever I get as iteration 1 I again take and I keep on multiplying to get iteration 2 this is just a simple example to explain how the matrix is used to calculate the page rank now let's take a very simple example again but a little bigger one which I was has more web pages it has six of them plus it also has these dangling pages which is used to calculate the values. Now what happens with this. Now with this web page what I need to do? I need to calculate the factors. Now for this particular diagram.
I've just missed out one thing. Let's do it. This link should be disconnected. This link should not be considered. Now let's see if I start up with a calculation.
PR of A, PR of B, PR of C. Now why we have taken this? Let's go back to this diagram. Now when I calculate the page rank for the page A, what we need to do is, see there are two values which are coming in, that is from B and that is from C. So we need to consider the inbound values for B and C.
So this is how we calculate. And why we have taken 4? Because from B I have 4 outwards. Outbounds 1, 2, 3 and 4. So it is.
I have point 4 and from C I have only two out bounds that is 1 and 2. So it is 4 and it is 2 and this is how it is calculated. So this value comes out to be 0.8. Now when I calculate the page rank for B, I'll take you back. I have only thing going from B is to A. So that is how it will be from A to B.
So that is only the inward it has. So this value becomes 0.8 divided by 2 and I get the page rank as 0.52. Similar thing happens with page rank of C.
I get 0.52. now what happens to pr of d there's only one invert which is from b so i get the value 0.34 same thing for pr of e i just told you to make the correction in the web map or web graph where f to e should be removed so pr of b again so i get the page ranking as 304 and for f it is coming from b as well as c so it has a page ranking of 5 1 Now this is at iteration level 1. Initial page rank for all the web pages should be considered as 1. Using these values we again calculate the page ranking and here we have to make sure that the damping factor by default given is 0.8 and not 0.85. So all the calculations that we have done is with 0.8.
Damping factor, teleport factor is 0.8. So this is how all the calculations, this is iteration 2. PRA PRB PRC D E and F are calculated again we stopping at iteration 2 we can continuously go on till iteration how much ever we want till it's not how much ever we want till it becomes a consistent page rank so this is the final output of this particular page rank problem this helps us to identify the importance of the website and its pages how it is linked in inbound and outbound and also remove the dangling links or the dead links thank you