Hi, my name is Vic. Lately, when I've been searching for something on Google, I end up having to wade through all of these irrelevant results to actually find what I want. So I decided to do something about it.
I built a custom filtering engine that can actually rank results according to my own criteria. And today I'm going to show you how you can do the same thing on your own. Let's dive in and get started.
I end up searching for baby products for my son a lot. And whenever I search, I end up with a page a lot like this. So I've turned off adblock, turned off everything, and it's kind of overwhelming.
There are a lot of different results here, and not all of them are actually relevant. If I clicked into these links, you would see that a lot of these are just full of affiliate links, trying to sell you things without giving you great information. about what the best stroller actually is. I think the best link on this page is actually this one.
These reviews by Wirecutter, where you actually get detailed information about how they tested and good recommendations. But this result is buried beneath all of these other ones. So what we're gonna do today is figure out how to improve these search results through filtering. And what we'll end up with is this really clean search interface where we can type in a search term like best baby stroller and actually search. And you can see here with these results, what I consider to be the best result, the Wirecutter article, is actually second, which is great.
So the ranking here is a lot better and the results give you a lot better information. And here's the overview of the project. And if you want to actually read through the code or these steps, here's the link at the top. I'll stick this link in the description so you don't have to type it out. But basically what we're going to do is we're going to use an API that Google provides to create a custom search engine.
Then we're going to use Python code to query that search engine and store the results. And we'll write some filters that will actually filter results and re-rank them for us. to give us a custom ordering for the results that will emphasize pages that are higher quality. So let's dive in and start building. As I mentioned, we're going to be using the Google Custom Search JSON API.
And there's two steps we need to do to actually set this up. So the first thing we need to do is create what's called a programmable search engine. So if you click on this link to go to the control panel, you can actually create one there.
You will need a Google account, but other than that, creating the programmable search engine is completely free. So let's click here to go to the control panel. And you can see I've already created a couple of engines, but I'll create another one just as a test here.
I'll call this test2. You should give your search engine a name that is meaningful to you. Under what to search, we want to select search the entire web. So this custom search engine will search across all pages.
And we want to select I'm not a robot. And we can hit create. And then we will see this JavaScript code.
So you can ignore the code, but you do want to copy this CX here. This is the ID of your programmable search engine. So make sure you copy that and store it somewhere. We'll be using that later. Then we need to go back to the custom search page.
And the second step we need is to get an API key. So you see this button that says get a key. You'll want to click this. And I already have a Google Cloud account.
But if you don't have one, you will need to create an account. And I'll just pick a project in my Google Cloud account that I want the API key to be created in. I'll hit next. And then... I have my API key.
So you want to go ahead and click show key and then copy that key and store it somewhere. So copy both the ID of the programmable search engine and the API key. One thing to know about this search API is that it's free for up to 100 queries a day. And a query is a page of results.
And a page is 10 results. So if you're getting 20 results every time you search, and I'll show you how to configure that. you will actually get 50 free searches per day in that case. If you're getting one page, you'll get 100 free searches per day. After that, you are charged.
$5 per thousand queries, but it's prorated. So if you only have 100 extra queries, then you'll only be charged 50 cents. But be careful about that. You may not want to make too many queries. And we're going to do some things that limit the number of queries you'll have to make.
So just an FYI. Now let's move on and actually talk about how to build the search engine. I'm going to be using an IDE called PyCharm to build this project, which is a free download.
as you can see, but feel free to use VS Code or JupyterLab or any other IDE. It'll be easier if you use an IDE that integrates a terminal and a coding area, but feel free to use your own tooling if PyCharm doesn't work for you. And this is the PyCharm interface.
I've gone ahead and created a folder called Search Live. which will hold all the code for this project. And you'll be able to see the files on the left over here.
There's only two files right now. private.py which holds my API keys. You won't need to create that file on your own. That's just a file I need so that I don't accidentally show my API keys in this tutorial. And then readme.md which is a file that gives you an overview of the whole project.
And as I mentioned before, You can get this readme and all of the code at this URL, which will also be in the description. One other thing that PyCharm has is at the bottom here, you have a terminal. So there's a terminal window where I can type command line commands. There's also a Python console where I can run Python code quickly and see the results. So I'll be using those two features as well.
All right. The first thing we're going to do is we're going to create a file. called requirements.txt. And we're going to use this file to list out some packages that we'll need to install.
So we'll need Flask, we'll need Pandas, we'll need requests, and beautiful soup for. Flask is a Python web server, which we'll use to actually create our website and search form. Pandas is a Python data manipulation and analysis library.
that's just going to make some things easier for us with organization requests is actually a python library that connects to web servers and downloads html and then beautiful soup 4 is a python library that lets us parse html and extract text and other data from the html once you've created requirements.txt you want to go to the terminal and just type pip install dash r requirements.txt and that will install our four requirements. Once we've done that, we need to create another file called settings.py, and this file will just hold some settings that are going to be common across all of our code files. So these are settings like search key.
So I hope you saved that API key from earlier because that's what goes in here. I'm going to leave mine blank. because again, I don't want to show my API keys here, but you should put your API key into the search key field.
There's search ID, which is going to be the ID of the programmable search engine. So we got both of these values earlier. Then we want to set country.
So this is a two letter country code for your country. You can use US if you want, but your results will be a bit more relevant. if you set it to your country. And then we need to set up our search URL. So this is the URL that our code is going to call to get search results.
So we're calling this URL, and then we're passing some query parameters to tell Google about who we are and what type of results we want. So we're gonna pass in our API key. We're going to pass in our ID. Then we're going to pass in our query.
So this is what we're searching for. And then we'll pass in our start. I'll explain start in a little bit, but this basically indicates what page of results you want. And then we're going to say and GL equals plus country. All right.
So this URL has a few parameters that we'll need to fill in later. There are... they're all templated now, but we'll be replacing them later with the actual values.
All right. And then we want to define how many results we want every time we search. I'm going to set that to 20. And now we're done with settings. The next file we'll create is a file called storage.py.
And this is actually going to store all our results. There's two reasons we want to store our results. One, we don't want to have to keep querying the API if we already search for something.
Two, we want to be able to store our search history. So down the line, we can use that data to build better filters and potentially even apply machine learning to filter our data. So every time we actually run a search query, what we're going to do is store all of the results that come back, along with some other metadata that we add. So I'm going to go ahead and create a class that will enable us to do that easily.
So we're going to import SQLite 3. because we'll be storing our data into a SQLite database. And then we're going to import pandas, because that's just going to help us save data to the database and read it back in an easy form. Then we're going to create a class called dbStorage.
And this class is going to handle all of our interactions with the database. So what we're going to do is we're going to create a database connection to a database file called links. dot db in our initialization. So whenever we call this class and initialize it, we're going to create a connection to that database.
And then we're going to write a function called setup tables, which will actually create the table in our database. So we first need to create what's called a connection cursor. And this cursor is what will let us actually run queries against our database. And then what we'll say is results table equals the SQL that will actually create our table for us. So we're going to write a create table if not exists clause.
So this basically says, hey, if this table exists, ignore this code. If it doesn't exist, create the table. We're going to call the table results. And it's going to have an integer primary key called ID. It's going to have a text field called query, which is the actual query that we type in and what we're searching for.
We're going to have an integer called rank, which is the rank of the search result. So 1, 2, 3, 4, 5, 6, 7, 8, etc. We'll have a link field, a title field, which is the title of the result. So the title of the page. We'll have a snippet, which is just a short snippet from the page that we're going to use to display our search results.
We'll have the HTML, which is the full HTML of the page that we'll scrape. And this will help us actually filter and create custom ranking for our pages. Then we'll create a created field.
And the created field will be a date time. And it'll just help us sort our queries and create a query history if we want to down the line. We'll also create a relevance field, which is an integer.
And this will allow us to mark results as relevant or not relevant, which if we want to apply machine learning to filter our results will help us. All right. And then what we want is each query link combination needs to be unique, right? We're not going to get the same link multiple times for the same search query. So we're just going to create a unique constraint on those two.
And then this is the code to actually create our table. And then we'll execute that code to actually go ahead and create that table, and then commit our changes to the database. And then we'll close our connection cursor. And then in the initialization, we'll go ahead and call setup tables.
So every time we initialize this DB storage class, we'll connect to the database and set up the table if it doesn't already exist. Then we need to write a couple more functions in this dbStorage class. Then we need query results.
So if we pass in a query to this function, it will return all of the results that exist in the database. So we'll say df equals pd.readSQL. So we're going to run a SQL query against our database.
So we'll say select star from results where query equals query. and then we'll order by rank ascending and what this will do is it will find all of the results that we've already stored in the database for this for any given query we'll pass in our connection and then we'll return that data frame okay then we need to write a quick function to insert a row into our database so when we get results, we want to store them into our database. And that's what this function will do.
So we'll create a cursor, which again, is the object that lets us query and interact with the database. And then what we'll say is we'll try to execute some SQL with this cursor, we'll say insert into results, query, rank, link. title, snippet, HTML, and created.
So this is inserting the values into the database. Values, question, and then we're putting all question marks here as placeholders for each value. Seven.
Seven of them. So the reason why we're doing these question marks instead of just direct. formatting the string and adding the values in is this ensures that we escape the values properly and we don't insert anything malicious into our database.
So what we're doing is we're saying insert these values into the database and we're passing these values in as a list of lists. And then this query will, sorry, SQLite 3, the Python package, will properly format these values and ensure they're escaped. properly when they're inserted into the database.
And then what we'll try is self.con.commit. So write our changes to the database. If there's an integrity error, which could mean that the data already exists in the database, we'll just pass.
So we'll ignore it if we've already written the data to the database. And then we'll close our connection cursor. Okay, so this is our database storage class and this just helps us create a database to store our data, read results from the database, and insert new data into the database.
All right so we've written our storage. The next thing we're going to write is actually what will do the search for us. So let's go ahead and create a file called search.py and this file will query me. the custom search API to get our search results. So we're going to import settings here, we're going to import requests.
And we're going to say from requests dot exceptions, import request exception, we also need to import pandas as PD. And we need to import our storage class from the file we just created. Okay, there's a couple more imports we need, but we will, I'll show you those when we get to them.
So the first thing we're going to do is we're going to write a function that will actually hit our API endpoint and return our search results. And basically, each page of results gives us 10 search results. So if we set result count to let's say 30, that means we want three pages of results.
So we'll just set our pages value appropriately based on result count. So we'll create an empty list of dictionaries called results. And then we'll say for i in range 0 comma pages, start equals i times 10 plus 1. So start just defines the first record on each page, the rank of that record.
So the rank of the first record on the first page of results is one. The rank of the first record on the second page of results is 11, and so on. So depending on what page we're on, we're going to set start to that index of the first record.
And then we will format our search URL to actually ensure that all the right values are in there. So we'll set key to search key. CX is going to equal search ID.
Query is going to be quote plus, which we have not imported. and we'll import now so from url lib.parse we're going to import something called a function called quote plus and we're going to execute that on our query and then we're going to say start equals start so quote plus will basically ensure that we're properly formatting our query to be in a url so for example if we're searching for baby stroller if that's our query there's a space here And you can't have spaces in URLs. So we need to actually format this so the space is turned into a different character. And that's what QuotePlus does.
It properly quotes all of our special characters that can't be in URLs to ensure they're encoded properly. Then what we say is response equals request.getURL. So this will make a request to that Google custom search API. It'll pass in our API key, our search engine ID, the query that we want to search for, and which page of results we want.
And then we'll get a response. And the response is going to be in JSON format. So we'll go ahead and decode it into this variable called data.
And then data is going to be a dictionary. So we will add the items key of that dictionary into results. And then results will be a list of dictionaries. All right.
Then what we can do is we can turn that list of dictionaries into a data frame. And this data frame is going to have fields that correspond to the fields in our storage class that we use to create the table. So we'll have query, rank, link, title, snippet, HTML, etc.
So we'll need to add those fields into the data frame before we store it to disk. But we'll add some of those fields in a bit. So we're going to add the rank field first. And the rank just indicates the rank of the result from 1 to 11 for the first page of results. And then we will just...
remove some of the extra fields that we don't need. So the fields we'll keep are link, rank, nibbit, and title. There's a bunch of extra fields.
So if you want to try to run this code yourself and take a look at the fields before we filter, you can definitely go ahead and do that. But we only need these fields that I've specified here for now. Okay, we now have a function that can take in a search query and return results.
So let's test this out. So at the bottom here, I'm going to click on Python console. And this will actually create a Python console in the current folder.
And then what I can do is I can go ahead and say, from search, import search API, and then I can run search API on baby stroller. And it'll take a second, but then I get my results as a pandas data frame. So you can see all of the fields here, link, rank, snippet, and title. Rank and snippet are hidden because pandas by default will only allow for a certain number of characters per row. But they are there.
I promise. Okay. Then we're going to write a second function called scrape page, which is going to take in a list of links. So really this link.
column from the results data frame, and it's going to get the full HTML of each of those pages. So we will create an empty list called HTML, and then we'll iterate through each of the links in our list. And what we're going to do is we're going to say data equals requests dot get link timeout equals five.
And this is going to actually download the HTML of that. page. This is useful for our filtering later because we can filter based on some of the things that are actually happening on that page like the page content or the number of ads on the page things like that.
And then we're going to append the text property of data which is the actual HTML to our list and then we're going to say accept request exception html.append blank. So request exception happens when for whatever reason requests can't download the page properly. That could be because the web server isn't responding, it took too long to respond, it's blocking us, whatever other reason. So in that case, we're just going to assume that the HTML was empty, and then we're going to return our HTML list. So that's scrape page.
And then we can finally write a function to pull everything together to run our whole search. So we're going to first create a list of columns that we want. in our search results. Link title snippet HTML and create it. So these are the seven columns that we are going to pass into our storage in order to save to disk and pull out of our storage as well.
You'll remember that when we created our database table, these were the columns of that table. Then we're going to initialize our storage class, dbStorage. And then what we're going to say is storedResults equals storage.queryResults query.
So we're essentially checking, hey, did we already run this query? And if so, are the results in our database? And we're saying, hey, if there's more than zero rows, so if we return to any results, then let's just go ahead and return our database results. First, we'll have to parse our created value because the way SQLite stores datetime timestamps is as strings.
So we're just going to go ahead and convert that to a pandas datetime object just to make it easier to work with. And then we'll return our stored results and we will pass in columns just to make sure that our our columns stay in the right order. All right. So that's if we've already run the search and if the results are in our database.
If they're not in our database, then we're going to use this search API to actually find results. So that'll query our Google custom search API. And then we're going to scrape each of the pages to get HTML.
Actually, I'll just say results HTML equals that. So that's just going to scrape each of our pages and store it to the data frame. And then we're going to remove any results where HTML is empty.
So remember how if there was an error downloading the web page, we assign... appended an empty string. So basically what we're saying is, hey, if there are any errors downloading the web page or we couldn't get the HTML, let's ignore that search result because maybe there's an issue with the website.
Okay, then we're going to assign the query column to results. And then we're also going to assign the created column to results. And I'll need to import date time in order to make this work. And what we'll do is we'll just format into the format that SQLite expects. So this will just put the timestamp into create a string from our current date and time and put it into a format where we can store it into SQLite.
And then we will remove. any extraneous columns and put the columns into the right order. And then what we'll do is we'll insert each row into our database. So what this apply function is doing is it's essentially iterating over each row in results, and each row is being inserted into the database by calling the insert row method of the storage. which we created earlier.
Then we can return our results. Okay, so this whole function is basically checking the database to see if we already searched for something and stored it to the database. If we did, it's going to return those results.
If we didn't, it's going to query the API, get new results, format them properly, save them to the database, then return them. All right, and that is the search part of our application here. Okay, the next part is a fun part.
It's actually creating the web server to show our results properly. So let's go ahead and create a new file called app.py. And this is going to hold the code for a Flask app.
And Flask is an amazing Python library for... creating a web server. So we're going to create a really simple web server that will render a search form and then render the results for us. And then from search, we're going to import that search function that we just created.
And then we're going to import HTML, which is just going to help us with rendering some HTML properly. Then what we'll do is we'll initialize our Flask application. And then we'll be able to render our our pages properly.
So the first thing we're going to do is we're going to create a new route that allows for get and post requests. And this is gonna be called search form. Okay, so a route is basically a URL that you can go to on the web server. So when you visit just a regular domain, like www.datacrest.io that's considered the slash route so when we visit our web server which is going to serve at an address that looks like this then this is the default route that we will see when we visit that web server and the methods a get request is what you typically see when you go to a web page your browser basically asks the page for the html and then it gets the page which is a get request and shows it to you. A post request is when you fill out a form or something along those lines.
Data is sent from your browser to the web server. That's a post request. And we're going to accept both types of requests in our search form.
So what we'll say is if it's a post request, then what we're doing is we're actually searching for something. So we'll get the search query from our post request data, and then we'll return the results of a function called runSearch. We haven't written this function yet, so don't worry.
If it's a get request, we're going to return showSearchForm. All right. So if it's a post request, we'll run our search and show the results. Otherwise, we're just going to show a form so someone can run a search.
Okay, let's write these functions. So first let's write show search form. So our search form is going to be a search template.
So this isn't necessarily the best way to write HTML templates in flask, but it is the easiest. So in order to make it not too complicated, I'm just going to go with this. So we're writing HTML inline and we're going to create a very simple HTML.
form. Okay. All right. So this HTML form will essentially take it, have an input where you can enter a search query.
And then when you hit the button, This is going to render a button that says search. When you click that button, it's actually going to send the search data to our web server so we can run the search. Okay, now we just need to write a function called show search form, and all this is going to do is return our search template for now. Okay, then we'll have to write the function called run search, which will take in a query.
And what this will do is it will call the search function we wrote earlier, pass in the query, then it will render the results page. So we're going to create another quick template to render each of our individual results. And this will render each of the links that we get back.
from our API and enable us to click on them. So this is doing the major work of a search engine. Okay, so that first part is just rendering a little header that shows the rank of the search results and the search result itself. Then we can go ahead and render the title of the page. Finally, we can render a little snippet that we can use to just understand.
better what the page is about. So that's our result template. And then what we'll do here is we'll say for index row in results dot either rows.
So this is iterating across each of the rows in our results. What we're going to say is rendered plus equals result template dot format dash dash row. So row will basically at this point be a dictionary, which contains the snippet, the link, the rank, etc. Everything that we pulled from our search. So these fields essentially.
So what we're going to do here is we're passing those into the template, and they will replace these values that are in brackets. So the rank of each result will replace this placeholder text. The link will replace this placeholder text and so on.
Then one more thing we need to do. What we'll say is results snippet equals results snippet dot apply HTML dot escape. And what this does is by default, our search results will have, let me actually show you. So these are the snippets we get back from our API.
So some of these snippets will actually have HTML tags in them sometimes. And if you just stick HTML into a web page, the browser will try to render it, treat it as actual HTML. But we don't want it to be treated as HTML. We want it to be treated as part of the snippet. So escaping the HTML makes sure that the browser doesn't render random HTML that happens to be in the snippet.
Okay, so then we just return rendered and that returns our search result page. So this whole thing should now work. So what we can do is we can jump over to our terminal at the bottom here. And what we'll run is we'll say flask dash dash debug run port 5001. So this creates a flask web server at port 5001 on our local machine, which means we can access it from our web browser. And it's going to be running in debug mode, which does a few nice things.
But the best thing it does is if you change your code, it automatically restarts the server. So let's go ahead and run that. And then we can grab this URL down here, and we can stick that into our browser. So we can head over to Chrome, and we can type in that address into the browser bar, and we get our search form rendered, which is very nice. And is just a special IP address that refers to your local machine.
So when you run a web server on your machine, you can use this IP to access it. And 5001 is just the port it's running on on your machine. You can run web servers on different ports.
It doesn't matter too much, so I won't get into too much detail. And then you can type in a search query here and hit search. And then our search will execute. So it's first going to call the API.
Then it's going to scrape the content from each page and then render our results to us. So here are our results. And I think I actually searched for best baby stroller before.
I mean, best is clearly very important here. So that will actually return all of our results. Now, you may notice these results don't look great. They're not nicely formatted.
So we're going to go back and add a little bit of styling to our HTML to make them look nicer. Okay. And I'm going to write a little bit of CSS here.
CSS is how you style. websites. If you don't know CSS, don't worry too much. This is just a little bit of styling that's not required.
It's totally optional. Okay, so our first CSS selector will say any item on the page that has class equals site, apply this styling to it. So we're going to say font size equals 0.8 REM, which means shrink our font size slightly and make the color of that green. Then we'll apply some CSS to the snippet.
So we'll say anything that has class equals snippet, we'll apply to that. So here we'll say font size equals 0.9 rem. So a little bit larger than our site text. The color is going to be gray. And then we're going to apply a little bit of space to the bottom.
So a 30 pixel margin bottom. Okay, now we just save that. And you may have noticed in this output down here, the flask server reloaded.
So we can go here. And then we can search for best baby stroller again. And let's see, reload the page. Before we go over to the browser, there's one thing I forgot to do with the styles. So we need to add the styles into our search template.
So our search template is going to be styles plus the rest of our search template. And that will just ensure that our styles are rendered along with the rest of the site. And then when we save, Flask is going to reload again.
Now we can head over to our browser and take a look. So now when we type in best baby stroller, we get some nice formatting. So it's a lot easier to read the results now. You can definitely play around with the formatting on your own and make it look even nicer if you want. Okay, next up is the most important part, writing the filters.
Okay, so we'll call this file filter.py. And what we're going to do is filter our results and re-rank them so that the pages that are the highest quality go to the top. So we're going to say from bs4 import beautiful soup, from urllib.parse import url parse, and then from settings import star. Okay, so we're going to create a class called filter here. And when we initialize it, we'll be passing in a list of results.
And what we'll say is self.filtered equals results.copy. So when we first initialize our filter class, we'll pass in a data frame of results, and those will be stored as part of the class. The first type of filter we're going to set up is a content filter.
So what we'll say is page content equals self.filtered. dot apply get page content xs equals one all right so this is going to apply the function get page content to each row of the filtered data frame of course we haven't written this function yet so let's go ahead and do that so what this is going to do is it's going to strip all of the html out of a page and just get the text back. So going back to the browser, so if you right click on a page and hit inspect, you can see the HTML of the page.
And for this page, you can see this is the HTML we just created. So the form, the individual result templates, etc. What we want to do is strip out all of this other stuff from a website. So things like these tags, these attributes, and just get the text.
And that'll help us understand the quality of the text on a page. Okay, so we're going to use a Python library called Beautiful Soup to help us do that. And Beautiful Soup, although strangely named, is very good for parsing HTML. It can do a lot more than just get text, but I'm going to call the get text method, which will just... get only the text out of the HTML.
So that's our page content function. Then what we can do is we can find the word count for each of our web pages. So this will just count how many words appear on that page. And then what we'll do is we'll divide by the median.
So Essentially, this will tell us, does a page have more words on it or fewer words than the median for this set of search results? And my assumption is that if a web page has too few words, it's going to be low quality. It may mostly have ads or photos or affiliate links or just low quality things that we don't necessarily want to see. So what we'll say is word count.
Word count is less than or equal to 0.5. We'll set equal to result count. And I'll explain right after I write a little bit more code, because it'll be easier to explain that in a second. And then self dot filtered rank plus equals word count. So what we're doing here is we're applying a penalty to the rank based on if the content has enough words or not.
So if the content has less than half as many words as the median, web page for a given search, then we're going to penalize it by pushing it down to the end of the rankings. So result count is the number of results in our search. So if we add result count to the rank, that pushes the search result to the end of all of the other results.
So that's essentially what we're doing here. We're saying if you have too little content on your page, we are going to rank your page very, very low. Okay.
Then we need to set up a function here called filter, or a method here called filter, and we will run this content filter, and then we'll say self.filtered equals self.filter.sort values, rank ascending equals true. So this is going to resort our filtered data frame by rank, so the lowest rank, so things with rank one, two, three, four, etc. are going to go first. And then we're going to go ahead and round our rank. This isn't relevant for this filter, but it could be relevant for other filters in the future that don't all use whole numbers.
Okay, so that's setting up our filter. Now we head back to our app. And what we'll say is from filter import filter. And then we're going to initialize our filter in this run search piece. fi equals filter results.
And then what we'll say is results equals fi dot filter. And what this will do is re-rank our results according to our filter criteria. Okay, so we can go ahead and save this and head back to our browser.
just to see the impact that has on the rank. So I can again search for best baby stroller here. And the ranks haven't changed a ton at the moment. This filter has pushed some things down to the end, but hasn't necessarily done a ton.
Now let's add another filter which will have a more dramatic impact on our results. And this one is going to be a tracker filter. So it's going to filter out any results that have a lot of ads or a lot of tracking links or a lot of tracking JavaScript. So we're going to say tracker count equals self dot filter dot apply tracker URLs, access equals one. Now we haven't written this function tracker URLs.
So let's go ahead and do that. cracker URLs, we're going to take in a row, we're again going to use beautiful soup to just parse our HTML. And what we'll say is scripts equals soup, soup dot find all. So beautiful soup can search through the HTML for us for specific tags, and specific attributes in those tags. So we're looking for any HTML.
I'm sorry, any script tag in our HTML that has the property src equals true. And this is usually used to load third-party JavaScript, like Google Analytics, Google Tag Manager, that kind of stuff. And then what we'll say is our sources equals s.get src for s in scripts.
And this is just going to get the URL that the script is being loaded from. So this will help us filter out this. the pages that load a lot of scripts like Google Analytics and that kind of thing.
Then we'll do the same thing for links. So we're going to find all of the A tags on a page, which are the link tags that have href true. So href is where the link tag is pointing.
And then what we're going to say is href equals l.get href for l in links. So that's getting where the link is pointing. And then what we'll say is all domains equals url parse s.hostname. or s in s r cs plus href.
So this is going to loop through our list of URLs in both of these list of lists. And it's going to pull out the host name. So if the script is pointing to something like this, gripped dot j s, basically, our parsing is going to get rid of all of this, and just leave us with the root domain.
that either the link or the script source is pointing to. And then what we're going to do is we are going to check if our domain is in a list called bad domains, which we haven't loaded yet. So let me show you where we're going to get that list. So we're going to get that list. from this GitHub URL.
So this is a list of a bunch of domains that have either bad content, so they show content that's malicious, or they show ads or trackers or something along those lines. So these are all bad domains. So this list of bad domains, you're going to want to download.
And that link is in the readme file that is in the description and that I shared earlier. Okay, so we'll go ahead and copy that file into a text file called blacklist.txt in our directory. So you can either download the file or you can go ahead and just copy paste.
And then what we'll say is with open blacklist.txt as f. domains equals set f dot read dot split backslash n. So this will go ahead and read our file in, split it up by line, and read it into a set called domains. So by default, this will be a list.
A set is a list where every element is unique. It's a lot faster to check if something is in a set versus a list. So when we run this in operation, it goes a lot faster if this bad domain list is actually set. Okay, and then we'll return the length of bad domains. All right, so that's our tracker URLs function.
Then we can go back and finish up this tracker URLs method. So we have our tracker Okay, and because we all hate trackers so much, if the tracker count is greater than the median tracker count, we're gonna set it to result count times two. So severely penalize it, and then we'll add it to our rank. So any website that has a lot of trackers and other bad domains that it links to will have a very poor rank.
in our search engine. All right, now we can go back and actually search for best baby stroller. And we can see if our results have improved.
Although the one thing I forgot to do is actually call self dot tracker filter in our main filter method, we need to actually call that to apply the filter. Okay. Now, when we search for best baby stroller, which I just did, we'll see that the results have improved. So this wire cutter article, which I think is probably the best result here, is now second. And a lot of results that just have a ton of trackers or low information are pretty low now, which is great.
Okay. So. now we've actually gone through and built a search engine. with filtering. So it gets results from that Google Search API, stores them, and filters them in order to give us better results.
There's a ton you can do to continue improving this on your own. And I'm going to show you one additional thing you can do that will give you a nice springboard to actually improve this. If we go to our storage. py file, you might have realized that I added a field called relevance that we haven't used. And my goal for this field was to allow us to manually indicate which results were relevant and which ones weren't.
And if we store this data, once we do enough searches, we can eventually actually use machine learning to filter our results. Right now, our filters are all based on median thresholds, but we can eventually base them on machine learning. but we need data first. And this relevance score will help us get that data. So I'll show you just how to store that relevance score for yourself.
So in our storage class, we're going to add an extra method here, which will help us store that relevance score. And it's going to be called update relevance, self query link and relevance. And what we're going to say is cursor equals self dot con dot cursor.
And then we're going to execute some SQL using that cursor. So we're going to say update results that relevance equal question mark, where query equals question mark, and link equals question mark. And then we'll pass in relevance, query and link.
And basically SQLite three we'll replace these question marks with the actual values but in a safe way one that doesn't allow someone entering malicious values to just totally destroy our database and then we'll go ahead and commit our changes to the database and close the database cursor okay so we now have a function to store the relevance now we actually need to do it so let's head over to app.py And here's where we'll add a little bit of code that just enables a user to mark if a result is relevant or not. Okay, so here we're going to add another route called relevant, and it's only going to take post requests. And then we'll write a function called mark relevant.
So our data is going to be JSON data this time. So we're going to use request.getJSON to get the data from here. First, we'll get the query. Then we'll get the link.
Then we'll create a storage class, which I don't think we imported dbStorage here. So I'll go ahead and import it. So we'll go ahead and create a storage class. And then we'll say storage.update relevance query link 10. So whenever anyone marks something relevant, we're going to set the relevance to 10. And if you want people to just enter in numbers or you want a little more flexibility, you can add that in on your own.
And then what we'll do is we'll return JSON if phi success equals true. So we'll just return a little message that says, hey. everything went okay, your result has been marked as relevant.
We need to import JSONify from Flask. Okay, so that's mark relevant. Now we need to give the user a way to mark relevance on the front end in the website.
So we're going to go in here and we're going to add a little HTML tag with call a span tag, which means it'll be inline. And it's going to have the rel button class, which means it's just a relevant button, and we will call it relevant. And then we'll add a little on click property here, which means that when you click this button, it's going to call a JavaScript function called relevant.
And we will pass in the query and the quote. wrong place. We'll pass in the query and the link to that call. And if you're not familiar with JavaScript or this on click.
Don't worry too much about it. All that's important is that we're giving the user a way to mark that a result was good and store that in the database. And once it's stored in the database, we can do amazing things with it, like machine learning.
Okay, then we'll write a little bit of CSS for the rel button. We'll basically say that the cursor style should be pointer, which means that it'll act like a button. and then we're going to color it blue so it stands out a little bit. And then we're going to write our function, our relevant function. And what this is going to do is it's going to be defined as const relevant equals function query link.
So it's going to take two parameters and it will use the HTML or the web fetch API. to actually call our endpoint. We're making a post request, so we're saying method equals post.
And then we are going to tell the web server what kind of content we're sending. So we're saying the content is application JSON and the response should be application JSON as well. Web requests are weird.
They can be a little bit confusing. Okay. And then we are going to pass to our web server these parameters. Query is to query.
Link is link. Okay. All right.
And then we just want to end. outer function. Great.
Okay. So what this will do is whenever we click on the relevant button, it's going to call this backend route relevant, and it's going to pass in the query and the link that we want marked relevant. Okay. Let's go ahead and save that and see how it runs. All right.
So when we search now, like we search for best baby stroller, we'll now see this blue button called relevant that shows up. And when we click on it, it will actually store that relevance information in the database. So let me show you what that looks like.
We'll head back over to PyCharm. And you can see down here in our in our flask log, we can see that a post request was made to the relevant endpoint. So that looks great. And the response was 200. which indicated success. Now what we can do in the terminal is we can create a new terminal and we can actually run SQLite 3 in the terminal against links.db and we can say select star from results where relevance is not null.
All right so we now get some results where relevance is not null We'll probably filter this. So the HTML field just has a ton of text. So we are going to filter that out. All right.
So here we see the query was best baby stroller. This was the link and the relevance was marked as 10. So if you do this for enough searches, So if you mark the relevant result in the search and you store this in the database, eventually we'll get enough training data to actually do some machine learning. And that's a cool project that you can use to continue this exercise.
So essentially, you'd be building filters that query data from the database, train a model, and filter based on that model. All right. So we've come a long way in this project. We started out.
by registering with the JSON custom search API. Then we built a storage class to essentially store information to a database. Then we created a search function that can query our database to see search results or query the API. Then we created a little Flask application that can actually do the search for us.
And then we built some filters to filter and re-rank the results. So you now know everything you need to build a custom search engine with filtering and continue to extend the filters on your own. All right.
Hope you enjoy this. And if you build on this and extend it, please share a comment. I'd love to hear what you built.