in today's video we're going to walk through a natural language processing project from start to finish by doing sentiment analysis on amazon reviews sentiment analysis it's the use of natural language processing to identify the motions behind text we're going to walk through a traditional approach to sentiment analysis using python's natural language toolkit or nltk and then we'll implement a more complex model called roberta that's provided by hugging face we'll do some analysis of how the different models perform and we'll even explore using some pre-trained pipelines for making sentiment analysis really quick and easy hi my name is rob and i make videos about data science machine learning and coding in python i'm going to share everything we do today in a kaggle notebook so you can find the link in the description copy that notebook and explore all the stuff that we would do today if you do enjoy this video please consider subscribing liking and follow me on twitch where i stream live coding all right let's get to the code okay so here we are in a kaggle notebook you can see this is the basic outline of what we're going to be talking about today we're going to be doing some sentiment analysis in python and we're going to use two main techniques the first one is the older kind of way of approaching sentiment analysis with a model called vader and this uses a bag of words approach and then we're gonna look at a pre-trained model from hugging face that's a roberta type model and this is a more advanced transformer model that we're going to see how the results compare between those two models and then we're going to also explore a hugging face pipeline but before we get into that let's talk about the data and do some basic analysis with the natural language toolkit which is a great library for python i'm going to show you over here to the right side that we are going to use this data set that is a bunch of amazon fine food reviews so these are text reviews for food on amazon as well as the rating that out of five stars that the reviewer gave them and all of this is in csv format that we're going to pull in so before we get too far into it let's do some of our imports we are going to let's give it some space here we're going to import pandas as pd we're going to import numpy it for some plotting we'll import matplotlib pi plot as plt and we're going to import seaborn as sns let's set a style sheet that we'll use for our plots and then we're going to just to start out here import nltk which is that natural language toolkit we'll be using for the start let's go ahead and read in our data so let's comment here and say read in data and we're going to read in from the input directory there is this reviews.csv and after i read this in i can show you here just in head command on this that in this data set we have each row is the unique id we have the product id user id profile name and then the real interesting stuff here is the score so this is out of a one to five star rating how many stars the reviewer gave this item and then the text it's a little small to see here but you can see if i just do the text row and i show us the first one it's actual text with the review that was written by the reviewer for this product and we're going to be running our sentiment analysis on this row of data in this entire data set but actually this data set if i print the shape is quite large there are almost half a million reviews here and just for time's sake let's down sample this data set so i can do that pretty simply by just doing a head command on this and taking the first 500 rows and then if we print the data frame shape after that we'll see that it's 500 rows but you could scale up this project very easily to all half a million products if you wanted to run a more intense analysis so then i'll just put the data frame head command here so we can see and reference back to what columns we have available to us now let's do some quick eda just to get an idea of what this data set looks like so we'll take this score column which we know to be a value between 1 and 5 and we're going to do a value counts on this this gives us the number of times each score occurs and then we'll sort the index and we'll do just a bar plot of this kind will be a bar plot and the title is going to be count of reviews by stars and let's also do a fig size of 10 by five plot that and let's add a label to it so i'm going to do a some line breaks here to clean this up and we're going to call this our axis and then i'm going to set the x label as review stars and we will do plt show all right so we can see here that most of the reviews are actually 5 stars but then it kind of goes down it has a little uptick in the number of one star reviews we have so the very biased towards positive reviews in our data set that's good to know before we get any further now um the next thing we'll do is just some basic and nltk stuff and we'll start by just taking one example review so let's do example equals this text column and just pick the 50th value as an example and we'll print this example all right so what did this person say they said this oatmeal is not good it's mushy soft i don't like it quaker oats is where you go okay so uh seems to be negative sentiment here but before we get into that let's just see some of the stuff that nltk can do out of the box um nltk can t tokenize this sentence so let's taste this example and run l ntk word tokenizer and all that basically does is splits this into uh the parts of each word in the sentence now it looks pretty clear clean at the beginning but then you can see that um i don't do an n apostrophe t is split so this is a little bit smarter than just splitting on spaces in the the text and this will give us our actual tokenized results in natural language processing often you need to convert the text into some format that the computer can interpret and tokenizing that is the way that you do it so let's make this the tokens and then take the tokens and let's just show the first 10 so we can remember what this looks like all right so now once we have these tokens another thing nltk can do out of the box is actually find the part of speech for each of these words so let's run nltk's pos tag for part of speech tagging and we'll run this on each of these tokens now we can see that we have each token and we also have its part of speech so these part of speech values are codes and we can actually load up an example page that has some examples of what each abbreviation means so let's go back here to our example not here so you can see that oatmeal is nnn and in our part of speech tagging that means it's a noun a singular noun so each of the values in this text has now been given its part of speech so let's call this tagged and then let's just show the first 10 again as our example um now it can actually we can take it the next step from this and take these tags part of speech and put them into entities so nltk we can do a chunk on this and then n e chunk so this uh takes the recommended name entity chunker to chunk the given list of tokens so it takes these tokens and actually will group them into chunks of text so let's run it on this to see what it looks like [Music] what are we getting here oh yeah so we need to store this because we're running in a notebook and then we'll actually run entities dot p print for pretty print of this so you can see that it's chunked this into a sentence and noted here that this is an organization some other interesting stuff about the text that can be extracted out automatically using an nltk all right so that's just a basic primer about nltk but we want to get into sentiment analysis so we're going to start by using vader vader stands for what does vader stand for it stands for i wrote up here balance aware dictionary and sentiment reasoner so this approach essentially just takes all the words in our sentence and it has a value of either positive negative or neutral for each of those words and it combines up it just does a math equation and for all the words it'll add up to tell you how positive negative or neutral that the statement is based on all those words now one thing to keep in mind is this approach does not account for relationships between words which in human speech is very important but at least is a good start so we also remove something called stop words stop words are just words like and and the and words that really don't have a positive or negative feeling uh to them they're just for the structure of the sentence all right so let's do some uh sentiment analysis using this vader approach we're gonna do from nltk sentiment sentiment import sentiment intensity analyzer and then we're going to also import from tqdm notebook import tqdm this is just a progress bar tracker for when we're going to do some loops on this data i also made a video about tqdm that you can watch if you're interested and then we're going to make our sentiment analyzer object by calling this sentiment intensity analyzer creating it and calling it s i a and that's going to be what we're the object uh let me make sure oh yeah this needs to be from tqdm notebook input cdm and now we have our sentiment intensity analyzer object we can run this on text and see uh what the sentiment is based on the words so let's run just on some examples let's say i am so happy an exclamation point we can see that this vader approach has made this has tagged this negative as zero this these are scales from zero to one so zero negative neutral point three and positive point six eight two so mostly positive now there's also this compound score which is an aggregation of negative neutral and positive this count com compound value is from negative one to positive one representing how negative to positive it is but if you want more detail you can take the breakdown of this negative neutral and positive so it did a good job it made this it tagged this as being mostly positive let's try the opposite so s-i-a polarity scores of this is the worst thing ever all right now we see that the polarity score polarity score for this is mostly negative and neutral and nothing positive and then this compound score is net point negative 0.62 so more on the negative side than positive very interesting now we can run sia on our example like that we had before remember our example which was this oatmeal comment let's run that on the oatmeal comment and see what it is okay so it's pretty high neutral but also some negative and the overall compound score is negative no positive score so we want to run this polarity score run the polarity score on the entire data set so basically looping through this data frame we have every text field we wanna run this and grab the polarity scores and we can do that with a simple loop so we're gonna do for d which is going to be just our row or we can just say four row and t qdm df dot itter rows and then our total is going to be yeah so then this should work i think and then we're gonna take um the row text and this will be tech our text and then we're also gonna take our we're going to call it my id which is the rows id column and then let's just break here to make sure we have this correctly oh that's correct this is going to be for i row in tqdm iter tuples uh it arrows and then we'll also make the total of this the length of the data frame so that when we see our progress bar it's out of 500 we're going to want some way to store these results so let's make a dictionary called res for results and every time we loop through we'll take my id we'll store in the my id part of the dictionary the polarity score score polarity score of the text right and then yeah that's it let's run this so really fast it ran it's done and um now we have this result uh dictionary with each id the negative neutral positive and compound score of each but we want to store this into a pandas data frame because that's easier to work with let's do that really quickly by just running pd.dataframe on this dictionary pandas can take in a dictionary pretty easily except for it's oriented the wrong way so let's just quickly run a dot t on this which will flip everything horizontally and now we have an index which is our id and then our negative neutral positive and our compound score for the sentiment for each of those values all right let's call this vader's that's our vader's result and then let's also let's take this vaders and let's reset the index and rename that index as our id so we can merge this onto our original data frame and then we're going to take vader's so vader's will now be this and then we'll also take vader's and we're going to merge it on our original data frame and how we'll do a left merge so now basically we have our data frame but with our scores and we also have all the other values from our original data set including the text so if i run a head on this we can see now we have sentiment score and metadata all right so let's see um let's see if this in general is in line with what we would expect so we're gonna make some assumptions here about our data that if the score of the item that the reviewer gave it is a five star review it's probably going to be more positive of text than if it was a score of one one star review is going to have more negative connotation than a five star review and we can do that by just doing a simple bar plot so let's use seaborne i think i imported seaborne yeah i imported keyboard before and we're going to do a bar plot of this data where our data is vader's let's call this plot vader results and our x value is going to be the score which remember is the star review of the the person and then compound is going to be our y value and that's the negative one to positive one overall um sentiment of the of the um text then let's set the title to be compound score by amazon stars review and then we'll show this what did i do wrong here i spelled compound wrong comp bound there we go okay so one star review has lower compound score and a five star view is higher and it's actually exactly what we would expect the more uh positive that the compound becomes that's the more um well by each score that was given it's more and more positive of text respectively and that's uh that's good that just kind of validates what we're looking for we can even break this down instead of looking at the compound score we can look at the positive neutral and negative scores for each so we're going to do that by doing something like sns bar plot data is vader's again x is score again and then let's do the positive and see what this looks like all right so this is the positive score and let's actually make three of these side by side uh left being positive neutral and then the negative to the right and we'll do that with matplotlib subplots so this will make a 1 by 3 grid of our results and we will call this axes put this first one here which will be our positive then we want our neutral and then we want the negative and this is going to be in position one two and three and then let's also set the title so we remember what these are positive neutral and negative and plots show this [Music] oh this needs to be ax equals and i need to change each of these there we go now we have um let's see what we have here let's make this a little less wide we have the positive positivity is higher as the score is higher in terms of stars the neutral is kind of flat and the negative goes down it becomes less negative of a comet as the star review becomes higher great this just confirms what we would hope to see and shows that vader is valuable in having this connection between the score of the text and sentiment score and that it does relate to the actual um the actual rating review of the reviewers uh let's do a tight layout just because i see some overlapping here of of the review of the y-axis labels but i think this is good all right so now we're going to take it up a notch our previous model just looked at each word in the sentence or in the review and scored each word individually but like we mentioned before human language depends a lot about a lot on context so if i say something uh we'll see a sentence that could have negative words actually could be sarcastic or related to other words in which way it makes it a positive statement so this uh vader model wouldn't pick up on that sort of relationship between words but more and more recently these transformer based deep learning models have become very popular because they can pick up on that context so we're going to use from hugging face which is one of the leaders in these types of models and gathering them and making them easily available we're going to import from transformers now this is hugging faces library you could pip install transformers to get this on your local machine or um of course you can just run it in a kaggle notebook like we are right now so let me make sure this works from transformers we're gonna import our auto tokenizer now this is gonna tokenize similar to what we showed nltk can do and then from transformers we're gonna import auto model for let's auto complete here for sequence classification you can see that there are a lot of different types of models that hugging face has and then we're also going to import from sci pi special um soft max which we will apply to the outputs because they don't have soft max applied and this will smooth out between zero and one all right special spell that correctly right and then we're going to pull in a very specific model that has been pre-trained on a bunch of data um for sentiment exactly like we're trying to do this is provided by hugging face and when we run the auto tokenizer in the auto model sequence classification methods and load it from a pre-trained model it'll pull down the model weights that have been stored and this is really great because we're essentially doing transfer learning this model was trained on a bunch of twitter comments that were labeled and we don't have to retrain the model at all we can just use these trained weights and apply it to our data set and see what comes out so anytime you do this the first time you will see that it needs to download all of the weights this is expected and now that's finished now we have a model and a tokenizer that we can apply to the text so let's remember what our example was before now this is this oatmeal comment when our polarity score from the old type of model look like this let's call this the vader results on example remember negative neutral and we want to run this though on using the roberta model that we've pulled so we just need to take a few steps so before we can run it on roberta model and that's uh first thing is encoding the text so we're going to take our tokenizer that we pulled in we're going to apply it to this example and return return tensors is going to be pt for pi torch and then encoded let's call this you can see here this is the encoded text so this is taking that text and putting it into ones and zeroes that embeddings that the model will understand we'll call this our ink coded text then we're going to take that and we're going to run our model on it it's just that simple so we're going to take this encoded text run our model on it and this will be our output and then you remember how we so this is what the output looks like it's a tensor with our results and then we're gonna take that output take it from being a tensor and make it into numpy so that we can store it locally so let's detach this and then numpy and store this as scores and then the last thing we're going to do is just apply that soft max to the these scores that we imported soft max layer um like this now if we print our scores we see that we have three different values in a numpy array now these are similar to the last type of model that we ran so basically this is the negative the neutral and the positive score for this text so let's make a scores dictionary where we will store this and we'll put in roberta negative is going to be the first value and then we'll just do like this negative neutral and positive and this will be 0 1 2 and we'll print this scores dictionary need a comma here there we go all right so the roberta model much more than the vader model thinks that this comment is negative which from reading it seems to make sense this is a very negative review of this product so already here we can see sort of how much more powerful roberta is than just a vader model let's go ahead and run this on the entire data set like we did before with the vader model so we can do this pretty easily by just making a function out of the code that we did before called polarity scores roberta where it takes an example like our code did before and it runs all of this and it just returns scores dictionary so now we could run this on one example of text and get this scores dictionary like we had written the code for above and we're going to enter iterate over the data set just like we did before so let's take this code from abort above where we iterated and we have our we're going to call this our vader result because we'll still run the vader text on this and then we'll also have our roberto results which is gonna be the polarity scores roberta function that we had written on this text and we'll break here after the first one just to see how it ran on the first iteration through so we see we have our veda results and we also have our roberta results it's exactly what we wanted and we also want to combine these so the way we can combine two dictionaries is there's a way to do it with the newer version of python but we're running an older version so we'll just do it like this and we'll call this both that's both results let's also go and rename this from negative neutral and positive to be explicitly named that they're vader vader results and i'm just going to copy this code which will basically rename these to vader and then the key name instead of right now it's just negative positive and then we will combine these two okay so running that for one iteration it looks like our results look good and we want to now just run it through all 500 examples so i'm going to take out this break and it's gonna run through oh we also need to um actually store this into the dictionary that we're gonna store with the id and we're gonna store both now here's i know this is gonna break uh because i ran this before but i want to just show as an example when it does break all right it did break on one of these examples because the text had some issues with it and it wasn't able to run through the roberta model so instead of debugging this all right now and it has to do with the size of the text itself there's certain size of text that's just too big for the model to handle we will skip those and we will skip those by adding a try accept clause here so it'll run through except for when this runtime error occurs in that case we'll just print out a message so we we know that it broke for id this id now we're going to rerun and let it go through all 500. okay so that's done running and you can see that did break for two examples um we could we could have lowered the amount of size of those and and it would have worked but this gets us a good result for now now it was pretty slow running keep in mind that's because i was running it only on a cpu these roberta models and transformer models are optimized to be run on a gpu and if i was to go here and turn on the gpu and the preferences i could have run a lot faster but it works for this case just to run it on a cpu and now let's actually take the results of this and make the results dictionary similar to what we did before so this line of code which takes these results runs of transforms on it let's call this results data frame and it will merge back on the main data set now if i do a head command on this we can see now we have our vader scores all four of them and our roberta scores for each row in the data set uh really quickly let's come compare scores across or between models and we can do this using seaborne's pair plot i think this would be a nice way to look at it so uh pair plot lets us see comparison between each observation and what each feature looks like so i'll show you here by running it on this results data frame and then we're just going to provide it the variables we want it to look at and those will be this vader negative neutral positive and the let's remove the vader compound because i don't think that's really needed to compare and we're also provided the roberta columns and we're saying these are the ones variables we want to compare uh let's also color it so the hue color of each dot is going to be by the score which is that one to five star scores and let's also give it a palette something where we can easily see the difference between each values um okay so this is vares i think and actually this combine and compare is what we're doing now okay so a lot going on here but one thing that we notice here is the five star reviews are this purplish color and if we look at vader the positive reviews are more so to the right on for these five star reviews for the roberta model you can see it's way over to the right and then we can see that there are some correlations between the roberta model and the vader model it's a little hard to see exactly if there are correlations but one thing that becomes very clear is that the vader model is a little bit less confident in all of its predictions compared to the roberta model which really separates the positivity and neutral and negative scores for each of these predicted values but if you look here this um positive and neutral like the roberta model has very high scores for the five stars and most of these one stars are very low in positivity um sentiment scoring so that's pretty cool let's also review some examples this is going gonna be pretty cool because now that we have uh sentiment score and uh we know the five-star ranking of the review we can look and see where the model maybe does the opposite of what we think it should so one way we can do that is we just take this results data frame and we query where there's a one star review so score equals one these are all our one star reviews and then we'll sort the values by this roberta positivity score and for ascending let's make this false so the highest positive score positivity score with the rank rating of the value being 1 will appear at the top and then we'll take the text of that and just do a values and print out the top value so what we're looking at here is a text that is said to be positive by the model but is one score by what the actual reviewer gave it it says i felt energized within five minutes but it lasts about 45 minutes i paid 3.99 for this drink i should have just drunk a cup of coffee and save my money so this is very nuanced sentence and you can see that it starts off being sort of positive i felt energized it lasted 45 minutes the model is getting confused and thinking this is more of a positive statement than we can tell that this is saying negative by the end of the statement so that's interesting and it makes sense let's do the same thing with the with the vader score so look at the most positive score for a once rating it says so we canceled the order it was cancelled without any problem that is a positive note so they actually were used the word positive and without any problem seems positive but it is a negative review and it's being a little sarcastic i guess a little tongue-in-cheek and the model does not pick on up on that especially the vader type of model which is only looking at a bag of words um for all of this sentence and and the score of each word individually let's also look at uh negative sentiment five star review and let's do this with the roberta model first so we'll switch to a five-star review and we'll look at the top negative sentence it says this was so delicious but too bad i ate them too fast and gained two pounds my fault okay so it is sort of a negative sentiment but a positive review that's kind of funny that that one came up and then we'll do the exact same thing for the vader and it happens to be the exact same one so the both models got i guess you could say confused but maybe this actually is a negative sentiment for a positive review so maybe it did a better job than what we would expect to see all right so we've explored a lot of things with sentiment analysis so far the one extra bonus piece that i want to show you is just using the hugging face transformers pipelines and this just makes everything really simple and i want to make sure i noted this you can read about it on their website but you basically can just import from transformers library a pipeline um and you need a spell pipeline correctly there we go and then we can make a pipeline with called sentiment pipeline with this pipeline and there are a handful of things that we could feed it um that it it will automatically be set up tasks that it's automatically set up for and we want to do sentiment analysis this will automatically download their default model and embeddings for uh this pipeline and you can just run sentiment analysis with two lines of code super quick and easy uh you can also go in here and change the model that it uses in the different tokenizer but the nice part about this is you don't have to set anything up you just do this it'll download the model it'll give us our default sentiment pipeline and then we can just run text on it okay so that's done running so i could say i love sentiment analysis you can see it's a different format and the output it gives by default but it's saying this is positive with a very high confidence um i'm going to type in make sure to like and subscribe here and that also is positive just to do a bad example we'll say boo and that does show as negative so it's working all right so that's it for our sentiment analysis project tutorial so there we have it we walked through two different types of models that you can use for sentiment analysis explored some of the differences between them we actually ran it on a whole corpus of data 500 different reviews from amazon you could scale this up and run it on all half a million examples and see what insights you can find so thanks again for watching my videos make sure you subscribe so that next time i release a video you'll be notified and i'll see you in the next one bye