Transcript for:
Simple Linear Regression Example in Python

Alright, so now we are at least going to get started with setting up a simple linear regression example. The first thing that we need to make sure we have is scikit learn, Pandas and Quandl. So open up terminal, command prompt, whatever. And pip install sklearn. pip install quandl. And pip install pandas. Once you have all those, you are good to go. So to install those, go ahead and pause the video and pick back up once you have them. Ok, so once you have those, let's go ahead and get started with a simple example. So we are starting with regression, and the idea of regression is to take continuous data, and figure out a best fit line to that data. and basically with that just boils down to we are trying to like "model" your data and the way we do that with regression at least with simple linear regression is just with a straight line so the equation of the line as we will talk more about down the line but as you might remember from school, y=mx+b, so if you have x, you can figure out what y is, also if you have m and b. So basically the whole point of regression is to find out what m an b is. So for example, a lot of people use regression with stock prices so that's what we are gonna do at least in this one. And, so the idea is, this is continuous data and you've got months and months of stock prices and and each price is in its own kind of unique day. But all the data is kind of one dataset together as opposed to with like classification, where each group of data has its own unique label. So with machine learning, basically everything boils down, at least with supervised machine learning, everything boils down to features and labels. Features are like your attributes, or in this case, the continuous data. So, let's go ahead and get started and we'll talk a little bit more about features. So first of all, let's go ahead and import pandas as pd. And then we are gonna import Quandl with a capital Q. And then what we are gonna say is df for dataframe equals Quandl.get and we'll put in the ticker. You can get this from quandl. So if you just go to quandl.com You can use a little search and find stuff like if I say google stock we can probably find it. Let's see I am trying to find, we are using the wiki dataset Let's just do free. Anyway, when you find it you can find all kinds of different datasets here but we are looking just simply for the wiki one. Here it is. You will pick up a dataset and you can come over here You can either just download here or more importantly here is the quandl code and then you can click on like python and this is the exact statement to get it. If you have an account, you can make basically unlimited request free data. If you don't use an account, like we are not gonna use an account here, like we are not gonna use an auth token. If you don't have an account, I think, it's limited like 50 calls a day. We are actually only use quandl fairly short term here and then maybe later on. So you really don't need to create an account, but if you like quandl, you might as well make an account at some point. So anyways, quandl.get and then wiki/google was the ticker there so then we can just simply print let's print the df.head just so we can see what it is we are working with. We'll see that basically each column here is a feature. So the open, high, low, close, these are features. So in machine learning you can have all the features you want but you want to have meaningful features features that actually have something to do with your data So some people are pretty avid believers in the ideas like pattern recognition with stock prices and that might be you but do you need every single one of these open high low close columns to do pattern recognition? No. Also, you would know we've got open high low close volume and then adjusted and adjusted is adjusted after a thing like stock splits so a stock split maybe your company has 10 stocks and each stock is $1000 a share and you decide I want people to be able to buy shares of my company for less than $1000 So you might say, ok, BAM, every share is now two shares so we have 20 total shares and the share price is $500 so you have adjusted prices to account for that so it doesn't like like the stock price went from $1000 to $500 so that's what adjusted is so we are gonna be using those but again, each one of these is really related to the other one like the correlation of these two columns is super high so would you use each one of these columns Does that the next one really brings that much meaningful data? No but one thing to always think about when you have features and labels is maybe like what about the relationship between those columns so when we get into something like deep learning and then some of the other algorithms you can start to discover relationships between attributes but with regression, just simply no. what you wanna do you wanna like simplify your data as much as possible. You want as many meaningful features as you can get but useless features as we'll show kinda through this series can really cause a lot of trouble for your machine learning classifiers especially the more simple ones in supervised learning and so on anyways let's close out of this and let's go ahead and grab some features what we're gonna say first we are gonna pair this down, we are gonna say dataframe equals the df and then we are going to create a long list basically all of the columns that we wanna have so we are gonna take adjusted, open, and then I'm just gonna go ahead and copy this copy, ok so that's adjusted, open, and then we are gonna take oepn, so high, low, close, and volume ok so now we have just these columns so we kinda recreated our dataframe to just be the open high low close and volume of the adjusted ones. so then, like I was saying, some of these columns are relatively worthless but they do have some relationships so for example, like what is interesting about high and low is the margin of high and low tells us a little bit about volatility for the day Also, the open price that's the starting price for the day and it's relationship to the close price tells us did the price go up if so, by how much and did it go down? If so, by how much and so on. so the relationship there is very valuable. But a simple linear regression is not gonna seek out that relationship. It's just gonna work with whatever features you feed through it so what we need to do is define those special relationships and then use those as our features rather than redundant almost prices that not gonna really give us anything else very useful first let's do the high minus the low percent so this is like the percent volatility almost so we are gonna define a new column we are gonna call it HL_percent and then that is going to be I'm having a hard time here that's gonna be equal to so percent change is in this case it would be the high minus low divided by the low times 100 so for us it would be df Adj high minus the df Adj close and what's happening here is just on a per row basis which is just this column minus this column so that column divided by df Adj close and then times 100 you can either times by 100 or not the classifier really is not gonna care about that we are just doing that for ourselves So that's the high minus low percent and then we actually want just the daily percent change like the daily move so I'm just gonna copy that whole line, paste and then we are gonna call this one percent_change and that is equal to pretty much the same thing only we need to change some stuff so normally percent change is new minus the old divided by the old times 100 so that would be adjusted close minus adjusted open so new minus the old divided by the old times 100 oh, I am sorry, ha, we did it the wrong way. divided by the old, so this would be open times 100 so that's percent change. actually, you can pass close here again the classifier doesn't really care as long as everything kinda normalized but yea so either way would been fine this is the actual way you should do it anyways once we have that data we are gonna define a new dataframe and we are instead gonna say so it's gonna be df equals df[] and then now we define the only columns that we really acutally care about and so in our case the columns we care about are gonna be adjusted close, the high low percent, the percent change and then volume is also somewhat useful to have so volume is just how many trades were occurred basically that day so volume is also kinda related to volatility so you can also make more features with some sort of relationship there but we'll try to keep this pretty simple so for now we'll just print df.head and we wait just to make sure everything worked out and sure enough it did so we have all the numbers we are kinda interested in so we got our features and eventually this will actually wound up being, possibly, our label but we'll get to..., I guess think about between now and the next tutorial features are the kinda of the attributes that make up the label and the label is, hopefully, some sort of prediction into the future so will the adjusted close, will this column, actually be a feature? or will it be a label as it stands right now. so think about that and the next tutorial we'll pick it up and start getting closer actually making real predictions with this data so if you have any questions, comments, whatever, leave them below otherwise, as always, thanks for watching, thanks for all the supports, subscriptions and until next time