Loading Large CSV Files in Pandas by Using Chunks

what is going on guys welcome back in today's video we're going to learn how to load huge CSV files into pandas by chopping them up into chunks and loading them step by step so let us get right into it [Music] alright so one problem that you will encounter as a data scientist is that you have to process huge data sets maybe consisting of multiple gigabytes of data so for example you might have a CSV file that has 50 gigabytes of data and you only have 32 gigabytes of RAM because what pandas does when it loads a CSV file and converts it into a data frame is essentially it parses it it converts it and then it keeps it in the ram so that we can work with it so we have the object data frame or data equals pandas dot read CSV we read the CSV file and then we have this data frame we can apply operations we can um filter we can query we can do all sorts of things because this data frame is in the ram so whatever we do is done in the ram we have the full data set in the ram however as I said if you have a data set that has 50 gigabytes of data you cannot fit it as a whole into 32 gigabytes of RAM so in today's video we're going to learn how to split the data up into chunks so that we can actually load part of it and step-by-step process the full data set without needing to fill all of our Ram or to fill even more than our Ram has to basically get an exception because we cannot load the full data set into the ram so for this obviously we're going to need pandas so I think if you're watching this video you have pandas installed already in case you don't have pandas you just type pip install pandas and then you go into your python script and you type import pandas as PD and now for this video I have prepared a simple uh data set a simple CSV file called huge dataset CSV it's actually not that huge because it actually has let me see it has I think four gigabytes 4.23 gigabytes of data so it's not a huge data set but still even if you have enough space in the ram you don't necessarily have to fill it up with a data set if you don't need to right so let me just show you right away what happens if I say now data frame equals PD dot read CSV huge dataset.csb and by the way you can use whatever data set you want I just went on Google and typed um big data set python CSV or some or actually big dataset CSV kaggle or something and I then I just downloaded one data set you can pick whatever data set you want as long it has as it has a certain size right so in this case I just picked this data set and when I run this now you can see that nothing happens here so it's loading the data set and if I open up the task manager on Windows drive now when I go onto the performance tab you can see that my memory is increasing the whole time so I have 15.9 gigabytes of RAM now it's occupying 12 and almost 13 in in a couple of seconds here and you can see that this is happening uh in Python so you can see python is allocating a lot of memory and I'm soon gonna be out of memory if I let this continue this is because I'm loading the data set panelist is loading the data set into RAM it maybe I will have enough RAM for this one maybe not the point is it's not very intelligent to just load the full data set into the script here if we are limited in terms of resources now if you have one terabyte of RAM maybe it makes more sense to load the full data set but in this case it wouldn't make sense so we need to process it step by step and one way to do that is of course to limit yourself to a portion of the data set so you don't load the full data set but you load only a part of the data set so you can say for example data frame equals PD read CSV um huge data set CSV and you can say that you want to just focus on the first 100 rows like that and then I can print DF here and you can see in this case we get 100 rows immediately we don't waste a lot of ram it's done quite quickly here because we're only loading 100 entries that's a pretty small data set we're loading into the RAM and this happens almost almost instantaneously um maybe you don't want to have the first 100 rows but you want to have the first 100 rows after 500 rows so you can say something like skip Rose equals 500 you can print the data frame and then you get 100 rows but not the first 100 rows but the first 100 rows after 500 rows so basically 501 502 and so on up until 600. um that is one way to do it but actually what we want to do is we want to load the full data set but we want to load it step by step so we want to process one part of the data set get whatever we want to get out of the data set and then continue with the next part and for this video we're just gonna um make up some arbitrary metric nothing fancy nothing intelligent here I don't even know what this data is to be honest because the focus of this video is not the data the focus of this video is how to work with pandas when we have huge data sets so let's just go ahead now and say that this data frame we're going to give it column names we're going to call the columns here A B C d e and I think we need f G and H so those are now going to be our features we can just print the data frame and you can see now we have these features and let's save I don't know those are some numbers that that are interesting and for some reason we want to take the E feature and divide it by the G feature to calculate a metric this is what we do when we process this data set so for example in this case this would be DF dot e divided by d f dot g and this would be our result that we want to save for whatever reason right so this is the data science work that we're doing so we could have something like a metric uh metric results this is our metric that we want to calculate this is going to be an empty list or actually you want to make this a series so a panda series which is based off an empty or based on an empty list the data type is going to be equal to float64 and essentially we're just going to go ahead and say metric results dot PD concat so concatenate um metric results equals PD concatenate and we're going to concatenate the metric results that we already have with um DF e divided by DFG so divided by the result uh that divided by so this feature divided by this feature the results are going to take them and combine them with the metric results so in the end we're going to have the results so we can actually print the metric results here and you can see we get length 100 and we have only the values that we calculated so this here can now be done not not only with 100 elements this can be done with a full data set step by step so what we can do here is we can uh do the following we can say counter equals zero maybe let's do it up here and then we're going to say four Chunk in and now we're going to say pandas dot read CSV huge dataset.csv and now we're going to say chunk size equal so how many elements do we want to read at once in this case we're going to say a thousand we want to read a thousand rows and then the next thousand rows and so on and so forth of course you can change that number if you want to and then one chunk is going to be a thousand rows off the data frame and the next chunk is going to be a thousand more rows of the data frame so we don't have to load the full data set into the ram we always just load a thousand elements of that data frame into the ram um and essentially we're going to say now chunk dot columns or whatever we decided that they are um and then we're going to take the chunk process whatever we want to do in this case we want to do metric results PD concat metric results chunk dot e divided by chunk dot g and um this would basically go on forever so if I now run this here we're not going to see anything but essentially it's now going through the individual chunks and processing everything if we want to actually see a result we can just go ahead and artificially break we can say okay if the counter is equal to 20 for example we're going to break out off the loop just so that we can see the progress and after 20 chunks so 20 times a thousand twenty thousand uh we're going to get the actual metric results we're going to get the actual um oh I forgot of course a very important thing here counter plus equals one there you go you're going to see that we have 20 000 of these results here and if we leave this running uh if we if we run this for long enough we're going to process the full data frame without having to store it into the ramp so this doesn't use a lot of ram at all we just need to save a thousand rows into the RAM and then we can process it then we can load the next 1000 rows and then the Thousand rows from before are no longer in the ram so we have these chunks here and we don't have to load the full data set into the RAM and this is how you process huge data sets in pandas in Python professionally by using chunks all right so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you next video and bye [Music] thank you [Music]

Transcript for:Loading Large CSV Files in Pandas by Using Chunks

Transcript for:
Loading Large CSV Files in Pandas by Using Chunks