Transcript for:
Sampling Techniques in Big Data Streams

foreign [Music] welcome back to my YouTube channel in this particular video we'll be looking into different sampling techniques that are used in big data streams but before that let's have a look at what exactly sampling means so basically sampling is a process of collecting a representative collection from the different elements that are present in the entire streaming data that means instead of working on the entire stream we can simply choose a representative from that entire stream and we can work on that so it is going to consume less computational resources as well as it is going to be less time consuming hence we use the concept of sampling now when we compare the sampled data with the actual data stream then in that case our sample data will be very very smaller than the entire data stream always remember that a good sample data will always retain all the significant characteristics and behavior of the streams for example if your entire stream is normally distributed then the sample that is chosen if it is a perfect sample then it will also be a normally distributed sample only and hence a good sample will always retain the characteristics the original data stream now this sampled data can also be used to find the crucial Aggregates on the entire stream which means that instead of applying the aggregate operations on the entire stream we can simply apply it on the taken sample and we can get the results now I hope you must have got an overview of why sampling is important in data stream now we'll have a look at the different techniques that are used in sampling when it comes to Big Data stream so the first sampling technique is fixed proportion sampling next sampling technique is fixed size sampling next we have biased Reservoir sampling and lastly we have the concise sampling all these techniques are very much important and now let's have a look at all of these techniques one by one starting with fixed proportion sampling so let's have an overview of what exactly fixed proportion sampling means now the name itself tells that it samples the data with a fixed proportion proportion simply means percentage now when can we calculate the proportion of a specific sample when we are aware about the length of the data that means you can get the proportionate sample only when you are aware about the length of the data or if you are aware about approximate count of the data points present inside the entire data Stream So in that case we can use this particular type of sampling now one advantage of this sampling technique is that it mostly ensures representative sample that means a sample that can retain almost all the characteristics of the entire data stream so whenever you are using a fixed proportion sampling that means there are high chances that the sample that you create will be a representative sample which will represent a large volume of data now we can use this fixed proportion sampling when the data is very much large provided that you have high computational size as well as resources because consider a scenario where your data size of the stream is very huge let's say you have billions of records and you choose the proportion to be 10 then in that case 10 percent of billions of record will also be a huge amount and hence to store that particular count as well as to process on that particular count will also be a big task you will require high computational power as well as resources but the final results will be good now this sampling technique is less biased when we compare it with the fixed size sampling technique which we are going to learn next and also make a note that this fixed proportion sampling technique might generate two problems the first problem is under representation and second problem is over representation which means your sample will not represent well your entire data stream or it can over represent your entire data stream which can again create problem so these are some of the points that you need to remember for fixed proportion sampling so let's have a look at the example which will clear all your doubts so let's say we have a social media application and the application wants to analyze the sentiments of the users towards a particular topic who are using that particular platform now there are millions of tweets that get generated every single day and it's very difficult to process and store all of these streets so instead of that the company applies for fixed proportion sampling and the proportion is one percent that means this one percent of the entire tweets that are generated every single day will represent the entire streaming tweets and this will be used for statistical analysis of the user sentiments towards a particular topic so this is going to create an ease to perform the difficult task of analyzing the huge volume of tweets and because of this fixed proportion sampling technique the less computational resources as well as less time will be consumed I hope this type of sampling technique is very much clear to you all now let's move on to another type of sampling technique which is called as fixed size sampling as the name suggests that it is going to sample the records based on a fixed number from the entire set of data stream now this particular technique does not guarantee the representative sample that means the probability cannot be predicted that the sample that is created based on a fixed number will retain all the characteristics that the data stream has but if you remember in fixed proportion sampling we were getting some guarantee that it is going to represent the entire stream now when you have a very huge data then in that particular case fixed size sampling is going to be very much useful for reducing the data volume that means if you have millions of Records then in that case if you have a fixed number as a threshold to choose from the entire stream then in that case the sample that you are fetching from that stream will be very much small now just imagine if the entire data stream is normally distributed or if it has a particular distribution then in that case if you are using fixed proportion sampling and you are randomly choosing some records as a part of the sample then in that case it is going to be a biased type of sample because there is no guarantee that the fixed number of records that you choose from the entire stream is going to create the distribution that the entire stream follows but if your entire streaming data is randomly distributed then then in that particular case your fixed size sampling will be good now as I said if you have a very large volume of data then in that case if you use this particular sampling technique then it is not going to retain all the significant characteristics of the entire stream and hence it is less effective I hope the overview of this particular sampling technique is clear to you all now let's have a look at the example which will clear all your doubts so let's say we have a data stream of customer records for an online store and it is observed that 10 000 orders are generating every single hour now this count 10 000 is an approximate count it may increase or decrease accordingly now this tour chooses this fixed size sampling technique and because of which per our 1000 record will be chosen as a part of sample now if the the order count of 10 000 is increased to 90 000 per hour then in that case also only 1000 order will be chosen as a part of sample because we are using fixed size sampling I hope this sampling technique is clear now let's move on to another type of sampling technique which is called as biased Reservoir sampling technique now this type of technique is specifically used in streams to select a subset of the entire streaming data in a way that is not uniformly random that means it will have certain pattern through that pattern we are going to choose the sample now the name itself tells that this is a biased sampling technique which is going to result into a biased sample that may not be a representative of the entire streaming data so this is the terminology of this type of sampling technique now you might be wondering that what makes the sampling technique biased so there are certain criteria for selection of specific elements based on a predetermined probability distribution that experts Define now because of this predetermined probability distribution High weights are assigned to certain elements or group of elements and because of this only those High weighted elements will be chosen as a part of sample now there can be different factors which are going to be used while choosing those type of high weighted elements for example the frequency of occurrence of certain type of data if the data is highly repeated or if it is generating at a faster rate then in that case High weight will be assigned to that particular data point and it will be chosen as a part of the biased Reservoir sample now this is one of the factors there might be some other factors also which are taken into consideration while creating samples based on this particular technique now also note that this particular technique is used when there are some constraints on the resource that are available for sampling the data for example there might be limited memory or limited computational power which will create a constraint and in that case we can use this always remember before using this type of bias Reservoir sampling you should always consider the potential biases that may be introduced because of this sampling technique because there might be a chance that this sampling technique may not result according to what you expect so it is better to adjust the analysis parameters accordingly so that was the biased Reservoir sampling overview now let's have a look at the example let's consider a scenario where we have a data stream of product ratings and we want to select a sample of ratings to estimate the average ratings of a product now there might be millions of users who are making a rating on a particular product so in this particular case we should create some criteria for selecting only those customers rating who are going to be a part of Sam using biased Reservoir sampling Technique we can assign a higher probability of selection to the ratings from users who tend to give more accurate rating now how we can come to know that the rating is accurate this can be done by looking at the history of that particular customer if a particular customer is regular and is giving the best and the correct rating for a particular product then we can consider only those ratings for this scenario also we can take only those ratings from those trusted customers I hope by this example the bias Reservoir sampling technique is clear to you all now let's have a look at the next and the last sampling technique which is concise sample technique now the goal of this particular sample technique is to maintain a small reservoir of fixed size which is similar to your fixed size sampling technique but it is going to still achieve the representative sampling of the data stream which we were not getting in the fixed size sampling technique here we are using the terminology of fixed size sampling but we are not going to follow the drawback of that particular technique here we get a fixed size sample but it is representative now how it can be achieved here we can have the fixed number of sample but we can also have the privilege to vary the number of samples according to the main memory size and the size of the data stream no doubt there might be some issues that may arise because of the limited memory but this type of concise sampling technique will try to overcome by adjusting the parameters of size of the main memory now one more specialty of this concise sampling technique is that instead of selecting the random samples that means instead of selecting the random records from the entire data streams here this type of technique uses the technique of choosing the samples with unique or representative values of a particular attribute for example if we have a big data with many repetitions of distinct elements then in that case we can only choose all the distinct elements from the entire streaming data and we can put that distinct element into a single sample because of this the original characteristics of the entire streaming data will be retained into the sample and even if it is of a fixed size it is going to represent the entire streaming data so I hope you have got an overview of the concise sampling technique now let's have a look at the example example says that a bank wants to analyze the customer spending habits from a stream of transactions the stream will obviously be continuous and will be in huge in size the bank tries to use this particular type of concise sampling to choose the distinct customer IDs as their attribute Now by default it is given that the reservoir can contain only 1000 customer transactions so we can use this concise sampling technique by adjusting the sample size based on the available memory so here we are getting the additional parameter of adjusting the sample size which is an advantage now because of this it allows us to retain the accuracy as well as for efficient analysis also it can be used so I hope you have understood this concise sampling technique with this we complete all the sampling techniques in big data streams I hope you have understood all of them for more such videos you like share and subscribe to my channel also hit the Bell icon and don't forget to follow me on Instagram thanks a lot for watching have a good day ahead