Transcript for:
Label Ambiguity and Data Quality in Machine Learning

in the last video you saw how the right bounding boxes for an image can be ambiguous let's take a look at some more label ambiguity examples we briefly touched on speech recognition in the first week of this course here's another example given this audio clip sounds like someone was standing on a busy roadside asking for the nearest gas station and then a car drove past so did they say something right after that i don't know so one way to transcribe this would be um nearest gas station and in some places people spell um with two m's so that would be a different way to spell it and we could have used dot dot dots or ellipses instead of the comma as well which would be another ambiguity or given the audio had noise after the last words um did they say something off the nearest gas station i'm not sure actually so would you transcribe it like this instead so there are combinatorially many ways to transcribe this um with one m or two m's commodore ellipses whether to write unintelligible at the end of this being able to standardize on one convention will help your speech recognition algorithm let's also look at an example of structured data a common application in many large companies is user id merge that's when you have multiple data records that you think correspond to the same person and you want to merge these user data records together for example say you run a website that offers online listings of jobs so this may be one data record that you have from one of your registered users with the email first name last name and address now say your company acquires a second company that runs a mobile app that allows people to log in to chat and get advice from each other about their resumes it seems synergistic for your business if you run a listing of online jobs maybe you merge or acquire a second company that runs a mobile app that lets people chat about their resumes and from this mobile app you have a different database of users so given this data record and this one do you think these two are the same person one approach to the user id merge problem is to use a supervised learning algorithm that takes as input to user data records and tries to output either one or zero based on whether it thinks these two are actually the same physical human being if you have a way to get ground truth data records such as if a handful of users are willing to explicitly link the two accounts then that could be a good set of labeled examples to train an algorithm but if you don't have such a ground true set of data what many companies have done is ask human laborers sometimes a product management team to just manually look at some pairs of records that have been filtered to maybe similar names or similar zip codes and then to just use human judgment to determine if these two records appear to be the same person because whether these two records really is the same person is genuinely ambiguous they may and they may not be different people will label these records inconsistently and if there's a way to just get them to label the data a little more consistently you see some examples of how to do this later even when the ground truth is ambiguous then that can help the performance of your learning algorithm user id merging is a very common function in many companies let me just ask you to please do this only in ways that are respectful of the user's data and their privacy and only if you are using the data in a way consistent with what they have given you permission for you know user privacy is really important a few other examples from structured data if you are trying to use a learning algorithm to look at a user account like these and predict is it a bot or a spam account sometimes that can be ambiguous or if you look at a online purchase is this a fraudulent transaction so has someone stole an account and is using a stolen accounts to interact with your website or to make purchases sometimes that too is ambiguous or if you look at someone's interactions with your website and you want to know are they looking for a new job at this moment in time based on how someone behaves on a job board website or a resume chat app you can sometimes guess if they're looking for a job but it's hard to be sure so that's also a little bit ambiguous in the face of potentially very important and valuable prediction tiles like these the ground truth can be ambiguous and so if you ask people to take their best guess at the ground truth label for tasks like these giving labeling instructions that results in more consistent and less noisy and less random labels will improve the performance of your learning algorithm so when defining the data for your learning algorithm here are some here are some important questions first what is the input x for example if you're trying to detect defects on smartphones for the pictures you're taking is the lighting good enough is the camera contrast good enough is the camera resolution good enough so if you find that you have a bunch of pictures like these which is so dark it's hard even for a person to see what's going on the right thing to do may not be to take this input x and just label it it may be to go to the factory and politely request improving the lighting because it is only with this better image quality that the labeler can then more easily see scratches like this and label them so sometimes if your sensor or your imaging solution or your audio recording solution is not good enough the best thing you could do is recognize that if even a person can't look at the input and tell what's going on then improving the quality of your sensor or improving the quality of the input x that can be an important first step to ensuring your learning algorithm kind of reasonable performance and for structured data problems defining whether the features to include can have a huge impact on your learning algorithm's performance for example for user id merge if you have a way of getting the user's location even a rough gps location if you have permission from the user to use that can be a very useful cue for deciding whether two user accounts actually belong to the same person and of course please do this type of thing only if you have permission from the user to use their data this way in addition to defining the input x you also have to figure out what should be the target label y and as you've seen from the preceding examples one key question is how can we ensure labelers give consistent labels in the last video in this video you saw a variety of problems with the labels being ambiguous or in some cases the input x not being sufficiently informative such as if the image is too dark let's take these data issues and put them into a more systematic framework that will allow us to devise solutions in a more systematic way let's go on to the next video to take a look