Transcript for:
Netflix Video Analysis Tech

did you know that netflix analyzes billions of frames of every tv show and movie and using all these frames they can do incredible things in today's video i'm gonna show you three things that you didn't know that they could do with these algorithms netflix it's known for its innovative tech stacks and state of the art engineering to power everything that they do and funny enough it actually started this way when they first got into the business by shipping dvds physically to your door go to netflix.com make a list of the movies you wanna see and in about one business day you'll get three dvds keep them as long as you want without late fees the co founders more specifically reed hastings was a computer scientist who co founded a company that built software to debug unix software which ended up being sold for $750,000,000 whatever his cut was he used to fund netflix so from the start we had a computer scientist that started netflix with his own money it's almost like it was always going to happen netflix just focusing on the tech of tv and film one of these areas of innovation would be computer vision but before we even talk about computer vision let's first talk about the first algorithm by netflix the match cut transition when editing a movie or a trailer for a movie how do we transition between two completely different scenes to understand this we need to understand how our eyes analyze something on the screen if we take 2,001 a space odyssey our focus stays on the bone as it's thrown up now composition wise we only see the sky so clever filmmaking would be to cut to the next scene that follows the structure immediately so we understand what happened almost immediately our eyes don't have to play a game of catch up now if you're making the film or tv show you can plan for these shots to be made but how can netflix do this when going through tens of thousands of movies and tv episodes up to 2,000 shots each we need to automate this but how do we even get a computer to understand what is on the screen netflix engineers comb through thousands of hours of video to do a more common type of match cut this is called frame matching where the framing of a person is well defined so they decided to use a machine learning algorithm yes machine learning not everything is ai to do something called instance segmentation now our eyes can easily look at this photo and see the ground the dog the person the water the sky but how do we train a computer to identify this instance segmentation is exactly what it sounds like which pixels belong to the human etcetera and before we had all this crazy ai machine learning type of things we had to resort to good old fashioned mathematics for example the viola jones framework was actually pretty clever in detecting faces by patterns in the lighting of your face and methods like this were great for specific objects but you can take a picture of anything so you don't wanna build an algorithm for each individual item that you take a picture of in 2015 fully convolutional networks for semantic segmentation by long schallhammer and darrell showed the first end to end deep learning approach to segmenting at the pixel level using a skip architecture a convolutional neural network is like a gigantic pipe with a billion filters through it you put the thing that needs to be ai'd at the top and it will output what the training wants it to do like detecting if this image is a bird but for segmenting items in a picture you need a lot of data and a lot of context to fully understand the whole picture so where the scit architecture comes in is that it's able to get the final result a dog but also gather context about items and edges from previous steps and of course ai has skyrocketed the last three or four years we even have up to date models like segment anything two from meta netflix went through their giant catalog segmented the main person of interest in the shot and then computed the intersection over union to see how it would look in the match cut and at first the focus was only on human characters but eventually they started adding different objects and even animals into the equation and the results looked pretty impressive brain matching is very effective but where the match cut just really goes to another level is by combining action shots the way this works is the motion from clip a is cut midway through and continued in a very similar way in clip b with image segmentation we're only taking one frame basically and kinda matching that way so how do we do it with actions well luckily there's actually already a way you can do this with computer vision called optical flows optical flow takes every pixel of an image and then creates a heat map based on what has moved the most between the previous frame of the video so netflix would take a shot calculate the optical flow average it out into one image then use this to search for matches and what they got back was not really what they intended remember these action shots were to grab items interest moving in one way or another transitioning over into another action but instead they found that the matches were matching based on similar camera movement so these findings were interesting but how did they package it all together well they did it in five steps first shot segmentation a scene contains a collection of shots a shot is a sequence of frames between two cuts remember this because we're gonna go over it a lot there are lots of open source models that can detect this even a python package surprisingly enough so everything needed to be broken down and grouped by shots step two shot deduplication think of a talking scene for a second you often have two characters in the same shot going back and forth so netflix would take the beginning frame of each shot and convert the frame into an embedding which is a mathematical way of representing an image or any data point this way it could detect how similar it is to other frames if the shots were too similar mathematically and was in the same episode or movie chances are it was a duplicate and was filtered out step three compute representation this step determined what type of algorithm was needed to do the match cut the optical flow the image segmentation etcetera step four compute pair scores a similarity score is given to each pair the higher the number the more similar they are step five extract the top results they then take the top results based on whatever criteria they are looking for and hand off to the editors and this would save hundreds maybe even thousands of hours of editing time because you often had to go through a large catalog of videos just to find that right moment but wait a minute search for videos how do i even search in this large catalog for videos specifically search it's something we use every day enter a query and return matches this works really well with text but what about for videos how can i search for a specific thing happening in a video if i type in a query how does that get represented in video format i have no idea at first an attempt was made by the netflix engineers to use the segmentation method like in the match cut segment but found that it wasn't specific enough for example in stranger things a demogorgon wouldn't be recognized as a demogorgon and would be pretty hard to search for otherwise like how do you even explain that thing the good news is that they technically already found the answer embeddings they had pretrained image text models that created an embedding space between the two that way you can see the relationship and similarity between the two so netflix extended this model to include video sequences that could also include a text pair from preexisting data they used something called raytrain to train the model at scale while using decord to decode videos much faster netflix then used these segmented videos with text pairings and embedded them putting them in a database at the shot level and remember embeddings just means that they can turn something like a video text image into a mathematical format and since embedding is really fast what you can do then is enter a search query and before it even hits the database to compare or query you can embed it and do a cosine similarity to get the desired result they actually use this as well in the action sequence that i talked about before but here's how cosine similarity works cosine similarity measures how aligned two vectors are by calculating the cosine of the angle between them this sounds complicated okay i'm sorry about that it confused me too but essentially what this means is that you can determine how similar items are to one another even though they don't match so for example the phrase can i get a peanut butter sandwich and is there a food that contains nut spread of some sort which i wrote that in the script that is just a really weird thing to say they would both be close to one another netflix already had a huge catalog of shots that were labeled like this like close-up or panning shot so this gave them a significant head start in training this model so to recap netflix was able to take text and video pairs and convert them into a mathematical format then a search query was able to be converted into the same format in order to determine how similar it is to the shots of a scene so if i search exploding car a bunch of clips of exploding cars would come up if i searched fantasy gremlin that would probably pop up too now remember when i said shots and scenes how do we even determine when a scene has changed now we already discussed the difference between a shot and a scene but to recap a shot can be pretty easy to understand visually a sequence of frames that plays before getting cut to another shot but a scene is a collection of these shots usually with the same narrative tone pace or whatever and that's kind of the issue with this concept of a scene at least from a computer vision stand point is a bit too nuanced there's a lot of things that can just make a scene up let's take the scene from into the spider verse visually all the shots line up in color matching mood dialogue and pacing however when miles gets bit by a spider we see a huge change in how the scene works for us this change is obvious and we can still identify it with this scene but a computer might see a scene change at least based on the previous algorithms we've mentioned the colors have changed the music has shifted the tone has just shifted in general so netflix made two different ways to determine when a scene change happens using frames it's pretty awesome the first leveraging aligned screenplay information a screenplay is written to allow producers and directors to execute a written piece on the screen unlike a book the screenplay is written with explicit visual instructions to how it should look and sound kind of like a program for a computer okay i'm sorry that was me trying to you know remind you that this is a coding channel sorry since screenplays are written like this it's actually fairly easy to parse using optical character recognition technologies but here's another issue netflix needed to find a way to align the screenplay with the closed captions of the movie but the movies that we often see differ a lot from the screenplay that is written actors go off script all the time and sometimes scenes change in fact matthew mcconaughey uses the screenplay as a suggestion rather than the source but we've already found a way to deal with this type of thing though embeddings remember mathematical representation of something i think you get it by now if we embed the screenplay line and the dialogue being spoken we can use paraphrase identification to get the closest match when we talked about cosign similarity in the last section we actually kind of already talked about paraphrase identification this software is also used to let chatbots understand your general wants play music in the shower means play spotify in the bathroom speaker etcetera and this is where dynamic time warping or dtw was used dtw is like a sophisticated matching algorithm that helps netflix line up screenplay content with what actually appears in the video even when the timing isn't consistent so think of it this way if we have two sequences the screenplay text and the actual dialogue with time stamps a aka the closed captions in netflix's case they create a timeline connecting all of the lines that match or don't appear or are delivered late in the dialogue dynamic time warping helps find the best way to warp or stretch one sequence to match the other but how many times have you watched a movie and sat there with the screenplay and been like i don't know when the scene has changed people don't do that it's actually kinda weird so how do we identify that without knowing the screenplay this brings us to the second method the multimodal sequential model we're bringing back the good old frame analysis models going back to the shots they used bidirectional gated recurrent unit networks god that word was just scary to say let me explain what that is a gated recurring unit or gru is like a memory cell that helps the ai model remember important information while forgetting irrelevant detail in the case of a shot sequence the model would have to guess what happens next based on what is already seen so this makes scene detection like basically impossible and that's where the bidirectional comes in it processes it from beginning to end and vice versa so the model knows what's in front of it as well aka the frame you see what i'm talking about here so netflix then separates the video and audio from each shot they then take the audio and separate the dialogue the background music and sound effects and this gives the model better clues to identify when something has changed they then embed all of these video and audio tracks then when they run the model they're able to beat some of the better open source technologies that exist by sometimes over three to 7% thanks to all the engineers that contributed to the models as well as writing the articles on the netflix tech blog without them this video wouldn't exist and we wouldn't learn about some of the cool ways that computer vision is powering editing analysis whatever i've linked them in the description below if you wanna read them and get a better analysis out of this but honestly it's just so epic i also have another video that talks about how discord handles the world's largest server so make sure you go check that out as well peace out coders