Introduction to Natural Language Processing with Deep Learning

um hi everybody um welcome to stanford cs224 n also known as ling 284 natural language processing with deep learning i'm christopher manning and i'm the main instructor for this class so what we hope to do today is to dive right in so i'm going to spend about 10 minutes talking about the course and then we're going to get straight into content for reasons i'll explain in a minute in a minute so we'll talk about human language and word meaning i'll then introduce the ideas of the word to vec algorithm for learning word meaning and then going from there we'll kind of concretely work through how you can work out objective function gradients with respect to the word deveck algorithm and say a teeny bit about how optimization works and then right at the end of the class i then want to spend a little bit of time giving you a sense of how these word vectors work and what you can do with them so really the key learning for today is i want to give you a sense of how amazing deep learning word vectors are so we have this really surprising result that word meaning can be represented not perfectly but really rather well by a large vector of real numbers and you know that's sort of in a way a common place of the last decade of deep learning but it flies in the face of thousands of years of tradition and it's really rather an unexpected result to start focusing on okay so um quickly what do we hope to teach in this course so we've got three primary goals um the first is to teach you the foundations i a good deep understanding of the effect of modern methods for deep learning applied to nlp so we are going to start with and go through the basics and then go on to key methods that are used in nlp recurrent networks attention transformers and things like that we want to do something more than just that we'd also like to give you some sense of a big picture understanding of human languages and what are the reasons for why they're actually quite difficult to understand and produce even though humans seem to do it easily now obviously if you really want to learn a lot about this topic you should enroll in and go and start doing some classes in the linguistics department but nevertheless for a lot of you this is the only human language content you'll see during your master's degree or whatever and so we do hope to spend a bit of time on that starting today and then finally we want to give you an understanding of an ability to build systems in pi torch for some of the major problems in nlp so we'll look at learning word meanings dependency parsing machine translation question answering let's dive in to human language once upon a time i had a lot longer introduction that gave lots of examples about human how human languages can be misunderstood and complex i'll show a few of those um examples in later lectures but since right for today we're going to be focused on word meaning i thought i'd just give um one example which comes from a very nice xkcd cartoon and um that isn't sort of about some of the sort of syntactic ambiguities of sentences but instead it's really emphasizing the important point that language is a social system constructed and interpreted by people and that's part of how and it changes as people decide to adapt its construction and that's part of the reason why human languages are great as an adaptive system for human beings but difficult as a system or our computers to understand to this day so in this conversation um between the two women one says anyway i could care less and the other says i think you mean you couldn't care less saying you could care less implies you care at least some amount and the other one says i don't know where these unbelievably complicated brains drifting through a void trying in vain to connect with one another by blindly fleeing words out into the darkness every choice of phrasing spelling and tone and timing carries countless signals and contexts and subtexts and more and every listener interprets those signals in their own way language isn't a formal system language is glorious chaos you can never know for sure what any words will mean to anyone all you can do is try to get better at guessing how your words affect people so you can have a chance of finding the ones that will make them feel something like what you want them to feel everything else is pointless i assume you're giving me tips on how you interpret words because you want me to feel less alone if so then thank you that means a lot but if you're just running my sentences past some mental checklist so you can show off how well you know it then i could care less okay so that's ultimately um what our goal is is to how to do a better job at building um computational systems um that try to get better at guessing how their words will affect other people and what other people are meaning by the words that they choose to say so an interesting thing about human language is it is a system that was constructed by human beings um and it's a system that was constructed you know relatively recently in some sense so in discussions of artificial intelligence um a lot of the time um people focus a lot on human brains and the neurons buzzing by and this intelligence um that's meant to be inside people's heads but i just wanted to focus for a moment on the role of language there's actually you know this is kind of controversial but you know it's not necessarily the case that humans are much more intelligent than some of the higher apes like chimpanzees of bonobos right so chimpanzees and bonobos have been shown to be able to use pools to make plans and in fact chimps have much better short-term memory than human beings do so relative to that if you look through the history of life on earth human beings develop language really recently um how recently we kind of actually don't know because you know there's no fossils that say okay here's a language speaker um but you know most people estimate that language arose for human beings sort of you know somewhere in the range of a hundred thousand to a million years ago okay that's a while ago but compared to the process of evolution of life on earth that's kind of um blinking an eyelid um but that powerfulness communication between human beings quickly set off our ascendancy over other creatures um so it's kind of interesting that the ultimate power turned out not to be have been poisonous fangs or being super fast or super big but having the ability to communicate with other members of your tribe it was much more recently again that humans developed writing which allowed knowledge to be communicated across distances of time and space and so that's only about 5 000 years old the power of writing so in just a few thousand years the ability to preserve and share knowledge took us from the bronze age to the smartphones and tablets of today so a key question for artificial intelligence and human computer interaction is how to get computers to be able to understand the information conveyed in human languages simultaneously artificial intelligence requires computers with the knowledge of people fortunately now our ai systems might be able to benefit from a virtuous cycle we need knowledge to understand language and people well but it's also the case that a lot of that knowledge is contained in language spread out across the books and web pages of the world and that's one of the things we're going to look at in this course is how that we can sort of build on that virtuous cycle a lot of progress has already been made and i just want to very quickly um give a sense of that so in the last decade or so and especially in the last few years with newer methods of machine translation we're now in a space where machine translation really works moderately well so again from the history of the world this is just amazing right for thousands of years learning other people's languages was a human task which required a lot of effort and concentration but now we're in a world where you could just hop on your web browser and think oh i wonder what the news is in kenya today and you can head off over to a kenyan website and you can see something like this and you can go huh and you can then ask google um to translate it for you um from swahili and you know the translation isn't quite perfect but it's you know it's reasonably good so the newspaper tuko has been informed that local government minister lingsan and his transport counterparts city me died within two separate hours so you know within two separate hours is kind of awkward but essentially we're doing pretty well at getting the information out of this page and so um that's quite amazing um the single biggest development in nlp for the last year certainly in the popular media media was gpt um which was a huge new model that was released by open ai um what gpt 3 is about and why it's great is actually a bit subtle and so i can't really go through all the details of this here but it's exciting because it seems like it's the first step on the path to what we might call universal models where you can train up one extremely large model on something like that library picture i showed before and it just has knowledge of the world knowledge of human languages knowledge of how to do tasks and then you can apply it to do all sorts of things so no longer are we building a model to detect spam and then a model to detect pornography and then a model to detect um whatever foreign language content and just building all these separate supervised classifiers for every different task we've now just built up a model that understands so exactly what it does is it just predicts following words so on the left it's being told to write um about elon musk in the style of dr seuss and it started off with some text and then it's generating more text and the way it generates more text is literally by just predicting one word at a time following words come to complete its text but this has a very powerful facility because what you can do with 3 is you can give it a couple of examples of what you'd like it to do so i can give it some text and say i broke the window change it into a question what did i break i gracefully saved the day i changed it into a question what did i gracefully save so this prompt um tells gpt 3 what i'm wanting it to do and so then if i give it another statement like i gave john flowers i can then say gpt-3 predict what words come next and it'll follow my prompt and produce who did i give flowers to or i can say i gave her a rose and a guitar and it will follow the idea of the pattern and do who did i give a rose and a guitar to and actually this one model can then do an amazing range of things including many there's quite surprising to do at all to give just one example of that another thing that you can do is get it to translate human language sentences into sql so this can make it much easier to do cs145 so having given it a couple of examples of sql translation of human language text which i'm this time not showing because it won't fit on my slide i can then give it a sentence like how many users have signed up since the start of 2020 and it turns it into sql or i can give it another query what is the average number of influences each user subscribe to and again it then converts that into sql so gpt gpt-3 knows a lot about the meaning of language and the meaning of other things like sql and can fluently manipulate it okay so that leads us straight into this top meaning and how do we represent the meaning of a word well what is meaning well we could look up something like the webster dictionary and say okay the idea is represented by a word the idea that a person wants to express by using words signs etc um those webster's dictionary definitions really focused on the word idea somehow but this is pretty close to the commonest way that linguists think about meaning so that they think of word meaning as being appearing between a a word which is a signifier or symbol and the thing that it signifies the signified thing which is an idea or thing so that the meaning of the word chair is the set of things that are chairs and that's referred to as denotational semantics a term that's also used and similarly applied for the semantics of programming languages this model isn't very deeply implementable like how do i go from the idea that okay chair means the set of chairs in the world just something i can manipulate meaning within my computers so traditionally the way that meaning has normally been handled in natural language processing systems is to make use of resources like dictionaries and thesauri in particular a popular one is wordnet which organized words and terms into both synonym sets words that can mean the same thing and hypernyms which correspond to is a relationships um and so for the is a relationships you know we can kind of look at the hyponyms of panda and panda is a kind of proceed whatever those are like i guess that's probably with red pandas um which is a kind of carnivore which is a kind of placenta which is kind of mammal and you sort of head up this um hyponym um hierarchy so wordnet has been a greater resource for nlp but it's also been highly deficient so it lacks a lot of nuance so for example in word net proficient is listed as a synonym for good but you know maybe that's sometimes true but it seems like in a lot of context it's not true and you mean something rather different when you say proficient versus good um it's limited as a human constructed um thesaurus so in particular there's lots of words and lots of uses of words that just aren't there including you know anything um that is you know sort of more current terminology like um wicked is there for the wicked witch but not for more modern colloquial uses um ninja certainly isn't there for the kind of description some people make of programmers and it's kind of impossible to keep up to date um so it requires a lot of human labor but even when you have that you know it has a sets of synonyms but doesn't really have a good sense of words that means something similar so um fantastic and great means something similar without really being synonyms and so this idea of meaning similarity is something that would be really useful to make progress on and where deep learning models excel okay so what's the problem with a lot of traditional nlp well the problem with a lot of traditional nlp is that words are regarded as discrete symbols so we have symbols like hotel conference motel our words which in deep learning speak we refer to as a localist representation and that's because if you in statistical or machine learning systems want to represent these symbols that each of them is a separate thing so the standard way of representing them and this is what you do in something like a statistical model if you're building a logistic regression model with words as features is that you represent them as one hot vectors so you have a dimension for each different word so maybe like in my example here are my representations as vectors for motel and hotel um and so that means that we have to have huge vectors corresponding to the number of words in our vocabulary so the kind of if you had a high school english dictionary it probably had about 250 000 words in it um but there are many many more words in the language really so maybe we at least want to have a 500 000 um dimensional vector to be able to cope with that okay um but the bigger even bigger problem with the discrete symbols is that we don't have this notion of word relationships and similarity so for example in web search if a user searches for seattle motel we'd also like to match on documents containing seattle hotel but our problem is we've got these one-hot vectors for the different words and so in a formal mathematical sense these two vectors are orthogonal that there's no natural notion of similarity between them whatsoever well there are some things that we could do but try and do about that and people did do about that um in you know before 2010 we could say hey we could use word net synonyms and we count things that list the synonyms is similar anyway or hey maybe we could somehow build up representations of words that have meaning overlap and people did all of those things but they tended to fail badly from incompleteness so instead what i want to introduce today is the modern deep learning method of doing that where we encode similarity in a real value vector themselves so how do we go about doing that okay and the way we do that is by exploiting this idea called distributional semantics so the idea of distributional semantics is again something that when you first see it maybe feels a little bit crazy because rather than having something like denotational semantics what we're now going to do is say that a word's meaning is going to be given by the words that frequently appear close to it jr firth was a british linguist from the middle of last century and one of his pithy slogans that everyone quotes at this moment is you shall know a word by the company it keeps and so this idea that you can represent a sense for words meaning as a notion of what context it appears in has been a very successful idea one of the most successful ideas that's used throughout statistical and deep learning nlp it's actually an interesting idea um more philosophically so that there are kind of interesting connections for example in wittgenstein's later writings he became enamored of a use theory of meaning and this is a sin in some sense a use theory of meaning but whether you know it's the ultimate theory of semantics it's actually still pretty controversial but it proves to be an extremely computational sense of semantics which has just led to it being used everywhere very successfully um in deep learning systems so when a word appears in a text it has a context which are the set of words that appear and so for a particular word my example here is banking we'll find a bunch of places where banking occurs in texts and we'll collect the sort of nearby words as context words and we'll see say that those words that are appearing that kind of muddy brown color around banking that those context words will in some sense represent the meaning of the word banking um while i'm here let me just mention one distinction that will come up regularly when we're talking about a word um in our natural language processing class we sort of have two senses of word which are referred to as types and tokens so there's a particular instance for word so there's in the first example government debt problems turning turning into banking crises there's banking there and that's a token of the word banking but then i've collected a bunch of instances of quote unquote the word banking and when i say the word banking and a bunch of examples of it i'm then treating banking as a type which refers to you know the uses and meaning the word banking has across instances okay so um what are we going to do with these distributional models of language well what we want to do is we're going based on looking at the words that occur in context as vectors that we want to build up a dense real valued vector for each word that in some sense represents the meaning of that word and the way it all represent the meaning of that word is that this vector will be useful for predicting other words that occur in the context of this um so in this example to keep it manageable on the slide vectors are only eight dimensional um but in reality we use considerably bigger vectors so a very common size is actually 300 dimensional vectors okay so for each word that's a word type we're going to have a word vector these are also used with other names they're referred to as newer word representations or for a reason they'll become clearer on the next slide they're referred to as word embeddings so these are now a distributed representation not a localist representation because the meaning of the word banking is spread over all 300 dimensions of the vector okay these are called word embeddings because effectively when we have a whole bunch of words these representations place them all in a high dimensional vector space and so they're embedded into that space now unfortunately human beings are very bad at looking at 300 dimensional vector spaces or even eight dimensional vector spaces so the only thing that i can really display to you here is a two-dimensional projection of that space now even that's useful um but it's also important to realize that when you're making a two-dimensional projection of a 300 dimensional space you're losing almost in all the information in that space and a lot of things will be crushed together that don't actually deserve to be better um so here's um my word embeddings of course you can't see any of those at all but if i zoom in and then i zoom in further what you'll already see is that the representations we've learned distributionally do a just a good job at grouping together similar um words so in this sort of overall picture i can zoom into one part of the space is actually the part that's up here in this view of it um and it's got words for countries so not only are countries generally grouped together um even the sort of particular subgroupings of countries make a certain amount of sense and down here we then have nationality words if we go to another part of the space we can see different kind of words so here are verbs and we have ones like come and go are very similar um saying and thinking words say think expect a kind of similar and by nearby over in the bottom right we have sort of verbal auxiliaries and copulas so have had hairs um forms of the verb to be and certain contentful verbs are similar to copula verbs because they describe states you know he remained angry he became angry and so they're actually then grouped close together to the word the verb to be so there's a lot of interesting structure um in this space um that then represents the meaning of words so the algorithm i'm going to introduce now is one that's called word to vec which was um introduced by tamash mikulov and colleagues in 2013 as a framework for learning word vectors and it's sort of a simple and easy to understand place to start so the idea is we have a a lot of text from somewhere which we commonly refer to as a corpus of text corpus is just the latin word for body so it's a body of text and so then we choose a fixed vocabulary which will typically be large but nevertheless truncated so we get rid of some of the really rare words so we might say vocabulary size of four hundred thousand and we then create for ourselves a vector for each word okay so then what we do is we want to work out what's a good vector um to for each word and the really interesting thing is that we can learn these word vectors from just a big pile of text by doing this distributional similarity task of being able to predict well what words occur in the context of other words so in particular we're going to iterate through words in the text and so at any moment we have a center word um c and context words outside of it which we'll call o and then based on the current word vectors we're going to calculate the probability of a context word occurring given the center word according to our current model but then we know that certain words did actually occur in the context of that center word and so what we want to do is then keep adjusting the word vectors to maximize the probability that's assigned to words that actually occur in the context of the center word as we proceed through these texts so to start to make that a bit more concrete this is what we're doing um so we have a piece of text we choose our center word which is here into and then we say well if a model of predicting the probability of context words given the center word and this model will come to in a minute but it's defined in terms of our word vectors so let's see what probability it gives to the words that actually occurred in the to the context of this word it gives them some probability but maybe be nice if the probability of the sign was higher so then how can we change our word vectors to raise those probabilities and so we'll do some calculations with into being the center word and then we'll just go on to the next word and then we'll do the same kind of calculations and keep on chunking so the big question then is well what are we doing for working out the probability of a word occurring in the context of the center word and so that's the central part of what we develop as the word object so this is the overall model that we want to use so for each position in our corpus our body of text we want to predict context words within a window of fixed size m given the center word wj and we want to become good at doing that so we want to give high probability to words that occur in the context and so what we're going to do is we're going to work out what's formally the data likelihood as to how good a job we do at predicting words in the context of other words and so formally that likelihood is going to be defined in terms of our word vectors so they're the parameters of our model and it's going to be calculated as taking the product of using each word as the center word and then the product of each word and a window around that of the probability of predicting that context word in the center word and so to learn this model we're going to have an objective function sometimes also called a cost or a loss that we want to optimize and essentially what we want to do is we want to maximize the likelihood of the context we see around center words but following standard practice we slightly fiddle that because rather than dealing with products it's easier to deal with sums and so we work with log likelihood and once we take log likelihood all of our products turn into sums we also work with the average log likelihood so we've got a one on t term here for the number of words in the corpus and finally for no particular reason we like to minimize our objective function rather than maximizing it so we stick a minus sign in there and so then by minimizing this objective function j of theta that comes maximizing our predictive accuracy okay so that's the setup but we still haven't made any progress in how do we calculate the probability of a word occurring in the context given the center word and so the way we're actually going to do that is we have vector representations for each word and we're going to work out the probability simply in terms of the word vectors now at this point there's a little technical point we're actually going to give to each word two word vectors one word vector for when it's used as the center word and a different word vector when it's used as a context word um this is done because it just simplifies the math and the optimization so it seems a little bit ugly but actually makes building word vectors a lot easier and really we can come back to that and discuss it um later but that's what it is and so then once we have these word vectors um the equation that we're going to use for giving the probability of a context word appearing given the center word is that we're going to calculate it using the expression in the middle bottom of my slide so let's sort of pull that apart just a little bit more um so what we have here with this expression is so for a particular center word and a particular context word o we're going to look up the vector representation of each word so they're u of o and v of c and so then we're simply going to take the dot product of those two vectors so dot product is a natural measure for similarity between words because in any particular dimension uh positive you'll get some a component that adds to the dot products um if both are negative it'll add a lot to the dot product some if one's positive and one's negative um it'll subtract from this similarity measure um if both of them are zero it won't change the similarity so it sort of seems a sort of plausible idea to just take a dot product and thinking well if two words have a larger dot product that means they're more similar and so then after that we're sort of really doing nothing more than okay we want to use dot products to represent word similarity and now let's do the dumbest thing that we know how to turn this into a probability distribution well what do we do well firstly well taking a dot product of two vectors that might come out as positive or negative but well if we want to have probabilities we can't have negative probabilities so a simple way to avoid negative probabilities is to exponentiate them because then we know everything is positive and so then we're always getting a positive number in the numerator but for probabilities we also want to have the numbers add up to one so we have a probability distribution so we're just normalizing in the obvious way where we divide through by the sum of the numerator quantity for each different word and the vocabulary and so then necessarily that gives us a probability distribution so all the rest of that that i was just talking through what we're using there is what's called the softmax function so the softmax function will take any rn vector and turn it into things between um zero to one um and so we can take numbers and put them through this soft max and turn them into probability distribution right so the name comes from the fact that it's sort of like a max um so because of the fact that we exponentiate that really emphasizes the big contents in the different dimensions of calculating similarities so most of the probability goes to the most similar things um and it's called soft because well it doesn't do that absolutely it'll still give some probability to everything that's in the slightest bit similar i mean on the other hand it's a slightly weird name because you know max normally takes a set of things and just returns one the biggest of them whereas the softmax is taking a set of numbers and is scaling them but is returning the whole probability distribution okay so now we have all the pieces of our model and so how do we make our word vectors well the idea of what we want to do is we want to fiddle our word vectors in such a way that we minimize our loss i that we maximize the probability of the words that we actually saw in the context of the center word and so theta the theta represents all of our model parameters in one very long vector so for our model here the only parameters are our word vectors so we have for each word two vectors it's context vector and its center vector and each of those is a d dimensional vector where d might be 300 and we have v many words so we end up with this big huge vector which is 2 v long which if you have a 500 000 vocab times the 300 dimensional the time um small method i can do in my head but it's got millions and millions of parameters so i've got millions and millions of parameters and we somehow want to fiddle them all to maximize the prediction of context words and so the way we're going to do that then is we use calculus so what we want to do is take that math that we've seen previously and say huh well with this objective function um we can work out derivatives and so we can work out where the gradient is so how we can walk downhill to minimize loss so we're at some point and we can figure out what what is downhill and we can then progressively walk downhill and improve our model and so what our job is going to be is to compute all of those vector gradients okay um so at this point i then want to kind of um show a little bit more as to how we can actually um do that and a couple more slides here but maybe i'll just try and things again and move to my interactive whiteboard what we wanted to do right so we had our overall we had our overall j theta that we were wanting to minimize our average neglog likelihood so that was the minus one on t of the sum of t equals one to big t which was our text length and then we were going through the words in each context so we're doing j between m words on each side um except itself um and then what we wanted to do was in the side there we were then we were working out the log probability of the context word at that position um given the word that's in the center position t and so then we converted that into um our word vectors by saying that the probability of o given c is going to be expressed as the um this soft max of the dot product okay and so now what we want to do is work out the gradient the direction of downhill for this last gen and so the way we're doing that is we're working out the partial derivative of this expression with respect to every parameter in the model and all the parameters in the model are the components the dimensions of the word vectors of every word and so we have um the center word vectors and the outside word vectors so here um i'm just going to do the center word vectors but on homework on a future homework assignment 2 the outside word vectors will show up and they're kind of similar so what we're doing is we're working out the partial derivative with respect to our center word vector which is you know maybe a 300 dimensional word vector of this probability of o give c um and since we're using log probabilities of the log of this probability of o given c of this x of u of o t v c over my writing will get worse and worse sorry um i've already made a mistake haven't i the sum um the sum of w equals one to the vocabulary of the expert uwt vc okay um well at this point things start off pretty easy so what we have here is something that's log of a over b so that's easy we can turn this into log a minus log b but before i go further i'll just make a comment at this point um you know so at this point um my audience divides on into right there are some people in the audience um for which maybe a lot of people [Music] ah this is um really elementary math i've seen this a million times before and he isn't even explaining it very well um and if you're in that group well feel free to look at your email or on the newspaper or whatever else is best suited to you but i think there are also um other people in the class who oh the last time i saw calculus was when i was in high school for which that's not the case and so i wanted to spend a few minutes um going through this a bit concretely so that to try and get over the idea that you know even though most of deep learning and even word vector learning seems like magic that it's not really magic um it's really just doing math and one of the things we hope is that you do actually understand this math that's being done so i'll keep along and do a bit more of it okay so then what we have is sort of use this way of writing the log and so then we can say that that expression above equals the partial derivatives with of vc of the log of the numerator log x u o t v c minus um the partial derivative of the log of of the denominator so that's then the sum of w equals 1 to v of the x of u w t v c okay so at that point i have um my numerator here and my former denominator there um so at that point there are starts the first part is the numerator part so the numerator part is really really easy so um we have here that log and x but just inverses of each other so they just go away so that becomes the derivative with respect to vc of just what's left behind which is you u0 dot product and with vc okay um and so the thing to be aware of is you know we're still doing this multivariate calculus so what we have here is calculus with respect to a vector like hopefully you saw some of in math 51 or some other place not um high school um single variable calculus on the other hand um you know to the extent you and half remember some of this stuff most of the time you can just do fi perfectly well by thinking about what happens um with one dimension at a time and it generalizes the multivariable calculus so if about um all that you remember of calculus is that d dx of a x equals a really it's the same thing that we're going to be using here that here we have the the outside word dot producted with the vc well at the end of the day that's going to have terms of sort of u0 component 1 times the center word component 1 plus u um zero component two plus um this were component two and so we're sort of using this bit over here and so what we're going to be getting out is the u0 and u01 and the u0 2 so this will be all that is left with respect to vc1 when we take its derivative with respect to vc1 and this term will be the only thing left when we take the derivative with respect to the variable um vc2 so the end result of taking the vector derivative of u0 dot product and with vc is simply going to be u0 okay great so that's progress so then at that point we go on and we say oh damn we still have the the denominator and that um slightly more complex but not so bad so then we want to take the partial derivatives with respect to vc of the log of the denominator okay and so then at this point um the one tool that we need to know and remember is how to use the chain rule so the chain rule is when you're wanting to work out um of having derivatives of compositions of functions so we have f of g of whatever x but here it's going to be vc and so we want to say okay what we have here is we're working out a composition of functions so here's our f um and here is our x which is g of v c actually maybe i shouldn't call it x um oops maybe i was it's probably better to call it z or something um okay so when we then want to work out um the chain rule well what do we do we take the derivative of f at the point z and so at that point we have to actually remember something we have to remember that the derivative of log is the one on x function so this is going to be equal to the 1 on x for z so that's then going to be 1 over the sum of w equals 1 to v of x of u t v c multiplied by the derivative of the inner function so so the derivative of um the part that is remaining i hope i'm getting this right the sum of oh and there's one trick here at this point we do want to have a change of index so we want to say the sum of x equals 1 to v of x of u of x v c since we can get into trouble if we don't change that variable um to be using a different one okay so at that point we're making some progress but we still want to work out the derivative of this and so what we want to do is apply the chain rule once more so now here's our f and in here is our new z equals g of vc and so we then sort of repeat over so we can move the um derivative inside uh some always so we're then taking the derivative of this and so then the derivative of x is itself okay so we're going to just have x of u x t v c times um there's the sum of x equals 1 to v times the derivative um of u x t v c okay and so then this is what we've worked out before we can just rewrite as ux okay so we're now making progress um so if we start putting all of that together what we have is um the derivative or the partial derivatives with vc of this log probability right we have the numerator which was just u0 um minus um we then had the sum of the numerator sum over x equals 1 to v of x u x t dc times u of x then that was multiplied by our first term that came from the one on x which gives you the sum of w equals one to v of the x of u w t v c and this is the fact that we change the variables um became important and so by just sort of rewriting that a little um we can get that that equals u0 minus um the sum of v equals oh sorry x all right x equals 1 to v of this x view of x t v c over the sum of w equals 1 to v of x u w t v c times u of x and so at that point this sort of interesting thing has happened that um we've ended up getting straight back exactly the soft max formula probability that we saw when we started um we can just rewrite that more conveniently as saying this equals u0 minus the sum over x equals 1 to v of the probability of x given c times ux and so what we have at that moment is this thing here is an expectation and so this is an an average over all the context vectors weighted by their probability according to the model and so it's always the case with these softmax style models that what you get out for the derivatives is you get obs the observed um minus the expected so our model is good if our model on average predicts exactly um the word vector that we actually see and so we're going to try and adjust the parameters of our model so it does that much of all um now i mean we try and make it do it as much as possible i mean of course as you'll find you can never get close right you know if i just say to you okay the word is croissant which words are going to um occur in the context of croissant i mean you can't answer that there are all sorts of sentences that you could say that involve the word croissant so actually our particular probability estimates are going to be kind of small but nevertheless we want to sort of fiddle our word vectors to try and make those estimates as high as we possibly can so i've gone on about this stuff um a bit but haven't actually sort of shown you any of what actually happened sorry i just want to quickly um show you a bit of that as to what actually happens with word vectors um so here's a simple little ipython notebook which is also what you'll be using for assignment one only um so in the first cell i import a bunch of stuff um so we've got numpy for our vectors matplotlib plotting off it learns kind of your machine learning um swiss army knife gensim is a package that you may well not have seen before it's a package that's often used for word vectors it's not really used for deep learning so this is the only time you'll see it in the class but if you just want a good package for working with word vectors and some other application it's a good one to know about okay so then in my second cell here i'm loading a particular set of word vectors so these are our glove word vectors that we made at stanford in 2014 and i'm loading a hundred dimensional word vectors um so that things are a little bit quicker for me um while i'm doing things here sort of do this model of bread and croissant um well what i've just got here is um word vectors so i just wanted to sort of um show you that there are um word vectors hmm well maybe i should have loaded those word vectors in advance hmm let's see oh okay well i'm in business um okay so right so here are my word vectors for um bread and croissant and while and seeing that maybe these two words are a bit similar so both of them are negative in the first dimension positive and the second negative in the third positive and the fourth negative and the fifth so it sort of looks like they might have a fair bit of dot product which is kind of what we want because bread and croissant are kind of similar um but what we can do is actually ask the model and these are gen sim functions now you know what are the most similar words so i can ask for croissant um what are the most similar words to that and it will tell me it's things like brioche baguette focaccia so that's pretty good pudding is perhaps a little bit more questionable we can say most similar to the usa and it says canada america usa with periods united states that's pretty good most similar to banana um i get out coconuts mangoes bananas sort of fairly tropical very great um now before finishing though i want to show you something slightly more than just similarity which is one of the amazing things that people observed with these word vectors and that was to say you can actually sort of do arithmetic in this vector space that makes sense and so in particular people suggested this analogy task and so the idea of the analogy task is you should be able to start with a word like king and you should be able to subtract out a male component from it add back in a woman component and then you should be able to ask well what word is over here and what you'd like is that the word over there is queen um and so um this sort of little bit of so we're going to do that um with this sort of same most similar function which is actually more so as well as having positive words you can ask for most similar negative words and you might wonder what's most negatively similar to a banana and you might be thinking oh it's um i don't know um some kind of meat or something actually that by itself isn't very useful because when you could just ask for most negatively similar things you tend to get crazy strings that were found in the data set that you don't know what they mean if anything but if we put the two together we can use the most similar function with positives and negatives to do analogies so we're going to say we want a positive king we want to subtract out negatively man we want to then add in positively woman and find out what's most similar to this point in the space so my analogy function does that precisely that by taking a couple of most similar ones and then subtracting out um the negative one and so we can try out this analogy function so i can do the analogy i show in the picture with man as to king as woman is fight sorry i'm not saying this right um yeah man is the king as woman is too oh sorry i haven't done my cells um okay man is the king as a woman as the queen so um that's great and that's um works well i mean and you can do it the sort of other way around king is to man as queen as to woman um if this only worked for that one freakish example um you maybe um wouldn't be very impressed but you know it actually turns out like it's not perfect but you can do all sorts of fun analogies with this and they actually work so you know i could ask for something like an analogy um oh here's a good one australia is to be uh as france is to what and you can think about what you think the answer that one should be and it comes out as champagne which is pretty good or i could ask for something like analogy pencil is to sketching as camera is to what um and it says photographing um you can also do the analogies with people um at this point i have to point out that this data was um and the model was built in 2014 so you can't ask anything about um donald trump in it well you can trump is in there but not as president but i could ask something like analogy obama is to clinton as reagan is to what and you can think of what you think is the right analogy there the analogy it returns is nixon so i guess that depends on what you think of bill clinton as to whether you think that was a good analogy or not you can also do sort of linguistic analogies with it so you can do something like analogy tall is to tallest as long is to what and it does longest so it really just sort of knows a lot about the meaning behavior of words and you know i think when these um methods were first developed and hopefully still for you that you know people were just gobsmacked about how well this actually worked at capturing of words and so these word vectors then went everywhere as a new representation that was so powerful for working out word meaning and so that's our starting point for this class and we'll say a bit more about them next time and they're also the basis of what you're looking at for the first assignment can i ask a quick question about the distinction between the two vectors per word yes um my understanding is that there can be several context words per uh word in the vocabulary like word in the vocabulary um but then if there's only two vectors i kind of i thought the distinction between the two is that one it's like the actual word and one's like the context word but the multiple context words like how do you how do you pick just two then well so we're doing every one of them right so like um maybe i won't turn back on the screen share but you know we were doing in the objective function there was a sum over you so you've got you know this big corpus of text right so you're taking a sum over every word which is it appearing as the center word and then inside that there's a second sum which is for each word in the context so you are going to count each word as a context word and so then for one particular term of that objective function you've got a particular context word and a particular um center word but you're then sort of summing over different context words for each center of the word and then you're summing over all of the the decisions of different center words and and to say um a little just a sentence more about having two vectors i mean you know in some sense it's an ugly detail but it's was done to make things sort of simple and fast so you know if you look at the math carefully if you sort of treated this two vectors as the same so if you use the same vectors for center and context and you say okay let's work out the derivatives things get uglier and the reason that they get uglier is it's okay when i'm iterating over all the choices of um context word oh my god sometimes the context word is going to be the same as the center word and so that messes with working out um my derivatives whereas by taking them as separate vectors that never happens so it's easy um but the kind of interesting thing is you know saying that you have these two different representations sort of just ends up really sort of doing no harm and my wave my hands argument for that is you know since we're kind of moving through each position the corpus one by one you know something a word that is the center word at one moment is going to be a context word at the next moment and the word that was the context word is going to have become the center word so you're sort of doing the um the computation both ways in each case and so you should be able to convince yourself that the two representations for the word end up being very similar and they do not not identical for technical reasons at the ends of documents and things like that but very very similar and so effectively you tend to get two very similar representations for each word and we just average them and call that the word vector and so when we use word vectors we just have one vector for each word that makes sense thank you i have a question purely of curiosity so we started when we projected the vectors the word vectors onto the 2d surface we saw like little clusters of words that are similar to each other and then later on we saw that um with the analogies thing we kind of see that there's these directional vectors that sort of indicate like the ruler of or the ceo of something like that and so i'm wondering is there are there relationships between those relational vectors themselves such as like is the um the ruler of vector sort of similar to the ceo of vector which is very different from like is makes a good sandwich with vector is there any research on that that's a good question um how will you stump me already in the first lecture ah i mean that yeah i can't actually think of a piece of research and so i'm not sure i have a confident and i'm not sure i have a confident answer i mean it seems like that's a really easy thing to check um with how much you have one of these sets of um word vectors that it seems um like and for any relationship that is represented well enough by word you should be able to see if it comes out kind of similar um huh i mean i'm not sure we can we can look and see yeah that's totally okay just just curious i'm sorry i missed the last little bit to your answer to first question so when you wanted to collapse two vectors for the same word did you say you usually take the average um different people have done different things but the most common practice is after you've uh you know there's still a bit more i have to cover about running word divec that we didn't really get through today so i still got a bit more work to do on thursday but you know once you've run your word to vec algorithm and you you sort of your output is two vectors for each word and kind of like when it's center and when it's context and so typically people just average those two vectors and say okay that's the representation of the word croissant and that's what appears in the sort of word vectors file like the one i loaded that makes sense thank you oh thanks so my question is if a word have two different meanings or multiple different meanings can we still represent it as a same single vector yes that's a very good question um and actually there is some content on that in thursday's lecture so i can say more about that um but yeah the first reaction is you kind of should be scared because um something i've said nothing about at all is you know most words especially short common words have lots of meaning so if you have a word like star that can be astronomical object or it can be you know a film star a hollywood star or it can be something like the gold stars that you've got in elementary school and we're just taking all those uses of the word star and collapsing them together into one word vector um and you might think that's really crazy and bad um but actually turns out to work rather well um maybe i won't go through all of that um right now because there is actually stuff on that on thursday's lecture oh i see i think you can put ahead of the slides for next time oh wait i know this let's see [Music] is do we look at how to implement or do we look at like the stack of like something like alexa or something provide speech to uh context actions in this course was it just primarily uh understanding so this is an unusual con an unusual quarter but for this quarter there's a very clear answer which is um this quarter um there's also a speech class being taught which is cs 224 s um a speech class being taught by andrew mars and you know this is a class that's been more regularly offered sometimes it's only been offered every third year but it's being offered right now so if what you want to do is learn about speech recognition and learn about sort of methods for building dialogue systems you should do cs224 yes so you know for this class in general um the vast bulk of this class is working with text and doing various kinds of text analysis and understanding so we do tasks like some of the ones i mentioned we do machine translation um we do question answering um we look at how to pass this structure of sentences and things like that you know in other years i sometimes say a little bit about speech um but since this quarter there's a whole different class that's focused on speech that seem a little bit silly i guess you have the the part of partnering with your audience [Music] more on speech i'm now getting a bad echo i'm not sure if that's my fault or your fault but anyway um anyway answer yeah so the speech class does a mix of stuff so i mean the sort of pure speech problems classically have been um doing speech recognition so going from a speech signal to text and doing text-to-speech going from text to us a speech signal and both of those are problems which are now normally done including by the cell phone that sits in your pocket using neural networks and so it covers both of those but then between that the class covers quite a bit and in particular it starts off with looking at building dialogue system so this is sort of something like alexa google assistant siri as to well assuming you have a speech recognition a text-to-speech system um then you do have text in and text out what are the kind of ways that people go about building um um dialogue systems like the ones that i just mentioned um i actually had a question so i think there was some people in the chat noticing that the uh like opposites were really near to each other which was kind of odd but i was also wondering um what about like positive and negative uh valence or like affect um is that captured well in this type of model or is it like not captured well like well like with the opposites how those weren't really yeah so the short answer is for both of those and so this is a good question a good observation and the short answer is no both of those are captured really really badly i mean there's there's a definition um oh you know when i say really really badly i mean what i mean is if that's what you want to focus on um you've got problems i mean it's not that the algorithm doesn't work so precisely what you find is that you know antonyms generally occur in very similar topics because you know whether it's um saying you know john is really tall or john is really short or that movie was fantastic or that movie was terrible right you get antonyms occurring in the same context so because of that their vectors are very similar and similarly for sort of affect and sentiment based words well like um great and terrible example their contexts are similar um they're for um that if you're just learning this kind of predict words and context models um that no that's not captured now that's not the end of the story i mean you know absolutely people wanted to use neural networks for sentiment and other kinds of sort of connotation effect and there are very good ways of doing that but somehow you have to do something more than simply predicting words in context because that's not sufficient to um capture that dimension um more on that later adjectives too like very basic adjectives like so and like not because those would like appear in like similar context right what was your first example before not uh like so this is so cool so that that's actually a good question as well so yeah so there are these very common words that are commonly referred to as function words by linguists which you know includes ones like um so and not but other ones like and and prepositions like you know two and on um you sort of might suspect that the word vectors for those don't work out very well because they occur in all kinds of different contexts and they're not very distinct from each other in many cases and to a first approximation i think that's true and part of why i didn't use those as examples in my slides yeah but you know at the end of the day we do build up vector representations of those words too and you'll see in a few lectures time when we start building what we call language models that actually they do do a great job in those words as well i mean to explain what i'm meaning there i mean you know another feature of the word to vect model is that actually ignore the position of words right so it said i'm going to predict every word around the center word but you know i'm predicting it in the same way i'm not predicting differently the word before me or versus the word after me or the word two away in either direction right they're all just predicted the same by that one um probability function and so if that's all you've got that sort of destroys your ability to do a good job at um capturing these sort of common more grammatical words like so not an and but we build slightly different models that are more sensitive to the structure of sentences and then we start doing a good job on those too okay thank you i had a question about the characterization of word to fact um because i i which was slightly different from how it was presented in the microwave so are these like two complementary reasons yeah so i i've still got more to say so i'm stay tuned thursday um for more stuff on word vectors um you know so word to back is kind of a framework for building word vectors and that there are sort of several variant precise algorithms within the framework and you know one of them is how whether you're predicting the context words or whether you're predicting the center word so the model i showed was predicting the context words so it was the skip gram model but then there's sort of a detail of how in particular do you do the optimization and what i presented was the sort of easiest way to do it which is naive optimization with the equation the soft max equation for word vectors um it turns out that that naive optimization is sort of ex needlessly expensive and people have come up with um a faster ways of doing it in particular um the commonest thing you see is what's called skip gram with negative sampling and the negative sampling is then sort of a much more efficient way to estimate things and i'll mention that on thursday right okay thank you who's asking for more information about how word vectors are constructed uh beyond the summary of random initialization and then gradient based uh iterative upgrade optimization yeah um so i sort of will do a bit more connecting this together um in the thursday lecture i guess this sort of only so much one can fit in the first class um but the pic the picture is essentially the picture i showed the pieces of so to learn word vectors you start off by having a vector for each word type both for context and outside and those vectors you initialize randomly so that you just put small little numbers that are randomly generated in each vector component and that's just your starting point and so from there on you're using an iterative algorithm where you're progressively updating those word vectors so they do a better job at predicting which words appear in the context of other words and the way that we're going to do that is by using um the gradients that i was sort of starting to show how to calculate and then you know once you have a gradient you can walk in the opposite direction of the gradient and you're then walking downhill i you're minimizing your loss and we're going to sort of do lots of that until our word vectors get as good as possible so you know um it's really all math but in some sense you know word vector learning is sort of miraculous since you do literally just start off with completely random word vectors and run this algorithm of predicting words for a long time and out of nothing emerges these word vectors that represent meaning well

Transcript for:Introduction to Natural Language Processing with Deep Learning

Transcript for:
Introduction to Natural Language Processing with Deep Learning