Overview of Graph Networks Lecture

welcome welcome to the uh Stanford S 24w course um today I will be your guest instructor uh on behalf of Yuri my name is Josan and you might find more information on myself on the website I also was one of the had TA in the previous off offering and so I'm relatively familiar as a course um today I'll be very excited to give the topic of graph Network to you all um in this lecture so let's give a quick uh recap of what we have learned in the previous lecture so the key concept we learn is node embedding our intuition is that we want to encode nodes in the network into low dimensional Vector space in particular we want to learn a function that takes an input graph and embed that into low dimensional no ining space here we project into two Dimension and the key problem of machine learning graph is how do we Define this function f so here's a recap of how to encode a node we have um uh the goal is to define a similarity function um that can uh represent the similarity in the network and we want to approximate in the embeding space and we really need to Define two things the first is the similarity function represent how two nodes are closed together in the network another thing is the encoder function that basically tells us how to map the node in the graph into embedding space and there are two uh key component here uh we have encoder that map the node into embedding and decoder that Define this similarity function so the encod function will take the node um as input and generate embedding in the D dimensional space and similarity function will specify the relationship ship in the network and also Express in the loss function as a decoder and how do we Define similarity uh last lecture we talk about random work based similarity function for example suppose two nodes they co-occur in a short random work then we think uh these two nodes are similar so that's one way to define similarity and we can also come up with other way to define similarity and how about encoding we have seen one of the uh simplest way to encode a node that is through a shallow encoder or embedding loot cap so in this uh shallow encoder we have embedding Matrix and one column of the embeding Matrix represent one node and the dimension of the embeding Matrix represent the size of the no of the node embedding and we we talk about the embedding lot cap which means that we will uh to to represent the Noe we will pick one column in The embeding Matrix and get that Vector as the node representation so this is the shallow encoder idea so what are the issues of this Shadow encoder well uh we list three key limitations the first is that um this way of encoding noes is not really scalable because we assign one learnable uh Vector for each node so in total there will be uh to the order of V times D this large number of uh parameters and it wouldn't work for a real world large graph say it has um billions of nodes and another limitation is that this approach is in transas trans transductive uh which means that the encoder can only apply to the noes that have seen during training time so suppose we have this embedding Matrix but there's a new node that comes we do we do not know how to encode that new node that because it's not even in the ining Matrix and lastly this approach does not incorporate no attributes so in real world graph the attributes and notes are usually very valuable for example in protein interaction graphs we have a lot of protein properties uh that can be associated with a given note so without that uh attributes we wouldn't be able to meaningfully learn from the network but this kind of encoding uh loap ining l cap we wouldn't be able to utilize such valuable attributes so these are the limitations uh for shadow encoder and that's why today we want to dive deeper into how we can design more expressive more powerful encoders and our idea is to use deep graph encoder and the idea is that rather than using a symbol uh embedding lot cap we will Define multiple layers of uh neuron Network Transformations and use that deep neuron Network to encode a a node in the graph and not that um these de colors can be combined with the no similarity function that we have just introduced for example we can also Define some sort of uh random work based similarity function even for the de coder that will later introduced today so here is our plan the pipeline of how to encode a a deep graph uh encoder so our input will be a complex Network we will pass it through several graph convolutional layers um between these layers we can inject uh for example activation functions and then some regularization layers um and after Computing uh this through this D encoder we will get output the output could be node embeddings which we most commonly seen but we can also generate say embeddings for subgraph and even embedding for entire graphs so these are the some options that we can use to uh generate uh through this deep graph encoder so suppose we have this encoder what we can do well we can Define many exciting task on graphs and here at least four different levels on the Node level we can use this idea to encode different nodes and make not classification uh for example suppose we have a a drug protein interaction Network and then we can predict whether this drug is toxic or not or whether can be used to uh treat a given disease uh we can also Define link prediction on network and Link PR can be quite useful for say recommended systems where we can define an interaction Network between users and items and we can predict whether a user will buy a certain item and we can also Define community detection over Network and this can be useful say for financial networks where we can have a cluster of U frosters and they may have some fishy transactions between each other and our goal is to detect those kind of anomaly clusters in the network and finally we can Define uh Network similarity task in that case it will be a graph level task or graph level prediction for example we can use this idea to encode different drug molecules so here we treat a molecule as a network and we want to classify uh different types of molecules so this are just a like a glipse of different ways you can um Define machine learning task on graphs okay so so far we have motivated you that why we want to use um deep uh encoders to encode graph and we'll start by uh just taking a very brief uh introduction of basics of deep learning um and a lot of course materials are also posted online uh in this lecture we wouldn't cover all of them so because we assume that most of you are already have some idea of deep learning U but you if you feel not like a very confident of the materials please take a look on the website and also we have this recitation and Al uh office hours so uh you can also ask Tas about specific questions about um uh deep learning Basics so here I would just uh summarize some of the key ideas that we probably want to uh use in today's lecture so so our idea is to formulate machine learning as a optimization problem this optimization problem has an input X and the goal is to predict uh the label Y and the input X can be uh different data modalities such as vectors sequences uh matrices and today we'll talk about graphs and we'll formulate this task as an optimization problem and the problem looks like this our goal is to find a uh optimal parameter SATA such that it can um minimize the loss function the loss function will involve labels and predictions and SATA is just a parameter that we want to learn and it could be the weight matrices uh in a deep neural network um and also in this shallow encoder uh example the uh seta uh is like a uh the entire embedding Matrix that we learn over and we can also Define different type of loss function depending on the type of task we care about for example for regression task we us use L2 loss and uh shown here and for classification task we also use U like a cross entropy loss here so one useful resource is uh this uh link of py TCH which give both thetical and concrete implementation examples of a bunch of useful loss functions for deep learning so um so we suggest you to take a look um and this will also be useful for your uh collab and potential course project and another concept will usually pop up in the lecture and we'll just use the acronym MLP so uh MLP is really multi-layer perception and it usually can be written as this so we have the input x uh the uh upper uh L means that we consider the input for the L layer of the network we will apply a linear transformation for the input and then uh plus the bias which is also trainable we will apply some nonlinear activation function which could be Ru or sigmoid and then uh we get the output so this layer is pretty simple but it's really the building block for tons of useful deeping architectures um for example in graph Network and also even in other domains like even Transformer we have MLP to do the linear or nonlinear projections Etc um and here's a concrete example uh suppose we have a two dimensional input to an MLP uh you see there are two neurons here uh we will apply different uh intermediate layer in MLP uh for example threedimensional like a hidden layer here and the output will be one dimension so this is like a very frequently used uh like a module in deep learning architectures and to summarize uh we formulate machine learning as an optimization problem about finding the optimum SATA for a deep neural network and uh F can can be anything can be a linear layer MLP or complex graph neur Network the way we train it is through propagate uh propagation where we first uh forward propagate to make inference to make predictions and then we perform back propagate to optimize and find the optimum seta and we'll use thetic gradient descend to optimiz seta over many iterations and yeah that's the uh basic idea of uh deep learning so uh let's jump in into more exciting topic today that is about deep learning for graphs how can we uh Define that Paradigm particularly for graph C data so here's our plan or the summary of our content uh we'll have two major steps to define a full deep learning on graph Network we'll first consider the local network structure uh where where we'll describe the aggregation function which tells us how we summarize information about the local neighborhood of a node and then we'll Define the concrete computational graph to um execute the uh computation and having defined this uh setup we spe specify a concrete computable architecture which involves multiple layers uh and also we'll talk about how do we train this model and how find some examples about un superise and superise learning on this deep graph Learning Network so let's begin with uh the setup uh we assume that we are provided with a graph as the input to this uh deep graph learning model um graph can be defined as follow we have V as the vertex set and a as the adjacency Matrix uh we assume adjacent Matrix us binary um and X is the Matrix of node feature um we have uh like the cality of V which is number of node times d uh as the two Dimensions D is the feature dimension of the input uh we represent a small lowercase V is a node in the node or vertex set and the neighborhood function of v defines the set of direct neighbors um of a given node in the graph uh a very useful uh like a attribute is uh we have node features Associated noes so we don't just plainly have nodes in the network we also associate a lot of meaningful features for a given node for example in Social Network we can associate uh user profile images of product those kind of attribut associated with noes and for biological networks we can use uh GM expression profiles and some Gene function information Etc a lot of useful uh attributes as well and you can also Imagine there are cases where we don't have not feature in the graph data set either they are missing or we simply don't know about them so there are also ways to deal with that kind of failure cases uh in that way either we can use an indicator Vector for example one one H encoding of node so suppose we have a five noes in the graph we can have a five dimensional one hard encoding label the noes as 1 2 35 um that way we can get some initialization or initial featur encoding we can also consider using constant as no feature uh in that case we'll just use a constant one for all the no in network just showing that all the noes are treated equally and they don't really carry uh too much additional information so these are the ways suppose you don't have no feature in the graph data set so after introducing this uh question can you go back one SL um so in the case of like vector Conant how you decide what the demand should be for the yeah good question so the question is suppose we don't know um no feature and we want to impute um uh we want to insert this like a constant Vector here what is dimension um I think is essentially uh having a one dimensional basic scalar is sufficient here because it really doesn't uh carry much meaning here and the only purpose we want to have this feature is that it can be uh easily processed by a new neur network and a constant one means that all the nodes are treated equally so you can imagine you can assign different value to noes but that kind of carry some additional prior information so let's say um like given Network we can assign A and B with different values so that basically mean you could differentiate this two noes in the network so um this like initialized initialization means how you will like a treat or value different no in the graph uh question so how do we copy Edge feature the training yeah a great question so in this setup we kind of simplify the case where we do not expl uh add features so we say we assume the adjac symetrix is binary but you can also Al imagine there are meaningful Edge features for example um we have user item interaction graph user may click an item they may buy purchase an item so there are different Edge types in that case we can expand the adjacent Matrix so rather than making it to be binary we can make it like a more meaningful for example can be like a categorical value where different values means different uh type of edges so that's one way you can extend to attributes okay um any more question okay um so this is like the setup for graph deep learning and let's see suppose we don't learn or we haven't learned this lecture and what you may consider to learn from a graph using deep learning well uh the most uh straightforward way to consider is because we have a adjacent Matrix we have a node feature Matrix so it will be natural that we can concatenate this two Matrix together and then we can apply the simplest uh deep learning method which is MLP which has introduced and we throw this like a concatenate Matrix of a graph into an MLP and see how it will predict so this pattern uh can be implemented and you may work in certain cases but there are some uh notable issues first is that uh this approach is not really scalable because you can see the input Dimension is to the order of number of nodes and we also have additional feature Dimension D so the um like the complexity here is the order of uh uh V * D and also this kind of approach is not really appical to graphs with different sizes because the input Dimension will vary based on the number of nodes so if you train a network on five node graph and you couldn't really apply the same network to a six node graph so this also kind of limiting and and finally that approach is very sensitive to node ordering so suppose we have the same network but we re reorder the node so instead of calling it a to e we may say say call it e to a and all the values here will be shuffled as well and you can see in that case the output will be different so these are kind of limitations tell us we couldn't directly Define uh a graph learning method based on the standard deing block rather that we need to Define specific architecture for graph data uh another thing we may consider in our toolbox isolution neural network because you see um for convolution Network you can easily process imagery data and in fact in all the materials we visualize a graph as an image so it will be naturally to see that we can uh like a pad or like a down sample the graph into the 2D space and then use CNN to process the graph data and actually in the early days in deep learning researchers have done that um and it actually works in some cases for example molecular classification those kind of task where graphs are relatively simple but um our uh the issue is uh the network doesn't have the regular structure as images so you can see that Network could be very complex in its structure and most notably it doesn't have the fixed notion of locality in this case so what I mean that for example here you may think um these two nodes are close together or these two nodes are close together but in reality maybe the node that are far away they are also very close together with the note here because they are connected so you have tons of options about how to visualize a network into an image so that give you a lot of freedom but also uh prevent a mod to meaningfully learn from Network and also um graph is permutation environment because graph is an unordered object so um so even though you have this like a Comal future patched here you may arbitrarily shaffle the order of the node and the meaning doesn't change but the CN network will treat it differently so this is kind of like a s experiment that shows why we couldn't directly apply CN on graphs so let's discuss what are the property we really need for deep learning methods for graph let's start with the first property that is So-Cal permutation invariance the idea of permutation invariance is that we know that graph does not have a conical ordering and we can have arbitrary order plans for a given Network here's a concrete example uh we can have uh this network as a running example we Define a specific node oring uh here and we can Define the corresponding node features and adjacent Matrix basically describe how this graph uh looks like and in the pin time we can Shuffle the order of the note here so we can come with a second way to order the graph and you see that after this shuffling or permutation the node feature Matrix and adjacent Matrix they get different and our idea is that because we know these two other plants describe the same network what we wish to say is that the graph and no representations for these two other plants but for the same network they should be same they should be identical so this is kind of our goal when we want to define a deep learning method for graphs so let's uh describe this peration in variance property more rigorously so what do we mean by um like the graph impation are the same so we can Define like a projection function or embedding function f that takes the adjacency Matrix a and the uh node feature Matrix X and what we say is that we can have two different order plan for the network A1 X1 and A2 X2 and suppose we uh take this input to the same encoding function f we want the output to be the same so that basically says uh suppose we have two ordering of the same network we want a encoding function f which is say a graph NE Network should always generate the same representation for the same graph this is like a visually say like a the two order plans the output F should be the same and we also formally uh defin it uh like a here that like basically U for per envirment we not only for two two ordering like Order Plan a and Order Plan B we should consider any potential like Order Plan I and J and we formally say that supposed it's per then for any pair of I and J the output should be the same we can also view this as in The Matrix form so we Define a graph function that essentially map a node feature Matrix R uh into the order of a uh V * m and the adjacent Matrix to a vector space and we say this function f is permission invariant uh if we wrote In The Matrix form a and X as the input and will permute The adjacency Matrix and permute um the feature Matrix and the output are the same for any permutation P then we say um this function is permutation in so the same thing we can express it in different ways but this is kind of definition of permutation invariance another concept here is permutation equivariance so here is a different term uh with invariance we talk about the idea of VAR so what does it mean um we talk about uh we can encode a graph into a graph representation where we have one vector for the graph now we are considering the representation in the node level so instead of having one vector we will have one vector for each node in the graph and in the end we will have an embedding or output Matrix for the network so now the output instead of being a vector it will be a matrix so what is the nice property we wish to see here well um we got the same network and we permute uh the order of the notes therefore we would like to see that the values or the encoded values for different noes they should be the same but the order should be according to the position of the node in the network here is a uh illustrative example um so let's say let's say we care about this yellow node and in the first order plan the yellow node is labeled as a and it appeared here and our hope is that because we only shaffle the order of the node so the encoding of the yellow Noe should be the same and it should appear in the C the latest encoding of the second order plan e so it says that the encoding for no should stay the same and it's only based on how the node is labeled it should always associated with the node label a and node label e and the same is for all the other nodes in the network let's say another example for example for a green mode is here is label as C and here is label as D and suppose we have this nice equivariance property then after we Shuffle the node ordering the value should stay the same and it is closely tied to the label of the node we can also formula Define what is permission equivalent that basically says um if the output Vector of a node at the same position in the graph remain unchange for any peration and we say this function is peret equivalent and we can also see this in the Matrix form so now we Define instead of a graph function we Define a node function that encode a network into a matrix instead of the single vector and this Matrix uh is permutation equivalent sorry this function is permutation equivalent if we permute input uh input uh of adjacent Matrix of node feature Matrix and the output Matrix is also permuted accordingly with the permutation Matrix p in that case we say this function is permutation equiv so we talk about these two properties it would be nice to review them and compare these two properties so here I show the definition the matrix definition of uh invariant and equivariant um in plain language what does it really mean is that um so suppos we permute the input then the output will always uh stay the same this is for permutation in variant and the output here is usually a vector so we assign one vector for one graph and we can also Define a plain language definition for equivariant where we if we permute the input the output also get permitted uh sorry accordingly and in that case this is um usually considering a case where we map a graph into a matrix and the Matrix talk about the embedding Vector for each of the node and here are three concrete examples of what functions are invariant and what functions are equivalent the first example is this like a one vector transpose time x is basically a summation over a matrix over different rows and in that case you can easily see the function is perimeter invariant because it is summation so no matter how you permute the input x uh the output will stay the same uh here here's another example which is per equivalent which is pretty simple so we just output always output the node um input uh X this is basically the uh shallow encoder that we see right we have a encoder Matrix and essentially is the output in and it's easy to see it's equivalent because if you permute the input of course the output get permuted uh equivalently and finally we have example of this a * X as a permutation equivalent function and here we use the property of that suppose you transpose a matrix and then transpose it with it um and then times with it transpose you'll cancel out into an identity function and uh you see this function is also permutation equivalent and this idea of a time x is uh like a simplest version of graph learning or uh like graph message aggregation and we'll talk about this later as well so uh any question here I know the concept A bit dense here what is exactly yeah so we have what does it look like MH um yeah uh we didn't have a visual license here uh but it's really like a binary Matrix it has zero has ones and um it basically said How We Shuffle different Columns of a network uh of a matrix sorry um right let's see um suppose um uh we have a matrix of six rows then the peration Matrix will also be 6X six and different locations of one and zero says how where we should put uh the a uh input a to the output so let's say if we have um uh like a 0 one uh one zero then that means we'll Shuffle A and B so like this is idea of peration matrix and you can also search on the web for more yeah and any other questions okay um we can uh continue so why do we want to study these two properties of invariance and equivariance that is because uh these properties are pretty important and essentially the key building blocks for graph learning method so I show an example of graph learning method here and you can see that it usually involves multiple permutation equant layers and permutation invar layers and this is kind of how we make sure uh graph Dearing method can uh face F represent the graph rather than like a mess it up like an MLP that we like introduced earlier and and here's like a a visual example like why MLP doesn't satisfy this property so suppose we have an MLP and then we have this Vector uh as input and we permute the values of this vector um you see and after throwing it into an MLP then it clear that the output will be different because it doesn't really have the um the way to say aggregate it in a per invariant way so the output will always be different and this is precisely why this naive MLP approach to try to encode a network would fail um because it doesn't have this nice property of peration invariance uh next I'm going to talk about how do we specifically Define a graph in your network with these two properties and in particular we'll talk about passing and um message message information on the neighbors question sorry you go to the previous Slide m so I mean it feels like this is a problem that would come up for like trying to create a neural network for any kind of unstructured data right where like ordering not important so like this graph Neal networks's the first time it came up or like was it seen before M yeah that's a good question so uh the question is about like can we use the graph learning method to incode other unordered objects and where this IDE original from and and yeah I think uh the idea of graph learning method as we also dive into deepal later uh is inspired from a learning from a set so essentially a neighbor uh the set of neighbors of a given node is a set and suppose we have a meage to learn from set essentially we can uh build a complete graph learning method so it definitely also have rootes in other areas of deep learning okay okay um uh let's continue with our discussion so so far we have talked about the motivation of graph de learning and the the required properties that we wish to see in the graph de learning method uh next I'm going to talk about concrete example uh that is so-call graph convolutional networks and this is arguably the most popular uh way to encode a graph and the most popular uh way to implement a graph new network so here's idea of a graph convolutional neuron network uh our idea is to encode a note based on its neighborhood structure and we follow a two-step process suppose we want to encode node I in the network we will first Define is a computational graph based on the no local neighborhood structure so we will Define based on it oneop neighbors and the neb of neighbors and so on suppose we have multiple layers and then having defined this computational graph we will propagate and transform information based on the defined computational graph and the key concept here is that how we can learn and propagate information across the network uh defined by the node neighborhood structure so we will uh introduce the idea through a running example this is an input graph and our goal is to encode this yellow node a based on the idea I just described we will enroll the computational graph by uh finding the a direct Neighbors in the network uh b c uh and d and then we'll iteratively enroll the neighbor structure for each of its neighbor so for node B we will see it neighbors A and C for node C we see uh neighbors and so on so we can imagine this enrolling can happen iteratively until um you reach the number of round you wish to see that is the number of layers in the network and another thing to note that uh a node can appear multiple times in this computational graph what do I mean that you can see node a it will also appear uh here like a a uh three times the the interpretation here is that node a is a two neighbor of itself so if you look two half away for a given node you will encounter the node itself again so it is that we can have this duplicate value in the computational graph so uh having defined a architecture of the computation how do we really implement we have these uh different gray boxes which are essentially neuron Network what does this neuron do is to compress or condense the information about a set of different nodes and then we will generate uh new representation and then compress it again so this is like a what where neuron network will help for graph learning method in GN and we can repeat this process and Define the compostition graph for each node in the network so we'll talk about this yellow node but you see we can Define competition graph for each node in a graph and um the intuition here is that different node will have a different competition graph because their local neighborhood structure are different and this is precise how graph learning methods can differentiate noes in a network because they have different neighborhood structure we will Define different computational graph and therefore the embedding we generate for different notes are also different um question is it true the statement that let's say for a given for a problem of um of not embedding um the the higher the degree is the more complicated the embedding would be meaning that nodes that are not that don't have a lot of neighbors are are more easily to to embed because we see here the embedding basically scales with the neighbors right so if there's one neighbor it's an easy embedding if we have a thousand neighbors it's like to embed all that information if you want know there a lot yeah that's a great question uh I think your question is talk about like a is like a node with fewer neighbor easier to encode um uh well I think rigor say we we need to Define what it easy mean but I think intuitively um like a a not way with fewer enables uh it takes like a fewer information and it will be more easily to be compressed and there's actually a frontier of a GN that describ problem called a so-call over scratching problem because you can see that uh for GN is always talk about how to scratch information uh together and like that problem definitely basically said suppose a network is very deep and the degree of a note is very high like a the network May tend to over scratch uh like this information of the network and your intution here is correct in the sense that if the node has fewer neighbor then it's easier to uh uh get the embedding um okay um let's continue and another concept here is the number of layers uh in my example here um we have a two layer uh graph neural network which basically said uh we look at the neighbors and the neighbors of Neighbors and this concept uh is quite important for GN because the number of layer of GN really tells about how far away we look in the network so given a two layer GN we'll usually be able to see like the nose there are two HS away so this is kind of different with other depl architecture say in convolution neural network the number of layer is usually just a hyperparameter right you can pick uh whenever uh like the model is performing the best but for GN it has a concrete physical meaning the number of G layers tells us how far away you can see in the network and usually we will peck the number of layer approximately equal to the diameter of the graph so in this graph we have diameter two which means that uh two nod can at most have like a a distance of two for any pair of nodes in the network so in this case usually having a two- layer uh G will be sufficient for the network to pick up the essential information in the network so uh setting the um like we can have this kind of more principal way to set number of layers for graph NE Network and uh we also have the concept of neighborhood aggregation which is done by this great brockus uh in the uh example here so the question is how do we really Define this boxes the most simple approach is to take an average of the information um uh to be more specific we can first average the message that we compute for each neighbor of a given node and then we'll apply a a NE Network that further transform this uh aggregate or average information and this two sty process is essentially what we will later see in a graph convolutional Network um so here is like the math mathematical uh there's a question yeah can you go back one SL uh yeah it's sorry here is good so if you said for example he will end up choosing let's say K2 the neighborhoods so two convolutional uh layers but we see all the nodes basically if we go two pops away we'll end up having all the other nodes as as a part of their embedding process as inputs so don't all the nodes end up having this like very similar representations when after after 2K yeah yeah yeah I think you as a lot of inside of questions uh the question is um suppose uh we enroll the network and we look at the computation graph for each of the noes uh wouldn't in the end all notes look similar so we talk about this over squashing problem for GN another research uh idea here is So-Cal over smoothing problem uh which is precisely what we described so suppose we enroll this process continuously for multiple layers and in the end all the nodes they have very similar neighborhood structure and which makes them indistinguishable so they all landed with the same embedding so this is also like a research question but usually the way we can get rid of this over smoothing problem is to do not like set very deep GN like we just set a number of layers uh equal or approximate the diameter of the network and in that case like in the example we show here usually the computational graph of different noes are largely the same but suppose you continuously enroll and row um multiple layers um you may face the issue that all the noes may look almost the same and we couldn't differentiate so this is kind of different with like convolutional neuronet or rest net where you can enroll I don't know 150 layers for graph if you unroll 100 layers and essentially the model will learn nothing because it can differentiate any noes okay so we have a lot of visualizations now we can get some concrete mathematical definition of what exactly is a graph convolutional Network so our input is node attributes um XV so this that describe a specific feature of a given node um and by the way to uh make this competition more concrete we consider a case of embedding a single node V so you see all the um expression here we have this uh subscri uh V and the output is this d v so we have a node attribute and we want to generate embedding D for this given node V so how do we uh start this computation we will assign this XV uh to the uh h v zero which is basically the zero layer or like the initialization for the uh input of a graph neural network and then we'll repeatedly uh apply this transformation so what this transformation does is it will first take the embedding of wi of the previous layer K and then we also take the average of the neighbors previous layer embedding so in this uh transformation we do two things we consider how do we represent the note itself in the previous layer and after we get that embedding we'll apply a linear transformation to uh encode or transform that message the second question we ask is how do we repr The Neighbors of a given note and in that case we will first take the average and we normalize uh by the like a the neighbor size with degree of the node and after we take the average we then apply another transformation uh W then tell us how to transform and further encode the neighbors information so the there is usually just uh is essentially just two parts how do we encode the node itself and how do we encode the no's neighbors uh and after having this encoding because it's a fully linear function if you stack linear function again again is still linear we have to have enerve some nonlinearity here the most popular chice is uh Ru which is also popular in other de architecture so this basically Define how we compute one layer of GNN and in practice we'll have multiple layers where K describe the number of total layers so we'll iterate apply this GNN on the input where the zero layer of um graph embedding and then in the end we'll get the embedding after K runs computation and this is U uh can be expressed as doing K runs of neighborhood aggregation and notice that here this because we use summation um this functions uh permutation invariant so this follows the nice property that we follow that we discussed uh for any graph learning method um any question here why why did you use U average of neighbor previously embed versus the just the neighbor um raw features like like or like the embeding of the neighbor at layer zero um so the question is why don't we average the input feature instead of just neighbor previous layer ining um and wondering why would wouldn't you choose um the layer zero um yeah I see so your suggesting that instead of uh just look at the pr layer in Bedding we can also look at the zero layer the initialization in Bedding yeah cuz to me it sounds like um you're adding this note at this step so it doesn't quite make sense to um like look at all the previous case that U well so first the motivation here is that we want to have So-Cal deep presentation or like inter leave complexation for node so that's why we have to like iteratively like a generate embedding and then feed into the next layer so suppose here we replace this representation with the zero layer then the uh embedding So-Cal shallow because you never go beyond like a one layer Network you directly transform the input um Zer layer um but I think your idea is actually uh in like helpful we're seeing in some of the literature uh where um probably um some of you know the notion of dense net so essentially your suggestion is that uh during computation we always associate competion with the raw input uh Z layer and you can consider in the network um representation we have a stack of multiple layers but then uh in the in each layer will connect it back to the input and I think that kind of one interpretation of your suggestion um I think yeah it's kind of also valuable uh but in this basic version of GM we just uh just use a previous layer um any other questions okay so yeah basically this is the most Den slide in this lecture that describe what is gcn looks like so we talk about uh permutation equivariance and permutation invariance and let's uh see or uh like check whether the network we have defined follows these two properties so let's first talk about permutation in property so uh the argument is as follows so given a node in the network the gcn that comput is embedding is permutation invariant um our example is to embed this blue node and we say uh this competition for given node embedding is Perman invalent because two reasons first we make sure that we share the new network weights when we transform the neighbors of of a given node and second when we take this aggregation we make sure that aggregation is permutation Inver for example just a summation and due to these two reasons we can conclude that this full computation computational graph is also permutation invariant so why this properly is uh like a interesting or like a non-trivial that's because um we can also consider Define uh competition differently and in fact in the in early like literature of GNN people have tried other ways of Performing the aggregation of neighbors and they may perform well but they don't really like follow the peration inir property so another popular chice in at that time was to use a recurrent neuron Network to encode a set of noes so in that case suppose we have neighbors BCD they will throw BCD as a sequence and throw it into a recurrent neet and use the output as the like encoded value and although this idea like makes sense it doesn't really follow this per in property um and in our case we can use this Shar neur race uh plus uh like a some aggregation to make sure the computation is permutation inent uh another property check is peration equivalence and our argument is as follows um consider when we try to incode all the nodes in the network so we not just uh previously we focus of uh encoding one node say this blue node now suppose we want to consider the uh embedding function for a no in the network we claim that this uh embedding function is permutation equivalent and just to remind you what is equivalent mean is that suppose we have this same network but with two different ordering plan and we can check um suppose we after the peration um the same node will have the same encoded value but only their Position will diff differ accordingly uh so what is the reasoning why this is true for g&n this is uh because two reasons first is that we make sure the feature Matrix and the EMB embedding are closely aligned what do I mean is that here we always make sure the first row is talking about node a in the both the input node feature and the output node embedding and the second row is about node B and uh and third row is node C Etc and the second reason is that we showed that given a node when we apply a g&n is permutation invariant so after permuting the noes the color which represent the value of the embedding is always the same because of this Inver property so given these two reasons suppose we shuffled the order of a Noe in the network we can see globally it uh present this compation equivalent property so why does this property also important that is because um in our case we want to encode all the nodes in the network and we want to have a reliable way to extract the embedding for a given note suppose we uh comput embeddings in parallel what could be the failure mode is that we may correctly embed each nodes so a to F we get their embeddings but we may get like a wrong order of the embedding so instead of having them access align so the first row know a second R know B we may get them like a wrong like for the third row may talk about node a so even though we get globally um the set of embeding correct their ordering may be wrong so the property of perp equivalent ensures that we always can um reliably get the embedding for a given note so this uh this property is also non-trivial and really important for GN okay uh we can move forward so so far we have defined the mo most of graph neural network and next I'm going to talk about how do we train such a network so uh to train network uh really the key is to define a loss function and based on the embedding D that we get so uh what are uh our like a training um pipeline or training procedure looks like uh this is the same definition of gcn that we have just seen we have two main trainable parameter here W and B where we talk about to understand this equation we talk about how to encode the note itself and how to encode the no's neighbors so node B tells how do we transform and process the information of the node itself and no uh this uh Matrix W talks about how to encode the neighbors of the given node um so these are the trainable parameter and then we can feed the output uh of the no bedding into any loss function that you can see so in the beginning of review we talk about L2 LW for regression and cross entropy law for classification so you can define those uh loss function uh based on the computed no embedding and another um uh sorry question yeah can go back one slly no not this one sorry back this is forward okay another one wait you mean the the one where there's like the embedding on the right here we go so let's say if this is a classification case and we have these two representation of the the network and it's the same network in the end of the day how do we make sure that eding H1 and eding H2 end up with the same output let's say that it's whatever Class A so we just like St stack them up then as one vector just then do some kind of summation and then that's it or how how do we how do we go from these embeddings to a to a specification yeah um actually in the following side we will talk about that but the idea is that um for a given node we get a vector right we can have a for example additional layer that project this Vector into a single score the score can tells that the predicted class of that uh like a Noe for example is a binary classification we have true or false then we can project this two dimensional Vector into a single scaler from range of 0 to one and that can tells us how we want to classify this given node but that itself should also be aimation or something where the order doesn't matter right because we want H1 and H2 to have the same numerical value when you when we go through that function oh yeah so I think here I want to clarified um because we want to define a note classification problem so the label are not directly associate with the embedding Matrix but uh asso with a particular node so for example we know node C sorry we know this blue node is associated with like a uh like a label of one so it's not uh related to the order of the node but we always find this blue node and then assign a order a label of one to the node and this way we can um always reliably associate a node with label and then train the network regardless of how we order the different noes okay um any other question okay so uh we talk about uh training a graph neur Network and then uh the goal is to optimize this weight matrices uh based on the uh embedding and another important topic uh here is how can we um efficiently U train or optimize this um GM the reason that the ESS of mod de learning is really rely on Parallel Computing uh GPU acceleration so earlier our formulation was like a local level so we only consider how to compute ining for a given node but we didn't really talk about how to paralyze that algorithm so now our goal is to ritten the definition of GN but now in The Matrix form and suppose we can written it in The Matrix form then we can easily utilize those parallel Computing device to accelerate the training process so here is our idea so earlier we talk about uh the core of neighborhood aggregation is about summing the neighbors of a given note and we can equivalently written it or write it into a matrix form which is essentially a * H where a is the adjacency Matrix but we particularly look at a row that describ not V and then H is the uh Matrix that describe the embedding for all the nodes in the network and another notion is this D which is a diagonal matrix that tells about the degree of each node in the network uh we can use the inverse of d uh to express like the inverse of the no degree and based on these two uh Matrix that we defined we can write this aggregation function which tells us we want to um take the average of each node neighbors into its Matrix form where we have this uh D inverse * a * H so you can check that these two equations are the same and this way we can express this local wheel of GN into this Global Matrix wheel so follow this idea we can additionally um write the entire transformation of g&n into its Matrix form so now we have this um a * H * W and H * B which uh basically describe how we transform the no's neighbors and transform the node itself in a global wheew um and so the red one talks about how to aggregate the neighbors the blue term talks about how to transform the node itself and in practice this implies that we can efficiently compute GNN not in a local perspective but really in the global sparse Matrix M multiplication perspective and this can be uh like a very helpful because suppose we can accelerate Spar matrix multiplication essentially you can accelerate any graph learning algorithm and there's a site note like a that is kind of Contra said that is not every GN can be written in this Matrix form so because the aggregation function can be more complex so what are some uh examples um so for example when we perform the aggregation instead of instead of taking the average of the neighbors you can consider taking the maximum of the neighbors because this maximum operator cannot be written in this Matrix form we cannot really succinctly write it in the this uh Matrix uh expression and couldn't really benefit from this U spish matrix multiplication okay so we talk about how to efficiently train a network uh next I'm going to give some concrete example of uh like how to train under the concrete mure learning setting uh as you probably know there are two major Paradigm of machine learning that is supress learning and npress learning in suest learning we have a concrete label associated with a graph and let's say for example we have a node label for each node and then we can define a loss function based on the type of classification or regression and alternatively we can consider the N surise learning Paradigm for graphs where in this case we don't have any particular external label for a node but uh instead we can use the graph structure itself as the supervision so um here is like an example of an unise learning uh technique and this idea is actually very related what we learned in the previous lecture about random work embedding so in this case the L function will take two node embedding as input so embedding for node U and no V we will have a decoder function which is an for example an inner product and then will compare this prediction with a Noe label y u v so how do we label the PA node UV the idea is that we can label it based on node similarity so suppose a pair of node is similar considered similar then we label them as one and if it's not we label them as zero and how do we Define a similarity it can be precisely what we uh teach uh in the previous lecture for example in random work suppose two notes co-occur in work then we assign a label one to this pair of node and uh like if it's not assign zero we can also use the idea of matrix factorization we also mentioned in the preious lecture and in that case uh we aim to reconstruct the next World structure so suppos two noes they are direct neighbors of each other we can label them as one and if not we can level zero we can additionally use a node proximity in the graph for example if two nodes are two two H neighbor in the network we can label them as two three half neighbor as three so there are a lot of like a creative way that you can Define this um um labels and then use them as self software learning uh objectives for graphs um an alternatively term for this idea is so-called self surise learning so we want to use the graph structure itself to create a supervision for a network and software learn also have a lot of other use cases in the deep learning application but in a in graph it will be pretty exciting because you can Define very complex and very interesting objectives on graphs so this is the idea how we can do n super learning on graph if you don't have any labels okay so uh here's a like visualization for um a concrete superise learning example um so in this case um suppose we have a drug drug interaction Network and we want to classify a given node whether it's toxic or not and we get a particular label for that node then what we can do is that we can get the embedding of that node DV and throw it into this um loss function and this loss function is really a binary cross entropy loss that are widely used for classification and so here DV is the encoder output from graph new network and we have some classification weights to weight this like a final layer embedding into a scaler and then we can optimize based on this uh binary cross cross entropy loss and this loss function what is really saying that because we want to minimize the loss is that like there's an inverse here uh we are essentially maximize the value here and suppose this Y which is the ground truth no Cass label suppose it take value of one then the second term get canceled out so our idea is want to we want to maximize the predictive value so suppose a class label is one then we want to maximize the prediction and uh alternatively if the Cass lab is zero the first term disappear and we only have the second term here and we want to have this value as low as possible to make sure this uh entire expression uh is maximized so suppose a c is zero then we want to minimize the prediction so this is like an integation of the loss function so we have already talked about um basically everything about uh GN so here's like a uh end to end like a a description of how the full pipeline may work we have this network and we want to embed a node a we will Define the neighborhood structure of node a into a computational graph and we will be able to uh generate this node embedding and we can then Define the loss function that we just uh go went over uh based on the predicted embedding and we will train that loss function over a set of input nodes for example we may consider train on node ABC in this network and after the training uh we can do prediction or inference for nodes in the graph and it is worth noticing that we can make predictions even for nodes that we never trained on so we train on the Node ABC but DF will never show it to the model during training but still we can use the train Network to make predictions for these uh three NC noes and this property is so-call the inductive learning capability inductive learning is really about can we uh use a train Network and generalize it to an noes and the reason that uh a GM can perform induct learning is because we share the parameter in the computational graph and because the parameter are shared we are not bounded by how the input uh noes look like so we can use the same network by repurpose it to any um andc structure in the network so um here are some examples how we can utilize this inductive learning capability um we can consider train the neuron Network in one graph but then apply it to a new graph and this new graph is never seen during training time and what are some use cases for example uh we can define a graph over protein interaction of different organism we can train the network in one organiz organism that we have a precise training label but given we have a trained model we can generate embeddings for unseen organism and this is quite useful say in a biological domain another example is that we can use this property to predict over new nodes in this case we have a a given graph or a snapshot of a graph and over time we'll get new nodes arrive into the network and our goal is to generate embedding for the new node um the example use case here is very popular in Industry uh where essentially you can represent a lot of online networks into a graph such as Reddit uh YouTube Etc so suppose someone upload a new video into the YouTube Network um this is essentially where we a new not arrive in the network we can use a pre-int graph NE Network and generate embeding for this new node so this is fully rely on the fact that GNN is inductive um on the contrary suppose we use the uh thing we learn on the previous lecture say no to back it will be hard because we wouldn't be able to generate embedding directly for a new node but rather we have to perform a few iteration optimization until we can uh generate edding for a new node so this is the nice property of GN uh question yeah I was wondering if there's like any results on like if you're successively adding new nodes to like a pre-trained graph or like a graph that you pre-trained on then is there a point after you just started hearing terms of like the the accuracy yeah that's a great question um I think yeah this is also like a front here in the research the question is about like suppose we continuously add new nodes into the network of course until certain point we the model wouldn't be able to predict and how do we find that point um yeah I think this is like a pretty nice research question to ask um I think usually people just consider in practice uh what people do in practice is that they have um like a ongoing validation set so over time um they will re-evaluate the network on a hold out validation set suppose the performance on the validation get deteriorated then uh it basically says the model is kind of failing to learn over the new graph structure and that is time when you want to retr or fine TR the network to the architecture to to the new graph we only have like for example on YouTube we only have like One MH yeah that's a great question we actually will talk about specific layout of uh graph learning method in the next lecture so stay tuned but in the meantime I can quickly say uh usually we'll just do the random splitting for example given this network we may randomly pick some of the noes as the training node and some of them as the validation Noe and we will make sure during training we never see this like a hold out no um yeah that's the basic idea but there are lot of different variants and we will cover in the next lecture question you add a new node to the network do you need to know all its connections to the existing Network or can you sort of like add a new node and then predict where it should be connected in that Network yes that's a great question um so the question is that um can we also predict how the new node will connect to the existing node um and actually there's a a bit late in our course we have a course lecture about deep model and that is precisely the problem you talk about where we want to predict how a node can connect with each existing node and this way we can use it to generate graph so we we know how new node connect and we can ITA do this to generate a full Network and yeah that's kind of a definitely doable uh alternatively you can think of this a link prediction task so you want to predict the missing link between a new note and existing graph so yeah that's a nice uh application yeah um okay um we can move forward to the last part of the lecture so far we talk about the basis of uh deep learning on graphs and particular architecture of gcn and in the final part of this lecture I want to discuss the relationship between GN and CN because I think uh a lot of you are already familiar with convolution network for images and we want to compare and argue that actually GN is more General than CN and it's subsumes uh subsume the CN so uh how do we compare GN and CN well uh this is how s works right we have an image we can have Define a convolution of Kernel and we'll um move that kernel around in the input image to generate the output feature map uh usually we wouldn't we write it in a like a convolutional form but now because we just learned this idea of aggregation aggregation over neighbors we can equivalently write CNN into the language that we just talked about about how do we aggregate neighbors so we can Define Neighbors in the image based on the proximity of pixels so for example given a node um and this 3x3 U image patch uh we can Define um the eight neighbors of given pixel um as uh as it neighbors and also we want to not include the note itself um and then having defined this neighborhood we can define a transformation W over each of the uh input pixel and then we take the summation and do the nonlinear uh transformation and this is essentially an alternative way to express a CNN and another way to encode and imaging data is that we simply convert this grid like structure into a graph and this graph can look at this so we have a node in the center we have the self Edge that connect node with itself and then we can connect all the neighboring pixels with it so essentially now a node have nine neighbors and we can write uh we can define a GNN over a graph defined like this so we reuse the um the expression about a gcn definition like this and uh we compare what we have with GNN and CNN CN we we just derive it in the previous slide we find that this two expressions are almost identical the only difference is that there's a uh like upper script U here there is not that basically says for GN we will share the parameter across all the neighbors because we have this permutation invariant property but for CN we do not have that like a risk constr TR of permutation invance the reason is that we know the property of location in an image and we also want to explicit it uh explicit express it in the network definition more precisely in an image data set we have uh we are aware of the relative position of pixels for example we can label the each neighboring pixel of a given no given pixel has this negative one netive uh negative one netive one Z Etc we have their relative location and this way we don't have to consider the permutation environment property because we have this very explicit ordering so this way we can define a unique weight for each position of the neighbors whereas in graph because we don't have this like a prior information of the location of those we have to treat them equally so we Define this permutation invar function which share the weights that's actually actually the only difference between G and CN in term of this uh grid like data so here's a summary of further discussing the difference um in term of uh comparing C and GN uh first for uh GN like the size of the filter is predefined for CN but for GN is not predefined uh and also we talk about um the GN can process arbitrary graph with different degrees for each noes um and CN is not peration EV as we just talked about because we know the exact position of different noes but GN we do do not have such information so we have this peration inar function of aggregating neighbors so that's everything uh we have in this lecture uh to summarize we talk about the basics of neuron Network and we motivate the idea of deep uh learning for graphs and what are the ideal properties we want to have uh in a graph learning method and then we'll particularly talk about one architecture called graph convolutional Network how do we Define it into like a uh aggregation of neighbors and how do we train it and finally we talk about how we compare G with SS um that's it for the course and um I hope you enjoyed it

Transcript for:Overview of Graph Networks Lecture

Transcript for:
Overview of Graph Networks Lecture