Exploration of System 2 Recognition in Artificial Intelligence

thank you so much for inviting me for this talk and thank you so much for being so numerous today I have a lot of things to talk about but what's interesting in my mind is that these things are linked together in a really interesting way at least that's what I'll try to convince you about so the title talks about system to cognition which is about high-level cognition and that's connected to consciousness as I'll explain it's also connected to issues that are very important today in machine learning about generalization out of distribution and it's also connected to the notion of agency and reinforcement learning and how agents face problems that standard static machine learning hasn't been looked looking at enough already so deep learning has made amazing progress in in the last few years and decades and some people think that it might be enough to take what we have and just grow the size of the data sets the model sizes computer speeds just get a bigger brain but I I'm not quite of this opinion I think that we're missing out in qualitative ways in order to approach human-level AI we have currently machines that learn in a narrow very narrow way they need much more data to learn a new task than our human examples of intelligence they need humans to provide high-level concepts through labels and they still make really stupid mistakes they're not very robust to changes in distribution there are adverse real examples and things like this and so some people contrary to the first ones that I mentioned think that we have to start a threat start afresh and invent something completely new to face these challenges and maybe go back to classical AI to deal with things like high-level cognition what I'm trying to tell you about today is that I think there is a path to go from where we are now extending the abilities of deep learning to approach these kinds of high level cognition questions of system two so I talked about these two things system 1 system - let me try to be a bit more precise I really got introduced to these concepts by reading the amazing book of Daniel Kahneman which I suggest you to look at Thinking Fast and Slow and introduced two kinds of cognitive tasks system 1 are the kinds of things that we do intuitively unconsciously that we can't explain verbally and that we can in a case of behavior that are habitual this is what current deep learning is good at so if you're driving a car and going back home to a place you know you don't need to pay a lot of attention to the road you can talk to the person near you on the other hand if you're in a new city and you don't know the way maybe somebody told you directions but you have to really pay attention to every corner and you have to read signs you have to be on the lookout and if somebody tries talking to you you will ask them to you know please let me drive so what's going on there is that you're generalizing in a more powerful way and you're doing it in a conscious way that potentially you could explain so if I asked you to add 34 and and 56 you can do it in your head but it's going to be sequential and slow compared to a calculator the kinds of things that we do with system to include programming so we come up with algorithms recipes we can plan you can reason we can you know used logic usually these things are very slow if you compare to what for some of these problems computers could do and these are the things that I'm gonna argue that we want future deep learning to do as well so here are some things that I think we're missing in order to approach human-level AI with deep learning we need to work better on out of distribution generalization and transfer so when you learn a new task you want to be able to do it with very little data we want to be able to handle these high-level cognitive tasks that I mentioned including using these tasks to force the learner to discover the kind of high-level representations which deep learning was supposed to be about in the first place you know deep learning was about learning multiple levels of representation with the highest level supposedly corresponding to the kind of concepts we manipulate with language and many of these concepts have to do with causality the kinds of concepts we manipulate with words tend to be cause or effect and once you start talking about causality you have to also talk about the agent perspective something that the mainstream of machine learning hasn't paid enough attention to not just because we want to solve reinforcement learning problems but because we want to we won't have machines that understand the world that build good world models that understand cause and effect and that can act in the world in order to acquire knowledge so a big theme of my presentation today is that these different questions are not independent research directions they are connected when you make progress in one you can make progress on the other and understanding how they're linked can help us plan a path for our research so let me talk a little bit about consciousness and the kind of functionalities that I think we need to add to machine learning so this is the road map for this talk I'm gonna start by talking about the need for having machine learning systems that can handle changes in distribution and why it's particularly important for agents because of non-stationarity zin in their environment then i'm gonna talk about the building blocks that I think can help us get there in the last few years maybe unnoticed we have added things to the deep learning toolbox in particular attentional mechanisms which I think are actually the key to move to the next stage that I'm talking about and in in this part of the talk I'll also tell you a little bit about how much work has been done by our colleagues in cognitive neuroscience to better understand human the human side of the equation then I'll talk about several priors that may be linked to consciousness so the main theme here is that there's an advantage for human beings to have these high-level cognitive abilities and and so these advantages you can think of them as assumptions about the world so usually we talk about prior so you can think of them as organisation so the first one that I'll talk about is the idea that the Joint Distribution between the high-level concepts can be represented by a sparse factor graph any joint attribution can be represented by a factor graph but the sparse one has a different story and then I'll talk about another prior that has to do with how the world changes and this is going to connect with the notion of agency so the fact that most of the time when things change in the world it's because agents like people do something so we intervene we do things and the hypothesis that I'm proposing with other people is to consider those changes as localized in some abstract space and if if we use these hypotheses I'm going to show through some of the recent work we've done you can actually discover relationships that are just causal relationships between variables yeah we also talked about meta learning which is connected to the problem of learning out of distribution and the final bit of my presentation is going to be more about the detailed architectural structures that we should explore to introduce the kind of compositionality that system to processing requires and in particular I think we should be moving towards neural nets that can not just operate on vectors but also operate on sets of elements sets of vectors sets of objects that are appointed all that that can be referred and operated on by dynamically recombined modules so that's it that's that's the end of my talk this is the right now I'm gonna go a little bit more detail on each of these points let's start with changes in distribution so the classical framework for machine learning is based on the hypothesis of iid independently and identically distributed data that means the test data has the same distribution as the training data and it's very important because if we didn't have that assumption then we would not be able to say anything about generalization why would a function learned on some day they work on some other data right now unfortunately this assumption is too strong and in reality is not like this most of the data we get isn't iid and so in practice what many people do in industry or in academia is it take when they collect data they shuffle it and and here to make it iid and and here I want to quote my friend Leon Batu who in his ICML keynote this year said something like nature does not shuffle data we shouldn't and the reason why we shouldn't is because when we do that we destroy important information about those changes in distribution that are inherent in the data we collect and instead of destroying that information which we use it in order to learn how the world changes so this is important because there are for example rare events like Black Swan events that are highly unlikely to happen but could have severe consequences and this question of out of distribution jamás ation really breaks the i-iv hypothesis now the idea hypothesis was good but if we discard it we need to replace it by something else and I'm gonna that's why I've been talking about priors and one of the important ones is gonna be well how is the test distribution gonna be related to our training distribution if it's if it's not the same okay so let's talk about out of this machines realization what I mean by out of this machine generalization is essentially the phenomenon of learner being able to generalize in some way to a different distribution it doesn't say anything about how we do but let's see why we need this well if you are a learning agent so agent means actions right so it's a it's a learning system that's embedded in some environment you you are almost always facing non-stationarity x' for several reasons right there changes in tribution due to the actions of the agents I move to a different place the actions of other agents the fact that you know I'm moving that we're looking at different times different sensory signals different sensors different traders different goals different policies all kinds of things are changing and as you heard from Blais talk this morning once we start looking at multi-agent systems it gets even more complicated you can't even talk about optimization as he was explaining but but certainly from the point of view of each agent you don't know you don't have a stationary distribution anymore and if you think about a child learning for example their world is changing all the time their body is changing all the time and so we need systems that are going to be able to handle those changes and do things like continual learning life front learning and so on this has been a long-standing goal for machine learning but I think we haven't yet built the solutions to this and one of the crucial elements according to me in order to be successful in this and I'll come back at the end of this presentation much more on this is introducing more forms of compositionality so so what does that mean it it means being able to learn from some finite set of combinations about a much larger set of combinations we already get that from distribute representations these three representations are really at the heart of why neural nets are working they were introduced by geoff hinton in the early 80s and in the last few years we actually have theory that helps us see why you get an exponential advantage potentially at least if if the function if if you make the right assumptions if we're making the right assumptions about the world in terms of compositionality in terms of being explained by a number of different variables and factors then this representations can be exponentially advantageous because essentially once you've trained a bunch of features you can generalize to new combinations of those features this is what a single hidden layer already gives you now if you have a deeper network you also get compositionality because each layer gets composed with the next one and we've also shown that gives another exponential advantage and now I think there are other forms of compositionality and the one that we know best is the one you find in language and linguistic mean I've been talking about this for a long time they call it system masticity or system original ization and and and I'll talk a lot more about that so this opens the door to better powers of analogy of abstract reasoning and that remains to be done in in machine learning I think so systemic realization really is aiming at out of descriptions realization and fast transfer but but it's about how we do it it's about the idea that we can we can get that by dynamically recombining existing concepts so it could be language but it could be in other settings like an in the picture here from Lake Adel 2015 we invent a new type of vehicle by combining properties of different vehicles but what's interesting about this is dynamic systematical ization allows potentially to generalize to combinations that have zero probability under the training distribution it's not just that they're not present in the training data then they would have zero probability under the training distribution so as an example if I tell you a science fiction scenario clearly you haven't lived something like this in your life but you can still imagine it if you if you drive if you've been living all your life in you know some continent or some city and then you move to a completely different place and you have to drive in this unknown city as the example I gave earlier you are also doing a form of systematic colonization unfortunately current neural net architectures are not that good at doing that and it's been shown through several papers and experiments starting in particular with the work of Lake and Baroni more recently the work we did at miele led by Padano and and collaborators presented by a clear currently we have a paper that's gonna be on archive probably tomorrow or the day after on clever which is a dataset for visual question-answering where there are combinations of the linguistic concepts that are present in the questions that just don't come up in the way that the data is generated and the current methods when you ask them to answer these kinds of questions fail completely where is the human wouldn't you know we don't even realize that these combinations are not present in the data okay and then one question people may come up with is how what you're proposing is different from the classical AI program of symbolic logic and so on so that's a good question well I think there are a number of reasons why these these classical IR programs had trouble and in in the work that we need to do in order to achieve system to performance we want to avoid some of these pitfalls we want to make sure that those systems will be able to generalize efficiently in a large scale the the the concepts that we want to learn need to be grounded in with system one in in low-level perception and action we we want to keep the power of generalization of the distribute representations we want to make sure that the kind of search that is involved in things like reasoning and planning can be done efficiently whereas the the classical approaches require a huge number of explorations of many trajectories of how things could unfold or how you could combine concepts rules and so on and finally we need to make sure we build systems that can properly handle uncertainty in the world and machine learning has been doing a pretty good job at that up to now but we want to achieve these extra goals that we're not very doing during very well at like systematization that I explained factorizing the knowledge or T some of it that the learner is acquiring in small exchangeable pieces to get discriminatory advantage that I've been talking about and that includes being able to manipulate variables things that you do not in programming and in in logic formulations dealing with instances that are associated with more general categories if you want and entry having references and interaction things that don't seem natural in the neural net world but as I'll try to convince you we now have actually built the tools for doing that in Indy planning using attention mechanisms ok so let's talk about attention and consciousness so what is attention attention is about doing computation in a focused way we're gonna sequentially focus competition on one or a few elements at a time and we realized in around 2014 that this was extremely powerful and and was the reason why we were able to get a breakthrough in machine translation when you when you produce the next word in English when you're trying to translate from French say you want to really focus the competition on just the right few words in the French sentence that are relevant to do the translation so we introduced a particular form of attention called content-based soft attention which is very convenient because you can backdrop through it and so you can learn it in other words we can learn where to attend and so the way it works is the competition being done at the next level is going to use this input a selection from the previous level of computation and that selection is going to be a soft selection so we can take a convex weighted combination of value vectors from the previous level and these convex weights are coming from through softmax that is conditioned on each of the elements so for each of the elements we're gonna see how well they match the context to see on which one the attention should be focused so in a way attention is peril like we're we're considering all the possible elements in some set and we're computing the score for each of them in order to decide on which one or which ones we're gonna put attention and there's been recent work in cognitive neuroscience showing that attention should be thought of as an internal action right so the way that your brain is is atending it's very similar to the way that your motor system is deciding to move your arm and so we want to learn these attention policies so attention has been very very useful I mentioned machine translation but essentially today's NLP systems state-of-the-art systems all rely on attention look at the older work based on transformers and their variations they're also the heart of memory extended neural nets we had a paper last year and more ongoing work on how attention connected to memory can also unlock the problem of vanishing gradients and as I'll mention attention also allows to change your nets from machines that are processing vectors to machines that are processing sets and in particular sets of key value pairs okay so so let's see this picture again we can think of attention as creating a dynamic connection between two layers whereas in a traditional setting the connections are fixed here we kind of pick which of the inputs is going to be sent to the the whatever module we're considering that uses an attention mechanism now this is great but from the point of view of the receiving module there is a problem it gets this value which is one of those in the sets of the set of input elements but it doesn't know where it's coming from right it's the value of what and so what we're doing it with attention mechanisms is in addition to the value we have some concept of key in other words a kind of identifier for where the value coming is coming from currently we're using those keys to decide which element should get the attention but that key is also sent to the next level and so downstream computation can know what the value it's getting is is what it is what it's coming from what kind of object it is what kind of type it is so you can think of this as creating a name for these these objects and creating a form of indirection and and and as I said now we have these systems of operating on sets why is it a set because the attention mechanism doesn't care of the order in which we're putting these these elements in the first layer it just you know picks one according to how well it matches the some some kind of learned policy but the information about the relative position of these elements in it say in your brain or in a new linear architecture it doesn't matter anymore okay now let's connect to what our friends in neuroscience and cognitive science are doing in our community the the C word consciousness is still kind of taboo but but in their community it's not anymore and that's great they've made a lot of progress in understanding several aspects of consciousness and there are a number of theories but many of them are related to what's called the global work space theory which originated from bars and there's a lot of very good important improvements to it from standing on and collaborators and what this theory says is that what's going on with with consciousness is there's a bottleneck of information the some some elements of what is computed in your brain gets selected and then broadcast to the rest of the brain and influencing the rest of the brain so this is related to short-term memory where the things that have been selected are available and conditions heavily perception and action so it also gives rise naturally to the kind of system to abilities that I've been talking about alright so how is machine learning going to do something I mean what's the connection with machine learning machine learning could be used to help brain sciences better understand consciousness but but what we are understanding of consciousness could also help machine learning develop better abilities so first of all the work that's been done in neuroscience in general is based on fairly qualitative descriptions of these functionalities that we think are associated with consciousness and what machine learning can do is help us formalize in in a way that's you know more mechanistic what what these exactly means and and then that could feed back the the research in in neuroscience in order to provide specific more specific tests that could be could be done of these theories of course for me at least one of the motivations is also to get rid of the sort of fuzzy fuzziness and magic that surrounds the word consciousness and and and I think machine learning is in a good position to provide a justification for these particular forms of computations in a sense of you know why do they why have they evolved what kind of computational and statistical advantage are coming with these particular forms of computations and of course once we understand these things we also want them in our learning agents so consciousness is is closely related to language because the way that we know that somebody is conscious is by asking them to report what they're thinking about so this is the direct way that we know about consciousness and that means that there's a very strong link between your thoughts that in things of you're conscious of and language that one can be translated to the other fairly easily over the loss of information when you go from your thoughts to to language it also means that there is a connection between system 1 and system 2 here because those high-level concepts that we communicate with language are anchored in the system one sort of intuitive system that connects your brain to the rest of the world through through perception and action and and I think that's one of the big important direction for natural language research we really want to explore things like grounded language learning where we don't just learn from texts we learn from environments which involve perception action the the perception action loop through the environment and and allow a learner not just to get sort of patterns of sequence of words but but also what they referred to is as its understanding potentially implicit of how the world works and I refer you here to some work we've we've done recently in publish to the last I clear on grounded language learning something that I called baby AI okay so now we're at the part of my presentation when we I'm gonna tell talk about the kinds of priors the kind of structures the kind of assumptions and regularization that we could use in order to to encourage our learning systems to do a good job at out of this efficient realization and the kind of conscious processing abilities that I've been talking about and the first one I want to talk about is the sparse factor graph assumption I mentioned I called that in a 2017 paper the conscious is prior so in this paper I about the way that cognitive neuroscience has been understanding conscious process saying the fact that we form these low dimensional conscious thoughts that are obtained by selecting elements from a larger unconscious state right so so now instead of having a single top level representation like we usually have in our systems we have two there's the a very high dimensional unconscious state which contains all of the things you could think about and then there's a tiny tiny one which only contains those that went through the bottleneck recently that's the conscious state and then attention of course is used to select from the first one what goes in the second one and attention is also used in a top-down way to condition further processing in the system one's computation so now the that way of thinking about conscious processing is about the sort of computation we're doing but in machine learning we also used to think about what do these computations mean and often they mean some kind of inference with respect to some model of the world which might be implicit and and and and a good way to think about a model of the world includes what what kind of join distribution between the high-level concepts are we talking about and so the consciousness prior proposes that it's a sparse factor graph so if factor graph is just a particular representation of joint distributions you can represent any joint distribution in this way it has these squares that correspond to factors that are kind of relations between variables the nodes are other variables and the sparsity here just has to do with how many neighbors each node has so why why do I think that it's a good hypothesis well think about the the kind of statements we can make with natural language like when I say if I drop the ball it will fall on the ground now this is a very powerful statement because it's true with high probability and yet it involves very few variables just just the words that are there and so the the fact that it uses very few words means that the relationships that are being described can tightly capture some element of the joint probability through very few variables at a time so you can think of like each of these black squares each factor corresponding to the relationship between a small set of variables of the kind you could produce with natural language now this is different from the usual assumption that you find in many papers these days that study high-level representations that are supposed to be in where the variables are supposed to be independent of each other this so called disentangling work which i think is a misunderstanding of the goals of deep learning which was to learn these high level representations and and so instead of thinking of these variables as completely independent at the top level we think of them as having this very very structured Joint Distribution and and and they have to be like this because high level concepts like say ball and hen that I just used are not independent they come in these very powerful strong relationships as I showed so so then what I'm saying is instead of imposing the very strong Trier of complete marginal independence of the high light law we can we can have this slightly weaker prior but still very strong prior that the joint is represented by a sparse factor of the reason I say is still very strong is that this prior for example would not work in the space of pixels you can't find easily a small group of five pixels in in images such that you can predict one of the pixels with high accuracy you've given given the four others but we can do that with natural language when we express those statements in this at this high level so factor graphs are expressed by writing the joint as a product of these partition functions like the 5k here where each of these partition functions involves only a few subsets of the variables and so what it means if we were to impose something like this is it puts pressure on the encoder say that map's the low level data to the high level representations to have that property that that the factor graph is sparse okay so next I want to talk about the meta learning aspect and another hypothesis that's important to deal with how the world changes so butter learning is something really hot and cool these days but actually started several decades ago my brother and I have been working on this in the early 90s and actually was Sammis PhD subject and and what it's about really is about having multiple time scales of learning or multiple time scales of iterative optimization like computation so typically you would have an inner loop like normal learning and outer loop like evolution which optimizes whatever the inner loop is is producing and so in this way we can talk about the evolution outer loop with individual learning in the loop or we can talk about in the life of an individual lifetime learning as the outer loop and individual adaptation to new environment as the fastest timescale and the thing that's cool about meta learning is it allows us to explicitly optimize for generalization and in particular it can be used to explicitly optimize for out of this visualization right if the agent sees multiple environments and we can we can train it's slow timescale meta parameters so that it will generalize well to new environments now there's the question of that I mentioned earlier what kind of hypothesis can we make about these changes in distribution and because underlying physics means that typically an action by an agent is going to be localized in space and time we can assume that the changes in distribution correspond to are caused by or the consequence of an intervention by an agent that acts on just a few causes or a few mechanisms that relate variables with each other so this extends hypothesis that was proposed by shop cough and collaborators of independent mechanisms and what they mean here is informational II independent meaning that the the relationships between variables the mechanisms the conditional distributions are independent in a sense that what you learned from one doesn't tell you anything about another one and so if something changes in the world one of these mechanisms changes one of the conditional distributions corresponding to say one of the nodes in this graph it's like a graphical model for example if something like this changes due to an intervention you only need to adapt the part of the model of corresponds to change so for example if I put on some sunglasses the data that I'm getting at my my retina is is very very different but it can be explained by a tiny change which is this one variable that you know changes its value from zero to one so that that's really interesting because if we have such hypothesis and we have a good representations of the interactions between all those high-level variables then very few bits would be needed to account for those changes and this very few observations needed to adapt or infer what has happened and this we can get good out of this visualization so so the idea is since out of this personalization can be obtained if we have a right decomposition of knowledge we can use out of this machine performance as a training signal for factorizing knowledge okay so we did some work in that direction there's a paper called a meta transfer objective for learning to disentangle causal mechanisms where we apply this idea in a very simple settings where you have just two variables a and B or say for variables a B X Y and a is the cause of B but the learner doesn't know and you might observe just X and y which are simple like rotations of a and B and it turns out that if you have to write decomposition if you if you look at the the right way of factoring the Joint Distribution between a and B the one in which we have P of a the cause multiplied by P of B given a for for the effect given the cause then when there's a change in distribution due to an intervention on the cause the the learner that has the right model will adapt much faster doesn't need as much data so the x axis on this thing is the amount of data that the learner needs to recover from a change in P of a and it turns out that you can also use this to learn about how to map the X Y which is things like the pixels that don't have a causal structure to the A's and B's that that are the high-level variables that have causal structure because again the same remark as I did earlier with the consciousness prior which does not apply on the pixel level the assumption that these tidal variables are causal does not work on on things like pixels and you can't really find a pixel that's the cause of another pixel right so not the right space for doing things we had a more recent paper called learning neural causal models from a known interventions where we extend these ideas to larger graphs in such a way that we can avoid the exponential explosion of the number of graphs that need to be considered by permit rising in a factorized way the distribution over graphs and one of the things we find is that in order to really facilitate the learning of the column structure the learner should try to infer what was the intervention on which variable was the change performed and that's something we do all the time like most of the time at least my brain is trying to figure out what was the cause of what I've observed or that explained the change that I am seeing so we tested these ideas on various small graphs and we were able to find that actually this works quite well and better this than the commonly used causal induction methods but more importantly the the way that we are attacking this problem is something that's very deep learning friendly in the sense that we just define an overall objective some regularization terms usual things and we do gradient descent on it ok the last bit of my presentation is about operating on sets of objects that may be pointable using dynamically recombined modules which I promised at the beginning and so we have another recent paper called rims for recurrent independent mechanisms and it's about motorizing the computation and operating on these sets of named and typed objects but in in it in a deep learning way as you'll see so we we apply this idea to recurrent maths the state of the recurrent net is broken into pieces let me see if this works all right so so each of these boxes represent a small recurrent net that is updated based on its previous take but it can also be updated based on this of other seven networks but we are constraining the way that these sub networks are talking to each other so that it's going to be sparse and it's going to be done in a dynamic way so what it means is that using attention mechanisms the the connectivity pattern between those those modules is is changed and and and computed on-the-fly also we use the attention mechanisms to select a subset of the modules that are going to be activated so this is the idea that there's a bottleneck that dominates the computation and and the other important element is that the things that are communicated between those sub networks are not the usual standard vectors that we use in recurrent Nets they are these sets of key value pairs and so what it means is that what these networks are exchanging is you can think of them as variables along with something like their type and what we found is that this leads to better out of this machine generalization then standard methods that don't use these kind of structures we've also tested this in reinforcement learning setups where we just replaced LST M's that were used in in a PPO baseline for our terry games and found that it helped on the majority of the atari games okay so we are close to the end of my presentation let me recap some of the hypotheses for conscious processing by agents and system original ization that i'm proposing we pay more attention to so I mentioned the the consciousness prior this idea that there would be a sparse factor graph relating the Joint Distribution between the high-level semantic variables I suggested that these high-level variables for the most part can be considered as causal variables that they can be cause or effect of each other and what it means really is that they they are about agents they are about intentions and they are about objects that are controllable and their attributes one thing I didn't talk about is that those those relationships between variables they there they don't have their own parameters for each factor of the graph for each potential function instead they should be shared modules that can be reused across different tuples so the fact that we have these key value pairs that means that a particular say sub network is going to receive input that's going to be different depending on the context and so it's like if it's a little rule but of course it's a neural net that is going to be applied to different instances so the the the graphical model that I'm talking about is more like a dynamic Bayes net where the same parameters can be reused in many many places this course this is also connected to Markov logic networks that Domingos propose a long time ago another really important hypothesis that I spend time on is the idea that the changes in this fusion are mostly localized if you represent information in the right way and this semantic space so that's another property of that semantic space and one thing I didn't spend much time on but that for example are just Qian boat who have talked about in their recent paper is that the things that are preserved across those changes in distribution have to do with the stable properties of the world that could be for example grounded by an encoder system one aspect that captured is stable and robust aspects of the world so to conclude I think after decades of work on consciousness by container science it's time that machine learning looked at these questions and explore the functionalities and and and the justifications for having these kinds of capabilities and I think that this would bring new priors that could help system original ization and out of distribution generalization it could benefit of course neuroscience because we would be able to provide more detailed account of these functionalities that can then be tested in in real people or animals and it would allow to expand deep learning from system system - so thank you [Applause] thank you very much your chef is its great dog so please if you have questions come on the left side or the right sides close to the mic sorry hey um so you're talking about consciousness and this is something I think that's really interesting um I think there's actually pretty broad consensus among moral philosophers that conscious having conscious experience is an important part of what makes one a moral patient in order in other words like deserving of moral consideration and then philosophers also like to talk about like the easy problem of consciousness versus the hard problem of consciousness and I'm kind of just wondering you know what your thoughts are on the moral implications of building machines that may be conscious and if you think the way that you're talking about consciousness has any relevance to that kind of question and also if you think that there's a way that you see for us to determine whether or not AI systems that we're building are having subjective experiences that make them relevant moral patients so today I've only talked about the easy problem of consciousness there is the question of subjective experience which haven't talked about and deserves much more work and attention but on the neuroscience side people have been thinking about this for much longer and there are some interesting theories which I think connect to issues of self-knowledge and predicting our own actions which might explain the the impression of subjective experience so over here the other major theory of consciousness is of course integrated information theory which measures consciousness by this fie quantity which is essentially a measure of the mutual information or the parts of the system and the higher the mutual information the more consciousness you have which seems like the polar opposite of your sparse factor graph hypothesis how do you reconcile the two I don't so I think the iit theory is more on the mystical side of things and you know attributes consciousness to any atom in the universe and I'm more interested in the kind of consciousness that we can actually see in brains no but they measure consciousness by this five measure in brains yes there is a quantity that is being measured that I don't think it's related to the kind of computational abilities that have been talking about hi here thank you for the very inspirational talk my question is regarding so the the topics that you touched upon the prior part and also the factorization part so in recent cognition work it has also been shown that the human mind kind of uses this the spatial world where we navigate as kind of a prior to sort our thoughts and this has been recently summarized also in the book by bob reverse key if I don't know if you're familiar with that but my question is do you see like a role for for this kind of like spatial prior or like a way to organize certain concepts and how could we go about this yeah yeah possibly I think one of the big lessons of the last few decades of machine learning is we need all kinds of priors in order to encourage good solutions to the problems we're looking at and clearly the brain is excluding the fact that a very important aspect of the world is the spatial structure of course some of our models do but I think on the conscious side and memory side these are also important but I haven't looked at that thanks thank you over here I'm inspired and also heartened to see that you're exploring how to integrate some aspects of as you what you put the symbolic program into deep learning and so on I guess my question is how far you anticipate that going there's been a lot of work for example on extending logic programming which is far more powerful than just knowledge graphs or graphs or hypergraphs to have numerical uncertainty including scalably computationally scalably so that might involve interleaving fine-grain these kinds of symbolic reasoning techniques with neural network type learning and inference as it's usually conceived of in neural net or machine learning terms so how much of that how soon do you think there ought to be well so indeed there are a lot of people who have been exploring how we could kind of paste the symbolic logic tools on top of the neural net computation I don't think this will work what I'm talking about is interleaving it yeah tightly as opposed to just bolting it on before but so what I'm talking about is different and and yeah of course I've you know I don't know what will work I don't have a crystal ball but what I'm talking about is more implementing some of the functionalities of logic and reasoning with the neural net machinery in the hope of keeping you know both properties and in one place I didn't have time to talk about where this matters is in the search so in in classical AI a bid up big obstacle is we have to look through a potential expansion large number of potential trajectories or combinations of things and that's where system 1 which is the neural net then comes in right so we learn what to attend to what to think about but it happens out of our consciousness we don't know why we think about something but that's crucial to enable efficient reasoning and search and planning and so on well thanks and we talk more about it later thanks for great talk yeah I'm curious to hear your thoughts on what is the data distribution and how we know if we're in it or out of it and how the data distribution like what is the relationship between the data distribution in the empirical distribution like how given a training set can we exactly characterize what is the underlying distribution of that data well obviously we don't have access to the data distribution except if we're doing machine learning experiments and we sort of cheat and and you know synthesize the data from some handmade distribution and so we can do it mathematically we can do it in simulations where we know what it is and it's very useful but I think we also have to think about maybe how that distribution came about as a non-stationary process where things change in the world and sometimes what looks like a distribution can be broken down into sub distributions that that are related think about data coming from different places or different times for example and and we don't pay enough attention to that right now hi a quick question so you made a call to move beyond just perception to higher-level reasoning and consciousness and the question is how do we actually measure progress towards that goal so I talked about out of distribution organization right and things like transfer learning few short learning continual learning and so on have benchmarks which i think is a good starting point but but in general especially if we introduce the the agent aspect we I think we have to start thinking about building environments in which we're going to test how well our learner can cope with those changes in distribution when those changes are not like a simple set of distributions like we do right now in standard but the meta learning benchmarks for example yeah I'm here so you set these high high level semantic variables you said in one point you said they're in a factor graph and then you also said they're causal yes usually the causal one causal variables will put them into tags yes so yeah can you say something on the release just put arrows on the photograph and that works okay well I mean so you you have the Joint Distribution structure and and the causality is just another thing on top which provides extra information and these are the arrows I mean there could be a longer answer but that's the short answer high to the right high so great talk thank you my question is what relationship you see between sparse factor graphs and relational and associative memory supported by the hippocampus in humans given that humans with bilateral hippocampal removal appear to be able to be conscious but not flexibly use relational associations that's interesting I didn't know about it so I don't know I guess maybe you need that attention going back and forth between the two sides of the brain in order to to coordinate the different elements of say our relations so so one thing maybe that has been misleading in my picture is I've drawn the conscious state as if it was some something that was physically separate from the unconscious state but in the brain that's not how it is it's more like within you know the brain the unconscious part is some some subset of neurons become excited a lot more and others become inhibited so in and of course that goes through a global communication so if you break the global communication I could imagine it would it would hurt the whole thing the coffee break and we are going to thank again Lucia you

Transcript for:Exploration of System 2 Recognition in Artificial Intelligence

Transcript for:
Exploration of System 2 Recognition in Artificial Intelligence