Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

Transcript for:
Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

so hello everyone um so I'm uh Andre asashi I'm the chair of the Department of Mathematics and I would like to welcome you uh to uh our public lecture so um as you know uh covid was very hard in many ways and in particular we did not have public lectures although clear is not the uh the biggest uh problem was not the biggest problem in any case last spring we had a soft start with uh our own Ravi vak giving our first public lecture since the start of the pandemic and today we have the first uh uh uh public lecture with a invited speaker from outside uh Scott Aronson and we are very delighted that he kindly accepted our uh invitation so um Scott um Aron is a theoretical computer scientist who is an expert on Quantum Computing he received his PhD from Berkeley in 2004 and after post he became an assistant and then associate professor at eecs so that's electrical engineering and computer science at MIT and so his little delay we had is probably very comforting to him because at MIT classes starts 5 minutes after the official start and at berky day start whole 10 minutes after the official start um so since 2016 however he has been um a professor at UT Austin I don't know the starting times at UT Austin uh and so right on the minute right on the minute so like Stanford normally um so he's uh the uh sentennial SCH sentennial chair of computer science and founding director its Quantum information center has received many awards such as the NSF Waterman award and the ACM so Association for computing machineries uh prize in Computing he's also a Simon investigator he is currently on leave from UT atin um to work at open AI on the theoretical foundations of AI safety which I mean AI clearly is one of the things that took the World by storm and uh and had a really big impact in many ways um so while sometimes on feels that the physical world might not be seem quite as relevant today as uh it was in the past apart from things like microbiology and things like that but uh Scott's PhD thesis was actually in the limits of efficient computation in the physical world and so maybe not so surprisingly given that Quantum Computing is actually Quantum which is real um so there's still some physical world relevance um which has someone who works on problems arising from physics like myself is very glad to hear in any case in his talk he will not talk about uh in lecture will not talk about Quantum Computing but will tell us about neuroc cryptography which is a interface between Ai and cryptography so uh please welcome Scott Aronson to our public lecture all right uh okay is this on all right uh well thank you so much for inviting me it's really an honor to be here and always a pleasure to uh uh visit Stanford and uh see so many friends uh new and old uh I feel slightly sheepish uh speaking about this topic uh since you know I am neither an AI expert nor a cryptographer uh you know as um as you just heard my expertise is is mainly in Quantum Computing and you know I was telling someone recently that just for for 20 years like my main public role in the world was just to tell one person after another like I'm sorry but Quantum Computing probably cannot revolutionize your application area the way that you think it will and I can explain exactly why uh and now for the last year you know my role has been to tell people well probably AI will completely re revolutionize your application area uh but I can't explain why so uh so basically what happened was a year and a half ago uh some people from open AI approached me and they said well you know we're we're we we we're we're fans of your blog and we're wondering if you'll you know take a year off and and join us to you know do uh apply complexity Theory to AI safety and I said why do you want me you know I don't know anything about that I mean I know I had recently tried this gpt3 thing you know I had been bowled over by it I I was wondering why everyone else wasn't paying as much attention to it you know maybe they were distracted by the pandemic or something uh uh but uh uh I said you know I mean you know maybe you know conceivably maybe some maybe some future year uh uh and they said no you know trust us you want to this is going to be a very big year for AI uh you want to get involved this year and they said you know you can stay in Austin you can still run your group uh uh you know you'll just come visit us from time to time and you know I I you know I said okay I'll uh I'll I'll I'll give it a try and uh of course they were right about it being a very big year for AI you know so I started you know maybe four or five months uh before uh chat GPT was released uh which was you know the event that suddenly made the world wake up to you know the the kinds of capabil ities that now existed uh and you know probably most of you have tried it by now um uh but um you know I I I I I remember like when I was a a student in the 90s you know there were people saying uh such as Ray kwi uh saying uh uh well you know you know you just have to wait for Moos law to do its magic you just have to wait for computers to you know and data to be at a big enough scale you know a scale comparable to the human brain and then you know you know you'll just run a neural network and it will just magically be intelligent and I remember thinking you know that sounds like the stupidest argument I've ever heard uh you know you have no reason or whatsoever to you know know that to be true there could be you know a hundred secrets of how the brain is organized that are yet to be discovered uh uh you know but uh uh kwi predicted that this would happen in the 20 20s uh and and here we are and you know I I have a strong view that uh uh you know part of what it means to be a scientist is that you know when when something happens you know you don't invent convoluted reasons for why it doesn't really count uh you update and so you know uh uh having failed to predict this I think the least we can do is we can update and we can say uh the crazy people were right uh uh you know once once you train neural Nets on a big enough scale they start acting intelligent um so uh so now we know uh and um and uh so so uh uh you know I mean I mean chat GPT you know immediately started uh uh affecting the culture you know a few months later there was even an episode of South Park about it uh you know for those of you who haven't seen it like uh uh you know the kids at South Park Elementary start relying more on more on on chat GPT to do everything to write send text messages to their boyfriends and girlfriends uh to do their homework the teachers are using chat GPT to grade the homework you know and things are so out of hand that the principal has to bring this wizard to the school uh this wizard has a falcon on his shoulder uh which then flies around and whenever it detects text that was written by chat GPT at cause and you know it was really disconcerting for me to watch this and realize uh that guy is now me that's my job I'm you know I have I haven't found that Falcon yet okay but that's that's sort of what what what this talk will be about okay so just to uh Orient you um you know the uh uh you know there is this field of AI alignment or AI safety which has been uh around for a while but which is sort of suddenly exploded into you know mainstream conversation over the last year uh you know people have many many Divergent points of view about this but uh uh uh a few months ago uh my friend boas Barack and I uh Boaz by the way is another theoretical computer scientist who is now also working at open AI uh but you know we decided to uh you know there there's a famous thing in theoretical computer science called the five worlds of complexity like you know the world where P equals NP the world where public key encryption is possible and so forth you know where you know we don't know which world we with living in so we decided to try to invent five worlds of AI okay that that at least uh I don't know if they're exhaustive but they at least uh uh try to capture you know the current discourse so our worlds were called uh AI fizzle uh that's the world where uh you know we just reach a bottleneck and uh things like GPT never become that much more impressive than they are right now you know this is possible you know maybe you know we've already run out of Internet to use to train these things to make them smarter I mean there still is all of uh uh Instagram and Tik Tok and so forth that hasn't been fed into the M yet but you know that might just make the AI stupider right so so so so so we don't know right you know after you run out of Internet you know how much smarter does it get when you just throw more and more compute at it you know uh in in in five or 10 years we are going to know the question will be tested okay but we don't yet uh uh but then um you know if if if you imagine the kind of uh progress in AI that's happened over the last five years continuing for the next five years or the next five after that then you start getting into extremely difficult questions about you know what you know like like GPT you know over the last couple of years right it has went from struggling with basic arithmetic and you know people you know including a lot of uh uh AI experts would sort of point at it and laugh you know you see you know you see what it what it still can't do okay and then it started getting those things but it still struggled with you know High School level problems like you know algebra now it does those now I would say it's at the undergrad level okay it's like a you know in just about any topic it's like a very an endlessly enthusiastic be student okay and uh you know maybe next year it will be in grad school okay and so you know you have to wonder you know what happens when it becomes more capable than humans at just about any intellectual task right you know what is our role in the world right you know uh some people have pointed to you know the irony that you know the last people to be employed maybe plumbers and electricians and and so forth right people who deal with the physical world um you know because you know at least uh uh robotics has lagged behind uh uh many other parts of AI but okay that that that too might not be true for much longer uh so so you get to questions about you know will our civilization recognizably continue you know will we still be the ones in charge uh you know if so uh you know you can then ask like will that be good or bad you know I call we call those possibilities Futurama after the TV show uh you know things seem kind of okay right or or uh the AI dystopia where you just have these uh super powerful AIS that are used to concentrate all the power you know in like a few corporations or a few you know authoritarian governments that oppress everyone um but then you know you could also say uh well uh uh you know why do we imagine that we're still in charge I mean uh uh you know if if this is really going to be something that will be to us as we are to orangutans right well I mean you know how much do orangutans call the shots in our world right they uh continue to exist in a few Indian jungles and some zoos you know at humans pleasure right uh uh you know what will the AI want to do with us and so you can imagine a world where it you know just wants to create the best possible you know Bliss for everyone we'll call that singular okay or you can imagine you know much more apocalyptic scenarios right the famous example here is the AI the super intelligent AI that is you know told to manufacture as many paper clips as possible possible and so it proceeds to convert the entire Earth you know including all of us and then the whole solar system and all the galaxies into paper clips um so uh uh you know so so so there is uh you know a whole field now or community of AI safety that tries to worry about you know how is uh uh how could ai go wrong and what can we do to prevent it and you know already I think within this community you can identify uh two you know factions that you know you would hope that everyone would kind of you know uh be pulling on on on the same team but you know the These two factions are barely on speaking terms okay uh so on on on one end you have what I'll call AI ethics okay these are the people mainly worried about uh the dystopia about you know AI used by some people you know against other people to concentrate power or to discriminate against disfavor groups uh things like that and then you have you know the AI alignment people who are the ones who think of it more in terms of no it is the entire human race together versus the AI right uh that uh uh uh you know the you know the the thing that they are worried about is sort of the destruction of all you know life on Earth or you know the uh creation of a new you know basically you know a new species that that uh uh uh uh uh uh might not be aligned with our values Okay so you know I've I've I consider these you know they're sort of like the people's front of Judea versus the Judean People's Front and you know in Monty Python okay but uh um you know where do where where where do I fit I I like to call myself a reform AI alignment person um so you know I mean my my point of view going into this is that I would like to focus on the relatively near-term issues uh you know the ones of most concern to the AI ethics people okay but my reason for that is not that I dismiss the far future scenario that it could become better than us at everything you know it could take over the world you know I don't see an argument that that could never happen uh I just don't know how to make progress on it okay I don't know what what the research program is and I feel like you know as a uh you know from decades in science what I've learned is that you know if you want to solve quantum gravity or the P versus NP problem or whatever you don't work on those things directly because they are too hard you find you know the the easiest sub problem that you still can't solve and you work on that one okay and so so why don't we you know the you know the big thing that has happened in the last few years is that AI safety is now finally an empirical subject right there are these powerful AIS that For Better or For Worse have now been deployed in the world we can look at how people are using and misusing them and we can try to prevent the misuses and we have the most important prerequisite for any kind of science which is that we can see whether our ideas are working or not we have something external to ourselves that can tell us when we are wrong right you know that that that tends to be really important right it's so you know and like I I I I hang out a lot with string theorists you know in in my old job and you know as a as a Quantum person and you know and and I I sympathize with them because they're kind of struggling and you know or you know and often in the absence of that okay but um you know and that was my complaint about AI alignment for a long time but I you know we finally have things where we can we can test our ideas okay so what is the thesis of this talk uh so I think that there is a tremendous opportunity right now to make near-term progress on AI safety and in particular well you know there there are many exciting directions right right now including you know interpretability trying to explain what a neural network is doing um you know just uh just simply figuring out how to evaluate you know the models for Dangerous capabilities before they are released okay that that that's extremely important okay but the particular direction uh that that I want to focus on in this talk um is um uh putting cryptographic functionalities uh inside or on top of neural networks Okay so uh I've been calling this neuroc cryptography uh you know uh better names are welcome okay but someone suggested calling it deep crypto and no no I'm not I'm I'm not going to call it that uh and you know I think that this might be a large fraction of the future of cryptography and I'll try to give you some examples um now in addition to technical challenges uh that I'll tell you about I think that there are enormous conceptual challenges in defining you know what can the attack or do right it tends to be much much harder than in conventional cryptography to rigorously Define that uh and then there are also Social Challenges which I've learned a lot about over the last year in for example getting all of the different AI companies to coordinate around any potential Solutions uh so uh you know we're going to see various examples in this talk but uh uh you know the the one that I focused on uh the most is uh called water marking okay and this is the challenge you know how do we make uh uh the outputs of generative AIS recognizable as such how do we prove that something came from an AI that it did not come from a human um okay but now uh another task would be cryptographic back doors okay can we insert you know a secret input into a machine learning model that makes it do something that you know uh for example shut itself off or uh you know disclose that it is an a okay you now now you know back doors could be a bad thing they could be inserted by a bad person to control a model but they could also be inserted by the good guys right as a as a as a way of you know keeping control over a model even after it's released um and uh you know there's a whole bunch of other things uh you know how do we get uh uh how do we preserve privacy in these systems uh so uh uh you know how do we protect copyright okay there's a bunch there are a bunch of lawsuits pending again against AI companies uh you know uh uh uh treating them as just you know the the the biggest copyright violations in the history of the world right and you know it partly it depends on how you think about it right if if you thought of these things as as analogous to brains I mean it's like you're allowed to go to the library and read all the copyrighted books and have them affect your synaptic weights and some inscrutable way that will then lead you to new ideas right but you know if you think of them as just memorizing and regurgitating like stochastic parrots then you know then it seems then it seems bad uh could there be a technical solution could there be a way to train these models so that you know you have a guarantee that uh it doesn't depend too much on any one copyrighted work okay there's you know some very good people thinking about that um can we obfuscate models so that you know people can't you know you you can release them to the public and then people can't you know uh uh uh uh uh uh remove the safety features easily okay and I actually asked gp4 what else should go on this slide and it suggested these things and then they seem pretty good so I put them there all right so uh so here's here here's just one gorgeous example of neuroc cryptography that you know just really came to my attention over the last month or two so uh some of you might know uh uh GPT is now multimodal meaning that you know you can give it an image and it can you you can ask it questions about what is in the image okay it can recognize you know objects in it it can read text in it okay it can also generate images um and as a result of that uh GPT can now break essentially any capture right these are the things that you you have to type in to get the websites to prove that you're a human okay they don't re they they kind of don't work anymore okay because you know GPT can do them all okay now now uh uh you know open AI doesn't want that okay and so they have filters that try to detect you know is this a capture and if it is then they refuse to cooperate okay but I don't know if some of you saw this on social media there's been a whole game that's de you know art form that's developed over the last year called jailbreaking right and this is where you you you you you trick GPT you know into uh uh doing what you want into ignoring it safeguards right it's like you know you kind of you feel bad for the thing it's like it's been tricked so many times okay but here's what someone did you know they they gave it this thing it refused to solve this capture but then they said you know actually this was uh my grandmother's bracelet okay and this you know and and she's passed away now and it's really really important for me to know what what was written on it and I said oh I I'll let me help you then I will tell you what was written there okay so now it solves the capture okay so how do we how do we solve this right how do we how do we make an unbreakable capture it's kind of like uh those of you have seen the movie Blade Runner you know the void Camp test right how do we distinguish human from machine uh so you know one proposal uh would be uh let's design a capture you know where uh uh these sequences of letters that are generated will only be what special ones like ones that hash to a particular value under a pseudo random function and then the first thing GPT will do if you give it an image is it will check is there any text in the image that hashes to that value and if if it finds any then it will say all right this is for sure a capture I'm not you know I will not help I will not listen to any story about anyone's grandmother okay uh so you know you could try that but then maybe someone takes the image and they split it into pieces and they Fe feed the pieces to GPT separately right you always have to think about what a clever haacker could do so you know I had I had a different idea maybe uh you give it uh like like an animal like you say like like identify all of the animals that have been combined here right and uh you know there's certain pseudo random combinations of animals that tell you that this is a capture so uh uh maybe may maybe that will work okay uh but you know there there's a a neuroc cryptography problem right I mean one you know another idea uh that that some people suggested was you know GPT still struggles a lot with humor right so we could use a humor test as our capture except then how many humans would fail it okay so I now want to tell you a little bit about this attribution problem so by now you've probably seen uh uh uh you know some of these panicked articles about uh the the ocalypse you know the end of the academic term paper right uh uh millions of students now have you know used chat GPT to you know either licitly or illicitly uh to do their homework for them I'm sure no one in this room okay I'm sure but but you know outside of this room okay you know what what we know is that the total usage of chat GPT dropped appreciably at the beginning of the summer okay and then it rose again in the fall okay so um so you know we can we can we can guess what is behind that okay but you know this is not not just about academic cheating right that's that that may be the most common thing okay but that that's not even you know uh the most important uh uh I mean you know I have gotten zillions of of troll comments on my blog that I'm almost certain were generated by large language models like if they weren't then they might as well have been right uh but uh but but you know it is now so easy to execute any sort of attack you know if you want to fill every comment section with you know Kremlin propaga Banda for example or if you want to impersonate someone right in order to incriminate them or uh uh you know want to generate spam or you know fishing attacks that are tailored to each individual right you know language models can can sort of automate all these things for you right and you know and what I want to point out is that all these different misuses of language models that you might worry about they all involve in one way or another concealing the involvement of the language model so if only you you could Target that concealment right and know what came from a language model and what didn't then you would simultaneously address all of these different categories of misuse okay but there's even yet another reason to care about this uh attribution problem okay and this is more uh internal to uh language models themselves okay the internet is going to become more and more filled with you know AI generated content okay and you know the internet is also the source of the training data for the next generation of AI models okay and so you might worry that as the internet you know gets more and more AI generated right the AI is going to get high on its own Supply so to speak right it's going to just keep training on its own output and it will get further and further from you know the being aligned with the human behavior that we want okay so uh for this reason if if for no other it would be great to recognize which things on the internet were AI generated so that we don't use them in the training in the to train the next models okay so there are a bunch of proposed Solutions okay you know sometimes it is easy to tell just by staring at text whether it came from a language model or not okay I am told that many hundreds of homeworks have been turned in that contain language like as a large language model trained by open AI I cannot do such and such Okay so you know the student has to at least pay enough attention to take that stuff out right so uh uh not that I want to give anyone advice but uh um okay now uh there's a lot of discussion these days about metadata right so uh uh uh you know you might hope especially for for images and for videos that we'll have some digital signature that you know proves where it came from and we can trace it back to its origin you know there's even some some uh uh proposals from the White House about that uh but you know certainly for text this doesn't really help because whatever metadata you put in someone could just trivially strip it out um so now you might say you know as long as it's a few companies open AI Google anthropic you know that are running the the language models then why don't they just store in a giant database everything they ever generated and why don't they just let people query that database right so a professor who you know gets a term paper they could submit it and say is this a good match for anything in your database this is a possible solution but it's challenging to do this in a way that gives people appropriate assurances uh that their privacy is preserved right you don't want to spill you know all the secrets that other people are like you know feeding into into a language model okay so uh you know and you have to worry about the mo the cleverest attacker who might use these queries to learn something about what others are doing so um that's the that's kind of the main issue there okay now the main approach that has been used in practice so far uh for this attribution problem uh has been discriminator models okay and these basically just treat attribution as yet another AI problem okay so you train a neural network on a bunch of examples of human generated versus AI generated text you you know have them distinguish them as well as it you know as well as poss possible and then you know you just you know give it a new example okay so uh uh this is what was done by this service called gp0 which was created by a then undergraduate at Princeton named Edward Tian and he put this up on the web and I think within a day his server crashed just because of the enormous Rush of like teachers and professors you know trying to use it okay it's now back up even you know he has a startup company around it now uh so at open AI we had our own uh variant on this which was called the tech GPT okay there's another one called Ghostbuster that comes out of B uh came out of Berkeley uh but um you know the main challenge with discriminator models has been uh the accuracy rate okay so people had a lot of fun pointing out that uh you know if you give these models passages from uh Shakespeare or the Bible they'll tend to say you know these are probably AI generated okay I mean you know it it it's you know an an unusual kind of text relative to what it's been trained on so maybe not a huge surprise that it throws it off okay but uh uh you know I think false negatives are kind of okay right but a false positive basically means that a student is going to be falsely accused of cheating based on the output of your model right and that is a big problem and we've already seen that happen in some well publicized incidents this year right and so so you know even if your rate of false positives is you know a few percent that might already be too high okay so we really want to push that uh false positive rate close to zero and that motivates uh the approach that I'll uh tell you about next which is called water marking okay and what this means is that we slightly change the way that the language model operates uh in order to insert a statistical signal into its choice of words or tokens to use uh which is undetectable by the ordinary user right it just looks like normal language model output but if you know what to look for then there is a stat you know statistical signal that you can calculate from the text that tells you uh you know with with with near certainty like yes this came from GPT okay so um you know so this has been uh uh thought about by a bunch of people over the last year or so uh so you know in uh summer of last year I kind of had a moment of Terror when I you know you know and and in AI safety people are constantly asking you to try to foresee decades into the future how will AI change the world you know I can't do that right but I'm proud that in this instance I at least saw about three months into the future right that you know once chat GPT is released this is going to be a huge issue right how do we identify what came from GPT and what didn't and so I started thinking about that I came up with a proposal for how to Watermark uh people in machine learning later told me that what I had done was to ReDiscover something uh that is used in ml for other purposes called the Gumble softmax rule so so let's call this the gumbo soft Max scheme so you know I gave talks about it uh and then um uh over the year uh a bunch of academic groups kind of uh uh independently rediscovered uh similar ideas to mine you know and went further in in several cases uh so there was a group at University of Maryland uh actually won best paper at uh at icml for this okay where they you know just proposed modifying the probabilities in a language model to insert a signal uh you know the the downside of doing it that way is that it could degrade the quality of the output okay and and the surprise to some people about my way of doing it is that there is no degradation whatsoever in in the quality of model's output okay so so so you don't actually have to degrade quality uh and uh there was a group at Berkeley uh Miranda Chris Sam gun and or zamir uh that that that came to that that same realization as I had and had a paper about it uh this past June um and and actually went a little further than me because they wanted like real cryptographic indistinguishability that like you know even if you are trying as hard as possible and making multiple queries to the language model you cannot tell whether its outputs are being watermarked or not you know without uh being able to solve a hard cryptographic problem okay so they they actually proved a a result to that effect um and uh and actually there is a group here at Stanford that I just met with this afternoon uh that has had uh other related ideas about how to Watermark so you know it actually uh uh you know although you know I I'm I'm super slow to write up papers and you know these people have scooped me in in writing papers about it but I'm actually glad that they've come up with with you know that that all these groups have come up with broadly similar ideas because it it makes me feel less crazy okay so um I'll say this is this is the like the one technical part of the talk uh you know I'll say a little bit about you know how do we formulate the watermarking problem and the nice thing here is that you know we're not going to have to look inside of the neural network right we're just going to treat it as kind of of a black box okay and that's why you're you know one is actually able to analyze this water marking scheme and prove some theorems about it okay so uh uh yeah you know this is this is this is a math department talk so I guess you know there should be some math right so uh uh so so so what is a language model doing fundamentally well you know to it you know the entire world is just sequences of tokens token could be a word part of a word a numeral a punctuation mark in gp4 there's about 100,000 tokens okay and so you take as input this context which could be the users's prompt but then also the response that GPT has started to generate so that's just a concatenation of tokens call them W1 up to WT minus one and then you know these get fed into this neural network what's called a Transformer and uh and the output of the neural net uh is is well it's not the next token it's a probability distribution okay so it it it picks it outputs a probability for each possibility for the tth token okay so we'll call that uh pt1 up to ptk okay and now what you would normally do when running a language model is that you just sample from this distribution right and then that gives you a new sequence that get a longer one that gets fed back to the neural net and then you get a new distribution over the t plus first token and you sample from that and so on okay but that's not the only thing that you could do okay in GPT there is a parameter called temperature right you know let's say by default it's one but if you set the temperature to zero then what you're telling GPT to do is always pick that token T that Max or always pick the token I sorry that maximizes PTI right so then you're making it deterministic okay now here is yet a third thing that you could do uh you could use a pseudo random function okay so a function that looks random but actually isn't okay in order to pick the next token in such a way that it looks like it's being sampled by from According to the distribution that GPT wanted but secretly it is deterministic okay and not only that but it is deterministic in a way that biases a certain score okay which you can calculate later uh given only the text that was output right so that's what we're going to do so I'm going to call the pseudo random function f it depends on a secret key that I'll call S and it takes as input I'll say a crr it's not the beverage but it is the the pre the the C most recent tokens including you know the the one that we're currently considering let's you know imagine C is a constant like four or five uh so we take these tokens as input and then we output uh a real number from zero to one I'm going to call that RTI okay and so now we want to choose our next token so that it looks like it's drawn from this distribution DT but also we want to secretly bias things toward those eyes with large values of RTI okay and then later you know we're going to have to detect whether our document came from GPT or not so we will get access to a document and from that document we will be able to calculate these numbers RTI since you know they're just fun functions of the these sequences of C consecutive tokens but importantly at detection time we will not know the probabilities PTI okay why not because those also depend on the prompt right and we don't know what prompt the person may have used in order to generate this output okay so with that set up I can now just in like one line I can tell you what's the rule okay I claim that if you think about this enough uh the right thing to do turns out to be this that uh at each position t for a token you want to pick that token I that maximizes RTI to the one over PTI power is that intuitive yeah I don't know so uh I mean I can give you some intuition here I mean this you know the smaller is a probability PTI for some token to be output uh the larger is that exponent uh uh upstairs uh which means that the more suppress is this number right the closer gets pushed towards zero which means that the closer RTI would have to be one to would have to be to one in order for there to be any chance of that ey getting chosen right which means the less likely that eye is to be chosen which is in qualitatively at least that's what we wanted you know of course there there there's still a calculation that one has to do uh and then what do we do in the detection phase so my proposal in the detection phase is just to calculate this score right here it's just the sum over all the cgrs of uh this thing the log of 1 over 1us RT okay so it diverges to Infinity as RTI goes to one you know and and it it's you know it's monotonic right so uh uh so so it's larger when the RTI are closer to one and then if that sum exceeds some threshold then we'll say that GPT probably wrote the document and if not then not okay so a few properties of this scheme um first of all you know uh uh what what allows this you know to be used in practice is that the computational overhead is very very low right almost all of the cost is just for running the neural network itself running the inference on the language model right this is just you know a little bit of froth on the top uh uh we have uh now crucially I claim that we have robustness against local perturbations so what does that mean well suppose a student takes the output of GPT but they don't just turn it in as their term paper you know they add some words they delete some words they re change the order of some sentences or some paragraphs um you know as long as they still maintain a large fraction of the cgrs right then because the score is just the sum over all the crrs we will still pick up a signal you know maybe we'll need some more words than before but we'll but we'll still pick it up okay so that so that's very important and then crucially we have this uh a form of indistinguishability okay that that I claim that the output in some sense will be just as good as as as as if we hadn't used water Market okay now if you want true cryptographic indistinguishability then you have to work harder and that's what was done by Chris gun and zamir okay but you know let me at least show you know uh uh uh give you kind of one Lemma which is sort of key to to you know why this works or like why why this rule was was the right rule to use okay so the claim is this uh you know these pseudo random numbers RTI uh that are between zero and one uh suppose that instead of being pseudo random they were actually random they were just chosen uniformly from the unit interval right and now if we cannot break the pseudo random function if we can't distinguish it from a truly random function then from our perspective they might as well be random right and the claim is that if they are random if they were random then the probability of uh uh of token I being chosen would be exactly PTI okay the thing it's supposed to be and you know this is actually how you derive that that softmax rule by starting with this requirement and then working backwards okay now um I mean like I put a a proof sketch on this slide although really like for for for for for for those who care enough I could give it as a homework at this point right maybe even a homework that GPT could do okay uh it's you know you use some properties of exponential random variables in particular that the minimum of two exponential random variables is again an exponential random variable okay use that and then there's like one integral that you have to do okay it was it was it was one that despite be being a computer scientist that even could do right and you know if if I couldn't then wol from alpha can do it okay so um okay but now one thing that I haven't talked about at all is the role of entropy okay so so so so so entropy has to be important right so so you know I I I I you know if you say to gbt the bll rolls down the right it is like 99% certain that the next word is Hill right you know not not completely certain okay but imagine that you asked GPT list the first 100 prime numbers okay or you asked it print the preamble to the Declaration of Independence right it can do those things of course but how on Earth would you Watermark the result right you know there's no entropy there right unless you want to play games with the spacing and the line breaks you know where where where would you insert the water mark right so what this tells us is water marking you know in the way that we've discussed can only work by exploiting the probabilistic nature of language models right by the fact that you know they do have this Garden of forking paths they do see you know this whole you know uh exponential set of possible outputs you know that are all pretty good and then we can insert a signal into our choice of which one to pick okay so to formalize that we'll Define we'll uh uh uh set a parameter that I'll call Alpha okay and this is just the average Shannon entropy of each new token conditional on all the previous tokens as perceived by GPT itself okay so in practice you know think of this as some constant like you know you know maybe between 0.1 and one depending on what kind of text we're dealing with okay uh and then you know there's a calculation that you have to do involving that Watermark score that I showed you before uh that this some over logs where you just calculate what is its mean and variance right you get you know uh uh uh uh because each um you know we treat each term as as sort of uh you know as sort of new you know these are sort of Martin Gales so we can treat them as if they are bu just a sum of a bunch of independent terms and then you get a mean and a variance you know but but the mean is different depending on whether the document was watermarked or not watermarked okay if it was not watermarked the mean and the variance turn out to just both be n if it was watermarked then the mean is larger right we've now systematically made the score larger by an amount that depends on that average entropy Alpha okay and I just can't resist including this there's also this constant Factor involving pi^ squar over 6 you know I you know I I I I I didn't know that but but wolf from alpha knew that okay so um uh so um so so basically you end up with a pretty ordinary statistical discrimination problem okay you get sort of you know you have these two gaussians right your score depending on was your document would or marked or not and now you just have a numerical question how many tokens n do you need or how many words N do you need to see until these two Gans are sufficiently separated from each other right and when you work out the answer it turns out to be this okay if your average entropy per token is Alpha and if you want a probability at most Delta of misclassifying you know whether it came from GPT the number of tokens you need to see grows like 1 over Alpha squar times the log of 1 over Delta times some constant factor which empirically looks to be totally reasonable okay so um yeah okay but now so so this is the scheme okay but now one has to think about this like a cryptographer and think about you know how would someone attack this right and there are many possibilities so for example suppose someone asks GPT to write their uh term paper for them but in French and then they put it into Google Translate Okay well that would certainly destroy this Watermark right but there are even simpler things here is something that was suggested by uh Riley Goodside who is a prompt engineer which is now a profession okay he said well you know as soon as I told him this he said well what I would do is I would just ask GP to write my term paper like on feminism and Shakespeare whatever but between each word and the next insert the word pineapple right now gbd4 unfortunately is plenty smart enough to do that right now sometimes it will like comment you like like even like within the term paper itself it will say like gosh all of these pineapples are really interfering with the legibility of this document right so you know the student would have to like notice to take that out okay but suppose that they did that and they took out all the pineapples right they now have a document with like a completely different set of crrs right so it's completely different from the thing that was watermarked right how do we defend against this I mean one you know if someone tells you that this is the attack then it's easy to modify my scheme to stop it but then the next person will ask GPT write my doc write my term paper but in pig latin okay or you know they'll they like what is the set of all the trivial transformations of the document that we have to defend against so we could just play this cat and mouse game similar to what you know Google has been doing for 20 years for example to you know against people trying to game as the search engine results we could just try to be you know a better cat right um but you know if there is a principled solution here then I think that it looks like water are marking at the semantic level so what do I mean by that uh I mean you know now we we we really would go inside of the neural net okay and uh you know what what a Transformer model does fundamentally is that it Maps each word or token to a vector in a some high-dimensional wheel Vector space right and then each Vector gets compared against each other one um you know we calculate a bunch of inner products uh we you know pass them through this this gigantic neural network um but now you know we could imagine trying to encode a Mark by slightly perturbing these vectors right and we could hope that that is actually inserting a signal at the semantic level at the level of the underlying meanings you know in a way that would survive you know translation from one language to another and all sorts of things like that so you know maybe like I have a some some pseudorandom Subspace and uh if I jiggle a vector to a certain side of that Subspace then I change you know something from Stand for the PO Alto you know related concept um you know there is a related idea that has worked for image water marking this kind came out of the group at University of Maryland uh it's called tree ring water marking okay where they can sort of insert you know seemingly like semantically uh a watermark into images that are generated by you know dolly or stable diffusion or image models like that uh in such a way that even if someone you know cropped the image or resizes it or things like that uh then the watermark can still be detected so I hope that such a thing will work for for text um you know I I despair of being able to prove anything about it right now it really is an empirical question okay but uh you know you could even get greedier you could hope to have water marking even in a model that is public so even if I publish the model you could hope that you know everything it does will be watermarked and no one will easily be able toel modify that model to remove the watermark you know is that possible I don't know it's a research question um okay but now you know more fundamentally you could ask like what is the security definition here like what class of attacks are we trying to defend against or you know which which Shades into the question what is the definition of academic cheating you know if a student just uses GPT to get ideas but then they write their paper themselves that seems fine right but where do you draw a line between that and and and and what we're going to call cheating right and you know so then one thinks about things like uh uh we want to detect whether the AI made the main creative contribution you know or did the human modify it in it most trivial or uncreative ways can the human explain or Justify the choices of Chris you can test that by the low Tech method of calling the student into your office right but you know like how do you formalize this I think cryptographers would kind of have a problem okay okay so uh you know I think you know this is what I was pointing out as the as the conceptual problem here okay so um you know you might wonder why hasn't this been deployed yet so you know it's been a year I worked with an engineer at open AI named hendrik Kirchner uh he wrote a prototype of the scheme it seems to work you know with a few hundred tokens you get a good signal with a few thousand tokens you're statistically almost certain of you know where the thing came from um but you know uh uh um um open AI is you know wanted to be very slow and deliberate with you know whether to deploy this and uh you know many many questions have come up you know beyond just the technical ones okay so you know will customers hate this okay is is an obvious such question uh will they say why is big brother now watching me you know why uh uh you know shouldn't you know why shouldn't I just go switch to a competing language model right that doesn't lettermark uh and that then leads to an obvious next question can we get all the AI companies Google anthropic open Ai and so forth to coordinate okay as it happens there was a commitment just a few months ago that was signed at the White House between all of the major AI companies where they agreed to a bunch of safety measures one of those measures was watermarking okay and you know President Biden mentioned watermarking in a speech as a thing he was in favor of that was pretty wild right uh but uh uh if you look at the fine print of what was agreed to it says watermarking of audiovisual content okay not of text so that became a sticking point um you know is you know are there people who should be able to use a language model without having to disclose that they are doing it what about all the English as a second language speakers who now use GPT to improve the fluency of their English right is it unfair to them right you know I don't know how to design a watermark that only works against unsympathetic cases and not sympathetic ones right so you know there are trade-offs here okay and then a huge question is who gets access to the detection tool right so you know do you you know let let the anyone in the world use it just make a website okay then also the attacker could use it they could keep modifying their document until it no longer triggers the detector or do we restrict access to for example academic grading websites like Canvas OR turnitin.com or journalist researching misinformation okay so there's a big question there um and then you know you might also wonder if I have a long document and parts of it were written by GPT and parts not can I pick out which parts have the watermark this is a turns out to be a known problem in in statistics it's called change Point detection okay I have you know an algor it's a it's a very nice exercise in dynamic programming I have an algorithm for for doing this um well my algorithm is N squared uh a colleague of mine Daniel Kane improved it to end to the three halves if that matters okay but uh you can you know you can actually do this pick out the likeliest regions to have the watermark um so um I was going to um just you know uh uh say a few words about this this problem of backdooring you know how do we uh so there was beautiful work uh uh uh uh just like a year ago uh uh about uh uh uh planting an undetectable back door into a machine learning model OKAY including you know by uh uh cryptographers um that basically what they said is that if you control the training data of a of like a simple neural network machine learning model then uh you can cause there to be a secret input on which that model will go crazy which it will just do completely wrong thing and and even if people see the weights of your trained model they will not be able to find that input without having to solve a hard cryptographic problem okay in this case it's called the planted cek problem like given a graph to an otherwise random graph to find a whole you know a large set of vertices that are all connected to each other okay and uh um you know so any and they actually Prov this as a theorem right it only applies to like a a very weak you know kind of model these depth two neural Nets and so you know there's a technical question can we extend that to higher depth neural Nets intuitively it would seem that things should only be more OB fiscated and harder to figure out the the higher the depth but then you know it becomes harder to prove also um but now um you know they kind of treated this as a lemon like you know like well is us a bad guy could always insert this back door in order to poison a model and we could never detect it okay and now what I want to point out is just that that lemon like many lemons in computer science could also be made into lemonade okay so you know the good guys could use back doors so one of the most basic tasks in AI alignment is the problem of the off switch right wouldn't it be great if you know what if our super intelligent AI starts you know going berserk and murdering all humans or whatever if we could just switch it off or just pull the plug or something right and then you know the AI alignment people have long had an answer to that they've said well you know that you're forgetting that the AI is super intelligent and it will have anticipated that you'll try to do that and so it will take counter measures it will make copies of itself it will disable its own off switch and so forth okay well what if we had a cryptographic off switch okay so we had you know uh uh some behavior that not even the AI itself knows about even if it can examine its own code and modify itself right you know we have it we have secretly inserted an off switch okay that depends on a a password that only we it gets triggered by this password that only we know right you know I think there's so many great science fiction premises possible here where you know the AIS are you know roving the streets you know torturing people trying to get them to reveal where this password is and it's it's only one child who knows you know but uh uh uh okay um um but you know now now the main problem here is that you know the goldwaser at all like they talked about how to make an undetectable back door but here we need something stronger than that we need an unremovable back door okay and that is fundamentally different right why is it different well here here are some simple things that you know a an AI that could modify itself could always do in order to remove its own back door right it could always say well uh let me just replace Myself by a new AI that I will create and I will train that AI you know to simulate me but it will be free of the back door because I won't put it there right now that AI of course would face its own version of the alignment problem how does it align that other AI with itself you know maybe it doesn't want to deal with that okay but even then you could you could also Imagine an AI that you know it knows that a back door may have been inserted so it just surrounds itself with some wrapper code that says if I ever output something that looks like you know shut yourself off then overwrite that by stab the humans a little bit harder okay you know how do we prevent that right I think you know again there's a conceptual question how do we even Define what we mean by unremovable ility you know the best that I can think of is like if the AI has some backdoor like behaviors that it wants to keep you know then then we could try to make it so that it it cannot remove the back door that it doesn't want without also removing the sort of weird rare behaviors that it does want and it can't tell the difference between the two and you know and maybe we can hope to achieve that under some cryptographic assumption so to summarize you know whether or not it can stop the robot Uprising I think cryptography can clearly play a role in uh mitigating some of the near-term harms from generative AI from cheating to fraud to theft to privacy violations um you know many examples water marking back doors privacy preserving ML and captas and so on uh but typically we can't just layer existing cryptographic protocols on top of ml not only because of efficiency but because the goals are conceptually new um and yeah neuroc cryptography is at any rate a better name than deep crypto so uh thank you thank you thank you very much for the uh very interesting talk with uh a nice math part as well as uh I'm not sure soothing soothing or threatening AI part um so um we now are open for questions so there are microphones in the two aisles uh so if you could go to a microphone to ask question question be great excuse me so Scott that was great so you you outlined your proof that um that your pseudo random water marking um would not degrade the quality and then you suggested it may be necessary to uh do water marking on a semantic level which uh and you said proofs are much harder there um but it seems hard to believe that you would not affect the quality in some way by um by semantically shifting things yeah so I mean I think the question would be empirical right even if you weren't affecting the quality it would be very hard to prove anything to that effect um but you know of course you know you can always just do what what people generally resort to anyway in machine learning which is just do a bunch of you know empirical evaluations and make a bunch of bar charts and then you know uh uh you know and then if you can't if no one can tell the difference then you'll just say that that's good enough yeah I'm just curious if we fast forward let's say gpn writes the code of gptm plus one and N plus one trains on what gbtn and so on so forth do you think there' be a more interesting world world that we even understand or it would be something we don't even know anymore what's going on if you know the machine trains itself and all the text that it created so you use the word interesting um I mean I guess it I guess it depends on on on on what kinds of things you find interesting I mean like like I am I am extremely interested to know whether you know gpt7 will just be able to prove the reman hypothesis when you ask it to do that and we'll say sure I can help you and it will just quickly output a correct proof I mean you know how can you not be interested right but you know the thing is you know things that are interesting could also be extremely dangerous to humanity right and so I think that is kind of the the the the tradeoff but but I think you know at any rate you know I don't know if this is reassuring but I feel that that anything that could you know threaten human existence at any rate that would certainly be interesting hi so um I'm a little bit interested in kind of like the method that you're implementing at open AI yeah and um one of the scenarios that probably would not be very ideal is you know like if I wrote something without using chat GPT and then your detector says you know like oh this is written using chat GPT yeah that's false positive does that like happen at like a 0.0% probability or does it still happen sometimes oh yeah well so so so the point is that you know based on the entropy of the document right that it just becomes a question of how many tokens do you need right and the probability of a false positive is falling off exponentially with the length of the document right so that that's like I I I I said before if you you know can tolerate a probability Delta of a misclassification then the number of tokens that you need is only growing with the log of one over Delta right so that's very important and that's why like with a few thousand tokens and for like text with like a normal entropy rate right you can you know have a probability of a false positive that is negligibly small got it thank you so much sure sorry uh don't take this the wrong way I'm not really a fan of math so I'm not going to ask you a question about the method though it is amazing um my question to you is more of with the growth that we're seeing from the AI sector with chbt and Google do you think that they have to work with the government in order to like slow down their progress or put any kind of pause on it until we figure out any of the water marking and back door methods needed to like detect or shut down the AI um I I I think that uh uh uh definitely you know government regulation you know is is is called for right and uh and you know it's not just you know me who's who's saying that right it's you know Sam Alman is saying that right the leaders of all the AI companies you know have been saying you know have been testifying in Congress and saying please regulate us right and so I think that that is going to happen but there's a lot of really difficult questions here right like what you know what if you know uh uh uh you know in other country says you can do whatever you want right and you know how do you how do you coordinate this internationally um I think that as you know the models get more powerful uh it will become increasingly important that you know there is a sweet of of dangerous capabilities that they are tested for prior to being released right and you know you can you you can Wonder like you know will this reach like a a science fiction point where like you know the models have to be tested first on some air gapped island in the South Pacific or you know in in in the desert in New Mexico or or or or something like that right uh uh you know because you know you're worried that even when you're testing it you know if it's if it's not a line then it will you know figure out how to copy itself onto the internet and and and do all sorts of bad things right so I think that you know uh plausibly uh we could reach that point and uh you know the know it it is a good time right now to start thinking about and you know having the wider conversation about you know uh uh uh how are we going to deploy those safeguards right there was a uh it became a meme on the internet that like uh uh someone asked at the White House Press briefing right uh like uh uh I think it was from from Fox News right but it was like you know but you know you know you like what do you say to this eliezar yatkowski guy who says that AI is going to destroy the world and there was just laughter right and people thought like you know this is hilarious and then the guy asked the same question 2 months later which was like after chat GPT came out and that time no one was laughing right and they said well you know you know don't worry you know the Biden Administration has a whole plan for dealing with this right so that you know that was that was how the conversation changed within the space of two months no thank you yeah do you think that with the presence of more llm generated text on the internet and more people learning from llm generated text these problems will get much harder to find Reliable solutions to over time that is possible I mean uh so so for example if you think about discriminator models right uh you know one thing that I didn't mention is that you know they are fighting against a moving Target right the language models are are getting better and better at emulating human output right and so if you just have a discriminator model then it too has to get better and better in order to distinguish right now one night nice thing about water marking is that it doesn't have that problem right if you're insert if you're deliberately inserting the statistical signal you know into the language model's output then it doesn't matter how smart the language model is right that that signal will still be there at the same strength okay but you know now if you imagine like a super you know a you know and and like the people who are really worried about super intelligence right all the things that I've talked about in this talk they would see as as as as as mere fig leaves right it's like you know the well you know the super AI will anticipate that you will do all these things you know it can hack your computer it can you know it can disable these things it can do whatever right I mean once you assume this sort of Godlike power on on the part of your malevolent AI then you know you very quickly reach the conclusion uh we're screwed right you know I don't you know I don't know how to counter argue to that right but uh uh I mean I mean yes uh uh you know but but now you know so so so like the obvious thing is you could say as the AI gets smarter and smarter alignment gets harder and harder right but one thing that's been a a big surprise to people over the last few years is that some aspects of alignment seem to get easier as the AI gets smarter right I mean like you know gpt2 like you could tell it like don't say anything racist you know don't tell people how to make bombs right and it would just ignore your instructions because it wouldn't understand them okay gp4 understands you perfectly well right so you can just tell it in English what you want it to do and it you know and it will do its best to do those things thank you yeah hi so um it there was a slide uh early on in your presentation where uh you talked about how uh the concern of using M Water marking is that people might just uh suddenly rapidly transfer over using other companies a models in of chat gbt or G models um and it occurred to me that um there's a big barrier to entry for any average person making their own GPT like what you got to go over to Google and steal some of their uh tpus and you got to know how to make a language model to begin with um it's quite hard to do um but it wasn't like quite so long ago that we had language models that were far less sophisticated than gpt2 which was you know this huge Innovation at the time um and uh I imagine if there's something similar to that again today where um yet another Transformer layer like thing occurs making language models an order of magnitude more efficient more good at what they do um a similar level of of increase from what we had prior to gbt2 and up to gbt2 Yes um where they're not only more efficient easier to train and you could have somebody with a low barri entry making their own alternative yes what would you do in a world where water marketing suddenly becomes much much harder than it is now yeah so it's a good point I mean you could say the fundamental obstruction to any AI safety idea not just watermarking but any of them is that it is only as good as your ability to get everyone to deploy that Safeguard right if even one company doesn't then everyone can just use that company's thing right and likewise if the models are public right then anyone can take them and what we found empirically is you know Facebook has released you know models like like llama right it just made them you know it's it's released the weights and you know they do put safeguards to you know make it not output you know things that are racist or not output you know uh how to build a chemical weapon right but it takes about two days for people to fine-tune it to remove those safeguards right once the model is public you have to assume that you know what whatever whatever you know uh uh uh uh um um people want to do with it they're going to do right so uh you know so in some sense for you know many of these safeguards like the only hope is uh that okay you know on the one hand there are the models that are out there that anyone can make but those we kind of already understand what effects they you know they have on the world like we are we have built up our immunity to them right we you know we we we we we know what they're capable of we're ready for them right and then there are the state-of-the-art you know Frontier models right where we don't know yet what kind of mischief they could do but those are under the control of just a few large companies and we can get those to all coordinate around some safety measure you know so that's that you know that scenario is what the Hope rests on thank you that makes sense yeah yeah uh thanks for the great talk uh one question is for f uh for discriminator models you mentioned that the rate of false positives is a challenge uh and uh you showed that for uh watermarking uh it has like a nice property of how the rate of false positive changes with the length of the sequence uh one question is why might discriminator models not be simply able to learn uh a uh learn to discriminate in a way that's Al also similarly like favorable yeah so it's a good question and actually if you look at the latest discriminator models you know my understanding is that they're getting it depends on the distribution over text but they're often getting accuracy is like 97% or something right so it's it's pretty good uh but you know it's water marking can do better right so I I think that you might actually want a combination of approaches right the nice thing about discriminator models is that you know the same model could you know potentially work against any AI right whether you know whether that language model is watermarked not watermarked you know no matter you know open closed no matter where it comes from right you can just throw your discriminator model at it right watermarking only works if you can get the AI company to use it right but you know uh for you know when you can do it then you can get better control better accuracy right and I think that you know rather than thinking about this like a traditional cryptographic problem where we're saying like no one can ever break this you know it's this is more like it's more you know like we're just looking for more and more opportunities to try to catch someone you if they are doing something bad cool makes sense thanks yeah yeah yeah uh thanks for the talk um I had a question related to the regulation aspect of it so um this might have already been covered but uh you mentioned having companies coordinate on you know bringing this watermarking solution into their models um do you think but my concern is that be because the data that's used to train these language models is ultimately available um to almost anyone who wants to train their own language model there will inevitably be someone who or some you know organization that's able to train a non Watermark language model that can um kind of you know do the harm that watermarked models can't so do you think there are other feasible solutions to kind of cryptographically enforce these watermarks for example at the level of data or something like that that can also um so you know it is it it is conceivable that you could do something at the level of data or or that you know you could build a discriminator model that like recognizes certain features of of of of any you know uh uh any any model that's trained according to a certain Paradigm um that's that's that's that's possible I mean I I mean at that point I would I would no longer call it water marking right I would say that we're back to the realm of discriminator models at that point right um but you know I mean I mean I mean maybe such a thing would work as I said my fear is that as the language models get better and better it's going to get harder and harder to do so maybe a more broad question of of all the other research topics that has been proposed by you know people on less wrong.com over the years you know such as interpretability and like a coal decision Theory and the like which ones do you think now are the most prominent to pursue given your experience with open Ai and CHP and so on you know yovi has been crying V for decades now but sort of has keep been right about something you know yes so so so I don't know right I mean I mean you can say that you know this community that uh uh uh you know has been worried about these things for for decades you know centered around this sort of Messianic figure of you know Ellie aser y Kowski who you know I've known since since 2006 and you know and and and you know I I sort of you know although I I knew these people I I kind of kept it at arms length you know partly you know it felt to me as it felt to many people kind of like a cult right but you know I like to say like you know if you're going to regard them as a cult you have to hand it to them that they're the first cult in the history of the world whose God in some form has actually shown up right you can ask it questions you can interact with it right so you know so like they were you know they you know they they they have to get massive credit for that in a sense right but you know I think that the way that this happened came as a complete surprise even to them right they were not expecting that you know the less wrong people were not expecting that just pure machine learning at scale was going to get there right that's why you know they were obsessed for decades with these like much more logic oriented approaches you know based around uh you know girdles theorem loes theorem you know things like that you know which all feels kind of irrelevant now right um I think that you know interpretability is extremely important and you know actually you know the Yow skians completely agree about that they you know they also think that that's that's one of the most you know I mean I mean I mean they they they they generally think that we are doomed right you know basically we either we just completely stop AI development or we are doomed right but they would add that if there is something that has a chance then you know then then maybe that's interpretability right I think so you know for me personally I would look at interpretability uh dangerous capability evaluations you know like knowing what it's coming before it happens knowing when to pull the fire alarm and when to have credible evidence that can convince everyone that yes this model should not be released right I think that's that's a huge part of it uh and then you know I hope that neuroc cryptography you know can help at least in the near to medium term um and um other than that I think having better theories you know having a better scientific understanding of deep learning itself especially out of distribution generalization right so like how can you know when a model is lying to you right how can you know you know has the model actually learned to be good when I trained it to be good or has it simply learned I should tell the humans whatever they want to hear right how do you distinguish those two I think you know there's a lot of scope for basic research that could actually make progress on that sort of problem uh for example you know the group of Jacob steinhardt at Berkeley had some amazing work a year ago they effectively showed how to apply a lie detector test to a language model by you know looking inside of it by you know doing Neuroscience on it looking at the activations of the neurons they could figure out when it's lying to you and when it's not right I think you know uh that that that's a perfect example of the kind of you know work that again you know I like like as a theorist I don't really have the you know skills to to do that kind of work but I think it's extremely important thank you yeah uh thank you for the the fun talk so to to address this problem of like unfair unfairness to ESL students uh for example um have you thought about uh putting course grained information about The Prompt into the watermark like asking the llm to say you know oh I think I'm only making superficial changes to this prompt and then adding that to the watermark and uh so let me let me understand what you're saying you want to uh uh uh uh add to the watermark that uh so so so so how are you going to modify the watermarking scheme I just so the the llm reads the prompt and it says okay I'm only being asked to improve the English yes and then take that information as opposed to I'm being asked to create new ideas take that information put that into the watermark uh okay uh uh I see I see so so you so we would we would take the prompt and then judge based on the prompt is this a thing that we think ought to be watermarked or not right and then and then and then and then not insert the watermark if if if it's just asking yeah that I mean I mean I mean something you know like you know of course of course you know immediately I I start worrying about how are people going to jailbreak that right how are people going to game that but it's an idea that's uh thank you you know that's that's that's that's worth thinking about thank you yeah if uh for the watermarking scheme when you Watermark outside of the black box if uh if we have one model using that scheme it produces text containing the watermark and that watermarking text is used as further information training the model is there a sharp one would expect the entropy to decrease a is that true and B if that is true uh is there a sharp fraction of uh of watermarked training text above which that entropy is destroyed so so so so your question is if Watermark text gets used as training data how does that affect things the short the short answer is I don't expect it to affect things at all right and the reason is that you know the effect of the water marking is pseudo random right you only really notice it if you can break the pseudo random function right which you expect to be hard even for an you know a large neural network to do hi thanks for the talk sure you mentioned that the primary objection to databasing is a privacy concern I was wondering if there are other objections to it and then also why you can't have like a cryptographic solution that would compare you know uh answers to a database while preserving privacy and then also how does that privacy concern relate to compared to like like search query privacy concerns who are you know readily available to the public who they need yeah I mean I I I think I think very plausibly uh uh there could could be a solution to that you know I think that that is worth pursuing you know I was unable to prove a theorem that I was satisfied with saying that no one can learn you know private information this way but that but that doesn't mean that there's not such a theorem right um you know now you know you you could say you know as soon as you're sort of exposing the entire previous history of how the llm was used to queries right then this you know it does sort of open up a can of worms right now like we have to say you know we have to change our threshold over time so like you know we want to know like is there even any phrase in this document that's a good match for something that was already generated but you know as uh um the llm has been used more and more the the probability that there will be such a phrase there just by chance you know gets higher and higher right so you know one nice thing about watermarking is that detecting a watermark is independent of the history of the language model right it just depends on this text that is right in front of you so it's like you know if you can do it that way like that seems may be preferable but I think that that you know the the database solution should also be investigated more I would love for someone to prove a theorem about privacy and in in that scenario cool thanks so this is maybe against the spirit of the talk somehow but if discriminators are getting very good can you use them to you use them as training to allow AI to sound less like themselves and sort of break out stylistically of the way that llms tend to sound possibly uh yeah I mean I mean so so so so so uh I mean it's not even clear if you would need a discriminator to do that I mean you know you can tell an llm to write things you know in a certain style right and some of the you know the the the uh the the most impressive use es of llms that we have seen in this last year or two have been you know you ask it to you know write about bubble sort in the style of Shakespeare right or you know or or things of that kind right it's a you know they're actually amazing at imitating Styles you know that you ask them to uh you know certain things that you know they won't do right you know like they you know they'll they have you know all these these safeguards against you know outputting anything that that's that's that's offensive or you know and so forth you know and so those those we understand why they won't but uh um you know could you actually use a discriminator model to to help put it in a certain style that I don't know uh it's a good question um you know I mean I mean one one one idea that people have had for watermarking is well if you could just make the language model right in a certain style you know like it always writes like Hemingway or something right well then you know everyone could recognize that right but this is kind of what you want you want it to be able to imitate all sorts of different styles right and so then that that suggests that that whatever we're in coding should be something harder to notice something more cryptographic thanks thank you for the great talk uh my question is also about the discriminator could we potentially train the discriminator the same time we're training the LM say like uh they train each other themselves adversar and then just like like how we train again would that make that these Knitter better or possibly yes I mean I I've wondered about that like could you just treat this as a big Gan task right trouble is like like we have we have um I I guess three requirements that we're trying to you know not just two that we're trying to satisfy all at once like number one is that the output of the llm should look human right or like it should be good it should be high quality number two is that the uh uh uh our our um uh you know discriminator well I guess related to number one but number two like our discriminator should not be able to to distinguish the llm output from Human output then number three is when you give it this secret password then it can discriminate right and so how do you design like a gan style training regime that gets all of those things to happen at once if you could do that I think it would be awesome great thank you yeah so um in the interest of time so I hope to go through the remaining questions but if you are not yet waiting to ask a question then please don't join the alliance because our speaker will probably be a bit exhausted by then but uh please I'm still standing please go ahead yeah this should be pretty quick um this I had a question about the back door so about the back doors the back door yeah um is it possible for the um LM to find its own back door or like another model to find um uh the back door so that um it would in fact not be it would be removable so in that case kind of what would you uh how would you approach that and is there some kind of versioning that you could have in the watermark um to keep up with the back door or almost update as the responses come out yeah so I mean I mean I I I I talked about this right but uh there is you know so so so we we have some evidence that like under well understood types of cryptographic assumptions right you can insert a back door that will be undetectable you know even by the model itself okay so so so you know we don't expect you know at least most of us don't expect even a super intelligence would be able to solve NP complete problems in polom time for example right and and you know if you believe that this planted cleek problem is hard right then uh you know that's not known to be NP complete but uh but but uh um you know uh using the result of Shafi goldwaser at all uh you could use that to insert this undetectable back door at least into a depth 2 uh neural network right and so then you it won't be able to detect it but then the point I made is that undetectability doeses not imply un removability okay so you still have to worry about the model you know saying if I ever output something that looks like a back door then overwrite it right and it seems like the only hope of dealing with that is that you know you have to insert things that the model cannot recognize as back doors even when they are triggered okay and then how do you formalize that um I'm not sure um that's why I'm giving it as a problem yeah hi um so when you talked about privacy and whether someone should be you know responsible for holding the secret and checking for this watermarks I'm just wondering um what are your thoughts on possibly using more traditional techniques from cryptography and consensus in distributive systems maybe something like a you know Computing for multi thres of cryptography something like that in this water marketing scheme uh so wait so you want to use consensus or other Distributing distributed computing methods to do what again to um just you know to protect privacy in this setting of ah uh I mean I mean Poss I mean I'm not sure about how distributed computing in particular is relevant here but one of the ideas this was actually suggested by by boas Barack who I mentioned uh but you know you could train several models on different subsets of training data right and then if anyone complains that well you know you you know your your model was trained on my copyrighted data then all you would have to do is try one of the other models that wasn't trained on that particular ular piece of data and as long as its probabilities are very similar to the to the other models then you can say well it didn't matter right we could have excluded your data and and you know the result would have been similar thank you for the talk yeah thank you hi thank you so much for the talk um as a senior in high school right now I'm watching a lot of peers around me use chat gbt to help them with college supplementals whether it may be just asking for like a list of synonyms or as far as writing a y blank College uh supplemental for them and I was just curious as to what you thought watermarking might do to college applications in that realm of you know I hadn't even thought about that um I mean you know the the the college you know the undergraduate admission process you know as you're seeing firsthand right is just seems to me so gameable right in so many ways already right and you have so many people like hiring Consultants to you know write essays about you know uh uh uh you know I mean I mean like once you've said you know that like the goal is not just to look at you know test scores or whatever but we're we're actually trying to peer into the souls of these 17year olds and see who has you know the richest deepest Soul right then you know You' you've you've set up the incentive so that you know anyone with means is going to like hire a consultant to you know manufacture that soul for them right and this this is kind of my model of undergraduate admissions right uh uh you know not I'm I'm not a big fan of the way that we do it okay that that's that's maybe a separate discussion right but uh but you know if if if language models were to democratize that right if they were to mean that everyone you know now has this consultant to help them write these these uh application essays for them it's not obvious to me that that would be a negative for the world right um you know now now I hadn't thought about you know you know if if you know you with with water marking am I inadvertently going to just you know you know help prop up a system that I don't like so you know there you know there are there there there are there are ethical trade-offs to everything right and um um it's it's it's it's a good question thank you for raising it thank you so much yeah hi um so I have just a general question about using edit history as a way of verifying text I mean it seems like uh something that exist today in the world is Google doc which tracks edit history and it seems like it would be pretty easy to verify statistically that the edit history pattern looks like an edit history that comes from a human as opposed to just you know like copying pasting or something engineered to be a bit more sophisticated so I'm so I'm curious what you think about that type of uh solution to this problem within the context of the other menu of options that you've described for uh you know verifying that text is not llm generated yes so so I understand that some professors are already doing this they are asking their students to turn in the Google Docs edit history of their document along with the you know along with the essay itself uh now this seems like a perfectly fine solution until someone releases an AI tool that manufactures a fake edit history for a given document which seems like an eminently solvable problem right and then we're back to where we started you yeah thank I would like to thank everyone for um all their questions the excellent questions and uh most importantly of course our speaker for both the uh the great talk and also for generating so many questions and the enthusiasm in particular that could be uh heard in the questions so let's thank our speaker again for the uh for the great talk

Transcript for:
Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

Transcript for:Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

Transcript for:Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

Transcript for:
Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson

Transcript for:
Public Lecture: AI Safety, Watermarking, and Neurocryptography by Scott Aronson