Transcript for:
AI Insights from Last Week's Podcast

[Music] last weeks never a with a new techart swing and open hello and welcome to the last week AI podcast where you can hear us chat about what's going on with AI as usual in this episode we will summarize and discuss some of last week's most interesting AI news I am andri rankov one of your hosts you may have just heard the AI version of me doing the intro we'll see if I swap it in and uh you know maybe it is maybe it isn't you never know uh but for a bit of context I did a PhD studying AI at Sanford at I work at a generative AI startup I think you might have found a loophole there Andre where everyone's always saying you need to disclose when you use AI you're saying I'm not I'm not saying I did I may have that may have been you know you're kind of kind sneaking through that's pretty good um actually that announcement makes me just a little bit nervous too because one of the things I wanted to mention was sorry my name is Jeremy Harris you know that if you listen to the podcast co-founder of Gladstone ai ai National Security Company all at jazz um so I also so my wife is expecting a baby uh with a due date in mid-september so um it's very comforting to hear as by my human uh requirement for some time to deal with h you know the immediate aftermath of the birth uh is going to intersect with AI that can replicate my voice perfectly so we're feeling very very comfortable with our jobs here at last week in AI um and we will absolutely cover this stuff in a neutral way because we are not biased uh by the fact that these AIS are coming for our jobs and it must be said our children and with that I do hear having babies is a bit timec consuming uh when they just arrive it takes a bit of time to take care of them from what I the thing is Andre my my wife um is she's very reasonable on almost everything except when I proposed that we have gbt 40 stand in for me uh through the whole you know the first three weeks or so just this you know and like I was like look it's literally drawing on vast information reserves that I don't have it can answer questions that you have about you know market economics and geopolitics that I just can't help with so we why not give it a shot but she she didn't want to I even showed her the figure two I know exactly as we'll cover there's a humanite robot that has GPT now I don't know why you would oppos to that being your babies personally but uh I guess not everyone is that excited about it and our listeners say that I'm biased the gives to mothers I don't know where they get this I don't know I sorry guys that was a joke that came from last episode uh yeah what is up with these in jokes we got moving on to the quick mention of listener comments and feedback actually we had an review on Apple podcast title is very Pro mhood okay yeah keep it going guys uh also AI stuff but at the cost of covering the marel Land Experience so there you go here's your marveland content for this week I I really appreciate that I'm glad that we're getting listeners duking it out in the comment section about whether or not we're pro mother that's that's where we want to be yes that's what we care about here at last week I and actually next one uh also an apple review thank you for that says the most substantive AI podcast out there if you're a new listener may not seem like it but when we do get to new perhaps will justify that uh and just two more uh YouTube comments and I've loved seeing YouTube comments come in please do keep that going first one is saying that apparently Jeremy's positivity is why I watch each week so Jeremy you're the reason some people postivity watch uh so I'm sure this listener would be glad to have you back this episode by the way will come out uh pretty quickly after the previous one I'm still a bit behind so I'm going to try and catch up we just need that figure three to come out podcast editing capabilities Andre yeah that I wish we had that and the last comment is uh there was a question on YouTube uh do we have any tips that would integrate with other services so for instance ask nii to generate a business plan and then it includes a website creates a Facebook and insta page so uh basically it sounds like an AI that can do not just one thing but multiple things all together and uh this is a tricky one for me because uh I do think for individual things like generating a business plan starting a website creating a Facebook and insta page you know we have ai for each of these things individually so far I don't think we have kind of a unified way of doing this and I think that's what a lot of the uh agented startups and efforts are going toward is you spin up an agent that is an AI agent you tell it you know generate a business plan start a website create socials make AI posts and these socials and then the agent using apis using something like that goes ahead and does all those things for you by itself so you're not there yet as far as doing this you have to basically create your own script with python to use various services and you can you can use Chad to go ahead and write such a thing but off a shelf I think we're not quite yet quite there yet yeah and if you're going to set that up to what of the problems you'll run into pretty fast we've talked about this a lot but you know these agents often have they may have fairly low failure rates on individual steps in their process like they may you know 90% of the time nail the steps in whatever the the overall goal is to set up a website in a business but um when you add up those probabilities of failure here like it only takes one to kind of like interrupt the whole flow now I'm being grossly oversimplifying desp by you know calling this one flow but you get the idea so anyway this is an ongoing problem we'll be talking about it today actually in the context of some of the audits that were done of gbd 40 um the model card came out like this morning or yesterday or something so we'll be talking about that but that's definitely a Hot Topic here starting with the story Google's hiring of character that ai's Founders is vated a sign that part of AI startup world is starting to implode so I'm not sure if you've covered this news last week but there was a pretty big deal when Google did hire the uh leadership team of character. so uh these are former Google researchers deep line researchers Noam shazir and Daniel the freus who actually were pioneers of which chatbot techn techology they worked on Lambda which was a sort of precursor to GPT 40 it was a large language model optimized for chatbot type activity and uh famously had the whole drama with the claim at the time from a person inside Google that it was conscious leading Google to uh kind of back down on ideas of releasing it publicly and as I've said before I often suspect that had that media thing about Consciousness not happened maybe Google would have been first before GPD 40 who knows and so uh this is uh kind of covering that story a bit more uh in the trend that we've seen so Google hired the founders of character. supposedly that came with around a 2.5 billion in investment in character. similarly we've seen uh Microsoft buying out essentially inflection Ai and with that the hiring of the founders of that team and and that led to the investors of inflection AI receiving about $600 million covering these things so the trend here is that it doesn't seem like these companies are making it work in terms of being profitable and so these kinds of moves are indicative of basically them needing to continue to take in cash to probably survive and and that does make sense as we've covered many times with AI models it's hard to get good margins they're expensive to run much more so than you know any sort of other software and with a business model of $20 per month subscriptions you may or may not break even uh depending on how much you use it and character that AI users particular are very engaged people who are spending $20 a month are likely chatting with chatbots a whole lot possibly enough to offset that c to not be that profitable so interesting Trend going on and continuing on with in fact there's going to be another story later this week about later this episode about a de Adept being hired by Amazon so yeah interesting I don't know what you think about it Jeremy yeah I think you're right I think it's part of a number of Trends one of which which I don't think was covered in the article I could be wrong but um this idea of um the motivation behind these kinds of weird pseudo Acquisitions right we're not seeing Microsoft come in and say hey we're going to buy inflection out right they do this weird thing where they bring over the CEO and they bring over the staff but they kind of leave this hollowed out husk of a company to just die on the vine that's what's happening here with character. that's what seems to have been happening lately and the reasoning behind it uh seems to be because of Anti-Trust concerns there's a concern that governments are going to go after uh these companies for these big Acquisitions sort for consolidating the space this makes it easier in a sense to say well look you know we never required the company so there's no actual antitrust situation here um another trend is I get I don't think this was highlighted in the article either but I think it's important to note like all of these compan so many of these companies that are in this position um are founded by former authors of or authors of the attention is all you need paper that famous Transformer paper right from 2017 um so we have in this instance um so Nome shazir the CEO of character. AI was one of the the authors um whole bunch of other examples including the co-founders of adept which you mentioned obviously had a similar uh experience earlier in the year and then of course there is Aiden Gomez at cohere right coh here is right now another one of these I would argue I have argued for years they're going to be in trouble um and I I keep saying this as they keep raising more money so the the stakes keep getting higher and higher here but I really do think structurally coherent is is really really um kind of in in some hot water long term they need to find a way to turn a profit yes they've raised $500 million but they're in fundamentally the same sort of circumstance as all these other companies how are they going to turn a profit if they're not coupled very closely to one of these Cloud super scalers like Microsoft like Google in the way that Open the Eyes in the way that Google deepmind is and in the way that you know in some ways anthropic is though you know they may arguably fall into that bucket to a certain extent as well so I think there are a whole bunch of key structural challenges in the space for all the midcap companies the ones that basically aren the you know deep Minds The Open AIS um to try to sort out how are you going to make that buck you know this space is getting flooded with more and more competition character AI has a whole bunch of competitors I think Kevin Roose from The New York Times was doing some uh play test of about half a dozen different comparable companies and anytime you're in that business it's like you know you're you're in trouble now worth noting character. was at least nominally an AGI lab their goal was actually to build AGI so this is another AGI lab that's now been folded into uh into Microsoft and um their cap table was impressive right we had andreon Horwitz uh jumping in as part of $150 million round that they LED uh with a billion dollar valuation that was just back in March of 2023 um so you know these are really good investors that are making this same mistake that hey if if if Mark and Reon listen to the podcast you know maybe maybe he'd have a a few million dollars more but anyway uh not actually serious about that but anyhow this is uh I think the big Trend in the space they were kneecapped uh character. AI by the way by meta coming out and saying look we're we're debing our own family VI characters this was uh back in October last year so so about 10 months ago or nine months ago or so um and then they allowed they created a new feature recently that lets users create their own so you really seeing the proliferation of this capability people don't have to sign up to a Specialized Service anymore meta is already offering more people will as well you have to assume so I think this kind of makes all the sense in the world but it is an indication of this underlying Trend that seems to be pretty persistent that's right and uh in addition another component of this trend that has been active for you know a while at least a year and a half is the intense competition to acquire AI Talent you know all of these companies are paying a lot either in actual salary I mean the salaries at open AI supposedly are breing into six figures easily for the top researchers seven figures yeah that's right six or seven figures which if you don't know that's like in the millions um and so this uh kind of hiring also is indicative of the competition where you know just to hire the leadership and most of the talent of uh inflection AI Microsoft paid something like 600 million so that's a lot of money to for an a aqu hire right and it could be another reason for this as well is the competition for data where character AI inflection AI they have a lot of users chatting with their chat Bots and that is a very very strong source of data that no one else has access to so very yeah a lot of Trends kind of uh coming out of these kinds of aqu hires and the search for talent and speaking of that we do have an update on the aqu hiring of adep by Amazon so this happened earlier this year where the top employees of the company was hired by Amazon back in June and now investors in adep will receive some reimbursement so Adept will receive $25 million while investors will roughly recoup their investment so very much like the inflection AI situation where it was also the case that when Microsoft recouped uh sorry hired away the company the investors in inflection did recoup investment via that amount uh yeah so there you go pretty much similar story and it is quite weird as a trend for the tech sector for a long time they were hiring outright in general rather than aquiring aqu hires were generally of smaller companies where partially for the tech uh and for the employees generally to my knowledge a hire at this scale at this amount of money were just not happening and now they are yeah the typical structure for an aqua hire is like you said you know the the acquiring company really doesn't care at all about your your IP your product the view is you have good talent they want that so they'll buy the company wholesale and then just basically Hollow it out gut it out take the employees but also own the IP in this case what's happening is oh wait and sorry one important note too um in Silicon Valley when this happens it us happens for early stage startups so we're talking about companies that you know have raised say on the order of you know tens of millions of dollars tops something like that right the idea is here if you're raising hundreds of millions there's there's no way that your value isn't in part your product like that's be a very weird notion um it's just that again as you say the talent here so s sought after that you're seeing what is effectively in aqua hire happening here um when Aqua hirees happened by the way very very um unusual for investors to get their money back one of the key things about an hire is it's usually structured you know so that often employees are taken on board and they're given these very generous compensation packages and that's the form that the acquiring Capital takes this allows the acquirer to get around having to pay off investors for their equity in the company now that you you can view that as a negative outcome for investors for sure and maybe a bit of a I don't know like a dark strategy it certainly gives the founders of the acquired company a really nice soft Landing but it it leaves investors holding the bag and so in this case uh they're doing it at such a scale and they've got great investors like Greylock and general Catalyst who've put in you know over $400 million into a depbt so at this point they're saying well look we're going to make you a whole which is again really weird for an aqua hire they're in a hurry as well to not call it that and explicitly saying look we're not interested in any of the IP here we're not interested in the product um you know almost to an insulting degree if you're an Adept founder they're saying look we don't care about this [ __ ] uh we want the people and um so you know that's that's the just the weird nature of these things part of it really is again that that goal of avoiding regulatory scrutiny on the basis of antitrust don't make this look like you were just straight up like acquiring a company um you want to have this weird kind of strange process that gets around uh these merger notification rules um that are going to trigger potentially you know FTC interest and things like that so yeah very interesting time I don't know how much this is going to hold up because the obvious um let's say strategy here is to acquire like the the end result here is the same so yeah I'm I'm curious how Regulators would look at this and what tools they even have to uh to poke at this kind of deal that's right and we have seen um antitrust being enforced more strictly at least it seems to be that in recent years certainly the EU is very active on the antitrust side but also the us as we'll cover later in this this episode Google had a big development for them this week and uh just to read a quick quote covered in this article uh last year vftc did make a statement about this trend uh so here's a part of a quote firms hoping to compete in the generative AI space need expertise and companies that Acquire both engineering experience and professional Talent will be better POS position to gain market share since requisite engineering Talent is scarse powerful companies may be incentivized to lock in workers and thereby stifle competition from actual W be Rivals to ensure a competitive and inovative Innovative Marketplace it is critical that talented individuals will be permitted to move freely so that's talking more about uh factors in your contract that basically forbid you from going to competitors uh but it could also apply to these quasi acri hires yeah and actually one one last note just to for for context on why the refund is happening in this case like one big source of pressure um every time I've invested in a like a early stage company when there's an aqua hire there's always this question from the founders of like do we pay back our investors just reputationally because those Founders are going to want to go off and found another company someday and they may want to raise from those same investors or Silicon Valley is a a small world man like the in like early stage startup investors like a lot of us know each other and that sort of thing and the later stages it gets even more like that because General catalysts and you know the these big firms um the partners move around and all that stuff they talk to each other so the goal really here is reputational control as well don't mistake this there's no good outcome here right getting a refund is not is not the intended goal for an investor especially company like this but you know reputation management is important and um essentially Microsoft here is allowing um these uh or sorry Amazon in this case is allowing um the the reputations of of the founders to be managed with respect to their previous investors to allow them potentially to go on and do the scan so um anyway that's an important ingredient that I'm sure is is part of the decision-making here that's right and um for companies that aren't public uh to my knowledge it's much harder to do a sort of aggressive acquisition via buying of their stocks right so you kind of need the agreement of the founders to be acquired or to be hired in these cases right so the founders could very well be laying out these conditions that to allow themselves to be hired they have to have this happen yeah that's always a question of Leverage there right like how much does Amazon really want Adept how much does adep can Adept survive on its own there there's that's all part of that negotiation you're absolutely right what form does the comp take and all that for sure moving on we've got the story that AI chip startup grock has its valuation rise to 2.8 billion and that's after they have closed a 640 million funding round LED by uh Black Rock they've also announced two new appointments Stuart Penn an Intel seror executive who has joined as the coo and they also have Yen laon who has joined the company's newest uh technical advisor and Yun is a very famous AI researcher leads the AI research efforts at meta so pretty big deal here for them they had previously closed a 300 million funding ground back in 2021 at which point they valued at 1 billion and earlier the here in March they reported to have deployed around 4,500 chips so this presumably will help them scale move faster and uh certainly I think Gro we've covered them a good deal and out of the companies that are trying to make novel chip infrastructure to compete with Nvidia and so on to make sort of AI native chips so far it appears that they are in the lead and actually might be finding a decent amount of success yes yeah their big um their their lead is in that Niche right of inference so their chips their so-called language processing units or lpus that we've talked about a lot on the show um are just for inference not for training so in that sense it's you know one slice of that market um and uh yeah the the big question here is going to be can they scale production they've shown at least through demos that they can achieve outrageously fast inference speeds that compare very favorably to things like the h100 even the the B100 b200s that are coming online um what right now the question is can you achieve volume production that's what this is about you know under you mentioned 4,000 of these lpu units uh produced as of March this year well as of March 2025 um their goal is to roll out more than 100,000 so 25x increase in production um for context when you look at the the scales that Nvidia operates at you're looking at millions of chips per year so we're talking about an actually significant level of scale coming from groa 100,000 versus Millions you know you're you're sort of an order of magn chewed off for a a company at this early a stage that's a a pretty impressive a pretty impressive play another thing to note is you know you always have to wonder what are the margins like especially as you're scaling up you haven't optimized all your processes right you're you're doing a Shakedown Cruise for every new kind of product line here um what they say is they're aiming for quotes a full dollar return for every dollar we spend on Hardware we don't intend to lose money unusual for uh a play like this but when you're in Hardware you kind of have to do that right you can't burn money in quite the same way because things are just so expensive as people say in Silicon Valley hardware is hard um so yeah it will be interesting to see if they can actually make good on that intended scale of 108,000 of these lpu units um and you know if they 10x again right they're they're really like in Nvidia scale territory uh they've got a bunch of Partnerships by the way these include meta and Samsung a whole bunch of sovereign Nations and and uh Sovereign wealth funds things like that so um you know a lot of serious people with smart money taking these guys very seriously and and we'll see if the the scaling uh comes out the the way they hope it will right and uh also to me it's interesting to my knowledge grock isn't quite like Nvidia in the sense that we're using a lot of their AI chips perhaps all of them I'm not sure to have a cloud offering with which you can do inference on open models that I release so mistra llama Etc that's their kind of money-making machine they're competing more so with Amazon and Google in that sense when Nvidia not selling off the chips and as we covered before the big differentiator is their inference speed their lighting fast about I think the latest benchmarks are maybe three four times faster than if you were to run it on Nvidia chips so very significant and uh I think still not much competition on that front in terms of inference speed of course A lot of people are trying to compete on that front but grock is a clear leader so that's another Advantage where if you're not trying to sell to customers you're just scaling up internally lots of factors that make that easier to move fast yes some some that In fairness do make it harder as well right because you've got to have a large enough user base that you can have these really large batch sizes like one of the things that drives down uh cost of inference the most is being able to process large batches of data in parallel at the same time right so you're parallelizing the use you're getting more value essentially out of your existing units of compute and one of the challenges for Gro is then then you got to be in the distribution game as well you've got to actually host these uh models you got to convince people to use your thing instead of Amazon's thing you don't have natively that distribution those massive Enterprise deals and all that stuff so you know I I think this is for them especially given the way their Hardware is set up the memory is really limited and so you need many many of these units to be able to hold one kind of say llama scale model and um and that's an issue so it increases even further their need to have very large amounts of throughput to kind of amortise across all those devices that they need to have in their uh in their sort server farm so yeah uh I think all all this is making Gro a very distinct company and it it could be a banging success or we could find out there are weird issues with this this model as it scales and I think the next couple months we're going to learn an awful lot about this you know March 2025 is not that far away right and one last comment there's a lot to say here uh they are also competing against open ey and anthropic right in the sense that if you use an API for an chatbot in llm you're kind of choosing between open ey and anthropic and in this case open source models so for them it's a very very big deal that llama free has come out and is on par with gp4 and anthropic and perhaps that's point a bit of a reason why we've seen gp4 costs just plummet down right where in commodity areas opening I very much trying to compete on price to a significant extent so that's another challenge next story is a bit more drama you could say from opening ey and another Trend we've seen John Schulman a co-founder of openi one of our research leads who joined openi basically out of grad school has been there since the beginning and was quite impactful in the development of Chad GPT has left the company to join on Tropic and once again I think the reason cited was wanting to focus more on AI alignment and safety and alongside that we've seen Greg Greg Brockman the openi president and co-founder taking an extended leave until the end of the year to quote relax and recharge and finally Peter Deng a product manager who has been at opening ey since last year has also left the company so uh you know maybe a coincidence maybe not right uh certainly it seems like a lot of people in AI safety and Alignment want to be at unprop which is not surprising on Tropic is more focused on that compared to open eye and just about any other player but some of these other high-profile departures are perhaps a bit more surprising yeah absolutely and and you know it shouldn't be lost on us that uh we're now on the fourth generation of open AI alignment leadership that's not a good sign right we had Paul cristano originally was like the head of alignment he left to start uh meteor or I guess it's come leer it was Arc back in the day or Arc eval then um and then we had uh yeah and then we had Yan Leica who left uh following the um very public disagreement that he had about the course of opening eyes uh kind of progress in super alignment and the extent to which Sam Alon seems not to be living up to his commitment uh on the compute side 20% of their committed compute as of about a year ago to Super alignment so he left and now we've got John Schulman I remember watching him on the doresh Patel podcast um there's he gave this like long interview uh this was shortly after I think um Yan Leica had was it after he departed anyway it was in in the context of all that and I remember being really impressed I was like wow you know like this is not bad this is a guy who seems to take you know the issue pretty seriously and may have different views from from Yan or whatever but but generally seems quite engaged with it I was like it's sort of surprising given uh Sam Alman has sort of seem to to make moves that suggest at least to me that he's not terribly serious about the issues that he has claimed at one time to be serious about um and now sure enough the John Schulman leaves so not a great sign I think for a lot of people who are uh concerned about open AI living up to their word here again um not a great sign necessarily as well given that he's a co-founder of open AI a lot of people have been talking about that you know what what do you do if you genuinely believe that open AI is on the cusp of building a super intelligence do you actually leave the company right that that seemed like a weird move for somebody to take if they genuinely thought that open ey was on course to do uh what they seem to think they're on course to do now I I personally don't take stock or take too much stock of that that's not the wrong expression but anyway I don't take that argument too seriously um I I think uh just based on talking to to Folks at open AI um they they very much seem confident uh and and I haven't seen that change um I I I think you gota you got to wait you know the thing with scaling is we get to uh discover what the scaling curve can do only every two two and a half years right it was 2 and a half years between GPT 3 and GPT 4 um so we've got a year to go before GPT 5 on that schedule so we're actually not behind schedule but people seem to forget like the scaling curve just you know it takes a while to get to the point where you can actually discover what that next Point looks like so I suspect things are actually going uh technically better at open AI than most people realize I wouldn't be surprised if we see something impressive come out from them in due course um but this does seem to reinforce that narrative that on the safety side uh open AI is sort of less than uh less than serious on or at least the level that they've claimed they would be at least that's my impression personally um the the product departure is interesting the Greg Brockman departure again a lot of people reading a lot into that I'm not so sure that that is an indicator either you know he's been there nine years uh you can see somebody like that needing to take you know six months off or or five months off or whatever maybe you know he's got a training run that he's got to wait for to complete or something and he's taking a breather but um hard to know what to make of all this stuff definitely opening eyes U kind of upheaval is uh is going to be the narrative that that's around I think it's partly accurate as I've said especially on the John Schulman side that one I think is the most interesting and informative piece I do know there are people at open a eye who've known that this was coming for some time um but uh but anyway uh yeah I I don't know how much to read into the the Greg departure right I do agree I think with the Greg part of this that may not inate very much it is pretty realistic that he needs some time off I'm sure you know we can be certain that it's been a very chaotic time at open AI for the past year and a half we have grown I don't know by how much I I believe I had maybe about 10050 employees as of 2022 if I recall correctly now I wouldn't be surprised if they're four times you know six times that something that I think they're over a thousand now yeah yeah exactly so maybe like 10x scaling and when you do that you know in a short period of time one year for a startup it's it's insane it's crazy you know it's it's chaos so uh it scale does does wreck a lot of things in startups one of the things it usually doesn't because it tends to be correlated with success you usually don't see founder level people leaving like this um but but you know open AI is not a normal startup right like I mean this is this is a company that's trying to build AGI so uh you know the unexpected and uh to your point I mean this is uh these departures mean something it's not clear fully what but uh some of them I like again the John scholman one I think that is absolutely uh to me highly suggestive of the kinds of problems that y Leica highlighted the kinds of problems that you know whistleblowers that opening ey left right and Center have been talking about with respect in particular to Sam malton's leadership and the extent to which he's been um sort of manipulating his way around and through the board um uh at least allegedly and and that seems to to stack up with all this right and and again Greg Brockman isn't leaving he's taking a leave but uh four to five uh months leave is pretty significant right so that's worth noting on to lighting round another story about open a eye drama with a story that Elon Musk has filed a new lawsuit against open Ai and this is kind of the same story he filed the first lawsuit alleging that open eyes breached the founding agreement of the company musk of course was one of the co-founders and put in a lot of money at the beginning he left part of paths with open ey in 2018 uh he said that was due to also working on AI at Tesla and not wanting to have conflicts of interest sources and Reporting has told us that perhaps it was more so because musk wanted to take over wanted to be the leader of open ey and was not able to take a position from Altman so the previous lawsuit uh argued the same thing and was um uh basically taken back by musk it was uh yeah dismissed by them as as opposed to a judge uh it uh was pretty flimsy from our knowledge it basically said there was this founding agreement to be a nonprofit but it was not even a contract it was sort of an informal agreement so at that point open AI had dismissed musk claims as incoherent and frivolous in a block post that included musk's emails this lawsuit is about uh twice the length of the first it was filed in a federal court in California and also claims that op is engaging in rocketeering activities so there you go uh elas continuing to be on the attack against open AI yeah this is um this is pretty interesting so uh you the narrative as you said is like look I put my money in says open AI on the basis that this was going to be nonprofit um and then say mman says oh we just discovered AI scaling is a thing and that means we're going to need a ton of money if we're going to achieve our goal of building AGI which means we need to become a for profit or at least cap for profit they're kind of weird structure they have over there and Elon says well you can't just do that um that's a violation of blah blah blah now the the emails that open AI then leaked were essentially showing showing Elon acknowledging this acknowledging the need for the company to make a ton of money in order to fund all the compu they need to to scale their their AI so their argument is look you knew about this you were okay with it um you know you you you recognize that need this is kind of interesting I mean I don't I'm not a lawyer I don't feel remotely equipped to assess the validity of this lawsuit um it's got some bombastic language in it I mean it refers to uh the pery and deceit of open AI as being of shakesperian proportions which I really appreciated that uh it alleges that Elon was quotes betrayed by ultimate and his accompli so you know some some very strong language there obviously there's no love loss between Elon and Sam uh this is a rivalry that goes back quite a ways to the the kind of first split when uh Elon was uh by sub accounts forced out um and I mean just look at some of this language right elon's Elon musk's case against Sam Alman in openi is a textbook tale of altruism versus greed Altman in concert with other defendants intentionally courted and received musk praying on musk's humanitarian concern about the existential dangers posed by AI so I mean this is like really intense stuff um and you're right the racketeering piece right this is like when I read racketeering again not a lawyer but I've seen enough mob movies to know racketeering is an interesting freaking thing to be alleging so where does this come from well I I think again not a lawyer but I think this is this is the quote that brings up the racketeering concern and how they're justifying it so they say in partnership with Microsoft Alman established an opaque web of for-profit open aai Affiliates engaged in rampant self-dealing um seized open ai's board and systematically drained the nonprofit of its valuable technology and Personnel so this is a really interesting angle for the racketeering case right the case is like look open AI itself yeah technically might not super be a for-profit or entirely be a for-profit but it's making deals with Hardware companies and those Hardware companies are owned or indirectly owned by Sam mman it might be making deals with energy companies that are owned or indirectly owned by Sam mman so now you've kind of got this indirect way that open AI feeling this growth the idea of its technology and even especially Personnel being siphoned off um you know maybe this is an illusion to that kind of power uh play where you know Alman threatened to leave opening eye for Microsoft because Saia invited him over and bring all the researchers with him um that didn't actually end up happening but certainly was threatened I don't know but I just thought the racketeering piece was really interesting and it seems like as as good of a case as you could make uh in this context for for that kind of uh chge and uh next again about openi indirectly but uh a bit less dramatic the story is about figure the humanoid robot company which has unveiled the figure 02 with successor to their F figure 01 humanoid robot that was unveiled just like last year this one was developed in partnership with openi which helped uh figure raise 675 million series B round back this February and uh as we sort of knew back when this was announced the idea is that figure 2 is powered by chbt when it comes to intelligence and now we know also for uh generally naturual speed conversations of course which plays in nicely into GPT 40 announcements from open Ai and the development of you know real time voice chat from that model so pretty you know again a trend we've seen for the last year is a lot a lot a lot of investment in humanoid robotics a lot of Rapid progress in humanoid robotics uh there's quite a few competitors One X agility others that are trying to build human robots of course even Tesla is investing a lot in this direction so there you go it's uh pretty exciting for me and apparently figure 2 has an underground a ground up hardware and software redesigned has a bunch of cool Hardware on there and certainly looks cool which is the most important thing um the hands yeah apparently hands are a key differentiator and so I I'm always amused like I I um you know I'm more on the kind of the model side um the capability side scaling All That Jazz uh so when it comes to to the except that I I focus on Hardware it's all all about right the processors and and that sort of thing not the robots and so I keep surprising myself at how little I know about robotics one of the things apparently is that there's a camp of people who think human inspired hands are way too delicate and just over engineered and um so there's this debate as to what whether you should kind of optimize for humanik hands or not figure is making their whole thing humanik hands we're just kind of diving into that going to try to make it a thing um so figure 2 has you know there are a lot of videos you can kind of see them that emphasize heavily the hand dexterity that sort of thing the other thing that I didn't really clue into it was just in this article that they put enough of this together where you're like ah okay I can see the trend um so cars car manufacturers seem to be for whatever reason the go-to early use case for these kinds of Rob OTS uh figure is beginning Pilots with BMW apparently and apparently the so figure 2 robot is already been over to a bunch of BMW plants um in North Carolina to kind of collect data but they mentioned agility atronic Sanctuary AI all of which have similar Pilots with car makers and when they said that I was like oh yeah they do I never connected that those dots so that's you know and then you said Tesla well yeah Tesla Optimus that's kind of where that comes from um so anyway for whatever reason that is the you know the high margin low hanging fruit um application it seems for these kinds of systems which I thought was was pretty interesting yeah and I think that makes a lot of sense right because we do have a lot of humans doing physical labor at these plants there is in some cases a shortage of talent because you do need talented uh people on those lines you need to move fast you need to keep working and that's one of the benefits you can go 24/7 with humanoid robots can't we do that with humans and one other thing I'll note is this is an interesting case of competition because in figure 1X Etc when you're at a car plant you don't need necessarily uh kind of general purpose reasoning capabilities you don't need necessarily a lot of knowledge about politics or other things that chbt and so on provides you what you really really need is very very robust control of the robot you need to be able to move uh use your arms uh do all of that quickly and uh accurately and that's been a challenge a long-standing challenge in robotics and that is kind of the point of competition in addition to Hardware uh among these companies making the software that allows robots to work well on part of humans if that's not been the case I I don't think it's the case yet but a trend robotics has been collecting more data and being able to sort of scale up models and really train machine learning models over uh classic control techniques so I'd be surprised if we see human robots being able to replace humans fully on uh plant lines uh until maybe a few years from now but it does seem to be heading in another Direction and next Another Story related to Hardware this about chips and it's about how asml and Tokyo electron have dodged new US chip expert rules so the US is considering invoking the foreign direct product rule as we've said with that I want to regulate foreign products made with American Technology but as the story says so far asml and Tokyo electron have been able to continue selling to China in part because they have a potential exemption from the rules uh us may be excluding this Dutch and Japanese semiconductor uh equipment companies uh so there you go uh partially this may be because the countries themselves are likely to conform to the stricter expert policies without the us invoking this fdpr policy and uh yeah asml very very important for being able to do uh Cutting Edge chips which China is very much behind on uh they are you know not on the scale of tsmc which is the provider for NVIDIA so very very important for China to try and catch up and the US is trying to curb that possibility yeah that's right so asml being that Dutch company that makes the machines that make the machines right so essentially they make the photo lithography machines that tsmc and Taiwan uses to make the chips that inidia uses to make the gpus so they are way way way at the very top uh or bottom of the supply chain the very beginning of the supply chain um so this is actually about yeah not necessarily just the export of asml machines but the ongoing maintenance as well of these devices so one thing that not a lot of people realize is that um when you when asml sends a a machine a device over to say China they don't just send the machine they send a team of about a dozen Personnel to maintain the system to integrate the system to ensure its property use and functioning and so if the US were to come in and say Hey you know like you can't do that we they essentially cannot just um prevent China from reing keeping new machines but significantly hamper maybe even halt the operation ongoing operation of asml machines in China right now so that's a really really powerful tool um you know to your point like there's this question of okay us comes in and imposes this rule um this rule essentially says if you have a product that is made with even the slightest sliver of American Technology it falls under this rule so in other words US the US can tell you no you're not allowed blanket rule you're not allowed to export to these kind of firms that are on this list that they're making uh we expanding rather so essentially we're in a situation right now where it seems like the US might be banking on these companies just adhering potentially to those um those constraints uh voluntarily in a sense without having to impose that constraint or they're okay with the fdpr not actually being applied it's a little unclear how that's going to fall out um but just to give you an indication of how much of an impact this is having on asml bottom line their shares went up 11% uh Tokyo electrons went up about 7% on this news so essentially the market feels that the value of China to asml is so significant that the US invoking this rule is worth uh you know more than 10% of the company and expected value right with with no guarantees of course that uh that these rules will in practice be enforced because the companies might kind of be you having their owns Twisted to do this pseudo voluntarily um we got to see how it plays out but this is an interesting uh next move it's it's unclear you know we talked last time about how asml and Tokyo electron were pushing back really hard against Washington making this argument to to invoke the fdpr uh and trying to pressure their respective governments to push back too but you've got at home American companies by the way who because they're in US jurisdiction they are already forced to adhere to these protocols so companies like Applied Materials lamb research KLA all these companies have been complaining look we're shouldering this burden meanwhile asml they can ship to to China Tokyo electron can sh ship to China we're facing a huge Advantage here they're basically lobing the US government to make this a unilateral play so everybody plays by the same rules that's all kind of part of the political and and economic game that's being played here we're going to have to wait to see what in practice is the effect of the US seeming to choose not to invoke this rule at this point and to provide some context asml as you said provides the hardware necessary to make Cutting Edge chips the you know the latest generation of the smallest chips that is necessary to develop Nvidia uh quality gpus and you know these machines cost I think about $400 million they are not easy to make I think asml is to my knowledge the only provider of the latest generation of these uh extreme what is it extreme ultraviolet something lithography yeah lithography so very big deal here to uh think about this and the US does have some leverage in the sense that as we've covered a lot of investment in domestic chip production you know on the order of billions of dollars so uh these companies certainly do want presumably those uh uh new efforts to also buy from them tsmc is partnering with some us companies on making plants in the US so very interesting Dynamics going on here maybe a B you know if you're not interested in geopolitics and uh Hardware production maybe less interesting but for us very interesting and finally the last story of the section uh business taking up half this episode surprisingly uh the story is that open ey has reportedly led a 60 million round for webcam startup opal so this company is developing consumer electronics for highend webcams these are uh $150 webcams we have one that's called tpol and it's designed to clip onto laptop monitors and uh really small really uh kind of pretty and very high resolution now that is not an AI product so the suspicion is that they will now work on an AI product given the investment from open AI yeah apparently so Opel has actually done some you know AI integrated devices in the past so this you know wouldn't be completely out of left field for them uh not that this sells us anything about what they're planning to do next but uh prior to launching their kind of latest thing which is yeah webcam called tadpole it's 150 bucks um kind of their Flagship product before that they had a webcam called the C1 was twice as expensive and used some some AI tools basically on board to optimize the quality of of the video that it captured and to do some background blurry effects that sort of thing that's about as as close as it gets to AI natively at this company so that leaves us with a lot of open questions you know obviously those don't seem like obvious open AI flavored things but we'll see what ends up uh being made of this right and it again follows on a trend uh Sam Alman in particular seems to be a big believer that AI powered Hardware is important so we've seen them invest in Hume for the infamous Humane AI pin and uh here's another example of that investing in this AI device we've also seen uh guest reports of uh the designer of the iPhone journey I um possibly working with open the eye on another AI consumer device so so far hasn't really worked out we'll see if one of these companies finally Nails it and we are now done with business moving on to tools and apps starting out again with open Ai and the story is that openi has cut gp40 prices again and has also launched structured outputs so this is big price reduction it makes it 50% cheaper for input tokens and 33% cheaper for output tokens and this is again after another price cut after gb4 already being the cheapest model on the market to be able uh to be usable by companies and that's a big deal because um now I think the main source of income from these companies is the users of the API is the customers thinking use of these products so certainly it is a big deal that they are able to do this structured outputs that means that you can uh output uh more uh let's say yeah as it says structured more sort of programmatic outputs that are easier to parse for programs uh so things like spreadsheet formated things uh database formated things Etc uh we've seen seen that already be launched by other companies so not fully novel but certainly an important capability yeah and is the um the scheme as well with an announcement from open oh and should mention too right the the closest comp to this is probably Gemini 1.5 Pro um which you know rather than open AI two and a half bucks per 1 million input tokens is three and a half for 1 million input tokens and rather than being open AI $10 per 1 million output tokens it's 1050 per 1 million output tokens so just trying to sneak under that threshold to be the cheapest model uh mission accomplished relative to to Gemini 1.55 Pro in any case here but obviously those prices are going to keep going down as Hardware gets better and software improves um yeah one of the cool things that came with this was a blog post where opening ey talked about you how did they actually train this model to uh to do this how did they train to do the structured outputs uh that um that are generated here now for a little bit of context structured outputs the one of the key structures you might want to play around with is something called Json Json roughly is the format of data the format you want to put data in if you're going to move it around the internet let's say if you have a some sort of like web service and somebody wants to Ping your web service to get data that they're going to for example show on their website so you might package it up in a Json uh file or you basically share it in that format um Json is just a way of formatting data you have cly braces you have a data type you have a colon and blah blah blah so there are some basic rules of formatting that you have to have to adhere to if you're going to make valid Json um open AI first tried to just fine-tune their latest model GPT 40 to understand some of those more complicated schemas and make sure they produced outputs that matched those schemas the problem is that no matter how much you do this your language Models Behavior is never going to be guaranteed it's always going to be a little bit random and so you're going to see some failure rate so what they found was through fine tuning through all these operations on the base model they were only able to get up to 93% on their Benchmark so that's still if you're a developer and you're ping an API and you want to get back a data package that has a certain format 93% is just not good enough right it's it's kind of just not going to cut it you need every single time you ping that thing to get your data back in a format that's not going to cause everything to crash right so what open a does is they go okay well 93% good enough we're going to have to figure out another approach what they do is essentially use a um a technique to constrain the model's outputs in a very kind of blunt way so by default when you try to generate an output from a model the outputs are totally unconstrained in principle the model could choose any next character and the model just has to be smart enough to choose the right next character and that flexibility is what leads to mistakes so what opening ends up doing is saying look um we're going to have a separate kind of parser that determines essentially given what has already been outputed what are the appropriate the acceptable next characters that would continue to make this output a piece of valid Json and they're going to force the model to choose only characters that fall into that category so this requires more Computing right there's more compute going on in the background to run this assessment to determine that in fact you know okay this is a valid Json package um and in fact they say like the first time that you try to to kind of get a result using this API it's going to take 10 seconds to process that first request um or as much as a minute for more complicated schemas just because the system has got to kind of figure out that rule set and make sure it's being adhered to things get faster thereafter but um I just thought this was really interesting it's kind of a you can think of it as a mix almost of like symbolic reasoning right because you're imposing actual hard constraints deterministic constraints uh that really kind of try to restrict what that model can in fact output so uh anyway this is a really interesting Tool uh good you know if you're like like us if you're a developer that uses these tools and you need a guarantee of a certain format or structure this is actually going to unlock some use cases exactly it's in particular a big deal for uh you know companies uh and the Enterprise customers so we've seen other companies like Lam lamai already offering this guarantee uh and as open has been competing for Enterprise customers I wouldn't be surprised if fat came out of the requirements of those types of customers final note um on The Cutting cost side there the price War hasn't just been between openi and Google it has also been between inference providers so there's a bunch of these such as to G AI they have so far done a lot of the undercutting of costs and uh this is going to probably hit them pretty hard uh you know for them it might be harder to stay afloat and next uh someone we haven't touched on Apple and the story is that Apple intelligence could get a $20 plus version wow where have we heard that before so this might be named Apple intelligent intelligence plus would offer extra features for um monthly fee now this is still kind of a you know reportedly we don't know if that's the case but seems probably not unlikely and yeah this seems to be the universal model $20 for better models and more features apparently the challenge they're looking at here is if they're going to introduce a paid subscription um most iPhones are not going to run the Apple intelligence capability uh so as a result they're going to have to find a way to convince people because people with the you know the older um uh only users with with one of the the kind of newer phones are actually going to you know have to pay so anyway it's going to have to find a way to convince people to to upgrade to to cover all that all that cost so we'll see I think Apple you know playing playing from behind here uh we're seeing them lean on a lot of third party products which is not necessarily a bad play you there there is a world where the the value acrs at the hardware level um and that becomes the least fungible thing in ecos system and and maybe that said maybe Apple ends up winning without having to play this game but uh you know we can't blame them for trying to spin up this sort of uh business model just if only so they can have part of the company that responds to those incentives that's heavily incentivized to push forward on uh on AI capabilities and product integration and by the way the other subscriptions at $20 per month I was referring to that's open AI That's anthropic that's Google that's perplexity it seems like now you kind of have to go at 20 you can't go above it just because everyone else is going for 20 on to just a couple more stories a bit more quickly we go back to Amazon and the story is that audible is testing an AI powered search feature this is supposedly called Maven and will be used to help users find specific audiobook titles based on natural language processing they are also playing around with some other features like creating AI curated Collections and AI generated review summaries so there you go I guess probably not too surprising this is currently only available to a select group of us customers on your phone and is limited to a subset of the AUD book Library so we're just starting to roll this out but likely can be a feature that will continue to expand yeah we we don't know what models are are powering this feature so uh there's just a lot of uncertainty here apparently a spokesperson said that U Maven is going to leverage quotes the strengths of multiple models and will quotes continually continuously evaluate as models improve which obviously make sense but it's clear that they're you know anyway not tying themselves to any anyone particular I think the interesting thing this is an amazon-owned company so you know if they don't end up using Amazon internal models they end up using you know like open AI models or whatever you know could could be uh seen as an interesting sign um and and there has been apparently a report recently that thousands of AI voice audiobooks are being listened to by audible users so uh there they sort of add this as a side note in this article I I thought that was kind of noteworthy you know it tells us something about the uh the actual market value of this this Tech at this point so lots of lots of jobs under uh under the gun these days it seems next a very similar story it is that Reddit is going to test AI powered search result pages so this is uh seemly coming later this year and these AI part search results will provide users with AI generate summaries of the top top search results uh I guess now that's everyone right we got Google we got Bing complexity of course is where that's where whole product audible is seemingly doing this now Reddit is also jumping on the train and this is perhaps not surprising given that Reddit has partnered with openi back in May so you know pretty clear application of uh chadt there on to research and advancements and the first story is kind of an exciting one I think it's generally the most Buzz uh in the researcher technical AI circles the title of the paper is scaling llm test time compute optimally can be more effective than scaling model parameters so basically this means is the inference time the test time inference time compute is uh after training when you actually run your model you know uh how much do you allocate to that and we've seen many uh ways in which you can allocate more computes you can run additional models and combine their outputs uh here they analyze two mechanisms so they say you can search against dense uh process based verify reward models so not the similar to the openai structured output technique and you can also update models distribution uh over the response using that reward model uh so um yeah the story is apparently that the compute optimal strategy can improve the efficiency of test time compute scaling by more than 4X compared to a best of and Baseline and in a flops matched evaluation in evaluation that has the same amount of compute they did find that on problems with uh um sorry they found that even when you use a smaller base model uh for a problem if you do use the right testm compute you can outperform a larger model by up to 14 uh times yeah this paper um is I think it's badly needed it's the paper that I personally was waiting for for a long time I know we talked about on the podcast a lot this idea of a kind of exchange rate between the compute that you use during training and the compute that you use at test time or or at inference time depending on how you want to phrase it um and there have been a lot of papers that show that you can kind of trade them off you can choose to invest your marginal operation or your marginal flop in training the model more or in making the model think harder after it's been trained but focusing on the particular problem you've just fed at the particular prompt and what we're learning here is that the picture is actually quite complicated so some of this research has been contradictory by the way um the research that talks about this tradeoff you'll find that in some cases they'll say oh yeah it works great you know kind of blanket uh exchange rate between the two and then others say well actually this works really poorly especially on complex logical tasks and so the question that they're going to try to answer here in part is which one of these stories is true how do they break down and what does it mean to scale test time Computing optimally now when we think about scaling training time compute it's easy to think about what that might involve you know train the model for example on more data okay you're just going to spend more computing power more flops to actually dial in the weights of the model get that model trained up but what does it actually mean to scale up test time compute right what does it mean to get a model to think harder if you will when contemplating a particular prompt at inference time well there are a couple different strategies to do this and the first thing this paper has to figure out is how are we going to kind of identify strategies that achieve this so one naive one is called best of end sampling and this basically means you give the model a prompt and then it's going to generate n outputs a bunch of different outputs in parallel to respond to the same prompt and then find a way to select the one that scores highest According to some reward model or some verifier model it's going to review them and kind of go okay yeah that's the best answer and and that's the answer you use right so there you can see how yeah okay that does involve putting in more compute at inference time at test time it's going to give you more results and then you sort through those results to pick the best one and yeah I can kind of see how that would result in a better output by investing at test time that's not the only strategy you can use though so what they do in this paper in particular is they're going to use that that best event sampling as a benchmark so that's what they're going to compare all their strategies to but the strategies they're going to explore most deeply are going to involve taking the because this is Google they're going to take a fine-tuned version of the Palm 2 model which is a Google language model um they're going to fine-tune it to revise incorrect answers to uh math problems from pulled from the math benchmark and so the idea here is okay let's make a model that you know takes it takes in a prompt for a problem Sol math problem that we're going to solve it generates an output now this model is going to be fine tuned to look at that output and revise it and to to try to make it a bit better once the whole output is produced that's one strategy right now the second strategy would be instead to look at the correctness of individual steps along the reasoning chain using what's known as a process-based reward model we've talked about those before so you've got the idea of kind of doing a wholesale revision of a a fully formed response compared to um doing a sort of step-by-step correctness verification process now you you can do those both like by investing more and more compute in your model do more and more thorough job right you can revise an incorrect answer in times and you're using the same amount of compute roughly as you would to generate and different responses it's an interesting question as to which one which approach works best and the answer turns out to depend on the kind of problem you're trying to solve if you're trying to solve a relatively simple problem and by simple what I mean here is a problem that the base language model can already kind of solve in one shot like it's already okay just like low but okay results in that case you're going to get better uh better results from doing end revisions of an initial answer than from doing n attempts in parallel and and that kind of makes sense right because in a sense the base LM is already doing okay so revising that answer it it has enough knowledge to kind of gain a lot of juice from that process um you're in refinement mode if you will whereas if a problem is harder especially problems that require like searching over a lot of different high level approaches to problem solving then kind of resampling new responses from scratch like basically repr prompting and doing that and independent tree search approach turns out to work a lot better and so they use these to create this this uh essentially they call it an Adaptive compute optimal strategy so what this means is give me a prompt and in a dynamic way my model is first going to assess okay what kind of problem is this is this one of those easy problems where I can have a decent initial solution and then iterate on it or is this a harder problem that I need to try a lot of different solutions in parallel using different strategies that in some cases actually look a lot like Alpha go so it sorts those into those buckets and then it implements that solution using that technique they reduce the compute requirements by 4X by fourfold relative to uh their their naive Baseline so um this is really really interesting it has a lot of implications for the future of AI because if you're going to take a system and have it run as an autonomous agent over long periods of time that implicitly means you've already trained the model you're now looking at test time compute finding ways to optimize how the model solves problems in real time um the last thing that's worth flagging and this wasn't mentioned in the paper but I think it is a it is a trade-off worth keeping in mind um you know the training compute is in a sense better because it's a one-time expense right so it's not just quite this onetoone thing if you're trying to solve a problem if you pre-train a model to the point where it can solve that problem in one shot well that's an advantage that the model retains the ne with the next problem you're trying to solve whereas any Compu that you invest at inference time unless you have some persistent memory that you're able to sort of store intermediate results in um that computer is sort of lost forever after the problem is solved after the session ends so uh other than that though I me I thought this was just a really really cool paper it is the paper I've been waiting for uh in so many different ways so it's it's really exciting to see it out there definitely and uh to that comment about alphago a little bit more context so the kind of naive view of how these models work is you have a large language model you give it an input and it produces a corresponding output and that is your answer in practice the way lm's function can be tuned in different ways so when you generate a sentence right you probably want to be able to look ahead and think about you know what I'm going to say later if I say this word next would it make sense in the context of what comes after that and that's something called beam search where you basically explore multiple possible outputs and then you look at what do I say after that and then similar to alphago is like what what is the final outcome is a final outcome something that looks correct and that's where you can use that PRM model to explore every possible step and evaluate every possible step very much like what alphago does uh so yeah very important paper coming from from UC Berkeley and deepmind so yeah I I like to see universities still being impactful in this case this was a person from UC Berkeley having an internship at Google Deep Mind so perhaps technically fully deep mind but universities are still the place where the people who go to Deep Mind and meta and so on the researchers originate from and speaking of Deep Mind the next story is again from them and is on uh robotics this time not large language models the title of it is achieving human level competitive robot table tennis and the claim is they present the first learned robot agent that achieves amateur human level performance in table tennis in ping pong it uses a hierarchical and modular policy architecture so going back to something like reinforcement learning has lowlevel controllers with detailed skills and high level controllers that choose lowlevel skills nothing new there it's a common architecture you decide what you want to do at the high level and you figure out how to do that at a low level they also uh are using something called zero shots Sim to real transfer so you train it in simulation and then you use it in the real world without having to do any real world simulation you know makes a lot of sense for pingpong you can simulate to physics very well you can train uh against yourself against you know AI uh that isn't trained pretty easily so you know you know maybe less of an impactful paper on the surface but important to continue training robots important to transfers from simulation to the real world now performance was uh on 29 matches against hum player players with robot still only beat 45% in those games although it did win 100% of matches against beginner players and uh some less against intermediate players 55% and then lost everything against Advanced players so we'll see maybe if it trains some more it can beat those better players although pretty sure it's not going to go you know Olympic level not in time for the end the end of the Olympics or maybe they've fth ended I'm not tracking but yeah um yeah it's uh it's interesting I mean like one of the cool things about ping pong um you said it actually I mean it's it's very it's a very constrainted in marment so you're not going to have that many factors that take you by surprise it's not going to be that easy to go out of distribution and in fact you know when they talk about in the paper uh they trained it you know or tested it rather in out of distribution context with new players that hadn't seen before yeah in a sense that's out of distribution but the distribution is like you're at a ping pong table there's a net there's a table there's a racket you know everything's fairly uh fairly con contained in constrainted so um this in a way I think is an interesting way to learn about generalization because you have such a favorable context for it so the failure modes you do see might be a little bit more obvious you can also have physics simulation engines that more accurately capture what's going on which a is part of what's allowing them in this case to succeed at the this out of distribution generalization but B will allow you to detect cases of failures of a distribution generalization and make it easier to interrogate those failures so from an academic standpoint I think this is actually quite interesting as well um you know not going to be playing ping pong against a device like this that kick my ass but uh it looks damn cool in the images yeah uh if you look there are some videos of this as always robotics at the very least you get very fun videos uh a little less let's say boring language models and I mean I'm just saying it's very fun to see robots doing stuff and in this case even more fun seeing them play table tennis if you look at the robot yes it will definitely destroy me in table tennis next up back to uh neural networks it is self-c compressing neural networks and the as the title says the focus is on reducing the size of newal networks which is a big advantage in you know being able to have it cost less and consum less power that's been a real Trend with um big LMS we've seen 7 billion 2 billion parameter models typically being done via distillation and here they are proposing a method called self-compression which aims to remove redundant weights and reduce weight the number of bits needed to represent the remaining weights uh quantization also very popular and they say that you can maintain the same accuracy with as few as 3% of the bits and 18% of the weights remaining in the network uh don't I don't feel it's necessarily an overly novel approach or result but uh you know big important Trend and important results that front uh yeah I like at least for me I thought this was really cool paper uh so one one of the things that they do that's distinct here is so when you train a neural network nowadays right you cause them you get the model to make a prediction of some kind if it's a text autocomplete system like gbd4 you're predicting you know what's the next word in the sequence and based on whether or not you were right or wrong you're going to update the values of all the weights all those parameters those numbers in the neural network right now um this allows you to do a really good job but what if not all of those weights actually a need to be there or B need to be represented with like full resolution full full floating Point Precision let's say right it's it's often the case that we want to as you said quantize we want to reduce the resolution of that that we're using to capture our weights to represent the weights in the neural network um if only because hey I mean it makes it easier to fit these things on edge devices it reduces the cost of inference and so on and so forth um the challenge is historically you've had to choose you've had to choose are we training a neural network with weights that have this many bits of representational accuracy of precision or are we doing you know training at this level of representation uh precision and and then you lock it in you train your model well in this case what these guys are saying is why don't we also make the resolution the Precision of the representation of these weights the level of precision trainable so why don't we make it so that the model over time learns not only the values of the weights but also the level of precision with which each weight needs to be represented and the really cool thing about this is that over the course of training it sometimes dynamically will learn okay um let's say weight number 112 well weight number 112 um maybe maybe I can represent it with you know four bits oh okay it's still works pretty well okay what if I try with three bits oh it still works pretty well it goes all the way down to zero bits and when that happens you just remove the weight and so baked into this process is a natural weight pruning process that is imp implied in it essentially and that's what they're going to do so they're going to find that automatically just by making the representation accuracy train s Precision trainable they're going to prune weights they don't need they're going to reduce the amount of uh bits that they need to represent the remaining weights and one consequence of this and this is really cool I've never seen SE a plot like this you end up seeing as you train your model from one Epic to the next so an epic of training basically is one full run over your whole training set as the model trains over multiple epics the amount of training time for each epic actually decreases because then the complexity of the model is dropping over time right we're losing those weights we're reducing the complexity that we're using to represent the weights in the network and so the model gets cheaper and cheaper to train as the training proceeds um this is just a really I think interesting uh strategy it it essentially you can think of it as the model dynamically rearchitecturing itself it's kind of going to decide like should I be a convolutional network today that's an extreme example but like should I look like that okay well if so I'm going to learn that you know these weights need to just disappear and these ones don't and blah blah blah so this takes away the responsibility of the developer in a sense to specify the full architecture of the model addition to giving this really interesting additional degree of Freedom they use a anyway an interesting strategy to create this so-called quantization aware training process qat that's based on a actually a paper that yosua Benjo put together back in 2013 um it's really cool uh anyway I I just I found this so so interesting one challenge with this kind of archit not architecture with the strategy is just that uh it is kind of short-term greedy so if you end up finding that you know there's a situ situation where your your model um is given a batch of training data and just for that batch it turns out that you don't need a particular weight the model might get rid of that weight um and but but maybe it was just that in that bash it wasn't useful and the next batch turns out it would have been so there's this kind of risk that you discard things a little too eagerly and I forget the term that they coin or used for that in the paper they have a a sort of uh something like permanent forgetting or something um anyway so I just thought this was really interesting uh could be could be scalable I I don't know the the big question as always with these new paradigms is how scalable will it really be but just as a concept you know just one of these beautifully simple things that makes you go why didn't I think of that before but that's you know that's the beauty of uh of human for the moment of human Innovation definitely and uh they do compare to a 2022 paper that had a similar idea of figuring out to write bit depths at front time they seem to they have a couple modifications including be able to round down to zero so remove weights and uh when you compare they uh have seemingly much better performance they prune a lot more and therefore are uh training a lot faster although as you said unclear if this actually scales we are testing this on a relative small model from back in 2018 on a small data set cyar 10 uh not surprising this is not from Deep Mind not from meta this is from two researchers at a place called imagination Labs I don't imagine they have a ton of compute uh but also wouldn't be surprised if there's follow-up research from places like deep mind yeah and and I do think that that rounding to zero pieces in a sense that's that's the part of it right that that makes it so the model is dynamically rearch itself because when we think about model architectures at least a big part of it is deciding which weights truly go to zero and that that just one little kind of innovation um anyway I think is behind a lot of what makes this thing so interesting but you're right like we we got to see and and also we think about fine-tuning Downstream could get a lot more challenging if if fine tuning would otherwise end up revealing the need for weights that had previously been discarded so anyway I think it's wobbly in some ways but really interesting other so uh to your point let's see if it scales and and for some more context uh what makes it special is that typically these kinds of things compression quantization these are done after training so these ideas are not novel in fact uh it's very common to quantize to four bits three bits even two bits these days to be able to fit your models on less compute but you do that all the time after training just a couple more research papers the first one is actually related to structure outputs as we covered with Chad GPT it's titled let me speak freely a study on the impact of format restrictions on performance of large language models so they are saying if you uh limit willm to Output in a certain format as opposed to whatever it wants or whatever you tell it uh does it perform as well does it actually solve your problem as well as if you did not constrain it to a certain output structure and surprisingly perhaps they observe a significant decline in lm's reasoning abilities under structured format restrictions now to be clear um one of the ways they do this is uh they compare to um on the first case so on a problem like Eliza's rate per hour for the first 40 hours she works each week is $10 uh some more if Eliza worked for 45 hours this week how much are her earnings for this week and they say the first prompt is reason step by step give a final answer the second prompt is just output the answer in Json although you can include step-by-step reasoning in that Json so seemingly not too dissimilar but in practice uh performs a fair bit worse uh so indeed very interesting and and practical result yeah and I think one of the the immediate implications of this is that you want to break up those steps of the problem solving and the formatting um you know generally that's not going to be that computationally expensive just because you know the formatting is a pretty simple operation you you could you know probably do it symbolically in different ways but but certainly most cases you're going to want to first start by asking your you know your AI system to generate a step-by-step reasoning chain and then integrate it into some kind of Json format or XML or whatever you're trying to ship um they do show some interesting side by sides uh for different models for different uh for different um formats as well so they compare you know XML to Json to uh to F to anyway whole bunch of different things and then they they look at what does it look like when you force your model to generate outputs in those formats um and the problem you're trying to solve takes different forms you know what if it's a math problem what if it's a a kind of logical uh natural language problem what if it's a problem that involves identifying like the last letter in a sentence or things like that and you see the the task and the format both play important roles like I I would not have expected that you know that you know you you ask for the result in XML versus you ask for it in Json and and one is a lot easier for the model to solve for than the other um you know that that's kind of interesting it suggests that some of these um these formats come with what you might think of as like a higher or lower formatting tax like a computational tax at runtime at test time uh that's that is different and causes them to then stumble when they actually go to solve the logical reasoning piece because they've invested so much of their reasoning uh into just the formatting piece so yeah thought that was pretty interesting and and another example I guess of moric paradox right where that are really easy for humans are sometimes hard for AIS and vice versa well this is a pretty counterintuitive thing it uh definitely wouldn't have wouldn't have occurred to me that this is going to be the case right and on average it appears that XML is the best model to me surprisingly I would have expected to be Json or something like that yeah also kind of sad you know nobody likes XML so why is that good yeah I I don't understand like what the hell's going because Json to me conceptually right is like it's just it's just friendly or more intuitive to use I don't know yeah I my guess is Json is less flexible so restrictions there it has to be all dictionaries XML you can do a lot more with so could be the reason and the last paper is titled Berkeley humanoid a research platform for learning based control so this is a new reliable and lowcost midscale humanoid research platform buil built- inhouse specifically designed for learning algorithms with low simulation complexity anthropomorphic motion and high reliability across Falls very important for human control and they say that there's a narrow simulation to reality Gap so you can train in simulation and then uh deploy it in the real world and they are able to do reinforcement learning to be able to control the robot so there you go another humanoid robotic story uh pretty fun once again to look at the videos and moving on to policy and safety we're going to start with this um I guess you call it a story it's really it's a Twitter thread and then a report uh but all the all the most Salient stuff I think is in the Twitter thread so there's this company called meter and this is an offshoot um so formerly was known as Arc evals that stood for the alignment Research Center um there still is I believe Arc which is the kind of main org that was founded by Paul Cristiano who is the outgoing head of alignment at openai back in the day so this is a very very talented org um and and I believe well funded as well uh that does audits for language models and that they actually famously conducted uh the original gp4 audit that showed that gbd4 could like you know convince a human to solve a capture anyway all that jazz so um here we have meter coming back they've done a bunch of audits of anthropic models and of um opening eye models and they're just kind of introducing us to their their approach their strategy a lot of what they focus on is these autonomy evals these survival and flourishing evaluations basically seeing can an agent for example self-replicate can it extract its own code its own weights can it run or spin up another agent or or things like that so um what they're basically saying is uh their initial results here as they look at Claude 3.5 Sonet and gp40 uh these models can complete if they're turned into agents with some basic scaffolding they can complete a a good proportion of tasks that are similar to what humans can do in about 30 minutes so that seems to be roughly and there are all kinds of exceptions to this but uh these models are able to yeah perform a good chunk of tasks again on the order of that humans can do in 30 minutes they have a great plot uh that shows that they're using kind of you know the these the amount of time it would take a human to perform these tasks as a bit of a a way point to kind of gauge how effective these agents are um they've got a whole new site of autonomous capabilities evaluations uh we don't have to get into but the broad areas or cyber security machine learning software engineering all the things that would be involved in basically exfiltrating um in breaking out if you were an AI agent trying to well trying to take over the world is kind of the idea here so uh they they look at that um they yeah they focus on essentially scaffolding improvements um to to language model agents that work with a fixed token budget and fixed compute budget and just see how can you know how well can it do uh when it's trying to perform a given one of these tasks the interesting result or an interesting result is that they show that again back to this question right we've just been talking about inference time compute or or test time compute well they experiment with like what is it look like when you increase the token budget you know roughly increase the Computing budget of these models um at test time as agents and what they find is this steady increase a scaling law essentially of A Sort um that starts to Plateau as in this case you get to around 10,000 or sorry 100,000 or so tokens so you get diminishing returns Beyond a certain point with current models I thought that was really interesting um it's also noteworthy that the uh scaling law seems to start to Plateau independent of the scale of the underlying model at least that's what it looks like from here um okay so sorry I should be more specific uh it looks it looks from this as if you actually could keep increasing the fraction of tests completed simply by scaling the model like anyway you have to look at the curve but they have a curve that shows this kind of the the um a fraction of tasks that are completed as you increase the token budget um goes up faster for larger models so you could keep increasing the size of the underlying model and actually get like I mean presumably if the scaling law just continues um some some decent success so interesting to see how far that goes one of the results they also share which I'll add as the last note here when an agent can do a task typically they'll do it at about 13th of the cost of the median hourly wage of a US bachelor's degree holder that's an interesting result so Claud 3.5 for example the Sonet agent um fixed bugs in some some library at a cost of under two bucks uh whereas a human Baseline took over two hours there's obviously a lot of variation there you we've seen versions of this before especially in the context of concerning cap abilities like offensive cyber attacks right where we saw that gbd4 uh gbd4 turbo that is could automate the discovery and exploit of one day and zero day cyber vulnerabilities in both cases with high success rates and in both cases at or below the cost of a human um sort of cyber cyber attacker if you will so this is kind of interesting as these things reach that economic escape velocity um but uh yeah Arc uh Arc sorry meter doing a great job of laying the results out here a few more details they have 50 automatically scored tasks on the simple end you have converting Json data from one structure to another very easy on the harder side writing Kudo kernels to improve Pyon performance and training a machine learning model to classify audio recordings and on the hard end uh they valuate this with 200 people who have stem undergraduate degree and about three years of technical work experiences so you know very uh let's say impressive people here to solve the tasks on the most difficult side of tasks that requires 16 to 64 hours I will say that those are machine learning and software engineering tasks so maybe my job is secure who knows um and they do have a lot of tasks here that take uh four to 15 minutes 15 to 60 minutes which to be fair is not sort of what senior software Engineers do that's more on uh more Junior kinds of things next story we are again covering responses to the California sb147 AI regulation bill this time the response is by the god M of AI F Lee so we've seen responses by many of the most important figures in AI Andrew Ang Yan Lon all opposed to these models now F has also come out and uh argued that this AI bill is wrong will hurt the AI ecosystem in the US and that would be by uh V say harming public sector Academia smaller tech companies and open source communities and even academic AI research uh because they are more uh potential liabilities there are there is a liability clause which would hold both the party that's using a model responsible for misuse but also the original developer and there's also more things you have to build in you man they mandate a kill switch for models over a certain threshold which could deter developers from writing uh big uh kind of models and would especially impact the open source community so yeah a lot of opposition going on against this bill that is very much concerned with AI safety yeah I think you know if we're going to talk about the opposition we obviously have to also talk about the people in favor and and even more considerably more high high profile um sort of AI researchers including academic ones have come out in favor of it including Yoshua benio and Jeff Hinton who very prominently said that this bill is actually really well-designed um you know there there's all kinds of views it's sort of the usual camps right F Lee Yan Lon uh you know okay no surprise you're going to be against any kind of uh approach that takes existential risk seriously um Jeff Hinton yosua Benji okay no surprise are going to take it um are going to be in favor of approaches that propose things like licensing and compute based thresholds and and kill switches and things like that there's really nothing new Under the Sun here uh F Le's position has been consistently this uh for as long as anyone can remember um I think one of the the the interesting things here too is like the the the um position piece that F Lee has here doesn't really touch on the catastrophic risk uh argument that's at the core of the motivation of this thing she kind of talks about other issues like hey this would be bad yeah for ancillary considerations like Academia and and all that that's really valid that's important um but the challenge is you know if you want to be heard if you want to have a constructive dialogue you also have to engage with the reasons why the bill is being proposed in the first place and there are really good reasons why the bill has been proposed um it's unfortunately I mean I think it's it's a there's a lot of pork in there too it's this sort of uh trying to tick all the boxes and get everybody on board there's stuff about you know workplace uh displacement and things like that um and it's not as targeted as as really perhaps it ought to be on the core issues that motivate the bill which makes it possible for people like fa Le to write this article and circumvent that entire argument which is at the core of the bill itself um look the reality is and and a lot of the objections that we've heard um have have skipped the fact that there's a very significant um cut off hundred million uh required before you're in the business of um of actually regulating these models so the model has to have a100 million budget or above and and so you know the the thesis there is if you can afford $100 million budget you can afford regulatory oversight of the sort that is being asked for in the bill and we can have arguments about whether or not the underlying threat model is accurate but those have to be debates about whether or not the underlying threat model is accurate I think the idea that we're just you know going to say okay well it would be bad for all these reasons fails to account for the pros and cons that any reasonable policy discussion has to include so I think this is a bit sort of disappointing as a as a a write up not too surprising because you know Faith a Lee I think is not like Yan Lun I mean you know this is this is my own bias showing of course everybody who's tracking the show knows um you know I've a ton of work in this space u there are a lot of really compelling and interesting Arguments for the uh sort of more catastrophic end of the risk Spectrum here it's not a guarantee but like we have to deal with uncertainty in all wmd circumstances including nuclear war uh including chemical weapons um so that doesn't mean that we do nothing uh it means that we have to have a robust discussion and unfortunately I think you we're not talking about the sort of um the pros and cons together here you know I think um I think we're missing an opportunity to have a more productive dialogue if you will good points there and as you said unsurprising given that Yan Lon Andrew to some extent f are more on the industry side of a development uh more on the big company side and wow big companies oppose regulation who would have thought whereas uh you know yosua Benjo Jeffrey Hinton not affiliated directly with any companies and also not in Silicon Valley or the US in some cases so to be fair that does uh matter and if I were to quickly give my take I think probably this is a fairly reasonable uh Bill although I do think a few of these things like holding the original developers of thei model do seem a little unreasonable on that front um you know in some ways you want to hold them accountable certainly the AI act does put restrictions on certain categories of risk and I think that approach makes a lot of sense uh but it's unclear to me at least for now how the bill addresses that point and uh certainly open source also is a much trickier question on how you regulate that yeah actually to to that point um maybe worth noting so the uh the bill to my understand is saying look uh you're going to be held liable for catastrophic harms for for failures essentially of of process and and failures of outcome if if your model leads to a catastrophic outcome like you're going to be you're going to be liable for that and um and for doing things kind of before that happens and one of the positions that Labs like anthropic have taken which um I'm very curious to hear you know because they are very safety-minded and safety conscious I'm very curious to hear what their actual justification would be for this there certainly I talked to a lot of researchers including researchers at anthropic uh who don't necessarily agree with that position um the the idea being that if we're going to wait for a catastrophe to unfold to then hold companies accountable that's a little concerning especially given the scope of the catastrophes that Labs like anthropic themselves seem to consider quite plausible um you know if you're talking about a wd saying hey you know what like we're not going to have a process in place that you're held accountable or you held liable for for upholding um uh like don't worry about it but you know we're we're still worried about it if the catastrophic outcome actually happens that's a little tough if you're talking about potentially millions of lives that could that could be under the gun here so uh that's part of been part of the push back back and forth certainly understand the anthropic argument as well that hey you know we need some latitude as well we need a bit of a safe harbor we need you know guarantee so we can continue to do our work um but I I think that that that core argument I've just heard enough opposition among people who would normally agree with anthropic on a lot of the the work they're doing um to be uh to be very interested let's say to hear more at this point right on the topic and Tropic their statement was that sb147 has substantial drawbacks that harm its safety aspects and could blunt America's Competitive Edge they argue that the bill should focus on Frontier AI safety and away from approaches that aren't adaptable enough for a rapidly evolving technology and also want to be build to shift two outcome based deterrant as opposed to prearm enforcement meaning that AI companies develop and deploy safety protocols and be held liable for catastrophes they cause so yeah a little bit nuanced and the people who do oppose a bill again who we mentioned Yan Lun Andrew a fth f are also people who are to some extent dismissive of catastrophic risk and in some cases oppose AI safety to an extent yeah and I think that's kind of the disappointing thing about faith a Le's article is like um we're not going to make progress if we just pretend that the arguments the core arguments that motivate this whole position don't even exist like this seems to me like a fundamental failure to engage with uh the the good faith arguments that have been put forward on all sides like we all lose the situation there are interesting arguments to be had you know we had a whole episode back and forth about you whether this risk set is is plausible and I think there are Arguments for and against that are really interesting but when we don't even have them I think we we do lose an opportunity and uh it's a bit unfortunate onto lighting round the first story is a spicy one it is about how a judge has ruled that Google has a monopolized search through illegal deals this is trama week Andre I don't know what's going on I know I love it drama with all the big Coes and in this case the reason they have monoply search is by making deals with Apple and Samsung to make their search engine with default option on smartphones and web browsers obviously this is a big deal this was a 286 page ruling and uh quote they say Google's distribution agreements foreclose a substantial portion of the general search Services market and impair Rivals opportunities to compete reminds me as well of the uh ruling in the EU that uh the Apple the appon marer has to allow people to choose their search provider as opposed to making a default uh so kind of along those lines as we always say we aren't uh lawyers but uh if I had to guess it does seem to be a pretty clear-cut example of the sorts of things that are cited as anti-competitive right this is not just about whether you're the only player it's not just about whether you're a monopoly it's also if you're kind of a dominant player and you engage in Acts that make it unfair to other players in the market and uh stifle competition and you could very well argue that this is what that's doing yeah you're right the claim here too is that there's been a very concrete uh impact of all this so uh the the the judge says the trial evidence firmly established that Google's Monopoly power maintained by the exclusive distribution agreements um has enabled Google to increase text ad prices without any meaningful competitive constraints so their claim is there has actually been uh an increase in prices as result of this so the consumer is actually lost out um you know I I thought this was fascinating I am I'm so uh embarrassed not to have known this lawsuit was going on I thought same here okay it's not just me like I don't know where the hell is so so alphabet Shar has slid like four and a half% Apple down 4.8% now the reason Apple's down is that although they're the ones paying Google for this obviously they're getting a lot of value sorry although rather although Google is paying Apple for sorry that's the whole point Google is not going to be uh paying Apple their you know their next payment assuming that this result sticks right now there is an appeal process Google is appealing this no surprise um who knows how far that could go uh but the interesting thing is too the consequences of this could be like really [ __ ] serious like this is another reason why I was like what the hell like I wasn't tracking this so apparently there's a hearing next month that the judge is scheduled to discuss the timing of a separate trial they're going to have on the remedy in other words on what they're actually going to do about this what the consequences are going to be um the doj could apparently demand and I don't know how how plausible this is how luckily it is to materialize but doj could demand the separation of alphabet's search business from other products like Android and chrome and if that happens that would be the biggest forced breakup of a US company since AT&T was dismantled in 1984 so this is like a big big deal um there are other alternatives to though judge apparently could also stop short of that breakup and instead choose to just unwind the exclusive search deals which to me as a complete an income poop I tend to think of as like yeah wouldn't that be like the more the more um sort of like light touch approach like than dismantling the company who knows and then apparently the third option is just to require Google to license its search index uh that's the data that it obviously uses to to build up its results so there are a whole bunch of of alternatives here I don't know which one end up being realistic or plausible and of course there's a prospect of appeal so we'll see if this result sticks but for the moment uh wow you know 5% drops in in stocks maybe an overreaction and amid a a broader Market selloff of course but uh but still uh damn I I can't believe I was in tracking this yeah to be fair it's uh kind of uh kind of came out of nowhere uh so the original Case by the Department of Justice was three years ago and now uh judge Amita meta has issued his ruling 20 lawyers in the audience apparently rule has violated section two of the Sherman Act oh the sh that's my favorite yeah I know and um it came out through the trial that uh Google paid 20 billion to Apple to have his position as the default search engine so kind of make sense that Apple shares slid given that's a fair bit of money yeah I I wonder if the article says that uh it fueled okay fueled more than $300 billion in annual revenue largely generated by search ads okay I guess I guess that was already publicly known I was trying to figure out does this scre gole over for their future negotiations if people know oh yeah from this deal you're getting that's not the case though this seems to be yeah must be their overall total because that uh is the only number that would makes sense but anyway yeah wild and speaking of antitrust the next story is that Amazon faces UK mergin probe merger probe over 4 billion anthropic AI investment this is the UK's competition and markets Authority they became a phase one investigation into Amazon AI research firm anthropic uh so this has to do with the exclusivity agreements that Amazon has with UNR IC and the question is whether that is anti-competitive this is following up on Plenty of action in the Au related to antitrust things with things like Microsoft and open AI so will be interesting to see where this goes and the last story as Jeremy hinted has to do with the gp40 system card system cards are a thing that's been going on for a while where along side where model release you release a sort of standardized overview of a model and things like its capabilities uh things like it's safety concerns training data etc etc and now we do have a system card of gp24 which comes with safety evaluations and mitigations that I'm pretty sure uh Jeremy has taken more of a look at than I have oh well I mean it's uh it's it's just a like actually it's not pages I think would probably be about like a 30 page read it's it's a big beefy document so I don't blame you it is your job to read these things it it's my job uh yeah well then I'm I'm pleased to announce uh some good news opening ey managed to pass their own self-administered and self-d designed safety tests with flying colors so there's that um it is actually I think you know it's a better result than that um the tests do seem interesting and um you know and very legitimate so they have this preparedness framework open AI does that they set up a little while ago um Sam Alman seems to have been quite intimately involved in that process uh and this involved developing a bunch of requirements including the system card uh which they put out now with all their new models so this is basically specifically for GPD 40 which kind of the the biggest identified a new vulnerability with this model is the fact that yeah it it has the ability to generate audio uh to respond to audio inputs and generate audio outputs so you got a whole bunch of new risks that come with that uh risks from unauthorized voice copying uh risks of it being used maliciously in a whole bunch of different ways and of course it has these very quick uh response times you know uh something like about 300 milliseconds which is comparable to to S human response time so you really can imagine using these in in the wild and serious ways um they go over a whole bunch of stuff they talk about how they went through and like red teamed the model who they hired blah blah blah it's a very comprehensive document along the lines of the kind of like metor document the report that we were talking about earlier um a couple of of little nuggets here so first how do they make these evals um if you're open AI you've got a whole bunch of text Bas based evaluations that you developed for previous generations of models that were text only um and so you might think okay well maybe we can reuse those uh for the kind of voice speech to text model the GPT 40 here and and they actually do that so you use a whole bunch of text to speech models um like voice engine that allow them to yeah to to do that to convert their their text evals into audio eals that was really interesting there are a couple of issues there not least of which is that in real audio situations there's a whole bunch of background noise things that make it kind of Messier and that's not present when you're doing those sorts of eals and my susp suspicion is that that is going to result in some really interesting jailbreaks that are going to exploit the fact that these evals have been done and and part of the training has been done on just sort like really clean audio um though their training process ALS includes semester stuff but I suspect that that's one thing that we'll end up seeing a one possible example of that that was really interesting was that uh in the section where they're testing unauthorized voice generation uh they casually had a a user interacting with um you know the gbd 40 and uh all of a sudden randomly the model just goes it shouts no there like in the middle of a response goes no and then it replicates the voice of the person it was talking to and has it say some creepy [ __ ] back and what the [ __ ] and this is really weird so that's one of the eals that they ran um they get it's really funny if you look at the documents like just just kind of real subtly W they say voice generation can also occur in non-adversarial situations such as our use of what such as our use of that ability to generate voices for chat gbt's advanced voice mode during testing we also observed rare instances where the model would unintentionally generate in output emulating the user's voice um apparently they've correlated they say in a footnote some instances of this Behavior which is really weird like you can listen to the audio it's it's all over the the Twitter verse but apparently that Behavior Uh is correlated with short as they say often inaudible voice messages made by the user which are often produced when users are in a High background noise environment such as using the model in a handsfree mode while driving or simply needing to cough and so this I think is a really interesting case where yeah you're seeing the model break in that context this really does uh kind of suggest even further that yeah there may be a jailbreak possibility here using background noise you're already finding a way kind of past the the filters here at least to make it break um last thing I'll mention they have a section so so on the EV vales they basically say oh we pass all the EV valves with flying colors it's the model is low risk according to their uh preparedness framework on everything except for one category which is persuasion and what they say is they just narrowly cross into the medium-risk threshold um for a little bit of context lowrisk and medium risk open AI says we're happy to deploy those things uh anything that drifts into high-risk territory we're going to work on to reduce it to medium risk or lower before actually deploying it uh so this is still Deployable but they do a bunch of experiments to show that using um the text uh based version of GPT 40 you can change people's political views if only transiently for a period of time um about a week later they kind of snap back to where they were before because humans are humans um but uh they were highlighting this is a possible risk they also test this with the voice the kind of audio version of GPT 40 don't see a result there I will say in this context um you know we talked to AI safety researchers at a lot of labs including Labs other than open AI a lot of them are concerned about uh you know persuasion becoming a thing fairly soon before we're able to kind of figure out how to deal with it um but uh but anyway that was interesting they did bring in Apollo research which we haven't talked about them so much we've talked about them a couple times I'm a really big fan uh there're another company like meter that does evaluations for deception um and they came in and evaluated capabilities related to what's known as scheming in gbd4 uh to see whether gbd4 can do things like uh model itself like reason about itself as a separate entity or others like kind of theory of mind and a whole bunch of other other tasks and they showed self- awareness uh in some context uh less so in in an kind of agenti um or applied agent setting uh but anyway so some some early results some stuff that starts to hint at like maybe we're seeing a little bit of takeoff in some of these Dimensions but still safe enough to deploy news right yeah to the credit of open I it do include a section on thirdparty assessments including met and Apollo so very yeah good model card you know good job open a eye following up a policy didn't say whether you can make a model mimic The Voice or scholar Johansson so still unknown on that front but otherwise lots of clarity ah what a wasted opportunity yeah and that's it a bit of a long episode lots of drama so hopefully people enjoyed it uh maybe I will put out uh our first mini episode with AI generated Andre with uh like maybe five minutes worth of summaries so we'll see please do let us know if you like that to exist and if you like the episode if it does come out otherwise thank you for listening to last week ni AI as always please do share please to review we are at 196 reviews on a podcast get it close to 200 but that aside please do keep listening and do enjoy our latest AI generated song [Music] last week nevered a with a new tech grow theart G swing it's open at CL Skilling be anyone know your be left side St you reaching for the sky take dreams and fsly last we can AI We R in the lab at the Frontiers eded by be of C with the cutting next fled neural network spark Innovations W from B to break through taking the next step elra be let stay to reaching for the sky take TR and full let's make it AI bring it R going screens and wide Minds in this dig race no one falls behind the sa us we together pushing forward this Tech World line at your be left side stay T reaching for the sky take trains and P fly last We Day High We rise High