Tiny Grad Updates and Future Aspirations

hello hello welcome it's been a long time uh I got some screen over here I got some screen over there I got my folding phone uh let's see if anyone still watches this stream uh let's see where have I been I I don't even know um I haven't been streaming that's true but in some ways I think the meta of the George Hots character is played out I think we're entering a new era uh I think the pendulum is is swinging backwards uh you know the pendulum is is swinging back um so yeah I think I don't know I don't really know what this stream uh should be you can see I'm in a new place this is my office um it's a shared office but we're here and we're streaming and it's Sunday mostly what I want to do is show off some of the stuff in tiny grad and then maybe we could do some work on it uh so tons of progress has been made in tiny grad um yeah I mean a lot of commits I don't really know what to what to show you I mean there's beautiful memest beautiful emest is pretty nice but it's the same beautiful memest that was there before uh actually the accuracy is a bit higher now because we improved the initialization um if we go in here to n and a nit you can see that our uh initialization is simpler now uh when you look at com 2D it's really just tensor uniform and scale used to have uh caming uniform but I don't really know what caming uniform is like if you ask me to just describe caming uniform but this is what torch uses so we just copied torch and now our kernels are a little bit more accurate which is cool um tiny grat has docks if you haven't seen them uh I saw some guy on Twitter liked liked how much uh meth there was in the docks see all meth um this is yeah I don't know I mean maybe we should try to implement a paper I think that might be more more fun I don't know how far we'll get though the new tutorial is helpful which tutorial well we have the quick start and we have the amness tutorial um the amness tutorial is pretty nice oh let's see I went to Poland uh where are you so long I went to Poland where else did I go I went to Italy I went to a lot of places um and it's cool because you know tiny grad's a it's a remote company one of the ways that we distinguish ourselves from comma is we do have some stuff in the comma office but tiny gr fundamentally is a remote company um two of our employees are fully remote because fundamentally it's a GitHub and a Discord if you haven't seen our beautiful GitHub it's GitHub tiny gr tiny grad uh we have a lot of stars yeah I I think what is what is Tiny grad 4 I mean it's going to replace pytorch probably we'll see now look something I say about this a lot is when I was in the self-driving car space I was competing against largely idiots uh sorry everyone who worked on self-driving cars you were mostly like an idiot um and I you know no respect um whereas the people who I like compete against here like pytorch JX uh Mojo uh mlx these are people who are very smart and people I respect uh so you know we're playing in a we're playing in a real like playing field now right do I respect like you know the the eight ass engineers at Toyota like no offense but no like you're not good uh oh man you know we had a we had a Ford engineer C come over and he we we we like we showed him the docks and he's like wait wait that's the same docks we have I'm like yeah anyone can go to like Ford do you know techinfo for.com and buy them right like they're just not like they're not on the kind of peretto Frontier um of of of coding right like the the reason the Toyota Adas sucks has nothing to do with like that we're better Engineers than them or anything it's not it's not even that it's just it's like structurally broken where there's nothing at all structurally broken about pytorch or Jax right like pytorch and Jax are they've made different tradeoffs from Tiny grad and it's a question of who's going to win but think about it more like high Lev sports teams competing right like you know you can you can talk about the other team but fundamentally you know they're they're they're Live players uh whereas open Pilots competition except for Tesla which isn't really our competition uh we're not live players uh Tesla and I don't share uh Tesla and open pilot don't share much Market um you know it's just it's iPhone and Android but we are trying to of course take all of the market share from P torch Jacks and every machine learning library um so yeah if you uh oh let's see I rafted down the Colorado River oh man it was was 115° outside it was cool it was not cool it was very hot but it was fun hello we're streaming I know um yeah so this is shared office you can say hi if you want this is men you hello Works tiny Court um yeah we got to we got to make MCTS fast one 500 nodes per second all all of them tiny Buck definitely yeah how do I par well I can at least parallel the uh the rendering and the I I don't know what percent of the time is spent on the rendering versus spent on the uh actual compilation um so what we're discussing here is I just added MCTS search so if I put in beam numbers that are greater than 100 it doesn't actually use beam search it uses MCS search so what this is doing is searching over tons and tons and tons of kernels um you see it found the best one there at 63 but this is a more highle tutorial Let's uh okay I don't know why I read the comments what you mean Nvidia will go open source nvidia's always been open source uh not always but for the last like two years they've had an open source driver uh and they are switching now to solely the open source driver their close Source driver didn't use the GSP you know it's just like who am I even talking to I'm the the AMD thing is you know the the amount of just complete misunderstandings and Mis wow I thought I thought we weren't going to rant you know I thought I thought we're we're past this on stream but like okay scale Cuda AMD right so like this stuff comes out and people are really excited about this look there's still there's still uh this is totally the wrong approach there's no reason anyone should be like doing this fundamentally the question is not can you run Cuda but the question is can you run Cuda fast and the answer is you never will be able to uh and I I can show you why I can show you uh I heard there were some docs that addressed this but I'll tell you why it'll never be good um so here tiny grad supports both AMD and Nvidia and I can show you here here in uh where is it it's in lower uh it's not in lower it's in kernel yeah it's here yeah here uh so AMD and Cuda have very different uh tensor course I should probably include the comment for the AMD ones uh so AMD and Cuda have different tensor cores and because of this they will never like you'll never be able to take Cuda code and compile it and make it fast on AMD uh so this is not like this isn't really going to help people the way they think it's going to help people you want to be at a level of abstraction higher than this um you want to be at a level of abstraction where like you can still mess with your shapes such that it can make the tensor cores fast uh so yeah this will this will never be uh fast there's the wrong approach um now we we'll give a shout out to uh T torrent uh T torrent I you know I love to see this they're shipping computers right like respect right there's a there's a button here there's a deposit they have a ship time oh we're shipping tiny boxes too I guess there's there's I don't even know what to like say about this I got a room full of tiny boxes over there I think we've shipped like 25 of them uh they're shipping we are very slowly making our way through the huge pre-order list uh and if you want of course add yourself to that it's on Tiny cr.org just click Tiny Box these are the tiny boxes we're selling like we can peek around a tiny box here see if anyone's on this one this is a tiny box green it has 6 490s in it just kick off a uh kick off an MCTS search here let's just do something like this why not let it search um make sure it works I haven't really tested it there so mtcs is still very experimental it's an extra uh yeah but this is the this is the Tiny Box uh what's the ratio of what people are ordering uh what do you think okay it's it's 10 to one and you can decide which direction it goes despite saving uh $10,000 but you know we are almost ready to upgrade this driver quality from mediocre to acceptable uh because you know again with no help from AMD we've completely succeeded you know shout out to to niml Jen did most of this work uh he he works at tiny Corp you find him on our Discord uh this is a complete AMD driver uh it uses pm4 so a lot of the bugs uh in AMD basically existed in two places um one of them has to do with uh this feature called CWS uh cwsr uh compute wave saver store which is basically like context switching for the waves implemented on the CPU gpus do kind of a quick contact switching for the waves but there slow contact switching you know one in 100 times one in a th times in a restore just work there were some race conditions in that so we disabled that in the driver uh it's just a flag you can you can use to disable it in the driver and then um we wrote our own this is an AMD driver that speaks directly with the hardware using IO controls and the key thing is that we communicate with it through pm4 packets not through AQL so the other bugs existed in AQL mostly um remember all those ones about you were like the que they were messing with in order and out of order remember that remember that thing they released all stupid we just bypassed all of it so the main firmware that actually runs these things on the GPU oh we have some decent docks now um again why wait for AMD to do anything uh I just went in there you know I used to be a famous reverse engineer um and we just dove in and said okay so what's actually going on here uh so this is the main uh controller uh the microengine compute that actually controls like using the compute resources of the GPU so we uh it's a 250k binary uh 90% of that code deals with AQL um but there's a lower level thing called pm4 so you can see pm4 instead of AQL it used to be called kfd now it's just called AMD um we switched from AQL to pm4 which these are actually just speaking like this is the launch of a a compute kernel here you see we have a function called exec here this specifies like the um uh you know just the location of the program a bunch of different resources that the program needs um the local size where's Global size oh this is in this pointer to dispatch packet wait wait what I didn't know this was still used huh interesting I haven't read this code that much um pointed A dispatch packet why do I need a dispatch packet why am I not just putting this stuff directly I am putting it directly in we're doing direct dispatch why is this stuff still in here pointer to dispatch packet get command idx oh maybe it's just a useful struct to use because this doesn't actually seem like it's doing anything I'm GNA ask about this in Discord uh oh wait what Kars pointer dispatch I'm GNA ask about this in dis in Discord why is HSA kernel dispatch packet still used in Ops amdp um yeah okay I would imagine we actually didn't need this HSA thing because we're not really using it so you can see these um these set sh registers these are actually setting Hardware registers in the GPU and you can see here's where we set up the uh local size and here's where we set up the global size so this is just completely like specifying to the registers uh like like it doesn't even use the firmware um the the Mec is executing very small pieces of code uh to actually run these register sets and they're pretty much just exactly what you think they are uh there is still this packet 3 dispatch direct so these are the different uh pm4 commands um yeah but turns out that when you bypass all of amd's crap there aren't really bugs in it any more so there aren't really bugs in the hardware it seems like they did a good job with the hardware it seems like their bugs exist in the driver and in their runtime uh so we turned off the one buggy feature of the driver um and we uh bypass their runtime entirely so yeah uh AMD so we we we we can upgrade this driver quality from mediocre to acceptable I think should we should we do that should we should we upgrade it from mediocre to acceptable I'll do it later um okay so we also have the same driver for NVIDIA this is [Music] the uh Nvidia version of that driver it speaks directly to the GPU um it's amazing how short these things end up being so one of the next big things that we're pushing into is we're still actually using the compilers here so we're using uh nvrc which is uh it's it's a inprocess uh Cuda compiler um some weird stuff for PTX here that's actually not on my computer uh the same thing for the AMD compiler we have a helper here that calls uh uh compil hip I don't really know why that's not an Ops AMD that should probably be moved to Ops AMD if we don't use the same compiler here why is this a separate file that probably belongs in op bamd I'll mention this in Discord too reason we we can't put merge hip Co manager dopy into Ops AMD dopy then .p can go to helpers py and remove a support di yeah so I don't like this support di this support Dr is kind of upsetting so this stuff should probably go in there this stuff should probably go in generic tiny grad helpers here um help it's getting kind of big yeah it's an elf loader wow it Imports libc we don't want elf in helpers if it Imports let's see okay uh you guys are just kind of really watching me watching me work here um does tiny gr work with Mesa no we don't use so Mesa my understanding is that Mesa is a user space library for gpus so tiny grad just replaces that uh we've done pretty well with line count we're at 8461 right now um we can kind of go through what these things are so the biggest file here is tensor uh so tensor has all these has all these methods look it has docs now tiny so unreadable yeah you haven't read it in a long time it's actually very readable now um and then all this Docs just gets compiled to here so you can like see like you know what's randen oh here normal distribution mean zero standard deviation one wow you can pass the device keyword that's cool um yeah movement Ops and different ops processing Ops un Ops everything's a method on tensor which is uh which is cool uh the docks became great thank you uh does your AMD driver support only rdna 3 I believe so um it should not be hard to add rdna 2 support I think the only difference you're going to have to deal with is like the scratch stuff's a little bit different uh how does tiny grad perform versus torch compile well again it all depends on your plat form um so tiny grad natively if you don't run it with anything is slower but if you run it with search parameters and that's what I'm talking about here with MCTS right so right now it's like searching for a kernel so here I can like I can show you something like um so we like this example here so this example is uh this is like inference on a resnet so inference on a resonet normally we're getting only three teraflops 170 milliseconds which isn't great but if you run it with beam equals 3 uh now it takes a little bit to compile which is I'll run it with debug equals 2 so you can see the beam um so beam equals 3 is the old uh search you see it searches all these different kernels so before kernel zero was taking 2.71 milliseconds and now it got down to 1.6 1 milliseconds thanks to beam search um we can also use MCTS here so let's add MCTS and beam uh so actually this one was already searched via MCTS so you see that the hand-coded one the one without any optimizations took 2.7 milliseconds the beam one took 1.6 and the MCTS one took 1.43 um yeah so it searches this is doing the MCTS search here so you see that's the beam search there this is the MCTS search see it already found one at 42 that's uh6 milliseconds the timings aren't that great on Mac they're better on Nvidia um yeah so this one when it actually reran it was slower so it used the built-in tensor core optimization but yeah so what are we searching uh we're searching over all of the here I'll run this with debug equals 3 so you guys can see what we're searching so this stuff these are all the different ways to represent the kernel you see we're messing with the strides and the views uh all this different kind of stuff we're messing with some arguments to the tensor course here which axis it is oh this is this is actually really cool I've never R this before uh sometime we split the reduce axis so you'll see two sums go in there now we're searching the next kind of Kernel so we search over all these different basic ways to run the kernel uh debug equals 3 totally makes it understand it's it's not that I'll break it down for you uh so like let's compare this kernel to this kernel they're the same but let's see what the difference is so this one has a 16 here and this one has a four here so you see how this one basically it like this this explains it so these blue ones here are the globals the GPU globals you know like like uh like get Global ID uh these are the GPU locals get local ID these red ones are reduces done in a for Loop purple ones are reduces that are unrolled for Loops uh and then yellow ones are upcast it uh I can okay here do you want to are you interested in this we can explain it further uh so let's just do a little gem here well first let me do it with no op equals one so we guys can you guys can really see it okay so this is a matrix multiply that's uh 64x 64 uh time 64x 64 so we have two buffers being loaded here um you see why they have stride zero this is my my analogy that I always give I talked about on Lex it's like if you have a cube uh you can put matrices on the side of the cube in multiply sum up and that gives you the top Matrix here so you know put them on the side of a cube multiply sum up that's the thing kernel here are some meta parameters about the kernel uh and then this is the actual generated metal code so you see that these dark blue ones are Global Dimensions um yeah now it gets a lot fancier when we start enabling optimizations so you can see this ranted in 41 microseconds so with the optimizations this a hand-coded optimizations it went down to 24 micros so it's faster and you can see this is the code that's generated it's not that much more complex like it's still very readable when you look at it but basically it expanded the uh see that 244 at the end it expanded the well I'll disable tensor cores for now so we can disable tensor cores which won't use uh won't use tensor cores it's not quite as fast you see we're getting 28 microseconds and then this is a 444 which is the absolute like classic um this is called cooning in some GPU books but basically think about it like this it's doing a 4x4 chunk of the Matrix all at once so this is that 4x4 chunk of the Matrix uh it's accumulating into that 4x4 chunk of the Matrix and then it's storing that 4x4 chunk of the Matrix so each run of the kernel is doing 4x4 instead of doing 1 by one uh so that's kind of what that is so let's take a look at a few there's a few cool options for tiny grad 2 we can uh graph the uops which shows you kind of what the inner generation looks like the other big thing that I've been doing while I haven't been streaming is I refactored the linearizer into the lower and I know these terms don't mean anything to you but well if you follow tiny grad they do uh so this is what that looks like um and you can see that there are 16 loads uh well actually no there's eight loads uh yeah there's eight loads four stores and these are float 4 stores you can see that's float 4 see these loads are float 4 uh this is like the graph of the compute of that gem uh so I'll show you guys one more thing if I put debug expand equal one it won't actually render but it'll show me a much simplified version uh of the of the graph before these expands so you see there's these nodes in this graph called expand and what that node basically is is it just like uh it pushes through the graph expanding this node four times so like this multiply is 64 time what's being expanded here is 0 1 2 3 so when this goes when this node goes here it actually runs 64 * 0 64 * 1 64 * 2 uh and that can be folded um if you're interested in seeing where that is this is file called UOP graph uh this is in the uh expand pattern match here and there's a function called do expand which actually runs that expansion uh but this is what the linearizer became and it's all written as these graph rewrite rules this is the big chunk of graph rewrite rules right some of the graph rewrite rules are stupid like x * 1 = x x + 0 = x uh x / 1 = x right like these are some some simple rewrite rules uh and then the rewrite rules get get uh sort of fancy to do all sorts of things um yeah so when I run it with debug expand it keeps all the expands and contracts around uh but those can actually be rendered so it pushes them through the graph generates a graph and turns it into actual uh code so this is what that code looks like without tensor course uh and you can see here it just stays a summon Maul but if I allow it to use tensor cores uh tensor cores are a feature of the GPU so we'll run it with tensor cores here uh there's this magic function called W 888 float it's fancy how it works uh you'll just have to look up tensor cores if you're not familiar with you know how gpus work you got to just read um yeah so this is all those all those wames um let's put that with debug expand too and I can show you like what it like a lot simpler right so it's basically one wh That's Just expanded across a couple of axes contracts someone's going to look at this and be like holy like who's someone who's like been like messing with Triton uh you know is going to look at this and be like holy this is a lot nicer than what we have I'm like yeah I know uh it Triton a lot faster you know again they're they're they're very very much more in line with like the pie torchan way of doing things like take the one konel we want and make it fast don't focus on this generic abstract case um you know it's all about trade-offs right and as again you know everyone in self-driving cars except for Tesla and sort of whmo is like weo is a fine project if you have unlimited money um but like you know again you meet all these people and they're just idiots right it's not there's not there's not like a redeeming Factor they're not making a trade-off they're just idiots uh but you know people in the ml accelerator space uh people in the in the ml framework space are are not idiots they're just making usually different trade-offs than we are uh is Tiny jit search faster than Jack's jit okay so faster is a very like what do you mean by faster right on what platform on what model saying faster in general doesn't really mean anything um like there's places right now where tiny grad is going to be W well Jax can do a lot of fancy fusions too so again yeah it all depends on the why accept Tesla Tesla doesn't seem idiotic to me Tesla seems like again they're making different trade-offs from us but we're all like operating in a world of modern engineering practices whereas these older companies are just not right like you can watch watch a video with the Ford CEO he's like well so we out sourced all the modules and now we have 100 modules in the car and they all have software in them and then we realize that the problem with the car is the software but we can't get the software because we bought the modules and they came with the software so then we have to email this company and they never get back to us yeah I mean yeah no right um you know they they took a basically an old style of uh is still busy searching they took like an old style of yeah if you're buying you know metal pieces going with suppliers is fine if you're buying software you shouldn't really do this right like you should have like an integrated uh you know you want like a mono repo basically for your car um and you know you don't want to deal with well there's a tiny piece of software running from this company and a tiny piece of software running from this company uh so yeah that sounds sketchy for Ford I mean it's yeah like I know you know people say weird stuff like like you know K sketchy or whatever but like I don't know they don't really know anything right they're just falling back on proxies um and they're falling back on well you know but like Ford that it's a I've seen their logo right like it looks like it looks like that right like that yeah there we go come on look they have a logo like that right like clearly that's all right we're not ripping we're not ripping off Ford um but like you know when you compare this to TOA which has this like self-drive car folder and this is just all the cars guess what they're tested in CI you think Ford has CI right AMD doesn't even have CI um well that's our docks why are those ones little and those ones big ah yeah let's see you think about how many lines of like convoluted sea that is across Ford's platform yeah Auto is just learning about get an AWS yeah see that's what I mean like they're clowns right they're not using real engineering practices right whereas like you know again when you compare us to our competitors you know we're tiny gret tiny gret they're pie torch pie torch they know how to use git right they know how to use git they got more stars than us so again we're we're competing against real people but it's so nice to be you know playing in a playing a lot of things about tiny Court make me a lot happier than comma um like there's things that are cool about Kama like comm's if you like to like mess with Hardware uh uh if you want to like really understand how all the hardware works I we have such like a culture here of like like we bought some of those unry dogs we tear them right open right we just like know how everything works we're building like cell phones from scratch in the basement not total scratch but like pretty close now we're making our own SS we have our own uh we pick and place machine Reflow oven and we're like like these aren't like oh we have these things but then no actually we Outsource everything I'm like not Outsourcing right like the purpose is the process right the purpose is the process what uh yeah tiny grad goes as low as bypassing the GPU driver Simplicity is the way to win it is it really is um I mean these GPU drivers again they're a lot of the complexity in a GPU driver comes from supporting Graphics uh the compute models on gpus are a lot simpler than the graphics models on gpus and if you can't write your own driver so our goal eventually is to make chips um again we have a lot more there's a lot more work to do before we can get there if you can't build a highly competitive framework on Nvidia you're not going to succeed with your own chip MCTS always be bottlenecked by the forward pass okay so now what's slow in MCTS right now is the compilation time uh MCTS is also single uh single process whereas beam is multiprocess so um yeah we can we can look into improving that I don't know this might just be a tutorial for for tiny grad uh stream uh if no Nvidia no own chip okay so nvidia's Chip is better than yours certainly better than the first chip you tape out right your chip may eventually become better than in video when you're on the third or fifth generation of it but the first chip you tape out will always be worse than an h100 right even regardless of like what you're trying to make it's always just going to be worse um so if you can't build a compet stack on better Hardware you certainly can't build a competitive stack on worse Hardware but but but but Nvidia is not documented yeah this doesn't matter right like the documentation you're going to have for your own chip sure you could read the verilog right but like you know how much verog can you read right there's some quirk in the pcie blah blah blah blah blah it's just not tested uh yeah so you're gonna basically have to if you can't build a competitive stack on Nvidia if you can't build something that's driver level on Nvidia uh why build your own yeah so this is this is basically an Invidia driver like you can read it yo we're like succeeding at this um I haven't been streaming in a while I don't know it's kind of maybe I stream because I'm miserable I'm just not that miserable anymore things are actually working uh Nvidia is moving to more open source drivers again like you have to be so careful with what you're reading the new news nvidia's had an open source driver for a long time right it's it's like GPU open kernel modules uh they're just deprecating the closed Source driver um but okay is this really open source I mean there's this huge binary called the GSP here let's go find it I don't even know where where is it where is it lip there it is so here let's go into this driver uh so we have two gsps one for the ga series and one for the Turing series I think that the ad series runs the ga one as well uh yeah so like if you like um you know if you call a 36 megabyte opaque GSP binary open source then it's open source right do they not even have strings in this thing junk strings huh some Colonels require a pointer to HSA n gen replied to me on Discord when wow com uh you're welcome to join our Discord too I'm just typing there now what are you do in a kernel to set that have that property be set so okay apparently his answer was that you have to here here's where allocate space for the dispatch packet and the Kern arcs so there's apparently some property of the code that's generated by the compiler sometime where it does this extra dispatch um so also when you when you go to this level you see all the complexity that they've hidden and a lot of time this kind of complexity is why your stuff's not fast um so once we start looking at this once you start so there's a there's an attribute uh let's go find it in renderer C style uh in Hip renderer you'll see it so you see this kernel modifier this kernel modifier see there we go it makes hlb C far twice as fast uh by basically specifying the max parameters it's going to launch with because this has to do with uh register the word for it uh basically the register pool on gpus is shared and if you H God what's the word it's like it's like resource exhaustion of registers um register pressure register pressure is the word I'm looking for uh how did we discover it it's not Insider knowledge it's open source you just read the code read the driver here's here's the docs for the driver right cwsr enable default is one to enable this feature setting zero disable it preempt Shader execution in the middle of a compute wave right like of course that's broken yeah reading docs how else you want to do it uh I don't know I don't really want to code on stream I got kind of sick of coding on stream because like it actually teaches me bad practices and it teaches you bad practices too you like the stuff that's flashy when in reality I code a lot slower especially now uh especially with the stuff in tiny grad where you're not really writing that many lines uh and you just you know you just you just code nice and slow and you know make sure it's correct don't spam code uh have we sent any boxes to the EU no we've only shipped us ones it's such a pain in the ass to ship the the Box weighs 90 pounds and we got two of them over there if you want to say some some tiny boxes chilling over there why do I care you know that I stream for me not for you right like we've had this discussion before um what chair is this the new keyboard that's what you guys want to talk about um we wrote a line of code it was a comment I'll be right back for for yeah we'll do a little coding um so it's a little annoying on metal because you can't actually parallelize this compiler super well um kind of want it to also say nodes per second that'd kind of be cool so we want to know the number of nodes per second we divide can I not do a no per seg we need to say no I think we can just do that I think we need that point okay let's just get rid of the float entirely 3D all right so we get about 27 per second right now don't need that don't need that how it looks all right do we like that first or should we put it over there I don't really like that first I'm just tweaking something stupid okay when do you think you'll start taping out your own chips what do you think I'm going to say to that like this one searches a lot faster does MCTS always converge to the same optimal solution bro I don't know what tiny search spaces you're searching no um no the search space is huge so I'll explain the sear a little so this is the uh no not that um where is it uh so we have this thing called op doops and those are the actions oh it's in kernel okay so these are the actions you can take and then if you want to see actually that's not even real this is the these are the actions you could take um so we have something like probably 40 actions not all of them are available at every move and then a kernel usually has like uh seven moves right so it's 40 to the 7 so that's the size of the space you're searching um and like you get about 500 kernel runs right so you have you have 500 uh tests to search a space that that that's that large maybe it's more like 20 actions per thing so you know okay it's that big um so no it's not optimal but that's okay uh because you can make some assumptions right you if you were searching the space randomly then yeah okay that'd be it you can't search that space forget it but you can make assumptions that well here you know what I'll show you something we have a convenient graph function called MCTS graph um so yeah you can see it's it's finding the best one here it's like 1.5 you know 1,500 the best I've ever seen it get is like 1450 um but this one's probably going to stop at this 1521 1521 is pretty good oh NOP it got better 1504 okay uh so we wrote a graph out here let's uh let's take a look at that graph so this shows all the nodes uh that were searched so yeah this should be about 500 nodes in this graph it's actually a bit less than 500 okay so this is the root node here um so the root node we don't actually run it's just for Speed there's no reason to actually run the root node you're always going to do better with at least one optimization um so then it tries all the optimizations you see it gets this tc1 which gets it down to thir 3300 um does it actually spend all of its time on the TZ one no there's other ones down here um it also does this TC one see like these are all like three pretty good actions so the winning action turned out to be this one uh which was five then it went in there did this upcast got it down to 2,800 and then with that unroll it got it down to 1504 um so these are all the the search space and you can see that like it doesn't waste time expanding a node like this right this one is 177,000 it's so massive that like it's unlikely that it's worth expanding that one so we do intelligence search where you know nodes that are pretty good are likely to be good places to look for more nodes that are really good um yeah this is like how tree search works okay so we're getting 29 per second let's see if we can increase that um our real goal is to get like like 500 per second 500 per second would be a good speed so then we can do this entire search in uh so something I was playing with in the uh let's get stash pop oh can we not get stash pop we can't get stash pop how do I look at the get stash get stash list whoa look at all these stashes get stash show yeah okay here uh so we added this early check to not waste time compiling uh compiling early check for same as optimized a so right now what we do do is we so sometimes you get to two nodes and they're actually the same code so like different optimizations ended up leading to the same thing um shouldn't you should all be caught by the as check need to fix uh caniz I talked about this in Discord um Quin is making progress challenges with Loop unrolling cool um that's what what I will look at later uh but so this this is a an early check for the same as so we don't actually have to put it through the compiler hold import lazy op here so this check should always be correct if not always complete uh basically so it just it goes to the current node it gets the optimized as and then it sees if it's already seen that before if it's already seen that we just remove the node and we continue which is also kind of wrong What We actually want to do is this yeah that should be right let's see if the searching is faster 30 okay so we only expect to get a little speed we don't expect to get a ton of speed from that it seems a little bit better than it was before it's actually pretty much the same that's the same as those before um oh look but look at the one we hit I mean it just just luck but uh you see that we got a we got a better outcome here at 257 so we can you know look at that graph again it's the same graph of the search we just got a little bit luckier about how he expanded the notes um so yeah I mean it's within a couple percent you see that the difference between the like it's 2x better than the hand-coded one and the Hand coded ones on Mac are pretty good uh the hand coded one ones on Nvidia are terrible uh and it's just cuz we really put effort into it okay cool um so one of the things we can do is we can just mock out this rollout function and then we can see how fast everything is so if we do a mock roll out wow look at that look at that speed wow 2,000 a second wow roll out slow as all right um so this is the thing that we can parallelize uh with a lot of work actually with a lot of work oh so I changed this function to be a sample function this is Chu's idea uh last night this uses uh basically does a soft Max and then a random Choice uh it samples from the tree instead of like taking the max of the UCB it's it's like Monet search you can read this by the way it's an extra MCTS search on Tiny grad um it's extra because it's not uh ready for the main code yet yeah um so we found out where all the slowness is and all the slowness is in well actually find out where all the slowness is is a slowness here or is the slowness here let's just check quickly so useless this is the code that co-pilot would probably help me write think you probably not cuz I probably waste a ton of time reading the code that it wrote and just like faster as it do it all right well that's interesting so it looks like it's spending about half the time on compilation and half the time on running uh so we can also turn that down to three didn't really help much and even within compile that can be further broken down interesting okay so most of the time is spent in the compile function but a sizable amount of time is spent in the runtime this isn't allocating buffers is it buffers are pre-allocated I didn't do anything stupid in here did I clear L2 is false make sure that isn't actually hitting let's just slow wait okay so why is it so slow versus TM right is that not the default so I'm looking at the actual time spent in like the python Loop versus the time that it's telling me it actually took on the GPU oh and then there's another hack here too which is like there's a hack for Global which doesn't actually run the whole thing none of these are actually running the whole thing well that's slow as okay well that's actually spending all the time with the GPU so we have a hack to work around interesting didn't realize how much of that mattered it like all becomes GPU time then that um actually i' want this as a percent it's all right okay wow and that's actually all the time none of the time is spent anywhere else which is interesting okay so to put this guy back that goes there that goes there starts to become overwhelmed by the compile time so there's not really much we can do uh about the runtime really three should reduce the runtime percent a bit but I guess it doesn't really matter those kernels are all fast any slow kernels get hit by early stopping so we have this thing called early stop uh actually bugging that okay so we're multiplying there by Factor no so that's actually right cool I like that extra letter how do I easily like make that colored but I don't really can we Nest F strings we're just talking about that you can color well compile time should be Sion and Flat's not used why is it imported runtime time is Red see how that looks 25 is not in list oh did I do it backwards yeah um y I've been listening to the new say anything came out with a new album Max beamus is proof that you don't have to you know be boring when you're old um the fanfiction like is it's really the you know a lot of these like emo bands they just like completely mellow out think like like Good Charlotte writing centuries right and it's just it's just terrible it's just this absolute generic you know pop music uh that you know no longer sounds like Lifestyles of the Rich and Famous right but Max beamus has not sold out you know my life might end up up shorter than yours but it's quality not quantity you thimble man [Laughter] like uh your therapist knows you're a failure um oh came out with the whole album I liked one of one of the songs was pretty good I mean it's still the same like and Detroit was was hard where I grew up in Detroit and it was hard and I came up from nothing yeah I'm Eminem yeah right like it's still he's still he's still on about that uh you know it's it's like it's kind of like when people get older they get like froze in their ways like he's still very talented but it's not new um whereas like in some ways I think this is like like evolved uh like he what he talks about and you know it's relevant you know I sympathize with it like the song's called fan fiction like it's it's about the the parasocial relationship between like the the the fans of a of a band I don't know if like I'm the only person who like listens to these like you know like if Panic at the Disco comes out with a new song right like you're not like you have to deal with basically this problem um yeah just just you know mind-blowing I probably listen to the song a hundred times oh it should be possible to determine runtime of a colonel without doing all these computations Good Luck Good Luck uh I have tried and I mean the cool thing is you can figure out exactly uh you know how accurate your estimator is so yeah try to build try to build an estimator of no I like them bettering the thing uh try to build an estimator of Kernel uh time without actually running the kernel I think you'll be shocked at how hard it is okay so this part can be entirely parallelized um but this part can't so what does that get us like 3 4X it's actually probably worth that make a branch uh there's a very detailed cost analysis in llvm yeah exactly I'm not writing that right so there's like the thing that you want probably isn't the thing you want right the the question is not given this konel uh given these kernel arguments how long does it take to run uh you're never going to build a good estimator for this uh there's so many things that interplay but where the lwh hanging fruit is right now if if you do want to improve this search now again I caution you about trying to improve this search right so if you want to improve search such that it runs faster and gives you the same answer that is a straightup win but if you're trying to improve search by being more clever uh how exactly have you figured this out right how exactly like we we don't have a framework for evaluating whether the new search is better than the old search and all the works in the framework right it's the same thing at comma too like so much of these modern machine learning is things are just about figuring out the correct test framework to see if you're making progress or not right so you make some tweak to the search and you run it on one kernel and it does better what does that mean right does that mean the search is better who knows um so stockfish uh solves this problem in a very elegant Way by having a tweak of stockfish play against a whole set of older versions of stockfish and you can just take a win percentage and that's pretty good because that's the exact thing you're trying to optimize for um so what you're trying to do here is you're saying well okay so I made something it's faster on my computer on my kernels that doesn't really say anything right this is what I mean like if you get speed it's a straightup win but if you get uh a smarter search well gu it's smarter always right there's no free launch and search and optimization uh you know you know there's no free launch in Search and optimization so but if you are interested the answer is probably not figuring out how to estimate this but the answer is figuring out I guess it's sort of estimating this but the answer is figuring out which nodes are worth expanding and which nodes aren't expanding so you'll see if we have unexplored children we just use random Choice here uh we also use random choice right here if we just expanded a node uh so random. Choice whenever you see a random. choice you're always like well but some choices are better than others and yes some choices are better than others uh so that would involve putting a prior on the actions right so you're at a node you want basically a prior over the action space now there's a few ways to get that prior uh you could get that prior uh online I think that something again you're going to spend 95% of time on your evaluation harness and 5% of time actually implementing this sort of code uh but well yeah there's hyper parameters too here's two hyper parameters we have we have a temperature of this sampling and we also have uh the C from the UCB function which is math Square 2 for now um so yeah tweaking those hyper parameters may you better results but the question is does it always get you better results or only sometimes uh so yeah by putting priors on those random choices you could probably do a lot better uh chenu and I discussed a simple way to do the priors yesterday would just be to figure out which kernels improve run time right just overall which Kernels have I selected which optimizations have I selected given this kernel that have improved runtime and search those first when you expand new notes uh this is probably a good idea uh I bet this would would improve search efficiency like 2x um then the other thing you can do is you can build offline models right so offline given a device you can figure out that that device is more amendable to certain kinds of optimizations uh than others and yeah you can have that offline prior and then the cool thing about prior is you can multiply them together and generally uh you know if you have two good priors that come from different you multiply them and just be better so yeah there's that um but again the main thing you're going to have to build before you start playing in that in that uh Direction at all is a robust evaluation harness that that evaluates the effectiveness of your search across a wide variety of hardware and a wide iety of kernels whereas if you just want to work on speed well speed is just speed if anyone all right can we ban any the the b word and the t word any of that stuff it's all banned it's all band the b word and the t word you know that's it that's it what kind of camera is this the same one I have at home I think um God now we bring them up you see you can't even say them man that's right I was referring to trains and buses we are not discussing trains and buses t word and b word long live trains uh yes if you want to follow long it's in the parallel MCTS branch on Tiny gra so let's see what crap I wrote to make things parallel for Beam part of the problem with parallelizing on Mac is that you need the device open in order to do compilation I also kind of want to break compilation down into two steps there's a lot of complex crap here let's just get rid of this we don't actually need this for Dev compiler so there's two different steps here there's the two program step and the compile step and we could don't do that no not that one or was it here okay so we could actually break these out further let's just stick this one here and then see how much time we're still spending on compilation yeah not much anymore okay so it's really all just that which is cool because that can be parallelized um two program can be parallelized wait what no okay um let's get a pool going here so there's two program and there's compile and they're separate things and sometime compile can fail two program should never fail two program fails it's a bug in tiny grad well that actually worked out to be less lines than using the stupid uh try compiled linearized with idx oh so this lowering is all in Python uh I wrote this crap so you see okay you see you see how much we can speed up by uh why is it faster now it's just faster now what did I change oh the signal crap how much time are we spending on that you see no one benchmarks any so it's just slow and the combination of linearize and uops I don't know how much of that is not reused anymore oh I run that twice wow I run that always that's kind of annoying all right well that was a big waste of speed there's a big speed regression there we can fix that for beam as well I me we need this to be in a shared framework wow there were so many regressions when we went to the lower everything's so much simpler now though cool look at that look at how much faster we're searching love it that was completely free we made things both simpler and faster no parallel where it going what was it before like 24 yeah 33% faster guys we're winning on this stream is free lunch and search that's right we found a free lunch and search we just stopped running the function twice uh that's cool that that bug actually exists in I mean my whole next week is going to be spent on search um I don't know why I posted it in the employees only Channel we have a breakdown so we have five um people who work at tiny Corp now one's on vacation next week so we have four people just we're all going to take a different part of the speed problem I'm going to work on our own part of the speed problem and this thing is going to be uh faster than torch in no time wow I love my new Fast MCTS okay so there still is a 2X to be had you can see these percentages don't add to 100 and it's because I put the lowering outside the time does that make sense this tur into a pretty good stream we're actually coding look at that I mean we had so much good gains in time programmed we want to try to parallelize this paralyzing is annoying they have to do multiple sample trees this crap all looks slow look how slow this all looks what here is actually slow decorate this with profile from helpers that'd be cool if I could this function as a decorator it does context decorator call takes two positional arguments but four we're given uh I think I like tell it stuff maybe that come on profile for me yeah profile all right so where we spending all the time ton of time spent in Sea style render wow wait why is that not sorted oh it's sorted by that oh SE style know SE style render just because that's why it's doing the U up graph okay so half the time is being spent doing stupid UOP graph we know how slow UOP graph is uh oh here we go so here's Ops metal compile Ops metal compile is actually a pretty small part of the time um wow spending a lot of time in symbolic floor div this symbolic and UOP is so slow okay okay but that's cool so almost all the time is being spent in two program well not almost all but that's a percent wow look how nice tiny gr infrastructure is that profiling just worked like that that was cool you understand the stream so far great we're trying to be more understandable and approachable for everybody except Noobs okay it's pretty good speed up already if we want to parallelize this I don't really want to parallelize this I just want to make it faster I I hate how slow this is so part of the problem with parallelizing is then we'd have to do like it would actually be slightly different if we were to parallelize because we'd have to sample from the tree more like you take multiple samples from the tree and then run them and like back propop them all when they come in 35 a second 47 a second tiny CAD infrastructure is so good this is so nice like so easy to write things like this that is how you make progress I I I was thinking about that I even thought about tweeting it it's like there's really one group people that moves the world forward and that's the people building infrastructure think about the tech tree of any game we're progressing through that same sort of Tech Tree now I know a lot of people are like talking about this stuff now but it's it's it's cool to uh you know to be explicit about like we're moving through a tech tree and the way that you move through a tech tree is you build infrastructure um I don't know that's kind of like what I do with my life um eventually I want to rewrite like all of open pilot in tiny grad uh like tiny grad's a programming language but it's a very restrictive programming language it only lets you express certain kinds of operations but these certain kinds of operations are pretty much all you're going to need to build like brains and stuff uh so yeah like a lot of open pilot can be Rewritten in tiny grad and then it's the job of the tiny GED back end to actually make that run fast on Hardware uh the first chip that we tape out is probably going to be not uh it's probably going to be an inference chip for comma um like once open Pilots ported to Tiny grad we can then uh switch comma devices to use tiny grad taped out chips it's also it's a lot cheaper to tape out a small chip um I mean we would have to care more about power but you have to care about power anyway so it's it's good to like you start with the mobile Asic and then you blow it up you know as much as they piss me off I think like qualcomm's undervalued like let's look at Qualcomm compared to uh compared to AMD 204 billion oh similar unbelievable unbelievable guys it's really uh yeah know I think that like I'm actually bullish on if Qualcomm understood like what they had and what they were doing I'm actually bullish on them building better neural network accelerators than like AMD for example all right we have faster search question of making this faster why so it's a kernel right can I give it a type does it still pass uh where's my mypie incantation I don't know why why when I run this in CI this is the myi that runs in CI it doesn't include the MCTS uh but when I run it there okay incompatible oh for so if you're interested in making this faster I think we should just probably focus on making lowering faster uh I know the branch is called parallel MCTS but the longer you can like parallelization is always going to be a trick that's available to you uh whereas if you just can straight up make improvements there are just winds out there so this one unfortunately is something we track pretty well already so I don't know how many gains there like low hanging fruit gains there are there's definitely gains they're just not that low hanging almost lunchtime so I have this thing called external Benchmark schedule yeah we're getting like 653 there and 653 is an improvement kind of from what it was but you can see that like this is 10x slower than all the other uh steps in the model uops is the thing we were running twice by the way um it's this it's this model linearized that's doing that's running all these graph rewrite rules so this is the thing that's new in tiny grad um and the old one was slow too like it's not like it's not like the old one was fast um but yeah we've moved to instead of like hand coding rules that writes this code in a linear way things have been Rewritten as graph rewrite rules right so like this is the matching pattern and then this is the code that runs if it matches uh you can see the pattern matcher is here so yeah there's lot of gains to be had by like instrumenting this and like like I have some things to speed it up that's kind of what where's my pattern match here's pattern matcher so like I'll compile like a dictionary so I can do a fast Dictionary look up but in theory I think you can do this recursively um there's also probably a lot of stupidity here in like how we iterate over these things so you can see that if something's a list I just create a list of all the permutations which like it's correct but there's probably a smarter way to like do that like I think greedy is fine it's greedy with a tree it's just I mean it's it's nice because it's an easy thing to write but uh I think a lot of time is wasted there I mean then this yeah this list comprehension is all pretty slow so let's see if I could make that 2x faster then we're looking at something like 60 nodes per second I want to get that to 500 nodes per second but there's no way we're going to get it to 500 nodes per second right like even if the only thing was GPU runtime which I can't parallelize at all right this this in theory can be parallelized is just a little annoying on metal so this red stuff can never be parallelized at least not the way tiny gr works now um you could in theory parallelize it across gpus if you have a uh multi-gpu computer but yeah so 40% so wait that's 25 I thought it was 30 was 30 before why did it lose speed oh maybe sometime it's faster than others depending on the nodes it happens to search he look this one's faster interesting wait and this one like messed up look this one didn't find a good one this one found a good one and was faster whatever yeah see and that's the other problem right again the search is allistic based so sometimes sometimes it gets to a bad one uh and then you know you do all these stupid things for one and we just put the search in a loop and run it five times well yeah you just got 5x slower can you parallelize that sort of okay so if we're at 30 per second and we're spending 38% of our time on runtime yeah the best we could possibly do even we got the other to zero was 78 and that's not fast enough still so let's there's no getting around just yeah like this all right so like sometime the time is spent in a net and sometime the time is spent in to the code and realize construct this compiled Runner but I really wants a line profiler on this interesting so most of the time isn't actually spent waiting for the kernel most of the time is spent constructing the kernel except in like some weird cases where the Colonel's probably already constructed that one's all in weight check why is sometime this fast and sometime this slow we're we're doing it pre compiled right yeah precompiled equals lib so that gets bypassed that's not running device uh nothing slow there this just all be the metal construct that's slow not triggering that dispatch data create new function with name lib you know we're actually already creating this program when we compile it where are we Library yeah see like this new this is calling new library with Source this calling new library with data so we sacrifice a little bit there um we ever hitting the compile Cache no wait this is all after that uh yeah so sometime the time is spent in weight check and sometime the time is spent in a net wish I could aggregate these somehow okay and this is happening before this so we way check yeah you see it's like a decent multiple of this right so we're spending so it's telling me that the kernel runs in 4 milliseconds and then it's spending this long uh running it which yeah again there's not really going to be a way around if you have a 4 millisecond kernel that's just how long it takes this only seems to be slow sometimes I wonder why is it slow sometimes and not slow other times let's profile it can you just do that great slow sometimes and not slow other times we don't have any more information on that oh maybe it's this GC I don't know maybe it's just running something still on the GPU and it takes a long time uh okay nothing sketchy and weight check yeah I don't know I mean this is just how long it takes these are Big kernels too like these kernels are on the order of milliseconds so if you have kernels if your average kernel is three milliseconds you're running it three times uh okay okay well one thing we can do is we can lower the early where's early we have early stopping this probably can be three and doesn't matter 40 per second yeah cool uh I don't know like I doubt that matters the question is it definitely made things faster uh three might be a little aggressive there so what this is what do I set to do in beam I do three in beam three is probably fine I again I need that framework to to do to do proper evaluation of these things like you could just get unlucky and get a slow one and then it exits and it's done I mean three is what I do in beam so it can't like be that I don't know like it hit there but that could just all be lock seems like I get most of the games with five anyway so what this is is it'll only run the kernel once uh if it's should we adjust beam to match beam is so wasteful now because it's not even doing that compilation correctly okay 36 per second it's pretty good you see MCTS is winning most so like we have the handc optimizer the tensor core Optimizer I can put beam in there as well and we have MCTS there W that's still going that's so slow let's switch to parallel MCTS I didn't push yet and now it pushed so the stuff is cach so you can see that's all fast now 34 per second just oh look at this 69% of the time is spent compiling and only 1% of the time is spent on runtime because these kernels are so much smaller actually that's probably also going to be true I happen to be running this on uh something that has big kernels but if I run it on something without big kernels uh jet beam means only beam the things in the jet so you see all that stuff not the jet so it doesn't run it now as soon as we get to Something in the jet the tradeoffs become different why is it doing this let's fix that bug probabilities contain N I exclude the ones that are Infinity see what's going on why it does that oh that's a different problem uh cool well yeah because we no longer put the compile in uh no there a tiny G device compiler eror that's fine though I mean it's still interesting uh but yeah this should not be this should be except compiler and we'll change that to include compiler eror but yeah that's a bug someone wants to look into that see this Colonel is very slow so we're spending 90% of the time on runtime uh okay same bug by the way does math is INF work for negative Infinity too cool for oh we got through all that is that a good time seems pretty good 7 milliseconds who knows if it's good how much faster the searches on these we're almost getting to that 500 per second 239 per second this is a slower kernel so it's not that much slower so I mean though there's so much diversity in these uh in these kernels like you do something to make it faster for one now there are real wins but all right go to the bathroom I'm going to grab a matcha latte and then we'll do some questions and then that's the end of today's stream we got to look into that HSA thing that's crap that should just be removed D we should exclude those kernels search is really addictive to watch you can just sit and watch searches wow it's searching look at it search oh yeah someone want to look into this not folding crap uh like I I know where that code is but you want to see if you can fold those that would be a cool thing to look at and you can look into some of those bugs and see why the linearizer doesn't know always work Perfection you know we we pursue perfection in tiny grad and I hope someday we'll be there it's true that we have like tiny used to have different looking bugs the bugs now are like deep bugs there aren't many shallow bugs anymore which I mean I guess that's how bugs work good progress man today yeah 8,461 lines get a much a lot and I'll do questions for for wow we got energy from caffeine Focus from theanine Clarity from Lion Man and calm from ashwaganda that sounds lovely you know they never sponsor me ta ta you tag the sponsor you know Lions man Big W all right um probably because of my rants no no no no look what's what's politically correct just change it maybe stream tomorrow often n not worth it I'll buy my drinks um so a good time overall for this is if it's sub 20 milliseconds we're doing really well so this will this will finish shortly while we do questions appreciate the more educative stream I want to get more people involved in tiny grad uh both usage and development I think it's now to the point where it's quite usable um unless you're doing these searches the speed is still slower than torch with these search sees it's competitive like if you if you do like like beam 3 or MCTS 500 uh it's it's competitive with with torch on most platforms uh it's definitely competitive on Mac um it's definitely competitive on AMD on Nvidia it's probably still a little slower uh you couldn't get it to work on your 380 it should work on 380 now uh we're we support all the it works on my 3090 and it's the same chip so yeah it should work fine uh in the Envy back end if you have a 3080 try it right now uh can you cash the search across runs yeah it's cashed automatically so yeah you only have to do the search Once I did hear about Kathy starting Eureka uh I like that form seem chill are pie trees ever coming to Tiny grad uh I don't know what a pie tree is so probably not what why would I want this oh I see we kind of have some like implicit stuff that does this I'll show you the function um you mean like here like we have functions like get parameters which will like get State this is kind of like a pie tree um what exactly do you want P Tre to do can you solve the arc challenge using tiny grad maybe maybe you can um isn't named tensor axis uh we don't have that apparently pytorch had it and deprecated it uh may just map scan over pie trees stop it stop it look so ugly now uh can you reduce the kernels in size for faster running yeah so we actually do that so there is in my time kernel function uh you see this thing called Factor so this will reduce the global size and then return a factor and then uh run it and use that as estimation so that's already being done what else we got oh on your 3060 uh I think that one supported to sorry that's not the same chip as the 3090 um if it's not supported it's probably a very small change to make it supported we're interested in supporting everything back to like the Turing series like the 2080 series Jax also has higher order gradients do we not can I not just grad twice all right so we got we got we got 21 milliseconds could be better uh by the way it caches so you can see it'll just run through the cash again it's very fast the second time uh we got 21 I've got 19 before that's what I mean like the search isn't uh the search ain't perfect wow repeated stores and uops interesting yeah what we need to do for all of these things is we need a context that oh this is this is great at finding finding all sorts of bugs and Tiny grab we need like a context that prints out what like see we have tensor operations we need uh another one of these that says like that prints out the kernel that's trying to be lowered right like you see like error lowering meta Ops kernel we should print that kernel out well actually I should just be able to why do I only print the op there will we hit that again nope we don't hit it again tragic so we can print the kernel there but that still doesn't even print the all right we need to do the refactor the lowering should have okay the schedule should be optimized separate from uh running yeah so the the as should be lowered so in this mtts search thing where we have the uh this get optimized as that should actually be like pre-run um and that's what the beam search is not it doesn't return a kernel it returns NST yeah I believe it's just grad twice uh hessen is a matrix [Music] um I don't know do I still my old cash interesting that's the old cash I'm not getting 19s now sometime I think it's dependent on like how hot my computer is or depending on like something we get like lucky with the RAM allocations and they're not like strided on pages or something who knows um Apple we have a lot less control on because we didn't write our own driver for Apple the the apple back end here you can see just uses uh just uses like normal metal stuff like here's metal program we just do like you know new function with name new compute Pipeline with State uh so yeah it's not as low level as the wonderful the AMD one's easier to read than the Nvidia video one has a lot of crap that hasn't been cleaned up yet but like I love this like that is how you launch a kernel on a GPU and we're going to get this removed because we don't need that that's not real Release Me is so such a misleading name these need better these need better documentation three and zero these need to be better documented this update API is interesting so this is how we like update things in the jet we have update the Q I don't have an obvious better way to do it but I wonder why it's uh slow oh yeah here we can let me show off something else to you guys uh so we can run llama like this is llama sharded across four gpus so those the weights that could be faster there's a reason why yeah that's not getting us the full speed but yeah so we're getting 138 tokens per second across four gpus I believe now across six gpus it's still slower this is llama 7B on quantized on a tiny box so once you go across six gpus it actually gets slower because you see even with our crazy good drivers we're still spending a lot of time here on NQ uh so it was a PR put up this morning uh to lower that but you see again if you're using Cuda you have no control over your NQ time whereas I don't exactly get how this works but um keeps stayed around but look at that it has the NQ time yeah so across six gpus we only get 89 tokens per second uh this this is still a little bit under what GPT fast can get so we have some some work to do still um all right do we have any other good questions how do you build super intelligence right now have kids start by having a high IQ and then have kids and then you know train your kids to work together that's that's the most straightforward path to Super intelligence teach your kids tiny grad what do you need to know to understand this what domain of programming I don't know I don't know I don't like learn it in school or anything I've been working on Tiny grad for like five it's been like five years it's been a long time uh you know but like we're getting close look we have doc like if this wasn't legit would there be docks like real docks how is inference speed compared to llama CPP um so a lot of those things are focused on quantization uh again if what you're trying to do is just run llama like use llama CPP um it probably is going to be faster the idea is like the whole Theory behind tiny grad is the bit or lesson uh you know this one of the things I always shell for the biggest lesson that can be read from 70 years of AI research is that General methods that leverage computation are ultimately the most effective and by a large margin uh the reason for this is Mo's law so uh where is see here uh one thing that should be learned from the bitter lesson is that the great power of general purpose methods or methods that continue to scale with increased computation even as the available computation becomes very great these two methods that seem to scale this way arbitrarily are search and learning uh I wouldn't use the word learning I would use the word optimization but yeah like search and optimization scale with the with the computer um by the way this is a great thing to read uh if if if you haven't read it I sh for it on stream a lot but that's kind of the like the difference between tiny grad and uh and torch right we don't specify what the uh we don't write the kernels we specify a generic framework with which we can search for kernels and as search methods continue continue to improve tagr is kind of this new class of compiler that's like so there can be search at every level too so all of this search is being done at this level called kernel. Pi um so these are these uh those optimizations that we talked about that search space but there's tons of other potential searches so for example I'll give you one here here's a function called image. imagecom 2D so this uses textures to rep present the uh the inputs and outputs to a convolution uh if using textures on a GPU instead of buffers uh Mobile gpus in particular they put a lot more effort into their texture cache than they do into their buffer cache so on Qualcomm this is like 3x faster like on the on the Comm 3x so this is mostly written for com but um like this stuff can be searched right how we choose to represent them right here is hand coded but that can be searched so you can search at this very high level right then we have another file called schedule. piy and this is the scheduler that determines what order kernels are running this stuff can also be searched it's effectively a topological sort so we have two big topological sorts in tiny grad we have the topological to sorts that determine what order kernels are running uh then within there we have a bunch of graph Transformations and then we have a second topological sort um when we actually turn the uops into a list uh those uops which I showed you in those UOP graphs remember the UOP graphs so like this is a UOP graph um when we turn these UOP graphs into a list and fundamentally a program uh we have to determine what order we put them in right and that order is not fixed there's multiple so the the Only Rule is that the order has to be a topological sword on the graph if you don't know what that is look it up in a computer science textbook um but within topological sorts there is me there's many sorts of there's many lists that satisfy the property of the topological sort uh so yeah you can search there as well and eventually we're going to abstract all of this out into a very powerful General search framework and be able to find the absolute fastest kernels uh to run all this stuff and like when you're talking about something like when you're talking about a training run at like gp4 scale or even bigger when you're spending $100 million on a training run if you can somehow save if you can somehow you know make it 1% more efficient you save a million dollars so you're willing to put tons and tons of compute into this search it's a question of how well the search scales uh I mean can the search continue right can you keep searching for better kernels and the answer is sure right um let's let's just do one as an example right I don't know what should we set it to 5,000 like it'll just work it's annoying it's wrapping Let It Go we'll see what it finds I don't know does it get better I've never ran one for 5,000 before but like the beauty is as we make this faster right we could parallelize it we could we could make the python faster this method only becomes more effective as computers get faster right when Apple releases the m6 chip it's not going to require any any like hand work right you can just increase your search budget um you don't need to think about what the GPU is as long as you've captured all of the uh you know everything you need fully flexibly then like Let It search no don't do that stop wrapping make it a little bit shorter so it doesn't wrap um would you like to know how the AMD accelerators are yeah we got rid of we disabled the broken feature in their driver and we uh rode our own runtime and now it's fine oh look we found a faster one at 2028 look at that new record that's a new record for that that colel I've never seen some i' never seen a colel be that fast before and it it only it searched 2028 kernels to find it that's on analog computers um I don't think there's big gains to be had you you give up a lot when you digital computers run uh 100% repeatably you don't really need this property but it's annoying to not have it uh it makes debugging stuff really hard so no I don't think I think that what we're going to start to see is smaller and smaller data types and video talks about this a lot like whoever can make if you can make training work on like float 4 uh that is really power efficient because most things scale with the square of the uh like multiplies multiply scale with the square of the operation size basically you can think about just a generic a generic table right a generic like two op table where I have X here and Y here right it's a square so uh am I searching randomly no we're using Monte Carlo tree search it's a it's a guided search right uh so no we're searching over a set of of of permutations right like I showed you how uh I showed you how that works yeah crowd strike needs it right now um green Tiny Box is greater than red uh to be honest the red one's probably a better deal uh it's again it's $10,000 cheaper right like how much do you want to deal with Annoying drivers the answer is like I said one box gets bought 10x more than the other and I'll leave it up to you guys to decide which one uh multiplying huge integers using Foria transforms oh it's even faster um yo have you we've never seen a colonel that fast before like that's a new record we just needed to search 4,8 54 kernels but imagine we got that 23 per second up to 2300 per second can I summarize why they're worse because AMD is bad at software it's it's really just that like Nvidia invests into software and AMD thinks you don't have to and like they're going to lose as a company because of it well look at this MCTS just continues to get better uh I don't know what hvm is I don't know what symbolic ml is I don't think this stuff is oh look look at how long it took on that one it hung um this Kernel's a lot simpler so like there's it's it gets into the weeds right like if you tell it to keep searching it will keep searching but eventually gets into the weeds and starts finding like dumb kernels um I think I got my stop condition correct on MCTS so like I have like MCTS when it exhausts the tree you kind of have to like deal with that it's a little Annoying see you can just put more zeros on this number and it does better what if we searched for 50,000 what would we find uh thank you yeah I like the colors too it's it's cool you can really you learn to read these really fast too like these are really simple globals locals reduce unroll upcast right and then they're in the kernel name too which is neat how many kernel permutations we went over this before 100 million well yeah maybe if you like prune them down maybe it's like 10 million yeah I mean you get even 5,000 which is crazy high like you get 500 test points in a space of like let's say 10 million um do you know if the search space includes very fast Solutions uh well I mean what you can generally do is you can plot uh and the thing looks like it converges kind of to an ASM toote so I would doubt that there are like magical super fast Solutions so okay actually I can tell you something we definitely know so the GPU in this computer is an apple M3 Mac so if we go here uh is not here wow apple has a lot of different um so the M3 Max has 14.13 teraflops of compute so we are at 86% pretty much 85 .8% uh efficiency on this kernel right so it can't be faster the the fastest time that you could possibly get on this kernel uh is going to be you know 1.43 ID by that sorry multiplied by that yeah so the fastest time you could possibly get is 1.22 86% is good yeah um multiplication on log numbers is an ad that's true but what's an ad on log numbers that's that's what's annoying about that yeah yeah 88 86% mfu um we're trying to get our training mfu up our training mfu is not that good yet our training mfu well it's not bad our training mfu for like resnets is like 33% on the Tiny Box uh all right are we out are we out of good questions are we in dumb question land now what are the nodes in the MCTS the nodes are I showed the action space so the nodes are like a different each node is a different way of running basically the same uh Colonel wow if we're really off into the weeds guys they're cousins they don't collude okay it's not like it's not like Jensen and Lisa Sue get together and say hey why don't you make your software suck balls so we could be worth $3 trillion and you could be worth 250 billion and Lisa Sue is like okay Mr Jensen what you think that's what happens come on they just suck at software and they don't see any you know that's not important to them it's very clear that it's not important to them and they'll continue to suck at software and they'll continue to have 10% of the market share of Nvidia right like it's not going to change people think like oh AMD is going to get fixed like it it's really not um I don't know I I don't want to go into this Rabbit Hole again you know you just need to you just need to like like what how many therapists does it take to change a light bulb one but the light bulb has to really want to change right what do I work out right here in the comma gym yeah let's go didn't AMD buy some AI companies yes completely the wrong ones it's guys they want to lose okay like some people want to lose uh bug Bounty programs these are the people who ruin Security man security used to be like you know showing people goats day in the middle of a CTF and like you know laughing at noobs right those were the good old days you know well I heard we're going to make America great again so you know maybe we can make hacking great again [Laughter] too I don't even know what the A and the N are but we're definitely not talking about trains and buses man uh any plans for using RL to optimize search well if you've watched my channel you know exactly what I think about RL um AMD is a culture problem not a Manpower problem that's true uh RL is a scam but not as big no RL Works someday bug bounties you'll always be a loser if you do bug bounties just just you know I love when people just what what people should really do with zero days is just drop them on Twitter right George you're influential in the security Community you can't say oh my God we got really spicy on this stream just drop your zero days on Twitter man sorry X I dropped it on x man respect right that's that's you want you want to gain respect as a security researcher just like here's a weaponized Chrome zero day dropped on Twitter um don't actually take any of my advice everything on the stream is for entertainment purposes only um I take no responsibility if you choose to murder somebody after watching my stream I do frequently say not to kill people uh I think that's a good note to to end things on um isn't RL just MCTS but worse RL has incredible potential RL will one day grow up to be something but today RL does not work um I mean like kind of works I don't know remember press the light up button remember getting getting frustrated by you almost forgot not to kill people well good I'm here to remind you we're back uh uh but don't do bug bounties man I got paid $3,000 for this chrome zero day that could have brought down the internet yeah great [Laughter] man um dreamer V3 Works po does not yeah I mean that's probably the truth it's probably going to be some like fancy hyperparameter tune thing that kind of works like Gans work but your Gan doesn't work right like style Gan works and dreamer V3 work but your Gan and yourl don't work uh so yeah I don't know but build infrastructure make it easier look at that look at that Colonel there look at that I found it at 4,432 it just continues to get better Yuri thank you for gifting Subs do you have a last question to ask the Stream bye

hello hello welcome it's been a long time uh I got some screen over here I got some screen over there I got my folding phone uh let's see if anyone still watches this stream uh let's see where have I been I I don't even know um I haven't been streaming that's true but in some ways I think the meta of the George Hots character is played out I think we're entering a new era uh I think the pendulum is is swinging backwards uh you know the pendulum is is swinging back um so yeah I think I don't know I don't really know what this stream uh should be you can see I'm in a new place this is my office um it's a shared office but we're here and we're streaming and it's Sunday mostly what I want to do is show off some of the stuff in tiny grad and then maybe we could do some work on it uh so tons of progress has been made in tiny grad um yeah I mean a lot of commits I don't really know what to what to show you I mean there's beautiful memest beautiful emest is pretty nice but it's the same beautiful memest that was there before uh actually the accuracy is a bit higher now because we improved the initialization um if we go in here to n and a nit you can see that our uh initialization is simpler now uh when you look at com 2D it's really just tensor uniform and scale used to have uh caming uniform but I don't really know what caming uniform is like if you ask me to just describe caming uniform but this is what torch uses so we just copied torch and now our kernels are a little bit more accurate which is cool um tiny grat has docks if you haven't seen them uh I saw some guy on Twitter liked liked how much uh meth there was in the docks see all meth um this is yeah I don't know I mean maybe we should try to implement a paper I think that might be more more fun I don't know how far we'll get though the new tutorial is helpful which tutorial well we have the quick start and we have the amness tutorial um the amness tutorial is pretty nice oh let's see I went to Poland uh where are you so long I went to Poland where else did I go I went to Italy I went to a lot of places um and it's cool because you know tiny grad's a it's a remote company one of the ways that we distinguish ourselves from comma is we do have some stuff in the comma office but tiny gr fundamentally is a remote company um two of our employees are fully remote because fundamentally it's a GitHub and a Discord if you haven't seen our beautiful GitHub it's GitHub tiny gr tiny grad uh we have a lot of stars yeah I I think what is what is Tiny grad 4 I mean it's going to replace pytorch probably we'll see now look something I say about this a lot is when I was in the self-driving car space I was competing against largely idiots uh sorry everyone who worked on self-driving cars you were mostly like an idiot um and I you know no respect um whereas the people who I like compete against here like pytorch JX uh Mojo uh mlx these are people who are very smart and people I respect uh so you know we're playing in a we're playing in a real like playing field now right do I respect like you know the the eight ass engineers at Toyota like no offense but no like you're not good uh oh man you know we had a we had a Ford engineer C come over and he we we we like we showed him the docks and he's like wait wait that's the same docks we have I'm like yeah anyone can go to like Ford do you know techinfo for.com and buy them right like they're just not like they're not on the kind of peretto Frontier um of of of coding right like the the reason the Toyota Adas sucks has nothing to do with like that we're better Engineers than them or anything it's not it's not even that it's just it's like structurally broken where there's nothing at all structurally broken about pytorch or Jax right like pytorch and Jax are they've made different tradeoffs from Tiny grad and it's a question of who's going to win but think about it more like high Lev sports teams competing right like you know you can you can talk about the other team but fundamentally you know they're they're they're Live players uh whereas open Pilots competition except for Tesla which isn't really our competition uh we're not live players uh Tesla and I don't share uh Tesla and open pilot don't share much Market um you know it's just it's iPhone and Android but we are trying to of course take all of the market share from P torch Jacks and every machine learning library um so yeah if you uh oh let's see I rafted down the Colorado River oh man it was was 115° outside it was cool it was not cool it was very hot but it was fun hello we're streaming I know um yeah so this is shared office you can say hi if you want this is men you hello Works tiny Court um yeah we got to we got to make MCTS fast one 500 nodes per second all all of them tiny Buck definitely yeah how do I par well I can at least parallel the uh the rendering and the I I don't know what percent of the time is spent on the rendering versus spent on the uh actual compilation um so what we're discussing here is I just added MCTS search so if I put in beam numbers that are greater than 100 it doesn't actually use beam search it uses MCS search so what this is doing is searching over tons and tons and tons of kernels um you see it found the best one there at 63 but this is a more highle tutorial Let's uh okay I don't know why I read the comments what you mean Nvidia will go open source nvidia's always been open source uh not always but for the last like two years they've had an open source driver uh and they are switching now to solely the open source driver their close Source driver didn't use the GSP you know it's just like who am I even talking to I'm the the AMD thing is you know the the amount of just complete misunderstandings and Mis wow I thought I thought we weren't going to rant you know I thought I thought we're we're past this on stream but like okay scale Cuda AMD right so like this stuff comes out and people are really excited about this look there's still there's still uh this is totally the wrong approach there's no reason anyone should be like doing this fundamentally the question is not can you run Cuda but the question is can you run Cuda fast and the answer is you never will be able to uh and I I can show you why I can show you uh I heard there were some docs that addressed this but I'll tell you why it'll never be good um so here tiny grad supports both AMD and Nvidia and I can show you here here in uh where is it it's in lower uh it's not in lower it's in kernel yeah it's here yeah here uh so AMD and Cuda have very different uh tensor course I should probably include the comment for the AMD ones uh so AMD and Cuda have different tensor cores and because of this they will never like you'll never be able to take Cuda code and compile it and make it fast on AMD uh so this is not like this isn't really going to help people the way they think it's going to help people you want to be at a level of abstraction higher than this um you want to be at a level of abstraction where like you can still mess with your shapes such that it can make the tensor cores fast uh so yeah this will this will never be uh fast there's the wrong approach um now we we'll give a shout out to uh T torrent uh T torrent I you know I love to see this they're shipping computers right like respect right there's a there's a button here there's a deposit they have a ship time oh we're shipping tiny boxes too I guess there's there's I don't even know what to like say about this I got a room full of tiny boxes over there I think we've shipped like 25 of them uh they're shipping we are very slowly making our way through the huge pre-order list uh and if you want of course add yourself to that it's on Tiny cr.org just click Tiny Box these are the tiny boxes we're selling like we can peek around a tiny box here see if anyone's on this one this is a tiny box green it has 6 490s in it just kick off a uh kick off an MCTS search here let's just do something like this why not let it search um make sure it works I haven't really tested it there so mtcs is still very experimental it's an extra uh yeah but this is the this is the Tiny Box uh what's the ratio of what people are ordering uh what do you think okay it's it's 10 to one and you can decide which direction it goes despite saving uh $10,000 but you know we are almost ready to upgrade this driver quality from mediocre to acceptable uh because you know again with no help from AMD we've completely succeeded you know shout out to to niml Jen did most of this work uh he he works at tiny Corp you find him on our Discord uh this is a complete AMD driver uh it uses pm4 so a lot of the bugs uh in AMD basically existed in two places um one of them has to do with uh this feature called CWS uh cwsr uh compute wave saver store which is basically like context switching for the waves implemented on the CPU gpus do kind of a quick contact switching for the waves but there slow contact switching you know one in 100 times one in a th times in a restore just work there were some race conditions in that so we disabled that in the driver uh it's just a flag you can you can use to disable it in the driver and then um we wrote our own this is an AMD driver that speaks directly with the hardware using IO controls and the key thing is that we communicate with it through pm4 packets not through AQL so the other bugs existed in AQL mostly um remember all those ones about you were like the que they were messing with in order and out of order remember that remember that thing they released all stupid we just bypassed all of it so the main firmware that actually runs these things on the GPU oh we have some decent docks now um again why wait for AMD to do anything uh I just went in there you know I used to be a famous reverse engineer um and we just dove in and said okay so what's actually going on here uh so this is the main uh controller uh the microengine compute that actually controls like using the compute resources of the GPU so we uh it's a 250k binary uh 90% of that code deals with AQL um but there's a lower level thing called pm4 so you can see pm4 instead of AQL it used to be called kfd now it's just called AMD um we switched from AQL to pm4 which these are actually just speaking like this is the launch of a a compute kernel here you see we have a function called exec here this specifies like the um uh you know just the location of the program a bunch of different resources that the program needs um the local size where's Global size oh this is in this pointer to dispatch packet wait wait what I didn't know this was still used huh interesting I haven't read this code that much um pointed A dispatch packet why do I need a dispatch packet why am I not just putting this stuff directly I am putting it directly in we're doing direct dispatch why is this stuff still in here pointer to dispatch packet get command idx oh maybe it's just a useful struct to use because this doesn't actually seem like it's doing anything I'm GNA ask about this in Discord uh oh wait what Kars pointer dispatch I'm GNA ask about this in dis in Discord why is HSA kernel dispatch packet still used in Ops amdp um yeah okay I would imagine we actually didn't need this HSA thing because we're not really using it so you can see these um these set sh registers these are actually setting Hardware registers in the GPU and you can see here's where we set up the uh local size and here's where we set up the global size so this is just completely like specifying to the registers uh like like it doesn't even use the firmware um the the Mec is executing very small pieces of code uh to actually run these register sets and they're pretty much just exactly what you think they are uh there is still this packet 3 dispatch direct so these are the different uh pm4 commands um yeah but turns out that when you bypass all of amd's crap there aren't really bugs in it any more so there aren't really bugs in the hardware it seems like they did a good job with the hardware it seems like their bugs exist in the driver and in their runtime uh so we turned off the one buggy feature of the driver um and we uh bypass their runtime entirely so yeah uh AMD so we we we we can upgrade this driver quality from mediocre to acceptable I think should we should we do that should we should we upgrade it from mediocre to acceptable I'll do it later um okay so we also have the same driver for NVIDIA this is [Music] the uh Nvidia version of that driver it speaks directly to the GPU um it's amazing how short these things end up being so one of the next big things that we're pushing into is we're still actually using the compilers here so we're using uh nvrc which is uh it's it's a inprocess uh Cuda compiler um some weird stuff for PTX here that's actually not on my computer uh the same thing for the AMD compiler we have a helper here that calls uh uh compil hip I don't really know why that's not an Ops AMD that should probably be moved to Ops AMD if we don't use the same compiler here why is this a separate file that probably belongs in op bamd I'll mention this in Discord too reason we we can't put merge hip Co manager dopy into Ops AMD dopy then .p can go to helpers py and remove a support di yeah so I don't like this support di this support Dr is kind of upsetting so this stuff should probably go in there this stuff should probably go in generic tiny grad helpers here um help it's getting kind of big yeah it's an elf loader wow it Imports libc we don't want elf in helpers if it Imports let's see okay uh you guys are just kind of really watching me watching me work here um does tiny gr work with Mesa no we don't use so Mesa my understanding is that Mesa is a user space library for gpus so tiny grad just replaces that uh we've done pretty well with line count we're at 8461 right now um we can kind of go through what these things are so the biggest file here is tensor uh so tensor has all these has all these methods look it has docs now tiny so unreadable yeah you haven't read it in a long time it's actually very readable now um and then all this Docs just gets compiled to here so you can like see like you know what's randen oh here normal distribution mean zero standard deviation one wow you can pass the device keyword that's cool um yeah movement Ops and different ops processing Ops un Ops everything's a method on tensor which is uh which is cool uh the docks became great thank you uh does your AMD driver support only rdna 3 I believe so um it should not be hard to add rdna 2 support I think the only difference you're going to have to deal with is like the scratch stuff's a little bit different uh how does tiny grad perform versus torch compile well again it all depends on your plat form um so tiny grad natively if you don't run it with anything is slower but if you run it with search parameters and that's what I'm talking about here with MCTS right so right now it's like searching for a kernel so here I can like I can show you something like um so we like this example here so this example is uh this is like inference on a resnet so inference on a resonet normally we're getting only three teraflops 170 milliseconds which isn't great but if you run it with beam equals 3 uh now it takes a little bit to compile which is I'll run it with debug equals 2 so you can see the beam um so beam equals 3 is the old uh search you see it searches all these different kernels so before kernel zero was taking 2.71 milliseconds and now it got down to 1.6 1 milliseconds thanks to beam search um we can also use MCTS here so let's add MCTS and beam uh so actually this one was already searched via MCTS so you see that the hand-coded one the one without any optimizations took 2.7 milliseconds the beam one took 1.6 and the MCTS one took 1.43 um yeah so it searches this is doing the MCTS search here so you see that's the beam search there this is the MCTS search see it already found one at 42 that's uh6 milliseconds the timings aren't that great on Mac they're better on Nvidia um yeah so this one when it actually reran it was slower so it used the built-in tensor core optimization but yeah so what are we searching uh we're searching over all of the here I'll run this with debug equals 3 so you guys can see what we're searching so this stuff these are all the different ways to represent the kernel you see we're messing with the strides and the views uh all this different kind of stuff we're messing with some arguments to the tensor course here which axis it is oh this is this is actually really cool I've never R this before uh sometime we split the reduce axis so you'll see two sums go in there now we're searching the next kind of Kernel so we search over all these different basic ways to run the kernel uh debug equals 3 totally makes it understand it's it's not that I'll break it down for you uh so like let's compare this kernel to this kernel they're the same but let's see what the difference is so this one has a 16 here and this one has a four here so you see how this one basically it like this this explains it so these blue ones here are the globals the GPU globals you know like like uh like get Global ID uh these are the GPU locals get local ID these red ones are reduces done in a for Loop purple ones are reduces that are unrolled for Loops uh and then yellow ones are upcast it uh I can okay here do you want to are you interested in this we can explain it further uh so let's just do a little gem here well first let me do it with no op equals one so we guys can you guys can really see it okay so this is a matrix multiply that's uh 64x 64 uh time 64x 64 so we have two buffers being loaded here um you see why they have stride zero this is my my analogy that I always give I talked about on Lex it's like if you have a cube uh you can put matrices on the side of the cube in multiply sum up and that gives you the top Matrix here so you know put them on the side of a cube multiply sum up that's the thing kernel here are some meta parameters about the kernel uh and then this is the actual generated metal code so you see that these dark blue ones are Global Dimensions um yeah now it gets a lot fancier when we start enabling optimizations so you can see this ranted in 41 microseconds so with the optimizations this a hand-coded optimizations it went down to 24 micros so it's faster and you can see this is the code that's generated it's not that much more complex like it's still very readable when you look at it but basically it expanded the uh see that 244 at the end it expanded the well I'll disable tensor cores for now so we can disable tensor cores which won't use uh won't use tensor cores it's not quite as fast you see we're getting 28 microseconds and then this is a 444 which is the absolute like classic um this is called cooning in some GPU books but basically think about it like this it's doing a 4x4 chunk of the Matrix all at once so this is that 4x4 chunk of the Matrix uh it's accumulating into that 4x4 chunk of the Matrix and then it's storing that 4x4 chunk of the Matrix so each run of the kernel is doing 4x4 instead of doing 1 by one uh so that's kind of what that is so let's take a look at a few there's a few cool options for tiny grad 2 we can uh graph the uops which shows you kind of what the inner generation looks like the other big thing that I've been doing while I haven't been streaming is I refactored the linearizer into the lower and I know these terms don't mean anything to you but well if you follow tiny grad they do uh so this is what that looks like um and you can see that there are 16 loads uh well actually no there's eight loads uh yeah there's eight loads four stores and these are float 4 stores you can see that's float 4 see these loads are float 4 uh this is like the graph of the compute of that gem uh so I'll show you guys one more thing if I put debug expand equal one it won't actually render but it'll show me a much simplified version uh of the of the graph before these expands so you see there's these nodes in this graph called expand and what that node basically is is it just like uh it pushes through the graph expanding this node four times so like this multiply is 64 time what's being expanded here is 0 1 2 3 so when this goes when this node goes here it actually runs 64 * 0 64 * 1 64 * 2 uh and that can be folded um if you're interested in seeing where that is this is file called UOP graph uh this is in the uh expand pattern match here and there's a function called do expand which actually runs that expansion uh but this is what the linearizer became and it's all written as these graph rewrite rules this is the big chunk of graph rewrite rules right some of the graph rewrite rules are stupid like x * 1 = x x + 0 = x uh x / 1 = x right like these are some some simple rewrite rules uh and then the rewrite rules get get uh sort of fancy to do all sorts of things um yeah so when I run it with debug expand it keeps all the expands and contracts around uh but those can actually be rendered so it pushes them through the graph generates a graph and turns it into actual uh code so this is what that code looks like without tensor course uh and you can see here it just stays a summon Maul but if I allow it to use tensor cores uh tensor cores are a feature of the GPU so we'll run it with tensor cores here uh there's this magic function called W 888 float it's fancy how it works uh you'll just have to look up tensor cores if you're not familiar with you know how gpus work you got to just read um yeah so this is all those all those wames um let's put that with debug expand too and I can show you like what it like a lot simpler right so it's basically one wh That's Just expanded across a couple of axes contracts someone's going to look at this and be like holy like who's someone who's like been like messing with Triton uh you know is going to look at this and be like holy this is a lot nicer than what we have I'm like yeah I know uh it Triton a lot faster you know again they're they're they're very very much more in line with like the pie torchan way of doing things like take the one konel we want and make it fast don't focus on this generic abstract case um you know it's all about trade-offs right and as again you know everyone in self-driving cars except for Tesla and sort of whmo is like weo is a fine project if you have unlimited money um but like you know again you meet all these people and they're just idiots right it's not there's not there's not like a redeeming Factor they're not making a trade-off they're just idiots uh but you know people in the ml accelerator space uh people in the in the ml framework space are are not idiots they're just making usually different trade-offs than we are uh is Tiny jit search faster than Jack's jit okay so faster is a very like what do you mean by faster right on what platform on what model saying faster in general doesn't really mean anything um like there's places right now where tiny grad is going to be W well Jax can do a lot of fancy fusions too so again yeah it all depends on the why accept Tesla Tesla doesn't seem idiotic to me Tesla seems like again they're making different trade-offs from us but we're all like operating in a world of modern engineering practices whereas these older companies are just not right like you can watch watch a video with the Ford CEO he's like well so we out sourced all the modules and now we have 100 modules in the car and they all have software in them and then we realize that the problem with the car is the software but we can't get the software because we bought the modules and they came with the software so then we have to email this company and they never get back to us yeah I mean yeah no  right um you know they they took a basically an old style of uh is still busy searching they took like an old style of yeah if you're buying you know metal pieces going with suppliers is fine if you're buying software you shouldn't really do this right like you should have like an integrated uh you know you want like a mono repo basically for your car um and you know you don't want to deal with well there's a tiny piece of software running from this company and a tiny piece of software running from this company uh so yeah that sounds sketchy for Ford I mean it's yeah like I know you know people say weird stuff like like you know K sketchy or whatever but like I don't know they don't really know anything right they're just falling back on proxies um and they're falling back on well you know but like Ford that it's a I've seen their logo right like it looks like it looks like that right like that yeah there we go come on look they have a logo like that right like clearly that's all right we're not ripping we're not ripping off Ford um but like you know when you compare this to TOA which has this like self-drive car folder and this is just all the cars guess what they're tested in CI you think Ford has CI right AMD doesn't even have CI um well that's our docks why are those ones little and those ones big ah yeah let's see you think about how many lines of like convoluted sea that is across Ford's platform yeah Auto is just learning about get an AWS yeah see that's what I mean like they're clowns right they're not using real engineering practices right whereas like you know again when you compare us to our competitors you know we're tiny gret tiny gret they're pie torch pie torch they know how to use git right they know how to use git they got more stars than us so again we're we're competing against real people but it's so nice to be you know playing in a playing a lot of things about tiny Court make me a lot happier than comma um like there's things that are cool about Kama like comm's if you like to like mess with Hardware uh uh if you want to like really understand how all the hardware works I we have such like a culture here of like like we bought some of those unry dogs we tear them right open right we just like know how everything works we're building like cell phones from scratch in the basement not total scratch but like pretty close now we're making our own SS we have our own uh we pick and place machine Reflow oven and we're like like these aren't like oh we have these things but then no actually we Outsource everything I'm like not Outsourcing right like the purpose is the process right the purpose is the process what uh yeah tiny grad goes as low as bypassing the GPU driver Simplicity is the way to win it is it really is um I mean these GPU drivers again they're a lot of the complexity in a GPU driver comes from supporting Graphics uh the compute models on gpus are a lot simpler than the graphics models on gpus and if you can't write your own driver so our goal eventually is to make chips um again we have a lot more there's a lot more work to do before we can get there if you can't build a highly competitive framework on Nvidia you're not going to succeed with your own chip MCTS always be bottlenecked by the forward pass okay so now what's slow in MCTS right now is the compilation time uh MCTS is also single uh single process whereas beam is multiprocess so um yeah we can we can look into improving that I don't know this might just be a tutorial for for tiny grad uh stream uh if no Nvidia no own chip okay so nvidia's Chip is better than yours certainly better than the first chip you tape out right your chip may eventually become better than in video when you're on the third or fifth generation of it but the first chip you tape out will always be worse than an h100 right even regardless of like what you're trying to make it's always just going to be worse um so if you can't build a compet stack on better Hardware you certainly can't build a competitive stack on worse Hardware but but but but Nvidia is not documented yeah this doesn't matter right like the documentation you're going to have for your own chip sure you could read the verilog right but like you know how much verog can you read right there's some quirk in the pcie blah blah blah blah blah it's just not tested uh yeah so you're gonna basically have to if you can't build a competitive stack on Nvidia if you can't build something that's driver level on Nvidia uh why build your own yeah so this is this is basically an Invidia driver like you can read it yo we're like succeeding at this um I haven't been streaming in a while I don't know it's kind of maybe I stream because I'm miserable I'm just not that miserable anymore things are actually working uh Nvidia is moving to more open source drivers again like you have to be so careful with what you're reading the new news nvidia's had an open source driver for a long time right it's it's like GPU open kernel modules uh they're just deprecating the closed Source driver um but okay is this really open source I mean there's this huge binary called the GSP here let's go find it I don't even know where where is it where is it lip there it is so here let's go into this driver uh so we have two gsps one for the ga series and one for the Turing series I think that the ad series runs the ga one as well uh yeah so like if you like um you know if you call a 36 megabyte opaque GSP binary open source then it's open source right do they not even have strings in this thing junk strings huh some Colonels require a pointer to HSA n gen replied to me on Discord when wow com uh you're welcome to join our Discord too I'm just typing there now what are you do in a kernel to set that have that property be set so okay apparently his answer was that you have to here here's where allocate space for the dispatch packet and the Kern arcs so there's apparently some property of the code that's generated by the compiler sometime where it does this extra dispatch um so also when you when you go to this level you see all the complexity that they've hidden and a lot of time this kind of complexity is why your stuff's not fast um so once we start looking at this once you start so there's a there's an attribute uh let's go find it in renderer C style uh in Hip renderer you'll see it so you see this kernel modifier this kernel modifier see there we go it makes hlb C far twice as fast uh by basically specifying the max parameters it's going to launch with because this has to do with uh register the word for it uh basically the register pool on gpus is shared and if you H God what's the word it's like it's like resource exhaustion of registers um register pressure register pressure is the word I'm looking for uh how did we discover it it's not Insider knowledge it's open source you just read the code read the driver here's here's the docs for the driver right cwsr enable default is one to enable this feature setting zero disable it preempt Shader execution in the middle of a compute wave right like of course that's broken yeah reading docs how else you want to do it uh I don't know I don't really want to code on stream I got kind of sick of coding on stream because like it actually teaches me bad practices and it teaches you bad practices too you like the stuff that's flashy when in reality I code a lot slower especially now uh especially with the stuff in tiny grad where you're not really writing that many lines uh and you just you know you just you just code nice and slow and you know make sure it's correct don't spam code uh have we sent any boxes to the EU no we've only shipped us ones it's such a pain in the ass to ship the the Box weighs 90 pounds and we got two of them over there if you want to say some some tiny boxes chilling over there why do I care you know that I stream for me not for you right like we've had this discussion before um what chair is this the new keyboard that's what you guys want to talk about um we wrote a line of code it was a comment I'll be right back for for yeah we'll do a little coding um so it's a little annoying on metal because you can't actually parallelize this compiler super well um kind of want it to also say nodes per second that'd kind of be cool so we want to know the number of nodes per second we divide can I not do a no per seg we need to say no I think we can just do that I think we need that point okay let's just get rid of the float entirely 3D all right so we get about 27 per second right now don't need that don't need that how it looks all right do we like that first or should we put it over there I don't really like that first I'm just tweaking something stupid okay when do you think you'll start taping out your own chips what do you think I'm going to say to that like this one searches a lot faster does MCTS always converge to the same optimal solution bro I don't know what tiny search spaces you're searching no um no the search space is huge so I'll explain the sear a little so this is the uh no not that um where is it uh so we have this thing called op doops and those are the actions oh it's in kernel okay so these are the actions you can take and then if you want to see actually that's not even real this is the these are the actions you could take um so we have something like probably 40 actions not all of them are available at every move and then a kernel usually has like uh seven moves right so it's 40 to the 7 so that's the size of the space you're searching um and like you get about 500 kernel runs right so you have you have 500 uh tests to search a space that that that's that large maybe it's more like 20 actions per thing so you know okay it's that big um so no it's not optimal but that's okay uh because you can make some assumptions right you if you were searching the space randomly then yeah okay that'd be it you can't search that space forget it but you can make assumptions that well here you know what I'll show you something we have a convenient graph function called MCTS graph um so yeah you can see it's it's finding the best one here it's like 1.5 you know 1,500 the best I've ever seen it get is like 1450 um but this one's probably going to stop at this 1521 1521 is pretty good oh NOP it got better 1504 okay uh so we wrote a graph out here let's uh let's take a look at that graph so this shows all the nodes uh that were searched so yeah this should be about 500 nodes in this graph it's actually a bit less than 500 okay so this is the root node here um so the root node we don't actually run it's just for Speed there's no reason to actually run the root node you're always going to do better with at least one optimization um so then it tries all the optimizations you see it gets this tc1 which gets it down to thir 3300 um does it actually spend all of its time on the TZ one no there's other ones down here um it also does this TC one see like these are all like three pretty good actions so the winning action turned out to be this one uh which was five then it went in there did this upcast got it down to 2,800 and then with that unroll it got it down to 1504 um so these are all the the search space and you can see that like it doesn't waste time expanding a node like this right this one is 177,000 it's so massive that like it's unlikely that it's worth expanding that one so we do intelligence search where you know nodes that are pretty good are likely to be good places to look for more nodes that are really good um yeah this is like how tree search works okay so we're getting 29 per second let's see if we can increase that um our real goal is to get like like 500 per second 500 per second would be a good speed so then we can do this entire search in uh so something I was playing with in the uh let's get stash pop oh can we not get stash pop we can't get stash pop how do I look at the get stash get stash list whoa look at all these stashes get stash show yeah okay here uh so we added this early check to not waste time compiling uh compiling early check for same as optimized a so right now what we do do is we so sometimes you get to two nodes and they're actually the same code so like different optimizations ended up leading to the same thing um shouldn't you should all be caught by the as check need to fix uh caniz I talked about this in Discord um Quin is making progress challenges with Loop unrolling cool um that's what what I will look at later uh but so this this is a an early check for the same as so we don't actually have to put it through the compiler hold import lazy op here so this check should always be correct if not always complete uh basically so it just it goes to the current node it gets the optimized as and then it sees if it's already seen that before if it's already seen that we just remove the node and we continue which is also kind of wrong What We actually want to do is this yeah that should be right let's see if the searching is faster 30 okay so we only expect to get a little speed we don't expect to get a ton of speed from that it seems a little bit better than it was before it's actually pretty much the same that's the same as those before um oh look but look at the one we hit I mean it just just luck but uh you see that we got a we got a better outcome here at 257 so we can you know look at that graph again it's the same graph of the search we just got a little bit luckier about how he expanded the notes um so yeah I mean it's within a couple percent you see that the difference between the like it's 2x better than the hand-coded one and the Hand coded ones on Mac are pretty good uh the hand coded one ones on Nvidia are terrible uh and it's just cuz we really put effort into it okay cool um so one of the things we can do is we can just mock out this rollout function and then we can see how fast everything is so if we do a mock roll out wow look at that look at that speed wow 2,000 a second wow roll out slow as  all right um so this is the thing that we can parallelize uh with a lot of work actually with a lot of work oh so I changed this function to be a sample function this is Chu's idea uh last night this uses uh basically does a soft Max and then a random Choice uh it samples from the tree instead of like taking the max of the UCB it's it's like Monet search you can read this by the way it's an extra MCTS search on Tiny grad um it's extra because it's not uh ready for the main code yet yeah um so we found out where all the slowness is and all the slowness is in well actually find out where all the slowness is is a slowness here or is the slowness here let's just check quickly so useless this is the code that co-pilot would probably help me write think you probably not cuz I probably waste a ton of time reading the code that it wrote and just like faster as it do it all right well that's interesting so it looks like it's spending about half the time on compilation and half the time on running uh so we can also turn that down to three didn't really help much and even within compile that can be further broken down interesting okay so most of the time is spent in the compile function but a sizable amount of time is spent in the runtime this isn't allocating buffers is it buffers are pre-allocated I didn't do anything stupid in here did I clear L2 is false make sure that isn't actually hitting let's just slow wait okay so why is it so slow versus TM right is that not the default so I'm looking at the actual time spent in like the python Loop versus the time that it's telling me it actually took on the GPU oh and then there's another hack here too which is like there's a hack for Global which doesn't actually run the whole thing none of these are actually running the whole thing well that's slow as  okay well that's actually spending all the time with the GPU so we have a hack to work around interesting didn't realize how much of that mattered it like all becomes GPU time then that um actually i' want this as a percent it's all right okay wow and that's actually all the time none of the time is spent anywhere else which is interesting okay so to put this guy back that goes there that goes there starts to become overwhelmed by the compile time so there's not really much we can do uh about the runtime really three should reduce the runtime percent a bit but I guess it doesn't really matter those kernels are all fast any slow kernels get hit by early stopping so we have this thing called early stop uh actually bugging that okay so we're multiplying there by Factor no so that's actually right cool I like that extra letter how do I easily like make that colored but I don't really can we Nest F strings we're just talking about that you can color well compile time should be Sion and Flat's not used why is it imported runtime time is Red see how that looks 25 is not in list oh did I do it backwards yeah um y I've been listening to the new say anything came out with a new album Max beamus is proof that you don't have to you know be boring when you're old um the fanfiction like is it's really the you know a lot of these like emo bands they just like completely mellow out think like like Good Charlotte writing centuries right and it's just it's just terrible it's just this absolute generic you know pop music uh that you know no longer sounds like Lifestyles of the Rich and Famous right but Max beamus has not sold out you know my life might end up up shorter than yours but it's quality not quantity you thimble man [Laughter]  like uh your therapist knows you're a failure um oh came out with the whole album I liked one of one of the songs was pretty good I mean it's still the same like and Detroit was was hard where I grew up in Detroit and it was hard and I came up from nothing yeah I'm Eminem yeah right like it's still he's still he's still on about that uh you know it's it's like it's kind of like when people get older they get like froze in their ways like he's still very talented but it's not new um whereas like in some ways I think this is like like evolved uh like he what he talks about and you know it's relevant you know I sympathize with it like the song's called fan fiction like it's it's about the the parasocial relationship between like the the the fans of a of a band I don't know if like I'm the only person who like listens to these like you know like if Panic at the Disco comes out with a new song right like you're not like you have to deal with basically this problem um yeah just just you know mind-blowing I probably listen to the song a hundred times oh it should be possible to determine runtime of a colonel without doing all these computations Good Luck Good Luck uh I have tried and I mean the cool thing is you can figure out exactly uh you know how accurate your estimator is so yeah try to build try to build an estimator of no I like them bettering the thing uh try to build an estimator of Kernel uh time without actually running the kernel I think you'll be shocked at how hard it is okay so this part can be entirely parallelized um but this part can't so what does that get us like 3 4X it's actually probably worth that make a branch uh there's a very detailed cost analysis in llvm yeah exactly I'm not writing that right so there's like the thing that you want probably isn't the thing you want right the the question is not given this konel uh given these kernel arguments how long does it take to run uh you're never going to build a good estimator for this uh there's so many things that interplay but where the lwh hanging fruit is right now if if you do want to improve this search now again I caution you about trying to improve this search right so if you want to improve search such that it runs faster and gives you the same answer that is a straightup win but if you're trying to improve search by being more clever uh how exactly have you figured this out right how exactly like we we don't have a framework for evaluating whether the new search is better than the old search and all the works in the framework right it's the same thing at comma too like so much of these modern machine learning is things are just about figuring out the correct test framework to see if you're making progress or not right so you make some tweak to the search and you run it on one kernel and it does better what does that mean right does that mean the search is better who knows um so stockfish uh solves this problem in a very elegant Way by having a tweak of stockfish play against a whole set of older versions of stockfish and you can just take a win percentage and that's pretty good because that's the exact thing you're trying to optimize for um so what you're trying to do here is you're saying well okay so I made something it's faster on my computer on my kernels that doesn't really say anything right this is what I mean like if you get speed it's a straightup win but if you get uh a smarter search well gu it's smarter always right there's no free launch and search and optimization uh you know you know there's no free launch in Search and optimization so but if you are interested the answer is probably not figuring out how to estimate this but the answer is figuring out I guess it's sort of estimating this but the answer is figuring out which nodes are worth expanding and which nodes aren't expanding so you'll see if we have unexplored children we just use random Choice here uh we also use random choice right here if we just expanded a node uh so random. Choice whenever you see a random. choice you're always like well but some choices are better than others and yes some choices are better than others uh so that would involve putting a prior on the actions right so you're at a node you want basically a prior over the action space now there's a few ways to get that prior uh you could get that prior uh online I think that something again you're going to spend 95% of time on your evaluation harness and 5% of time actually implementing this sort of code uh but well yeah there's hyper parameters too here's two hyper parameters we have we have a temperature of this sampling and we also have uh the C from the UCB function which is math Square 2 for now um so yeah tweaking those hyper parameters may you better results but the question is does it always get you better results or only sometimes uh so yeah by putting priors on those random choices you could probably do a lot better uh chenu and I discussed a simple way to do the priors yesterday would just be to figure out which kernels improve run time right just overall which Kernels have I selected which optimizations have I selected given this kernel that have improved runtime and search those first when you expand new notes uh this is probably a good idea uh I bet this would would improve search efficiency like 2x um then the other thing you can do is you can build offline models right so offline given a device you can figure out that that device is more amendable to certain kinds of optimizations uh than others and yeah you can have that offline prior and then the cool thing about prior is you can multiply them together and generally uh you know if you have two good priors that come from different you multiply them and just be better so yeah there's that um but again the main thing you're going to have to build before you start playing in that in that uh Direction at all is a robust evaluation harness that that evaluates the effectiveness of your search across a wide variety of hardware and a wide iety of kernels whereas if you just want to work on speed well speed is just speed if anyone all right can we ban any the the b word and the t word any of that stuff it's all banned it's all band the b word and the t word you know that's it that's it what kind of camera is this the same one I have at home I think um God now we bring them up you see you can't even say them man that's right I was referring to trains and buses we are not discussing trains and buses t word and b word long live trains uh yes if you want to follow long it's in the parallel MCTS branch on Tiny gra so let's see what crap I wrote to make things parallel for Beam part of the problem with parallelizing on Mac is that you need the device open in order to do compilation I also kind of want to break compilation down into two steps there's a lot of complex crap here let's just get rid of this we don't actually need this for Dev compiler so there's two different steps here there's the two program step and the compile step and we could don't do that no not that one or was it here okay so we could actually break these out further let's just stick this one here and then see how much time we're still spending on compilation yeah not much anymore okay so it's really all just that which is cool because that can be parallelized um two program can be parallelized wait what no okay um let's get a pool going here so there's two program and there's compile and they're separate things and sometime compile can fail two program should never fail two program fails it's a bug in tiny grad well that actually worked out to be less lines than using the stupid uh try compiled linearized with idx oh so this lowering is all in Python uh I wrote this crap so you see okay you see you see how much we can speed up by uh why is it faster now it's just faster now what did I change oh the signal crap how much time are we spending on that you see no one benchmarks any so it's just slow and the combination of linearize and uops I don't know how much of that is not reused anymore oh I run that twice wow I run that always that's kind of annoying all right well that was a big waste of speed there's a big speed regression there we can fix that for beam as well I me we need this to be in a shared framework wow there were so many regressions when we went to the lower everything's so much simpler now though cool look at that look at how much faster we're searching love it that was completely free we made things both simpler and faster no parallel where it going what was it before like 24 yeah 33% faster guys we're winning on this stream is free lunch and search that's right we found a free lunch and search we just stopped running the function twice uh that's cool that that bug actually exists in I mean my whole next week is going to be spent on search um I don't know why I posted it in the employees only Channel we have a breakdown so we have five um people who work at tiny Corp now one's on vacation next week so we have four people just we're all going to take a different part of the speed problem I'm going to work on our own part of the speed problem and this thing is going to be uh faster than torch in no time wow I love my new Fast MCTS okay so there still is a 2X to be had you can see these percentages don't add to 100 and it's because I put the lowering outside the time does that make sense this tur into a pretty good stream we're actually coding look at that I mean we had so much good gains in time programmed we want to try to parallelize this paralyzing is annoying they have to do multiple sample trees this crap all looks slow look how slow this all looks what here is actually slow decorate this with profile from helpers that'd be cool if I could this function as a decorator it does context decorator call takes two positional arguments but four we're given uh I think I like tell it stuff maybe that come on profile for me yeah profile  all right so where we spending all the time ton of time spent in Sea style render wow wait why is that not sorted oh it's sorted by that oh SE style know SE style render just because that's why it's doing the U up graph okay so half the time is being spent doing stupid UOP graph we know how slow UOP graph is uh oh here we go so here's Ops metal compile Ops metal compile is actually a pretty small part of the time um wow spending a lot of time in symbolic floor div this symbolic and UOP is so slow okay okay but that's cool so almost all the time is being spent in two program well not almost all but that's a percent wow look how nice tiny gr infrastructure is that profiling just worked like that that was cool you understand the stream so far great we're trying to be more understandable and approachable for everybody except Noobs okay it's pretty good speed up already if we want to parallelize this I don't really want to parallelize this I just want to make it faster I I hate how slow this is so part of the problem with parallelizing is then we'd have to do like it would actually be slightly different if we were to parallelize because we'd have to sample from the tree more like you take multiple samples from the tree and then run them and like back propop them all when they come in 35 a second 47 a second tiny CAD infrastructure is so good this is so nice like so easy to write things like this that is how you make progress I I I was thinking about that I even thought about tweeting it it's like there's really one group people that moves the world forward and that's the people building infrastructure think about the tech tree of any game we're progressing through that same sort of Tech Tree now I know a lot of people are like talking about this stuff now but it's it's it's cool to uh you know to be explicit about like we're moving through a tech tree and the way that you move through a tech tree is you build infrastructure um I don't know that's kind of like what I do with my life um eventually I want to rewrite like all of open pilot in tiny grad uh like tiny grad's a programming language but it's a very restrictive programming language it only lets you express certain kinds of operations but these certain kinds of operations are pretty much all you're going to need to build like brains and stuff uh so yeah like a lot of open pilot can be Rewritten in tiny grad and then it's the job of the tiny GED back end to actually make that run fast on Hardware uh the first chip that we tape out is probably going to be not uh it's probably going to be an inference chip for comma um like once open Pilots ported to Tiny grad we can then uh switch comma devices to use tiny grad taped out chips it's also it's a lot cheaper to tape out a small chip um I mean we would have to care more about power but you have to care about power anyway so it's it's good to like you start with the mobile Asic and then you blow it up you know as much as they piss me off I think like qualcomm's undervalued like let's look at Qualcomm compared to uh compared to AMD 204 billion oh similar unbelievable unbelievable guys it's really uh yeah know I think that like I'm actually bullish on if Qualcomm understood like what they had and what they were doing I'm actually bullish on them building better neural network accelerators than like AMD for example all right we have faster search question of making this faster why so it's a kernel right can I give it a type does it still pass uh where's my mypie incantation I don't know why why when I run this in CI this is the myi that runs in CI it doesn't include the MCTS uh but when I run it there okay incompatible oh for so if you're interested in making this faster I think we should just probably focus on making lowering faster uh I know the branch is called parallel MCTS but the longer you can like parallelization is always going to be a trick that's available to you uh whereas if you just can straight up make improvements there are just winds out there so this one unfortunately is something we track pretty well already so I don't know how many gains there like low hanging fruit gains there are there's definitely gains they're just not that low hanging almost lunchtime so I have this thing called external Benchmark schedule yeah we're getting like 653 there and 653 is an improvement kind of from what it was but you can see that like this is 10x slower than all the other uh steps in the model uops is the thing we were running twice by the way um it's this it's this model linearized that's doing that's running all these graph rewrite rules so this is the thing that's new in tiny grad um and the old one was slow too like it's not like it's not like the old one was fast um but yeah we've moved to instead of like hand coding rules that writes this code in a linear way things have been Rewritten as graph rewrite rules right so like this is the matching pattern and then this is the code that runs if it matches uh you can see the pattern matcher is here so yeah there's lot of gains to be had by like instrumenting this and like like I have some things to speed it up that's kind of what where's my pattern match here's pattern matcher so like I'll compile like a dictionary so I can do a fast Dictionary look up but in theory I think you can do this recursively um there's also probably a lot of stupidity here in like how we iterate over these things so you can see that if something's a list I just create a list of all the permutations which like it's correct but there's probably a smarter way to like do that like I think greedy is fine it's greedy with a tree it's just I mean it's it's nice because it's an easy thing to write but uh I think a lot of time is wasted there I mean then this yeah this list comprehension is all pretty slow so let's see if I could make that 2x faster then we're looking at something like 60 nodes per second I want to get that to 500 nodes per second but there's no way we're going to get it to 500 nodes per second right like even if the only thing was GPU runtime which I can't parallelize at all right this this in theory can be parallelized is just a little annoying on metal so this red stuff can never be parallelized at least not the way tiny gr works now um you could in theory parallelize it across gpus if you have a uh multi-gpu computer but yeah so 40% so wait that's 25 I thought it was 30 was 30 before why did it lose speed oh maybe sometime it's faster than others depending on the nodes it happens to search he look this one's faster interesting wait and this one like messed up look this one didn't find a good one this one found a good one and was faster whatever yeah see and that's the other problem right again the search is allistic based so sometimes sometimes it gets to a bad one uh and then you know you do all these stupid things for one and we just put the search in a loop and run it five times well yeah you just got 5x slower can you parallelize that sort of okay so if we're at 30 per second and we're spending 38% of our time on runtime yeah the best we could possibly do even we got the other to zero was 78 and that's not fast enough still so let's there's no getting around just yeah like this all right so like sometime the time is spent in a net and sometime the time is spent in to the code and realize construct this compiled Runner but I really wants a line profiler on this interesting so most of the time isn't actually spent waiting for the kernel most of the time is spent constructing the kernel except in like some weird cases where the Colonel's probably already constructed that one's all in weight check why is sometime this fast and sometime this slow we're we're doing it pre compiled right yeah precompiled equals lib so that gets bypassed that's not running device uh nothing slow there this just all be the metal construct that's slow not triggering that dispatch data create new function with name lib you know we're actually already creating this program when we compile it where are we Library yeah see like this new this is calling new library with Source this calling new library with data so we sacrifice a little bit there um we ever hitting the compile Cache no wait this is all after that uh yeah so sometime the time is spent in weight check and sometime the time is spent in a net wish I could aggregate these somehow okay and this is happening before this so we way check yeah you see it's like a decent multiple of this right so we're spending so it's telling me that the kernel runs in 4 milliseconds and then it's spending this long uh running it which yeah again there's not really going to be a way around if you have a 4 millisecond kernel that's just how long it takes this only seems to be slow sometimes I wonder why is it slow sometimes and not slow other times let's profile it can you just do that great slow sometimes and not slow other times we don't have any more information on that oh maybe it's this GC I don't know maybe it's just running something still on the GPU and it takes a long time uh okay nothing sketchy and weight check yeah I don't know I mean this is just how long it takes these are Big kernels too like these kernels are on the order of milliseconds so if you have kernels if your average kernel is three milliseconds you're running it three times uh okay okay well one thing we can do is we can lower the early where's early we have early stopping this probably can be three and doesn't matter 40 per second yeah cool uh I don't know like I doubt that matters the question is it definitely made things faster uh three might be a little aggressive there so what this is what do I set to do in beam I do three in beam three is probably fine I again I need that framework to to do to do proper evaluation of these things like you could just get unlucky and get a slow one and then it exits and it's done I mean three is what I do in beam so it can't like be that I don't know like it hit there but that could just all be lock seems like I get most of the games with five anyway so what this is is it'll only run the kernel once uh if it's should we adjust beam to match beam is so wasteful now because it's not even doing that compilation correctly okay 36 per second it's pretty good you see MCTS is winning most so like we have the handc optimizer the tensor core Optimizer I can put beam in there as well and we have MCTS there W that's still going that's so slow let's switch to parallel MCTS I didn't push yet and now it pushed so the stuff is cach so you can see that's all fast now 34 per second just oh look at this 69% of the time is spent compiling and only 1% of the time is spent on runtime because these kernels are so much smaller actually that's probably also going to be true I happen to be running this on uh something that has big kernels but if I run it on something without big kernels uh jet beam means only beam the things in the jet so you see all that stuff not the jet so it doesn't run it now as soon as we get to Something in the jet the tradeoffs become different why is it doing this let's fix that bug probabilities contain N I exclude the ones that are Infinity see what's going on why it does that oh that's a different problem uh cool well yeah because we no longer put the compile in uh no there a tiny G device compiler eror that's fine though I mean it's still interesting uh but yeah this should not be this should be except compiler and we'll change that to include compiler eror but yeah that's a bug someone wants to look into that see this Colonel is very slow so we're spending 90% of the time on runtime uh okay same bug by the way does math is INF work for negative Infinity too cool for oh we got through all that is that a good time seems pretty good 7 milliseconds who knows if it's good how much faster the searches on these we're almost getting to that 500 per second 239 per second this is a slower kernel so it's not that much slower so I mean though there's so much diversity in these uh in these kernels like you do something to make it faster for one now there are real wins but all right go to the bathroom I'm going to grab a matcha latte and then we'll do some questions and then that's the end of today's stream we got to look into that HSA thing that's crap that should just be removed D we should exclude those kernels search is really addictive to watch you can just sit and watch searches wow it's searching look at it search oh yeah someone want to look into this not folding crap uh like I I know where that code is but you want to see if you can fold those that would be a cool thing to look at and you can look into some of those bugs and see why the linearizer doesn't know always work Perfection you know we we pursue perfection in tiny grad and I hope someday we'll be there it's true that we have like tiny used to have different looking bugs the bugs now are like deep bugs there aren't many shallow bugs anymore which I mean I guess that's how bugs work good progress man today yeah 8,461 lines get a much a lot and I'll do questions for for wow we got energy from caffeine Focus from theanine Clarity from Lion Man and calm from ashwaganda that sounds lovely you know they never sponsor me ta ta you tag the sponsor you know Lions man Big W all right um probably because of my rants no no no no look what's what's politically correct just change it maybe stream tomorrow often n not worth it I'll buy my drinks um so a good time overall for this is if it's sub 20 milliseconds we're doing really well so this will this will finish shortly while we do questions appreciate the more educative stream I want to get more people involved in tiny grad uh both usage and development I think it's now to the point where it's quite usable um unless you're doing these searches the speed is still slower than torch with these search sees it's competitive like if you if you do like like beam 3 or MCTS 500 uh it's it's competitive with with torch on most platforms uh it's definitely competitive on Mac um it's definitely competitive on AMD on Nvidia it's probably still a little slower uh you couldn't get it to work on your 380 it should work on 380 now uh we're we support all the it works on my 3090 and it's the same chip so yeah it should work fine uh in the Envy back end if you have a 3080 try it right now uh can you cash the search across runs yeah it's cashed automatically so yeah you only have to do the search Once I did hear about Kathy starting Eureka uh I like that form seem chill are pie trees ever coming to Tiny grad uh I don't know what a pie tree is so probably not what why would I want this oh I see we kind of have some like implicit stuff that does this I'll show you the function um you mean like here like we have functions like get parameters which will like get State  this is kind of like a pie tree um what exactly do you want P Tre to do can you solve the arc challenge using tiny grad maybe maybe you can um isn't named tensor axis uh we don't have that apparently pytorch had it and deprecated it uh may just map scan over pie trees stop it stop it look so ugly now uh can you reduce the kernels in size for faster running yeah so we actually do that so there is in my time kernel function uh you see this thing called Factor so this will reduce the global size and then return a factor and then uh run it and use that as estimation so that's already being done what else we got oh on your 3060 uh I think that one supported to sorry that's not the same chip as the 3090 um if it's not supported it's probably a very small change to make it supported we're interested in supporting everything back to like the Turing series like the 2080 series Jax also has higher order gradients do we not can I not just grad twice all right so we got we got we got 21 milliseconds could be better uh by the way it caches so you can see it'll just run through the cash again it's very fast the second time uh we got 21 I've got 19 before that's what I mean like the search isn't uh the search ain't perfect wow repeated stores and uops interesting yeah what we need to do for all of these things is we need a context that oh this is this is great at finding finding all sorts of bugs and Tiny grab we need like a context that prints out what like see we have tensor operations we need uh another one of these that says like that prints out the kernel that's trying to be lowered right like you see like error lowering meta Ops kernel we should print that kernel out well actually I should just be able to why do I only print the op there will we hit that again nope we don't hit it again tragic so we can print the kernel there but that still doesn't even print the all right we need to do the refactor the lowering should have okay the schedule should be optimized separate from uh running yeah so the the as should be lowered so in this mtts search thing where we have the uh this get optimized as that should actually be like pre-run um and that's what the beam search is not it doesn't return a kernel it returns NST yeah I believe it's just grad twice uh hessen is a matrix [Music] um I don't know do I still my old cash interesting that's the old cash I'm not getting 19s now sometime I think it's dependent on like how hot my computer is or depending on like something we get like lucky with the RAM allocations and they're not like strided on pages or something who knows um Apple we have a lot less control on because we didn't write our own driver for Apple the the apple back end here you can see just uses uh just uses like normal metal stuff like here's metal program we just do like you know new function with name new compute Pipeline with State uh so yeah it's not as low level as the wonderful the AMD one's easier to read than the Nvidia video one has a lot of crap that hasn't been cleaned up yet but like I love this like that is how you launch a kernel on a GPU and we're going to get this removed because we don't need that that's not real Release Me is so such a misleading name these need better these need better documentation three and zero these need to be better documented this update API is interesting so this is how we like update things in the jet we have update the Q I don't have an obvious better way to do it but I wonder why it's uh slow oh yeah here we can let me show off something else to you guys uh so we can run llama like this is llama sharded across four gpus so those the weights that could be faster there's a reason why yeah that's not getting us the full speed but yeah so we're getting 138 tokens per second across four gpus I believe now across six gpus it's still slower this is llama 7B on quantized on a tiny box so once you go across six gpus it actually gets slower because you see even with our crazy good drivers we're still spending a lot of time here on NQ uh so it was a PR put up this morning uh to lower that but you see again if you're using Cuda you have no control over your NQ time whereas I don't exactly get how this works but um keeps stayed around but look at that it has the NQ time yeah so across six gpus we only get 89 tokens per second uh this this is still a little bit under what GPT fast can get so we have some some work to do still um all right do we have any other good questions how do you build super intelligence right now have kids start by having a high IQ and then have kids and then you know train your kids to work together that's that's the most straightforward path to Super intelligence teach your kids tiny grad what do you need to know to understand this what domain of programming I don't know I don't know I don't like learn it in school or anything I've been working on Tiny grad for like five it's been like five years it's been a long time uh you know but like we're getting close look we have doc like if this wasn't legit would there be docks like real docks how is inference speed compared to llama CPP um so a lot of those things are focused on quantization uh again if what you're trying to do is just run llama like use llama CPP um it probably is going to be faster the idea is like the whole Theory behind tiny grad is the bit or lesson uh you know this one of the things I always shell for the biggest lesson that can be read from 70 years of AI research is that General methods that leverage computation are ultimately the most effective and by a large margin uh the reason for this is Mo's law so uh where is see here uh one thing that should be learned from the bitter lesson is that the great power of general purpose methods or methods that continue to scale with increased computation even as the available computation becomes very great these two methods that seem to scale this way arbitrarily are search and learning uh I wouldn't use the word learning I would use the word optimization but yeah like search and optimization scale with the with the computer um by the way this is a great thing to read uh if if if you haven't read it I sh for it on stream a lot but that's kind of the like the difference between tiny grad and uh and torch right we don't specify what the uh we don't write the kernels we specify a generic framework with which we can search for kernels and as search methods continue continue to improve tagr is kind of this new class of compiler that's like so there can be search at every level too so all of this search is being done at this level called kernel. Pi um so these are these uh those optimizations that we talked about that search space but there's tons of other potential searches so for example I'll give you one here here's a function called image. imagecom 2D so this uses textures to rep present the uh the inputs and outputs to a convolution uh if using textures on a GPU instead of buffers uh Mobile gpus in particular they put a lot more effort into their texture cache than they do into their buffer cache so on Qualcomm this is like 3x faster like on the on the Comm 3x so this is mostly written for com but um like this stuff can be searched right how we choose to represent them right here is hand coded but that can be searched so you can search at this very high level right then we have another file called schedule. piy and this is the scheduler that determines what order kernels are running this stuff can also be searched it's effectively a topological sort so we have two big topological sorts in tiny grad we have the topological to sorts that determine what order kernels are running uh then within there we have a bunch of graph Transformations and then we have a second topological sort um when we actually turn the uops into a list uh those uops which I showed you in those UOP graphs remember the UOP graphs so like this is a UOP graph um when we turn these UOP graphs into a list and fundamentally a program uh we have to determine what order we put them in right and that order is not fixed there's multiple so the the Only Rule is that the order has to be a topological sword on the graph if you don't know what that is look it up in a computer science textbook um but within topological sorts there is me there's many sorts of there's many lists that satisfy the property of the topological sort uh so yeah you can search there as well and eventually we're going to abstract all of this out into a very powerful General search framework and be able to find the absolute fastest kernels uh to run all this stuff and like when you're talking about something like when you're talking about a training run at like gp4 scale or even bigger when you're spending $100 million on a training run if you can somehow save if you can somehow you know make it 1% more efficient you save a million dollars so you're willing to put tons and tons of compute into this search it's a question of how well the search scales uh I mean can the search continue right can you keep searching for better kernels and the answer is sure right um let's let's just do one as an example right I don't know what should we set it to 5,000 like it'll just work it's annoying it's wrapping Let It Go we'll see what it finds I don't know does it get better I've never ran one for 5,000 before but like the beauty is as we make this faster right we could parallelize it we could we could make the python faster this method only becomes more effective as computers get faster right when Apple releases the m6 chip it's not going to require any any like hand work right you can just increase your search budget um you don't need to think about what the GPU is as long as you've captured all of the uh you know everything you need fully flexibly then like Let It search no don't do that stop wrapping make it a little bit shorter so it doesn't wrap um would you like to know how the AMD accelerators are yeah we got rid of we disabled the broken feature in their driver and we uh rode our own runtime and now it's fine oh look we found a faster one at 2028 look at that new record that's a new record for that that colel I've never seen some i' never seen a colel be that fast before and it it only it searched 2028 kernels to find it that's on analog computers um I don't think there's big gains to be had you you give up a lot when you digital computers run uh 100% repeatably you don't really need this property but it's annoying to not have it uh it makes debugging stuff really hard so no I don't think I think that what we're going to start to see is smaller and smaller data types and video talks about this a lot like whoever can make if you can make training work on like float 4 uh that is really power efficient because most things scale with the square of the uh like multiplies multiply scale with the square of the operation size basically you can think about just a generic a generic table right a generic like two op table where I have X here and Y here right it's a square so uh am I searching randomly no we're using Monte Carlo tree search it's a it's a guided search right uh so no we're searching over a set of of of permutations right like I showed you how uh I showed you how that works yeah crowd strike needs it right now um green Tiny Box is greater than red uh to be honest the red one's probably a better deal uh it's again it's $10,000 cheaper right like how much do you want to deal with Annoying drivers the answer is like I said one box gets bought 10x more than the other and I'll leave it up to you guys to decide which one uh multiplying huge integers using Foria transforms oh it's even faster um yo have you we've never seen a colonel that fast before like that's a new record we just needed to search 4,8 54 kernels but imagine we got that 23 per second up to 2300 per second can I summarize why they're worse because AMD is bad at software it's it's really just that like Nvidia invests into software and AMD thinks you don't have to and like they're going to lose as a company because of it well look at this MCTS just continues to get better uh I don't know what hvm is I don't know what symbolic ml is I don't think this stuff is oh look look at how long it took on that one it hung um this Kernel's a lot simpler so like there's it's it gets into the weeds right like if you tell it to keep searching it will keep searching but eventually gets into the weeds and starts finding like dumb kernels um I think I got my stop condition correct on MCTS so like I have like MCTS when it exhausts the tree you kind of have to like deal with that it's a little Annoying see you can just put more zeros on this number and it does better what if we searched for 50,000 what would we find uh thank you yeah I like the colors too it's it's cool you can really you learn to read these really fast too like these are really simple globals locals reduce unroll upcast right and then they're in the kernel name too which is neat how many kernel permutations we went over this before 100 million well yeah maybe if you like prune them down maybe it's like 10 million yeah I mean you get even 5,000 which is crazy high like you get 500 test points in a space of like let's say 10 million um do you know if the search space includes very fast Solutions uh well I mean what you can generally do is you can plot uh and the thing looks like it converges kind of to an ASM toote so I would doubt that there are like magical super fast Solutions so okay actually I can tell you something we definitely know so the GPU in this computer is an apple M3 Mac so if we go here uh is not here wow apple has a lot of different um so the M3 Max has 14.13 teraflops of compute so we are at 86% pretty much 85 .8% uh efficiency on this kernel right so it can't be faster the the fastest time that you could possibly get on this kernel uh is going to be you know 1.43 ID by that sorry multiplied by that yeah so the fastest time you could possibly get is 1.22 86% is good yeah um multiplication on log numbers is an ad that's true but what's an ad on log numbers that's that's what's annoying about that yeah yeah 88 86% mfu um we're trying to get our training mfu up our training mfu is not that good yet our training mfu well it's not bad our training mfu for like resnets is like 33% on the Tiny Box uh all right are we out are we out of good questions are we in dumb question land now what are the nodes in the MCTS the nodes are I showed the action space so the nodes are like a different each node is a different way of running basically the same uh Colonel wow if we're really off into the weeds guys they're cousins they don't collude okay it's not like it's not like Jensen and Lisa Sue get together and say hey why don't you make your software suck balls so we could be worth $3 trillion and you could be worth 250 billion and Lisa Sue is like okay Mr Jensen what you think that's what happens come on they just suck at software and they don't see any you know that's not important to them it's very clear that it's not important to them and they'll continue to suck at software and they'll continue to have 10% of the market share of Nvidia right like it's not going to change people think like oh AMD is going to get fixed like it it's really not um I don't know I I don't want to go into this Rabbit Hole again you know you just need to you just need to like like what how many therapists does it take to change a light bulb one but the light bulb has to really want to change right what do I work out right here in the comma gym yeah let's go didn't AMD buy some AI companies yes completely the wrong ones it's guys they want to lose okay like some people want to lose uh bug Bounty programs these are the people who ruin Security man security used to be like you know showing people goats day in the middle of a CTF and like you know laughing at noobs right those were the good old days you know well I heard we're going to make America great again so you know maybe we can make hacking great again [Laughter] too I don't even know what the A and the N are but we're definitely not talking about trains and buses man uh any plans for using RL to optimize search well if you've watched my channel you know exactly what I think about RL um AMD is a culture problem not a Manpower problem that's true uh RL is a scam but not as big no RL Works someday bug bounties you'll always be a loser if you do bug bounties just just you know I love when people just what what people should really do with zero days is just drop them on Twitter right George you're influential in the security Community you can't say oh my God we got really spicy on this stream just drop your zero days on Twitter man sorry X I dropped it on x man respect right that's that's you want you want to gain respect as a security researcher just like here's a weaponized Chrome zero day dropped on Twitter um don't actually take any of my advice everything on the stream is for entertainment purposes only um I take no responsibility if you choose to murder somebody after watching my stream I do frequently say not to kill people uh I think that's a good note to to end things on um isn't RL just MCTS but worse RL has incredible potential RL will one day grow up to be something but today RL does not work um I mean like kind of works I don't know remember press the light up button remember getting getting frustrated by you almost forgot not to kill people well good I'm here to remind you we're back uh uh but don't do bug bounties man I got paid $3,000 for this chrome zero day that could have brought down the internet yeah great [Laughter] man um dreamer V3 Works po does not yeah I mean that's probably the truth it's probably going to be some like fancy hyperparameter tune thing that kind of works like Gans work but your Gan doesn't work right like style Gan works and dreamer V3 work but your Gan and yourl don't work uh so yeah I don't know but build infrastructure make it easier look at that look at that Colonel there look at that I found it at 4,432 it just continues to get better Yuri thank you for gifting Subs do you have a last question to ask the Stream bye

Transcript for:Tiny Grad Updates and Future Aspirations

Transcript for:
Tiny Grad Updates and Future Aspirations