The ARC Prize Competition: An Overview

the ark prize is a new competition to surpass human level performance on a very interesting AI Benchmark called Arc AGI and what's so fascinating about this Benchmark is that it consists of these puzzles that are actually very simple for people to solve but the current AI systems really struggle with them there are many different benchmarks to test AI but one of the things that we've seen over and over again is that a benchmark gets introduced and then within a couple of years AI will be able to surpass human level performance at that now the creators of the arc prize Mike nup and franois chol argue that this is a very different sort of AGI Benchmark because it doesn't test for skills and knowledge but it rather tests for skill acquisition what makes the arc AGI Benchmark so different is that it really requires the ability to acquire new skills based on just a few small examples so it's not something that can just be simply memorized like information off of a Wiki page or even some basic sort of functions you really need to understand what's going on in these inputs and then generate a new output for a particular test as a result it's been almost 5 years since the arc AGI Benchmark has come out and AI has only been able to achieve 34% completion whereas typical human level performance is getting about 85% of the questions correct let me show you an example of what these problems actually look like and then we're going to get into the details of what sort of solutions might work and why this is so challenging for the AI okay so if we go to Arc pri.org playay it's going to show us a number of these Arc puzzles and they're all these sort of two-dimensional grids of colors which is why this puzzle is so interesting because the data is actually very simple it just represents a grid with axes and a certain number representing a color and the expectation is that we have these examples and that we're able to generate a result for a test input and by the way there are two different test sets one that is for training that is kind of easier and then the evaluation test which is a public evaluation test that is hard there's also a private data set that they have to make sure that any models that people submit aren't actually just memorizing the solutions that haven't been trained directly on the solutions to this evaluation data set but anyways let's take a look at some of these puzzles okay so in this first example we got a few different colors and then I mean it looks like this is a sort of pattern that's repeating and in this case it okay it looks like it's flipping over on this second row right so is that what we're seeing in the second example yeah we got that same pattern okay it's right there and then it it flips over okay so the output we also see that it goes from a 2X two grid to a 6X by six so we're going to need a 6x6 grid here and we're going to draw that pattern here real quick okay so okay let's just double check that that's the right shape and then it flips over okay let's see if that's correct all right cool so we got this one correct again it was a pretty simple puzzle but this is the sort of thing that current AI systems really struggle with and I'll show you that later in the video Let's do let's do another example let's look at this easy set and just skip around a little bit okay let's take a look at this one I I I think this one's pretty interesting because it's very different from the other example okay we got this shape and a blue block and then I guess it comes down to it in this case well it again kind of moves towards the Blue Block and similar here so it's almost almost as if this blue block has a sort of gravity to it where this shape is moving towards it right so if we use that same idea here let's Okay and we'll erase that shape we'll put it okay it's going to move towards the Blue Block I think that's I think that's right yeah there we go awesome okay cool so there are tons of these different puzzles and you can see like this one's pretty interesting right there's a different color outlining a portion of the picture and then we have to basically like zoom in on that fragment right so we're going to take these blocks and that's going to be our solution like whatever is inside of this green square but this is the kind of thing that does require some like abstract thinking it's really hard to go from just simple data points to a solution here you have to really look at these examples and really understand that context to be able to generate a response something that I find really interesting about this Benchmark is that there is a clear correct answer for every puzzle so it's not something subjective like the S bench where you have to write some code that might satisfy an arbitrary set of requirements every Arc puzzle has a discrete answer and the data structures that are used to set it up are very simple and clear so it can be really surprising to see that llms like chat GPT can't actually do this very well but let's put it to the test let's see how well chat GPT can do with some of these Arc puzzles so I took this example which is like relatively straightforward uh it looks like you got to basically Center all of these wherever the Blue Block is and then just you know move the red and yellow blocks to the right place so this is the solution and this is what the data looks like so we can just see representations of the whole grid and the colors that are found in each of the blocks okay so for this puzzle we are saying you are tasked with completing a puzzle which will include three example input output Pairs and then a test input for which you must generate the output these data points represent colors the numbers you see in the arrays on a 2X two grid you must find the pattern of what is happening in each of the inputs and apply that same abstract approach to solving the test example here's the set of inputs and outputs and then here is the test input what you believe the output should be okay the output should follow the pattern identified and provided examples you can observe the following pattern each color block represented by numbers other than zero the input grid is repositioned within the output grid maintaining their relative positioning colors tend to shift to Aline vertically or horizontally within the output grid yes that's true okay let's take a look at this you can kind of see like that shape in this array right because ones just represent the blue blocks right there so it actually got this shape correct you can see that Blue Block right here but what happened here look we got two rectangles this is what the Chad GPT solution looks like so it's like way off right I mean it's got at least the grid it responded with an array but you can see that the abstract thinking just isn't there it doesn't really understand what we want it to do and the pattern that we see in these other examples okay so Chad GPT doesn't do super well but what if we just add scale what if we add more data what if we keep training it well eventually it probably can actually solve these puzzles but franois on the dwares podcast made a really interesting point that any system if you feed it enough data is going to be able to answer things if it is able to just retrieve that data from its knowledge base so if if you solve this Challenge in a way that simply memorizes all of the answers yes that is a possible solution but what happens when you create a new arc puzzle it might actually struggle with that so the heart of this challenge is really to create a system that is going to be able to see these puzzles and understand the actual intent and figure out how to solve it France sis made a really interesting point about what sets this challenge apart and this was featured on the dwares podcast check out the full podcast in the link below but here's a clip of his explanation if you look at uh the benchmarks we're using for LMS they're all memorization based benchmarks like sometimes they're literally just knowledge based like like a school test and even if you look at the ones that are uh uh you know explicitly about reasoning you realize if you look closely that it's uh in order to solve them it's enough to memorize uh a finite set of uh uh resoning patterns uh and then you just reapply them they they like static programs LMS are very good at memorizing static programs small static programs and and they've got this sort of like Bank of uh solution programs and when you give them a new puzzle uh they can just fetch uh the appropriate program uh apply it and it's looking like it's reasoning but really it's not doing any sort of on thefly program synthesis all it's doing is program fetching so I think this part is interesting because franois does acknowledge that llms are able to generate Things based B on all of these different patterns that they recognize in their training data set so we can indeed see some original things but they're you know combinations of all of the different patterns that we see in the training and he makes a distinction from that and actually having program synthesis where you are searching for a solution programmatically rather than pulling it out of your memory so you can actually solve all these benchmarks with memorization and so what what you're scaling up here like if you look at the models they are uh big parametric curves uh fitted to a data distribution descent so they are basically these big interpolative uh databases interpolative memories and of course if you scale up the size of your database and you cram into it uh more knowledge more patterns and so on uh you are going to be increasing its its performance as measured by memorization Benchmark that's that's kind of obvious but as you're doing in it you are not increasing the intelligence of the system one bit you are increasing the skill of the system you you are increasing its usefulness it's uh scope of applicability but not its intelligence because skill is not intelligence and that's the fundamental confusion um that that that people run into is that they're confusing skill and intelligence if you scale up your database and you keep adding to it more knowledge uh more program templates then sure it becomes more and more skillful can apply to more and more tasks but general intelligence is not task specific skill scaled up to many skills it because there is an infinite space of possible skills general intelligence is the ability to approach any problem any skill and very quickly Master it using valid all data because this is what makes you able to face anything you might have cont this is what makes uh this this is the definition of generality like generality is not specificity scaled up it is uh the ability to apply your mind to anything at all to arbitrary things and this requires fundamentally this requires the ability to adapt to learn on the Fly efficiently the only thing uh that makes AR special is that it was designed with this intent to resist memorization this is the only thing and this is the huge blocker uh for LM performance but why does any of this matter what if we can solve the arc AGI prize and we just have a program that's able to generate other programs and figure this out well the thing is if we can arrive at a system that can figure things out programmatically we're going to be one step closer to AGI it's not going to be the full AGI solution but many experts in the industry agree that one of the missing components to AGI is reasoning and planning and this is the capability that the arc prize seeks to actually create we're already able to memorize a bunch of information and generate things but it is very difficult to get the current AI systems to actually follow through on a plan and reason about what they're doing AI agents currently are capable of this to some degree but time and time again we've seen them struggle to actually follow through on a plan or create a very effective plan which is why we've had systems like Auto GPT around for a while and now Devon is around but these systems still struggle to execute on larger tasks and that's because they don't have very good planning and reasoning capabilities because those are actually missing components of the llm architecture now if we can come up with a new system that can augment that llm system I agree with franois that we're going to be much closer to AGI because we'll have systems that can reason and use their extensive knowledge base let's talk a little bit about the competition itself what sort of limitations there are and what prizes are on the line so one of the important things to mention is that to get a prize you actually need to open source the code that you write so this is an attempt to get the community thinking about new ideas and pushing the frontier of AGI research so if you submit a solution you're expected to open- Source your code in order to get one of the prizes now the grand prize for the first team that's able to achieve human level performance at 85% success rate on the ark AGI Benchmark is $500,000 but the organizers of the arc prize don't believe that we'll actually achieve that level of performance this year so the intent is for this to be an ongoing Challenge and there will be other prizes for making progress for the top teams that submit Solutions and open source them there are also some additional prizes in case you write a paper that will explain the approach that you took the current state-of-the-art solutions for the arc prize involve programs that have no AI in them whatsoever as well as an AI system that fine-tunes an AI model on the Fly based on the example that are shown for a particular Arc puzzle so definitely very compute intensive and even that model only performs at 34% success rate the competition has just started and will remain open until November so you can go check out the data sets that they have available and start playing around to ensure that the challenge is fair the team has constructed a private data set with their own arc puzzles that they're going to run through any sort of software that you submit to them this is what's going to end up on the official leaderboard if you do choose to apply keep in mind there are some rules for the challenge one of them being that you can't use public AI systems and access the internet in your program there will also be a secondary leaderboard that allows you to use the internet but that leaderboard doesn't have any official prizes associated with it so definitely check out Arc pri.org and see if you have what it takes to push the frontier of AI research thanks for watching take care

Transcript for:The ARC Prize Competition: An Overview

Transcript for:
The ARC Prize Competition: An Overview