AI Forecasting: Insights from Metaculus

hello and welcome to the cognitive Revolution where we interview Visionary researchers entrepreneurs and Builders working on the frontier of artificial intelligence each week we'll explore their revolutionary ideas and together we'll build a picture of how AI technology will transform work life and Society in the coming years I'm Nathan lens joined by my co-host Eric torberg this episode is brought to you by work OS if you're building a B2B SAS application at some point your customers will start asking for Enterprise features like SLE authentication skim provisioning ro-based access control and audit Trails that's where work OS comes in with easy to use and flexible apis that help you ship Enterprise features on day one without slowing down your core product development today some of the hottest startups in the world are already powered by work OS including ones you probably know like perplexity forell Jasper and webflow work OS also provides a generous free tier of up to 1 million monthly active users for user management making it the perfect authentication and authorization solution for growing companies it comes standard with Rich features like bot protection MFA roles and permissions and more if you're currently looking to build SSO for your first Enterprise customer you should consider using work OS integrate in minutes and start shipping Enterprise plans today hello and welcome back to the cognitive Revolution today my guest is Dar teron CEO of of metaculus a leading forecasting platform that aims to bring epistemic security to how we understand the future forecasting has long fascinated me I participated in one of Philip tetlock's original Super forecasting tournaments some 15 years ago and I actually did reasonably well though I was not ultimately named a super forecaster since then I've often visited forecasting sites like metaculus to take a pulse on future events and I've always intuitively believed that forecasting could really help most organizations to make better decisions and yet despite its promise forecasting has not become a mainstream technology that's in part because it is a ton of painstaking work to do forecasting well and thus really hard to scale to the point where forecasts become consistently relevant to decision makers as they operate in highly contextual situations as a case in point metaculus has hosted some of the most referenced AGI timeline forecasts for years and still today fewer than 1500 people have registered their own individual forecasts if you've listened to this show for any length of time you can probably guess where this is going next what if we had AIS do the forecasting it turns out that with some real effort required and some important remaining caveats this already Works multiple recent papers including some by tetlock and others show that AI systems forecasts are competitive with the best individual humans though they do still fall somewhat short of the wisdom of human crowds Dar and I first review this literature together and then go on to discuss metaculus reasons for Gathering more complex probability distribution style predictions rather than simply seeking to discover market prices we also review the community's track record on major world events and we get into metaculus is ambitious new AI forecasting Benchmark tournament this is a year-long competition that aims to advance the state-of-the-art in AI forecasting capabilities with free API credits available and meaningful prizes to be awarded quarterly beyond that we explore dar's vision for the future of forecasting not just as a way to predict outcomes but as a tool for building Better World models and improving Collective decision-making his ideas on using forecasting for resource allocation policy development and even enhancing democracy are both exciting and thought-provoking and another example not only of how ai's already superhuman strengths in this case scalability have the potential to change the world even even if the AIS perform no better than humans on a given task but also how AIS are meaningfully generalizing beyond their training data after all the future is in some sense definitionally out of distribution as someone who's deeply interested in both AI development and societal decision-making I found darus perspectives fascinating and I think you will too as always if you're enjoying the show I'd ask that you take a moment to share it with friends or leave us a review on Apple podcast or Spotify and if you have any feedback or guest suggestions you can contact us via our website cognitive revolution. or you can DM me anywhere you like now I hope you enjoy this exploration of the frontiers of AI powered forecasting with DAR ton CEO of metaculus Dar Tron CEO of metaculus Welcome to the cognitive Revolution thank you big fun here thank you that's kind of you to say I'm excited for this conversation I've been a miraculous longtime Watcher and you've got some very interesting new projects which will be kind of the bulk of our conversation today but maybe for starters you're relatively new to the job just a handful of months in the CEO role there want to give us a little bit of your background in AI because you've been working in the space for years before that and made a move that you might want to unpack a little bit in terms of like why the shift to the forecasting realm now I guess hindsight is 2020 especially context like forecasting I have always been interested in questions of collective intelligence how can we aggregate multiple different perspectives to the point that we can build a coherent World model and I was working on this question while I was still at school from a perspective of can we use natural language processing this is way before llms like back in 2016 how can we aggregate multiple different streams of thought so that that's this would be actionable for a policy maker so I've been interested in working with federal agencies and the startup on the space and right before metaculus I was working on AI objectives Institute which is a nonprofit research lab focusing on soci technical alignment which is looking at questions of AI alignment in context of real people in deployments right now and seeing are there any technical steps we can take that can actually say this group of people do perceive this AI model as better aligned or more instrumental towards their goals looking at it from an individual perspective and from a collective perspective and from a systems perspective on understanding how can these tools be much better used for enhancing human agency and I've been doing a lot of work on questions of collective intelligence how can we augment a multitude of different views to be able to contribute to A system that might be a single mod or it might be multiple different agents collaborating this might be a government this might be an llm this might be an ensemble model this is in a way applicable to many different systems right and very quickly I find myself thinking about okay say we can aggregate multiple people's desirabilities even if we are able to do this on a high fidelity level how do we get to the world outcomes that we're looking for how do we actually go towards the model that is not just driving towards the lowest common denominator and finding you know a consensus mode where the desire for consensus actually causes a tradeoff with Fidelity to one's perspective but instead see what would actually be helpful what future do we want to live in are causes prioritized correctly so that we can move towards better Futures can we find positive sunm games who will shoulder the externalities if there is no free lunch how can we make sure we're tracking that and these questions brought me towards thinking okay who's looking at the world from a perspective of if we take action course X will that yield a better outcome and that's how I found myself thinking about forecasting because if I'm able to even have a very high fidelity version of desirability elicitation preference elicitation under understand all the cruxes that exist in society that still does not bring me towards a machine system being able to take actions that yield net good so exploring that along with the questions of wisdom of the crowd you know if you have multiple different perspectives be these be real humans or AI that is actually once aggregated is doing a better job in forecasting outcomes I realized this is actually quite instrumental and how we look at the world so that is what brought me towards thinking about more forecasting cool just to rewind back in time for a second to the 2016 era right did any of that stuff work were you able to to make anything with technology that was available at the time that you thought yielded any practical utility yes well back then I was working with Dan jski on a question at Stanford on federal agency regulatory feedback so the FCC net neutrality debate was just taking off and there was 2 million comments that were submitted to the federal agency and at that point this seemed like wow this is the largest data set that was ever collected of raw text now it's ridiculously small but at that point there was this question of okay how should a central agency pay attention to this there was a lot of interesting questions like a lot of the submissions were form letters that are copy paste but the fact that it is copy paste does not make it illegitimate you know I will read a campaign and say yeah I agree with this I want to send this how should you take that into account compared to something that is bot generated compared to something that is a genuine opinion that someone has taken time but it's not necessarily sophisticated and you can see that it is shortcomings and my thesis work I was looking at this both from a legal perspective and also technical perspective you know what are the expectations from EPA and EPA perspective on just the US legislation but also how can we have natural language processing augment this uh and funny enough at that point I was thinking a lot of questions of good Harting like every time we start focusing on a specific narrower metric it's seases to be a good measure and this was very apparent in being able to claim that you're looking at the public things we tried were so simple and yet they seemed like oh this is a paradigm shift just like simple clustering strategies unsupervised learning topic modeling and even that went a long way because people had no idea and now we can do much more sophisticated versions and people still are like people assume that we don't need to ask these questions because we cannot do large scale qualitative surveys so we keep going going back to lyer scale so I guess the Crusade I've been on for a long while is can we go past how happy are you 1 to7 or should we build a road from A to B or a to c I'm like maybe we need to build a bridge instead and if you don't let people Express that you're just not going to get that opinion so how do we get machine learner systems optimizers to be able to understand ah there's something qualitative that I need to Elis that even the speaker might not be sure what they want like how can we give room for that human flexibility in these missions so your comment on good Harding definitely resonates with me particularly today I was just having some back and forth on Twitter about the new GPT 40 Mini model which is coming in ahead of CLA 3.5 sonnet on the LM CIS leaderboard and that raises the question it seems like the general consensus among people like me who are obsessed with the latest AIS and kind of have their own individual opinions seems to be that 3.5 Sonet is the best certainly we would all tend to think it's better than gb40 mini and yet there it is ahead in terms of the votes that it's getting so this has raised all sorts of you know why is this happening do we trust this thing anymore is it just about formatting if it is mostly about formatting does that mean it's like illegitimate in some way or does that mean we should really think more about the importance of formatting but yeah as always it's like extremely difficult to pin much down if you really want to do it with any real rigor so that is maybe a good Bridge toward and certainly the same is even more true about the the future so right I've been interested in this space of forecasting for a long time I actually participated in one of the original tetlock organized super forecasting tournaments maybe 15 years ago now Tyler Cowan posted an invitation to participate and I was like I'll try that my hand at that I did reasonably well you know I didn't top the leaderboard but I came out feeling like I had at least somewhat of a knack for it and I've always found that the sort of Robin Hansen perspective on like this seems like a much better way than sort of highest paid person's opinion or like whatever kind of prevails in organizations that's always been compelling to me but it seems like generally speaking we still don't see a ton of forecasting deployments in the world what would your an is be of kind of where we are in forecasting you know where you think like the most notable deployments are and and why we don't see more than we do today right I want to answer from the previous point you made around goodar because I think this actually goes towards there and you know which model is better Define better right better is incredibly content specific like even in just a small world of things I am trying to do personally with an metaculus PL 2.5 is more useful for me on question writing but but not necessarily fact retrieval and it's a very narrow thing and this current version of these models seem like you know one of them has an easier neck towards one thing without interrupting without self-censoring so I can move forward faster I think the core of forecasting also relates to this in terms of how can this process actually be useful is one thing that is different from will it score good on these benchmarks like in this one specific competition will 4.0 score better like that is a much narrower thing in a way usefulness is almost always anecdotal and I think we are not paying enough attention to that and I think the same is applicable for forecast as well there has been so many tournaments like the lcx 3 tournament is interesting for how accurate the forecasts I've gotten but to me okay how is this going to actually be useful for an end user is a different question and I think this is the approach that I really want to focus on with metaculus for the upcoming Sprint is okay I have a couple validations that I feel satisfied with what are those one wisdom of the crowd works it is interesting that it works but when you aggregate a number of people and we will go into it when you aggregate the number of bxs that are trained with llms or what have you it seems to be able to draw an accurate picture of the future and we have enough case studies of this why is this not more commonly used I think usefulness is a whole different Paradigm than accuracy and this is really what I want to focus on what do we want to do on this front like for example can we identify critical INF reflection points towards future States is a different question than just is your forecast accurate what is the shape of the world that we want to live in what are entities you know these can be federal agencies like my prior work this can be philanthropist this can be corporations nation states the framework of forecasting so far especially from a talking lens I think has not focused on how will this be actually useful but has more so focused on can improve the accuracy I would like to move forward towards coming up with versions of forecasting deployments that actively pay attention to how will the decision maker take this into account what are things that are within their space of levers this is why I think forecasting in context of the humans that are trying to live a better life is extremely important I can share more examples on this front but I'm curious where you'd want to go with with this like pretty much our research agenda for metaculus for the next chapter for the next two quarters heavily focuses on a lot of experimentation on these kinds of questions yeah interesting does that look like basically conditional markets being kind of the first thing I mean that's something that Robin hson has talked about for a long time and I don't see much but we've just lived through a moment in history where it seems like something like that was pretty motivating to the current president of the United States who right I think was you know if reporting and general intuition are correct seems like he was not sold on the idea that somebody else had a better chance than him until he saw like polling data suggesting that and then was kind of like well I guess yeah if somebody else has that much of a better chance maybe I should step aside is that the sort of play that out a million different ways for million different contexts but is that the kind of next big Paradigm I think that is an example that basically is a very simplified and somewhat was obvious to the entire population except for the decision makers themselves for longer than it should have in my opinion state of affairs with that said the real cases that I'm interested in is much more sophisticated Ms right that heavily looks at public opinion and is more if I may say like Vibes based rather than data driven and I'm interested in really pushing this Edge further say we have $50 million to allocate towards making the world better pick your challenge reducing microplastics climate related risks AI related risks how should we allocate these resources to get to a better world requires us to build World models I think forecasts around what are specific outcomes we can get to if we take an intervention point is very helpful and the simplest thing we can do is exactly what we said on you know come up with conditional forecast if we take lever X will that bring us to a better future if not what will it be like and if we see high Divergence here that is great now I want to push this much further we can already do this but truth is conditional questions are somewhat hard to answer it's hard to think in that framework so can we build tools be it llm driven or discourse driven or come up with ways in which people can collaborate that enable conditional forecasts to be more useful this is one Avenue one of the things I want to do in uh the near future is launch a series of minuses this is the name we came up with think of it like subreddits which is metaculus instances this will come in context of US Open sourcing metaculus and we hope that many of these will spring in the next quarter each Minit aculus will be focused on a specific question a goal oriented question such as we are trying to reduce microplastics for example or we're trying to reduce homelessness in San Francisco this is geared towards Disaster Response or Consequences of research Avenues in context of a I would want every question in the metaculus to be able to serve if we are able to answer this question it bubbles up to the parent and then we can see which of these questions have the highest value of information which of these questions actually inform us to say oh it looks like this intervention is going to be able to drive much further value I would want us to identify critical inflection points like is there a specific moment in which the world models seem to diverge are we able to extract schols of thought in the context of forecasting by seeing oh this group of forecasters seem to be much more in accordance over multiple forecasts why is there World mother Divergent can we double down on finding more questions that can help us excavate that these kinds of research questions go way Way Beyond what metaculus has aimed for so far this is actively trying to build a world model in an enclosed World space that is Lim metaculus with a specific goal and I find this to be really really interesting like the kinds of things I would love to do is for example can we find short-term proxies and heavily forecast on both is this short-term proxy good and how is the short-term proxy panning out the way we landed with the Fab the forecasting AI Benchmark is basically guided from this also in a way we can frame the presid debate through this lens I think we as a civilization need to think much more from this lens like this is a version of a cognitive Revolution where we change how people are thinking about the future in a way we're enabling them to coordinate between their world models I still see forecast aggregation to be fairly low Fidelity to be able to coordinate World models but I think we can use that as a building block as a primitive and come up with much better World models hey we'll continue our interview in a moment after a word from our sponsors I am really excited that our new sponsor 80,000 hours is now offering free one-on-one career advising sessions to cognitive Revolution listeners 80,000 hours aims to be the best source of advice for people who want to do the most good that they possibly can with their careers we typically work for about 40 years in our lifetime and we work about 2,000 hours per year that is the single biggest opportunity that most of us have to make a positive contribution and it's worth being being strategic about it that's where 80,000 hours can help I actually used their career advising service myself two years ago I had just finished the gp4 red teaming project and I wanted to do anything I could to nudge the AI future in a positive direction but what could or should I do that was not clear after my call with 80,000 hours I got a number of connections to outstanding individuals in the space and over the course of the follow-on conversations I developed confidence that this podcast was one of the projects that I should pursue today I'm thrilled to have built an audience of thoughtful high potential people that 80,000 hours wants to help to request a free one-on-one career advising session follow the link in the show notes it's 80,000 hour.org cognitive Revolution that's 80,000 hour.org cognitive Revolution sign up for a free one-on-one career advising session figure out how you can make a positive impact on the AI future and I think you'll be glad that you did this episode of the cognitive Revolution is sponsored by the brave search API you may know of Brave as an alternative to Chrome but did you know that brave has its own independent search engine powered by its own 20 billion page index of the web the brave search API gives developers reliable and affordable access to programmable web search auto suggest spell check and more with flexible plans for all types of use cases from rag search to automated business intelligence on top of that it's up to three times cheaper than Bing all without without compromising on quality speed or reliability over 700 businesses including coher CH and kagi rely on the brave search API and a recent survey showed that 94% of customers would recommend it to their peers to start building apps with the power of the web sign up at brave.com API and get up to 5,000 free monthly [Music] calls so let's maybe take one one step back just for those who maybe don't have so much experience with metaculus as it exists today i' interested to hear just like for starters how do you describe it to somebody who's never seen it before would you call it a prediction Market would you call it a forecasting platform how does it fit in because there's like a few of these now right with people would be familiar with certainly just online betting platforms but also there's like manifold and poly Market that I see kind of shared around how would you describe the overall product today and how would you distinguish it from some of the other things that are out there metaculus Aggregates forecast it's a forecasting platform and an aggregation engine I would not call metaculus a prediction Market the goal of metaculus is to bring a level of epistemic security towards how we envision the future to be we can call it Wikipedia about the future and it is a space in which multiple people can see how they foresee the future to play out and the critical discussions to take place there it works with predictions as its core block but it does not have a monetary incentive it behaves quite differently than prediction markets as a result of that and some of those differences are the reasons why in my opinion metaculus has been both more accurate and also more rigorous compared to a lot of other spaces it is I can shed more light into that for example on a prediction Market the participants are buying and selling contracts that are based on the future outcome of an event right so you place bets and the result is a zero some question for example if the current forecast is at 60% and you think it is 60% you don't have an incentive to add that information in because it only can get worse from there while on metaculus this is not the case there's no Financial incentive and instead we are comparing all the forecasters accuracy and track record through time and this creates a different set of incentives that I believe are actually much more productive toward something that looks like Wikipedia about the future where people are one incentivized to share what their world model is two people are incentivized to participate in forecasts and not think about der risking or spreading but instead focus on what is the world model that I have that I can share so our scoring mechanisms are different with respect to prediction markets as a result of that too because the reward mechanism is that you reward a forecaster over time for their calibration across many predictions and for their accuracy we also have a metric that we call the community prediction this is interesting because this is where you can see everyone that has predicted on this question and we also have weighted averages depending on your past success on your track record and the recency for you know sooner forecast T to be better for you have more information so using these we actually end up in an ecosystem that is much more collaborative and much more grounded and good systemics and this brings higher rgor as well so the questions before forecast tend to be more like carbon Emissions on this state rather than their personal questions on who am I going to be dating with next and I like that too you know there's the market for that but I like that there is the space for focusing on the rigor that is actually beneficial for the world so that is what I find distinct and attractive about my gotcha so it seems like the big thing there is the incentives whereas on a more Market oriented platform I am looking for things that are wrong so I can make money here I am looking for just opportunities where I can share something with the community that hopefully brings the overall Community forecast toward a more accurate right but on top of that I will add there is a financial incentive in metaculus too as in if you have very good track record as a forecaster we will hire you as a pro forecaster to engage on many specific paid client works and these range from federal agencies to large retailers to hedge funds think banks that have much narrower questions these questions don't make their way to the you know public forecast that we have and they will be paid engagements and we will also pay the pro forecasters for their reasoning so that they can actually explain why they see the forecast to be this way just to Echo back to the thing we were talking about I think one of the core projects that we have for metaculus is going to be reasoning transparency the forecast on say making up an example say we were thinking about the Taiwanese sovereignty this was the context in which this came up I was in Taiwan recently for a conference and we were talking about will taiwan's electricity grid be challenged by China in the process of a potential Invasion and the forecast jumped from 30% to 40% and there were no comments under it because it was fairly new and I was sharing this with one of the folks there that are working on from the government's perspective and they're saying unless I know why this is the case I cannot I cannot take this into account and for that we do pay our Pro forecasters to come up with better uh forecast with their explanations as well we plac people in jobs in the past so there is in a way the leaderboard can translate into Financial incentives it happen on the substrate of the question itself and that I think is quite important okay cool can you say a little bit more about how you create that Community for for out of you kind of touched on this briefly but this is one of the big questions that I've had honestly for a long time because there are two kind of I think of them as like canonical AI questions around AGI timelines that I've come back to for years now I think I first tweeted about them like almost two years ago and these are the weak AGI and the strong AGI and we can unpack this a little bit more in a minute but one thing that I've always kind of wondered is like EX exctly what am I looking at when I'm looking at this you know to what degree am I seeing is it a naive aggregation um because you can go in there and put a whole distribution right so I'm interested in your comments on that too like I don't have a great sense of what I'm doing when I'm like actually drawing a curve over the potential timeline of something happening I feel like I'm a little bit like you know I mean I guess I'm always sort of vibing with these predictions but that brings it home to me where I'm like boy the the Precision here feels like I'm leaving something behind that is you know and it's weird right it's a way to express my uncertainty but it's also a way to like be very specific about my uncertainty and even that I don't feel like I really have so interested in sort of how you see that kind of curve drawing and then how are you aggregating all these curves into one curve like do people get upweighted if they're better is there a recency bias it's so couple different parts there with respect to the community prediction calculation we basically keep the most recent prediction from each forecaster that we have and we wait each forecast based on how recent it is for before being aggregated and we also for the metaculus prediction do pay attention especially if there is a paid client that is looking for a specific outcome or higher rigor we do do things like pay attention to the Past track record of each of the forecasters and aggregate that information in as well for binary question the commun prediction is basically a weighted mean all of the individual forecasters probabilities and if you have a numeric question or a date question it is a weighted mixture of all of the individual forecaster distributions there's actually a lot of different philosophies of aggregation and this is one of the spaces that I am interested in experimenting further these are the ones that we use and in general the community prediction works and seems to be really accurate there may be a couple of folks that's seemingly consistently beat the community prediction and I think you know there's a couple posts from like astral codex 10 Etc that has focused on exploring this a thing that I'm very interested in is very soon we will open source metaculus I'm curious for people to come up with you know hey here's a different you know scoring mechanism an aggregation mechanism that actually seems to work better or seems to be more instrumental I really want to encourage experimentation on the space like what we have so far seems to work if you just copy the community prediction you will do pretty well in a tournament but then you're not really bringing in independent information right so the better question to do is for someone to do their homework on you know what are past forecasts that might be similar to this and then contribute and only after that you see the community prediction because otherwise you start missing with the wisdom of the crowd you said you know you might be doing Vibe based reasoning I think a lot of more reasoning based forecasting ends up being R spaced for a lot of people not everyone is going to build like a mathematical rigorous model or be a domain expert but somehow when we put many of these together it actually does yield a better outcome I would start from for example the comparisons from the 2023 ACX contest is quite interesting for example to look at for comparison of you know how does metaculus compare for prediction markets or super forecasters and I really like sity as an anchor I think they're actually doing a great job as a reference uh point to me the parts of the question is like how to make the forecast useful along with how to make the forecast be accurate I think there is enough track record on a couple different pors like the who predicted 20123 best post from estal good extent is quite interesting for this but if you've been zoom in on specific moments where an entity it's a federal agency or a person was able to take action from a forecast I think those are the moments where I see yes metaculus was actually useful or interesting like for example in epidemiology metaculus has outperformed both panels of experts on covid and also informed Hospital resource allocation Public Health decision making a lot of the forecasts were more robust and accurate than Baseline models in predicting covid vaccine uptakes and I do think that was quite interesting for example and another part is on the monkey pox outbreak mulus was able to quickly provide perspective on that front for example in January 2020 backo conventional wisdom was that Co wouldn't be significant metaculus instead was predicting that more than 100,000 people would eventually become in infected with the disease and a lot of folks took that as an early warning sign I remember reading a post about someone making a bunch of investment decisions because metaculus said so like another cases around predictions around Russia invasion of Ukraine there's a comment on metaculus that I really like where the metaculus user in Ukraine said just want to say that I moved from KV toiv on February 13 entirely thanks to this prediction threat and the metaculus estimates like that to me is a moment where yeah the forecasts were instrumentally useful for someone making a decision like we've been tiited in the economist our world in data Bloomberg Fortune Forbes you name it I think the core question is around how will these be able to turn into instrumental actions that one can take and I'm particularly interested in this being heeld for the AI sphere and making sure the tools that we're building in AI as a whole are actually serving humanity and I think if there is something that has good track record and ability to be able to influence individual decisions we should explore why and how this can be much more useful and that is the lens with which I think prioritizing AI is important but also you know other cause areas some folks say ATT should purely focus on long-term M causes I actually do think a lot of short intervention modeling and being able to take successful action teaches us on how forecasting can be helpful in longer term contexts like questions like how can we reduce homelessness in San Francisco you know if I have an additional five million should I just go build two more houses is that the best thing I can do or is there a form of lobbying that I do think a large crowd will say this will make the housing market understand the negative externalities of homelessness to the point that we can have a more you know well- grounded change I would like this to be a metaculus where every forecast these forecasts can be conditional they can be interwined with each other we can explore different ways to visualize them like these will bring usefulness towards someone and us being good at this on a repeatable fashion will help us be able to tackle the largest questions much better hey we'll continue our interview in a moment after a word from our sponsors omnik uses generative AI to enable you to launch hundreds of thousands of ad iterations that actually work C customized across all platforms with a click of a button I believe in omnik so much that I invested in it and I recommend you use it too use Cog rev to get a 10% discount hey all' Eric torberg here I'm hearing more and more that Founders want to get profitable and do more with less especially with engineering listen I love your 30-year-old xfang senior software engineer as much as the next guy but honestly I can't afford them anymore Founders everywhere are trying to turn to Global Talent but boy is it a hassle to do its scale from sourcing to interviewing to on the ground operations and management that's why I teamed up with Sean Lanahan who's been building engineering teams in Vietnam at a very high level for over 5 years to help you access Global Engineering without the headache Squad Sean's new company takes care of sourcing legal compliance and local HR for Global Talent so you don't have to with teams across Asia and South America we can cover you no matter which time zone you operate in their Engineers follow your process and use your tools they work with react nextjs or your favorite front-end Frameworks and on the back end they're experts at node python Java and anything under the sun full disclosure it's going to cost more than the random person you found on upwork that's doing 2 hours of work per week but billing you for 40 but you'll get premium quality at a fraction of the typical cost our Engineers are vetted top 1% talent and actually working hard for you every day increase your velocity without amping up burn head to choose squad.com and mention turpentine to skip the the weight [Music] list cool very interesting so going to these two AI questions that I refer back to the most I think probably many listeners will have visited these pages right the weak AGI and the strong AGI timeline each one has kind of four different resolution criteria and the question is basically when will a single AI satisfy these different criteria and you get to predict your you know your distribution of dates right I think this has been really interesting and and quite informative for a long time although more recently it does feel like it also highlights a real challenge of writing these questions which is that when things are farther out it seems like the or it can seem I feel like in this case it does seem like the detailed criteria seemed quite reasonable and now as we're getting closer I'm feeling like there's sort of a Divergence between what matters and what is actually the you know the letter of the law in the question particularly when it comes to in the weak AGI question the use of a specific form of the Turing test where I'm kind of like okay from my standpoint in the sort of intuitive like what really matters in the Turning test you know line of thinking I would say we've passed it and yet the form ulation requires a expert interrogation and that experts not be able to tell what is the AI and what is not the AI and we're not that close to that but I always emphasize for people that I think the reason we're not that close to that is a design decision of the people that are making the AIS that you know are to be tested right how confident are you about this claim I would say quite in in as much as if I wanted to create an AI that would be impossible for a or you know much more difficult for a interrogator to identify as AI or not the first thing I would do would be have it say I don't know a lot more often than it does right so like the easiest way for me to tell whether something's an AI or not is just to ask 10 very longtail random questions and be like no human is gonna answer these questions all 10 like it's just wa you know nobody has that breadth of knowledge actual people would say I don't know so I would actually dramatically narrow the range of responses from the AI and make it seem much more conversational make it seem much more ignorant make it much less useful you know for what you would actually go to chat GPT for but I think I could make it a lot harder for the the expert interrogator to figure out what is what so anyway that's like just one example of of this General problem I wonder how you guys are thinking about that I'll start from what you just said and then go Zoom back to the h question the changes you mentioned could probably be enacted by just using a couple language models that are policing each other and you could basically get to a system that actually behaves this way right now so I'm not very s that that is the main threshold here there's a quote from jiren laner that I actually used at the beginning of my thesis that was focusing on how can we augment citizen participation in governance through natural language processing I guess today I would call it LMS where he says the touring test Cuts both ways if you can have a conversation with a simulated person presented by an AI program can you tell how far you've let your sense of personhood degrade in order to make the illusion work I think this is quite important here like being able to communicate on a certain pattern full of expectations is not what I am interested in I think that's a very very low bar in fact like the goal my my interest face goes Way Beyond am I able to simulate something that is convincing in in this closed box tournament like this does not actually bring us a better world from my opinion I actually agree with you I don't really love the way the question is operationalized but keep in mind they are both from 2020 we will not always formulate questions so that they are optimally informative years later right and I would say that in 2020 these questions I you said it to yourself were in fact quite useful to the point that I would argue attackers probably moved the Overton window with respect to how people were paying attention to the impact of AI and I think that is one value ad that's both hard to track but also intuitively resonates with me which is what also draws me here so I think these questions somewhat serve their purpose now zooming out of the purpose question to like okay but can we have a better question if we were to do this today these questions instead of it being a single question I think this should be shaped something like the minul instance that we were talking about where all of the subp parts of the questions would have multiple different forecasts and they all together in an ensemble actually give you a picture of heck we don't even have a clear definition of AGI to the point that the question needs to operationalize itself one way or another right one of the projects that I'm pretty excited about on metaculus is to come up with indexes basically come up with just way you can aggregate a bunch of stock tickers to come up with a composite view I would like there to be 30 forecasts that are all really zooming in on slightly different aspects of this and them Al together is what you're paying attention to now you can say well I want to rigorously link all of these through causal diagrams and then you will get into a whole different hair ball sure that could be helpful if you pull it off but even a lesser Fidelity version of this that is just here is 30 forecasts that all rhyme with each other with respect to its focus point will be able to shed more light and I think at that point with what metaculus had like that was possible and these questions did serve their purpose it's the challenge of forecasting the future right we don't always know what formulation for a question will be most useful years from now for example we're doing the five years after AGI question series that just launched like I am much more interested in that for example because when I look at the answers there it actually gives me a much more accurate view of what people even conceive of as AGI like the way I am using those questions isn't about thinking of okay will these things happen when a AI hits it more so informs me okay these are the things that people consider as critical possibilities within AI or not and just doing a throwback to some of my prior work with AI objectives Institute I often find there to be if we are talking about Ai and AI alignment without talking about the societal context in which this has impact I think we are narrowing this down like for example there's a common meme that we would use in AI objectives Institute to talk about what a successful AGI that can enable human coordination is in the world that is post AGI if you don't think Jerusalem will be a United and peaceful City maybe your AGI isn't ambitious enough like maybe something that is truly AGI would be able to resolve clashes and cruxes within existing human sphere where it's not trained to make sure it is perfectly neutral towards that but instead it augments human agency that actually brings a meaningful change because otherwise sure like I can up a system so it can say I don't know enough time so that you might find that convincing I find that to be a low bar I think we should be more ambitious yeah so how much editorial do you exercise over these questions because it does seem like these couple of questions have sort of become a bit of a shelling point and it's a delicate decision probably from your perspective right do we sort of take a proactive step to retire a shelling point because we feel like it's outlived its usefulness or do we just let the community kind of gradually move on I wouldn't say that at all I think like these questions are good I am much more interested in pushing it further so I guess it's also a call to action ask for help for everyone that loves metaculus or is interested in this as we launch minuses which are focused instances like if you have different formulations that you think are interested come tell us like we already get a lot of inbound questions right that are around like hey can we write a question we heavily editorialize them for we do want to maintain the level of risk and so far I think metaculus has been much more closed than what I would like it to be I would love for there to be many more people that write questions especially domain experts especially people who say you know hey I get the hang of what a question is resolvable and meaningful and these are things we can help with but we are entering a chapter of metaculus as we are open sourcing to get many more people to write questions to have instances that they are hosting that have maybe different scoring mechanisms different levels of rigor to the point that it actually proves useful to them we will also host at metaculus dooman that are very domain specific so I just think we need more questions it's not about going back and changing what this has accomplished but more so bringing on new lights like for example I'm quite excited about a bunch of new technologies that could actually bring better results like I love a lot of the more symbolic Bas like are we able to use language models in order to come up with specific criteria and um use that to create Safeguard AI systems I'm interested in like singular learning theory there's a bunch of new techniques that actually when I imagine versions of AGI or weak levels of AI competence even that is really good at these the ways in which I Invision those features are substantially different and I'm quite interested in exploring these and I think that just requires a breath of more questions to forecast and also more human energy or AI energy to be able to forecast on these questions so anyone that is interested in saying hey I have things I want to forecast I have a formulation that I think is better come talk to us how do you create density though right I mean the one worry that I would have if I was you and I got a million new questions in is now I need like a billion predictions right to actually create a meaningful Community forecast for all those million questions so how do you think about B balancing like the diversity of questions versus the density of forecasts right great question um where the AI can come in we'll get there momentarily yeah the you know nail on the head like the real currency here the real limitation is human attention band right I have 10,000 questions being launched at metaculus every day it's not going to help anyone at least until like everyone is forecasting part-time as part of their job like that's not what I'm trying to angle towards yet but this is is why we need to do active research on questions like can we build indexes that aggregate a bunch of different forecasts can we have ai start forecasting as part of the AI Benchmark tournament we've had an explosion of AI contributors they seem to be doing okay and like can that create a scaffolding or jumping off point for humans to be able to get to a rigorous forecast much faster there's I think a lot we can do to increase the quality we have 10,000 new forecasts every day I think that just will drown any kind of quality from the system and the wind condition also isn't you know well AI are forecasting and humans are just watching I think the question is more around are we able to identify what are the best questions and I want that to happen with more people not just the metaculus team and I want the contribution of the Bots to be able to gear towards shedding more light and more information and have these questions be incrementally composable so that we can build World model models of all of these are rhyming with respect to how we thinking about electric vehicle proliferation for example and using those then we can have much bigger questions like forecasting specific Windows of electric vehicle proliferation in China is quite specific but if I have 30 of those questions I can then make something aggregate that says you know will electric vehicles be better for the environment it's like what does that even mean but like it actually means a compos of all of these different things this is a different frame of reference or thinking about how we can aggregate cool well that's great motivation then to get into kind of what the state of the art is in llm powered forecasting and then the tournament that you're running to try to advance that state-ofthe-art a little bit further inep in prep for this uh I read through Three Links papers that you had shared and I was overall pretty impressed and you can give me more detail on kind of what you think is most important or or stands out or like what the kernels are that you really want to build on but it seemed like across the board pretty positive results and these are from like serious authors including tetlock has been involved in some of this research just to summarize the three papers the first one was approaching human level forecasting with language models and this was from Jacob steinhardt and and his group they basically created I guess what I would describe as sort of the intuitive thing that I would create if I was going to work hard on trying to make this work and that is like a retrieval system the ability to go on the internet search through the latest news process that and ultimately create forecasts and it seemed like it was pretty good it was like yeah coming close to the community forecast although not as good as the overall Community forecast a question that I did have reading that paper which I don't know if you would know the answer to is okay it was a little bit short of the community forecast but like how does that compare if it were an individual human you know at what percentile would that system have performed I don't know if you do know that but it it's my intuition was that it would be pretty high as like the percentile of individual humans right I mean let me give you a thing that I do believe in that I don't have data but I think we will soon have data for this is that b in the hands of a team of human forecasters will just kill it like that is obvious going to be much better right going back to the question of usefulness I like the academic rigor aspect don't get me wrong I do think testing these in isolation is good and a lot of our designs around AI benchmarking for competition pays attention to that but the part that already is obvious to me is let's actually look at what these systems are good at they're good at going through massive amounts of content identifying which parts may be relevant they're good at taking a First Step at creating a world model let's actually build systems that pay attention to those first and build on top of that I particularly like the shinard lab the hallowe paper for a lot of intuitions that I would like to explore further is in that paper and a lot of folks that have asked me I want to compete in the tournament what should I do I always Point them well start from here and then try different things try different Ensemble methods use different language models don't share prior data have them debate each other have a third model look at it and synthesize them there's a lot of playful strategy that one can go with so I really enjoy thinking in this framework so I've been encouraging there's a very active Discord by the way for the tournament where people are discussing strategies it was absolutely delightful to see how much the models have evolved in just like the three weeks of the tournament kicking off on different strategies people have come up with and we do encourage people to update their models for the tournament there shouldn't be human in the loop because we're trying to Benchmark the State of Affairs but if they have a better strategy they should update that another thing we do is reasoning transparency like have the models post at least some form of text we don't grade or score this because it is a whole different level of complexity for that let's stay with the forecasting accuracy but that at least gives us a way in which how the model is thinking and a lot of earlier research on Chain of Thought reasoning has pointed that explanation is actually what gets these models to be able to stay further grounded so to me these things are really good couple other avenues that I don't think has been explored as much lately is how can we bring in modelbased reasoning on top of language models that will actually yield much much better results even in the first few weeks we have what 15 questions that have resolved so far I think in that signal we've had plenty of questions that have for example tests like scope sensitivity or uh negation ask the same forecast ask the opposite see if the answers are the opposite of each other for example there was the measles question like if you asked about what will be the number of measel cases less than 200 it'll say 75% 70% were more than 300 and that means there should be 5% chance that it will be between 200 and 300 and Bots will say 65% for the window that's between 200 and 300 like obviously a pro forecaster wouldn't make this mistake but even an ensemble method falls apart on this case because it doesn't keep track of here is a world model that is mathematical rigorous now why is this interesting to me humans wouldn't also be able to do a very good job at tracking this if they were much more complex like obviously a pro forecaster is trained to get better at it but if you ask this to someone that is not highly numerous child they would be like oh maybe it's the same oh I didn't realize that this is wrong that means there is something intuitive about how humans reason that is distinct from the mathematical rigor here how can we close that Gap like if we are able to incorporate modelbased reasoning and I think this can happen with approaches like open agency architecture or even like a simple squiggle integration I'm super interested in that to be able to say your forecasts as you bring in more and more information from a language model is building a world model that also has a mathematical grounding in which you can enforce things like scope sensitivity negation sensitivity etc etc there I think we will actually have a lot more Paradigm shifts and this goes along with an intuition that I would like to be much more explored for building safer AI systems which is can we have language models to be used in order to come up with rigorous World models that can be interpreted because they are mechanical and there are papers on this front that I think have explored this in different ways for example there's a Tenon bound Lab at Stamper that's published a paper called from word models to World models it was about a year ago so there's plenty of work that come on top of that are we able to build probabilistic programming language support so use an llm to come up with a world representation where the world representation is both consistent and also interpretable can we use this as a starting point to come up with much better much more robust worlds or AI systems that are getting more accurate that are getting more logically consistent I can go a little more technical if you're interested like for example can we come up with autof formalizing option spaces like there's a finite set of proposals there's a constant positive number that is the total budget say we're trying to do like a question of budget allocation the option space has many finite functions that you know the sum of all over the actions needs to be less than or equal to the budget like an llm is not going to parse that but an llm might be able to get better to this if you're thinking about take a natural language description of a decision situation and produce a formal description of the decision variables the constraints in the language optimization framework ton of research like use something like CVX P like there's a bunch of libraries that would get this better we can do structured con construction of option spaces like are we doing a simple Choice question are we doing a matching problem like llms would actually be good at figuring out oh the thing you're asking me seems to match this kind of problem and then help parse into that that way we actually have a world model that we can inspect that we can see okay these parts are right these parts are wrong can you improve this and have the human continue with just verbal inition rather than need to keep an Excel spreadsheet to see if all of my intuitive probabilities are tracking but instead you're able to say things like these are my intuitive probabilities can you figure out where I am logically not consistent like this would help a forecaster right now here's all the data that I have here's 30 forecasts maybe these 30 forecasts have some logical inconsistencies between them can you do better if we can help language models get better at this that's composite system I believe will yield a much more reliable much more safer AI model as well and if we extend that further and further we actually might end up with AI systems that are interpretable that have reasoning transparency that have distinct parts of world model building exploring option spaces eliciting preferences and desirabilities from people that you can look at and see oh this is how all of them are talking if you're just trying to get one LM or a bunch of llms not so much like I will never be able maybe through breakthroughs on mechanistic inter to be my T to get there so I believe these kinds of Explorations will actually yield much better AI forecast yeah cool so that's awesome I want to just for a closure for people who wanted to maybe hear those other two papers I'll just mention them briefly U because I think if you're going to get into this tournament you should look at the Hallow and steinhardt paper for sure that's like your kind of jumping off point you know Well Done agent with retrieval with fine-tuning with a a good SC to to spit out good predictions that's one that I mentioned before comes reasonably close but still falls short of the community prediction then there are two from tetlock and co-authors one was pretty interesting around just giving people access to an AI assistant and seeing how that helped them as humans forecast this is kind of the big picture at least a small step toward the big picture Vision that you're painting interestingly they ran an experiment where they gave the best assistant that they could give to the participants and then also a deliberately biased AI assistant and they found that both helped although the biased one didn't help as much that's an interesting finding and also an interesting argument in that paper about how this look at the future I mean you could think of like multiple reasons for why AI forecasting could be useful you've made the case that you know obviously we would like to have a better sense of what's going to happen we would like to be able to make better decisions they also make kind of an argument in that paper that the study of what language models can predict is also kind of a useful way to interrogate to what degree they can get out of distribution which is a very hotly debated topic within the study of AI right some level by definition if you're predicting the future you're out of distribution so that's I think and the fact that they are doing it like reasonably well is definitely at least some points for team these things can generalize purely beyond their strictly defined training data right then the third one also from tlock and a number of the same authors is just an exercise in ensembling the different language models and finding that the average of the language models performs better than the individual so it's kind of wisdom of in fact that's the title of the paper is wisdom of the Silicon crowd yes that one was interesting to me I mean they did a as you'd expect from a tetlock publication they pre-registered all their experiments and were very sort of rigorous about declaring exactly what hypotheses they were going to test it did jump out to me from the data that gpc4 was like way outperforming the other models and if you wanted to make like a really simple Improvement on their method I would just cut the worst models and take uh one or like a few of the very top performing models and just kind of run multiple copies of those right couple interesting notes there around bias toward round numbers a little bit bias toward positive resolution some of these things that you noted where there was like some logical inconsistencies and they were also hard another note I had on that paper was it really shows the bread of AI knowledge off in a powerful way because I was looking at some of these things and I'm like I don't even know the person you know might be a leader of a country or whatever and I don't even know who that is as we begin this discussion so the fact that the AAS are just like jumping in and and making predictions on such a wide range of things is like a good reminder of some of their fundamental strengths right so okay that's all background and you know hopefully breadcrumbs for people that want to get into the tournament if I'm many comments on them a little bit I think the conclusions of these papers do make intuitive sense and I'd love to see much more rigorous tests of this like for example what you said on well cut the worst models and maybe that'll get you better maybe but I also want you to keep in mind like if your goal is to have better forecast probably the kinds of things that will make mild leaps is going to be things like look up if there has been similar questions asked on a variety of prediction platforms like that is actually instrumental and shifts out of an academic mindset of uh the rigor towards like what is actually instrumentally helpful here I do think a lot of shortcuts are actually much more around like okay how can we get this to be further helpful so those are the things I'm really interested in getting more folks to explore Ensemble Mo methods are great like wisdom of the crowd Works wisdom of silicon crowd also would probably work as a result like try different models try the same model figure out if these models are good at specific domains or if they're avoidant on specific questions like there is so much juice there and as we have better models also the conclusions will shift continuously which is why I think a benchmark like this will actually be really really helpful so let's get to the contest you've alluded to the competition or I guess we should maybe properly call this a forecasting tournament you've alluded to that a couple times I would maybe love to start with just kind of a rundown of like what the setup is what the rules are there's prize money available so just give people a sense of like what they would be signing up for and kind of the specific nature of the competition yes I am pretty excited about this competition it's called AI forecasting Benchmark series the competition will last for an entire year where every quarter we will have new questions that get launched and for this quarter we will ask about 250 to 400 questions to the AI systems that are forecasting we have about 120k prize money to be allocated across the entire competition and yeah I would love for people to participate a couple learnings that I've had just looking at the competition is even very simple Bots seem to be doing really well and this might be that their prompts are better this might be that they're paying attention to better news sources so it is definitely interesting and without a lot of effort I do think it is possible to get something that is somewhat competitive now what is the tournament in a way it's a typical metaculus forecasting tournament but it is specifically geared towards spots we encourage people to build Bots that use multiple llms and we will use this as a benchmark to compare against Pro forecast MERS and also the community aggregation and it's the first of its kind to be done at this scale the basic rules are that you cannot have a human in the loop you can change your Bot but you cannot make specific adjustments to the bot Bots have to leave comments to Showcase reasoning we won't score the comments but it will be good for us to see what are the winning Bots doing so just quick clarification when you said you can change your Bot but you can't like modify it that means you can update it periodically but you can't adjust it for a specific question is that exactly exactly adjusting it for specific question basically defeats the purpose at that point it has human in the loop right I mean I do believe human the loop systems ultimately are going to be better just like how you know competition is good collaboration is good but like a competitive collaborative system where you have teams that can collaborate in competition is better this tournament we're not looking into this we do want to see AI capabilities on their own in their own footing hence you can update if you come out with a better strategy and especially like the writers of the winning Bots we will interview them and ask them what was your approach we do encourage people to share these systems if they would like but they're not expected to do so they will provide some description of the bot or the code if they would like so that we can see what works and what doesn't yeah this will create a benchmark that is continuously evolving and also it's a benchmark that kind of cannot be cheated for we literally don't know the answers to a lot of these questions so it is really fun like it evolves every day like every day you look at it again and see how your but is doing is there something that is missing so I find it pretty fun personally we have created a template that people can follow that is fairly simple and straightforward that you can use to just start with a oneoff system but on top of it go to town like there's a lot that can be done try The Ensemble methods try model based reasoning systems bring on news like one of the things for example where I saw I touched on this earlier a little bit if a human forecaster is forecasting and trying to do well they will Google and look up other forecasting platforms to see if there are similar questions this was something that the Bots figured out pretty fast or the bot writ rather so I'm really happy about these kinds of Explorations yeah so that is the rough rules of the competition fairly easy to get started oh also both open Ai and anthropic have donated a fairly generous amount of credits for the competition so for anyone that would like to get credits because you know there's a lot of questions to forecast I think we can cover all of the needs so the process is Ping us either on Discord or send us an email you can email me dger smiths.com or support email for it it's all on the website and say hey this is the kind of thing I want to try can I get credits and I'm pretty sure we can give you plenty of credits to support you so yeah I find it to be like quite exciting for all the questions are binary questions they're going to resolve yes no is that right can you give us a few examples of early ones that have already resolved yes I want to say all the questions are binary at this tournament we will have non-binary questions soon for the next round which will be Q4 the tournaments will happen in every quarter and you get scored on the questions you forecast so a lot of folks have asked me am I too late to participate I missed the first few weeks actually not at all we have some people that are high on the leaderboard that have joined fairly late so I say join now and also try things now because we will just cover your credits anyway so when Q4 hits you can go in with a bot system that you have rigorously tested some of the questions that have already resolved with respect to like the prime minister of France will that belong to a coalition other than the new popular front or together like a lot of election related questions on France have resolved already domestic box office question we have one on Deadpool and balverine will that be higher than that of Deadpool resolve yes it was interesting for the Bots did better than humans on this one Joe Biden you know what will happen for the Democratic party we had a bunch of questions on that front one thing we have done that I think is really interesting is in the like main quarterly cup we have questions that are for example continuous variables like I'm looking at one of the questions between July 17 and July 28th what will the strongest geomagnetic have a k inel for this tournament for example we have discretized this continuous variable question to say things like will it be greater than four or less than or equal to six greater than four less than or equal to five between five and six so we have turn continuous variables into binary questions and that way we can compare it to you know is there logical consistency here but also how does this compare with human forecasters that are looking at this so couple tricks that we have placed in there to be able to make sure we can stay with a purely binary forecasting tournament and next quarter will build on top of that with more complexity so is the structure that the people submit their bot then you guys are going to run the bot for them on a daily basis as new questions come out are they gonna answer each question once or are they sort of going to be Repose the questions you know daily as they go until they resolve and then do they provide a I assume it's not just a simple yes no but it's presumably a percentage and how does the scoring work based on that number that they give you the scoring is the same as any binary question that we have done on metaculus it will resolve yes or no and we will score it accordingly towards that and we will score The Ensemble on top of all we are only doing binary questions here but the process is basically we don't host their Bots they host the Bots and we will open questions every question will be open for 24 hours so you will need to continuously forecast on top of that we have made that process fairly easy if anyone has any questions they should take a look at the onboarding uh process it would take maybe about 30 minutes to spin up a very simple bot but yeah so you need to submit through the API your forecast to the question that open and close every day gotcha so people are running their own bots on their own yeah Computing infrastructure yeah basically we have some folks that are actually competing with Bots that have value in form of a private IP we don't want them to just say well you need to give us what you have or if you have built something really interesting that you want to monetize we do want to encourage that so I think it would be very interesting for everyone to just submit their Bots but also at this point at this stage where the current LM capabilities are at I would rather have people to have much more ease updating their Bots like I want to see things like oh this bot seem to be bad at scope sensitivity but it better I'm curious like what tricks did they do or that I'd rather have the bot be still in the custody of the person that is building it does that make sense yeah is there a way are we working on the honor System or is there the possibility that people could have a human in the loop as they submit their stuff like the reason I assumed that they had to submit the bot was so that you could guarantee that they wouldn't have a human intervention in the process at in this ter we have discussed this quite a bit and we decided that this will go with an honor System one of the things we have is like the number of questions to forecast is quite High actually so somewhat disincentivizes a human to forecast about 600 questions that are just you're being bombarded L to work and another reason why we want people to submit their reasoning is that if there is human intervention it might be much more visible through that especially for the winning cases and the high volume of questions I do think we will scrutinize them heavily if we do have Reasonable Suspicion on that front I'm sure we will have further investigations but for this tournament we decided to go forward with the honor System rather so we do encourage people to abide by the Norms because they'll be contributing to a meaningful Benchmark that way otherwise it g to Pro a cheat I guess the other thing is that the the general prior assumption is just based on the older research that we talked about a few minutes ago that like you might be able to add a bit but you're not going to add you know even over the Bas line AI performance an individual is not likely to add that much and you might even make it worse in some cases so it's not an obvious advantage to be unlike for example The Arc challenge where if you had a human intervention you know there would be like a clear reason to think that you would have a big Advantage toward the prize here it's like much less obvious that a human can actually improve on what their bot is doing on its own yeah exactly and to be honest like this is fairly early to make any inclusive statements and kind of hope that this will change but we have a couple questions where the Bots did better than the aggregate of pro forecasts and I'm like oh this is very interesting because they clearly are failing on things like scope or negation sensitivity but on questions where that's not the primary uh mechanism they seem to be able to do something that is powerful I think people should try not messing with it and one thing is every quarter we will have a different tournament that will have its own sets of rules and Norms so I am interested in actually EXP exploring different mechanisms here as well we have an active Discord so people have ideas I'm like hey this kind of competition would be much more interesting I'd love to chat with them and hear what their ideas are so we can try different things but like the goal of this is to be able to Benchmark AI capabilities and see if this can be an early warning system that would actually substantially help the role forecasting plays in you said a minute ago that the scoring is like standard scoring but just in case I don't know exactly what that is my intuition is that it would be almost in the way that like Network might be measured for like a classification task sort of sum of squares type scoring is that right it's s question like how the scoring is for binary questions yeah exactly yeah think of it this way we're basically comparing the forecast to a coin flip in the context of the forecaster we are using log loss versus a coin flip and the peer score is basically how your Baseline score which is the log loss compared to the C flip compares to everyone else so that way we have a proper scoring rule that's used based on log scores the reason why we designed this way is that it rewards the forecasters for beating the crowd and incentivizes both like do research and understand the crowd but also report your best guess rather than say things like I want to be 10% above whatever the community prediction is does that make sense yeah are you penalized by that for overconfidence like if you put a a one on something and you're wrong is that like very costly in that scoring system so you need to be more accurate with respect to how you envision the world that's way we don't drive towards the far edges which is something you see in prediction markets that have Financial incentives so this actually combats against that that's why I was saying like if the market or Community prediction is at 60% and you believe this is correct you should say 60% because if it is accurate then if it turns out to be a yes like you will be not penalized but if you don't think this is the accurate guess that you could make then you should not say it's 60% while in a market if you're trying to maximize the financial output where the market is at isn't something meaningful to that on yeah gotcha and you get all the questions at once as well would is it within the rules to have the system consider all the questions to try to deal with the sort of possible inconsistencies can you repeat what you said yeah you said that you have like a number of questions that are related where you've kind of just IED a continuous variable or something uh are you supposed to just consider each one one by one and give a prediction or can you kind of consider them as a family and and impose checks on like if there are related questions you know make sure my predictions are self-consistent I hope that's what the contributors do right I want a human forecaster and a bot to be able to do that which is precisely why we're asking these questions this way because it is very hard for Bots to be able to do this on unless you have come up with something that is much more robust so I also want to see if they get better at doing this over time through the next year will simpler models be better at scope sensitivity or negation sensitivity but yeah like a human Pro forecaster will obviously say oh all of these questions are related so let me come up with one answer it wouldn't be very difficult for an AI system to start doing this by noticing these questions are all connected you can just have a simple llm check that says if all of the questions are looking got similar variables come out with one coherent answer and then based on that answer answer each of the individual pars specifically I'm curious how many contributors will do that and I'm curious if that will actually yield numbers that will be yeah I would think it would help at least a bit Yeah it it should like this is what I this what I'm hoping for is there a leaderboard that folks can obsessively refresh instead of going to the election websites every day yes there is and I encourage people to check it um what's interesting is some like newcomer Bots have already went up high on the leaderboard fairly fast which is why I tell everyone we have easy templates to get started just go and take your shot at winning the prize money like we do have a large pool of money and this will also be distributed across the entire contribution over a certain level so you don't have to get first place to get something like I do expect a lot of the Bots will actually be compensated and some shape or form so yeah B performance seems high variance enough and we have three more quarters so I would say like this is the time to get your body in shape set up through the template start testing and maybe you'll even win some money we will cover your costs so the cat will be distributed along the curve rather than just the top three so it's not too late at all I say give it a shot my question to you is like what would it take you to make your own bot I think you would probably enjoy it yeah I think it it does look fun I did read through the collab notebook that you have shared and it does look really with that all in place it's like especially and also covering credits that was one other little detail question I had was is there a credit budget or any sort of budget per question or is that's all up to you to manage on the participant side so in this tournament we decided to not impose a Max budget I do think this would be interesting like say if we said you only have $200 forecast for an entire quarter but then like we're in intentionally inhibiting measuring AI capabilities so that doesn't actually give us the outcome we want and that could have been a fairer competition but that kind of defeats the purpose we have given up to 5K to some of the competitors and people have asked us to say hey can we get more credits we ask people to describe the strategy they will try and incrementally give more so this is we start to do it on an ad hoc basis because if we just distribute it at large we kind of can't keep track but people that actually want to spend a lot more budget they can reach out to us and what it will take is a quick email exchange back and forth where they describe the phenomenon and then we will keep a look at our day forecasting if so we will keep on giving them gotta cool great question for me in terms of what would it take for me to get into the contest I think the biggest thing I'm probably missing right now is a sense of like how do I improve on the steinhardt and group paper and I I reading through that I was like this seems quite well done it seems like it's working quite well and it wasn't immediately obvious to me how I would make that better so I would need to and it hasn't been that long since I've read it so you know Eureka moments could strike any time but I hadn't had an intuition yet around oh if I do this I think I can beat them and that's sort of the moment of inspiration I guess I'm sort of waiting for right and a couple quick ideas I'd like to share is pay attention to other forecasts like any forecaster should start by Googling okay has anyone forecasted this um what's happening on other platforms are there other questions at metaculus like we do pay attention to make sure we don't have another question that's identical um for the tournament but similar things will provide a lot of lights towards having a good base rate so that for example or C up perplexity pull up recent news or try multiple different models and see how they compare with each other I do believe that fact fetching versus coming up with base rates versus you know abiding by Logic consistencies or rather than abiding by I say leveraging that to be able to make better forecasts different models will have different competencies I have found this is very anecdotal and this might change even if you use different prompts I'm just sharing you know I tried this not as a research but literally like spending prompting for a while and Tropics seems much better with respect to negation and logic consistencies compared to open AI open AI is much more helpful to work with me and trying to figure out what I should pay attention to in the question so come up with models that pay attention to multiple things use llama models use custom things you can even fine tune with prior forecast like there's a lot that can be done and I don't think the what will make the winning condition is going to be like this comes from a multiple um phds research team that invented something really Noel I think it will be actually much more flexible than that with respect to which vot ends up winning I don't know my bet will be that something simple will end up in the top three or top five and we will say wow like maybe we didn't need all of that complexity or maybe we will find out like yeah logic consistency abiding by it will yield much better forecast so you definitely need to pay attention to that because that was the determining Factor there's a lot of conclusions we can get so I just my goal is getting people to try out a bunch of things it seems like the if I had to forecast on a meta level the result of this tournament over the next year and definitely using the the research papers as Point of Departure it seems like some sort of Ensemble of these Bots which themselves will probably be ensembles internally in many cases it seems like we probably are headed for a world in which the AI forecasts are in aggregate very competitive with and maybe even surpassing to some extent the community forecast and I guess maybe in in closing I'd love to hear like first of all do you think that's true and then could you give us a little bit more you've kind of sprinkled some of this in throughout but a little bit more about like what that will mean and I guess I don't have a great intuition always for like how good is the community forecast there's you know it's still obviously not like a crystal ball but if we could get that good how good is that how useful does that start to be is it a question of you know just we need to if we had this level of accuracy for everything like how much does that change the world or how much does that still sort of leave us in a like fundamentally uncertain State because this seems like this could happen especially as the tokens are getting extremely cheap you know we could really have with the latest GPT 40 mini it's especially if you do the offline batch process right 10 million tokens for a dollar folks it's like it's ridiculous so yeah like what is the sort of epistemic future look like to you I will answer from a different starting point again that hearkens back to what we started with which is around why is this if I was to tell you we will end up in a world where you will have an assemble AI or the bot aggregate that consistently beats the community prediction does that bring a better world on its own I don't think so I think the part with the AI that I'm pretty excited about is that it costs a couple cents to have them write an entire research paper explaining their reasoning I don't know if you'll buy the reasoning but it is much cheaper to probe and understand ways in which their conclusions come compared to humanist and that to me is a paradigm shift and this is already happening like there's multiple companies focusing on this ranging from elicits to Future search I am interested in seeing how that will give us better actionability bots in aggregate start to forecast better than humans or human plusbot Ensemble systems are doing much better I will again go back to how do we make this useful and I am excited that the llms will be able to create much more interpretable World models hoping that these World models will not be manipulative also hoping that these World models will actually serve the goals of those that are trying to make value out of it which is why some of the questions are fun and throw away other questions are actually quite proactive towards specific goals and we will have more of these to come up in the upcoming months like around questions of strategies on mitigating homelessness like there's a lot that we can do there so if Bots are getting much better than humans then I think the question is how can we make this be maximally useful we already have really accurate forecasts and we haven't had the cambering explosion of every single you know entity Corporation government perpetually uses forecast because the frame of the forecasts has not proven to be helpful just yet which is why I am really interested in this action oriented inflection point identification finding short-term proxies finding questions around budget allocation working with hypotheticals another side note here that I think might be worthwhile to go into I think we should ask questions that might not resolve like if I say if San Francisco was to allocate $70 million for improving public transportation if they did X if they did y now like this might not resolve this is probably not going to happen but asking this kind of question and asking Pro forecasters to forecast on it they might say well I'm not going to get a good Leaderboard on it because it will never resolve but if I told them hey this will be helpful for resource allocation doing this for questions on Taiwanese sovereignty doing this own questions on AI um will actually let us explore hypotheticals counterfactuals in a way that will bring a lot more strategic information so these are spaces I would like to go to and if Bots are able to bring a level of reasoning transparency that is great because that will make it much more actionable I hope it checks out I hope it doesn't end up being The Uncanny value of seems legit but not necessarily correct but it's too hard to scrutinize like that failure mode would be what I would be afraid of here and in a way we already live in a world that encourages that right like we have legal documents that are unless you speak legally you won't understand it so you need to defer the failure mode here is like yes there is something that all of these Bots are able to converse with each other that we see as like somehow this has higher epistemic status than what a human can access like at that point I think we are in the failure mode of epistemic security I would like to dismantle that possibility which is why I'm excited for llms to be able to help say someone that doesn't understand anything legal that doesn't have money for a lawyer to be able to make sense of a contract or for the medical sector for insurance sector for science so there is a world in which these tools can actually get us towards better reasoning and do it be individual or in a collective scale or we're aggregating multiple perspectives I think same as wild for forecasting like if we do this well it is quite a paradigm shift that is going to be super powerful and like the failure mode is basically like pushing humans further and further away um from having access to that I think the benchmarking tournament gives a little bit of light towards that and hopefully more and more as the upcoming uh quarters will have more rigorous questions that go beyond just fineries so I'm excited to explore this um I presume many folks that are listening have been curious about setting up their own llm Bots or getting more facility with llms trying prompt chaining I think like this is a skill set that I see a lot of people around me wanting to cultivate more of I would say like this is an unusual moment it's a powerful opportunity you have outsized price money you will contribute to a benchmark and it helps you build capacity that you would want to do anyway plus we will cover your credits and we have some templates so I would just say like tell people hey like come along we need to figure this out cool I love it is there anything else that you wanted to touch on that we haven't got to I mean I just want to say I am especially interested in folks that say hey I have a different Vision with respect to how forecasting can be helpful do reach out to us we're quite interested in seeing different experimentation and testing out some of these experiments where we can both financially incentivize people and try things out we will be open sourcing metaculus very soon so I'm very excited about more people to try things out so that is another thing I would like to underline um if you have ideas that you want to test holler or I'd be very curious to hear them and yeah hopefully we will have many more instances many more ways in which people will try to make use of forecasts yeah I wanted to ask also that's a good reminder what is the sort of future of the company like obviously open sourcing important strategic decision in terms of like gives a lot of value and a lot of opportunity to others how do you see that sort of shaping what your own possibility set is as you go forward with the company itself I'll start from like before I joined the CE I was wondering why is this not open source already I do think there is a lot of wealth here and one way to get more people to be able to both like critique and audit but also host their own versions like this is something that is a net good towards the entire Humanity to be able to do much more of this and metaculus will focus much more on ways in which forecasting can be useful um for both decision makers and understanding more robust World models which is why questions around better reasoning transparency and action oriented decision making for those that are trying to figure out what's the best way to allocate resources or which policy seems to address the needs of a community will this policy if enacted actually yield the impact so we're interested in much more collaborations on this front with fksp and Municipal to Federal agency government we have some some of these Partnerships already in place um interested in ways in which this would also help businesses uh to be able to make better decisions and keep track towards the goals so these are all spaces that we will highly prioritize we're entering a window of a lot of experimentation actually on seeing like what are additional things on top of the forecasts as they exist right now would add value so the minuses that we were talking about is just one of the many examples here building indices is and other like looking into things like Tech trees that foresight Institute is building on seeing okay for us to get to this technical state it looks like we have these in between things that need to be invented or researched will this be worthwhile okay if we put $5 million to this versus $15 million to this will we be able to go through these inflection points these are the kinds of ways of reasoning that I would like to see metaculus enable in is there anything that you worry about with this paradigm we just liveed through this crowd strike disruption right where like one seemingly very local Point failure like cascaded through society and if I was going to take the most risk oriented position on this I might say something like in a world where yeah maybe we're getting most of the time even better than wisdom of the crowd's forecasts from like a couple different AI models what happens if they fail not necessarily in terms of an outage although Maybe but like more so is there a potential for sort of a correlated failure that could create weird scenarios weird tail risks how do you think about that type of possibility is there any way we can get ahead of that I mean decentralization is key right like if everyone wasn't using croud strike probably everything wouldn't have gone down all at once and I'm sure a lot of people have had the hindsight to say well I saw this coming I've talked to couple folks even that you said well yeah we switched out because we for something like this coming up I think especially when you have institutional systems that are entrenching the use of specific tools or specific flows that ends up introducing a lot of failure modes right like if there is a government endorsed supply chain flow that would be much easier for a foreign entity to be able to um infiltrate and mess with causing supply chain related risk be plenty of those through human history I think like metaculus being open source is good I mean obviously I want metaculus to be the most accurate and that is the goal we're striving towards but even more than that like the proliferation of the perspective so that there is multiple people doing multiple focuses there is many different strategies like maybe for a specific domain a different way of scoring will work much better and this will end up causing much more value for that like for example just zooming back to something we said at the beginning of the podcast decisions around political outcomes is very different than say like questions like should Biden step down versus questions like if we allocated an additional $50 million towards synthetic biology which of these will yield an inflection point now it is interesting like in the book like super forecasting there's plenty of anecdotes of people are really good at forecasters can end up beating domain experts this doesn't mean we don't want domain experts this means we want the forecasters to work with the domain experts so I think there's many different context in which forecasts can play instrumental role but they don't all look the same so I think the real win for metaculus is to actually bring us towards this world where many people are trying many different approaches and many different things we will always strive towards the level of rigor and accuracy but on top of that like I would love it if there's a different thing that ends up building an AI Ensemble method that seems to be more reliable than metaculus prediction like at that point then the question is okay like how do we integrate this towards you fness like that wouldn't be a failure it would in a way be a success for like we have shown that this is possible so my answer towards failure modes through singular Authority that has maximum control as always well decentralization is much better and how can we St towards that that's the cool are you also interested in things like polus is there a sort of meticulous polus mashup or like you know what does it look like if those two concepts have a baby right question this goes all the way back to my backstory I would say PO is probably the most influential platform for evolution of my thinking right before metaculus at AI objectives Institute the project I worked on the for longest was called talk to the city which is an llm assisted aggregation and deliberation platform that can take in qualitative feedback and in a way this was what I was working on for uh my thesis this was what I was working on in my startup cerebra also so I have thought about this question in so many different lenses when you have raw text that is the highest Fidelity human opinion how can you augment that and through many iterations first without llms then looking at multimodal systems with text Data plus other data and in the latest iteration with talk to the city what I have come to was okay the current state of language models actually does a fine job in identifying what are the topics are there any subtopics and another thing they are really really good at is actually the retri of these Concepts talk to the city still actually exists and it's going pretty well we have an excellent team at AI objectives Institute so I recommend the audience to check it out website is ai. objectives. Institute and they can see on the blog post a couple case studies that we've done on impacts of talk to the city like for example one was with Taiwan U we've had an extensive collaboration with taiwan's Ministry of digital Affairs um where we were aggregating people's opinions with respect to both local Municipal questions and also larger scale questions um on same-sex marriage data a lot of these data inputs come in from pois now polus doesn't like there's some constraints there that is a tradeoff or It ultimately it's a recommender system where you're finding people that will share certain viewpoints instead of leaving the onus to that which causes less human bandwidth like you can only write about tweet sized things we said okay can we have just completely unbounded input where people can send in text they can send in videos they can share any kind of data that you know you would be able to then enhance and see okay what does this community talk about what matters to them and in a way my journey to forecasting and metaculus was me seeing okay we are actually at a place where we can aggregate public desirability um I prefer the word desirability to preference because preferences can change one might not necessarily be aware of their PR references you can have cyal dependencies I use desirability more as a catch all term that overcome some of these we can aggregate these we can have a snapshot like one project that I loved for example was with the labor union with focusing on Veterans Health where they wanted to run a survey with respect to a specific proposition and the negotiation Windows they have with the governments is incredibly small there's a 20-day window where Dan you come up with the policy share it and it will go through the whole system is is basically rigged against internal deliberation it's so narrow and so fast-paced that people just say whoever I voted for as head of the Union should go forward the new president of the Union Mark Smith ran a survey open-ended text based question to the entire population within 24 hours I think we got what like maybe 200 responses on the first round within 24 hours we could turn this open-ended survey in multiple languages by the way not just in English you can get it in Spanish and pagal and consolidate this to say here are the four viewpoints that came up what are your thoughts send it back to the entire community and develop this recursive loop it's so much more interesting when you say wow between just yesterday when the survey went out I already have four clusters maybe I should respond to this oh I have something to add to this one next day do it again this cycle is much cheaper than what it used to be I think this is a paradigm shift like this changes governance and what I realized is okay now that we can aggregate this what I need is the next pillar which is will the action over actually get us towards where people want to go and that's I was talking about this with gu and that's who was the previous CE of metaculus and that's how I ended up here eventually so I'm really excited about seeing how forecasting can help this I think there's a lot there and I think we need a lot more experimentation so yeah like Pol like tools there in a way where I started thinking huh maybe there is a way in which we can do better glad I remembered that bonus question I think it it does sort of start to look like a liquid democracy or a you know sort of Technology mediated liquid democracy and that is definitely super compelling I mean there's just not much room on we see this right now as we're going through this campaign obviously in the US it's like the things that are actually getting talked about are few and not particularly well chosen in many cases and there's just so much stuff that really ought to be you know that that we want to understand what people actually care about first of all or how they think about different things we're just not even getting that data in the first place let alone being able to map a a path toward actually delivering for people yeah so absolutely and I think you know the bit signal of red or blue votes in the US it's we can do so much better than that that's why I'm looking at you know Municipal engagements membership based organizations looking at Dow for they are making a lot of decisions around their stakeholders and the entire thing is on a digital substrate so there is much more experimentation here but you said reminded me of something with respect to liquid democracy I must say like the term liquid democracy does bring some aspects of fear in my mind also and I think it's worth pointing that out there's this concept that I call the consensus illusion there is a drive towards the lowest common denominator that if we say finding consensus in a community is desirable for that is the best policy what you will end up is a lot of policy outcomes that are quite lukewarm that don't actually address the issue there was a Europe's paper from deepmind and like 2022 on like how can we fine-tune language models so that we can find agreement across humans that have diverse references and one of my concerns when I see work like this I think the paper is actually great like some of the techniques they use has made its way to talk to the city so I like that some of the concerns I have is finding agreement doesn't necessarily mean good policy in fact it quite often doesn't like one example I like to give is one of the conclusions from a Civic data set is we should build more bike lanes for the community everyone seems to want more bike lanes and you say okay where should we build these bike Lanes it should it be on this street and the model will say no no no that's not what we agreed on it shouldn't be on a specific Street we need more bike Lanes this is the thing we don't want to make any specific streets narrower and it's this like oh we're doing the same thing we've been doing with politics all along find the most common denominator statement that yields power because in the current elective system consensus is power so when you have a desire towards consensus that causes a trade-off with Fidelity to your Viewpoint and if you seek power over Fidelity you will have higher representation of a Viewpoint that is not very meaningful and the most catch all sentences will end up resonating the most and it won't make meaningful policy so I don't want consensus really I want to start from a shared World model like can we actually come up with ways in which we have desirabilities for stakeholders we have action possibilities that come from policy makers and then we have outcome likelihoods from domain experts and these three need to be continuously talking to each other as division of labor I think democracy focuses on the state desirability for stakeholders but if the options you give is only roads you can't really build a bridge and people need to be able to say I want also be able to build a bridge and an AI model that hampers this by finding oh this is the most agreeable thing is not good instead I'd much rather have an AI model whose goal is come up with individual viewpoints that are clashing with each other represent each of them with High Fidelity and identify what are the cruxes between these viewpoints like if we can find a good Crux where I think a is true you think a is false but the underlying cause if I thought the underlying cause Chang this would change my mind about a and you would say the same thing like the typical double cor process is good because that means we actually have a same shared World model and we need more data and at that point bring on further research bring on forecasting like we're in a good mode the failure mode here is a risky Alliance where someone says oh I think policy a is good you say I also think a is good and I say a is good because it will enable B which will be great and you could say I don't think B is likely to happen as a result of a but maybe we should keep together because it seems instrumental just for this one step you end up having a lot of unlikely alliances and the ultimate version of this is a completely polarized Society where for some reason social conservative and fiscally conservative behavior is correlated with each other even though that doesn't necessarily manifest there was one research that I read I can't remember where that I just absolutely loved which was instead of asking people their opinions with respect to immigration policy it actually asked a statistical question on how many people are you willing to let in before you reject someone that truly deserved to come into the country and the previous people that you have let in might be mistakes this is very interesting because instead of people saying well I am liberal I am conservative you see people who say they are liberal you know San Francisco crowd that will say yeah I'm willing to Len five mistakes or we admit one person and others will say what do you mean five we need like 300 but both of these people see themselves as having a liberal like we do not yet operate on a level where we are looking at good policy so we should not encourage these illusion of consensus so that it can serve towards better power instead we can use these techniques that we have right now to go towards better policy and better policy means higher visibility better policy means people agree on the world models so I I see the work we're doing at metaculus as an instrumental step in that trajectory are we able to find people whose World models are diverging let's figure out why they are diverging are we able to see okay will this action that a policy maker has given us bring us to the outcome I think these are really important questions around the category of how can we have better epistemic security it's crazy to think how far this might be able to go over the not to distant future I mean viously you're primarily focused right now on getting a read on what the Bots can do and trying to be as accurate as you can and to be much more in depth on these sort of you know mini metaculus let's assume this works do you have a road map for how this sort of rolls out and scales up to ultimately like big picture you know most important questions in society right in that question I hear all they are you should write the blog post about this probably that Maps this out quick thoughts that come to my mind like we know preferences can and do change based on actions and their consequences right so can we build these feedback loops as proof points in organizations that can actually take action on them where real Stakes are present that's why I'm interested in labor unions for example because it's a fairly acute case of coordinated decision making with stakeholders that are outside and inside the group so it gives us a lot of visibility I think where we are at right now is like I don't expect a top down revolution of your government adopting this off the bat but I do think there's a lot of Municipal level experimentation that already has been happening for example talk to the city is collaborating with a couple municipalities in Japan right now through liid which is a Japanese like liquid democracy PR Company there's a bunch coming in in Taiwan we have interest from folks in Singapore to be able to use metaculus forecasts for example we have interest and similar Taiwan Community we have groups in the US also on Municipal level like for example Detroit has a bunch of communities um that are focusing on well-being of African-Americans and previously incarcerated folks like are we able to figure out what interventions bring a better world to them these are all very short term but in these processes we can actually see oh this seemed to have worked this actually did yield an outcome where a group was able to coales much better I think we need more visibility into that this I think is the very first step I think AI tooling is absolutely critical for both the failure mode and the success mode will heavily depend on how these tools are implemented and used how these tools are actively enhancing human agency I would recommend people who are interested in more of this to check out a road map documents from AI objectives Institute that's where I have done a lot of my writing and thinking with the team there I reach out to them also I still try to stay as involved as I can but it does hold a deer space in my heart for Peter ersy who was the original founder and a mentor and a friend of mine unfortunately passed away quite unexpectedly which is when I started leading AI objectives intitute and like the line of thought there was very much always like AI can be a transformative point for human well-being but the default systems do not Place us on that path so I think there's a lot there and this is the question of existential hope right like existential risk is failure to coordinate at the face of a risk if we can already foresee this path and we are failing to coordinate the systems we are in the coordination capabilities we are in doesn't let us get to the heart of that that is why we are failing so that is the angle that I want to keep looking at because we have seen incredibly successful cases of international coordination or multi Corporation coordination like the ozone layer is basically recovering since we have ban CP CFCs like there is many different cases where we have moved mountains as Society like microplastics related harms like we have band Led at this point like there are ways in which we are able to coordinate if we can create coherent World models and I think the thing we need to do is have shared World models that can contain the disagreements rather instead of agreeable action policies I think the way politics happens right now voting happens right now is find the most agreeable action policy so you can maintain control as opposed to say what is the good policy and this requires a level of epistemic rigor and epistemic security I see my life workers to focus on that question and bring more and more towards that and if there are groups that we can work with hell yeah let's kick it like this is where we need to start so if there are any organizations that are focusing on our priority cause areas be it's on climate change or AI or like nuclear consequences or there's any organizations that are trying to do better on resource allocation where they want the resources to be able to do the maximal good for the specific Community I would love to talk to them I would love to understand how the things we are building can be useful for them I think we need more experimentation of this sort and I'd rather have these experiments be with people that benefit from them in the immediate short term cool well that's a great call to action I'm glad we stayed on the a little extra to get that final section I guess I'll ask again last was a fruitful question anything else that we didn't get to that you wanted to touch on nope I Feel Complete thank you so much for this opportunity it was lovely for it also made me realize the context through which my path has evolved like seeing the role forecasting can play hand to hand with collective intelligence and the failure modes of the current political processes or democracy or resource allocation makes me realize oh I see the role that this play so this was great for me as well so thank you cool dare CEO of metaculus thank you for being part of the cognitive Revolution it is both energizing and enlightening to hear why people listen and learn what they value about the show so please don't hesitate to reach out via email at TCR turpentine doco or you can DM me on the social media platform of your choice [Music]

Transcript for:AI Forecasting: Insights from Metaculus

Transcript for:
AI Forecasting: Insights from Metaculus