hey this is lanch Lang Cham anthropic just released clae 37 now this is a really exciting development from anthropic because it's their first explicit reasoning model and let me just show how to use it and then talk a little bit about how it works so first you're just going to install L anthropic make sure AP API key is set you're going to select Claude 37 latest as your model now this is where things are a bit new you can set max tokens for your response just like before but you can also set this thinking parameter and a budget for thinking tokens I'll talk more about this later passing an input now the response is an AI message just like normal but what's new here is that it's actually a list and this output actually has two blocks so one is a thinking block and the other is the response now what's interesting here is anthropic made the decision to actually expose the thinking to the user so we can see claud's full thinking process transparently in this thinking block and then we can see the final answer provided we can open a lsmith and see the overall latency 20 seconds as well as the full output with the thinking block and the text block now this model is also compatible with building agents so I'll build a simple agent here with three arithmetic tools we'll just use Lang graph pre-built for a classic react style tool calling agent and we'll go ahead and run that by passing ad 3+4 and we can see something pretty cool here in the AI message we can see this thinking block present and this is actually very nice with tool use because it actually tells you its thinking around when and why to use a specific tool so you can see the user asking to add numbers this is arithmetic and I have access to the add function which is suitable for this task that's really cool because often times you don't know why models are actually choosing to call tools so it's great in this case to see that visibly it chooses to call the tool there's your tool call is normal we get the tool message back we get a response so these thinking models are entirely compatible with building agents with tool calling as you'd expect now I just want to talk about this a little bit more in general terms so we have some prior videos discussing chat models versus reasoning models in general they're using two different scaling paradigms chat models being nextto prediction reasoning models being RL on Chain of Thought different reasoning types System One versus system 2 I show a little bit below often the way you instruct the model differs a bit for reasoning models versus chat models and again the interaction modes are a bit different whereas chat models are typically strong for short firm chat conversation reason models are better for longer running reasoning heavy tasks so just with other reing models have been released for example from we saw from R1 we saw from open AI Cloud 37 is post-t trained with reinforcement learning now what's need is they do expose the reing traces to the user what's also interesting is they give a lot of control over how long the model can think for with O Series models there's qualitative reasoning tiers like medium and high with claw 37 you can explicitly instruct the model to use a particular thinking budget of tokens with Max tokens being the maximum number of tokens to generate before stopping and the thinking budget being the number of tokens you allocate for the thinking process now a few other interesting things about this model it has an October 2024 knowledge cut off which is improvement over what we saw previously and it allows up to 128,000 tokens output which is potentially interesting for certain applications now they report strong performance on software engineering so sbench verified you can see it's a big jump over CLA 35 49% versus 62% and so overall from look at performance here you can see see there was a very high emphasis placed on coding which makes a lot of sense because Claude 35 Sonet was already one of the strongest llms for coding obviously very heavily used in tools like cursor and this appears to be a significant jump over 35 Sonet what's also kind of interesting is if you look at the pricing so 35 Sonet input tokens are $3 per million token output are $15 per million token 35 Sonet is the same now it's worth noting the number of output tokens with 37 will frequently larger because of thinking but the pricing is the same now some tips for usage anthropic mentions use this on challenging stem problems now this is kind of an interesting thing that they note for complex tasks consider over 16,000 tokens of thinking budget 4 to 8,000 is considered acceptable for simpler reasoning tasks now you'd have to kind of experiment with that given your particular needs to see what the right threshold is for you but that's a nice thing that give you a lot of control over the amount of tokens you allocate for reasoning now of course you have to think about latency here because as you bump up the thinking tokens you'll also bump up the latency and also they also made some interesting notes on very long outputs you can actually request a detailed outline with word counts down to the paragraph level if you want and you can ask clct to index paragraphs to the outline and maintain specific word counts so you actually have a lot of potential configurability in control over these very long generation outputs and that's another thing that's worth experimenting with a bit some nice tricks they mention with respect to prompting avoid predetermined instructions like when you're working with chat models you often think about think through your problem step by step enumerate the task very explicitly here think more about General instructions think about this problem thoroughly and in great detail consider multiple approaches so you're not telling it really how to think explicitly you're giving it more General instructions and the task you want solved now we've seen that same kind of approach as other reasoning models and in terms of parameters you know again we talked about that budget tokens parameter being quite important we show that here when working with the model itself again you're passing your budget tokens relative to your max tokens where budget tokens should be less than your max tokens and just as we showed above here is example usage and as mentioned the response is going to have a thinking block and a response block which you can very easily segment you can extract the thinking as well as the text from each block very easily it is also worth noting that the thinking block contains a signature field you might wonder what that is and it's a cryptographic token that verifies the thinking block is generated by claw that's all that is and you can see that right here if I look at my my response that first element the in the thinking block is going to be signature and then I have my overall thinking right here so you can see this is a very interesting and Powerful new model you can configure the thinking tokens very precisely it has very long kind of generation capacity at a state of their performance across a number of tasks including coding it works of tool calling as we showed with our simple agent example so it should be a really interesting model to experiment with and I'm excited to test this further but just want to give you a quick overview thanks and feel free to leave any comments below