Apresentação sobre o Anything LLM

Hey everyone, my name is Timothy Karambat, founder of MintFlex Labs and creator and maintainer of Anything LLM. Today, I'm actually going to showcase Anything LLM, just kind of how it works, but then also show you something that makes Ollama models really powerful. We're actually going to give agent capabilities to any LLM available on Ollama, where you can then search the web, save things to memory, scrape websites, do whatever you want, make charts even, and I'm going to show you how to unlock all of those abilities by just downloading anything LLM and connecting to Ollama. It'll be really simple, but first I want to showcase a little bit of education about what Ollama is, quantization, and what agents even are. First Ollama. If you found this video, you've definitely heard of Ollama because Ollama is in the title. Ollama is a application you can install for Mac, Windows, and Linux, and it allows you to run LLMs using your own computer's devices. no cloud, no anything like that. So it's totally private. The way that this is possible because Lama three is a massive model that takes dozens of GPUs to run is through a process called quantization. Quantization is basically how we can get these models small enough to run on your CPU or your GPU. I'm not going to get into weeds of how that works. In general, you should know how quantization works. It's basically compression of an LLM. And when we get into agents, I'll tell you why that's really important. The next part of this really short lecture is what is an agent? So you have LLMs and they respond to you with text, right? They don't really do anything. An agent does something. It's an LLM that is able to execute what people call our tools or skills, or there's a whole bunch of language, but it's an LLM that from your input doesn't just respond with text. It actually goes run some program or interface or API, gets that information, does that action. and then comes back to you with the result, the response, or your question answered with that tool's supplemented help. It's like RAG, but you're doing things instead of just chatting with a chunk of a document. And you can see that actually RAG's on the top part of this graph, short-term and long-term memory, which is kind of a common use case for retrieval, augmented generation, chat with your docs, all the same thing. What we're going to do is get this working for any LLM. So you're probably familiar with cloud-based models like OpenAI or Anthropics Cloud or Perplexity, where you can say things to the model, and sometimes it can go and do something like search the web, which is a very common use case. However, if you are using Ollama and you try to tell your model to search the web, it'll just tell you that it can't do that. Well, now with Anything LLM, any LLM is an agent and can be an agent and can even search the web and do all of this for free on your computer with 100% privacy. So I'm going to show you how we're going to unlock. that today. So the first thing that we need to do is find a good model. As I said, any LLM will work with anything LLM and its agent capabilities. However, when we come back to quantization, there's one detail that people kind of overlook when it comes to O-Lama. By default, O-Lama will install a Q4 quantization. Now that probably doesn't really mean anything to you, but here's the go-by for it. Q1 is the most compressed version of that model. Q8 is the least compressed version of that model, but still compressed, not the raw model. If you have a model that is 8 billion parameters and you compress it a lot to like two or three, you basically took something that's already small and then compressed it a lot. So now you have a pretty bad model and you'll get hallucinations, you'll get weird outputs, it'll just go crazy, not even respond to your questions. All of these become problems at smaller models being quantized very heavily. So what we're going to do today is intentionally download Lama 3 from O-Lama, but use the Q8 version so that it is more robust, the calls are more reliable, and the responses are just better. If we were messing with the 70 billion model, yeah, we probably wouldn't download the Q8. It'd be like 30 gigs. We'd use the Q4 and have a good time because 70 billion parameters is a lot. I know that sounds very technical, but hopefully you understand why quantization and picking the right model is a use case science. And it's something that you should understand if you're messing with LLMs at all. If you go to Ollama and you go to Llama 3 and you scroll down, you'll see that the 8B tag and the latest tag, which is what downloads by default, are the same. But this tag is also matched to the instruct model, which is the same, and it is a Q4. So this is a pretty small model, and it's basically the middle of the road between size and performance. But we want really good performance because we're dealing with agents. I'm going to go and find the Q8 version of this model, which you can do by just typing in Q8. And you'll see that it's right here. It's eight and a half gigs. I'm running on a MacBook Pro Intel. It's pretty bad for inferencing in general. I have a Windows computer in the other room. So I'm actually going to use Ollama on that computer, run anything LLM on this computer, all on my private network. So here I am on my Windows computer and I have Ollama installed. If I type in Ollama. we have it running. I need to pull in that Q8 model and the easiest way to do that is Ollama. And I already have this downloaded because I wasn't going to wait while making this video. And so you'll see it downloads all of the layers. We're good to go. So the only thing left is olamaserve to make sure that the server is running. The server is already running. And as you can see, I have ngrok running and I'm tunneling my desktop computer in one room to the connection for in another room on my Mac. This is where we can get into Anything LLM. Anything LLM is a all-in-one AI agent. and RAG tool that just runs on your desktop fully locally, connects with pretty much anything that you care about, and it can work on Mac, Windows, and Linux. All you do, use anything.com slash download, and then click on the proper operating system and chip architecture. And since I have anything LLM downloaded, we're going to boot it up. And because I've never run it before on this computer, it is going to basically just ask us, what LLM do you want to use? That should be the first question. So here we are on onboarding, and it asks us, what LLM do you want? anything LLM actually ships with Ollama inside of it. So the whole setting up Ollama on my Windows computer, completely extraneous if you have a GPU device. I am on an Intel MacBook, so it's really old. So I'm actually going to use the Ollama external connection. And all I'm going to do is paste in that address from Ngrok, and you'll see that my chat models are loaded. I want to use the Q88B. And I know because I know about this model, it is a 8192. context window. It's really annoying that they don't publish this information for every model. You have to go and Google it. It's just annoying. But anyway, we'll just continue. So you can see this is kind of a privacy overview. We're going to use anything LLM's built in embedder. So everything will embed on this device. And we're going to use the built in vector database as well so that basically none of my chats are leaving my local network at all. All of my data is going to stay on premises and it'll all just work very nicely. And of course, you can skip the survey. It's totally optional. Let's make a workspace. And we're going to call it sample for now. The very first thing that people would want to do is just test to see does the model work. So let's just say hello. And what this is doing is sending a request to my Windows computer and Oh llama on that computer is going to stream it back. And you can see it works, it works about as well as you would expect it to and it's it's fast. However, while it might be fast, because I'm using a 4090 in the other room, it's still pretty dumb. And the reason that we can say that is because it doesn't know anything about what maybe I want it to know about. For example, anything LLM, while people love it, and it's great, and it's cool, it's not popular enough for an LLM to know about it. So if we were to ask the question, what is anything LLM, it's likely going to make something up. And it's going to say that anything LLM is a LLM, which is totally wrong. And it's, yeah, this is all a hallucination. None of this is accurate. But what can we do to improve its... ability to know about anything LLM. Well, the easiest way is Rack. So let's do that first. So we're going to go and upload a document. I actually have anything LLM's GitHub readme already downloaded as a PDF. So I'm just going to upload that and then move it over to the workspace so that when I am in this workspace chatting with Olamo, it will use this set of documents. And you can see that it was downloaded successfully. And so we can just close this window. Now let's reset the chat and ask it that same question again. What is anything, LLM, what we would hope to see is to get a response back, which wow, that was quick. And we get citations. And we can actually see what chunks exactly were relevant to my query that resulted in the LLM being able to complete this. And it says anything LLM is a full stack. application, blah, blah, blah, does all this stuff. That is accurate. This is actually factual information. We can go into the workspaces settings and we can, you know, go to the vector database. We can increase the number of snippets per chat. It changed the way that documents are deemed relevant, but it's actually an easier way to just use LLMs. And that is with agents. And as I had said before, this is not a capability built into Ollama. It's not a capability built into LLama3. This is actually something that we have been able to do to apply to any LLM that doesn't support function calling. Function calling is how all of this magic works. And now you can unlock it when you use anything LLM with any LLM. So what we want to do is we want to use O-Lama. We have O-Lama. We have our model. That's it. Really don't want to use a worse model. I think we have Lama 3, but let's stick with the Q8 version. And there are some default skills that exist. Of course, RAG and long-term memory. We already saw that that's built into anything LLM. We should have the ability to look at the documents in our workspace, modify them, summarize them, commit new information to long-term memory just from chatting and all of that. We should be able to summarize these documents. We should be able to scrape websites. That's a feature just built into anything LLM. We should be able or can generate charts. I will admit this one is a little model dependent. Some models just aren't great with like, you know, you could paste in a CSV and say, make a bar chart. Some models. kill it. Llama 3 honestly isn't that great. Generate and save files to a browser. So if we're talking to it and we say, hey, can you save that contact information to, you know, tim.txt, it'll download it and save it on your desktop on this device. And then of course, live web search and browsing. This makes any LLM that you download running locally, basically on par with Perplexity. And actually you can do it for free. I'm sure you were like, ah, but I need an API key. You do, but... Google actually offers this service totally for free. You can just click on this link that we provide and it opens up this new programmable search engine stuff. You get 100 queries a day, which is honestly pretty good. We do support other search engine results providers, but this one's totally free and anybody with a Google account can sign up. So let's connect mine so we can get web browsing. Okay, so I have that information put in. I'm gonna click update and now everything is saved. Let's go back to the chat window. Now, keep in mind, We had information about anything LLM already stored in here. So let's remove it. So we're just going to go remove that right now. If we are to reset the chat and say, what is anything LLM? We should, again, get a made up response that has nothing to do with the actual tool. We can actually get agent into the loop on this. And the way you can do that is by typing at agent or you can click this and we tell you about how agents work. But if you click this. You can actually see agent is how you would invoke this. You would say at agent, can you scrape useanything.com, which is our website, and tell me the key features. Let's just call it that. And what we should hope to see is this model go to useanything.com, scrape that, compile that information, specifically the key features, and hopefully give us back a pretty good text response. And you can see that we actually get what I would consider. a pretty decent response. But keep in mind, this is not in long-term memory. So let's ask the model to remember that for later. So let's say, thank you. Can you remember that information for later? And what we should hope to see is the model recognize this as an available function and it say, oh yes, of course, I will take the chat as it is right now, summarize it, and then save that for later so that when we ask in regular chat, it would work. and you can see that it's done that. But now let's look at summarization. Summarization is one of the most asked and used features of AnythingLLM. It's not how RAG works. It's actually a pretty big misunderstanding that people think that you can just upload a document in a vector database and say, summarize my document. It's just not how vector databases work. But with AnythingLLM, you can do it. And so I'm going to open up a new workspace, and we'll just call it AnythingLLM, and we're going to upload. that same readme document. Because I've already embedded in another workspace, embedding is instant. And now with no other kind of inferencing or leading or anything like that, let's just ask the agent, can you summarize readme.pdf, which is the name of the file in the workspace. And you can see it looks at the available documents, found a document called readme.pdf, and then begins to summarize it. Again, this is all running. locally within my network because I'm using my Windows computer, but it is summarizing. You can see that it says it summarized it, blah, blah, blah, did all the stuff, mentions it's MIT licensed. That is kind of the quick preview of what agents can do for any LLM when you put them in anything LLM. And while I do recognize that this list of default skills is pretty limited right now, I do want to really, really emphasize that this is just the beginning for anything LLM. We're actually going to have the ability for you to define your own agents like you would in tools like Crew AI and any other kind of agent builder that you know is already out there. That'll just exist in Anything LLM. Anything LLM plus Ollama can be your go-to for not only RAG, but also AI agents that can do things for you. We have a lot more cooking on this front, and so I'm really excited to show you this even in its current state. And I also do want to remind everybody that Anything LLM... is open source, you can use the app that I just showed you right now today for free with no if ands or buts. You just download it and get it running. And the easiest way to support us is actually by starring us on GitHub. We would really appreciate that. More so, I'd also appreciate feedback, suggestions on new tools that you would like to see agents accomplish. We'd love to know what you're working on and how anything LLM fits into that flow. So that's it for this short video. I really appreciate your time. Thank you.

First Ollama. If you found this video, you've definitely heard of Ollama because Ollama is in the title. Ollama is a application you can install for Mac, Windows, and Linux, and it allows you to run LLMs using your own computer's devices. no cloud, no anything like that. So it's totally private.

The way that this is possible because Lama three is a massive model that takes dozens of GPUs to run is through a process called quantization. Quantization is basically how we can get these models small enough to run on your CPU or your GPU. I'm not going to get into weeds of how that works.

In general, you should know how quantization works. It's basically compression of an LLM. And when we get into agents, I'll tell you why that's really important. The next part of this really short lecture is what is an agent? So you have LLMs and they respond to you with text, right?

They don't really do anything. An agent does something. It's an LLM that is able to execute what people call our tools or skills, or there's a whole bunch of language, but it's an LLM that from your input doesn't just respond with text.

It actually goes run some program or interface or API, gets that information, does that action. and then comes back to you with the result, the response, or your question answered with that tool's supplemented help. It's like RAG, but you're doing things instead of just chatting with a chunk of a document. And you can see that actually RAG's on the top part of this graph, short-term and long-term memory, which is kind of a common use case for retrieval, augmented generation, chat with your docs, all the same thing. What we're going to do is get this working for any LLM.

So you're probably familiar with cloud-based models like OpenAI or Anthropics Cloud or Perplexity, where you can say things to the model, and sometimes it can go and do something like search the web, which is a very common use case. However, if you are using Ollama and you try to tell your model to search the web, it'll just tell you that it can't do that. Well, now with Anything LLM, any LLM is an agent and can be an agent and can even search the web and do all of this for free on your computer with 100% privacy.

So I'm going to show you how we're going to unlock. that today. So the first thing that we need to do is find a good model.

As I said, any LLM will work with anything LLM and its agent capabilities. However, when we come back to quantization, there's one detail that people kind of overlook when it comes to O-Lama. By default, O-Lama will install a Q4 quantization.

Now that probably doesn't really mean anything to you, but here's the go-by for it. Q1 is the most compressed version of that model. Q8 is the least compressed version of that model, but still compressed, not the raw model.

If you have a model that is 8 billion parameters and you compress it a lot to like two or three, you basically took something that's already small and then compressed it a lot. So now you have a pretty bad model and you'll get hallucinations, you'll get weird outputs, it'll just go crazy, not even respond to your questions. All of these become problems at smaller models being quantized very heavily.

So what we're going to do today is intentionally download Lama 3 from O-Lama, but use the Q8 version so that it is more robust, the calls are more reliable, and the responses are just better. If we were messing with the 70 billion model, yeah, we probably wouldn't download the Q8. It'd be like 30 gigs. We'd use the Q4 and have a good time because 70 billion parameters is a lot.

I know that sounds very technical, but hopefully you understand why quantization and picking the right model is a use case science. And it's something that you should understand if you're messing with LLMs at all. If you go to Ollama and you go to Llama 3 and you scroll down, you'll see that the 8B tag and the latest tag, which is what downloads by default, are the same. But this tag is also matched to the instruct model, which is the same, and it is a Q4. So this is a pretty small model, and it's basically the middle of the road between size and performance.

But we want really good performance because we're dealing with agents. I'm going to go and find the Q8 version of this model, which you can do by just typing in Q8. And you'll see that it's right here.

It's eight and a half gigs. I'm running on a MacBook Pro Intel. It's pretty bad for inferencing in general. I have a Windows computer in the other room.

So I'm actually going to use Ollama on that computer, run anything LLM on this computer, all on my private network. So here I am on my Windows computer and I have Ollama installed. If I type in Ollama.

we have it running. I need to pull in that Q8 model and the easiest way to do that is Ollama. And I already have this downloaded because I wasn't going to wait while making this video. And so you'll see it downloads all of the layers. We're good to go.

So the only thing left is olamaserve to make sure that the server is running. The server is already running. And as you can see, I have ngrok running and I'm tunneling my desktop computer in one room to the connection for in another room on my Mac.

This is where we can get into Anything LLM. Anything LLM is a all-in-one AI agent. and RAG tool that just runs on your desktop fully locally, connects with pretty much anything that you care about, and it can work on Mac, Windows, and Linux.

All you do, use anything.com slash download, and then click on the proper operating system and chip architecture. And since I have anything LLM downloaded, we're going to boot it up. And because I've never run it before on this computer, it is going to basically just ask us, what LLM do you want to use? That should be the first question. So here we are on onboarding, and it asks us, what LLM do you want?

anything LLM actually ships with Ollama inside of it. So the whole setting up Ollama on my Windows computer, completely extraneous if you have a GPU device. I am on an Intel MacBook, so it's really old. So I'm actually going to use the Ollama external connection.

And all I'm going to do is paste in that address from Ngrok, and you'll see that my chat models are loaded. I want to use the Q88B. And I know because I know about this model, it is a 8192. context window. It's really annoying that they don't publish this information for every model. You have to go and Google it.

It's just annoying. But anyway, we'll just continue. So you can see this is kind of a privacy overview. We're going to use anything LLM's built in embedder. So everything will embed on this device.

And we're going to use the built in vector database as well so that basically none of my chats are leaving my local network at all. All of my data is going to stay on premises and it'll all just work very nicely. And of course, you can skip the survey.

It's totally optional. Let's make a workspace. And we're going to call it sample for now.

The very first thing that people would want to do is just test to see does the model work. So let's just say hello. And what this is doing is sending a request to my Windows computer and Oh llama on that computer is going to stream it back.

And you can see it works, it works about as well as you would expect it to and it's it's fast. However, while it might be fast, because I'm using a 4090 in the other room, it's still pretty dumb. And the reason that we can say that is because it doesn't know anything about what maybe I want it to know about.

For example, anything LLM, while people love it, and it's great, and it's cool, it's not popular enough for an LLM to know about it. So if we were to ask the question, what is anything LLM, it's likely going to make something up. And it's going to say that anything LLM is a LLM, which is totally wrong.

And it's, yeah, this is all a hallucination. None of this is accurate. But what can we do to improve its... ability to know about anything LLM. Well, the easiest way is Rack.

So let's do that first. So we're going to go and upload a document. I actually have anything LLM's GitHub readme already downloaded as a PDF.

So I'm just going to upload that and then move it over to the workspace so that when I am in this workspace chatting with Olamo, it will use this set of documents. And you can see that it was downloaded successfully. And so we can just close this window.

Now let's reset the chat and ask it that same question again. What is anything, LLM, what we would hope to see is to get a response back, which wow, that was quick. And we get citations. And we can actually see what chunks exactly were relevant to my query that resulted in the LLM being able to complete this. And it says anything LLM is a full stack.

application, blah, blah, blah, does all this stuff. That is accurate. This is actually factual information.

We can go into the workspaces settings and we can, you know, go to the vector database. We can increase the number of snippets per chat. It changed the way that documents are deemed relevant, but it's actually an easier way to just use LLMs. And that is with agents. And as I had said before, this is not a capability built into Ollama.

It's not a capability built into LLama3. This is actually something that we have been able to do to apply to any LLM that doesn't support function calling. Function calling is how all of this magic works. And now you can unlock it when you use anything LLM with any LLM. So what we want to do is we want to use O-Lama.

We have O-Lama. We have our model. That's it. Really don't want to use a worse model. I think we have Lama 3, but let's stick with the Q8 version.

And there are some default skills that exist. Of course, RAG and long-term memory. We already saw that that's built into anything LLM.

We should have the ability to look at the documents in our workspace, modify them, summarize them, commit new information to long-term memory just from chatting and all of that. We should be able to summarize these documents. We should be able to scrape websites.

That's a feature just built into anything LLM. We should be able or can generate charts. I will admit this one is a little model dependent. Some models just aren't great with like, you know, you could paste in a CSV and say, make a bar chart.

Some models. kill it. Llama 3 honestly isn't that great. Generate and save files to a browser. So if we're talking to it and we say, hey, can you save that contact information to, you know, tim.txt, it'll download it and save it on your desktop on this device.

And then of course, live web search and browsing. This makes any LLM that you download running locally, basically on par with Perplexity. And actually you can do it for free.

I'm sure you were like, ah, but I need an API key. You do, but... Google actually offers this service totally for free. You can just click on this link that we provide and it opens up this new programmable search engine stuff. You get 100 queries a day, which is honestly pretty good.

We do support other search engine results providers, but this one's totally free and anybody with a Google account can sign up. So let's connect mine so we can get web browsing. Okay, so I have that information put in. I'm gonna click update and now everything is saved. Let's go back to the chat window.

Now, keep in mind, We had information about anything LLM already stored in here. So let's remove it. So we're just going to go remove that right now.

If we are to reset the chat and say, what is anything LLM? We should, again, get a made up response that has nothing to do with the actual tool. We can actually get agent into the loop on this.

And the way you can do that is by typing at agent or you can click this and we tell you about how agents work. But if you click this. You can actually see agent is how you would invoke this. You would say at agent, can you scrape useanything.com, which is our website, and tell me the key features.

Let's just call it that. And what we should hope to see is this model go to useanything.com, scrape that, compile that information, specifically the key features, and hopefully give us back a pretty good text response. And you can see that we actually get what I would consider.

a pretty decent response. But keep in mind, this is not in long-term memory. So let's ask the model to remember that for later.

So let's say, thank you. Can you remember that information for later? And what we should hope to see is the model recognize this as an available function and it say, oh yes, of course, I will take the chat as it is right now, summarize it, and then save that for later so that when we ask in regular chat, it would work. and you can see that it's done that. But now let's look at summarization.

Summarization is one of the most asked and used features of AnythingLLM. It's not how RAG works. It's actually a pretty big misunderstanding that people think that you can just upload a document in a vector database and say, summarize my document.

It's just not how vector databases work. But with AnythingLLM, you can do it. And so I'm going to open up a new workspace, and we'll just call it AnythingLLM, and we're going to upload. that same readme document. Because I've already embedded in another workspace, embedding is instant.

And now with no other kind of inferencing or leading or anything like that, let's just ask the agent, can you summarize readme.pdf, which is the name of the file in the workspace. And you can see it looks at the available documents, found a document called readme.pdf, and then begins to summarize it. Again, this is all running. locally within my network because I'm using my Windows computer, but it is summarizing.

You can see that it says it summarized it, blah, blah, blah, did all the stuff, mentions it's MIT licensed. That is kind of the quick preview of what agents can do for any LLM when you put them in anything LLM. And while I do recognize that this list of default skills is pretty limited right now, I do want to really, really emphasize that this is just the beginning for anything LLM.

We're actually going to have the ability for you to define your own agents like you would in tools like Crew AI and any other kind of agent builder that you know is already out there. That'll just exist in Anything LLM. Anything LLM plus Ollama can be your go-to for not only RAG, but also AI agents that can do things for you. We have a lot more cooking on this front, and so I'm really excited to show you this even in its current state.

And I also do want to remind everybody that Anything LLM... is open source, you can use the app that I just showed you right now today for free with no if ands or buts. You just download it and get it running. And the easiest way to support us is actually by starring us on GitHub.

We would really appreciate that. More so, I'd also appreciate feedback, suggestions on new tools that you would like to see agents accomplish. We'd love to know what you're working on and how anything LLM fits into that flow.

So that's it for this short video. I really appreciate your time. Thank you.

Transcript for:Apresentação sobre o Anything LLM

Transcript for:
Apresentação sobre o Anything LLM