Running GPT Locally with GPU Support

Have you already thought about installing a GPT locally with a GPT on your PC? Your data is yours and you can even use the more versatile uncensored models. Speed used to be a major issue when using GPTs locally, but that is no longer the case. Nomic AI has released a GPT for all version that supports the Vulkan GPU interface and accelerates PCs with AMD, Nvidia and Intel Arc GPUs. See the difference. On the left Mistral Open Orca runs slowly with CPU support like in the older GPT for all versions and on the right is the same model running with GPU support. It is more than five times faster. In this video you will learn how to install GPT for all, enable GPU support, download an uncensored model and what else to consider for GPU support. Let's start. On Nomic AI's Jitta page you find GPT for All. It's open source under MIT license. By the way, you find all links in the description below. In the README you find some links for example to the web page and the documentation. In the chat client section there are the links to the installers. We choose Windows. After the successful download we start the executable. Sorry for German in the installer, I haven't found a way to change the language in this tool. The installation is very simple. Select the suitable directory, accept the license, and a link to the start menu. After a very short time, the installation is finished. Before recording this installation, I have tested the whole thing and that's the reason why you can see several model files already listed here. In the setup menu you should check that the download path is suitable for you. There you will put all new model files. Here you can select the number of threads to be used and in this section you can choose whether your GPU shall be used. Auto is a good value here. As I've promised we will use the Mistral LLM with GPT for All. And in the download section you find a list of several large language models. We will download both Mistral models. Now we will select Mistral Open Orca and do a quick test before we just check that the GPU is selected. 44 tokens per second, so that's definitely using the GPU. There are a lot of more models to be used with GPT for All and I'll show you how. For example the uncensored Lama 2 model. By the way, if you wanna see more videos like this, press the subscribe button. On Hugging Face you can find lots of more models. It's important to search for the GGUF models. That's the new format for GPT4ALL. And I've selected LAMA and the 7BCHAT uncensored GGUF. The files are quantized differently. Just remember, all downloaded GGUF files have to be put here. There's one important point to consider you will understand in a few seconds. Let's choose the Q8 model of the Lana 7B chat. What's that? Obviously the GPU is not used but the CPU and that's wrong. In the readme file under what's new we find that the NOMIC Vulkan support is only for Q4 O and Q6 quantizations. That's the reason why it did not work with our Q8 GGUF model. But I have downloaded a Q6K GGUF model. Let's try this one. And again this does not work. So maybe it's only Q for 0. And yes, obviously this works. 41 or around 40 tokens per second. That's great. Now we try the 13B, Lama 13B Q4O model. No. Again, only the CPU is taken and GPU is ignored. In contrast to the entry in the readme file it seems that currently only Q4O models can be used with GPU support. Additionally, my 13B Q4O model did not use the GPU, so most probably support for models larger than 7B will follow in some time. If you enjoyed the video, please leave a like or a comment.

Nomic AI has released a GPT for all version that supports the Vulkan GPU interface and accelerates PCs with AMD, Nvidia and Intel Arc GPUs. See the difference. On the left Mistral Open Orca runs slowly with CPU support like in the older GPT for all versions and on the right is the same model running with GPU support. It is more than five times faster.

In this video you will learn how to install GPT for all, enable GPU support, download an uncensored model and what else to consider for GPU support. Let's start. On Nomic AI's Jitta page you find GPT for All.

It's open source under MIT license. By the way, you find all links in the description below. In the README you find some links for example to the web page and the documentation.

In the chat client section there are the links to the installers. We choose Windows. After the successful download we start the executable.

Sorry for German in the installer, I haven't found a way to change the language in this tool. The installation is very simple. Select the suitable directory, accept the license, and a link to the start menu.

After a very short time, the installation is finished. Before recording this installation, I have tested the whole thing and that's the reason why you can see several model files already listed here. In the setup menu you should check that the download path is suitable for you.

There you will put all new model files. Here you can select the number of threads to be used and in this section you can choose whether your GPU shall be used. Auto is a good value here.

As I've promised we will use the Mistral LLM with GPT for All. And in the download section you find a list of several large language models. We will download both Mistral models.

Now we will select Mistral Open Orca and do a quick test before we just check that the GPU is selected. 44 tokens per second, so that's definitely using the GPU. There are a lot of more models to be used with GPT for All and I'll show you how. For example the uncensored Lama 2 model. By the way, if you wanna see more videos like this, press the subscribe button.

On Hugging Face you can find lots of more models. It's important to search for the GGUF models. That's the new format for GPT4ALL. And I've selected LAMA and the 7BCHAT uncensored GGUF.

The files are quantized differently. Just remember, all downloaded GGUF files have to be put here. There's one important point to consider you will understand in a few seconds. Let's choose the Q8 model of the Lana 7B chat. What's that?

Obviously the GPU is not used but the CPU and that's wrong. In the readme file under what's new we find that the NOMIC Vulkan support is only for Q4 O and Q6 quantizations. That's the reason why it did not work with our Q8 GGUF model.

But I have downloaded a Q6K GGUF model. Let's try this one. And again this does not work. So maybe it's only Q for 0. And yes, obviously this works. 41 or around 40 tokens per second.

That's great. Now we try the 13B, Lama 13B Q4O model. No. Again, only the CPU is taken and GPU is ignored.

In contrast to the entry in the readme file it seems that currently only Q4O models can be used with GPU support. Additionally, my 13B Q4O model did not use the GPU, so most probably support for models larger than 7B will follow in some time. If you enjoyed the video, please leave a like or a comment.

Transcript for:Running GPT Locally with GPU Support

Transcript for:
Running GPT Locally with GPU Support