In my quest to explore generative AI and LLM models, I tried setting up a local/offline LLM model. This is where his presence was felt with gpt4all.
I wanted to try both and realized that gpt4all needs a GUI to work in most cases and there is a long way to go before you get proper headless support straight away. However, PrivateGPT has its own ingestion logic and supports GPT4All and LlamaCPP model types. So I started exploring this in more detail. There are a lot of prerequisites if one wants to work on these models, the most important of which being being able to save a lot of RAM and a lot of CPU for processing power (GPUs are better but I was stuck with machines without a GPU to focus specifically on the optimized CPU configuration).
This article is more of a reminder to myself when I encounter these errors again and I hope it helps others in the same process.
Python
You need Python 3.10 to run these systems. Ubuntu 20.04 and similar systems do not have it by default. You will need to use a PPA to get Python 3.10 on these systems. seems to be the most commonly mentioned Python version.
Commands to configure this
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.10 python3.10-dev python3.10-distutils
Installing pip and other packages
Expert Tip: Use venv to avoid corrupting your machine’s base Python.
create a new venv environment in the folder containing privategpt. This is a one-time step.
python3.10 -m venv venv
later activities require following two commands
source venv/bin/activate
deactivate
Find the models
The problem with GPTs is that it largely depends on the input model you use. contains many interesting models. However, these models did not work either for me.
- ggml-gpt4all-j-v1.3-groovy.bin
Using embedded DuckDB with persistence: data will be stored in: db
Found model file.
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 5401.45 MB
Illegal instruction (core dumped)
- ggml-stable-vicuna-13B.q4_2.bin
$ python3 privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
Found model file.
gptj_model_load: loading model from 'models/ggml-stable-vicuna-13B.q4_2.bin' - please wait ...
gptj_model_load: invalid model file 'models/ggml-stable-vicuna-13B.q4_2.bin' (bad magic)
GPT-J ERROR: failed to load model from models/ggml-stable-vicuna-13B.q4_2.bin
This put an end to my experimentation and I had to start looking for alternatives. The first step, as always, is to check the issue logs on the project repository itself. Things got a bit complicated as we are looking at 3 projects llamacpp, gpt4all and privategpt.
- This /issues/203 issue confirmed my suspicions that I was using an older CPU and that could be the problem in this case. in particular, they needed AVX2 support.
- To get the gpt4all templates to work, the suggestion seems to say to recompile gpt4, but that gets complex because I’m not using gpt4all directly and am using via a python binding, so that would be a mess in itself.
- This is where I turned my attention to the llama.cpp because I had played with it earlier and it worked on some level.
I’d like to spend a few minutes talking about llama.cpp because it can help us convert models into a usable format, especially if they are created with different types of software. was one of the first things to come out and it has a lot of tools around template customization.
I would again recommend using venv for this project as well. After compiling Llama, you will be able to play with the template files listed here. #prepare-data-run. I won’t go into detail on how to get these models, there are several ways and this article would just assume you have these models and we will work from that point.
Now I remember I had been working on setting up llama.cpp and converting a 7B model a few days ago and I got that model working in chat mode. but it was very little production and the system was not to my taste because my ultimate goal was to obtain more data from my own notes. So I thought I would try this system and for privategpt all I had to do was change the model path in .env
MODEL_TYPE=LlamaCpp
MODEL_PATH=../llama.cpp/models/7B/ggml-model-q4_0.bin
Running python3 privateGPT.py
now started to open another can of worms.
Error: Module format is no longer supported
error loading model: this format is no longer supported (see [ggerganov/llama.cpp#1305](/pull/1305))
It was funny but it’s okay, I built the model a few weeks ago in the AI/GPT world which lasts about a decade. and the Text section above #description confirmed that the quantization format has changed in llama. It’s simple, I have to do git pull and recompile the project. And do the conversion again and that should fix the problem. After a few minutes, I was ready to use the updated template.
Next error: unknown magic, version
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
It was interesting. I was absolutely sure it was the ggml file because I had just created it, but the error was reminiscent of other languages that would fail when the magic values in the file didn’t match. I went back to the issue log and found /issues/409#issuecomment-1559128238 and many other similar entries pointing to the lama-cpp-python binding upgrade. I realized that the required.txt file was blocked 0.1.50
so a pip install of the fixed version and a change to the requirements fixed the problem.
Order for reference:
pip install llama-cpp-python==0.1.53
If these two errors do not match the error you are facing, I suggest monitoring /issues/276#issuecomment -1554262627 for any updates. For me, the lama-cpp-python binding did the trick and finally got my privateGPT instance working. I’m still experimenting with data entries.
Injecting files into a vector database
- This also works in subdirectories.
- There is a problem if the file contains longer lines or extra characters that are difficult to read in UTF-8.
I made a minor change to the code to get the file name causing ingestion errors due to Unicode issues. I added a try catch block to print the name of the file causing the error.
def load_single_document(file_path: str) -> Document:
ext = "." + file_path.rsplit(".", 1)[-1]
if ext in LOADER_MAPPING:
loader_class, loader_args = LOADER_MAPPING[ext]
try:
loader = loader_class(file_path, **loader_args)
return loader.load()[0]
except UnicodeDecodeError:
print(file_path)
raise ValueError(f"Unsupported file extension '{ext}'")
This has worked so far for me to identify files that need to be deleted as a source. I haven’t found a better way to deal with these files yet. I simply delete the offending file and rely on directly digestible files.
Speed up responses
Once the ingestion process has done its wonders, you will now be able to run python3 privateGPT.py
and receive a prompt that will hopefully answer your questions. It lists all sources used to develop this answer. However, you’ll immediately realize that it’s pathetically slow. I immediately started htop
to check the amount of server load added by this process and for my amusement and as I half expected the server was only using a single thread and the RAM usage was also under full control.
So it looks like thread count is something I need to update and, just for cash, there was a discussion in the privateGPT repo that covered this very aspect /discussions/286#discussioncomment-5945851. A quick edit to my privateGPT.py file with the code below did the job.
n_cpus = len(os.sched_getaffinity(0))
match model_type:
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_threads=n_cpus, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False)
Now running the code I can see all my 32 threads in use while it tries to find the “meaning of life”.
Bonus tip:
Bonus tip: If you’re just looking for a super-fast search engine for your notes of all kinds, Vector DB makes your life easier. Load a fake model name so that no models are loaded. Enter the search string in the search box; it will point to all files and sources used to obtain the relevant text.
And then?
I feel like I’m just starting a crazy adventure, and there’s still a lot to learn and play. I may write more on this topic if I have other work in progress in this area.