Running Llama 2 on CPU Inference Locally for Document Q&A

Large language models (LLMs) are becoming increasingly popular for a variety of applications, including document Q&A. However, running LLMs can be computationally expensive, especially on CPUs.

In this article, we will show you how to run Llama 2 on CPU inference locally for document Q&A. We will use the C Transformers library, GGML, and LangChain.

Prerequisites

You will need a Linux or macOS machine with a CPU that supports SSE4.2.
You will need to install the following Python packages:
- ctransformers
- ggml
- langchain

Instructions

Clone the Llama 2 Open-Source LLM CPU Inference: https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference repository. ‘git clone https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference.git’
Install the dependencies by running the following command:

pip install -r requirements.txt

Download the Llama 2 GGML binary file. You can find the download link in the Llama 2 Open-Source LLM CPU Inference: https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference repository.
Once you have downloaded the binary file, extract it to the data directory.
Run the following command to start the document Q&A application:

python main.py

The application will prompt you to enter a question. Once you have entered a question, the application will use Llama 2 to answer the question.

Example

Let’s say you want to ask the following question:

What is the capital of France?

To answer this question, you would run the following command:

python main.py "What is the capital of France?"

The application will then print the answer, which is Paris.

Troubleshooting

If you are having trouble running the application, you can check the following:

Make sure that you have installed all of the required Python packages.
Make sure that the Llama 2 GGML binary file is extracted to the data directory.
Make sure that the CPU in your machine supports SSE4.2.

If you are still having trouble, you can post a question on the Llama 2 GitHub repository: https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference.

Continue to read on “Quick Primer on Quantization”