Running Llama 2 on CPU Inference Locally for Document Q&A

Large language models (LLMs) are becoming increasingly popular for a variety of applications, including document Q&A. However, running LLMs can be computationally expensive, especially on CPUs.

In this article, we will show you how to run Llama 2 on CPU inference locally for document Q&A. We will use the C Transformers library, GGML, and LangChain.


  • You will need a Linux or macOS machine with a CPU that supports SSE4.2.
  • You will need to install the following Python packages:
    • ctransformers
    • ggml
    • langchain


  1. Clone the Llama 2 Open-Source LLM CPU Inference: repository. ‘git clone’
  2. Install the dependencies by running the following command:
pip install -r requirements.txt
  1. Download the Llama 2 GGML binary file. You can find the download link in the Llama 2 Open-Source LLM CPU Inference: repository.
  2. Once you have downloaded the binary file, extract it to the data directory.
  3. Run the following command to start the document Q&A application:

The application will prompt you to enter a question. Once you have entered a question, the application will use Llama 2 to answer the question.


Let’s say you want to ask the following question:

What is the capital of France?

To answer this question, you would run the following command:

python "What is the capital of France?"

The application will then print the answer, which is Paris.


If you are having trouble running the application, you can check the following:

  • Make sure that you have installed all of the required Python packages.
  • Make sure that the Llama 2 GGML binary file is extracted to the data directory.
  • Make sure that the CPU in your machine supports SSE4.2.

If you are still having trouble, you can post a question on the Llama 2 GitHub repository:

