Run LLMs on Your CPU with Llama.cpp: A Step-by-Step Guide

Large language models (LLMs) are becoming increasingly popular, but they can be computationally expensive to run. There have been several advancements like the support for 4-bit and 8-bit loading of models on HuggingFace. But they require a GPU to work. This has limited their use to people with access to specialized hardware, such as GPUs. Even though it is possible to run these LLMs on CPUs, the performance is limited and hence restricts the usage of these models.

Recent work by Georgi Gerganov has made it possible to run LLMs on CPUs with high performance. This is thanks to his implementation of the llama.cpp library, which provides high-speed inference for a variety of LLMs.

The original llama.cpp library focuses on running the models locally in a shell. This does not offer a lot of flexibility to the user and makes it hard for the user to leverage the vast range of python libraries to build applications. Recently LLM frameworks like LangChain have added support for llama.cpp using the llama-cpp-python package.

In this blog post, we will see how to use the llama.cpp library in Python using the llama-cpp-python package. This package provides Python bindings for llama.cpp, which makes it easy to use the library in Python.

We will also see how to use the llama-cpp-python library to run the Zephyr  LLM, which is an open-source model based on the Mistral model.

Set up llama-cpp-python

Setting up the python bindings is as simple as running the following command:

pip install llama-cpp-python

For more detailed installation instructions, please see the llama-cpp-python documentation: https://github.com/abetlen/llama-cpp-python#installation-from-pypi-recommended.

Using a LLM with llama-cpp-python

Once you have installed the llama-cpp-python package, you can start using it to run LLMs.

You can use any language model with llama.cpp provided that it has been converted to the GGML format. There are already GGML versions available for most popular LLMs and the required GGML can be easily found on HuggingFace.

An important thing to note is that the original LLMs have been quantized when converting them to GGML format. This helps reduce the memory requirement for running these large models, without a significant loss in performance. For example, this helps us load a 7 billion parameter model of size 13GB in less than 4GB of RAM.

In this article we use the GGUF version of Zephyr-7B-Beta which is available on the HuggingFace Hub.

The model can be downloaded from here: https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF.

Downloading the GGUF file and Loading the LLM

The following code can be used to download the model. The code downloads the required GGML file, in this case the zephyr-7b-beta.Q4_0 GGUF, from the Hugging Face Hub. The code also checks if the file is already present before attempting to download it.

import os
import urllib.request


def download_file(file_link, filename):
    # Checks if the file already exists before downloading
    if not os.path.isfile(filename):
        urllib.request.urlretrieve(file_link, filename)
        print("File downloaded successfully.")
    else:
        print("File already exists.")

# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_0.gguf"
filename = "zephyr-7b-beta.Q4_0.gguf"

download_file(ggml_model_path, filename)

The next step is to load the model that you want to use. This can be done using the following code:

from llama_cpp import Llama

llm = Llama(model_path="zephyr-7b-beta.Q4_0.gguf", n_ctx=512, n_batch=126)

There are two important parameters that should be set when loading the model.

  • n_ctx: This is used to set the maximum context size of the model. The default value is 512 tokens.

The context size is the sum of the number of tokens in the input prompt and the max number of tokens that can be generated by the model. A model with smaller context size generates text much quicker than a model with a larger context size. If the use case does not demand very long generations or prompts, it is better to reduce the context length for better performance.

The number of tokens in the prompt and generated text can be checked using the free Tokenizer tool by OpenAI.

  • n_batch: This is used to set the maximum number of prompt tokens to batch together when generating the text. The default value is 512 tokens.

The n_batch parameter should be set carefully. Lowering the n_batch helps speed up text generation over multithreaded CPUs. Reducing it too much may cause the text generation to deteriorate significantly.

The complete list of parameters can be viewed here: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama

Generating Text using the LLM

The following code writes a simple wrapper function to generate text using the LLM.

def generate_text(
    prompt="Who is the CEO of Apple?",
    max_tokens=256,
    temperature=0.1,
    top_p=0.5,
    echo=False,
    stop=["#"],
):
    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        echo=echo,
        stop=stop,
    )
    output_text = output["choices"][0]["text"].strip()
    return output_text


def generate_prompt_from_template(input):
    chat_prompt_template = f"""<|im_start|>system
You are a helpful chatbot.<|im_end|>
<|im_start|>user
{input}<|im_end|>"""
    return chat_prompt_template


prompt = generate_prompt_from_template(
    "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."
)

generate_text(
    prompt,
    max_tokens=356,
)

The llm object has several important parameters that are used while generating text:

  • prompt: The input prompt to the model. This text is tokenized and passed to the model.

  • max_tokens: The parameter is used to set the maximum number of tokens the model can generate. This parameter controls the length of text generation. Default value is 128 tokens.

  • temperature: The token sampling temperature to use, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Default value is 1.

  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

  • echo: Boolean parameter to control whether the model returns (echoes) the model prompt at the beginning of the generated text.

  • stop: A list of strings that is used to stop text generation. If the model encounters any of the strings, the text generation will be stopped at that token. Used to control model hallucination and prevent the model from generating unnecessary text.

The llm object returns a dictionary object of the form:

{
  "id": "xxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",  # text generation id 
  "object": "text_completion",              # object name
  "created": 1679561337,                    # time stamp
  "model": "./models/7B/zephyr-7b-model.gguf",    # model path
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.", # generated text
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,       # Number of tokens present in the prompt
    "completion_tokens": 28,   # Number of tokens present in the generated text
    "total_tokens": 42
  }
}

The generated text can be easily extracted from the dictionary object using output["choices"][0]["text"].

Example text generation using Zephyr-7B

import os
import urllib.request
from llama_cpp import Llama


def download_file(file_link, filename):
    # Checks if the file already exists before downloading
    if not os.path.isfile(filename):
        urllib.request.urlretrieve(file_link, filename)
        print("File downloaded successfully.")
    else:
        print("File already exists.")


# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_0.gguf"
filename = "zephyr-7b-beta.Q4_0.gguf"

download_file(ggml_model_path, filename)


llm = Llama(model_path="zephyr-7b-beta.Q4_0.gguf", n_ctx=512, n_batch=126)


def generate_text(
    prompt="Who is the CEO of Apple?",
    max_tokens=256,
    temperature=0.1,
    top_p=0.5,
    echo=False,
    stop=["#"],
):
    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        echo=echo,
        stop=stop,
    )
    output_text = output["choices"][0]["text"].strip()
    return output_text


def generate_prompt_from_template(input):
    chat_prompt_template = f"""<|im_start|>system
You are a helpful chatbot.<|im_end|>
<|im_start|>user
{input}<|im_end|>"""
    return chat_prompt_template


prompt = generate_prompt_from_template(
    "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."
)

generate_text(
    prompt,
    max_tokens=356,
)

Generated text:

As the sun began to set over the Pacific Ocean, I found myself standing on the shores of Waikiki Beach in Honolulu, Hawaii. The vibrant colors of the sky painted a breathtaking scene that left me speechless. This was just the beginning of my unforgettable journey through the Aloha State.

Hawaii is a place like no other, where the lush green mountains meet the crystal-clear waters of the ocean. The culture and traditions of Hawaii are deeply rooted in its people, and I was eager to immerse myself in this unique experience.

One of my first stops was the historic Pearl Harbor. As an American, it was a humbling experience to learn about the events that took place here during World War II. The USS Arizona Memorial is a powerful tribute to the men and women who lost their lives during the attack on December 7, 1941.

Next, I headed to the North Shore of Oahu, where I was greeted by the stunning views of Turtle Bay Resort. Here, I had the opportunity to learn about Hawaiian culture through traditional activities such as lei making and ukulele lessons. The locals were incredibly welcoming and eager to share their heritage with me.

One of my favorite experiences in Hawaii was attending a traditional Hawaiian luau. The feast was filled with delicious local cuisine, including poke (raw fish), kalua pig, and poi (a staple food made from taro root). The entertainment included hula dancing, fire knife dancing, and other cultural performances that left me in awe.

The notebook with the example can be viewed here.
The complete code for running the examples can be found on GitHub.

Conclusion

In this blog post, we explored how to use the llama.cpp library in Python with the llama-cpp-python package. These tools enable high-performance CPU-based execution of LLMs. llama.cpp is updated almost every day. The speed of inference is getting better, and the community regularly adds support for new models. You can also convert your own Pytorch language models into the GGUF format. llama.cpp has a “convert.py” that will do that for you.

The llama.cpp library and llama-cpp-python package provide robust solutions for running LLMs efficiently on CPUs. If you're interested in incorporating LLMs into your applications, I recommend exploring these resources.