Summary

This guide provides step-by-step instructions for installing llama-cpp-python with NVIDIA GPU acceleration on Windows for local LLM developments.

Abstract

The guide is designed for developers looking to leverage hardware-accelerated llama-cpp-python on Windows. It outlines the prerequisites, such as installing Visual Studio with C++ CMake tools for Windows, C++ core features, and Windows 10/11 SDK. Additionally, it requires installing the CUDA Toolkit 12.2 from NVIDIA's official website. The installation steps involve opening a new command prompt, activating the Python environment, and running specific commands. The guide also includes troubleshooting tips for common issues, such as CUDA not being configured correctly. The installation can be verified by running a Python code snippet, which should display a BLAS = 1 indicator in the model properties if the installation is correct.

Opinions

The guide aims to simplify the installation process and help developers avoid common pitfalls.
The author shares their personal experience and challenges encountered during their own installation journey.
The guide recommends using the --verbose option during installation for extra assurance that cuBLAS is being used in compilation.
The guide suggests adjusting the n_gpu_layers parameter based on the user's GPU and model.
The guide concludes by encouraging users to dive into local llama development with enhanced performance.
The guide promotes an AI service that provides the same performance and functions as ChatGPT Plus(GPT-4) but at a more cost-effective price.
The guide assumes that the user has a basic understanding of Python and command-line interfaces.

Installing llama-cpp-python with NVIDIA GPU Acceleration on Windows: A Short Guide

https://github.com/abetlen/llama-cpp-python

Are you a developer looking to harness the power of hardware-accelerated llama-cpp-python on Windows for local LLM developments? Look no further! In this guide, I’ll walk you through the step-by-step process, helping you avoid the pitfalls I encountered during my own installation journey.

Prerequisites:

Install Visual Studio with:

C++ CMake tools for Windows.
C++ core features
Windows 10/11 SDK.

Visual Studio 2022 Enterprise with required components installed.

2. CUDA Toolkit:

Download and install CUDA Toolkit 12.2 from NVIDIA’s official website.
Verify the installation with nvcc --version and nvidia-smi.

Add CUDA_PATH ( C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2) to your environment variables.

Installation Steps:

Open a new command prompt and activate your Python environment (e.g., using conda). Run the following commands:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

# Use --verbose for extra assurance that cuBLAS is being used in compilation.

Add the --verbose option during installation if you want to ensure that CUDA is being used in compilation.

If CUDA is not configured correctly, llama-cpp-python will be installed without Hardware Acceleration.

If Cuda is detected but you get No CUDA toolset founderror do the following:

Copy files from: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\extras\visual_studio_integration\MSBuildExtensions to (For Enterprise version) C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations or (For Community version)C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations

copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\extras\visual_studio_integration\MSBuildExtensions" "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations"

(Adjust the paths based on your installation)

Testing

Verify the installation by running the following Python code:

from llama_cpp import Llama
llm = Llama(model_path="model.gguf", n_gpu_layers=30, n_ctx=3584, n_batch=521, verbose=True)
# adjust n_gpu_layers as per your GPU and model
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

Using LLama2–7B-Chat with 30 layers offloaded to GPU

If the installation is correct, you’ll see a BLAS = 1 indicator in the model properties.

Conclusion:

By following these steps, you should have successfully installed llama-cpp-python with cuBLAS acceleration on your Windows machine. This guide aims to simplify the process and help you avoid the common pitfalls.

Now you’re ready to dive into local llama development with enhanced performance. Happy GPU Offloading!