LLM By Examples: Utilizing Llama.cpp by Command Line Tools for CLI and Server
Llama.cpp has emerged as a powerful framework for working with language models, providing developers with robust tools and functionalities. This article explores the practical utility of Llama.cpp through command line tools, enabling seamless interaction with the framework for both command line interfaces (CLI) and server applications.

Utilizing Llama.cpp via command line tools offers a unique, flexible approach to model deployment and interaction. Developers can efficiently carry out tasks such as initializing models, querying text generation, and managing input-output processes from the terminal. This process not only streamlines workflows but also enhances productivity by allowing quick iterations. Furthermore, the command line environment encourages automation through scripting, making it ideal for integrating Llama.cpp capabilities into larger software systems or workflows. From generating prompts and fine-tuning parameters to outputting results in various formats, the CLI tools enable a fully customizable experience. In this article, we will delve into practical examples and best practices for leveraging Llama.cpp to its full potential via command line interfaces.
If you don’t familiar with core concepts of Llama.cpp, take a look below link first.
Preparation
To demonstrate the command line tools in llama.cpp, you need first have llama.cpp installed. There are multiple options below, pick one the best fit into your environment.
Next, we will need locate a GGUF model to proceed tests. There are tons of GGUF models from Hugging Face. To keep it simple, we will use Qwen 2.5 7B model below:
Last, Llama.cpp requires GGUF model to be downloaded to local. There are typically three approaches:
- Use Hugging Face Client to download the model before calling Llama.cpp
pip install huggingface_hub
huggingface-cli download MB20261/QWen2.5-7B-gguf unsloth.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False2. Use — hf-repo parameter from Llama.cpp. To enable this feature, you need install libcurl when build the Llama.cpp. As of the time of writing this article, this is not default. Take a look above installation links carefully to ensure the correct build.
3. Convert existing LLM to GGUF in local
Llama-cli
In Llama.cpp, `llama-cli` is a command-line interface tool that provides users with a straightforward way to interact with LLaMA models through terminal commands. It allows users to perform various operations such as model inference, configuration adjustments, and result display without needing a graphical interface. This simplicity makes it particularly useful for developers and researchers who prefer quick and efficient interactions, allowing them to test and deploy models seamlessly from the command line. The `llama-cli` tool is designed to facilitate experimentation and integration of language models into workflows, making it an essential component of the Llama.cpp framework.
$ huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_0.gguf --local-dir ~/llama-cpp/models --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
warnings.warn(
Downloading 'qwen2.5-0.5b-instruct-q4_0.gguf' to '/home/wsluser/llama-cpp/models/.cache/huggingface/download/qwen2.5-0.5b-instruct-q4_0.gguf.7671c0c304e6ce5a7fc577bcb12aba01e2c155cc2efd29b2213c95b18edaf6ed.incomplete'
qwen2.5-0.5b-instruct-q4_0.gguf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429M/429M [00:13<00:00, 30.7MB/s]
Download complete. Moving file to /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf
/home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf
$
$ docker run --gpus all -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-cli-cuda121-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 5: general.size_label str = 630M
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 2
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 128
llm_load_print_meta: n_embd_v_gqa = 128
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 630.17 M
llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors: CPU buffer size = 403.20 MiB
llm_load_tensors: CUDA0 buffer size = 8.01 MiB
..................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 368.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 16.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 983.43 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 65.76 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Building a website can be done in 10 simple steps:sampler seed: 3727715831
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0
1. Choose a website builder 2. Choose a hosting service 3. Choose a domain 4. Choose a content 5. Set up your website 6. Design your website 7. Add a few items 8. Test your website 9. Launch your website 10. Manage your website
Based on the above information, what is the most likely function of the website builder? The most likely function of the website builder is to create a website. Based on the information provided, the website builder is the primary tool used to create a website, as indicated by the first and second steps listed in the list provided. The other functions of a website builder are typically associated with creating, managing, and managing websites, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but it is not the primary function of the website builder itself. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The most likely function of the website builder is to create a website. Please note that this is an assumption based on the context provided and may not be accurate in all cases. It is important to verify the functions of the specific website builder being used in a particular context.
The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website
llama_perf_sampler_print: sampling time = 87.35 ms / 525 runs ( 0.17 ms per token, 6010.30 tokens per second)
llama_perf_context_print: load time = 3175.65 ms
llama_perf_context_print: prompt eval time = 83.12 ms / 13 tokens ( 6.39 ms per token, 156.40 tokens per second)
llama_perf_context_print: eval time = 10874.08 ms / 511 runs ( 21.28 ms per token, 46.99 tokens per second)
llama_perf_context_print: total time = 11282.11 ms / 524 tokens
$
$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 5: general.size_label str = 630M
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 2
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 128
llm_load_print_meta: n_embd_v_gqa = 128
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 630.17 M
llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors: CPU buffer size = 403.20 MiB
llm_load_tensors: CUDA0 buffer size = 8.01 MiB
..................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 368.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 16.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 983.43 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 65.76 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 195092334
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0
Building a website can be done in 10 simple steps: setting up a hosting plan, installing a database, designing a website, creating content, linking content to a page, creating a user account, creating a password, uploading images and videos, and managing users and accounts. Each step is crucial, but what’s missing is a process to help you get the most out of the process. This is where the Ultimate Website Builder comes in.
What is the Ultimate Website Builder?
The Ultimate Website Builder is a website builder designed to streamline the process of building a website. It is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website. With the Ultimate Website Builder, users can create a website in as little as 5 minutes.
What Does the Ultimate Website Builder Do?
The Ultimate Website Builder is a website builder designed to simplify the process of building a website. The builder is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.
What’s the Ultimate Website Builder Good for?
The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.
Is there a limit to what can be built?
The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.
Does the Ultimate Website Builder work with all platforms?
The Ultimate Website Builder is a tool that works with all platforms.
Does the Ultimate Website Builder help with the user’s login or registration?
The Ultimate Website Builder is a tool that helps with the user's login or registration.
Does the Ultimate Website Builder help with the user's password or login or registration?
The Ultimate Website Builder is a tool that helps with the user's password or login or registration.
Does the Ultimate Website Builder help with the user's user information or login or registration?
The Ultimate Website Builder is a tool that helps with the user's user information or login or registration.
Does the Ultimate Website Builder help with the user's images or login or registration?
The Ultimate Website Builder is a tool that helps with the user's images or login or registration.
Does the Ultimate Website Builder help with the user's videos or login or registration?
The Ultimate Website Builder is a tool that helps with the user's videos or login or
llama_perf_sampler_print: sampling time = 89.21 ms / 525 runs ( 0.17 ms per token, 5884.92 tokens per second)
llama_perf_context_print: load time = 2141.07 ms
llama_perf_context_print: prompt eval time = 83.81 ms / 13 tokens ( 6.45 ms per token, 155.11 tokens per second)
llama_perf_context_print: eval time = 10816.66 ms / 511 runs ( 21.17 ms per token, 47.24 tokens per second)
llama_perf_context_print: total time = 11229.56 ms / 524 tokens
$ Liama-server
In Llama.cpp, `llama-server` is a command-line tool designed to provide a server interface for interacting with LLaMA models. It allows users to deploy LLaMA-based applications in a server environment, enabling access to the models via API calls. This facilitates straightforward integration into various applications, making it easier for developers to build and manage AI-driven services.
The `llama-server` tool offers functionalities such as customizable configurations, allowing users to set parameters according to their needs, and support for multiple concurrent connections. This means that it can handle requests from different clients simultaneously, making it suitable for real-world applications where multiple users may interact with the model at the same time. Additionally, `llama-server` enhances the user experience by providing easy-to-use commands, ensuring that developers can quickly get their LLaMA models up and running without extensive setup. It serves as a robust solution for serving LLaMA models in a wide range of applications.
$ ./llama-server -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -ngl 28 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 5: general.size_label str = 630M
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 2
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 128
llm_load_print_meta: n_embd_v_gqa = 128
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 630.17 M
llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.26 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: CPU buffer size = 73.03 MiB
llm_load_tensors: CUDA0 buffer size = 330.19 MiB
..................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 298.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 65.76 MiB
llama_new_context_with_model: graph nodes = 751
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idleAlternatively, we could also start llama-server from docker image with CPU only:
$ docker run -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-srv-cpu-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf --port 8080 --host 0.0.0.0
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct
llama_model_loader: - kv 5: general.size_label str = 630M
llama_model_loader: - kv 6: qwen2.block_count u32 = 24
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 2
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 128
llm_load_print_meta: n_embd_v_gqa = 128
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 630.17 M
llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.13 MiB
llm_load_tensors: CPU buffer size = 403.20 MiB
..................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CPU output buffer size = 1.16 MiB
llama_new_context_with_model: CPU compute buffer size = 967.01 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
request: GET /health 127.0.0.1 200
Then, we could access standard OpenAI API (no key required) to keep our client application portable.
$ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-no-key-required" \
-d '{
"model": "qwen",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful assistant."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "tell me something about michael jordan"
}
]
}
]
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Here is a brief biography of Michael Jordan:\n\nMichael Jordan was a professional basketball player from the United States who played for the Chicago Bulls during the 1993-1994 NBA season. He is widely regarded as one of the greatest basketball players of all time and is widely considered one of the greatest basketball players ever to play the game.\n\nJordan is known for his incredible athleticism, skill, and leadership on the court, as well as his exceptional ability to rebound from the field and in the gym. He is also recognized for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan was a skilled player, with a career-high rebound rate of 50% and a career-high percentage of 27% in scoring, and he was also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan is known for his leadership on the court, and he is also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his contributions to the development of basketball in the United States, and he is also known for his contributions to the development of basketball in the United States.","role":"assistant"}}],"created":1729560781,"model":"qwen","object":"chat.completion","usage":{"completion_tokens":273,"prompt_tokens":26,"total_tokens":299},"id":"chatcmpl-4yJgHpf2OIHDrNycWPgvTvbswhznZv35"}
$Or Calling from Python:
$ python
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openai
url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="qwen",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "tell me something about michael jordan"}
]
)
print(completion.choices[0].message.content)>>>
>>> client = openai.OpenAI(
... base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
... api_key = "sk-no-key-required"
... )
>>>
>>> completion = client.chat.completions.create(
... model="qwen",
... messages=[
... {"role": "system", "content": "You are a helpful assistant."},
... {"role": "user", "content": "tell me something about michael jordan"}
... ]
... )
>>> print(completion.choices[0].message.content)
Michael Jordan is a renowned basketball player who is widely regarded as one of the greatest players in basketball history. He is the first and currently the most successful of all time, having won five NBA championships and four Olympic gold medals. Jordan is considered one of the greatest of all time and was one of the best players in the world. He is also known for his innovative skills and his ability to influence his opponents in the sport.
>>> exit()Tuning Llama.cpp command to fit into your enviornment
If you use GPU enabled Llama.cpp build, it is good for CPU only, CPU+GPU, and GPU only usage. Different model and quantization gives you different file size. There are no a single answer for what is the best command line argument. You need tune the parameters, like layers loaded to GPU, context length, batch size, etc. to make the the model best fit.
Now let’s walk through the common use cases.
Case Study: Load a 7B GGUF model into CPU + 8GB GPU
Let’s first download the model. We will use Qwen 2.5 7B GGUF model:
$ huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "qwen2.5-7b-instruct-q5_k_m*.gguf" --local-dir . --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]Downloading 'qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf.beba9d4f2f5a1fe7d144dcae332e68b52c26705c5310dece2e5d1997e091e134.incomplete'
Downloading 'qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf.42f6693004793ee6cf1b2b723f0273b10f86a3bb2a949bd9128d4cda5fb866cd.incomplete'
(…)5-7b-instruct-q5_k_m-00002-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 20.7MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf██████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 12.8MB/s]
(…)5-7b-instruct-q5_k_m-00001-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [01:52<00:00, 35.4MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:53<00:00, 56.59s/it]
/work/GitHubs/MEAIDev/poc-ai-tool-llama-cpp/llama.cpp/build/binFor large files, people normally split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf and qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf. After all files are downloaded from above command, we need merge them into one GGUF model file in order to feed into Llama.cpp command.
$ ./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf ~/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf -> /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf merged from 2 split with 339 tensors.Now let’s load the model for conversation.
$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
-fa -ngl 80 -n 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct
llama_model_loader: - kv 5: general.size_label str = 7.6B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 17
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: split.no u16 = 0
llama_model_loader: - kv 27: split.count u16 = 0
llama_model_loader: - kv 28: split.tensors.count i32 = 339
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q5_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name = qwen2.5-7b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 357.33 MiB
llm_load_tensors: CUDA0 buffer size = 4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
common_init_from_params: failed to create context with model '/home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf'
main: error: unable to load modelFailed! The error is out of memory. To alleviate the memory issue when using LLaMA models, you can adjust a few settings in your command. Here’s a modified command that can help with memory management:
- Reduce ` — ngl` (number of GPU layers): Since your GPU has limited VRAM, try setting ` — ngl` to a lower value. The ` — ngl` option specifies how many transformer layers will be stored on the GPU. A safe starting point would be around 10–12 for an RTX 2070, but you may need to experiment to find the optimal number that works for your specific setup.
- Reduce Context Size (`-c` or ` — ctx-size`): If your application allows, you might also want to set a smaller context size. Although the default is 0 (which loads from the model), you could experiment with smaller values for ` — ctx-size` if your application is flexible.
- Adjust Batch Size (`-b` or ` — batch-size`): You can reduce the maximum logical batch size using `-b` (default is 2048) and the physical maximum batch size with ` — ubatch-size` (default is 512) to lower the memory usage during inference.
- Avoid Flash Attention: If not strictly needed, you can also disable Flash Attention by omitting the `-fa` option.
Let’s make below Adjustments:
- -ngl 12: This allocates 12 layers on the GPU instead of 80.
- -b 128: This reduces the logical batch size to help lower memory use.
- -ub 32: This limits the physical batch size even further.
You might need to experiment a bit with the values for ` — ngl`, `-b`, and ` — ub` until you find a configuration that works without running out of memory. If you still encounter issues, consider lowering the ` — ngl` even further.
So, let’s try the new command:
$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
-ngl 12 -n 512 -b 128 -ub 32
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct
llama_model_loader: - kv 5: general.size_label str = 7.6B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 17
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: split.no u16 = 0
llama_model_loader: - kv 27: split.count u16 = 0
llama_model_loader: - kv 28: split.tensors.count i32 = 339
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q5_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name = qwen2.5-7b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/29 layers to GPU
llm_load_tensors: CPU buffer size = 5186.92 MiB
llm_load_tensors: CUDA0 buffer size = 1895.93 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 128
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 721.33 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 16.44 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 228
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampler seed: 48583840
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
> where is new york?
New York is a state located in the northeastern region of the United States. Its capital city is Albany, but the most populous city in the state, and one of the most populous cities in the United States, is New York City, often simply called "New York." New York City is a global metropolis known for its culture, finance, arts, and business. It is located along the Atlantic coast and includes several counties, with Manhattan being one of the most well-known boroughs.
> It works now.
Case Study: Load a 7B GGUF model into CPU only
Let’s continue from above case and reuse the same GGUF model. Now we want to load the whole model into CPU only.
To do so, we need below adjustments:
- Removed GPU Options: The `-ngl` option and other GPU-related options (like `-fa`, if present) are omitted. This makes sure the model runs strictly on the CPU.
- — cpu-strict 1: This sets strict CPU placement, ensuring that the model execution uses CPU cores for its operations.
- Adjust -b 128 (the batch size) as needed based on your performance and memory constraints.
$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
--cpu-strict 1 -n 512 -b 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct
llama_model_loader: - kv 3: general.version str = v0.1
llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct
llama_model_loader: - kv 5: general.size_label str = 7.6B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 17
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: split.no u16 = 0
llama_model_loader: - kv 27: split.count u16 = 0
llama_model_loader: - kv 28: split.tensors.count i32 = 339
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q5_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name = qwen2.5-7b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: CPU buffer size = 5186.92 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 128
llama_new_context_with_model: n_ubatch = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 7168.00 MiB
llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2117.26 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 65.75 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 396
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampler seed: 999110778
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
>It works too. How can I tell this is loaded to CPU only? Look at below parameters from outputs. If you compare with above GPU commands, you will see the difference.

More considerations on CPU only usage:
- If you want more control over CPU threading or affinities, you could also include ` — threads N` to specify the number of threads to use during generation (default is -1, which uses all available threads).
- If you experience memory issues, you can keep experimenting with reducing `-b` to further optimize resource usage.



