avatarMB20261

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

42976

Abstract

he models via API calls. This facilitates straightforward integration into various applications, making it easier for developers to build and manage AI-driven services.</p><p id="b812">The llama-server tool offers functionalities such as customizable configurations, allowing users to set parameters according to their needs, and support for multiple concurrent connections. This means that it can handle requests from different clients simultaneously, making it suitable for real-world applications where multiple users may interact with the model at the same time. Additionally, llama-server enhances the user experience by providing easy-to-use commands, ensuring that developers can quickly get their LLaMA models up and running without extensive setup. It serves as a robust solution for serving LLaMA models in a wide range of applications.</p><div id="a688"><pre>$ ./llama-server -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -ngl 28 -fa <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no</span> <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no</span> <span class="hljs-section">ggml_cuda_init: found 1 CUDA devices:</span> Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes <span class="hljs-section">build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu</span> system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

<span class="hljs-section">system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |</span>

<span class="hljs-section">main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11</span> <span class="hljs-section">main: loading model</span> <span class="hljs-section">llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free</span> <span class="hljs-section">llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))</span> <span class="hljs-section">llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.</span> <span class="hljs-section">llama_model_loader: - kv 0: general.architecture str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 1: general.type str = model</span> <span class="hljs-section">llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 3: general.version str = v0.1</span> <span class="hljs-section">llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 5: general.size_label str = 630M</span> <span class="hljs-section">llama_model_loader: - kv 6: qwen2.block_count u32 = 24</span> <span class="hljs-section">llama_model_loader: - kv 7: qwen2.context_length u32 = 32768</span> <span class="hljs-section">llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896</span> <span class="hljs-section">llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864</span> <span class="hljs-section">llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14</span> <span class="hljs-section">llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000</span> <span class="hljs-section">llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001</span> <span class="hljs-section">llama_model_loader: - kv 14: general.file_type u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2</span> <span class="hljs-section">llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "", "%", "&amp;", "'", ...</span> <span class="hljs-section">llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</span> <span class="hljs-section">llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...</span> <span class="hljs-section">llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645</span> <span class="hljs-section">llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false</span> <span class="hljs-section">llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '&lt;|im_start|&gt;...</span> <span class="hljs-section">llama_model_loader: - kv 25: general.quantization_version u32 = 2</span> <span class="hljs-section">llama_model_loader: - type f32: 121 tensors</span> <span class="hljs-section">llama_model_loader: - type q4_0: 169 tensors</span> <span class="hljs-section">llama_model_loader: - type q8_0: 1 tensors</span> <span class="hljs-section">llm_load_vocab: special tokens cache size = 22</span> <span class="hljs-section">llm_load_vocab: token to piece cache size = 0.9310 MB</span> <span class="hljs-section">llm_load_print_meta: format = GGUF V3 (latest)</span> <span class="hljs-section">llm_load_print_meta: arch = qwen2</span> <span class="hljs-section">llm_load_print_meta: vocab type = BPE</span> <span class="hljs-section">llm_load_print_meta: n_vocab = 151936</span> <span class="hljs-section">llm_load_print_meta: n_merges = 151387</span> <span class="hljs-section">llm_load_print_meta: vocab_only = 0</span> <span class="hljs-section">llm_load_print_meta: n_ctx_train = 32768</span> <span class="hljs-section">llm_load_print_meta: n_embd = 896</span> <span class="hljs-section">llm_load_print_meta: n_layer = 24</span> <span class="hljs-section">llm_load_print_meta: n_head = 14</span> <span class="hljs-section">llm_load_print_meta: n_head_kv = 2</span> <span class="hljs-section">llm_load_print_meta: n_rot = 64</span> <span class="hljs-section">llm_load_print_meta: n_swa = 0</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_k = 64</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_v = 64</span> <span class="hljs-section">llm_load_print_meta: n_gqa = 7</span> <span class="hljs-section">llm_load_print_meta: n_embd_k_gqa = 128</span> <span class="hljs-section">llm_load_print_meta: n_embd_v_gqa = 128</span> <span class="hljs-section">llm_load_print_meta: f_norm_eps = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_norm_rms_eps = 1.0e-06</span> <span class="hljs-section">llm_load_print_meta: f_clamp_kqv = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_max_alibi_bias = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_logit_scale = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: n_ff = 4864</span> <span class="hljs-section">llm_load_print_meta: n_expert = 0</span> <span class="hljs-section">llm_load_print_meta: n_expert_used = 0</span> <span class="hljs-section">llm_load_print_meta: causal attn = 1</span> <span class="hljs-section">llm_load_print_meta: pooling type = 0</span> <span class="hljs-section">llm_load_print_meta: rope type = 2</span> <span class="hljs-section">llm_load_print_meta: rope scaling = linear</span> <span class="hljs-section">llm_load_print_meta: freq_base_train = 1000000.0</span> <span class="hljs-section">llm_load_print_meta: freq_scale_train = 1</span> <span class="hljs-section">llm_load_print_meta: n_ctx_orig_yarn = 32768</span> <span class="hljs-section">llm_load_print_meta: rope_finetuned = unknown</span> <span class="hljs-section">llm_load_print_meta: ssm_d_conv = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_inner = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_state = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_rank = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_b_c_rms = 0</span> <span class="hljs-section">llm_load_print_meta: model type = 1B</span> <span class="hljs-section">llm_load_print_meta: model ftype = Q4_0</span> <span class="hljs-section">llm_load_print_meta: model params = 630.17 M</span> <span class="hljs-section">llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)</span> <span class="hljs-section">llm_load_print_meta: general.name = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llm_load_print_meta: BOS token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOS token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOT token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: PAD token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: LF token = 148848 'ÄĬ'</span> <span class="hljs-section">llm_load_print_meta: FIM PRE token = 151659 '&lt;|fim_prefix|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM SUF token = 151661 '&lt;|fim_suffix|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM MID token = 151660 '&lt;|fim_middle|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM PAD token = 151662 '&lt;|fim_pad|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM REP token = 151663 '&lt;|repo_name|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM SEP token = 151664 '&lt;|file_sep|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151662 '&lt;|fim_pad|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151663 '&lt;|repo_name|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151664 '&lt;|file_sep|&gt;'</span> <span class="hljs-section">llm_load_print_meta: max token length = 256</span> <span class="hljs-section">llm_load_tensors: ggml ctx size = 0.26 MiB</span> <span class="hljs-section">llm_load_tensors: offloading 24 repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloading non-repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloaded 25/25 layers to GPU</span> <span class="hljs-section">llm_load_tensors: CPU buffer size = 73.03 MiB</span> <span class="hljs-section">llm_load_tensors: CUDA0 buffer size = 330.19 MiB</span> .................................................. <span class="hljs-section">llama_new_context_with_model: n_ctx = 32768</span> <span class="hljs-section">llama_new_context_with_model: n_batch = 2048</span> <span class="hljs-section">llama_new_context_with_model: n_ubatch = 512</span> <span class="hljs-section">llama_new_context_with_model: flash_attn = 1</span> <span class="hljs-section">llama_new_context_with_model: freq_base = 1000000.0</span> <span class="hljs-section">llama_new_context_with_model: freq_scale = 1</span> <span class="hljs-section">llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA0 compute buffer size = 298.50 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host compute buffer size = 65.76 MiB</span> <span class="hljs-section">llama_new_context_with_model: graph nodes = 751</span> <span class="hljs-section">llama_new_context_with_model: graph splits = 2</span> <span class="hljs-section">common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)</span> srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 32768 <span class="hljs-section">main: model loaded</span> <span class="hljs-section">main: chat template, built_in: 1, chat_example: '&lt;|im_start|&gt;system</span> You are a helpful assistant&lt;|im_end|&gt; &lt;|im_start|&gt;user Hello&lt;|im_end|&gt; &lt;|im_start|&gt;assistant Hi there&lt;|im_end|&gt; &lt;|im_start|&gt;user How are you?&lt;|im_end|&gt; &lt;|im_start|&gt;assistant ' <span class="hljs-section">main: server is listening on 127.0.0.1:8080 - starting the main loop</span> srv update_slots: all slots are idle</pre></div><p id="0cdf">Alternatively, we could also start llama-server from docker image with CPU only:</p><div id="bcf1"><pre> docker run -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-srv-cpu-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf --port 8080 --host 0.0.0.0 <span class="hljs-section">warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host</span> <span class="hljs-section">build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu</span> system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

<span class="hljs-section">system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |</span>

<span class="hljs-section">main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11</span> <span class="hljs-section">main: loading model</span> <span class="hljs-section">llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))</span> <span class="hljs-section">llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.</span> <span class="hljs-section">llama_model_loader: - kv 0: general.architecture str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 1: general.type str = model</span> <span class="hljs-section">llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 3: general.version str = v0.1</span> <span class="hljs-section">llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 5: general.size_label str = 630M</span> <span class="hljs-section">llama_model_loader: - kv 6: qwen2.block_count u32 = 24</span> <span class="hljs-section">llama_model_loader: - kv 7: qwen2.context_length u32 = 32768</span> <span class="hljs-section">llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896</span> <span class="hljs-section">llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864</span> <span class="hljs-section">llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14</span> <span class="hljs-section">llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000</span> <span class="hljs-section">llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001</span> <span class="hljs-section">llama_model_loader: - kv 14: general.file_type u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2</span> <span class="hljs-section">llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...</span> <span class="hljs-section">llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</span> <span class="hljs-section">llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...</span> <span class="hljs-section">llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645</span> <span class="hljs-section">llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false</span> <span class="hljs-section">llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...</span> <span class="hljs-section">llama_model_loader: - kv 25: general.quantization_version u32 = 2</span> <span class="hljs-section">llama_model_loader: - type f32: 121 tensors</span> <span class="hljs-section">llama_model_loader: - type q4_0: 169 tensors</span> <span class="hljs-section">llama_model_loader: - type q8_0: 1 tensors</span> <span class="hljs-section">llm_load_vocab: special tokens cache size = 22</span> <span class="hljs-section">llm_load_vocab: token to piece cache size = 0.9310 MB</span> <span class="hljs-section">llm_load_print_meta: format = GGUF V3 (latest)</span> <span class="hljs-section">llm_load_print_meta: arch = qwen2</span> <span class="hljs-section">llm_load_print_meta: vocab type = BPE</span> <span class="hljs-section">llm_load_print_meta: n_vocab = 151936</span> <span class="hljs-section">llm_load_print_meta: n_merges = 151387</span> <span class="hljs-section">llm_load_print_meta: vocab_only = 0</span> <span class="hljs-section">llm_load_print_meta: n_ctx_train = 32768</span> <span class="hljs-section">llm_load_print_meta: n_embd = 896</span> <span class="hljs-section">llm_load_print_meta: n_layer = 24</span> <span class="hljs-section">llm_load_print_meta: n_head = 14</span> <span class="hljs-section">llm_load_print_meta: n_head_kv = 2</span> <span class="hljs-section">llm_load_print_meta: n_rot = 64</span> <span class="hljs-section">llm_load_print_meta: n_swa = 0</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_k = 64</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_v = 64</span> <span class="hljs-section">llm_load_print_meta: n_gqa = 7</span> <span class="hljs-section">llm_load_print_meta: n_embd_k_gqa = 128</span> <span class="hljs-section">llm_load_print_meta: n_embd_v_gqa = 128</span> <span class="hljs-section">llm_load_print_meta: f_norm_eps = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_norm_rms_eps = 1.0e-06</span> <span class="hljs-section">llm_load_print_meta: f_clamp_kqv = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_max_alibi_bias = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_logit_scale = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: n_ff = 4864</span> <span class="hljs-section">llm_load_print_meta: n_expert = 0</span> <span class="hljs-section">llm_load_print_meta: n_expert_used = 0</span> <span class="hljs-section">llm_load_print_meta: causal attn = 1</span> <span class="hljs-section">llm_load_print_meta: pooling type = 0</span> <span class="hljs-section">llm_load_print_meta: rope type = 2</span> <span class="hljs-section">llm_load_print_meta: rope scaling = linear</span> <span class="hljs-section">llm_load_print_meta: freq_base_train = 1000000.0</span> <span class="hljs-section">llm_load_print_meta: freq_scale_train = 1</span> <span class="hljs-section">llm_load_print_meta: n_ctx_orig_yarn = 32768</span> <span class="hljs-section">llm_load_print_meta: rope_finetuned = unknown</span> <span class="hljs-section">llm_load_print_meta: ssm_d_conv = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_inner = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_state = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_rank = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_b_c_rms = 0</span> <span class="hljs-section">llm_load_print_meta: model type = 1B</span> <span class="hljs-section">llm_load_print_meta: model ftype = Q4_0</span> <span class="hljs-section">llm_load_print_meta: model params = 630.17 M</span> <span class="hljs-section">llm_load_print_meta: model size = 403.20 MiB (5.37 BPW)</span> <span class="hljs-section">llm_load_print_meta: general.name = qwen2.5-0.5b-instruct</span> <span class="hljs-section">llm_load_print_meta: BOS token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOS token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOT token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: PAD token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: LF token = 148848 'ÄĬ'</span> <span class="hljs-section">llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'</span> <span class="hljs-section">llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: max token length = 256</span> <span class="hljs-section">llm_load_tensors: ggml ctx size = 0.13 MiB</span> <span class="hljs-section">llm_load_tensors: CPU buffer size = 403.20 MiB</span> .................................................. <span class="hljs-section">llama_new_context_with_model: n_ctx = 32768</span> <span class="hljs-section">llama_new_context_with_model: n_batch = 2048</span> <span class="hljs-section">llama_new_context_with_model: n_ubatch = 512</span> <span class="hljs-section">llama_new_context_with_model: flash_attn = 0</span> <span class="hljs-section">llama_new_context_with_model: freq_base = 1000000.0</span> <span class="hljs-section">llama_new_context_with_model: freq_scale = 1</span> <span class="hljs-section">llama_kv_cache_init: CPU KV buffer size = 384.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: CPU output buffer size = 1.16 MiB</span> <span class="hljs-section">llama_new_context_with_model: CPU compute buffer size = 967.01 MiB</span> <span class="hljs-section">llama_new_context_with_model: graph nodes = 846</span> <span class="hljs-section">llama_new_context_with_model: graph splits = 1</span> <span class="hljs-section">common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)</span> srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 32768 <span class="hljs-section">main: model loaded</span> <span class="hljs-section">main: chat template, built_in: 1, chat_example: '<|im_start|>system</span> You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' <span class="hljs-section">main: server is listening on 0.0.0.0:8080 - starting the main loop</span> srv update_slots: all slots are idle

<span class="hljs-section">request: GET /health 127.0.0.1 200</span> </pre></div><p id="0b61">Then, we could access standard OpenAI API (no key required) to keep our client application portable.</p><div id="47bd"><pre>$ curl http:<span class="hljs-comment">//localhost:8080/v1/chat/completions </span> <span class="hljs-operator">-</span><span class="hljs-type">H</span> <span class="hljs-string">"Content-Type: application/json"</span>
<span class="hljs-operator">-</span><span class="hljs-type">H</span> <span class="hljs-string">"Authorization: Bearer sk-no-key-required"</span>
<span class="hljs-operator">-</span>d '{ <span class="hljs-string">"model"</span>: <span class="hljs-string">"qwen"</span>, <span class="hljs-string">"messages"</span>: [ { <span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: [ { <span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"You are a helpful assistant."</span> } ] }, { <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: [ { <span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"tell me something about michael jordan"</span> } ] } ] }'

{<span class="hljs-string">"choices"</span>:[{<span class="hljs-string">"finish_reason"</span>:<span class="hljs-string">"stop"</span>,<span class="hljs-string">"index"</span>:<span class="hljs-number">0</span>,<span class="hljs-string">"message"</span>:{<span class="hljs-string">"content"</span>:<span class="hljs-string">"Here is a brief biography of Michael Jordan:<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Michael Jordan was a professional basketball player from the United States who played for the Chicago Bulls during the 1993-1994 NBA season. He is widely regarded as one of the greatest basketball players of all time and is widely considered one of the greatest basketball players ever to play the game.<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Jordan is known for his incredible athleticism, skill, and leadership on the court, as well as his exceptional ability to rebound from the field and in the gym. He is also recognized for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Jordan was a skilled player, with a career-high rebound rate of 50% and a career-high percentage of 27% in scoring, and he was also known for his ability to rebound from the bench and in the gym.<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Jordan is also known for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Jordan is known for his leadership on the court, and he is also known for his ability to rebound from the bench and in the gym.<span class="hljs-subst">\n</span><span class="hljs-subst">\n</span>Jordan is also known for his contributions to the development of basketball in the United States, and he is also known for his contributions to the development of basketball in the United States."</span>,<span class="hljs-string">"role"</span>:<span class="hljs-string">"assistant"</span>}}],<span class="hljs-string">"created"</span>:<span class="hljs-number">1729560781</span>,<span class="hljs-string">"model"</span>:<span class="hljs-string">"qwen"</span>,<span class="hljs-string">"object"</span>:<span class="hljs-string">"chat.completion"</span>,<span class="hljs-string">"usage"</span>:{<span class="hljs-string">"completion_tokens"</span>:<span class="hljs-number">273</span>,<span class="hljs-string">"prompt_tokens"</span>:<span class="hljs-number">26</span>,<span class="hljs-string">"total_tokens"</span>:<span class="hljs-number">299</span>},<span class="hljs-string">"id"</span>:<span class="hljs-string">"chatcmpl-4yJgHpf2OIHDrNycWPgvTvbswhznZv35"</span>}

</pre></div><p id="fb7a">Or Calling from Python:</p><div id="a6be"><pre> python Python <span class="hljs-number">3.10</span><span class="hljs-number">.12</span> (main, Sep <span class="hljs-number">11</span> <span class="hljs-number">2024</span>, <span class="hljs-number">15</span>:<span class="hljs-number">47</span>:<span class="hljs-number">36</span>) [GCC <span class="hljs-number">11.4</span><span class="hljs-number">.0</span>] on linux <span class="hljs-type">Type</span> <span class="hljs-string">"help"</span>, <span class="hljs-string">"copyright"</span>, <span class="hljs-string">"credits"</span> <span class="hljs-keyword">or</span> <span class="hljs-string">"license"</span> <span class="hljs-keyword">for</span> more information. <span class="hljs-meta">>>> </span><span class="hljs-keyword">import</span> openai url=<span class="hljs-string">"http://localhost:8080/v1"</span>, <span class="hljs-comment"># "http://<Your api-server IP>:port"</span> api_key = <span class="hljs-string">"sk-no-key-required"</span> )

completion = client.chat.completions.create( model=<span class="hljs-string">"qwen"</span>, messages=[ {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"You are a helpful assistant."</span>}, {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"tell me something about michael jordan"</span>} ] ) <span class="hljs-built_in">print</span>(completion.choices[<span class="hljs-number">0</span>].message.content)>>> <span class="hljs-meta">>>> </span>client = openai.OpenAI( <span class="hljs-meta">... </span> base_url=<span class="hljs-string">"http://localhost:8080/v1"</span>, <span class="hljs-comment"># "http://<Your api-server IP>:port"</span> <span class="hljs-meta">... </span> api_key = <span class="hljs-string">"sk-no-key-required"</span> <span class="hljs-meta">... </span>) >>> <span class="hljs-meta">>>> </span>completion = client.chat.completions.create( <span class="hljs-meta">... </span> model=<span class="hljs-string">"qwen"</span>, <span class="hljs-meta">... </span> messages=[ <span class="hljs-meta">... </span> {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"You are a helpful assistant."</span>}, <span class="hljs-meta">... </span> {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"tell me something about michael jordan"</span>} <span class="hljs-meta">... </span> ] <span class="hljs-meta">... </span>) <span class="hljs-meta">>>> </span><span class="hljs-built_in">print</span>(completion.choices[<span class="hljs-number">0</span>].message.content) Michael Jordan <span class="hljs-keyword">is</span> a renowned basketball player who <span class="hljs-keyword">is</span> widely regarded <span class="hljs-keyword">as</span> one of the greatest players <span class="hljs-keyword">in</span> basketball history. He <span class="hljs-keyword">is</span> the first <span class="hljs-keyword">and</span> currently the most successful of <span class="hljs-built_in">all</span> time, having won five NBA championships <span class="hljs-keyword">and</span> four Olympic gold medals. Jordan <span class="hljs-keyword">is</span> considered one of the greatest of <span class="hljs-built_in">all</span> time <span class="hljs-keyword">and</span> was one of the best players <span class="hljs-keyword">in</span> the world. He <span class="hljs-keyword">is</span> also known <span class="hljs-keyword">for</span> his innovative skills <span class="hljs-keyword">and</span> his ability to influence his opponents <span class="hljs-keyword">in</span> the sport. <span class="hljs-meta">>>> </span>exit()</pre></div><h1 id="748e">Tuning Llama.cpp command to fit into your enviornment</h1><p id="11b9">If you use GPU enabled Llama.cpp build, it is good for CPU only, CPU+GPU, and GPU only usage. Different model and quantization gives you different file size. There are no a single answer for what is the best command line argument. You need tune the parameters, like layers loaded to GPU, context length, batch size, etc. to make the the model best fit.</p><p id="2af2">Now let’s walk through the common use cases.</p><h2 id="e0e0">Case Study: Load a 7B GGUF model into CPU + 8GB GPU</h2><p id="e4e6">Let’s first download the model. We will use Qwen 2.5 7B GGUF model:</p><div id="9e13" class="link-block"> <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF"> <div> <div> <h2>Qwen/Qwen2.5-7B-Instruct-GGUF · Hugging Face</h2> <div><h3>We're on a journey to advance and democratize artificial intelligence through open source and open science.</h3></div> <div><p>huggingface.co</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*BYIkavMDsxV5oV6f)"></div> </div> </div> </a> </div><div id="1c8b"><pre> huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include <span class="hljs-string">"qwen2.5-7b-instruct-q5_k_m*.gguf"</span> --local-dir . --local-dir-use-symlinks False /work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a <span class="hljs-built_in">local</span> directory does not use symlinks anymore. warnings.warn( Fetching 2 files: 0%| | 0/2 [00:00&lt;?, ?it/s]Downloading <span class="hljs-string">'qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf'</span> to <span class="hljs-string">'.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf.beba9d4f2f5a1fe7d144dcae332e68b52c26705c5310dece2e5d1997e091e134.incomplete'</span> Downloading <span class="hljs-string">'qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf'</span> to <span class="hljs-string">'.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf.42f6693004793ee6cf1b2b723f0273b10f86a3bb2a949bd9128d4cda5fb866cd.incomplete'</span> (…)5-7b-instruct-q5_k_m-00002-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10&lt;00:00, 20.7MB/s] Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf██████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10&lt;00:00, 12.8MB/s] (…)5-7b-instruct-q5_k_m-00001-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [01:52&lt;00:00, 35.4MB/s] Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:53&lt;00:00, 56.59s/it] /work/GitHubs/MEAIDev/poc-ai-tool-llama-cpp/llama.cpp/build/bin</pre></div><p id="7681">For large files, people normally split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, <code>qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf</code> and <code>qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf</code>. After all files are downloaded from above command, we need merge them into one GGUF model file in order to feed into Llama.cpp command.</p><div id="634d"><pre> ./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf /llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf gguf_merge: qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf -> /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf <span class="hljs-keyword">done</span> gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf <span class="hljs-keyword">done</span> gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf <span class="hljs-keyword">done</span> gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf <span class="hljs-keyword">done</span> gguf_merge: /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf merged from 2 <span class="hljs-built_in">split</span> with 339 tensors.</pre></div><p id="9c21">Now let’s load the model for conversation.</p><div id="25a0"><pre>$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
-co -cnv -p <span class="hljs-string">"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."</span>
-fa -ngl 80 -n 512 <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no</span> <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no</span> <span class="hljs-section">ggml_cuda_init: found 1 CUDA devices:</span> Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes <span class="hljs-section">build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0 for x86_64-linux-gnu</span> <span class="hljs-section">main: llama backend init</span> <span class="hljs-section">main: load the model and apply lora adapter, if any</span> <span class="hljs-section">llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free</span> <span class="hljs-section">llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))</span> <span class="hljs-section">llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.</span> <span class="hljs-section">llama_model_loader: - kv 0: general.architecture str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 1: general.type str = model</span> <span class="hljs-section">llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 3: general.version str = v0.1</span> <span class="hljs-section">llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 5: general.size_label str = 7.6B</span> <span class="hljs-section">llama_model_loader: - kv 6: qwen2.block_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 7: qwen2.context_length u32 = 131072</span> <span class="hljs-section">llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584</span> <span class="hljs-section">llama_model_loader: - kv 9: qwen2.feed_for

Options

ward_length u32 = 18944</span> <span class="hljs-section">llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4</span> <span class="hljs-section">llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000</span> <span class="hljs-section">llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001</span> <span class="hljs-section">llama_model_loader: - kv 14: general.file_type u32 = 17</span> <span class="hljs-section">llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2</span> <span class="hljs-section">llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "", "%", "&amp;", "'", ...</span> <span class="hljs-section">llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</span> <span class="hljs-section">llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...</span> <span class="hljs-section">llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645</span> <span class="hljs-section">llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false</span> <span class="hljs-section">llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '&lt;|im_start|&gt;...</span> <span class="hljs-section">llama_model_loader: - kv 25: general.quantization_version u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 26: split.no u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 27: split.count u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 28: split.tensors.count i32 = 339</span> <span class="hljs-section">llama_model_loader: - type f32: 141 tensors</span> <span class="hljs-section">llama_model_loader: - type q5_K: 169 tensors</span> <span class="hljs-section">llama_model_loader: - type q6_K: 29 tensors</span> <span class="hljs-section">llm_load_vocab: special tokens cache size = 22</span> <span class="hljs-section">llm_load_vocab: token to piece cache size = 0.9310 MB</span> <span class="hljs-section">llm_load_print_meta: format = GGUF V3 (latest)</span> <span class="hljs-section">llm_load_print_meta: arch = qwen2</span> <span class="hljs-section">llm_load_print_meta: vocab type = BPE</span> <span class="hljs-section">llm_load_print_meta: n_vocab = 152064</span> <span class="hljs-section">llm_load_print_meta: n_merges = 151387</span> <span class="hljs-section">llm_load_print_meta: vocab_only = 0</span> <span class="hljs-section">llm_load_print_meta: n_ctx_train = 131072</span> <span class="hljs-section">llm_load_print_meta: n_embd = 3584</span> <span class="hljs-section">llm_load_print_meta: n_layer = 28</span> <span class="hljs-section">llm_load_print_meta: n_head = 28</span> <span class="hljs-section">llm_load_print_meta: n_head_kv = 4</span> <span class="hljs-section">llm_load_print_meta: n_rot = 128</span> <span class="hljs-section">llm_load_print_meta: n_swa = 0</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_k = 128</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_v = 128</span> <span class="hljs-section">llm_load_print_meta: n_gqa = 7</span> <span class="hljs-section">llm_load_print_meta: n_embd_k_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: n_embd_v_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: f_norm_eps = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_norm_rms_eps = 1.0e-06</span> <span class="hljs-section">llm_load_print_meta: f_clamp_kqv = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_max_alibi_bias = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_logit_scale = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: n_ff = 18944</span> <span class="hljs-section">llm_load_print_meta: n_expert = 0</span> <span class="hljs-section">llm_load_print_meta: n_expert_used = 0</span> <span class="hljs-section">llm_load_print_meta: causal attn = 1</span> <span class="hljs-section">llm_load_print_meta: pooling type = 0</span> <span class="hljs-section">llm_load_print_meta: rope type = 2</span> <span class="hljs-section">llm_load_print_meta: rope scaling = linear</span> <span class="hljs-section">llm_load_print_meta: freq_base_train = 1000000.0</span> <span class="hljs-section">llm_load_print_meta: freq_scale_train = 1</span> <span class="hljs-section">llm_load_print_meta: n_ctx_orig_yarn = 131072</span> <span class="hljs-section">llm_load_print_meta: rope_finetuned = unknown</span> <span class="hljs-section">llm_load_print_meta: ssm_d_conv = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_inner = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_state = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_rank = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_b_c_rms = 0</span> <span class="hljs-section">llm_load_print_meta: model type = ?B</span> <span class="hljs-section">llm_load_print_meta: model ftype = Q5_K - Medium</span> <span class="hljs-section">llm_load_print_meta: model params = 7.62 B</span> <span class="hljs-section">llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)</span> <span class="hljs-section">llm_load_print_meta: general.name = qwen2.5-7b-instruct</span> <span class="hljs-section">llm_load_print_meta: BOS token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOS token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOT token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: PAD token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: LF token = 148848 'ÄĬ'</span> <span class="hljs-section">llm_load_print_meta: FIM PRE token = 151659 '&lt;|fim_prefix|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM SUF token = 151661 '&lt;|fim_suffix|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM MID token = 151660 '&lt;|fim_middle|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM PAD token = 151662 '&lt;|fim_pad|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM REP token = 151663 '&lt;|repo_name|&gt;'</span> <span class="hljs-section">llm_load_print_meta: FIM SEP token = 151664 '&lt;|file_sep|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151643 '&lt;|endoftext|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151645 '&lt;|im_end|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151662 '&lt;|fim_pad|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151663 '&lt;|repo_name|&gt;'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151664 '&lt;|file_sep|&gt;'</span> <span class="hljs-section">llm_load_print_meta: max token length = 256</span> <span class="hljs-section">llm_load_tensors: ggml ctx size = 0.30 MiB</span> <span class="hljs-section">llm_load_tensors: offloading 28 repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloading non-repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloaded 29/29 layers to GPU</span> <span class="hljs-section">llm_load_tensors: CPU buffer size = 357.33 MiB</span> <span class="hljs-section">llm_load_tensors: CUDA0 buffer size = 4829.59 MiB</span> ....................................................................................... <span class="hljs-section">llama_new_context_with_model: n_ctx = 131072</span> <span class="hljs-section">llama_new_context_with_model: n_batch = 2048</span> <span class="hljs-section">llama_new_context_with_model: n_ubatch = 512</span> <span class="hljs-section">llama_new_context_with_model: flash_attn = 1</span> <span class="hljs-section">llama_new_context_with_model: freq_base = 1000000.0</span> <span class="hljs-section">llama_new_context_with_model: freq_scale = 1</span> <span class="hljs-section">ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 0: cudaMalloc failed: out of memory</span> <span class="hljs-section">llama_kv_cache_init: failed to allocate buffer for kv cache</span> <span class="hljs-section">llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache</span> <span class="hljs-section">common_init_from_params: failed to create context with model '/home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf'</span> <span class="hljs-section">main: error: unable to load model</span></pre></div><p id="471d">Failed! The error is out of memory. To alleviate the memory issue when using LLaMA models, you can adjust a few settings in your command. Here’s a modified command that can help with memory management:</p><ol><li><b>Reduce ` — ngl` (number of GPU layers)</b>: Since your GPU has limited VRAM, try setting ` — ngl` to a lower value. The ` — ngl` option specifies how many transformer layers will be stored on the GPU. A safe starting point would be around 10–12 for an RTX 2070, but you may need to experiment to find the optimal number that works for your specific setup.</li><li><b>Reduce Context Size (`-c` or ` — ctx-size`)</b>: If your application allows, you might also want to set a smaller context size. Although the default is 0 (which loads from the model), you could experiment with smaller values for ` — ctx-size` if your application is flexible.</li><li><b>Adjust Batch Size (`-b` or ` — batch-size`)</b>: You can reduce the maximum logical batch size using `-b` (default is 2048) and the physical maximum batch size with ` — ubatch-size` (default is 512) to lower the memory usage during inference.</li><li><b>Avoid Flash Attention</b>: If not strictly needed, you can also disable Flash Attention by omitting the `-fa` option.</li></ol><p id="3e05">Let’s make below Adjustments:</p><ul><li><i>-ngl 12</i>: This allocates 12 layers on the GPU instead of 80.</li><li><i>-b 128</i>: This reduces the logical batch size to help lower memory use.</li><li><i>-ub 32</i>: This limits the physical batch size even further.</li></ul><p id="c5e8">You might need to experiment a bit with the values for ` — ngl`, `-b`, and ` — ub` until you find a configuration that works without running out of memory. If you still encounter issues, consider lowering the ` — ngl` even further.</p><p id="4c81">So, let’s try the new command:</p><div id="32fd"><pre> ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
-co -cnv -p <span class="hljs-string">"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."</span>
-ngl 12 -n 512 -b 128 -ub 32 <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no</span> <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no</span> <span class="hljs-section">ggml_cuda_init: found 1 CUDA devices:</span> Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes <span class="hljs-section">build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu</span> <span class="hljs-section">main: llama backend init</span> <span class="hljs-section">main: load the model and apply lora adapter, if any</span> <span class="hljs-section">llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free</span> <span class="hljs-section">llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))</span> <span class="hljs-section">llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.</span> <span class="hljs-section">llama_model_loader: - kv 0: general.architecture str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 1: general.type str = model</span> <span class="hljs-section">llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 3: general.version str = v0.1</span> <span class="hljs-section">llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 5: general.size_label str = 7.6B</span> <span class="hljs-section">llama_model_loader: - kv 6: qwen2.block_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 7: qwen2.context_length u32 = 131072</span> <span class="hljs-section">llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584</span> <span class="hljs-section">llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944</span> <span class="hljs-section">llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4</span> <span class="hljs-section">llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000</span> <span class="hljs-section">llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001</span> <span class="hljs-section">llama_model_loader: - kv 14: general.file_type u32 = 17</span> <span class="hljs-section">llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2</span> <span class="hljs-section">llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...</span> <span class="hljs-section">llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</span> <span class="hljs-section">llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...</span> <span class="hljs-section">llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645</span> <span class="hljs-section">llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false</span> <span class="hljs-section">llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...</span> <span class="hljs-section">llama_model_loader: - kv 25: general.quantization_version u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 26: split.no u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 27: split.count u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 28: split.tensors.count i32 = 339</span> <span class="hljs-section">llama_model_loader: - type f32: 141 tensors</span> <span class="hljs-section">llama_model_loader: - type q5_K: 169 tensors</span> <span class="hljs-section">llama_model_loader: - type q6_K: 29 tensors</span> <span class="hljs-section">llm_load_vocab: special tokens cache size = 22</span> <span class="hljs-section">llm_load_vocab: token to piece cache size = 0.9310 MB</span> <span class="hljs-section">llm_load_print_meta: format = GGUF V3 (latest)</span> <span class="hljs-section">llm_load_print_meta: arch = qwen2</span> <span class="hljs-section">llm_load_print_meta: vocab type = BPE</span> <span class="hljs-section">llm_load_print_meta: n_vocab = 152064</span> <span class="hljs-section">llm_load_print_meta: n_merges = 151387</span> <span class="hljs-section">llm_load_print_meta: vocab_only = 0</span> <span class="hljs-section">llm_load_print_meta: n_ctx_train = 131072</span> <span class="hljs-section">llm_load_print_meta: n_embd = 3584</span> <span class="hljs-section">llm_load_print_meta: n_layer = 28</span> <span class="hljs-section">llm_load_print_meta: n_head = 28</span> <span class="hljs-section">llm_load_print_meta: n_head_kv = 4</span> <span class="hljs-section">llm_load_print_meta: n_rot = 128</span> <span class="hljs-section">llm_load_print_meta: n_swa = 0</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_k = 128</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_v = 128</span> <span class="hljs-section">llm_load_print_meta: n_gqa = 7</span> <span class="hljs-section">llm_load_print_meta: n_embd_k_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: n_embd_v_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: f_norm_eps = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_norm_rms_eps = 1.0e-06</span> <span class="hljs-section">llm_load_print_meta: f_clamp_kqv = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_max_alibi_bias = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_logit_scale = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: n_ff = 18944</span> <span class="hljs-section">llm_load_print_meta: n_expert = 0</span> <span class="hljs-section">llm_load_print_meta: n_expert_used = 0</span> <span class="hljs-section">llm_load_print_meta: causal attn = 1</span> <span class="hljs-section">llm_load_print_meta: pooling type = 0</span> <span class="hljs-section">llm_load_print_meta: rope type = 2</span> <span class="hljs-section">llm_load_print_meta: rope scaling = linear</span> <span class="hljs-section">llm_load_print_meta: freq_base_train = 1000000.0</span> <span class="hljs-section">llm_load_print_meta: freq_scale_train = 1</span> <span class="hljs-section">llm_load_print_meta: n_ctx_orig_yarn = 131072</span> <span class="hljs-section">llm_load_print_meta: rope_finetuned = unknown</span> <span class="hljs-section">llm_load_print_meta: ssm_d_conv = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_inner = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_state = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_rank = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_b_c_rms = 0</span> <span class="hljs-section">llm_load_print_meta: model type = ?B</span> <span class="hljs-section">llm_load_print_meta: model ftype = Q5_K - Medium</span> <span class="hljs-section">llm_load_print_meta: model params = 7.62 B</span> <span class="hljs-section">llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)</span> <span class="hljs-section">llm_load_print_meta: general.name = qwen2.5-7b-instruct</span> <span class="hljs-section">llm_load_print_meta: BOS token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOS token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOT token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: PAD token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: LF token = 148848 'ÄĬ'</span> <span class="hljs-section">llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'</span> <span class="hljs-section">llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: max token length = 256</span> <span class="hljs-section">llm_load_tensors: ggml ctx size = 0.30 MiB</span> <span class="hljs-section">llm_load_tensors: offloading 12 repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloaded 12/29 layers to GPU</span> <span class="hljs-section">llm_load_tensors: CPU buffer size = 5186.92 MiB</span> <span class="hljs-section">llm_load_tensors: CUDA0 buffer size = 1895.93 MiB</span> ....................................................................................... <span class="hljs-section">llama_new_context_with_model: n_ctx = 131072</span> <span class="hljs-section">llama_new_context_with_model: n_batch = 128</span> <span class="hljs-section">llama_new_context_with_model: n_ubatch = 32</span> <span class="hljs-section">llama_new_context_with_model: flash_attn = 0</span> <span class="hljs-section">llama_new_context_with_model: freq_base = 1000000.0</span> <span class="hljs-section">llama_new_context_with_model: freq_scale = 1</span> <span class="hljs-section">llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB</span> <span class="hljs-section">llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA0 compute buffer size = 721.33 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host compute buffer size = 16.44 MiB</span> <span class="hljs-section">llama_new_context_with_model: graph nodes = 986</span> <span class="hljs-section">llama_new_context_with_model: graph splits = 228</span> <span class="hljs-section">common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)</span> <span class="hljs-section">main: llama threadpool init, n_threads = 6</span> <span class="hljs-section">main: chat template example:</span> <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant

<span class="hljs-section">system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |</span>

<span class="hljs-section">main: interactive mode on.</span> sampler seed: 48583840 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist <span class="hljs-section">generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0</span>

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

> where is new york? New York is a state located in the northeastern region of the United States. Its capital city is Albany, but the most populous city in the state, and one of the most populous cities in the United States, is New York City, often simply called <span class="hljs-string">"New York."</span> New York City is a global metropolis known for its culture, finance, arts, and business. It is located along the Atlantic coast and includes several counties, with Manhattan being one of the most well-known boroughs.

> </pre></div><p id="71a7">It works now.</p><h2 id="5775">Case Study: Load a 7B GGUF model into CPU only</h2><p id="972a">Let’s continue from above case and reuse the same GGUF model. Now we want to load the whole model into CPU only.</p><p id="b728">To do so, we need below adjustments:</p><ol><li><b>Removed GPU Options</b>: The -ngl option and other GPU-related options (like -fa, if present) are omitted. This makes sure the model runs strictly on the CPU.</li><li><b><i>— cpu-strict 1</i></b>: This sets strict CPU placement, ensuring that the model execution uses CPU cores for its operations.</li><li>Adjust <b><i>-b 128</i></b> (the batch size) as needed based on your performance and memory constraints.</li></ol><div id="5fe2"><pre> ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \ -co -cnv -p <span class="hljs-string">"You are Qwen, created by Alibaba Cloud. You are a helpful assistant."</span> \ --cpu-strict 1 -n 512 -b 128 <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no</span> <span class="hljs-section">ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no</span> <span class="hljs-section">ggml_cuda_init: found 1 CUDA devices:</span> Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes <span class="hljs-section">build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu</span> <span class="hljs-section">main: llama backend init</span> <span class="hljs-section">main: load the model and apply lora adapter, if any</span> <span class="hljs-section">llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free</span> <span class="hljs-section">llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))</span> <span class="hljs-section">llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.</span> <span class="hljs-section">llama_model_loader: - kv 0: general.architecture str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 1: general.type str = model</span> <span class="hljs-section">llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 3: general.version str = v0.1</span> <span class="hljs-section">llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct</span> <span class="hljs-section">llama_model_loader: - kv 5: general.size_label str = 7.6B</span> <span class="hljs-section">llama_model_loader: - kv 6: qwen2.block_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 7: qwen2.context_length u32 = 131072</span> <span class="hljs-section">llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584</span> <span class="hljs-section">llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944</span> <span class="hljs-section">llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28</span> <span class="hljs-section">llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4</span> <span class="hljs-section">llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000</span> <span class="hljs-section">llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001</span> <span class="hljs-section">llama_model_loader: - kv 14: general.file_type u32 = 17</span> <span class="hljs-section">llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2</span> <span class="hljs-section">llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2</span> <span class="hljs-section">llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "", "%", "&", "'", ...</span> <span class="hljs-section">llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</span> <span class="hljs-section">llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...</span> <span class="hljs-section">llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645</span> <span class="hljs-section">llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643</span> <span class="hljs-section">llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false</span> <span class="hljs-section">llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...</span> <span class="hljs-section">llama_model_loader: - kv 25: general.quantization_version u32 = 2</span> <span class="hljs-section">llama_model_loader: - kv 26: split.no u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 27: split.count u16 = 0</span> <span class="hljs-section">llama_model_loader: - kv 28: split.tensors.count i32 = 339</span> <span class="hljs-section">llama_model_loader: - type f32: 141 tensors</span> <span class="hljs-section">llama_model_loader: - type q5_K: 169 tensors</span> <span class="hljs-section">llama_model_loader: - type q6_K: 29 tensors</span> <span class="hljs-section">llm_load_vocab: special tokens cache size = 22</span> <span class="hljs-section">llm_load_vocab: token to piece cache size = 0.9310 MB</span> <span class="hljs-section">llm_load_print_meta: format = GGUF V3 (latest)</span> <span class="hljs-section">llm_load_print_meta: arch = qwen2</span> <span class="hljs-section">llm_load_print_meta: vocab type = BPE</span> <span class="hljs-section">llm_load_print_meta: n_vocab = 152064</span> <span class="hljs-section">llm_load_print_meta: n_merges = 151387</span> <span class="hljs-section">llm_load_print_meta: vocab_only = 0</span> <span class="hljs-section">llm_load_print_meta: n_ctx_train = 131072</span> <span class="hljs-section">llm_load_print_meta: n_embd = 3584</span> <span class="hljs-section">llm_load_print_meta: n_layer = 28</span> <span class="hljs-section">llm_load_print_meta: n_head = 28</span> <span class="hljs-section">llm_load_print_meta: n_head_kv = 4</span> <span class="hljs-section">llm_load_print_meta: n_rot = 128</span> <span class="hljs-section">llm_load_print_meta: n_swa = 0</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_k = 128</span> <span class="hljs-section">llm_load_print_meta: n_embd_head_v = 128</span> <span class="hljs-section">llm_load_print_meta: n_gqa = 7</span> <span class="hljs-section">llm_load_print_meta: n_embd_k_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: n_embd_v_gqa = 512</span> <span class="hljs-section">llm_load_print_meta: f_norm_eps = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_norm_rms_eps = 1.0e-06</span> <span class="hljs-section">llm_load_print_meta: f_clamp_kqv = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_max_alibi_bias = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: f_logit_scale = 0.0e+00</span> <span class="hljs-section">llm_load_print_meta: n_ff = 18944</span> <span class="hljs-section">llm_load_print_meta: n_expert = 0</span> <span class="hljs-section">llm_load_print_meta: n_expert_used = 0</span> <span class="hljs-section">llm_load_print_meta: causal attn = 1</span> <span class="hljs-section">llm_load_print_meta: pooling type = 0</span> <span class="hljs-section">llm_load_print_meta: rope type = 2</span> <span class="hljs-section">llm_load_print_meta: rope scaling = linear</span> <span class="hljs-section">llm_load_print_meta: freq_base_train = 1000000.0</span> <span class="hljs-section">llm_load_print_meta: freq_scale_train = 1</span> <span class="hljs-section">llm_load_print_meta: n_ctx_orig_yarn = 131072</span> <span class="hljs-section">llm_load_print_meta: rope_finetuned = unknown</span> <span class="hljs-section">llm_load_print_meta: ssm_d_conv = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_inner = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_d_state = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_rank = 0</span> <span class="hljs-section">llm_load_print_meta: ssm_dt_b_c_rms = 0</span> <span class="hljs-section">llm_load_print_meta: model type = ?B</span> <span class="hljs-section">llm_load_print_meta: model ftype = Q5_K - Medium</span> <span class="hljs-section">llm_load_print_meta: model params = 7.62 B</span> <span class="hljs-section">llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)</span> <span class="hljs-section">llm_load_print_meta: general.name = qwen2.5-7b-instruct</span> <span class="hljs-section">llm_load_print_meta: BOS token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOS token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOT token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: PAD token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: LF token = 148848 'ÄĬ'</span> <span class="hljs-section">llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'</span> <span class="hljs-section">llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'</span> <span class="hljs-section">llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151643 '<|endoftext|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151645 '<|im_end|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151663 '<|repo_name|>'</span> <span class="hljs-section">llm_load_print_meta: EOG token = 151664 '<|file_sep|>'</span> <span class="hljs-section">llm_load_print_meta: max token length = 256</span> <span class="hljs-section">llm_load_tensors: ggml ctx size = 0.15 MiB</span> <span class="hljs-section">llm_load_tensors: offloading 0 repeating layers to GPU</span> <span class="hljs-section">llm_load_tensors: offloaded 0/29 layers to GPU</span> <span class="hljs-section">llm_load_tensors: CPU buffer size = 5186.92 MiB</span> ....................................................................................... <span class="hljs-section">llama_new_context_with_model: n_ctx = 131072</span> <span class="hljs-section">llama_new_context_with_model: n_batch = 128</span> <span class="hljs-section">llama_new_context_with_model: n_ubatch = 128</span> <span class="hljs-section">llama_new_context_with_model: flash_attn = 0</span> <span class="hljs-section">llama_new_context_with_model: freq_base = 1000000.0</span> <span class="hljs-section">llama_new_context_with_model: freq_scale = 1</span> <span class="hljs-section">llama_kv_cache_init: CUDA_Host KV buffer size = 7168.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA0 compute buffer size = 2117.26 MiB</span> <span class="hljs-section">llama_new_context_with_model: CUDA_Host compute buffer size = 65.75 MiB</span> <span class="hljs-section">llama_new_context_with_model: graph nodes = 986</span> <span class="hljs-section">llama_new_context_with_model: graph splits = 396</span> <span class="hljs-section">common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)</span> <span class="hljs-section">main: llama threadpool init, n_threads = 6</span> <span class="hljs-section">main: chat template example:</span> <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant

<span class="hljs-section">system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |</span>

<span class="hljs-section">main: interactive mode on.</span> sampler seed: 999110778 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist <span class="hljs-section">generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0</span>

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

></pre></div><p id="c09b">It works too. How can I tell this is loaded to CPU only? Look at below parameters from outputs. If you compare with above GPU commands, you will see the difference.</p><figure id="3b09"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ga4qCMSPcPC68HvmXB0ygg.png"><figcaption></figcaption></figure><p id="acff">More considerations on CPU only usage:</p><ul><li>If you want more control over CPU threading or affinities, you could also include — threads N to specify the number of threads to use during generation (default is -1, which uses all available threads).</li><li>If you experience memory issues, you can keep experimenting with reducing -b to further optimize resource usage.</li></ul></article></body>

LLM By Examples: Utilizing Llama.cpp by Command Line Tools for CLI and Server

Llama.cpp has emerged as a powerful framework for working with language models, providing developers with robust tools and functionalities. This article explores the practical utility of Llama.cpp through command line tools, enabling seamless interaction with the framework for both command line interfaces (CLI) and server applications.

Utilizing Llama.cpp via command line tools offers a unique, flexible approach to model deployment and interaction. Developers can efficiently carry out tasks such as initializing models, querying text generation, and managing input-output processes from the terminal. This process not only streamlines workflows but also enhances productivity by allowing quick iterations. Furthermore, the command line environment encourages automation through scripting, making it ideal for integrating Llama.cpp capabilities into larger software systems or workflows. From generating prompts and fine-tuning parameters to outputting results in various formats, the CLI tools enable a fully customizable experience. In this article, we will delve into practical examples and best practices for leveraging Llama.cpp to its full potential via command line interfaces.

If you don’t familiar with core concepts of Llama.cpp, take a look below link first.

Preparation

To demonstrate the command line tools in llama.cpp, you need first have llama.cpp installed. There are multiple options below, pick one the best fit into your environment.

Next, we will need locate a GGUF model to proceed tests. There are tons of GGUF models from Hugging Face. To keep it simple, we will use Qwen 2.5 7B model below:

Last, Llama.cpp requires GGUF model to be downloaded to local. There are typically three approaches:

  1. Use Hugging Face Client to download the model before calling Llama.cpp
pip install huggingface_hub

huggingface-cli download MB20261/QWen2.5-7B-gguf unsloth.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False

2. Use — hf-repo parameter from Llama.cpp. To enable this feature, you need install libcurl when build the Llama.cpp. As of the time of writing this article, this is not default. Take a look above installation links carefully to ensure the correct build.

3. Convert existing LLM to GGUF in local

Llama-cli

In Llama.cpp, `llama-cli` is a command-line interface tool that provides users with a straightforward way to interact with LLaMA models through terminal commands. It allows users to perform various operations such as model inference, configuration adjustments, and result display without needing a graphical interface. This simplicity makes it particularly useful for developers and researchers who prefer quick and efficient interactions, allowing them to test and deploy models seamlessly from the command line. The `llama-cli` tool is designed to facilitate experimentation and integration of language models into workflows, making it an essential component of the Llama.cpp framework.

$ huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_0.gguf --local-dir ~/llama-cpp/models --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Downloading 'qwen2.5-0.5b-instruct-q4_0.gguf' to '/home/wsluser/llama-cpp/models/.cache/huggingface/download/qwen2.5-0.5b-instruct-q4_0.gguf.7671c0c304e6ce5a7fc577bcb12aba01e2c155cc2efd29b2213c95b18edaf6ed.incomplete'
qwen2.5-0.5b-instruct-q4_0.gguf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429M/429M [00:13<00:00, 30.7MB/s]
Download complete. Moving file to /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf
/home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf

$ 

$ docker run --gpus all -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-cli-cuda121-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors:        CPU buffer size =   403.20 MiB
llm_load_tensors:      CUDA0 buffer size =     8.01 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   983.43 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Building a website can be done in 10 simple steps:sampler seed: 3727715831
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0

 1. Choose a website builder 2. Choose a hosting service 3. Choose a domain 4. Choose a content 5. Set up your website 6. Design your website 7. Add a few items 8. Test your website 9. Launch your website 10. Manage your website
Based on the above information, what is the most likely function of the website builder? The most likely function of the website builder is to create a website. Based on the information provided, the website builder is the primary tool used to create a website, as indicated by the first and second steps listed in the list provided. The other functions of a website builder are typically associated with creating, managing, and managing websites, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but it is not the primary function of the website builder itself. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The most likely function of the website builder is to create a website. Please note that this is an assumption based on the context provided and may not be accurate in all cases. It is important to verify the functions of the specific website builder being used in a particular context.

The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website

llama_perf_sampler_print:    sampling time =      87.35 ms /   525 runs   (    0.17 ms per token,  6010.30 tokens per second)
llama_perf_context_print:        load time =    3175.65 ms
llama_perf_context_print: prompt eval time =      83.12 ms /    13 tokens (    6.39 ms per token,   156.40 tokens per second)
llama_perf_context_print:        eval time =   10874.08 ms /   511 runs   (   21.28 ms per token,    46.99 tokens per second)
llama_perf_context_print:       total time =   11282.11 ms /   524 tokens

$

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors:        CPU buffer size =   403.20 MiB
llm_load_tensors:      CUDA0 buffer size =     8.01 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   983.43 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 195092334
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0

Building a website can be done in 10 simple steps: setting up a hosting plan, installing a database, designing a website, creating content, linking content to a page, creating a user account, creating a password, uploading images and videos, and managing users and accounts. Each step is crucial, but what’s missing is a process to help you get the most out of the process. This is where the Ultimate Website Builder comes in.

What is the Ultimate Website Builder?

The Ultimate Website Builder is a website builder designed to streamline the process of building a website. It is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website. With the Ultimate Website Builder, users can create a website in as little as 5 minutes.

What Does the Ultimate Website Builder Do?

The Ultimate Website Builder is a website builder designed to simplify the process of building a website. The builder is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

What’s the Ultimate Website Builder Good for?

The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

Is there a limit to what can be built?

The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

Does the Ultimate Website Builder work with all platforms?

The Ultimate Website Builder is a tool that works with all platforms.

Does the Ultimate Website Builder help with the user’s login or registration?

The Ultimate Website Builder is a tool that helps with the user's login or registration.

Does the Ultimate Website Builder help with the user's password or login or registration?

The Ultimate Website Builder is a tool that helps with the user's password or login or registration.

Does the Ultimate Website Builder help with the user's user information or login or registration?

The Ultimate Website Builder is a tool that helps with the user's user information or login or registration.

Does the Ultimate Website Builder help with the user's images or login or registration?

The Ultimate Website Builder is a tool that helps with the user's images or login or registration.

Does the Ultimate Website Builder help with the user's videos or login or registration?

The Ultimate Website Builder is a tool that helps with the user's videos or login or

llama_perf_sampler_print:    sampling time =      89.21 ms /   525 runs   (    0.17 ms per token,  5884.92 tokens per second)
llama_perf_context_print:        load time =    2141.07 ms
llama_perf_context_print: prompt eval time =      83.81 ms /    13 tokens (    6.45 ms per token,   155.11 tokens per second)
llama_perf_context_print:        eval time =   10816.66 ms /   511 runs   (   21.17 ms per token,    47.24 tokens per second)
llama_perf_context_print:       total time =   11229.56 ms /   524 tokens

$ 

Liama-server

In Llama.cpp, `llama-server` is a command-line tool designed to provide a server interface for interacting with LLaMA models. It allows users to deploy LLaMA-based applications in a server environment, enabling access to the models via API calls. This facilitates straightforward integration into various applications, making it easier for developers to build and manage AI-driven services.

The `llama-server` tool offers functionalities such as customizable configurations, allowing users to set parameters according to their needs, and support for multiple concurrent connections. This means that it can handle requests from different clients simultaneously, making it suitable for real-world applications where multiple users may interact with the model at the same time. Additionally, `llama-server` enhances the user experience by providing easy-to-use commands, ensuring that developers can quickly get their LLaMA models up and running without extensive setup. It serves as a robust solution for serving LLaMA models in a wide range of applications.

$ ./llama-server -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -ngl 28 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =    73.03 MiB
llm_load_tensors:      CUDA0 buffer size =   330.19 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   298.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 751
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

Alternatively, we could also start llama-server from docker image with CPU only:

$ docker run -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-srv-cpu-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf --port 8080 --host 0.0.0.0
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors:        CPU buffer size =   403.20 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.16 MiB
llama_new_context_with_model:        CPU compute buffer size =   967.01 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle

request: GET /health 127.0.0.1 200

Then, we could access standard OpenAI API (no key required) to keep our client application portable.

$ curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-no-key-required" \
  -d '{
  "model": "qwen",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "tell me something about michael jordan"
        }
      ]
    }
  ]
}'

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Here is a brief biography of Michael Jordan:\n\nMichael Jordan was a professional basketball player from the United States who played for the Chicago Bulls during the 1993-1994 NBA season. He is widely regarded as one of the greatest basketball players of all time and is widely considered one of the greatest basketball players ever to play the game.\n\nJordan is known for his incredible athleticism, skill, and leadership on the court, as well as his exceptional ability to rebound from the field and in the gym. He is also recognized for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan was a skilled player, with a career-high rebound rate of 50% and a career-high percentage of 27% in scoring, and he was also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan is known for his leadership on the court, and he is also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his contributions to the development of basketball in the United States, and he is also known for his contributions to the development of basketball in the United States.","role":"assistant"}}],"created":1729560781,"model":"qwen","object":"chat.completion","usage":{"completion_tokens":273,"prompt_tokens":26,"total_tokens":299},"id":"chatcmpl-4yJgHpf2OIHDrNycWPgvTvbswhznZv35"}

$

Or Calling from Python:

$ python
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openai
url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
    model="qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "tell me something about michael jordan"}
    ]
)
print(completion.choices[0].message.content)>>>
>>> client = openai.OpenAI(
...     base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
...     api_key = "sk-no-key-required"
... )
>>>
>>> completion = client.chat.completions.create(
...     model="qwen",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "tell me something about michael jordan"}
...     ]
... )
>>> print(completion.choices[0].message.content)
Michael Jordan is a renowned basketball player who is widely regarded as one of the greatest players in basketball history. He is the first and currently the most successful of all time, having won five NBA championships and four Olympic gold medals. Jordan is considered one of the greatest of all time and was one of the best players in the world. He is also known for his innovative skills and his ability to influence his opponents in the sport.
>>> exit()

Tuning Llama.cpp command to fit into your enviornment

If you use GPU enabled Llama.cpp build, it is good for CPU only, CPU+GPU, and GPU only usage. Different model and quantization gives you different file size. There are no a single answer for what is the best command line argument. You need tune the parameters, like layers loaded to GPU, context length, batch size, etc. to make the the model best fit.

Now let’s walk through the common use cases.

Case Study: Load a 7B GGUF model into CPU + 8GB GPU

Let’s first download the model. We will use Qwen 2.5 7B GGUF model:

$ huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "qwen2.5-7b-instruct-q5_k_m*.gguf" --local-dir . --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Fetching 2 files:   0%|                                                                                                                                      | 0/2 [00:00<?, ?it/s]Downloading 'qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf.beba9d4f2f5a1fe7d144dcae332e68b52c26705c5310dece2e5d1997e091e134.incomplete'
Downloading 'qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf.42f6693004793ee6cf1b2b723f0273b10f86a3bb2a949bd9128d4cda5fb866cd.incomplete'
(…)5-7b-instruct-q5_k_m-00002-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 20.7MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf██████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 12.8MB/s]
(…)5-7b-instruct-q5_k_m-00001-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [01:52<00:00, 35.4MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:53<00:00, 56.59s/it]
/work/GitHubs/MEAIDev/poc-ai-tool-llama-cpp/llama.cpp/build/bin

For large files, people normally split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf and qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf. After all files are downloaded from above command, we need merge them into one GGUF model file in order to feed into Llama.cpp command.

$ ./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf ~/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf -> /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf merged from 2 split with 339 tensors.

Now let’s load the model for conversation.

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    -fa -ngl 80 -n 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   357.33 MiB
llm_load_tensors:      CUDA0 buffer size =  4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
common_init_from_params: failed to create context with model '/home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf'
main: error: unable to load model

Failed! The error is out of memory. To alleviate the memory issue when using LLaMA models, you can adjust a few settings in your command. Here’s a modified command that can help with memory management:

  1. Reduce ` — ngl` (number of GPU layers): Since your GPU has limited VRAM, try setting ` — ngl` to a lower value. The ` — ngl` option specifies how many transformer layers will be stored on the GPU. A safe starting point would be around 10–12 for an RTX 2070, but you may need to experiment to find the optimal number that works for your specific setup.
  2. Reduce Context Size (`-c` or ` — ctx-size`): If your application allows, you might also want to set a smaller context size. Although the default is 0 (which loads from the model), you could experiment with smaller values for ` — ctx-size` if your application is flexible.
  3. Adjust Batch Size (`-b` or ` — batch-size`): You can reduce the maximum logical batch size using `-b` (default is 2048) and the physical maximum batch size with ` — ubatch-size` (default is 512) to lower the memory usage during inference.
  4. Avoid Flash Attention: If not strictly needed, you can also disable Flash Attention by omitting the `-fa` option.

Let’s make below Adjustments:

  • -ngl 12: This allocates 12 layers on the GPU instead of 80.
  • -b 128: This reduces the logical batch size to help lower memory use.
  • -ub 32: This limits the physical batch size even further.

You might need to experiment a bit with the values for ` — ngl`, `-b`, and ` — ub` until you find a configuration that works without running out of memory. If you still encounter issues, consider lowering the ` — ngl` even further.

So, let’s try the new command:

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    -ngl 12 -n 512 -b 128 -ub 32
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/29 layers to GPU
llm_load_tensors:        CPU buffer size =  5186.92 MiB
llm_load_tensors:      CUDA0 buffer size =  1895.93 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  4096.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   721.33 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.44 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 228
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: interactive mode on.
sampler seed: 48583840
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

> where is new york?
New York is a state located in the northeastern region of the United States. Its capital city is Albany, but the most populous city in the state, and one of the most populous cities in the United States, is New York City, often simply called "New York." New York City is a global metropolis known for its culture, finance, arts, and business. It is located along the Atlantic coast and includes several counties, with Manhattan being one of the most well-known boroughs.

> 

It works now.

Case Study: Load a 7B GGUF model into CPU only

Let’s continue from above case and reuse the same GGUF model. Now we want to load the whole model into CPU only.

To do so, we need below adjustments:

  1. Removed GPU Options: The `-ngl` option and other GPU-related options (like `-fa`, if present) are omitted. This makes sure the model runs strictly on the CPU.
  2. — cpu-strict 1: This sets strict CPU placement, ensuring that the model execution uses CPU cores for its operations.
  3. Adjust -b 128 (the batch size) as needed based on your performance and memory constraints.
$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    --cpu-strict 1 -n 512 -b 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  5186.92 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  7168.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2117.26 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.75 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: interactive mode on.
sampler seed: 999110778
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

>

It works too. How can I tell this is loaded to CPU only? Look at below parameters from outputs. If you compare with above GPU commands, you will see the difference.

More considerations on CPU only usage:

  • If you want more control over CPU threading or affinities, you could also include ` — threads N` to specify the number of threads to use during generation (default is -1, which uses all available threads).
  • If you experience memory issues, you can keep experimenting with reducing `-b` to further optimize resource usage.
Llama Cpp
Llama 3
AI
Cuda
Ai Tools
Recommended from ReadMedium