Free AI web copilot to create summaries, insights and extended knowledge, download it at here

42976

Abstract

he models via API calls. This facilitates straightforward integration into various applications, making it easier for developers to build and manage AI-driven services.The llama-server tool offers functionalities such as customizable configurations, allowing users to set parameters according to their needs, and support for multiple concurrent connections. This means that it can handle requests from different clients simultaneously, making it suitable for real-world applications where multiple users may interact with the model at the same time. Additionally, llama-server enhances the user experience by providing easy-to-use commands, ensuring that developers can quickly get their LLaMA models up and running without extensive setup. It serves as a robust solution for serving LLaMA models in a wide range of applications.<div id="a688"><pre>$ ./llama-server -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -ngl 28 -fa ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11 main: loading model llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = qwen2.5-0.5b-instruct llama_model_loader: - kv 3: general.version str = v0.1 llama_model_loader: - kv 4: general.finetune str = qwen2.5-0.5b-instruct llama_model_loader: - kv 5: general.size_label str = 630M llama_model_loader: - kv 6: qwen2.block_count u32 = 24 llama_model_loader: - kv 7: qwen2.context_length u32 = 32768 llama_model_loader: - kv 8: qwen2.embedding_length u32 = 896 llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 4864 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 14 llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: general.file_type u32 = 2 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", " $", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q4_0: 169 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 896 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 14 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 128 llm_load_print_meta: n_embd_v_gqa = 128 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4864 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 630.17 M llm_load_print_meta: model size = 403.20 MiB (5.37 BPW) llm_load_print_meta: general.name = qwen2.5-0.5b-instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.26 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 73.03 MiB llm_load_tensors: CUDA0 buffer size = 330.19 MiB .................................................. llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB llama_new_context_with_model: CUDA0 compute buffer size = 298.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 65.76 MiB llama_new_context_with_model: graph nodes = 751 llama_new_context_with_model: graph splits = 2 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 32768 main: model loaded main: chat template, built_in: 1, chat_example: '<|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' main: server is listening on 127.0.0.1:8080 - starting the main loop srv update_slots: all slots are idle</pre></div>Alternatively, we could also start llama-server from docker image with CPU only:<div id="bcf1"><pre>$ docker run -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-srv-cpu-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf --port 8080 --host 0.0.0.0 warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

main: HTTP server is listening, main: loading modelllama_model_loader: loaded llama_model_loader: Dumping llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - kv 20: llama_model_loader: - kv 21: llama_model_loader: - kv 22: llama_model_loader: - kv 23: llama_model_loader: - kv 24: llama_model_loader: - kv 25: llama_model_loader: - type llama_model_loader: - type q4_0: llama_model_loader: - type q8_0: llm_load_vocab: special llm_load_vocab: token llm_load_print_meta: format llm_load_print_meta: arch llm_load_print_meta: vocab type llm_load_print_meta: n_vocab llm_load_print_meta: n_merges llm_load_print_meta: vocab_only llm_load_print_meta: n_ctx_train llm_load_print_meta: n_embd llm_load_print_meta: n_layer llm_load_print_meta: n_head llm_load_print_meta: n_head_kv llm_load_print_meta: n_rot llm_load_print_meta: n_swa llm_load_print_meta: n_embd_head_k llm_load_print_meta: n_embd_head_v llm_load_print_meta: n_gqa llm_load_print_meta: n_embd_k_gqa llm_load_print_meta: n_embd_v_gqa llm_load_print_meta: f_norm_eps llm_load_print_meta: f_norm_rms_eps llm_load_print_meta: f_clamp_kqv llm_load_print_meta: llm_load_print_meta: f_logit_scale llm_load_print_meta: n_ff llm_load_print_meta: n_expert llm_load_print_meta: n_expert_used llm_load_print_meta: causal attn llm_load_print_meta: pooling type llm_load_print_meta: rope type llm_load_print_meta: rope scaling llm_load_print_meta: freq_base_train llm_load_print_meta: llm_load_print_meta: n_ctx_orig_yarn llm_load_print_meta: rope_finetuned llm_load_print_meta: ssm_d_conv llm_load_print_meta: ssm_d_inner llm_load_print_meta: ssm_d_state llm_load_print_meta: ssm_dt_rank llm_load_print_meta: ssm_dt_b_c_rms llm_load_print_meta: model type llm_load_print_meta: model ftype llm_load_print_meta: model params llm_load_print_meta: model size llm_load_print_meta: general.name llm_load_print_meta: BOS token llm_load_print_meta: EOS token llm_load_print_meta: EOT token llm_load_print_meta: PAD token llm_load_print_meta: LF token llm_load_print_meta: FIM PRE token llm_load_print_meta: FIM SUF token llm_load_print_meta: FIM MID token llm_load_print_meta: FIM PAD token llm_load_print_meta: FIM REP token llm_load_print_meta: FIM SEP token llm_load_print_meta: EOG token llm_load_print_meta: EOG token llm_load_print_meta: EOG token llm_load_print_meta: EOG token llm_load_print_meta: EOG token llm_load_print_meta: llm_load_tensors: ggml ctx size = llm_load_tensors: .................................................. llama_new_context_with_model: n_ctx llama_new_context_with_model: n_batch llama_new_context_with_model: n_ubatch llama_new_context_with_model: llama_new_context_with_model: freq_base llama_new_context_with_model: llama_kv_cache_init: llama_new_context_with_model: KV self size llama_new_context_with_model: llama_new_context_with_model: llama_new_context_with_model: graph nodes llama_new_context_with_model: common_init_from_params: srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 32768 main: model loadedmain: chat template, You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant ' main: server is listening srv update_slots: all slots are idle hostname: 0.0.0.0, port: 8080, http threads: 11 --> meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest)) metadata keys/values. Note: KV overrides do not apply in this output. general.architecture str = qwen2 general.type str = model general.name str = qwen2.5-0.5b-instruct general.version str = v0.1 general.finetune str = qwen2.5-0.5b-instruct general.size_label str = 630M qwen2.block_count u32 = 24 qwen2.context_length u32 = 32768 qwen2.embedding_length u32 = 896 qwen2.feed_forward_length u32 = 4864 qwen2.attention.head_count u32 = 14 qwen2.attention.head_count_kv u32 = 2 qwen2.rope.freq_base f32 = 1000000.000000 qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 general.file_type u32 = 2 tokenizer.ggml.model str = gpt2 tokenizer.ggml.pre str = qwen2 tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... tokenizer.ggml.eos_token_id u32 = 151645 tokenizer.ggml.padding_token_id u32 = 151643 tokenizer.ggml.bos_token_id u32 = 151643 tokenizer.ggml.add_bos_token bool = false tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... general.quantization_version u32 = 2 f32: 121 tensors 169 tensors 1 tensors tokens cache size = 22 to piece cache size = 0.9310 MB = GGUF V3 (latest) = qwen2 = BPE = 151936 = 151387 = 0 = 32768 = 896 = 24 = 14 = 2 = 64 = 0 = 64 = 64 = 7 = 128 = 128 = 0.0e+00 = 1.0e-06 = 0.0e+00 f_max_alibi_bias = 0.0e+00 = 0.0e+00 = 4864 = 0 = 0 = 1 = 0 = 2 = linear = 1000000.0 freq_scale_train = 1 = 32768 = unknown = 0 = 0 = 0 = 0 = 0 = 1B = Q4_0 = 630.17 M = 403.20 MiB (5.37 BPW) = qwen2.5-0.5b-instruct = 151643 '<|endoftext|>' = 151645 '<|im_end|>' = 151645 '<|im_end|>' = 151643 '<|endoftext|>' = 148848 'ÄĬ' = 151659 '<|fim_prefix|>' = 151661 '<|fim_suffix|>' = 151660 '<|fim_middle|>' = 151662 '<|fim_pad|>' = 151663 '<|repo_name|>' = 151664 '<|file_sep|>' = 151643 '<|endoftext|>' = 151645 '<|im_end|>' = 151662 '<|fim_pad|>' = 151663 '<|repo_name|>' = 151664 '<|file_sep|>' max token length = 256 0.13 MiB CPU buffer size = 403.20 MiB = 32768 = 2048 = 512 flash_attn = 0 = 1000000.0 freq_scale = 1 CPU KV buffer size = 384.00 MiB = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB CPU output buffer size = 1.16 MiB CPU compute buffer size = 967.01 MiB = 846 graph splits = 1 warming up the model with an empty run - please wait ... (--no-warmup to disable) --> built_in: 1, chat_example: '<|im_start|>system on 0.0.0.0:8080 - starting the main loop

request: GET /health 127.0.0.1 200 </pre></div>Then, we could access standard OpenAI API (no key required) to keep our client application portable.<div id="47bd"><pre>$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json"
-H "Authorization: Bearer sk-no-key-required"
-d '{ "model": "qwen", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are a helpful assistant." } ] }, { "role": "user", "content": [ { "type": "text", "text": "tell me something about michael jordan" } ] } ] }'

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Here is a brief biography of Michael Jordan:\n\nMichael Jordan was a professional basketball player from the United States who played for the Chicago Bulls during the 1993-1994 NBA season. He is widely regarded as one of the greatest basketball players of all time and is widely considered one of the greatest basketball players ever to play the game.\n\nJordan is known for his incredible athleticism, skill, and leadership on the court, as well as his exceptional ability to rebound from the field and in the gym. He is also recognized for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan was a skilled player, with a career-high rebound rate of 50% and a career-high percentage of 27% in scoring, and he was also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan is known for his leadership on the court, and he is also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his contributions to the development of basketball in the United States, and he is also known for his contributions to the development of basketball in the United States.","role":"assistant"}}],"created":1729560781,"model":"qwen","object":"chat.completion","usage":{"completion_tokens":273,"prompt_tokens":26,"total_tokens":299},"id":"chatcmpl-4yJgHpf2OIHDrNycWPgvTvbswhznZv35"}

$</pre></div>Or Calling from Python:<div id="a6be"><pre>$ python Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import openai url="http://localhost:8080/v1", # "http://<Your api-server IP>:port" api_key = "sk-no-key-required" )

completion = client.chat.completions.create( model="qwen", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "tell me something about michael jordan"} ] ) print(completion.choices[0].message.content)>>> >>> client = openai.OpenAI( ... base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port" ... api_key = "sk-no-key-required" ... ) >>> >>> completion = client.chat.completions.create( ... model="qwen", ... messages=[ ... {"role": "system", "content": "You are a helpful assistant."}, ... {"role": "user", "content": "tell me something about michael jordan"} ... ] ... ) >>> print(completion.choices[0].message.content) Michael Jordan is a renowned basketball player who is widely regarded as one of the greatest players in basketball history. He is the first and currently the most successful of all time, having won five NBA championships and four Olympic gold medals. Jordan is considered one of the greatest of all time and was one of the best players in the world. He is also known for his innovative skills and his ability to influence his opponents in the sport. >>> exit()</pre></div><h1 id="748e">Tuning Llama.cpp command to fit into your enviornment</h1>If you use GPU enabled Llama.cpp build, it is good for CPU only, CPU+GPU, and GPU only usage. Different model and quantization gives you different file size. There are no a single answer for what is the best command line argument. You need tune the parameters, like layers loaded to GPU, context length, batch size, etc. to make the the model best fit.Now let’s walk through the common use cases.<h2 id="e0e0">Case Study: Load a 7B GGUF model into CPU + 8GB GPU</h2>Let’s first download the model. We will use Qwen 2.5 7B GGUF model:<div id="9e13" class="link-block"> <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF"> <div> <div> <h2>Qwen/Qwen2.5-7B-Instruct-GGUF · Hugging Face</h2> <div><h3>We're on a journey to advance and democratize artificial intelligence through open source and open science.</h3></div> <div>huggingface.co</div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*BYIkavMDsxV5oV6f)"></div> </div> </div> </a> </div><div id="1c8b"><pre> ./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf /llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf gguf_merge: qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf -> /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done gguf_merge: /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf merged from 2 split with 339 tensors.</pre></div>Now let’s load the model for conversation.<div id="25a0"><pre>$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
-fa -ngl 80 -n 512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct llama_model_loader: - kv 3: general.version str = v0.1 llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct llama_model_loader: - kv 5: general.size_label str = 7.6B llama_model_loader: - kv 6: qwen2.block_count u32 = 28 llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 9: qwen2.feed_for

Options

ward_length u32 = 18944 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: general.file_type u32 = 17 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", " ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
-co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
-ngl 12 -n 512 -b 128 -ub 32 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct llama_model_loader: - kv 3: general.version str = v0.1 llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct llama_model_loader: - kv 5: general.size_label str = 7.6B llama_model_loader: - kv 6: qwen2.block_count u32 = 28 llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: general.file_type u32 = 17 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: split.no u16 = 0 llama_model_loader: - kv 27: split.count u16 = 0 llama_model_loader: - kv 28: split.tensors.count i32 = 339 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q5_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.62 B llm_load_print_meta: model size = 5.07 GiB (5.71 BPW) llm_load_print_meta: general.name = qwen2.5-7b-instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloaded 12/29 layers to GPU llm_load_tensors: CPU buffer size = 5186.92 MiB llm_load_tensors: CUDA0 buffer size = 1895.93 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 128 llama_new_context_with_model: n_ubatch = 32 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 4096.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 721.33 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.44 MiB llama_new_context_with_model: graph nodes = 986 llama_new_context_with_model: graph splits = 228 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 6 main: chat template example: <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant

main: interactive mode on. sampler seed: 48583840 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to the AI.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

> where is new york? New York is a state located in the northeastern region of the United States. Its capital city is Albany, but the most populous city in the state, and one of the most populous cities in the United States, is New York City, often simply called "New York." New York City is a global metropolis known for its culture, finance, arts, and business. It is located along the Atlantic coast and includes several counties, with Manhattan being one of the most well-known boroughs.

> </pre></div>It works now.<h2 id="5775">Case Study: Load a 7B GGUF model into CPU only</h2>Let’s continue from above case and reuse the same GGUF model. Now we want to load the whole model into CPU only.To do so, we need below adjustments:<ol><li>Removed GPU Options: The -ngl option and other GPU-related options (like -fa, if present) are omitted. This makes sure the model runs strictly on the CPU.</li><li>— cpu-strict 1: This sets strict CPU placement, ensuring that the model execution uses CPU cores for its operations.</li><li>Adjust -b 128 (the batch size) as needed based on your performance and memory constraints.</li></ol><div id="5fe2"><pre> $./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \ -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \ --cpu-strict 1 -n 512 -b 128 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = qwen2.5-7b-instruct llama_model_loader: - kv 3: general.version str = v0.1 llama_model_loader: - kv 4: general.finetune str = qwen2.5-7b-instruct llama_model_loader: - kv 5: general.size_label str = 7.6B llama_model_loader: - kv 6: qwen2.block_count u32 = 28 llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: general.file_type u32 = 17 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$ ", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: split.no u16 = 0 llama_model_loader: - kv 27: split.count u16 = 0 llama_model_loader: - kv 28: split.tensors.count i32 = 339 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q5_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.62 B llm_load_print_meta: model size = 5.07 GiB (5.71 BPW) llm_load_print_meta: general.name = qwen2.5-7b-instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/29 layers to GPU llm_load_tensors: CPU buffer size = 5186.92 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 128 llama_new_context_with_model: n_ubatch = 128 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 7168.00 MiB llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 2117.26 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 65.75 MiB llama_new_context_with_model: graph nodes = 986 llama_new_context_with_model: graph splits = 396 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 6 main: chat template example: <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant

main: interactive mode on. sampler seed: 999110778 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to the AI.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

></pre></div>It works too. How can I tell this is loaded to CPU only? Look at below parameters from outputs. If you compare with above GPU commands, you will see the difference.<figure id="3b09"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ga4qCMSPcPC68HvmXB0ygg.png"><figcaption></figcaption></figure>More considerations on CPU only usage:<ul><li>If you want more control over CPU threading or affinities, you could also include — threads N to specify the number of threads to use during generation (default is -1, which uses all available threads).</li><li>If you experience memory issues, you can keep experimenting with reducing -b to further optimize resource usage.</li></ul></article></body>

LLM By Examples: Utilizing Llama.cpp by Command Line Tools for CLI and Server

Llama.cpp has emerged as a powerful framework for working with language models, providing developers with robust tools and functionalities. This article explores the practical utility of Llama.cpp through command line tools, enabling seamless interaction with the framework for both command line interfaces (CLI) and server applications.

Utilizing Llama.cpp via command line tools offers a unique, flexible approach to model deployment and interaction. Developers can efficiently carry out tasks such as initializing models, querying text generation, and managing input-output processes from the terminal. This process not only streamlines workflows but also enhances productivity by allowing quick iterations. Furthermore, the command line environment encourages automation through scripting, making it ideal for integrating Llama.cpp capabilities into larger software systems or workflows. From generating prompts and fine-tuning parameters to outputting results in various formats, the CLI tools enable a fully customizable experience. In this article, we will delve into practical examples and best practices for leveraging Llama.cpp to its full potential via command line interfaces.

If you don’t familiar with core concepts of Llama.cpp, take a look below link first.

LLM By Examples: A overview of Llama.cpp

Developed with an emphasis on performance and ease-of-use, Llama.cpp brings together the power of advanced algorithms…

medium.com

Preparation

To demonstrate the command line tools in llama.cpp, you need first have llama.cpp installed. There are multiple options below, pick one the best fit into your environment.

LLM By Examples: Build Llama.cpp with GPU (CUDA) support

As the demand for advanced language models continues to surge, developers increasingly seek high-performance solutions…

medium.com

LLM By Examples: Build Llama.cpp for CPU only

In the evolving landscape of artificial intelligence, Llama.cpp stands out as an efficient tool for working with large…

medium.com

LLM By Examples: Llama.cpp Installation from pre-built binary

Llama.cpp is a versatile and efficient framework designed to support large language models, providing an accessible…

medium.com

LLM By Examples: Build Llama.cpp with customized Docker Images

Llama.cpp is an innovative library designed to facilitate the development and deployment of large language models. Its…

medium.com

Next, we will need locate a GGUF model to proceed tests. There are tons of GGUF models from Hugging Face. To keep it simple, we will use Qwen 2.5 7B model below:

unsloth.Q4_K_M.gguf · MB20261/QWen2.5-7B-gguf at main

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Last, Llama.cpp requires GGUF model to be downloaded to local. There are typically three approaches:

Use Hugging Face Client to download the model before calling Llama.cpp

pip install huggingface_hub

huggingface-cli download MB20261/QWen2.5-7B-gguf unsloth.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False

2. Use — hf-repo parameter from Llama.cpp. To enable this feature, you need install libcurl when build the Llama.cpp. As of the time of writing this article, this is not default. Take a look above installation links carefully to ensure the correct build.

3. Convert existing LLM to GGUF in local

Llama-cli

In Llama.cpp, `llama-cli` is a command-line interface tool that provides users with a straightforward way to interact with LLaMA models through terminal commands. It allows users to perform various operations such as model inference, configuration adjustments, and result display without needing a graphical interface. This simplicity makes it particularly useful for developers and researchers who prefer quick and efficient interactions, allowing them to test and deploy models seamlessly from the command line. The `llama-cli` tool is designed to facilitate experimentation and integration of language models into workflows, making it an essential component of the Llama.cpp framework.

$ huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_0.gguf --local-dir ~/llama-cpp/models --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Downloading 'qwen2.5-0.5b-instruct-q4_0.gguf' to '/home/wsluser/llama-cpp/models/.cache/huggingface/download/qwen2.5-0.5b-instruct-q4_0.gguf.7671c0c304e6ce5a7fc577bcb12aba01e2c155cc2efd29b2213c95b18edaf6ed.incomplete'
qwen2.5-0.5b-instruct-q4_0.gguf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429M/429M [00:13<00:00, 30.7MB/s]
Download complete. Moving file to /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf
/home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf

$ 

$ docker run --gpus all -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-cli-cuda121-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors:        CPU buffer size =   403.20 MiB
llm_load_tensors:      CUDA0 buffer size =     8.01 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   983.43 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Building a website can be done in 10 simple steps:sampler seed: 3727715831
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0

 1. Choose a website builder 2. Choose a hosting service 3. Choose a domain 4. Choose a content 5. Set up your website 6. Design your website 7. Add a few items 8. Test your website 9. Launch your website 10. Manage your website
Based on the above information, what is the most likely function of the website builder? The most likely function of the website builder is to create a website. Based on the information provided, the website builder is the primary tool used to create a website, as indicated by the first and second steps listed in the list provided. The other functions of a website builder are typically associated with creating, managing, and managing websites, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but it is not the primary function of the website builder itself. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The most likely function of the website builder is to create a website. Please note that this is an assumption based on the context provided and may not be accurate in all cases. It is important to verify the functions of the specific website builder being used in a particular context.

The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function in this context. The other steps in the list suggest that a website builder is involved in the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website builder being used in a particular context. Therefore, the most likely function of the website builder is to create a website. However, without a specific list of functions, it is not possible to provide a more specific answer. The other functions of a website builder are typically associated with the design and development phase of creating a website, but are not listed as the primary function of the specific website

llama_perf_sampler_print:    sampling time =      87.35 ms /   525 runs   (    0.17 ms per token,  6010.30 tokens per second)
llama_perf_context_print:        load time =    3175.65 ms
llama_perf_context_print: prompt eval time =      83.12 ms /    13 tokens (    6.39 ms per token,   156.40 tokens per second)
llama_perf_context_print:        eval time =   10874.08 ms /   511 runs   (   21.28 ms per token,    46.99 tokens per second)
llama_perf_context_print:       total time =   11282.11 ms /   524 tokens

$

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/25 layers to GPU
llm_load_tensors:        CPU buffer size =   403.20 MiB
llm_load_tensors:      CUDA0 buffer size =     8.01 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   983.43 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 326
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 195092334
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0

Building a website can be done in 10 simple steps: setting up a hosting plan, installing a database, designing a website, creating content, linking content to a page, creating a user account, creating a password, uploading images and videos, and managing users and accounts. Each step is crucial, but what’s missing is a process to help you get the most out of the process. This is where the Ultimate Website Builder comes in.

What is the Ultimate Website Builder?

The Ultimate Website Builder is a website builder designed to streamline the process of building a website. It is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website. With the Ultimate Website Builder, users can create a website in as little as 5 minutes.

What Does the Ultimate Website Builder Do?

The Ultimate Website Builder is a website builder designed to simplify the process of building a website. The builder is a tool that takes the hassle out of the process and gives users a step-by-step guide to help them create a website. The builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

What’s the Ultimate Website Builder Good for?

The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

Is there a limit to what can be built?

The Ultimate Website Builder is a tool that makes website building a breeze, reducing the time and effort required to create a website.

Does the Ultimate Website Builder work with all platforms?

The Ultimate Website Builder is a tool that works with all platforms.

Does the Ultimate Website Builder help with the user’s login or registration?

The Ultimate Website Builder is a tool that helps with the user's login or registration.

Does the Ultimate Website Builder help with the user's password or login or registration?

The Ultimate Website Builder is a tool that helps with the user's password or login or registration.

Does the Ultimate Website Builder help with the user's user information or login or registration?

The Ultimate Website Builder is a tool that helps with the user's user information or login or registration.

Does the Ultimate Website Builder help with the user's images or login or registration?

The Ultimate Website Builder is a tool that helps with the user's images or login or registration.

Does the Ultimate Website Builder help with the user's videos or login or registration?

The Ultimate Website Builder is a tool that helps with the user's videos or login or

llama_perf_sampler_print:    sampling time =      89.21 ms /   525 runs   (    0.17 ms per token,  5884.92 tokens per second)
llama_perf_context_print:        load time =    2141.07 ms
llama_perf_context_print: prompt eval time =      83.81 ms /    13 tokens (    6.45 ms per token,   155.11 tokens per second)
llama_perf_context_print:        eval time =   10816.66 ms /   511 runs   (   21.17 ms per token,    47.24 tokens per second)
llama_perf_context_print:       total time =   11229.56 ms /   524 tokens

$

Liama-server

In Llama.cpp, `llama-server` is a command-line tool designed to provide a server interface for interacting with LLaMA models. It allows users to deploy LLaMA-based applications in a server environment, enabling access to the models via API calls. This facilitates straightforward integration into various applications, making it easier for developers to build and manage AI-driven services.

The `llama-server` tool offers functionalities such as customizable configurations, allowing users to set parameters according to their needs, and support for multiple concurrent connections. This means that it can handle requests from different clients simultaneously, making it suitable for real-world applications where multiple users may interact with the model at the same time. Additionally, `llama-server` enhances the user experience by providing easy-to-use commands, ensuring that developers can quickly get their LLaMA models up and running without extensive setup. It serves as a robust solution for serving LLaMA models in a wide range of applications.

$ ./llama-server -m /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf -ngl 28 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/wsluser/llama-cpp/models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:        CPU buffer size =    73.03 MiB
llm_load_tensors:      CUDA0 buffer size =   330.19 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   298.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.76 MiB
llama_new_context_with_model: graph nodes  = 751
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle

Alternatively, we could also start llama-server from docker image with CPU only:

$ docker run -v /home/wsluser/llama-cpp/models:/models mb20261/llama.cpp:llama-srv-cpu-ubuntu2204-v1 -m /models/qwen2.5-0.5b-instruct-q4_0.gguf --port 8080 --host 0.0.0.0
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 3955 (e94a138d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 11
main: loading model
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/qwen2.5-0.5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-0.5b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 630M
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 2
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 896
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 14
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 128
llm_load_print_meta: n_embd_v_gqa     = 128
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 4864
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 630.17 M
llm_load_print_meta: model size       = 403.20 MiB (5.37 BPW)
llm_load_print_meta: general.name     = qwen2.5-0.5b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors:        CPU buffer size =   403.20 MiB
..................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.16 MiB
llama_new_context_with_model:        CPU compute buffer size =   967.01 MiB
llama_new_context_with_model: graph nodes  = 846
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on 0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle

request: GET /health 127.0.0.1 200

Then, we could access standard OpenAI API (no key required) to keep our client application portable.

$ curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-no-key-required" \
  -d '{
  "model": "qwen",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "tell me something about michael jordan"
        }
      ]
    }
  ]
}'

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"Here is a brief biography of Michael Jordan:\n\nMichael Jordan was a professional basketball player from the United States who played for the Chicago Bulls during the 1993-1994 NBA season. He is widely regarded as one of the greatest basketball players of all time and is widely considered one of the greatest basketball players ever to play the game.\n\nJordan is known for his incredible athleticism, skill, and leadership on the court, as well as his exceptional ability to rebound from the field and in the gym. He is also recognized for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan was a skilled player, with a career-high rebound rate of 50% and a career-high percentage of 27% in scoring, and he was also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his role as the owner of the Chicago Bulls, who won four NBA championships with Jordan as the team's head coach.\n\nJordan is known for his leadership on the court, and he is also known for his ability to rebound from the bench and in the gym.\n\nJordan is also known for his contributions to the development of basketball in the United States, and he is also known for his contributions to the development of basketball in the United States.","role":"assistant"}}],"created":1729560781,"model":"qwen","object":"chat.completion","usage":{"completion_tokens":273,"prompt_tokens":26,"total_tokens":299},"id":"chatcmpl-4yJgHpf2OIHDrNycWPgvTvbswhznZv35"}

$

Or Calling from Python:

$ python
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openai
url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
    model="qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "tell me something about michael jordan"}
    ]
)
print(completion.choices[0].message.content)>>>
>>> client = openai.OpenAI(
...     base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
...     api_key = "sk-no-key-required"
... )
>>>
>>> completion = client.chat.completions.create(
...     model="qwen",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "tell me something about michael jordan"}
...     ]
... )
>>> print(completion.choices[0].message.content)
Michael Jordan is a renowned basketball player who is widely regarded as one of the greatest players in basketball history. He is the first and currently the most successful of all time, having won five NBA championships and four Olympic gold medals. Jordan is considered one of the greatest of all time and was one of the best players in the world. He is also known for his innovative skills and his ability to influence his opponents in the sport.
>>> exit()

Tuning Llama.cpp command to fit into your enviornment

If you use GPU enabled Llama.cpp build, it is good for CPU only, CPU+GPU, and GPU only usage. Different model and quantization gives you different file size. There are no a single answer for what is the best command line argument. You need tune the parameters, like layers loaded to GPU, context length, batch size, etc. to make the the model best fit.

Now let’s walk through the common use cases.

Case Study: Load a 7B GGUF model into CPU + 8GB GPU

Let’s first download the model. We will use Qwen 2.5 7B GGUF model:

Qwen/Qwen2.5-7B-Instruct-GGUF · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

$ huggingface-cli download Qwen/Qwen2.5-7B-Instruct-GGUF --include "qwen2.5-7b-instruct-q5_k_m*.gguf" --local-dir . --local-dir-use-symlinks False
/work/GitHubsEnv/MEAIDev/poc-ai-tool-llama-cpp/huggingface-cli/lib/python3.10/site-packages/huggingface_hub/commands/download.py:139: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Fetching 2 files:   0%|                                                                                                                                      | 0/2 [00:00<?, ?it/s]Downloading 'qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf.beba9d4f2f5a1fe7d144dcae332e68b52c26705c5310dece2e5d1997e091e134.incomplete'
Downloading 'qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf' to '.cache/huggingface/download/qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf.42f6693004793ee6cf1b2b723f0273b10f86a3bb2a949bd9128d4cda5fb866cd.incomplete'
(…)5-7b-instruct-q5_k_m-00002-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 20.7MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf██████████████████████████████████████████████████████████████| 1.45G/1.45G [01:10<00:00, 12.8MB/s]
(…)5-7b-instruct-q5_k_m-00001-of-00002.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3.99G/3.99G [01:52<00:00, 35.4MB/s]
Download complete. Moving file to qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:53<00:00, 56.59s/it]
/work/GitHubs/MEAIDev/poc-ai-tool-llama-cpp/llama.cpp/build/bin

For large files, people normally split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf and qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf. After all files are downloaded from above command, we need merge them into one GGUF model file in order to feed into Llama.cpp command.

$ ./llama-gguf-split --merge qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf ~/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf -> /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: reading metadata qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00001-of-00002.gguf done
gguf_merge: writing tensors qwen2.5-7b-instruct-q5_k_m-00002-of-00002.gguf done
gguf_merge: /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf merged from 2 split with 339 tensors.

Now let’s load the model for conversation.

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    -fa -ngl 80 -n 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   357.33 MiB
llm_load_tensors:      CUDA0 buffer size =  4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
common_init_from_params: failed to create context with model '/home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf'
main: error: unable to load model

Failed! The error is out of memory. To alleviate the memory issue when using LLaMA models, you can adjust a few settings in your command. Here’s a modified command that can help with memory management:

Reduce ` — ngl` (number of GPU layers): Since your GPU has limited VRAM, try setting ` — ngl` to a lower value. The ` — ngl` option specifies how many transformer layers will be stored on the GPU. A safe starting point would be around 10–12 for an RTX 2070, but you may need to experiment to find the optimal number that works for your specific setup.
Reduce Context Size (`-c` or ` — ctx-size`): If your application allows, you might also want to set a smaller context size. Although the default is 0 (which loads from the model), you could experiment with smaller values for ` — ctx-size` if your application is flexible.
Adjust Batch Size (`-b` or ` — batch-size`): You can reduce the maximum logical batch size using `-b` (default is 2048) and the physical maximum batch size with ` — ubatch-size` (default is 512) to lower the memory usage during inference.
Avoid Flash Attention: If not strictly needed, you can also disable Flash Attention by omitting the `-fa` option.

Let’s make below Adjustments:

-ngl 12: This allocates 12 layers on the GPU instead of 80.
-b 128: This reduces the logical batch size to help lower memory use.
-ub 32: This limits the physical batch size even further.

You might need to experiment a bit with the values for ` — ngl`, `-b`, and ` — ub` until you find a configuration that works without running out of memory. If you still encounter issues, consider lowering the ` — ngl` even further.

So, let’s try the new command:

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    -ngl 12 -n 512 -b 128 -ub 32
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/29 layers to GPU
llm_load_tensors:        CPU buffer size =  5186.92 MiB
llm_load_tensors:      CUDA0 buffer size =  1895.93 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  4096.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   721.33 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.44 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 228
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: interactive mode on.
sampler seed: 48583840
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

> where is new york?
New York is a state located in the northeastern region of the United States. Its capital city is Albany, but the most populous city in the state, and one of the most populous cities in the United States, is New York City, often simply called "New York." New York City is a global metropolis known for its culture, finance, arts, and business. It is located along the Atlantic coast and includes several counties, with Manhattan being one of the most well-known boroughs.

>

It works now.

Case Study: Load a 7B GGUF model into CPU only

Let’s continue from above case and reuse the same GGUF model. Now we want to load the whole model into CPU only.

To do so, we need below adjustments:

Removed GPU Options: The `-ngl` option and other GPU-related options (like `-fa`, if present) are omitted. This makes sure the model runs strictly on the CPU.
— cpu-strict 1: This sets strict CPU placement, ensuring that the model execution uses CPU cores for its operations.
Adjust -b 128 (the batch size) as needed based on your performance and memory constraints.

$ ./llama-cli -m /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf \
    -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
    --cpu-strict 1 -n 512 -b 128
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
build: 3951 (dbd5f2f5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2070) - 7089 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from /home/wsluser/llama-cpp/models/qwen2.5-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct
llama_model_loader: - kv   3:                            general.version str              = v0.1
llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct
llama_model_loader: - kv   5:                         general.size_label str              = 7.6B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 17
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = qwen2.5-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  5186.92 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  7168.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2117.26 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    65.75 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: interactive mode on.
sampler seed: 999110778
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 128, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

>

It works too. How can I tell this is loaded to CPU only? Look at below parameters from outputs. If you compare with above GPU commands, you will see the difference.

More considerations on CPU only usage:

If you want more control over CPU threading or affinities, you could also include ` — threads N` to specify the number of threads to use during generation (default is -1, which uses all available threads).
If you experience memory issues, you can keep experimenting with reducing `-b` to further optimize resource usage.