Deploying Models with Xinference
Today, let’s explore Xinference, a deployment and inference tool for Large Language Models (LLMs), characterized by its quick deployment, ease of use, efficient inference, support for various open-source models, and provision of both a WebGUI interface and API endpoints for convenient model deployment and inference. Let’s dive into Xinference together!
Introduction to Xinference
Xorbits Inference (Xinference) is a powerful and comprehensive distributed inference framework suitable for various models. With Xinference, you can effortlessly deploy your own or cutting-edge open-source models with just one click. Whether you’re a researcher, developer, or data scientist, Xinference connects you with the latest AI models, unlocking more possibilities. Below is a comparison of Xinference with other model deployment and inference tools:
Installing Xinference
Xinference supports two installation methods: Docker image and local installation. For those interested in the Docker method, please refer to the official Docker Installation Documentation. Here, we will focus on local installation.
First, install Xinference’s Python dependencies:
pip install "xinference[all]"
Since Xinference depends on many third-party libraries, the installation might take some time. Once completed, you can start the Xinference service with the following command:
xinference-local
Upon successful startup, access the Xinference WebGUI interface via http://localhost:9
777.
Note: During the installation of Xinference, it might install a different version of PyTorch (due to its dependency on the vllm component), which could cause issues with GPU servers. Therefore, after installing Xinference, you can execute the following command to check if PyTorch is working correctly:
python -c "import torch; print(torch.cuda.is_available())"
If the output is True
, PyTorch is working fine. Otherwise, you may need to reinstall PyTorch, following the instructions on PyTorch's website.
Deploying and Using Models
Deploying models in Xinference’s WebGUI interface is straightforward. Let’s see how to deploy an LLM model.
First, in the Launch Model
menu, select the LANGUAGE MODELS
tab and enter the model keyword chatglm3
to search for the ChatGLM3 model to deploy.
Then, click on the chatglm3
card to see the following interface:
When deploying an LLM model, you have several parameters to choose from:
- Model Format: Model format, with options for quantized and non-quantized formats. The non-quantized format is
pytorch
, while quantized formats includeggml
,gptq
, etc. - Model Size: The model’s parameter size. For ChatGLM3, the only option is 6B, but for Llama2, options include 7B, 13B, 70B, etc.
- Quantization: Quantization precision, with options for 4bit, 8bit, etc.
- N-GPU: Select which GPU to use.
- Model UID (optional): Custom model name; if not specified, the original model name is used by default.
After filling in the parameters, click the rocket icon button to start the model deployment. Based on the selected parameters, the system will download the quantized or non-quantized LLM model. Once deployed, the interface will automatically redirect to the Running Models
menu, where you can see the deployed ChatGLM3-6B model in the LANGUAGE MODELS
tab.
If you click the red square icon Launch Web UI
, a browser window will pop up with the LLM model's web interface, allowing you to converse with the LLM model as shown below:
API Endpoints
If you’re not satisfied with using the LLM model’s web interface, you can also use the API endpoints. The WebGUI interface and API endpoints are prepared simultaneously when the Xinference service is deployed. Accessing http://localhost:9997/docs/
in your browser will display the list of API endpoints.
The list includes a wide range of endpoints, not just for LLM models but also for other models (like Embedding or Rerank), and all are compatible with OpenAI API endpoints. For example, to use the chat function of an LLM model, you can use the Curl tool as follows:
curl -X 'POST' \
'http://localhost:9997/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "chatglm3",
"messages": [
{
"role": "user",
"content": "hello"
}
]
}'
# Response
{
"model": "chatglm3",
"object": "chat.completion",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?",
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"total_tokens": 29,
"completion_tokens": 37
}
}
Multimodal Models
Let’s also deploy a multimodal model, which is an LLM capable of recognizing images. The deployment process is similar to that for LLM models.
First, select the Launch Model
menu and filter models by Model Ability
under the LANGUAGE MODELS
tab, choosing vl-chat
. You will see the currently supported two multimodal models:
Choose the qwen-vl-chat
model for deployment. The parameter selection process is similar to that for LLM models. After selecting the parameters, click the rocket icon button to deploy. Once deployed, it will automatically enter the Running Models
menu as shown below:
Clicking the
Launch Web UI
button will open the multimodal model's web interface in the browser. Here, you can converse with the multimodal model using images and text, as shown below:
Embedding Models
Embedding models convert text into vectors. Deploying them with Xinference is even simpler. Just select the Embedding
tab in the Launch Model
menu, choose the desired model and deploy directly without needing to select parameters, unlike LLM models. Here, we deploy the bge-base-en-v1.5
Embedding model.
We can verify the deployed Embedding model using the Curl command to call its API endpoint:
curl -X 'POST' \
'http://localhost:9997/v1/embeddings' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "bge-base-en-v1.5",
"input": "hello"
}'
# Response
{
"object": "list",
"model": "bge-base-en-v1.5-1-0",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [0.0007792398682795465, …]
}
],
"usage": {
"prompt_tokens": 37,
"total_tokens": 37
}
}
Rerank Models
Rerank models sort texts. Deploying them with Xinference is straightforward, similar to Embedding models. The deployment steps are shown below, and here we deploy the bge-reranker-base
Rerank model:
We can verify the deployed Rerank model by calling its API endpoint with the Curl command:
curl -X 'POST' \
'http://localhost:9997/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "bge-reranker-base",
"query": "What is Deep Learning?",
"documents": [
"Deep Learning is ...",
"hello"
]
}'
# Response
{
"id": "88177e80-cbeb-11ee-bfe5-0242ac110007",
"results": [
{
"index": 0,
"relevance_score": 0.9165927171707153,
"document": null
},
{
"index": 1,
"relevance_score": 0.00003880404983647168,
"document": null
}
]
}
Image Models
Xinference also supports image models for functions like text-to-image and image-to-image. Several image models are built-in within Xinference, including various versions of Stable Diffusion (SD). The deployment process is similar to text models, performed via the WebGUI interface without needing parameter selection. However, due to the large size of SD models, ensure your server has 50GB or more space before deploying image models. Here, we deploy the sdxl-turbo
image model, with screenshots of the deployment steps below:
We can use Python code to call the image model for generating images, with an example as follows:
from xinference.client import Client
client = Client("http://localhost:9997")
model = client.get_model("sdxl-turbo")
model.text_to_image("An astronaut walking on the mars")
Here, we used Xinference’s client tool for the text-to-image function. The generated images are automatically saved in the image
folder within Xinference's Home directory, which defaults to ~/.xinference
. You can also specify the Home directory when starting the Xinference service, with the following command:
XINFERENCE_HOME=/tmp/xinference xinference-local
Audio Models
Audio models, a recent addition to Xinference, enable functions like speech-to-text and audio translation. Before deploying audio models, you must first install the ffmpeg
component, using Ubuntu as an example:
sudo apt update && sudo apt install ffmpeg
Currently, Xinference does not support deploying audio models via the WebGUI interface; they must be deployed via the command line. Ensure the Xinference service is started (xinference-local
) before executing the deployment command:
xinference launch -u whisper-1 -n whisper-large-v3 -t audio
-u
: Model ID-n
: Model name-t
: Model type
This command-line deployment method applies not only to audio models but also to other model types. We can use the deployed audio model by calling its API endpoint, compatible with OpenAI’s Audio API, allowing us to use OpenAI’s Python package for audio models, with an example as follows:
import openai
# The API key can be any non-empty string
client = openai.Client(api_key="not empty", base_url="http://127.0.0.1:9997/v1")
audio_file = open("/your/audio/file.mp3", "rb")
# Using OpenAI's method to call the audio model
completion = client.audio.transcriptions.create(model="whisper-1", file=audio_file)
print(f"completion: {completion}")
audio_file.close()
Model Sources
By default, Xinference downloads models from HuggingFace. To use models from other websites, set the XINFERENCE_MODEL_SRC
environment variable. Starting the Xinference service with the following command will download models from Modelscope:
XINFERENCE_MODEL_SRC=modelscope xinference-local
Model GPU Exclusivity
During the deployment process with Xinference, if your server has only one GPU, you can deploy only one LLM model/multimodal model/image model/audio model because Xinference currently implements a one-model-per-GPU approach for these types. Attempting to deploy multiple such models on one GPU will result in an error: No available slot found for the model
.
However, there is no such restriction for Embedding or Rerank models, allowing multiple models to be deployed on the same GPU.
Conclusion
Today, we introduced Xinference, an open-source deployment and inference tool known for its convenient deployment and support
for various models, leaving a deep impression. We hope this article helps more people discover this tool. If you encounter any issues while using it, feel free to discuss them in the comments section.
Follow me to learn about various artificial intelligence and AIGC new technologies. If you have any questions or comments, please feel free to leave a message in the comments section.