avatarJavier Calderon Jr

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2987

Abstract

tly impacting latency.</p><div id="8e02"><pre><span class="hljs-attr">texts</span> = [<span class="hljs-string">"This is a great product!"</span>, <span class="hljs-string">"I had a terrible experience."</span>] <span class="hljs-attr">predictions</span> = classifier(texts, batch_size=<span class="hljs-number">2</span>)</pre></div><p id="92ac"><b>Warm Up the Model: </b>Run some inference before timing to load everything into memory and ensure that the CPU is ready to perform at its best.</p><div id="3419"><pre><span class="hljs-attr">_</span> = classifier([<span class="hljs-string">"Warm up the model."</span>])</pre></div><p id="71f7"><b>Monitor and Tune Performance:</b> Use profiling tools to monitor the performance and tune as necessary. Adjusting the number of threads can sometimes yield better performance.</p><div id="ced0"><pre><span class="hljs-keyword">import</span> os os.environ[<span class="hljs-string">"OMP_NUM_THREADS"</span>] = <span class="hljs-string">"1"</span> <span class="hljs-comment"># Adjust based on your CPU's core count</span></pre></div><h1 id="8697">Step-by-Step Implementation</h1><p id="8a8b">Now, let’s dive into the technicalities of achieving the low latency with the BGE-1.5 models.</p><h2 id="575f">Load the Model with INT8 Precision</h2><p id="6aec">The INT8 models are designed for CPUs, particularly those with VNNI support. Loading these models is straightforward with the Hugging Face <code>transformers</code> library.</p><div id="c77c"><pre>from transformers <span class="hljs-keyword">import</span> <span class="hljs-type">AutoTokenizer</span>

<span class="hljs-variable">tokenizer</span> <span class="hljs-operator">=</span> AutoTokenizer.from_pretrained(model_name)</pre></div><h2 id="63a3">Tokenize Your Input</h2><p id="c15f">Tokenize the input text to convert it into a format that the model can understand. For sequence lengths of 512, ensure that the tokenizer is set accordingly.</p><div id="94c7"><pre><span class="hljs-attr">inputs</span> = tokenizer(<span class="hljs-string">"Example input text"</span>, return_tensors=<span class="hljs-string">"pt"</span>, max_length=<span class="hljs-number">512</span>, truncation=<span class="hljs-literal">True</span>, padding=<span class="hljs-string">'max_length'</span>)</pre></div><h2 id="010f">Execute the Model Inference</h2><p id="1559">With the Intel extension, perform the inference. This is where you’ll notice the speed advantage.</p><div id="ee65"><pre><span class="hljs-attr">outputs</span> = classifier(inputs[<span class="hljs-string">"input_ids"</span>])</pre></div><p id="3168">The latency should be around 5ms, which is groundbreaking for such sequence lengths.</p><h1 id="526e">Why INT8 Precision Matters</h1><p id="b2b2">INT8 precision reduces the model’s memory footprint and increases the throughput on CPUs. This precision is particularly effective when combined with Intel’s VNNI, which is designed to accelerate INT8 calculations. The lower precision does

Options

not significantly affect the accuracy for most tasks but provides a substantial boost in speed.</p><h1 id="ce17">Engaging Conclusion</h1><p id="c839">By leveraging the power of Hugging Face’s INT8 models with the optimization capabilities of Intel CPUs, we unlock new possibilities for deploying NLP applications. Whether it’s real-time sentiment analysis, chatbots, or text classification, achieving low-latency inference without compromising accuracy is now within reach. With the simple steps outlined in this guide, developers can integrate these advancements into their applications, delivering a seamless user experience that was once thought to be the exclusive domain of high-end GPUs.</p><h2 id="510c">The Ultimate Goal</h2><p id="ab58">The target is clear to enable widespread adoption of advanced NLP by optimizing for performance and cost. With these techniques, we’re not just opening the door to more interactive and</p><div id="b08a" class="link-block"> <a href="https://github.com/intel/intel-extension-for-transformers"> <div> <div> <h2>GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your…</h2> <div><h3>⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*hAkoJwb0DpXvXHiY)"></div> </div> </div> </a> </div><div id="dfea" class="link-block"> <a href="https://huggingface.co/Intel/bge-small-en-v1.5-sst2-int8-static"> <div> <div> <h2>Intel/bge-small-en-v1.5-sst2-int8-static · Hugging Face</h2> <div><h3>We're on a journey to advance and democratize artificial intelligence through open source and open science.</h3></div> <div><p>huggingface.co</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*LWLiprEjMWW1hy55)"></div> </div> </div> </a> </div><div id="94eb" class="link-block"> <a href="https://huggingface.co/Intel/bge-small-en-v1.5-sst2-int8-dynamic"> <div> <div> <h2>Intel/bge-small-en-v1.5-sst2-int8-dynamic · Hugging Face</h2> <div><h3>We're on a journey to advance and democratize artificial intelligence through open source and open science.</h3></div> <div><p>huggingface.co</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*dwc5IHg1MwwQ8G7o)"></div> </div> </div> </a> </div></article></body>

Supercharging NLP Inference: Leveraging INT8 BGE-1.5 Models on Intel CPUs for Ultra-Low Latency

Accelerating Natural Language Processing: Harnessing INT8 BGE-1.5 Models on Hugging Face with Intel CPUs

Introduction

Natural Language Processing (NLP) has taken giant strides in the past decade, with transformer-based models becoming the de facto standard for a wide range of tasks. However, the computational cost of deploying such models can be prohibitive, particularly when low latency is a critical requirement. This is where INT8 precision models come into play, offering a promising solution for achieving high performance on CPUs. Hugging Face, a leading hub for NLP models, and Intel have collaborated to provide optimized versions of these models that can achieve impressive speeds. In this article, we’ll explore how to utilize the INT8 BGE-1.5 models to achieve around 5ms latency for embedding sequences with a length of 512 on Intel CPUs.

Prerequisites

To work with these models, ensure that you have:

  • An Intel CPU that supports vector neural network instructions (VNNI).
  • Python 3.6 or later.
  • The transformers and intel-extension-for-transformers libraries installed.
pip install transformers
pip install intel-extension-for-transformers

Best Practices for Model Inference

When working with INT8 models, there are several best practices to follow:

Load the Model Appropriately: Use the model from Hugging Face that is already quantized to INT8. This eliminates the need for manual quantization and ensures compatibility.

from transformers import AutoModelForSequenceClassification

model_name = "Intel/bge-small-en-v1.5-sst2-int8-static"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Optimize for Intel CPUs: Use the Intel extension for transformers. This optimizes operations for Intel CPUs to achieve better performance.

from intel_extension_for_transformers import pipeline

classifier = pipeline('sentiment-analysis', model=model_name, device=0) # device=0 for CPU

Batch Your Data: Even though we’re discussing low-latency inference, batching can provide throughput benefits without significantly impacting latency.

texts = ["This is a great product!", "I had a terrible experience."]
predictions = classifier(texts, batch_size=2)

Warm Up the Model: Run some inference before timing to load everything into memory and ensure that the CPU is ready to perform at its best.

_ = classifier(["Warm up the model."])

Monitor and Tune Performance: Use profiling tools to monitor the performance and tune as necessary. Adjusting the number of threads can sometimes yield better performance.

import os
os.environ["OMP_NUM_THREADS"] = "1" # Adjust based on your CPU's core count

Step-by-Step Implementation

Now, let’s dive into the technicalities of achieving the low latency with the BGE-1.5 models.

Load the Model with INT8 Precision

The INT8 models are designed for CPUs, particularly those with VNNI support. Loading these models is straightforward with the Hugging Face transformers library.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Tokenize Your Input

Tokenize the input text to convert it into a format that the model can understand. For sequence lengths of 512, ensure that the tokenizer is set accordingly.

inputs = tokenizer("Example input text", return_tensors="pt", max_length=512, truncation=True, padding='max_length')

Execute the Model Inference

With the Intel extension, perform the inference. This is where you’ll notice the speed advantage.

outputs = classifier(inputs["input_ids"])

The latency should be around 5ms, which is groundbreaking for such sequence lengths.

Why INT8 Precision Matters

INT8 precision reduces the model’s memory footprint and increases the throughput on CPUs. This precision is particularly effective when combined with Intel’s VNNI, which is designed to accelerate INT8 calculations. The lower precision does not significantly affect the accuracy for most tasks but provides a substantial boost in speed.

Engaging Conclusion

By leveraging the power of Hugging Face’s INT8 models with the optimization capabilities of Intel CPUs, we unlock new possibilities for deploying NLP applications. Whether it’s real-time sentiment analysis, chatbots, or text classification, achieving low-latency inference without compromising accuracy is now within reach. With the simple steps outlined in this guide, developers can integrate these advancements into their applications, delivering a seamless user experience that was once thought to be the exclusive domain of high-end GPUs.

The Ultimate Goal

The target is clear to enable widespread adoption of advanced NLP by optimizing for performance and cost. With these techniques, we’re not just opening the door to more interactive and

NLP
Artificial Intelligence
Llm
Text Preprocessing
Intel
Recommended from ReadMedium