Skip to main content

llmlingua

·416 words·2 mins·
LLMs Data Engineering
Table of Contents

llmlingua a model to compact prompts

Intro
#

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

Why
#

  • Ever encountered the token limit when asking ChatGPT to summarize lengthy texts?
  • Frustrated with ChatGPT forgetting previous instructions after extensive fine-tuning?
  • Experienced high costs using GPT3.5/4 API for experiments despite excellent results?

While Large Language Models like ChatGPT and GPT-4 excel in generalization and reasoning, they often face challenges like prompt length limits and prompt-based pricing schemes.

Now you can use LLMLingua, LongLLMLingua, and LLMLingua-2!

These tools offer an efficient solution to compress prompts by up to 20x, enhancing the utility of LLMs.

  • Cost Savings: Reduces both prompt and generation lengths with minimal overhead.
  • Extended Context Support: Enhances support for longer contexts, mitigates the “lost in the middle” issue, and boosts overall performance.
  • Robustness: No additional training needed for LLMs.
  • Knowledge Retention: Maintains original prompt information like ICL and reasoning.
  • KV-Cache Compression: Accelerates inference process.
  • Comprehensive Recovery: GPT-4 can recover all key information from compressed prompts.

Quick Start
#

To get started with LLMLingua, simply install it using pip

pip install llmlingua

Setup
#

This example use ollama with deepseek-r1 and llmlingua

python -m venv venv
source venv/bin/activate
pip install ollama llmlingua

Test
#

The following example can be use to test the compression rate for a given prompt and the obtained result.

import ollama
from llmlingua import PromptCompressor

# Define the model and original prompt
model = "deepseek-r1"  # You can change this to other models like "llama2"
original_prompt = "Can you please tell me what the capital of France is? I need this information for my geography project."


llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)


print(f"Compressed Prompt: {compressed_prompt}")

# Generate response using Ollama
response = ollama.chat(model=model, messages=[{"role": "user", "content": compressed_prompt}])

# Print the response
print("AI Response:", response["message"]["content"])

You can test directly using the following Demo

Conclusion
#

With the usage of commercial products that charge for per token, a compressor like this would make sense. Still important to assess each use-case to understand if it makes sense for the volume of tokes you are considering. Also I have this in my drafts for some time now, and with the speed of inovation on the AI field there could be a better way or model to do this today.

References
#