Intégration de LLM avec Scikit-Learn à l’aide de Scikit-LLM

Integration of LLM with Scikit-Learn using Scikit-LLM

Certainly! Here’s a rewritten version of the article:

Integrating LLM with Scikit-Learn Using Scikit-LLM

Image by author

The popular Python package, Scikit-Learn, has long been a staple for creating machine learning models and classifiers for industrial applications. However, it traditionally relied on TF-IDF and frequency-based methods for natural language tasks, lacking language comprehension capabilities. With the rise of large language models (LLMs), the Scikit-LLM library aims to bridge this gap. It integrates LLMs to create text classifiers using the familiar Scikit-Learn API.

In this article, we explore Scikit-LLM and implement a zero-shot text classifier on a demo dataset.

Setup and Installation

Scikit-LLM is available on PyPI, making it easy to install via pip. Use the command below to install the package.

LLM Backend Support
Scikit-LLM currently supports API integrations and locally supported large language models. Custom APIs hosted on-premises or cloud platforms can also be integrated. We’ll explore how to set up each in the sections below.

OpenAI

GPT models are among the most widely used language models globally, powering numerous applications. To configure an OpenAI model using Scikit-LLM, set up the API credentials and specify the model name.


from skllm.config import SKLLMConfig<br />
<br />
SKLLMConfig.set_openai_key("<your_key>")<br />
SKLLMConfig.set_openai_org("<your_organization_id>")<br />
```<br />
<br />
Once the API credentials are configured, we can use Scikit-LLM's zero-shot classifier, which defaults to the OpenAI model.<br />
<br />
```python<br />
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier<br />
clf = ZeroShotGPTClassifier(model="gpt-4")<br />
```<br />
<br />
#### LlamaCPP and GGUF Models<br />
<br />
While OpenAI is popular, it can be costly and impractical in some scenarios. Scikit-LLM supports locally run quantized GGUF or GGML models. Install the necessary support packages to use llama-cpp for running language models.<br />
<br />
Run the following commands to install the required packages:<br />
<br />
```bash<br />
pip install 'scikit-llm[gguf]' --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu --no-cache-dir<br />
pip install 'scikit-llm[llama-cpp]'<br />
```<br />
<br />
We can now use Scikit-LLM's zero-shot classifier model to load GGUF models. Note that only a few models are currently supported. Find the list of supported models [here](https://skllm.beastbyte.ai/docs/introduction-backend-families#gguf).<br />
<br />
We'll use the GGUF quantized version of Gemma-2B for our purpose. Use the syntax gguf::<model_name> to load a quantized gguf model in Scikit-LLM.<br />
<br />
```python<br />
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier<br />
clf = ZeroShotGPTClassifier(model="gguf::gemma2-2b-q6")<br />
```<br />
<br />
#### External Models<br />
<br />
Lastly, self-hosted models following the OpenAI API standard can be used, either locally or cloud-hosted. Simply provide the model's API URL.<br />
<br />
Load the model from a custom URL using the code below:<br />
<br />
```python<br />
from skllm.config import SKLLMConfig<br />
SKLLMConfig.set_gpt_url("http://localhost:8000/")<br />
clf = ZeroShotGPTClassifier(model="custom_url::<custom_model_name>")<br />
```<br />
<br />
### Model and Inference Using Core Scikit-Learn API<br />
<br />
We can now train the model on a classification dataset using the Scikit-Learn API. We'll demonstrate a basic implementation using a sentiment prediction dataset on movie reviews.<br />
<br />
#### Dataset<br />
<br />
The dataset is provided by the scikit-llm package, containing 100 movie reviews labeled as positive, neutral, or negative sentiments. We'll load and split the dataset into training and test sets for our demo.<br />
<br />
Use traditional Scikit-Learn methods to load and split the dataset.<br />
<br />
```python<br />
from skllm.datasets import get_classification_dataset<br />
X, y = get_classification_dataset()<br />
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)<br />
```<br />
<br />
#### Fit and Predict<br />
<br />
Training and predicting with the large language model follows the same Scikit-Learn API. First, fit the model to the training dataset, then use it to make predictions on unseen test data.<br />
<br />
```python<br />
clf.fit(X_train, y_train)<br />
predictions = clf.predict(X_test)<br />
```<br />
<br />
On the test set, we achieve **100% accuracy with the Gemma2-2B model** due to the dataset's simplicity. Here are examples for the test samples:<br />
<br />
```python<br />
Sample Review: "Under the Same Sky was an okay movie. The plot was decent, and the performances were fine, but it lacked depth and originality. It is not a movie I would watch again."<br />
Predicted Sentiment: ['neutral']<br />
<br />
Sample Review: "The cinematography in Awakening was nothing short of spectacular. The visuals alone are worth the ticket price. The storyline was unique and the performances were solid. An overall fantastic film."<br />
Predicted Sentiment: ['positive']<br />
<br />
Sample Review: "I found Hollow Echoes to be a complete mess. The plot was non-existent, the performances were overdone, and the pacing was all over the place. Not worth the hype."<br />
Predicted Sentiment: ['negative']<br />
```<br />
<br />
### Conclusion<br />
<br />
The Scikit-LLM package is gaining popularity for its familiar API, making integration into existing pipelines seamless. It enhances text-based model responses, improving upon the original frequency-based methods. Integrating language models adds reasoning and understanding to text input, potentially boosting standard model performance.<br />
<br />
Additionally, it offers options for few-shot and chain-of-thought classifiers, as well as other text modeling tasks like summarization. Explore the package and available documentation on the official site to find what suits your needs.<br />
<br />
**Kanwal Mehreen**  <br />
Kanwal is a machine learning engineer and technical writer passionate about data science and the intersection of AI and medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT." As a Google Generation Scholar 2022 for APAC, she advocates for diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Fellow. Kanwal is a change advocate, having founded FEMCodes to empower women in STEM fields.<br />
<br />
**Our Top 3 Partner Recommendations**<br />
<br />
1. [Best VPN for Engineers – 3 Months Free](https://go.expressvpn.com/c/359203/1462856/16063?subId1=kdntop) - Stay safe online with a free trial<br />
2. [Best Project Management Tool for Tech Teams](https://try.monday.com/a9o3iv7bs8g2?sid=KDTOP) - Boost your team's efficiency today<br />
3. [Best Password Management Tool for Tech Teams](https://keepersecurity.partnerlinks.io/xosnelbf35px-1yt2lb) - Zero trust and zero knowledge security<br />
<br />
---

Source