Introduction
LLM models have been a hot topic in the machine learning field lately, and many job openings now seek developers who have experience fine-tuning LLM models. As a 7th-semester student preparing to apply for jobs, I want to share my experience and insights into the process of fine-tuning.
What we will talk about?
LLM
LLM stands for Large Language Model. It is a type of artificial intelligence trained on vast amounts of text data to understand and generate human-like language. LLMs, like GPT (which powers me), can perform various tasks, such as answering questions, writing text, translating languages, and more. They work by predicting the next word in a sequence based on the context provided. For more understanding about LLM, you could read more about LLM from this Medium post by Andreas StĂśffelbauer
HuggingFace
Hugging Face is a Github-like platform for machine learning, providing an open-source ecosystem that fosters collaboration among researchers, developers, and data scientists. It hosts a wide array of pre-trained models, datasets, and tools, allowing users to easily access and share state-of-the-art models across various domains, including natural language processing (NLP), computer vision, and more.
DistilBERT
DistilBERT is a transformer-based model derived from BERT (Bidirectional Encoder Representations from Transformers) that has been âdistilledâ to be smaller and more efficient, while still retaining much of BERTâs language understanding capabilities. According to the documentation, DistilBERT can be fine-tuned for a variety of tasks, making it a versatile choice for various NLP applications.
Sentiment Analysis
Sentiment Analysis is a process where a large language model (LLM) analyzes text to determine the emotional tone or sentiment behind it. The goal is to classify text as positive, negative, or neutral. LLMs use patterns in the text to understand emotions, opinions, or attitudes expressed, making it useful for tasks like social media monitoring, customer feedback analysis, and product reviews. To gain a better understanding of Sentiment Analysis on LLM, you can check out this blog.
CRISP-DM
In this blog post, we will walk through the steps of building a sentiment analysis model using DistilBERT, a popular transformer model. We will use the CRISP-DM framework to guide us through the process. CRISP-DM stands for Cross-Industry Standard Process for Data Mining, which provides a structured methodology for tackling data analysis tasks. Weâll be applying this approach to a Kaggle dataset of Amazon product reviews to classify sentiment into âGood Reviewâ and âBad Review.â
Business Understanding
The goal of this project is to classify Amazon product reviews into two categories: âGood Reviewâ and âBad Reviewâ. Sentiment analysis is a powerful tool for understanding customer feedback and improving user experiences. By automating sentiment classification, businesses can quickly analyze large volumes of reviews and make informed decisions.
Data Understanding
The dataset used in this project contains Amazon product reviews that are compressed into .bz2 files that you can download it here. We will need to extract and preprocess these reviews for use in our sentiment analysis model.
def decompress_bz2(file_path, output_path):
with bz2.open(file_path, 'rt', encoding='utf-8') as file:
with open(output_path, 'w', encoding='utf-8') as out_file:
out_file.write(file.read())
decompress_bz2('/kaggle/input/amazonreviews/train.ft.txt.bz2', 'train.ft.txt')
decompress_bz2('/kaggle/input/amazonreviews/test.ft.txt.bz2', 'test.ft.txt')
After decompression, we parse the data into a DataFrame and combine both the training and test datasets. The labels are mapped to binary values (0 for âBad Reviewâ and 1 for âGood Reviewâ),
and the title
and text
fields are concatenated into a single input_text
field for better model accuracy.
# Here's how I map it
df['label'] = df['label'].apply(lambda x: 0 if x == 1 else 1)
Data Preparation
Given the limited memory available on Kaggleâs environment, we will only sample 5% of the dataset for training and evaluation.
df['input_text'] = df['title'] + " " + df['text']
df_prep = df[['input_text', 'label']]
sampled_df = df_prep.groupby('label', group_keys=False).apply(lambda x: x.sample(frac=0.05, random_state=42))
dataset = Dataset.from_pandas(sampled_df)
Hereâs the example of the dataset
| input_text | label |
|:--------------------------------------------------|:------|
| I HATE THIS CD.. And you should too...Avoid th... | 0 |
| Too cheap Bought 2 of these 75 to 300 ohm matc... | 0 |
| It had a good walkthrough but it lacked many s... | 0 |
| Didn't know what to think. I like Jim Carrey, ... | 0 |
| Valeo Olympic Spring Collars I should have lis... | 0 |
| ... | ... |
| Fun and magical read! What a magical book! My ... | 1 |
| Terrific This is a wonderful book!! I was givi... | 1 |
| fun toy I have had good luck with the batterie... | 1 |
| Info Love the Family Guy DVD's I enjoy watchin... | 1 |
| What happend? The only reason I gave this game... | 1 |
Next, we tokenize the text data using DistilBERTâs tokenizer. This prepares the dataset for model training by converting the text into a format that the model can understand.
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
def tokenize_function(examples):
return tokenizer(examples['input_text'], padding='max_length', truncation=True, max_length=128)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Donât forget to split your dataset đ
Modeling
We use DistilBERT, a lighter version of the BERT transformer model, for sequence classification.
The model is fine-tuned on the sentiment analysis task, and the training process is set up using the Trainer
class from the Hugging Face library.
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
training_args = TrainingArguments(
output_dir='/kaggle/working/results',
num_train_epochs=5.0,
...
)
trainer = Trainer(model=model, args=training_args, compute_metrics=compute_metrics, train_dataset=train_test_split['train'], eval_dataset=train_test_split['test'])
trainer.train()
Evaluation
After training, we evaluate the modelâs performance on the test dataset using various metrics such as accuracy, precision, recall, and F1-score.
results = trainer.evaluate()
print(results)
"""
{
'eval_accuracy': 0.953925,
'eval_precision_macro': 0.9539209871607255,
'eval_recall_macro': 0.9539319939428168,
'eval_f1_macro': 0.9539242719746999,
'eval_loss': 0.15511418879032135,
'eval_runtime': 90.9442,
'eval_samples_per_second': 439.83,
'eval_steps_per_second': 6.872,
'epoch': 5.0
}
"""
The metrics show the overall effectiveness of the model. We save the model and tokenizer for future use, including deployment or further fine-tuning.
# Label mapping
label2id = {"Bad Review": 0, "Good Review": 1}
id2label = {0: "Bad Review", 1: "Good Review"}
model_ckpt = "kohendru/distilbert-amazon-sentiment-model"
# Define config
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
# Load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config)
model.save_pretrained("/kaggle/working/distilbert-sentiment-model")
tokenizer.save_pretrained("/kaggle/working/distilbert-sentiment-model")
model.save_pretrained("/kaggle/working/distilbert-sentiment-model", safe_serialization=False) # If you want to save `pytorch_model.bin`
Deployment
Hereâs how you can test the model by predicting the sentiment of a sample text.
def predict_sentiment(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
prediction = torch.argmax(logits, dim=-1).item()
return prediction
text = "I really disappointed about this product"
predicted_sentiment = predict_sentiment(text)
print(f"Predicted Sentiment: {predicted_sentiment}")
If you want to test it yourself, I uploaded my model on HuggingFace repository. Or you could try it from streamlit I made.
Conclusion
Using the CRISP-DM methodology, we were able to build a sentiment analysis model that classifies Amazon reviews into âGoodâ and âBadâ categories. The model was successfully trained, evaluated, and deployed, and it provides a foundation for future improvements such as hyperparameter tuning or deployment to production systems.
This approach can be applied to other text classification tasks, and using pretrained models like DistilBERT can significantly speed up the process of building effective NLP models.