Introduction
Welcome back! 🎉
This is Part 2 of our series “How to Train and Publish Your Own LLM with Hugging Face.”
👉 If you missed Part 1: Getting Started, I highly recommend starting there first. In Part 1, we covered:
- What Hugging Face is
- Installing the right tools
- Running your first pre-trained model
- Creating a simple custom dataset
Now that you’re set up, let’s take the next step: fine-tuning your own model.
What You’ll Learn in This Post
- Loading your dataset into Hugging Face
- Choosing the right pre-trained model
- Fine-tuning on your dataset
- Evaluating results
- Saving your model locally
By the end, you’ll have your very first custom-trained model running on your machine! 🚀
Step 1: Load Your Dataset
In Part 1, we created a simple CSV dataset. Let’s load it:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_dataset.csv")
print(dataset["train"][0])
👉 Hugging Face converts it into a format that works seamlessly with models.
Step 2: Pick a Pre-Trained Model
For fine-tuning, we’ll use a small GPT-like model:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Step 3: Tokenize Your Dataset
Models don’t understand raw text — they need tokens.
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Step 4: Set Up Training
We’ll use the Hugging Face Trainer
API to keep things simple:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
save_strategy="epoch"
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["train"],
tokenizer=tokenizer,
data_collator=data_collator,
)
Step 5: Train Your Model
Now let’s run training 🚀:
trainer.train()
You’ll start seeing logs of loss decreasing — that means your model is learning!
Step 6: Save Your Model
After training, save it locally:
trainer.save_model("my_custom_model")
tokenizer.save_pretrained("my_custom_model")
You now have a custom fine-tuned model on your dataset. 🎉
Step 7: Test Your Model
Let’s test the output:
from transformers import pipeline
generator = pipeline("text-generation", model="my_custom_model")
print(generator("Hugging Face is", max_length=30))
Wrap-Up
In this post, you:
- Loaded your dataset into Hugging Face
- Tokenized it for training
- Fine-tuned a pre-trained GPT model
- Saved your own custom-trained model
🎯 Congratulations — you now have a working, fine-tuned LLM!
In the next post (Part 3), we’ll cover:
- How to publish your model on Hugging Face Hub
- How to share it with others
- How to create an interactive demo using Hugging Face Spaces
👉 Continue to: “How to Train and Publish Your Own LLM with Hugging Face (Part 3: Publishing & Sharing)”