How to Train and Publish Your Own LLM with Hugging Face (Part 2: Fine-Tuning Your Model)

Introduction

Welcome back! 🎉

This is Part 2 of our series “How to Train and Publish Your Own LLM with Hugging Face.”

👉 If you missed Part 1: Getting Started, I highly recommend starting there first. In Part 1, we covered:

What Hugging Face is
Installing the right tools
Running your first pre-trained model
Creating a simple custom dataset

Now that you’re set up, let’s take the next step: fine-tuning your own model.

What You’ll Learn in This Post

Loading your dataset into Hugging Face
Choosing the right pre-trained model
Fine-tuning on your dataset
Evaluating results
Saving your model locally

By the end, you’ll have your very first custom-trained model running on your machine! 🚀

Step 1: Load Your Dataset

In Part 1, we created a simple CSV dataset. Let’s load it:

from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_dataset.csv")
print(dataset["train"][0])

👉 Hugging Face converts it into a format that works seamlessly with models.

Step 2: Pick a Pre-Trained Model

For fine-tuning, we’ll use a small GPT-like model:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Step 3: Tokenize Your Dataset

Models don’t understand raw text — they need tokens.

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Step 4: Set Up Training

We’ll use the Hugging Face Trainer API to keep things simple:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch"
)

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["train"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Step 5: Train Your Model

Now let’s run training 🚀:

trainer.train()

You’ll start seeing logs of loss decreasing — that means your model is learning!

Step 6: Save Your Model

After training, save it locally:

trainer.save_model("my_custom_model")
tokenizer.save_pretrained("my_custom_model")

You now have a custom fine-tuned model on your dataset. 🎉

Step 7: Test Your Model

Let’s test the output:

from transformers import pipeline

generator = pipeline("text-generation", model="my_custom_model")
print(generator("Hugging Face is", max_length=30))

Wrap-Up

In this post, you:

Loaded your dataset into Hugging Face
Tokenized it for training
Fine-tuned a pre-trained GPT model
Saved your own custom-trained model

🎯 Congratulations — you now have a working, fine-tuned LLM!

In the next post (Part 3), we’ll cover:

How to publish your model on Hugging Face Hub
How to share it with others
How to create an interactive demo using Hugging Face Spaces

👉 Continue to: “How to Train and Publish Your Own LLM with Hugging Face (Part 3: Publishing & Sharing)”