Introduction
If you’ve been following the rise of AI, you’ve probably heard of Hugging Face — the platform that has become the home of modern machine learning models. But if you’re new to this world, it can feel overwhelming: How do you train your own AI model? What’s a dataset? And how do you actually get your model online so others can use it?
This is the first post in our 3-part series: “How to Train and Publish Your Own LLM with Hugging Face.” In this post, we’ll take the very first steps — setting up Hugging Face, understanding the basics, and preparing a simple dataset.
What You’ll Learn in This Post
- What Hugging Face is and why it matters
- How to install the right libraries
- How to load a pre-trained model
- How to create your own toy dataset for experiments
- Running your first mini-training loop
By the end, you’ll have a simple example running locally — your very first step toward training an LLM!
Step 1: What is Hugging Face?
Think of Hugging Face as the GitHub of AI models.
It has:
- Models Hub → a collection of pre-trained models you can download and use
- Datasets Hub → ready-made datasets for NLP, vision, speech, etc.
- Spaces → share interactive apps powered by models
In this series, we’ll focus on training & publishing models.
Step 2: Install the Tools
You’ll need three main libraries:
pip install transformers datasets huggingface_hub
- transformers → models & training
- datasets → load and prepare datasets
- huggingface_hub → connect with your Hugging Face account
Step 3: Load a Pre-Trained Model
Instead of starting from scratch (which is expensive), we usually fine-tune a model. Let’s start small:
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")
print(generator("Hello world, this is my first model", max_length=30))
👉 This uses DistilGPT-2, a lightweight GPT model.
Step 4: Create a Simple Dataset
You don’t need a huge dataset to start experimenting. Let’s create one in Python:
import pandas as pd
data = {
"text": [
"I love learning about AI.",
"Transformers are amazing for NLP.",
"Hugging Face makes training easy.",
"Custom datasets help models adapt."
]
}
df = pd.DataFrame(data)
df.to_csv("my_dataset.csv", index=False)
print("Dataset saved as my_dataset.csv")
👉 In the next post, we’ll use Hugging Face’s datasets
library to load this CSV.
Step 5: Your First Mini-Training (Optional)
If you want to try training right now:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="my_dataset.csv")
print(dataset["train"][0])
This loads your dataset into Hugging Face format. We’ll build on this in Part 2, where we’ll fine-tune an actual model.
Wrap-Up
🎉 You did it! You:
- Installed Hugging Face tools
- Ran a pre-trained model
- Created your first custom dataset
In the next post, we’ll fine-tune a model on your dataset, measure performance, and save it locally.
👉 Stay tuned for:
“How to Train and Publish Your Own LLM with Hugging Face (Part 2: Fine-Tuning Your Model)”