How to Train and Publish Your Own LLM with Hugging Face (Part 1: Getting Started)

Introduction

If you’ve been following the rise of AI, you’ve probably heard of Hugging Face — the platform that has become the home of modern machine learning models. But if you’re new to this world, it can feel overwhelming: How do you train your own AI model? What’s a dataset? And how do you actually get your model online so others can use it?

This is the first post in our 3-part series: “How to Train and Publish Your Own LLM with Hugging Face.” In this post, we’ll take the very first steps — setting up Hugging Face, understanding the basics, and preparing a simple dataset.

What You’ll Learn in This Post

What Hugging Face is and why it matters
How to install the right libraries
How to load a pre-trained model
How to create your own toy dataset for experiments
Running your first mini-training loop

By the end, you’ll have a simple example running locally — your very first step toward training an LLM!

Step 1: What is Hugging Face?

Think of Hugging Face as the GitHub of AI models.
It has:

Models Hub → a collection of pre-trained models you can download and use
Datasets Hub → ready-made datasets for NLP, vision, speech, etc.
Spaces → share interactive apps powered by models

In this series, we’ll focus on training & publishing models.

Step 2: Install the Tools

You’ll need three main libraries:

pip install transformers datasets huggingface_hub

transformers → models & training
datasets → load and prepare datasets
huggingface_hub → connect with your Hugging Face account

Step 3: Load a Pre-Trained Model

Instead of starting from scratch (which is expensive), we usually fine-tune a model. Let’s start small:

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
print(generator("Hello world, this is my first model", max_length=30))

👉 This uses DistilGPT-2, a lightweight GPT model.

Step 4: Create a Simple Dataset

You don’t need a huge dataset to start experimenting. Let’s create one in Python:

import pandas as pd

data = {
    "text": [
        "I love learning about AI.",
        "Transformers are amazing for NLP.",
        "Hugging Face makes training easy.",
        "Custom datasets help models adapt."
    ]
}

df = pd.DataFrame(data)
df.to_csv("my_dataset.csv", index=False)
print("Dataset saved as my_dataset.csv")

👉 In the next post, we’ll use Hugging Face’s datasets library to load this CSV.

Step 5: Your First Mini-Training (Optional)

If you want to try training right now:

from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_dataset.csv")

print(dataset["train"][0])

This loads your dataset into Hugging Face format. We’ll build on this in Part 2, where we’ll fine-tune an actual model.

Wrap-Up

🎉 You did it! You:

Installed Hugging Face tools
Ran a pre-trained model
Created your first custom dataset

In the next post, we’ll fine-tune a model on your dataset, measure performance, and save it locally.

👉 Stay tuned for:
“How to Train and Publish Your Own LLM with Hugging Face (Part 2: Fine-Tuning Your Model)”