Fine-tuning a LLM on my blog posts

Ever wondered what it would be like to have an AI that writes exactly in your style? I did. And in this post, I share what I did about it. This is a very practical guide on how to fine-tune an LLM using LoRA with MLX on Apple Silicon.
At the start of the year I shared this blogpost, which converted all my blogs into a Q&A dataset that I could use to fine-tune a LLM.

After sharing this, I spent time trying to fine-tune an LLM - but results were not great and combined with all things happening at OpenBB I didn't have time to dedicate a lot of time to this. :upside-down-hf-logo
But I hate leaving things half way. And this task didn’t leave my TODO for the past 6 months.
So I finally took things to my hands last weekend, and I’m going to share the entire journey on what, why and how.
Buckle up, this will be a long post - and more technical than previous ones. And all the code will be available here: https://github.com/DidierRLopes/fine-tune-llm.
Context
Most AI models are like Wikipedia - they know a little about everything but lack the depth and personality that comes from lived experience.
Think of it this way: RAG is like giving someone a reference book during an exam. Fine-tuning is like actually teaching them the subject until it becomes part of how they think.
“Once you’ve maximized the performance gains from prompting, you might wonder whether to do RAG or finetuning next. The answer depends on whether your model’s failures are information-based or behavior-based.
If the model fails because it lacks information, a RAG syustem that gives the model access to the relevant sources of information can help. (…) On the other hand, if the model has behavioral issues, finetuning might help.”
- Chip Huyen’s - AI Engineering (Chapter 7: Finetuning)
When you fine-tune a model on your writing, you're not just feeding it information (particularly with small models and a LoRA - you're rewiring how it processes and responds to ideas. The same neural pathways that learned to write about quantum physics now learn your specific way of sharing thoughts on open source, MCP, boxing, and others.
In this case, because we will fine-tune an instruct model - even the system prompt becomes part of this personalization process from the very first token. It’s not a simple “You are a helpful assistant" but “You are Didier, CEO of OpenBB. You write with clarity and impact, focusing on fintech, open source, AI, and the future of research workflows”.
This will result in a fine-tuned model that thinks in your voice and operates with your expertise baseline. To some extent that is, we will see later that the information transfer could be better. I attribute that to the fact that we are using a small (3.8B model), we are doing partial fine-tuning (only 0.08% of weights will be updated) and I didn’t spend a lot of time iterating on the hyperparameters.
0. Setting up the foundation
Model
I chose Microsoft's Phi-3 mini model (3.8B parameters) for several strategic reasons beyond just "it fits on my Mac":
Technical sweet spot: At 3.8B parameters, Phi-3 mini hits the perfect balance - large enough to produce coherent, contextual responses, but small enough to fine-tune efficiently on consumer hardware. Larger models like 7B+ would require more aggressive quantization.
Instruct-optimized foundation: This isn't a raw base model. Phi-3 mini is already instruction-tuned with supervised fine-tuning (SFT) and likely RLHF, meaning it understands how to follow prompts and maintain conversational flow. This gives me a much better starting point than training from a base model. Note: Microsoft actually did not release the base model.
Ecosystem support:
- This code reference gave me a working starting point
- There was an official cookbook with best practices
- There was a good model card on Hugging Face with clear usage example
Hardware compatibility: With my M3 Max and 48GB RAM, this model fits comfortably in memory with room for LoRA adapters and training overhead.
Finetuning Technique
Traditional fine-tuning updates all 3.8 billion parameters, requiring enormous compute resources and risking catastrophic forgetting (where the model loses its general capabilities while learning your specific data).
LoRA's elegant solution: Low-Rank Adaptation works by decomposing weight updates into smaller matrices. Instead of modifying a large weight matrix W directly, LoRA adds two smaller matrices A and B such that the update becomes W + BA, where B has rank r and r << d (with d being the original dimensions). More on LoRA here.

Why this matters:
- Parameter efficiency: I'm only training a small percentage (
<0.2%) of the entire 3.8b model - Memory efficiency: Base model stays frozen, only adapter weights need gradients
- Modularity: Can swap different LoRA adapters for different tasks/personalities
- Reduced overfitting: Smaller parameter space makes it harder to memorize training data (which also validates the fact that fine-tuning is not best choice to give more information to a model)
Framework
MLX is specifically designed for Apple's unified memory architecture. While PyTorch can run on Mac, it wasn't built with Apple Silicon's unique characteristics in mind.
Key MLX benefits:
- Memory efficiency: Unified memory means no CPU/GPU transfers, LoRA adapters and base model share the same memory pool efficiently
- Lazy evaluation: Only computes what's needed, when it's needed - crucial for memory-constrained fine-tuning
- Native optimization: Built for Apple's AMX (Apple Matrix Extensions) and Neural Engine integration
Most production fine-tuning still happens on NVIDIA GPUs with PyTorch. But for Apple Silicon users, MLX offers several advantages:
- Lower barrier to entry: No need for cloud GPUs or expensive NVIDIA hardware
- Rapid experimentation: Faster iteration cycles for smaller models
- Privacy: Everything runs locally, no data leaves your machine
Note that I was able to do this because I was working with a <10B parameter model and had Apple Silicon with 48GB RAM. But more importantly, this was done for experimentation, and not production - so I chose what allowed me to get my hands dirty faster.
1. Preparing the data
Code can be found here: https://github.com/DidierRLopes/fine-tune-llm/blob/main/scripts/01_prepare_data.py.
For the data we will be using a Q&A dataset based on my blogposts. The repository where I turned my blog posts into this dataset can be found here.

The dataset contains 91 blog posts transformed into conversational Q&A pairs - roughly 2,100 exchanges covering everything from OpenBB's journey to technical deep-dives on open source.
Each entry in the dataset contains conversations with user questions and my responses. But raw conversational data (which I parsed from a blogpost) isn't something you can just throw at a model. It needs structure, and more importantly, it needs the right structure for your chosen model.
Formatting for phi-3-mini-4k-instruct
Phi-3-mini-4k-instruct has been trained with a specific chat template, and we need to follow it - otherwise results won't be optimal (this was one of my first mistakes!)

You can find that template in the model card on HF: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
Important: Since this is an instruct model, then it is important to retain the system prompt on the training samples. (I also did a mistake here!)
Example:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
Those special tokens (<|system|>, <|user|>, <|assistant|>, <|end|>) aren't decorative, they're semantic markers that tell the model exactly where each part of the conversation begins and ends. (Do not forget these, and ensure there are no typos! I did not do a mistake here ehe)
I actually added a function to validate if the required tokens existed, and are in the right order.
Training split
One of the most common mistakes in fine-tuning is treating your test data as validation data. Here's how I split the ~2,100 samples:
- Training (80%, ~1,700 samples): The model learns from these
- Validation (10%, ~210 samples): Monitors training progress in real-time.
- In typical ML systems, this is used to tweak hyper parameters. In this case it checks the validation loss during training - and allows you to avoid overfitting, by making sure that training loss doesn’t diverge from validation loss.
- Test (10%, ~210 samples): Final evaluation, never touched during training
But before splitting, I shuffle all samples from all conversations. This avoids temporal bias where training data represents one era of thinking while test data represents another.
One of the reasons for which I recommend displaying the number of samples is so that you can put yourself in the shoes of the model to understand how many samples it will see; and that will help you make better decisions in terms of the training and model configs.
Preparing data logs
$ python scripts/01_prepare_data.py --config config/data_config.yaml
============================================================
DATA PREPARATION PIPELINE
============================================================
>>> Step 1: Loading raw dataset...
Loading dataset: didierlopes/my-blog-qa-dataset
Dataset loaded successfully. Available splits: ['train']
Dataset size: 91 samples
Dataset features: {'title': Value('string'), 'conversation': List({'content': Value('string'), 'role': Value('string')}), 'context': Value('string'), 'url': Value('string'), 'date': Value('string')}
>>> Step 2: Processing and formatting data...
Starting data preprocessing...
Extracted 2129 conversation samples
Data split created:
Training samples: 1705 (80.1%)
Validation samples: 212 (10.0%)
Test samples: 212 (10.0%)
Data preprocessing completed successfully!
>>> Step 3: Validating processed data...
Validating training data...
Validating 10 samples...
📊 Validation Summary:
Total samples checked: 10
Valid samples: 10
Invalid samples: 0
Validation rate: 100.0%
✅ All samples passed validation!
Validating validation data...
Validating 10 samples...