The hidden tax on LLM fine-tuning

There's a stat that gets thrown around in ML circles: engineers spend roughly 73% of their fine-tuning time on data preparation and only 27% on actual training. I used to think that was exaggerated. After a year of fine-tuning LLMs — LoRA adapters on Mistral, instruction-tuning Llama, domain-specific models for coding and Q&A — I think it's conservative.

The model code for a typical fine-tune takes about an hour. The data work takes days. And it's the same data work every time.

This article breaks down where that time actually goes, what I found when I started measuring it, and the pipeline I ended up building to eliminate most of it.

Where the time goes

I tracked my workflow across 11 fine-tuning projects over six months. The breakdown was consistent enough to be depressing:

Finding data: 15-20% of total time. Browsing Hugging Face, checking Kaggle, searching academic repos. Hugging Face has incredible breadth — over 200,000 datasets — but quality varies from research-grade to "I exported my ChatGPT conversations and uploaded them." Kaggle is strong for tabular and structured data but thin on instruction-tuned conversation sets. Most searches end with downloading 3-4 candidate datasets and manually inspecting each one.

Cleaning and filtering: 35-40% of total time. This is the killer. Every dataset has its own problems. Duplicates (some datasets have 10-15% exact duplicates — I measured this). Empty fields. Encoding issues. Samples that are just noise — responses that don't answer the question, instructions that are ambiguous or broken, entries in the wrong language. I wrote cleanup scripts for each project. Different datasets, same types of problems, slightly different edge cases each time.

Format conversion: 10-15% of total time. Different datasets use different schemas. Some use question/answer. Others use prompt/completion or input/output or instruction/response. Some use CSV, some use Parquet, some use JSONL with inconsistent field names. The training framework expects a specific format. Every project starts with a conversion step.

Spot-checking and validation: 10-15% of total time. Automated filters catch most problems but not all. Responses that are technically well-formed but factually wrong. Instructions that are subtly ambiguous. Quality issues that only become visible when you actually read the samples. This step is manual and slow.

Actual training: ~25% of total time. Writing the training script, configuring hyperparameters, running the job, evaluating results. The part that's supposed to be the point of the whole exercise.

That's the 73/27 split in practice. Most of the data work isn't intellectually hard — it's just tedious and repetitive.

The duplication problem is worse than expected

One thing that surprised me was the scale of duplication in public datasets. I ran deduplication analysis on 14 popular instruction-tuning datasets from Hugging Face using both exact matching and fuzzy matching (MinHash with a Jaccard threshold of 0.85).

Results:

Dataset size	Exact duplicates	Near-duplicates (Jaccard > 0.85)	Combined
< 10K samples	2-4%	3-6%	5-10%
10K-100K samples	5-8%	6-12%	11-18%
> 100K samples	8-15%	10-18%	16-28%

Larger datasets tend to have more duplication. This makes sense — they're often assembled by scraping or combining multiple sources, and the same content shows up in different places.

The impact on training isn't trivial. Duplicate samples mean the model sees the same patterns multiple times, which can lead to overfitting on specific phrasings and reduced generalization. In one test, I trained the same LoRA adapter on a 50K coding instruction set before and after deduplication (which removed ~12% of samples). The deduplicated version scored 4.2% higher on HumanEval despite having fewer training samples.

Fewer samples. Better results. Because the samples were actually distinct.

What a proper cleaning pipeline looks like

After building one-off scripts for a dozen projects, I eventually standardized the process into a pipeline. This is what runs behind Neurvance (neurvance.com), but the approach applies to any dataset preparation workflow.

Stage 1: Schema detection and normalization. The pipeline ingests a raw dataset and maps whatever field names it finds to a standard schema: instruction, response, and optional system and context fields. This handles the conversion problem — whether the source uses question/answer or prompt/completion or something else entirely, the output is always the same format. Standard JSONL.

A surprising amount of work goes into edge cases here. Some datasets embed metadata in the instruction field. Others have multi-turn conversations flattened into single rows. Some use nested JSON structures inside individual cells. The normalization layer handles all of this, but getting it right took longer than I expected.

Stage 2: Deduplication. Two passes — exact match first (fast, hash-based), then fuzzy match using MinHash from the datasketch library. The Jaccard threshold is configurable but defaults to 0.85, which catches near-duplicates without being so aggressive that it removes legitimately similar-but-different samples.

The key insight here: deduplication needs to happen on the response field, not the instruction field. Multiple valid instructions can be phrased similarly (e.g., "Write a Python function that sorts a list" vs "Write a Python function to sort a list"), but what matters is whether the responses are substantively different.

Stage 3: Quality filtering. This is the most opinionated part of the pipeline and the one that required the most iteration. Current filters:

Response length filtering — responses below a minimum character threshold (configurable, default 50 chars) or above a maximum (default 8000 chars) get flagged. Too short usually means the response is unhelpful. Too long usually means it's copy-pasted from documentation or includes a lot of boilerplate.

Information density scoring — a simple heuristic that measures the ratio of unique tokens to total tokens in the response. Low-density responses tend to be repetitive filler. This isn't a perfect metric but it catches the worst offenders.

Language detection — using langdetect to identify and remove samples that are in the wrong language or are garbled text. This is more common than you'd think in multilingual datasets.

Instruction-response coherence — a lightweight check that the response actually addresses the instruction. This uses embedding similarity between the instruction and response. It's not trying to verify factual accuracy — just that the response is topically relevant to the question.

Stage 4: Manual spot-check. The pipeline randomly samples 200-500 rows from the filtered dataset and generates a report with the samples, their quality scores, and any flags from the automated filters. I go through these by hand. This step is slow but it catches things automated filters can't — like responses that are grammatically correct and topically relevant but subtly wrong in a way that would teach the model bad patterns.

The whole thing runs in Python. Dependencies are minimal: pandas, datasketch, langdetect, sentence-transformers for the coherence check, and standard library JSON handling. Nothing exotic.

The format problem nobody talks about

One underappreciated source of fine-tuning friction is format inconsistency. There's no universal standard for instruction-tuning data. Hugging Face datasets use different column names, different structures, different file formats. A dataset that works with one training framework might need conversion to work with another.

This sounds trivial but it compounds. Every time you switch datasets, you check the schema, write a conversion step, validate the output. Across a dozen projects, this easily adds up to days of cumulative work.

The approach I settled on was aggressive standardization. Every dataset that goes through the pipeline comes out as JSONL with identical field names and identical structure. The training script reads one format. Always. If I want to swap datasets, I just point it at a different file. No conversion step.

This is one of those optimizations that doesn't sound impressive until you've done 10+ fine-tuning runs and realized how much time format juggling actually costs.

Benchmarking the pipeline

To validate that the pipeline actually improves training outcomes (and doesn't just remove useful data), I ran a controlled experiment. I took three popular instruction-tuning datasets, trained identical LoRA adapters on the raw versions and the pipeline-processed versions, and compared performance on standard benchmarks.

Setup: Llama 3.1 8B base model. LoRA rank 16, alpha 32. Same hyperparameters across all runs. 3 epochs. Evaluated on MMLU (knowledge), HumanEval (coding), and a held-out instruction-following test set.

Results:

Dataset	Raw MMLU	Cleaned MMLU	Raw HumanEval	Cleaned HumanEval	Samples removed
Coding-50K	54.2	55.8	31.1	35.3	14%
General-QA-100K	61.4	63.1	—	—	22%
Multi-domain-30K	57.8	59.2	28.4	30.1	11%

The cleaned versions consistently outperformed the raw versions despite having fewer training samples. The largest improvement was on the coding dataset, where HumanEval scores jumped 4.2 points after removing 14% of the data. The General-QA dataset lost 22% of its samples to deduplication and filtering — and still scored higher on MMLU.

This aligns with what the literature suggests but it's still satisfying to see it in practice. Data quality compounds through training in ways that data quantity doesn't.

What I got wrong

Two things I had to learn the hard way.

First, I over-filtered early on. My initial quality thresholds were too aggressive. I was removing samples that were unusual but not actually bad — like very long multi-step coding explanations or responses that used informal language. Unusual is not the same as low-quality. I had to relax several filters and add more nuance to the scoring. The current pipeline removes less data than my first version, and the results are better for it.

Second, I underestimated the value of keeping the pipeline simple. My first version had a dozen different quality signals, weighted and combined into a composite score. It was elegant and nearly impossible to debug when something went wrong. The current version uses straightforward, independently interpretable filters. Each one does one thing. When a sample gets removed, I can point to exactly which filter caught it and why. Simpler is better when you need to trust the output.

Where this goes

The pipeline behind Neurvance currently handles a few dozen datasets across coding, general Q&A, and some domain-specific categories. Every dataset on the platform is free to download. There's a $10/mo API for programmatic access if you want to pull data directly into your training loop, with a Python client on GitHub.

The catalog is small right now. I'm adding datasets as fast as the pipeline and manual review process allows. The bottleneck is quality control, not engineering, each new dataset takes time to properly clean and validate, and I'm not willing to ship datasets I haven't manually spot-checked.

I'm also working on more rigorous quality evaluation, training small probe models on cleaned vs. uncleaned versions of each dataset and publishing the benchmark deltas alongside the datasets themselves. The goal is to make the value of the cleaning process measurable, not just anecdotal.

If you're doing fine-tuning work and have run into the same data prep bottleneck, the datasets are at neurvance.com.