JarvisBitz Tech
How AI Works

Fine-Tuning & Custom Models

Train AI models on your data. From LoRA adapters to full RLHF — domain-specific intelligence at any scale.

Approach Spectrum

Why fine-tune?

Fine-tuning sits at the sweet spot of the customization spectrum — more powerful than prompting, more practical than training from scratch.

Prompt Engineering

You need quick results with off-the-shelf models and no training data.

Cost
Low
Latency

None

Data needed

Zero — just examples in the prompt

Training intensity

RAG

Your knowledge base changes frequently and you need grounded, citation-backed answers.

Cost
Medium
Latency

+50–200ms retrieval

Data needed

Document corpus (any size)

Training intensity
Sweet spot

Fine-Tuning

You need the model to internalize domain vocabulary, tone, reasoning patterns, or structured output formats.

Cost
Medium–High
Latency

None at inference

Data needed

100–10,000 curated examples

Training intensity

Training from Scratch

No existing model covers your domain and you have massive proprietary corpora (rare).

Cost
Very High
Latency

None at inference

Data needed

Billions of tokens + months of compute

Training intensity
Techniques

Fine-tuning techniques

Six approaches to adapting a model — from full parameter updates to lightweight adapters and preference learning.

Full Fine-Tuning

Updates every parameter in the model. The original approach — maximum flexibility but maximum cost. All weights are unfrozen and trained on your dataset with a low learning rate.

How it works

1

Load the pre-trained model and unfreeze all layers

2

Prepare your dataset in instruction/completion format

3

Train with a low learning rate (1e-5 to 5e-5) to avoid catastrophic forgetting

4

Evaluate on a held-out validation set and checkpoint the best model

Organizations with large GPU clusters and datasets > 50K examples that need deep behavioral changes.

ParametersAll parameters (~7B–405B)
Memory2–4× model size in VRAM
TimeDays to weeks

Advantages

Maximum control over model behavior
Can fundamentally shift model capabilities
Well-understood training dynamics

Trade-offs

Requires full model in GPU memory (e.g. 140GB for 70B)
Risk of catastrophic forgetting without careful tuning
Slow: days to weeks on multiple GPUs
Base Models

Model landscape

Choosing the right base model determines your ceiling. Open-weight models offer full control; API models trade flexibility for convenience.

Llama

Meta
8B70B405B
Meta Community License

Best open-weight all-rounder. Strong reasoning, multilingual, 128K context. The default choice for most fine-tuning projects.

Fine-tuning

Full, LoRA, QLoRA — full ecosystem support

Mistral / Mixtral

Mistral AI
7B8×7B8×22B
Apache 2.0

Exceptional efficiency. Mixtral MoE architecture activates only 2 of 8 experts per token — 70B-class quality at 13B inference cost.

Fine-tuning

Full, LoRA, QLoRA — strong community support

Phi

Microsoft
3.8B7B14B
MIT

Punches above its weight. Data-quality-focused training yields strong reasoning in a tiny model. Ideal for on-device and edge deployment.

Fine-tuning

Full, LoRA, QLoRA — excellent for constrained environments

Gemma

Google
2B9B27B
Gemma Terms of Use

Strong instruction following and safety alignment out of the box. Good multilingual performance with efficient architecture.

Fine-tuning

Full, LoRA — well-supported in Keras and JAX

OpenAI API

OpenAI
API only (undisclosed)
Proprietary API

Frontier capability across all modalities. Fine-tuning available via API — no infrastructure management required.

Fine-tuning

API fine-tuning only — no weight access

Anthropic API

Anthropic
API only (undisclosed)
Proprietary API

Best-in-class instruction following, safety, and long-context performance. Prompt-based customization only — no fine-tuning API.

Fine-tuning

No fine-tuning — prompt engineering and RAG only

Data Quality

Data requirements

The quality of your fine-tuned model is bounded by the quality of your data. Relevance matters more than volume.

Data quality pyramid

Domain Relevance95% importance

Data from your actual domain and use cases

Diversity80% importance

Covers edge cases, formats, and variations

Quality65% importance

Accurate, well-formatted, consistent examples

Quantity45% importance

Enough examples to learn patterns

LoRA / QLoRA
100 – 10,000
curated examples
Full Fine-Tune
10,000 – 100,000+
diverse examples
RLHF / DPO
5,000 – 50,000
preference pairs
Distillation
50,000 – 500,000+
teacher outputs

Data preparation pipeline

1

Collection

Gather domain-specific instruction/response pairs from existing workflows, documentation, and expert annotations.

2

Cleaning

Remove duplicates, fix encoding, strip PII, normalize formatting. Bad data in → bad model out.

3

Formatting

Convert to the model's expected format — Alpaca, ShareGPT, or chat-ML. Add system prompts and metadata.

4

Validation

Expert review of a random sample. Check for hallucinations, bias, formatting errors, and label quality.

5

Splitting

Train (80%) / validation (10%) / test (10%). Stratify by category to ensure balanced evaluation.

Quality thresholds

95%
Accuracy
90%
Consistency
85%
Coverage
Decision Guide

When to fine-tune vs. other approaches

The right approach depends on your data, latency requirements, and how deeply you need to change model behavior.

ScenarioPrompt Eng.RAGFine-TuneFrom Scratch
Need domain-specific vocabulary and jargon
Need real-time access to changing data
Need specific output format or tone
Need to reduce hallucinations with citations
Need specialized reasoning patterns
Need to work within budget constraints
Need model to run on-device / at the edge
Need to combine external knowledge + custom behavior

Combine RAG + Fine-Tuning

Use RAG for real-time knowledge and fine-tuning for domain behavior — the most powerful pattern for enterprise AI.

Start with LoRA

For 90% of enterprise use cases, LoRA delivers the best ROI: fast training, tiny adapters, and production-ready results.

Evaluate continuously

Benchmark on held-out data before and after. Track perplexity, task accuracy, and human preference scores.

Train a model on your data.

Describe your domain and data — we'll recommend the right fine-tuning approach, base model, and training strategy.