## Definition
**Data synthesis for AI** is the programmatic generation of training examples — either derived from existing real data (augmentation) or created without a real-world reference (synthesis) — to expand data quantity, broaden coverage, improve quality, enable privacy-safe training, or distill larger model capabilities into smaller ones.
## Why Synthesize Data
Five primary motivations:
1. **Quantity** — produce training examples at scale for tasks where real-world data is scarce (e.g., rare weather events, robotic accidents, medical edge cases).
2. **Coverage** — generate targeted examples for underrepresented classes, adversarial scenarios, or specific output formats (e.g., very short/long texts, toxic examples for safety classifiers, tool-use traces).
3. **Quality** — AI can sometimes outperform humans for structured tasks; AI-generated preference ratings are more consistent than individual human raters; AI can produce arbitrarily complex math problems.
4. **Privacy** — synthetic records (patient notes, financial transactions) that carry no real PII allow training where regulations would otherwise prohibit use of real data.
5. **Distillation** — generate examples from a large teacher model to train a smaller student model with comparable performance. See [[Model Distillation]].
## Synthesis Techniques
### Rule-Based and Template-Based
Use predefined schemas populated with random generators (e.g., Faker). Useful for structured documents (invoices, tax forms, configuration files), regex/grammar generation, and math equations. DeepMind used 100 million synthetic geometry examples to train AlphaGeometry (Trinh et al., 2024).
Text augmentation variants: synonym replacement (via dictionary or nearby embedding vectors), gender-pronoun swapping for bias mitigation, perturbation (adding noise to existing samples to create adversarial examples, used in BERT pretraining by Devlin et al., 2018).
### Simulation
Run experiments in virtual environments to collect data cheaply and safely — especially common in robotics (CARLA for self-driving), agent training (OpenAI's Dota 2 bot played ~180 years of games per day via self-play), and tool-use data generation (simulating API calls rather than invoking real APIs).
### AI-Powered Synthesis
Modern LLMs can synthesize text, code, preference labels, and structured data indistinguishable from human output.
Key patterns:
- **Paraphrase/translate** — rewrite existing examples in different styles or languages. Yu et al. (2023) created MetaMath (~400K examples) from 15K MATH/GSM-8K examples; resulting models outperformed larger baselines.
- **Reverse instruction** — given high-quality long-form content, use AI to generate the instruction that would elicit it. Avoids hallucinations in responses because the response is human-authored.
- **Self-instruct** — seed with a small diverse set, use a model to generate new instruction-response pairs resembling the seeds (Wang et al., 2022). Used for Alpaca (52K pairs from 175 seeds).
- **Self-play** — AI agents play against or interact with each other to generate interaction data (chess, negotiation, customer support simulations).
- **Code synthesis pipeline** — generate problem descriptions → generate solutions → generate unit tests → fix failures iteratively → translate to other languages → back-translate for documentation. Llama 3.1 generated 2.7M synthetic coding examples this way (Dubey et al., 2024).
## Data Verification
Synthetic data quality must be verified before training. Methods:
- **Functional correctness** — execute code, check unit tests, run parsers/linters.
- **Back-translation** — translate to a target language, translate back, compare with original; divergence signals low-quality translation.
- **AI judges** — general-purpose foundation model evaluators or specialised scorers. To avoid first-position bias, run the judge twice with response order swapped and accept only consistent verdicts.
- **Heuristics** — filter empty, too-short, too-long, or repetitive examples; filter by keyword or metadata.
## Limitations
**Quality control** — AI-generated data can be low-quality and hard to verify automatically.
**Superficial imitation** — student models trained on teacher outputs learn style but may not acquire factual accuracy or generalization (Gudibande et al., 2023, "The False Promise of Imitating Proprietary LLMs").
**Model collapse** — training recursively on AI-generated data causes performance degradation over iterations as rare events become under-represented (Shumailov et al., 2023). Avoided by mixing synthetic with real data (Gerstgrasser et al., 2024).
**Obscured data lineage** — AI outputs may incorporate copyrighted material or benchmark contamination without the generator or user being aware.
## Related
- [[Instruction Dataset Design]]
- [[Model Distillation]]
- [[Self-Supervised Learning]]
- [[Fine-Tuning]]
## Sources
- [[AI Engineering - Chip Huyen]]