Data Synthesis for AI - Albert Masoliver's learning site

## Definition **Data synthesis for AI** is the programmatic generation of training examples — either derived from existing real data (augmentation) or created without a real-world reference (synthesis) — to expand data quantity, broaden coverage, improve quality, enable privacy-safe training, or distill larger model capabilities into smaller ones. ## Why Synthesize Data Five primary motivations: 1. **Quantity** — produce training examples at scale for tasks where real-world data is scarce (e.g., rare weather events, robotic accidents, medical edge cases). 2. **Coverage** — generate targeted examples for underrepresented classes, adversarial scenarios, or specific output formats (e.g., very short/long texts, toxic examples for safety classifiers, tool-use traces). 3. **Quality** — AI can sometimes outperform humans for structured tasks; AI-generated preference ratings are more consistent than individual human raters; AI can produce arbitrarily complex math problems. 4. **Privacy** — synthetic records (patient notes, financial transactions) that carry no real PII allow training where regulations would otherwise prohibit use of real data. 5. **Distillation** — generate examples from a large teacher model to train a smaller student model with comparable performance. See [[Model Distillation]]. ## Synthesis Techniques ### Rule-Based and Template-Based Use predefined schemas populated with random generators (e.g., Faker). Useful for structured documents (invoices, tax forms, configuration files), regex/grammar generation, and math equations. DeepMind used 100 million synthetic geometry examples to train AlphaGeometry (Trinh et al., 2024). Text augmentation variants: synonym replacement (via dictionary or nearby embedding vectors), gender-pronoun swapping for bias mitigation, perturbation (adding noise to existing samples to create adversarial examples, used in BERT pretraining by Devlin et al., 2018). ### Simulation Run experiments in virtual environments to collect data cheaply and safely — especially common in robotics (CARLA for self-driving), agent training (OpenAI's Dota 2 bot played ~180 years of games per day via self-play), and tool-use data generation (simulating API calls rather than invoking real APIs). ### AI-Powered Synthesis Modern LLMs can synthesize text, code, preference labels, and structured data indistinguishable from human output. Key patterns: - **Paraphrase/translate** — rewrite existing examples in different styles or languages. Yu et al. (2023) created MetaMath (~400K examples) from 15K MATH/GSM-8K examples; resulting models outperformed larger baselines. - **Reverse instruction** — given high-quality long-form content, use AI to generate the instruction that would elicit it. Avoids hallucinations in responses because the response is human-authored. - **Self-instruct** — seed with a small diverse set, use a model to generate new instruction-response pairs resembling the seeds (Wang et al., 2022). Used for Alpaca (52K pairs from 175 seeds). - **Self-play** — AI agents play against or interact with each other to generate interaction data (chess, negotiation, customer support simulations). - **Code synthesis pipeline** — generate problem descriptions → generate solutions → generate unit tests → fix failures iteratively → translate to other languages → back-translate for documentation. Llama 3.1 generated 2.7M synthetic coding examples this way (Dubey et al., 2024). ## Data Verification Synthetic data quality must be verified before training. Methods: - **Functional correctness** — execute code, check unit tests, run parsers/linters. - **Back-translation** — translate to a target language, translate back, compare with original; divergence signals low-quality translation. - **AI judges** — general-purpose foundation model evaluators or specialised scorers. To avoid first-position bias, run the judge twice with response order swapped and accept only consistent verdicts. - **Heuristics** — filter empty, too-short, too-long, or repetitive examples; filter by keyword or metadata. ## Limitations **Quality control** — AI-generated data can be low-quality and hard to verify automatically. **Superficial imitation** — student models trained on teacher outputs learn style but may not acquire factual accuracy or generalization (Gudibande et al., 2023, "The False Promise of Imitating Proprietary LLMs"). **Model collapse** — training recursively on AI-generated data causes performance degradation over iterations as rare events become under-represented (Shumailov et al., 2023). Avoided by mixing synthetic with real data (Gerstgrasser et al., 2024). **Obscured data lineage** — AI outputs may incorporate copyrighted material or benchmark contamination without the generator or user being aware. ## Related - [[Instruction Dataset Design]] - [[Model Distillation]] - [[Self-Supervised Learning]] - [[Fine-Tuning]] ## Sources - [[AI Engineering - Chip Huyen]]