Achieving 97% accuracy with 95% latency reduction in voice dictation

With only 10K failure mode-targeted synthetic examples, TinyLlama (1.1 B) beats GPT‑4o mini and an equal‑sized real‑data set on PubMedQA. We show that the right data matters more for performance than data volume or human quality.

The Need to Fine-tune

Willow Voice, an AI-powered dictation tool designed to work across all applications, faced a critical technical challenge. Their initial product relied on GPT-4o to provide accurate transcription and formatting, but this approach introduced unacceptable latency for real-time dictation use cases.

When testing Llama 3.1 8B as a faster alternative, they found it achieved only 51% accuracy on their benchmark. "Llama 3.1 8B's instruction following capability wasn't sufficient for our embedded application logic in formatting transcriptions," explained Lawrence Liu, CTO at Willow Voice. They couldn't tolerate hallucinations or formatting errors. When a user is dictating an email to their boss or a medical report, accuracy is essential.

Challenges in Fine-tuning

Willow's team knew finetuning was the answer, but encountered several obstacles:

Identifying model weaknesses: Difficulty pinpointing exactly where the model failed in real-world scenarios, leading to uncertainty of how to improve the model.

Data preparation inefficiency: Weeks spent manually reviewing transcription errors, compounded by synthetic data generation attempts that resulted in repetitive, inaccurate data and frequent hallucinations.

Lack of training data: Required domain-specific data did not exist publicly. Initial synthetic data generation attempts failed, prompting consideration of expensive, time-consuming human annotation.

Accelerated LLM Development with Targeted, Curated Synthetic Data

Partnering with Phinity, Willow developed a synthetic benchmark precisely reflecting real-world use cases. Leveraging Phinity’s synthetic data pipeline, Willow generated diverse, accurate, and domain rule-adherent training datasets in two days. This targeted approach allowed rapid identification of failure patterns and systematic fine-tuning:

Precisely isolated representative failure cases in the model

Generated small, diverse, and accurate datasets that adhered to application logic

Efficiently and automated the verification of synthetic data samples to ensure high quality and relevance

Rapidly evaluated model improvements, significantly accelerating development cycles.

This accurate, diverse synthetic data allowed Willow to rapidly enhance performance, achieving 97% accuracy in under a week without incurring the high costs of human annotation or extensive manual labor.

Results: GPT-4o to Custom Llama 3.1 8B

The outcomes were transformative:

Latency: Reduced from 2 seconds (GPT-4o) to 100 ms - almost instantaneous. Willow Voice is now 20x faster.

Accuracy: Improved from 51% to 97%.

User satisfaction significantly improved, with 97% of Willow users reporting that they used the application daily for professional communications.

With Phinity's research-backed approach in synthetic data generation, Willow was able to take the guesswork, traditional cost, and fear out of fine-tuning, resulting in 95% reduction in latency in an accuracy-critical application.

Key Takeaways

A custom benchmark that matches application scenarios is crucial in development to ensure that the model will perform in production.

Accurate, diverse, domain-adherent synthetic data enables cost-effective fine-tuning.

Small, high-quality, error-targeted datasets, instead of large generic data, result in a systematic, efficient process that achieves expected results quickly.