Fine-Tuning Efficient Chinese Speech Models beyond the Pareto Frontier

This repo asks a narrow question under a strict compute budget: when does speech fine-tuning actually buy you something, and when does a strong baseline already dominate?

zh-CN benchmark + fine-tunes
zh-CN benchmark — WER / CER vs model size, baseline vs fine-tuned

Take-homes

  1. Fine-tuning is not a universal win. On zh-CN it helps decisively; on fr-FR it often regresses once the base model is already strong.
  2. Model size and data overlap matter more than adapter choice alone. Tiny still has headroom on French; small/medium/turbo mostly do not.
  3. The right first step under a small budget is a baseline sweep, not blind SFT. Gap-to-ceiling is the main diagnostic signal in this repo.
  4. Qwen3-ASR and Granite were added as counterpoints. They show how much the conclusion depends on backbone quality, pre-training mix, and the evaluation slice.
  5. RL (MWER/GSPO) fixes the SFT regression. On Qwen3-ASR-0.6B, GSPO brings French WER below baseline (6.13 % vs 6.35 %) and MWER achieves the best Chinese CER at that scale (7.62 % vs 10.41 % baseline). An RL stage at half an epoch recovers what SFT lost — and then some.

This repo bundles four tracks under one roof: the original Whisper fine-tuning work, the asr_bench baseline benchmark, the Qwen3-ASR pilot, and the Granite Speech pilot. The earlier zh-CN Whisper run lives intact under archive_zh/; the present fr-FR Whisper run is in outputs/. Tiny was added later to control for the gap-to-ceiling effect discussed in §3.3.

The benchmark plots now include Qwen3-ASR 0.6B and 1.7B baseline and fine-tune points on fixed dev100 slices, and Granite points will be reported on that same reduced slice for apples-to-apples overlay.

TL;DR

Size zh-CN (CV21) baseline / best FT (Δ rel) fr-FR (FLEURS) baseline / best FT (Δ rel)
tiny CER 59.4 % → 35.5 % (full FT, −40.2 %) WER 45.5 % → 42.1 % (full FT, −7.5 %)
small CER 33.5 % → 22.1 % (full FT, −34.1 %) WER 13.9 % → 14.6 % (full FT, +5.0 %)
medium CER 28.7 % → 13.2 % (LoRA, −54.0 %) WER 8.20 % → 8.59 % (LoRA, +4.8 %)
turbo n/a WER 5.81 % → 6.17 % (LoRA, +6.2 %)

Same code, same recipe family, same sub-sampling protocol. Three patterns:

  1. Sign flips with the language at small/medium. zh fine-tunes help by 30–54 %. fr fine-tunes hurt by 5 % at the same scale.
  2. At tiny, both languages benefit from fine-tuning — fr full FT improves WER by 7.5 % too. Tiny is small enough that the baseline has not saturated FLEURS.
  3. Best result on fr is Whisper-large-v3-turbo baseline — no fine-tune — at WER 5.81 %. It even beats whisper-large-v3-french-distil-dec4 (6.97 % zero-shot), a model specifically distilled for French.

The mechanism is gap-to-ceiling × pre-training distribution overlap:

  • Common Voice 21 zh-CN → Whisper-v3 has weak zh coverage → gap to close at every size → fine-tuning closes it (−30 to −54 % CER).
  • FLEURS fr-FR at small/medium/turbo → French is one of Whisper's strongest languages → no gap to close → fine-tuning adds noise.
  • FLEURS fr-FR at tiny → tiny has not absorbed enough of Whisper's multilingual corpus to saturate FLEURS-fr → real gap → full FT helps.

The take-away that matters more than the numbers: fine-tuning helps when (target distribution is poorly covered) OR (the model is small enough that it has not saturated the data even if covered). A fine-tuner under compute constraint should test this first, before spending the GPU budget.

A second hypothesis follows directly (cf. §5): our recipe is sub-optimal for in-distribution data. High-LR / small-batch / single-dataset / no-SpecAugment is calibrated for picking up new distributions; on already-seen data it produces drift instead of staying at the equilibrium. So the regression on French is partly a recipe diagnostic.

A live Gradio server (bash scripts/start_demo.sh) lets you record / upload audio in your browser and compare baseline vs fine-tuned side-by-side.

Repo layout

.
├── README.md # this file
├── asr_bench/ # baseline benchmark + comparative plots
│ ├── plot.py
│ ├── preds/
│ └── figures/
├── Qwen3-ASR/ # Qwen3-ASR codebase + finetuning runs
│ ├── finetuning/
│ └── qwen_asr/
├── granite_speech/ # Granite Speech finetuning + eval helpers
│ └── finetuning/
├── archive_zh/ # zh-CN run, preserved
│ ├── README.zh.md
│ ├── metrics.json # 5-row zh table
│ ├── error_analysis.json
│ └── preds/ # 5 prediction JSONs
├── requirements.txt
├── src/
│ ├── data.py # FLEURS fr_fr load → cast 16 kHz → filter → features+labels
│ ├── train.py # CLI: --mode lora | full | scratch
│ ├── eval.py # generation + WER/CER on test split
│ ├── analyze.py # results table + worst-100 error bucketing
│ ├── render_table.py # metrics.json → Markdown table
│ ├── train_w2v.py # alt paradigm: wav2vec2 / XLS-R + CTC head
│ ├── eval_w2v.py # CTC eval helper
│ └── server.py # Gradio (mic + upload, baseline vs fine-tuned)
├── scripts/
│ ├── run_phase_tiny.sh # whisper-tiny fr : baseline + LoRA + full FT
│ ├── run_phase_tiny_zh.sh # whisper-tiny zh-CN : baseline + LoRA + full FT
│ ├── run_phase_a.sh # whisper-small : baseline / LoRA-zh / full FT / scratch
│ ├── run_phase_a2.sh # whisper-small + LoRA "fr recipe"
│ ├── run_phase_b.sh # whisper-medium : baseline + LoRA
│ ├── run_phase_c.sh # whisper-large-v3-turbo : baseline + LoRA
│ ├── run_phase_d.sh # zero-shot refs: wav2vec2-fr + whisper-fr-distil
│ ├── run_phase_e.sh # (optional) whisper-large-v3 + LoRA, needs bnb 4-bit
│ ├── run_all_phases.sh # A → B → C → analyze
│ ├── run_post.sh # post-pipeline: A2 + D + analyze + render_table
│ ├── quick_summary.sh # one-liner WER/CER over outputs/preds
│ └── start_demo.sh # launches Gradio
└── outputs/
├── preds/ # 17 prediction JSONs (fr + zh-tiny)
├── adapters/ # LoRA weights, full-FT and scratch checkpoints
├── logs/ # stdout per step
├── metrics_fr.json # 14 fr rows
├── metrics_zh.json # 8 zh rows
├── table_fr.md # rendered fr table
├── table_zh.md # rendered zh table
└── error_analysis_fr.json

1. Datasets

1.1 zh-CN

Common Voice 21 zh-CN via the parquet rehost keeve101/common-voice-21.0-2025-03-14-zh-CN-split. Read speech, ~3–7 s per clip, 32 kHz mp3 cast to 16 kHz mono. Sub-sampled: 4 000 train / 300 dev / 500 test. Heavy long-tail vocabulary (place names, historical text). Full description: archive_zh/README.zh.md.

1.2 fr-FR

FLEURS sub-corpus fr_fr via google/fleurs. Read speech, 16 kHz mono, transcripts already lowercased and stripped of punctuation. Splits: 3 193 train / 289 dev / 676 test → sub-sampled to 3 193 / 289 / 500. Train volume 10.32 h, mean clip 11.6 s (3.8–29.4 s), 24.1 words per sentence on average.

Why not Common Voice fr? CV21-fr is huge (≥ 100 GB streamed). FLEURS-fr is the exact "few hours of audio" the spec asks for, with cleaner splits, and is the standard FLEURS leaderboard entry.

Whisper feature extractor / tokenizer is bit-identical between small and medium (80-bin mel, 51 865-token vocab) → encoded processed/ is reused between Phase A and Phase B (saves ~5 min). Whisper-large-v3-turbo uses 128-bin mel + 1 extra vocab token → Phase C re-encodes.

2. Models, recipes, and rationale

2.0 Two-pass recipe — why

We ran the test in two passes:

  1. "zh recipe" (LR 1e-4 LoRA, rank 32, 1 epoch, warmup 10 %) — the recipe that worked on Chinese. Applied to the small-model French run as the first-cut.
  2. "fr recipe corrected" (LR 5e-5 LoRA, 2 epochs) — applied to medium and turbo after we observed the small fine-tune regressed. We also re-ran small with LR 3e-5 / 2 epochs to test whether the corrected recipe rescues small.

Both recipes regress on French (cf. §3). On Chinese the zh recipe wins clearly.

2.0b Phase Tiny — Whisper-tiny (39 M) on both languages

Three variants per language, identical recipes side-by-side:

VariantInitTrainableWhat
baseline_tiny[_zh]OAI pre-trained0no fine-tune
lora_tiny[_zh]OAI pre-trained1.5 M (~3.8 %)LoRA q/k/v/out_proj enc+dec, LR 1e-4, 1 ep
full_tiny[_zh]OAI pre-trained39 M (100 %)full FT, LR 1e-5, 1 ep

Tiny is the cleanest test of the gap-to-ceiling hypothesis: on a small under-trained backbone, baseline WER/CER is far from the model's plateau, so any well-calibrated fine-tune has real headroom to fill — even on in-distribution data.

2.1 Phase A — Whisper-small (244 M) test bench

Four variants on identical data, identical seed:

VariantInitTrainableWhat
baseline_smallOAI pre-trained0no fine-tune
lora_smallOAI pre-trained7.1 M (~2.8 %)LoRA q/k/v/out_proj enc+dec
lora_small_v2OAI pre-trained7.1 Msame LoRA, LR 3e-5, 2 ep
full_smallOAI pre-trained244 M (100 %)full FT, LR 1e-5
scratch_smallrandom init244 Mfrom-scratch, 5 ep, LR 5e-4

The scratch_small answers the "finetune vs from-scratch on small config" question: same architecture, random weights, 5× the LoRA's compute budget. The question is the value of the weights, not of a from-scratch BPE.

2.2 Phase B — Whisper-medium (769 M) scale-up

LoRA on the same projections, 2 epochs at LR 5e-5 (corrected). Per-device batch 8 + grad-accum 2 + grad ckpt → effective batch 16, ~10 GB VRAM peak.

2.3 Phase C — Whisper-large-v3-turbo (809 M)

The 2024 OAI variant: encoder of large-v3 + 4-decoder-layer pruned head, ~8× faster than large-v3 for similar quality. Officially recommended production checkpoint for 2025–2026. Same LoRA recipe, batch 4 + grad-accum 4 + grad ckpt.

Considered and skipped: NVIDIA Parakeet-TDT v3 (conflicting venv deps), SeamlessM4T-medium (heavier, no advantage), wav2vec2 / XLS-R + CTC (different paradigm, code staged in src/train_w2v.py).

2.4 Phase D — zero-shot off-the-shelf references

Two French-specialized public models, evaluated zero-shot on the same 500 FLEURS-fr test clips:

  • bofenghuang/asr-wav2vec2-ctc-french — encoder-only + CTC head, 315 M, fine-tuned on CV9-fr + Voxpopuli-fr. Different paradigm.
  • bofenghuang/whisper-large-v3-french-distil-dec4 — Whisper-large-v3 distilled specifically for French (4-decoder-layer head).

2.5 Hyperparameters (final)

SettingValue
OptimizerAdamW
Precisionbf16 (L4 supports it)
LR (LoRA "zh recipe")1e-4
LR (LoRA "fr recipe corrected")5e-5 (small_v2: 3e-5; medium/turbo: 5e-5)
LR (full FT)1e-5
LR (from scratch)5e-4
Warmup ratio0.1 (LoRA/full) ; 0.05 (scratch)
LoRA rank / alpha / dropout32 / 64 / 0.05
LoRA targetsq_proj, k_proj, v_proj, out_proj (enc + dec)
Effective batch (small / medium / turbo)16 (16×1 / 8×2 / 4×4)
Epochs (small LoRA / full / _v2 / scratch)1 / 1 / 2 / 5
Epochs (medium / turbo)2
Seed42
Generationgreedy, max_new_tokens 225, lang=fr task=transcribe

3. Results

All rows evaluated on the same 500-utterance FLEURS-fr test slice (deterministic seed 42), greedy decoding, bf16, num_beams = 1. Primary metric: WER for French; CER for Chinese.

3.1 fr-FR table (FLEURS, WER primary)

RunTrainableWall eval (s)WERCERΔ WER absΔ WER rel
Whisper-tiny baseline10.70.45470.2047
Whisper-tiny + LoRA (zh recipe, LR 1e-4, 1 ep)1.5 M12.00.45600.2051+0.0013+0.3 %
Whisper-tiny full FT39 M22.60.42050.1933−0.0342−7.5 %
Whisper-small baseline31.80.13860.0508
Whisper-small + LoRA (zh recipe, LR 1e-4, 1 ep)7.1 M34.10.15510.0600+0.0165+11.9 %
Whisper-small + LoRA (fr recipe, LR 3e-5, 2 ep)7.1 M30.60.15470.0571+0.0161+11.6 %
Whisper-small full FT244 M32.80.14560.0549+0.0070+5.0 %
Whisper-small from scratch (random init, 5 ep)244 M17.10.96150.7593+0.8229+593.7 %
Whisper-medium baseline88.60.08200.0292
Whisper-medium + LoRA (fr recipe)18.9 M88.30.08590.0307+0.0039+4.8 %
Whisper-large-v3-turbo baseline63.60.05810.0199
Whisper-large-v3-turbo + LoRA (fr recipe)~12 M62.10.06170.0213+0.0036+6.2 %
ref: wav2vec2-CTC-français (zero-shot, CTC)13.50.10370.0465
ref: Whisper-large-v3 distil-fr-dec4 (zero-shot)64.10.06970.0256

3.2 zh-CN table (CV21, CER primary)

RunTrainableWall eval (s)CERΔ CER absΔ CER rel
Whisper-tiny baseline21.60.5938
Whisper-tiny + LoRA (zh recipe, LR 1e-4, 1 ep)1.5 M14.80.4168−0.1770−29.8 %
Whisper-tiny full FT39 M16.60.3551−0.2387−40.2 %
Whisper-small baseline20.90.3352
Whisper-small + LoRA (zh recipe, LR 1e-4, 1 ep)7.1 M25.80.2322−0.1030−30.7 %
Whisper-small full FT244 M27.00.2208−0.1144−34.1 %
Whisper-medium baseline62.20.2873
Whisper-medium + LoRA (zh recipe, LR 1e-4, 2 ep)18.9 M62.70.1322−0.1551−54.0 %

3.3 Cross-language and cross-size — the gap-to-ceiling effect

Sizezh-CN baselinezh-CN best FTΔ zhfr-FR baselinefr-FR best FTΔ fr
tiny CER 59.4 %CER 35.5 % (full FT)−40.2 % WER 45.5 %WER 42.1 % (full FT)−7.5 %
small CER 33.5 %CER 22.1 % (full FT)−34.1 % WER 13.9 %WER 14.6 % (full FT)+5.0 %
medium CER 28.7 %CER 13.2 % (LoRA)−54.0 % WER 8.20 %WER 8.59 % (LoRA)+4.8 %
turbo n/a WER 5.81 %WER 6.17 % (LoRA)+6.2 %

Two patterns to read out:

  1. Sign flips with the language for small/medium/turbo. zh-CN fine-tunes help a lot (−30 to −54 %) because Whisper's zh prior is weak; fr-FR fine-tunes hurt a little (+5 to +6 %) because Whisper's fr prior is already at the ceiling FLEURS allows.
  2. At tiny, the sign is the same on both languages: fine-tuning helps. Tiny is a small under-trained backbone — its baseline is far from the model-class plateau.

Operationally: before spending the GPU budget on a fine-tune, estimate gap-to-ceiling. Cheap proxies: (a) is the baseline "much better than I need" → fine-tune is unlikely to help; (b) if I train for one epoch on dev set held-out, does train loss drop faster than dev loss → if no, the model is already at its plateau.

3.4 Statistical significance

500 test utterances × 24 words in fr / × 13 chars in zh ≈ 12 000 words / 6 500 chars per language. A 95 % Wilson interval at WER ≈ 10 % is ±0.5 pt absolute. The zh deltas (−10 to −24 pt CER) and the fr-tiny full-FT delta (−3.4 pt WER) are massively significant. The fr small/medium/turbo regressions (+0.4 to +1.7 pt WER) sit at the edge of significance.

4. Error analysis

4.1 Per-utterance distribution (Whisper-small fr)

Model (small, fr)Perfect< 5 %< 10 %< 25 %MedianWorst
baseline801262174050.1180.700
LoRA (zh recipe)601021913830.1331.000
full FT631102044020.1250.783

Both fine-tunes reduce the count of perfect transcriptions (80 → 60 / 63), exactly opposite of the intended effect. LoRA(zh) introduces at least one 100 %-error sentence (e.g. chocolat chaud → cocochot-style hallucinations).

4.2 Worst-100 categorization on baseline_turbo (best fr model)

CategoryCount
word_substitution188
agreement_or_plural22
deletion9
accent_only5
insertion4
homophone3
  • Proper-noun substitutions dominate (188). Toponyms, foreign patronyms, rare technical terms — typical Whisper failure mode, irreducible without an external LM.
  • Plural / gender agreement (22) — il propose vs ils proposent. Often acoustically indistinguishable.
  • Accent-only diffs (5) — a vs à, ou vs .
  • Homophones (3) — Whisper-large-v3's internal LM disambiguates most of these.

No hallucination_long, no truncation. The remaining errors are essentially un-fixable from the audio alone — large-v3 plus an external 4-gram or beam search + LM rescoring would be the next move.

4.3 What lora_small got wrong that baseline got right

ReferenceBaseline (correct)LoRA (zh recipe)
chocolat chaudchocolat chaudcocochot
ses racines philosophiquesses racines philosophiquessérasines philosophiques
l'occident s'est retrouvél'occident s'est retrouvél'occidence se retrouvait
quiconque se rendqui conquiserontkikong seront

Phonetically plausible French nonsense, drifting toward sub-lexical fragments over-represented in the FLEURS train set. Classic over-fit signature.

5. Why fine-tuning regresses on French — recipe diagnostic

The cross-language flip in §3.3 has two non-mutually-exclusive explanations:

(a) Distribution overlap. FLEURS-style content is heavily represented in Whisper-v3's pre-training mix. There is no gap to close. zh-CN is the opposite — Whisper's zh data is relatively sparse and CV21's distribution differs, so fine-tuning closes a real gap.

(b) Recipe sub-optimality on in-distribution data. If the Whisper authors had merged a copy of FLEURS-fr into their pre-training with their original recipe, the model would not have regressed. Ours does because the recipe is mis-calibrated for the in-distribution case:

AspectAuthors' regime (continual pre-training)Our regime (fine-tune from cold)
Effective LR~1e-5 to 5e-6 (cosine decay, ≥ 1 M steps)1e-4 / 5e-5 / 1e-5
Effective batch256+ (multi-task interleaved)16 (single dataset)
Warmupthousands of steps20–40 steps
RegularizationSpecAugment, weight decay, dropout, multi-task gradientbf16 + AdamW only
Steps≥ 1 M200–400

The tiny rows are the natural control. Tiny is not saturated on FLEURS-fr (baseline WER 45 %), so the gap-to-ceiling argument predicts that fine-tuning should work even though FLEURS-fr is in-distribution. And it does: full FT at LR 1e-5 reduces WER by 7.5 % relative, while LoRA at LR 1e-4 stays neutral.

Concretely, the corrective experiments we would run with more time:

  1. LoRA at LR 1e-5 / 2 epochs / SpecAugment + freq-mask + time-mask — matches end-of-pre-training dynamics more closely.
  2. LoRA at LR 5e-5 / 2 epochs / co-training (50 % FLEURS-fr + 50 % VoxPopuli-fr) — diluted gradient should kill the regression.
  3. Sweep (rank, LR, steps) against dev WER — predict-with-generate every 50 steps, early-stop on dev.

5b. Strategic implications

  1. Test for gap-to-ceiling before spending the GPU budget. A 30-minute baseline-only sweep across 3–4 model sizes tells you whether you have a fine-tune problem, a recipe problem, or no problem at all.
  2. Pick the largest pre-trained model that fits VRAM, then evaluate. Our large-v3-turbo baseline (5.81 % WER) beats bofenghuang/whisper-large-v3-french-distil-dec4 (6.97 %) — a model specifically distilled for French. Pre-training scale beat post-hoc specialization.
  3. Where to actually fine-tune (out-of-distribution data): CV fr accents (Québec, Suisse, Maghreb); telephony 8 kHz; spontaneous speech (ESLO, BREF); low-resource languages that Whisper-v3 saw < 50 h of.
  4. Where the recipe needs work. Add SpecAugment + RIR + additive noise augmentation, drop LR by ~5×, lengthen warmup, and co-train with a small in-distribution buffer.

6. With more time

  • Recipe ablation (§5 list) — most informative use of additional GPU.
  • Common Voice fr fine-tune — direct test of the in-distribution-overlap hypothesis.
  • Whisper-large-v3 full (32-decoder) + LoRA + 4-bit base via bitsandbytes. Phase E staged at scripts/run_phase_e.sh.
  • wav2vec2 / XLS-R 300 M + CTC fine-tuned from scratch on FLEURS train (src/train_w2v.py is ready).
  • NVIDIA Parakeet-TDT v3 / OWSM-CTC v3.1 in a separate venv — ESPnet / NeMo bring different inductive biases.
  • Beam search + KenLM 4-gram fr Wikipedia rescoring — typical 0.5–1 pt WER on the rare-vocabulary tail.
  • Streaming / chunked long-form in src/server.py via chunk_length_s + voice-activity-aware segmentation.

7. Transferring the scaffolding to Arabic

Same code transfers, but the failure modes shift:

  • Diglossia / dialects — MSA vs Egyptian / Levantine / Maghrebi diverge enough that one fine-tune undercovers. Pick the closest dialect or train a multi-dialect adapter conditioned on a tag.
  • Diacritics — most production text is unvocalized; metric must strip diacritics before scoring.
  • Hamza / alef normalizationأ ا إ آ → ا, ى → ي, ة → ه (standard pre-eval normalization).
  • Code-switching — English insertions are common in Arabic audio; LoRA recipe must keep encoder weights free for bilingual frames.
  • Tokenizer — orthographic WER is more meaningful in Arabic than Chinese; lead with WER and report CER as secondary.

What stays the same: dataset filtering, LoRA recipe (rank 32, alpha 64, q/k/v/out_proj), bf16, greedy, Gradio harness.

Qwen3-ASR pilot

A parallel Qwen3-ASR path under Qwen3-ASR/finetuning:

  • prepare_qwen3_asr_data.py — exports the same two datasets and split sizes into JSONL + 16 kHz WAV.
  • qwen3_asr_sft.py — supports --mode lora|full with PEFT LoRA and gradient-checkpointing fixes.
  • eval_qwen3_asr.py — scores Qwen checkpoints with the same French / Chinese normalization logic.
  • run_qwen3_matrix.sh — wires the full matrix together.

The Qwen code path must be run from venv with transformers==4.57.6; the repo's current Whisper env (transformers==5.6.0) breaks the upstream Qwen model import path.

Full-FT matrix

Held-out slice: first 100 examples of the dev split.

ModelLanguageBaselineFull FT (1 ep)Delta vs baseline
Qwen3-ASR-0.6BFrenchWER 6.35 %WER 7.94 %+25.0 %
Qwen3-ASR-0.6BChineseCER 10.41 %CER 9.26 %−11.0 %
Qwen3-ASR-1.7BFrenchWER 3.75 %WER 3.57 %−4.7 %
Qwen3-ASR-1.7BChineseCER 7.02 %CER 5.81 %−17.2 %

One extra LoRA reference row on French:

ModelLanguageBaselineLoRA (1 ep)Delta vs baseline
Qwen3-ASR-0.6BFrenchWER 6.35 %WER 7.11 %+11.8 %

Take-aways:

  • Chinese improves at both sizes under full FT (−11.0 % at 0.6B, −17.2 % at 1.7B).
  • French splits by scale. 0.6B regresses under both LoRA and full FT; 1.7B recovers a small gain under full FT (−4.7 % WER).
  • Qwen shows the same broad language-dependent sign flip at small scale, but at 1.7B it appears more stable on in-distribution French than the Whisper medium/turbo runs.

Fit / runtime findings on 1× NVIDIA L4 (24 GB)

  • Qwen3-ASR-0.6B full FT fits comfortably with batch_size=2, grad_acc=8; French ~9.6 min / epoch, Chinese ~12.0 min / epoch.
  • Qwen3-ASR-1.7B full FT fits with batch_size=1, grad_acc=16 and gradient checkpointing; French ~16.9 min / epoch, Chinese ~20.3 min / epoch.
  • Parallelism helps only for the small job. A 1.7B full-FT run can still OOM at optimizer-state allocation if another training process is already holding several GiB on the same GPU.

RL fine-tuning: MWER & GSPO

Two reinforcement-learning algorithms were added on top of the SFT checkpoint to push WER/CER further without extra labelled data.

MWER (Minimum Word Error Rate)

MWER turns beam-search hypotheses into a differentiable training signal (Shannon et al., 2017). For each training utterance the model generates an N-best list (greedy + beam), teacher-forces each hypothesis through the decoder to get log-probabilities, re-normalises to a posterior distribution P̂, and minimises the expected word-error rate under that distribution. A small cross-entropy interpolation (λce = 0.01) prevents collapse.

L_MWER = Σ_i P̂(y_i|x) · WER(y_i, y*)   +   λ_ce · CE(y*|x)

where  P̂(y_i|x) = softmax( logP(y_i|x) )  over the N-best list

GSPO (Group Sequence Policy Optimisation)

GSPO (Dang et al., 2025) is a GRPO-style objective adapted for sequence-level ASR rewards. A frozen old-policy snapshot (synced every 32 grad steps) generates G rollouts per utterance; the advantage of each rollout is z-scored within the group; the policy update uses an asymmetric clipped surrogate on the length-normalised log-ratio:

r_i  = exp( (log P_θ(y_i) − log P_θ_old(y_i)) / |y_i| )

L_GSPO = −E[ min( r_i · A_i,  clip(r_i, 1−ε_lo, 1+ε_hi) · A_i ) ]

ε_lo = 3e-4,  ε_hi = 4e-4          # tight because length-norm ratio ≈ 1.0

A format bonus (+0.1) is added to rollouts that contain the expected <lang>…</lang> wrapper.

Hyperparameters

MWERGSPO
N-best / group size (0.6B)44
N-best / group size (1.7B)22
Generation policytemperature sampling (T=0.9, top_p=0.95)current policy rollouts
MWER audio microbatch2 audios for 0.6B on this GPU
GSPO audio microbatch4 audios for 0.6B on this GPU
Learning rate (0.6B)5e-65e-6
Learning rate (1.7B)2e-62e-6
CE / format coefficientλ_ce = 0.01format_α = 0.1
Gradient accumulation44
Training epochs (completed 0.6B run)0.50.5
Old-model syncevery 32 grad steps
Clip rangeε_lo=3e-4, ε_hi=4e-4

Results (first 100 dev examples)

WER for French, CER for Chinese. ↓ is better. Best result per row is bold.

ModelLanguageMetricBaselineSFT full-FTMWER (RL)GSPO (RL)
Qwen3-ASR-0.6BFrenchWER 6.35 % 7.94 % 6.22 %6.13 %
Qwen3-ASR-0.6BChineseCER 10.41 % 9.26 % 7.62 %8.77 %
Qwen3-ASR-1.7BFrenchWER 3.75 % 3.57 %
Qwen3-ASR-1.7BChineseCER 7.02 % 5.81 %

The 0.6B RL rows are completed 0.5-epoch runs from 2026-05-11. Dashes mean the 1.7B RL variants have not been run yet.

RL implementation note

The initial MWER trainer generated and scored one utterance at a time, so the 0.6B run spent most of its wall time on skinny generation/scoring calls. The trainer now batches the outer audio loop (--mwer_batch_size; 2 is stable for 0.6B full runs on this GPU), extracts mel features once per audio microbatch and reuses them across the N-best scoring and CE paths, and defaults MWER N-best generation to sampling rather than beam search. GSPO mirrors that structure with --gspo_batch_size and cached audio features. Sequence scoring is row-chunked (QWEN_ASR_SCORE_ROW_CHUNK=2 by default), MWER backpropagates the large sequence-risk graph before building the CE graph, and both trainers explicitly release CUDA cache between microbatches so VRAM does not monotonically grow. The completed 0.6B sweep used two-audio MWER microbatches and four-audio GSPO microbatches; all four runs completed from run_rl_0p6b_fast.sh on 2026-05-11, with the final log at Qwen3-ASR/finetuning/outputs/logs/run_rl_0p6b_fast_cleanup_20260511_202755.log and result JSONs under Qwen3-ASR/finetuning/outputs/qwen3_0p6b_{mwer,gspo}_{fr,ch}_dev100.json.

Paper cross-check

This RL setup is intentionally conservative relative to the two most relevant LLM-ASR reports:

  • Seed-ASR motivates an RL stage because cross-entropy SFT is mismatched with inference-time WER/CER, then applies MWER interpolated with CE over an N-best set. The same section reports that weighted WER and preserving context-style data improve robustness, which is a useful follow-up for named entities and hard cases.
  • Qwen3-ASR reports a final ASR RL stage using GSPO, with about 50k utterances mixed across Chinese/English, multilingual, and functional data. That makes our GSPO branch the closer match to the released model recipe, while MWER remains the most direct metric-aligned ablation.

Recommended next configuration: keep the current 0.6B pass as a throughput and signal check; if French MWER still regresses, prioritize 1.7B + GSPO and/or a mixed-language RL set over longer 0.6B French-only training. For MWER quality, the most paper-faithful next upgrade is weighted WER: upweight named entities, numbers, and keyword spans rather than treating every token equally.

Scripts: qwen3_asr_mwer.py (MWER trainer), qwen3_asr_gspo.py (GSPO trainer), run_rl_0p6b_fast.sh (0.6B four-run sequence), run_rl_matrix.sh (full 8-run matrix).

8. Reproduction

Pinned: transformers==5.6.0, datasets>=3.6,<4.0, peft==0.19.1, accelerate==1.13.0, torch==2.11.0+cu128, jiwer==4.0.0, librosa==0.11.0, gradio>=5.0,<6. Seed = 42.

# Install (reuse venv on the test machine)
/venv/bin/pip install -r requirements.txt

# HF auth (FLEURS is open but HF_TOKEN avoids rate limits)
export HF_TOKEN=$(cat ~/.cache/huggingface/token)
export HF_HOME=/data/speech2text/outputs/cache
export HF_DATASETS_TRUST_REMOTE_CODE=1   # FLEURS ships via loader script

# Main pipeline (data + 3 phases) — ~2-3 h on L4
bash scripts/run_all_phases.sh

# Post pipeline (fr-recipe LoRA-small + zero-shot references)
bash scripts/run_post.sh

# Or phase by phase:
bash scripts/run_phase_tiny.sh    # whisper-tiny fr : baseline / LoRA / full FT
bash scripts/run_phase_tiny_zh.sh # whisper-tiny zh-CN : baseline / LoRA / full FT
bash scripts/run_phase_a.sh       # whisper-small fr : baseline / LoRA-zh / full / scratch
bash scripts/run_phase_a2.sh      # whisper-small + LoRA-fr (LR 3e-5, 2 ep)
bash scripts/run_phase_b.sh       # whisper-medium fr : baseline + LoRA-fr
bash scripts/run_phase_c.sh       # whisper-large-v3-turbo : baseline + LoRA-fr
bash scripts/run_phase_d.sh       # zero-shot refs

# Analyze passes
/data/venv/bin/python -m src.analyze --language fr --out-name metrics_fr.json
/data/venv/bin/python -m src.analyze --language zh --out-name metrics_zh.json
/data/venv/bin/python -m src.render_table --metrics outputs/metrics_fr.json --out outputs/table_fr.md
/data/venv/bin/python -m src.render_table --metrics outputs/metrics_zh.json --out outputs/table_zh.md

# Demo (mic + upload, baseline vs fine-tuned)
bash scripts/start_demo.sh
# Override via env: BASE, LORA, FULL, SERVER_PORT, SERVER_HOST

Browser microphone access requires a secure context (HTTPS or localhost). Either SSH-tunnel:

ssh -L 7860:localhost:7860 user@<server-ip>
# then http://localhost:7860

or pass --share to src.server for a *.gradio.live HTTPS URL.

System dep: Gradio decodes uploaded audio via ffmpeg. On a fresh machine: sudo apt-get install -y ffmpeg.

9. Limitations and why the French plot is last

fr-FR benchmark + fine-tunes
fr-FR benchmark — WER vs model size, baseline vs fine-tuned

The French figure is deliberately parked at the end because it is mostly a limitation study, not a clean fine-tuning success story.

  • French is already heavily covered in pre-training. FLEURS-fr sits close to the distribution Whisper was already trained to solve well.
  • Small/medium/turbo are already near their plateau on this benchmark. That leaves little gap for adaptation to close.
  • Our recipe is tuned for picking up new distributions, not for staying at equilibrium on in-distribution data. Small batch, single-dataset gradients, no SpecAugment, and comparatively aggressive learning rates make over-specialization more likely.
  • Tiny is the exception that proves the rule. It still has real headroom on FLEURS-fr, so full FT helps there even though the dataset itself is not novel.

So the French plot is useful, but mainly as a warning: a strong multilingual ASR baseline on in-distribution data can make a naive fine-tune look busy without making it better. The optimistic future work is to continue running RL in continuation of the already successful Qwen SFT result.