RESEARCH RESULT

Finding a FedAvg failure mode in computational pathology

I tested federated pathology learning on 10,611 PANDA-derived Phikon slide features and found a clear failure mode: FedAvg is safe when sites are clean, but vulnerable when the largest simulated site becomes unreliable.

Matthew Vaishnav

1 Jun 2026 | 6 min read

Federated learning usually treats sample count as trust. In FedAvg, the largest client gets the most influence because it contributes the most examples. That assumption is reasonable when bigger also means better. It becomes dangerous when a high-volume site is systematically less reliable.

I tested that mechanism in a computational pathology setting using PANDA-derived Phikon features. The experiment simulates a multi-site federation from real slide-level feature vectors and stresses the largest simulated site.

Dataset and setup
├── 10,611 readable PANDA-derived slide feature vectors
├── 768-dimensional mean-pooled Phikon embeddings
├── 5 simulated sites
├── 15 random seeds
├── Task: ISUP grade prediction, classes 0-5
└── Metrics: global QWK, worst-site QWK, mean-site QWK, accuracy, macro-F1

The failure mode

The stress test is simple: keep validation labels clean, but make the largest simulated training site unreliable. Then compare ordinary FedAvg against a cross-site blending strategy that reduces dependence on raw sample volume.

Mechanism
FedAvg:
  more samples -> more aggregation influence

Failure condition:
  largest site becomes less reliable

Consequence:
  sample count remains high, but trust should be lower

Alternative:
  blend sample-size weighting with cross-site contribution signals

Label-corruption result

Under clean conditions, cross-site blending is not universally better. That matters. The result is conditional, not just a method that happens to win everywhere.

15-seed full-PANDA label-noise stress
0% noise:
  FedAvg and cross-site blending are essentially tied globally
  FedAvg retains better clean worst-site QWK

25% dominant-site label corruption:
  cross-site global QWK +0.0083 vs FedAvg
  95% CI: [+0.0016, +0.0150]

35% dominant-site label corruption:
  cross-site global QWK +0.0083 vs FedAvg
  95% CI: [+0.0051, +0.0115]

45% dominant-site label corruption:
  cross-site worst-site QWK +0.0111 vs FedAvg
  95% CI: [+0.0011, +0.0211]

A more pathology-like stress test

Random label corruption is useful for exposing the mechanism, but pathology disagreement is often more systematic. So I also tested ordinal threshold shift: a dominant site grades more aggressively or more conservatively by shifting selected training labels up or down by one ISUP grade.

The strongest transfer result came from conservative grading bias. Cross-site blending improved every major metric at 25%, 35%, and 45% conservative threshold shift.

15-seed conservative threshold-shift stress
25% shift:
  global QWK +0.0057
  worst-site QWK +0.0053
  macro-F1 +0.0088

35% shift:
  global QWK +0.0060
  worst-site QWK +0.0071
  macro-F1 +0.0102

45% shift:
  global QWK +0.0116
  worst-site QWK +0.0141
  macro-F1 +0.0199

Dominance-aware switching

Cross-site blending helps under corrupted regimes, but using it all the time can create clean-regime costs. The better idea is a switch: use FedAvg when FedAvg validation behavior looks normal, and switch away from sample-size dominance when diagnostics become abnormal.

Dominance-aware switch
1. Calibrate normal FedAvg diagnostics on clean validation runs
2. Track global QWK, worst-site QWK, site-QWK spread, ordinal error, severe error
3. If enough diagnostics leave the clean-calibrated range, switch to cross-site blending
4. Otherwise keep FedAvg

On label-noise stress, a tuned observable detector reduced clean false switching to 6.7% while preserving significant global-QWK gains at 25% and 35% dominant-site noise. On conservative threshold shift, the detector transferred strongly in corrupted regimes, but clean false-trigger control still needs improvement.

What I claim, and what I do not claim

The claim is not that hospitals commonly have 45% random label noise. The claim is that sample volume and client reliability can diverge, and FedAvg has no built-in way to notice when the largest client should no longer be trusted the most.

Supported
- FedAvg has a dominant-site reliability failure mode in simulated pathology federations
- Cross-site blending helps when the largest simulated site is unreliable
- The effect transfers from random label corruption to systematic conservative ordinal grading bias
- Observable detector switches can recover much of the benefit under label noise and conservative threshold shift

Not yet supported
- real hospital federated validation
- clinical deployment
- diagnostic use
- proof that one detector calibration works across every failure mode

This is research infrastructure and simulated-federation evidence, not clinical software. The next validation step is to test the same idea on real multi-center pathology benchmarks like Camelyon17 and on additional site-shift patterns.

Full documentation is available at the dominance-aware switch results page. The source code lives at github.com/matthewvaishnav/computational-pathology-research.