Finding a FedAvg failure mode in computational pathology
I tested federated pathology learning on 10,611 PANDA-derived Phikon slide features and found a clear failure mode: FedAvg is safe when sites are clean, but vulnerable when the largest simulated site becomes unreliable.
Matthew Vaishnav
1 Jun 2026 | 6 min read
Federated learning usually treats sample count as trust. In FedAvg, the largest client gets the most influence because it contributes the most examples. That assumption is reasonable when bigger also means better. It becomes dangerous when a high-volume site is systematically less reliable.
I tested that mechanism in a computational pathology setting using PANDA-derived Phikon features. The experiment simulates a multi-site federation from real slide-level feature vectors and stresses the largest simulated site.
Dataset and setup
├── 10,611 readable PANDA-derived slide feature vectors
├── 768-dimensional mean-pooled Phikon embeddings
├── 5 simulated sites
├── 15 random seeds
├── Task: ISUP grade prediction, classes 0-5
└── Metrics: global QWK, worst-site QWK, mean-site QWK, accuracy, macro-F1The failure mode
The stress test is simple: keep validation labels clean, but make the largest simulated training site unreliable. Then compare ordinary FedAvg against a cross-site blending strategy that reduces dependence on raw sample volume.
Mechanism
FedAvg:
more samples -> more aggregation influence
Failure condition:
largest site becomes less reliable
Consequence:
sample count remains high, but trust should be lower
Alternative:
blend sample-size weighting with cross-site contribution signalsLabel-corruption result
Under clean conditions, cross-site blending is not universally better. That matters. The result is conditional, not just a method that happens to win everywhere.
15-seed full-PANDA label-noise stress
0% noise:
FedAvg and cross-site blending are essentially tied globally
FedAvg retains better clean worst-site QWK
25% dominant-site label corruption:
cross-site global QWK +0.0083 vs FedAvg
95% CI: [+0.0016, +0.0150]
35% dominant-site label corruption:
cross-site global QWK +0.0083 vs FedAvg
95% CI: [+0.0051, +0.0115]
45% dominant-site label corruption:
cross-site worst-site QWK +0.0111 vs FedAvg
95% CI: [+0.0011, +0.0211]A more pathology-like stress test
Random label corruption is useful for exposing the mechanism, but pathology disagreement is often more systematic. So I also tested ordinal threshold shift: a dominant site grades more aggressively or more conservatively by shifting selected training labels up or down by one ISUP grade.
The strongest transfer result came from conservative grading bias. Cross-site blending improved every major metric at 25%, 35%, and 45% conservative threshold shift.
15-seed conservative threshold-shift stress
25% shift:
global QWK +0.0057
worst-site QWK +0.0053
macro-F1 +0.0088
35% shift:
global QWK +0.0060
worst-site QWK +0.0071
macro-F1 +0.0102
45% shift:
global QWK +0.0116
worst-site QWK +0.0141
macro-F1 +0.0199Dominance-aware switching
Cross-site blending helps under corrupted regimes, but using it all the time can create clean-regime costs. The better idea is a switch: use FedAvg when FedAvg validation behavior looks normal, and switch away from sample-size dominance when diagnostics become abnormal.
Dominance-aware switch
1. Calibrate normal FedAvg diagnostics on clean validation runs
2. Track global QWK, worst-site QWK, site-QWK spread, ordinal error, severe error
3. If enough diagnostics leave the clean-calibrated range, switch to cross-site blending
4. Otherwise keep FedAvgOn label-noise stress, a tuned observable detector reduced clean false switching to 6.7% while preserving significant global-QWK gains at 25% and 35% dominant-site noise. On conservative threshold shift, the detector transferred strongly in corrupted regimes, but clean false-trigger control still needs improvement.
What I claim, and what I do not claim
The claim is not that hospitals commonly have 45% random label noise. The claim is that sample volume and client reliability can diverge, and FedAvg has no built-in way to notice when the largest client should no longer be trusted the most.
Supported
- FedAvg has a dominant-site reliability failure mode in simulated pathology federations
- Cross-site blending helps when the largest simulated site is unreliable
- The effect transfers from random label corruption to systematic conservative ordinal grading bias
- Observable detector switches can recover much of the benefit under label noise and conservative threshold shift
Not yet supported
- real hospital federated validation
- clinical deployment
- diagnostic use
- proof that one detector calibration works across every failure modeThis is research infrastructure and simulated-federation evidence, not clinical software. The next validation step is to test the same idea on real multi-center pathology benchmarks like Camelyon17 and on additional site-shift patterns.
Full documentation is available at the dominance-aware switch results page. The source code lives at github.com/matthewvaishnav/computational-pathology-research.