PCam Federated Benchmark Plan
Status: 📋 Planned
Previous Step: PCam federated smoke tests (5 rounds) — ✅ Complete
Next Step: Full PCam federated benchmark (20-50 rounds)
Objective
Run a comprehensive federated learning benchmark on real PCam pathology patches to compare institutional weighting strategies under controlled conditions with extended training.
Motivation
The smoke tests (5 rounds) validated that:
- All strategies execute end-to-end without crashes
- Real pathology patches load and train correctly
- Weights are computed and logged properly
- No numerical instabilities (NaN/Inf)
However, 5 rounds is insufficient to:
- Observe convergence behavior
- Measure final performance differences
- Evaluate weight dynamics over time
- Assess calibration quality
The benchmark extends training to 20-50 rounds to capture these dynamics.
Scope
What This Benchmark IS
- ✅ Extended training on real PCam pathology patches
- ✅ Controlled comparison of 4 weighting strategies
- ✅ Measurement of convergence, performance, and weight dynamics
- ✅ Validation of FAIR-WEIGHTS-H behavior over extended training
What This Benchmark IS NOT
- ❌ Real multi-center Camelyon17 validation
- ❌ True hospital-level domain shift evaluation
- ❌ Slide-level WSI aggregation
- ❌ Clinical validation
Experimental Design
Dataset
- Source: Real PCam pathology patches (Camelyon16-derived)
- Size: 5,000 training samples (for benchmark speed)
- Sites: 5 simulated federated sites (1,000 samples each)
- Distribution: Balanced positive rates across sites
Strategies to Compare
- Equal — Uniform weighting (baseline)
- Volume — Dataset-size-based weighting
- Prestige — Accuracy-based weighting
- FAIR-WEIGHTS-H — Fairness-aware weighting (default settings)
Training Configuration
- Rounds: 20-50 (to be determined based on convergence)
- Local epochs per round: 1
- Batch size: 32
- Learning rate: 0.01
- Model: Simple CNN (same as smoke tests)
- Seeds: 3 independent runs for statistical reliability
Metrics to Track
Performance Metrics
- Global AUC — Overall model performance
- Site-wise AUC — Per-site performance (5 sites)
- Worst-site AUC — Minimum site performance (fairness proxy)
- Global accuracy — Overall classification accuracy
- Site-wise accuracy — Per-site accuracy
Calibration Metrics
- ECE (Expected Calibration Error) — Global calibration quality
- Site-wise ECE — Per-site calibration quality
Weight Dynamics
- Weight entropy — Distribution uniformity (H = -Σ w_i log w_i)
- N_eff (Effective institution count) — exp(H)
- Weight trajectories — How weights evolve over rounds
- Weight variance — Stability of weight assignments
Convergence Metrics
- Training loss — Per-round training loss
- Validation loss — Per-round validation loss
- Rounds to convergence — When performance plateaus
Implementation Plan
Phase 1: Script Enhancement
Extend scripts/federated/run_pcam_federated_smoke.py to support:
- Configurable number of rounds (--rounds 20-50)
- Multiple seeds (--seed 42, 43, 44)
- Extended metrics logging (AUC, ECE, weight dynamics)
- Validation set evaluation
- Convergence detection
Phase 2: Execution
Run benchmark for each strategy × seed combination:
# Equal weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 44
# Volume weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 44
# Prestige weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 44
# FAIR-WEIGHTS-H
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 44Total: 12 runs (4 strategies × 3 seeds)
Phase 3: Analysis
Generate comparison report with:
- Performance comparison tables (mean ± std across seeds)
- Weight dynamics plots (entropy, N_eff over rounds)
- Convergence curves (loss, accuracy, AUC over rounds)
- Calibration comparison (ECE across strategies)
- Statistical significance tests (paired t-tests)
Phase 4: Documentation
Create docs/validation/pcam-federated-benchmark-results.md with:
- Executive summary
- Detailed results tables
- Visualizations
- Interpretation
- Limitations
- Next steps
Expected Outcomes
Hypothesis 1: Convergence
All strategies should converge to similar global AUC (~0.85-0.90 for PCam) since sites are balanced.
Hypothesis 2: Weight Dynamics
- Equal: Constant uniform weights (entropy = log(5) ≈ 1.61)
- Volume: Constant uniform weights (sites have equal size)
- Prestige: Dynamic weights favoring better-performing sites
- FAIR-WEIGHTS-H: Balanced weights with slight adjustments for fairness
Hypothesis 3: Calibration
FAIR-WEIGHTS-H may show better calibration if fairness-aware weighting reduces overconfidence.
Hypothesis 4: Worst-Site Performance
FAIR-WEIGHTS-H should maintain competitive worst-site AUC compared to baselines.
Success Criteria
The benchmark is successful if:
- ✅ All 12 runs complete without crashes
- ✅ Metrics are logged and saved correctly
- ✅ Results are reproducible across seeds (low variance)
- ✅ FAIR-WEIGHTS-H shows no performance degradation vs. baselines
- ✅ Weight dynamics match theoretical expectations
- ✅ Comprehensive report is generated
Timeline Estimate
- Script enhancement: 2-4 hours
- Execution: 4-6 hours (12 runs × ~20-30 min each)
- Analysis: 2-3 hours
- Documentation: 1-2 hours
- Total: 1-2 days
Risks and Mitigations
Risk 1: Long Runtime
Mitigation: Use smaller dataset (5K samples) and efficient model. Consider GPU if available.
Risk 2: High Variance Across Seeds
Mitigation: Use 3 seeds minimum. If variance is high, add more seeds.
Risk 3: No Performance Differences
Mitigation: This is actually expected for balanced sites. The value is in validating weight dynamics and establishing baseline performance.
Risk 4: Numerical Instabilities
Mitigation: Smoke tests already validated stability. Monitor for NaN/Inf during runs.
Future Extensions
After this benchmark:
- Heterogeneous Sites: Introduce class imbalance across sites
- New Mathematical Modes: Test log_linear and mirror_descent
- Hyperparameter Sensitivity: Vary beta, eta, temperature
- Real Camelyon17: Move to true multi-center WSI validation
References
- Smoke test report:
docs/FAIR_WEIGHTS_H_PCAM_FEDERATED_SMOKE_REPORT.md - Implementation:
src/features/federated/pathology_fl/weighting/fair_weights_h.py - Test script:
scripts/federated/run_pcam_federated_smoke.py
Validation Ladder Position
✅ 1. Synthetic Camelyon17-like smoke
✅ 2. PCam federated smoke (equal)
✅ 3. PCam federated smoke (all strategies)
📋 4. PCam federated benchmark ← YOU ARE HERE
⏭️ 5. Real Camelyon17 subset smoke
⏭️ 6. Real Camelyon17 full validationThis benchmark represents the transition from "does it run?" to "how does it perform?"