PCam Federated Benchmark Plan

Status: 📋 Planned
Previous Step: PCam federated smoke tests (5 rounds) — ✅ Complete
Next Step: Full PCam federated benchmark (20-50 rounds)

Objective

Run a comprehensive federated learning benchmark on real PCam pathology patches to compare institutional weighting strategies under controlled conditions with extended training.

Motivation

The smoke tests (5 rounds) validated that:

All strategies execute end-to-end without crashes
Real pathology patches load and train correctly
Weights are computed and logged properly
No numerical instabilities (NaN/Inf)

However, 5 rounds is insufficient to:

Observe convergence behavior
Measure final performance differences
Evaluate weight dynamics over time
Assess calibration quality

The benchmark extends training to 20-50 rounds to capture these dynamics.

Scope

What This Benchmark IS

✅ Extended training on real PCam pathology patches
✅ Controlled comparison of 4 weighting strategies
✅ Measurement of convergence, performance, and weight dynamics
✅ Validation of FAIR-WEIGHTS-H behavior over extended training

What This Benchmark IS NOT

❌ Real multi-center Camelyon17 validation
❌ True hospital-level domain shift evaluation
❌ Slide-level WSI aggregation
❌ Clinical validation

Experimental Design

Dataset

Source: Real PCam pathology patches (Camelyon16-derived)
Size: 5,000 training samples (for benchmark speed)
Sites: 5 simulated federated sites (1,000 samples each)
Distribution: Balanced positive rates across sites

Strategies to Compare

Equal — Uniform weighting (baseline)
Volume — Dataset-size-based weighting
Prestige — Accuracy-based weighting
FAIR-WEIGHTS-H — Fairness-aware weighting (default settings)

Training Configuration

Rounds: 20-50 (to be determined based on convergence)
Local epochs per round: 1
Batch size: 32
Learning rate: 0.01
Model: Simple CNN (same as smoke tests)
Seeds: 3 independent runs for statistical reliability

Metrics to Track

Performance Metrics

Global AUC — Overall model performance
Site-wise AUC — Per-site performance (5 sites)
Worst-site AUC — Minimum site performance (fairness proxy)
Global accuracy — Overall classification accuracy
Site-wise accuracy — Per-site accuracy

Calibration Metrics

ECE (Expected Calibration Error) — Global calibration quality
Site-wise ECE — Per-site calibration quality

Weight Dynamics

Weight entropy — Distribution uniformity (H = -Σ w_i log w_i)
N_eff (Effective institution count) — exp(H)
Weight trajectories — How weights evolve over rounds
Weight variance — Stability of weight assignments

Convergence Metrics

Training loss — Per-round training loss
Validation loss — Per-round validation loss
Rounds to convergence — When performance plateaus

Implementation Plan

Phase 1: Script Enhancement

Extend scripts/federated/run_pcam_federated_smoke.py to support:

Configurable number of rounds (--rounds 20-50)
Multiple seeds (--seed 42, 43, 44)
Extended metrics logging (AUC, ECE, weight dynamics)
Validation set evaluation
Convergence detection

Phase 2: Execution

Run benchmark for each strategy × seed combination:

bash

# Equal weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting equal --rounds 30 --seed 44

# Volume weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting volume --rounds 30 --seed 44

# Prestige weighting
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting prestige --rounds 30 --seed 44

# FAIR-WEIGHTS-H
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 42
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 43
python scripts/federated/run_pcam_federated_smoke.py --weighting fair_weights_h --rounds 30 --seed 44

Total: 12 runs (4 strategies × 3 seeds)

Phase 3: Analysis

Generate comparison report with:

Performance comparison tables (mean ± std across seeds)
Weight dynamics plots (entropy, N_eff over rounds)
Convergence curves (loss, accuracy, AUC over rounds)
Calibration comparison (ECE across strategies)
Statistical significance tests (paired t-tests)

Phase 4: Documentation

Create docs/validation/pcam-federated-benchmark-results.md with:

Executive summary
Detailed results tables
Visualizations
Interpretation
Limitations
Next steps

Expected Outcomes

Hypothesis 1: Convergence

All strategies should converge to similar global AUC (~0.85-0.90 for PCam) since sites are balanced.

Hypothesis 2: Weight Dynamics

Equal: Constant uniform weights (entropy = log(5) ≈ 1.61)
Volume: Constant uniform weights (sites have equal size)
Prestige: Dynamic weights favoring better-performing sites
FAIR-WEIGHTS-H: Balanced weights with slight adjustments for fairness

Hypothesis 3: Calibration

FAIR-WEIGHTS-H may show better calibration if fairness-aware weighting reduces overconfidence.

Hypothesis 4: Worst-Site Performance

FAIR-WEIGHTS-H should maintain competitive worst-site AUC compared to baselines.

Success Criteria

The benchmark is successful if:

✅ All 12 runs complete without crashes
✅ Metrics are logged and saved correctly
✅ Results are reproducible across seeds (low variance)
✅ FAIR-WEIGHTS-H shows no performance degradation vs. baselines
✅ Weight dynamics match theoretical expectations
✅ Comprehensive report is generated

Timeline Estimate

Script enhancement: 2-4 hours
Execution: 4-6 hours (12 runs × ~20-30 min each)
Analysis: 2-3 hours
Documentation: 1-2 hours
Total: 1-2 days

Risks and Mitigations

Risk 1: Long Runtime

Mitigation: Use smaller dataset (5K samples) and efficient model. Consider GPU if available.

Risk 2: High Variance Across Seeds

Mitigation: Use 3 seeds minimum. If variance is high, add more seeds.

Risk 3: No Performance Differences

Mitigation: This is actually expected for balanced sites. The value is in validating weight dynamics and establishing baseline performance.

Risk 4: Numerical Instabilities

Mitigation: Smoke tests already validated stability. Monitor for NaN/Inf during runs.

Future Extensions

After this benchmark:

Heterogeneous Sites: Introduce class imbalance across sites
New Mathematical Modes: Test log_linear and mirror_descent
Hyperparameter Sensitivity: Vary beta, eta, temperature
Real Camelyon17: Move to true multi-center WSI validation

References

Smoke test report: docs/FAIR_WEIGHTS_H_PCAM_FEDERATED_SMOKE_REPORT.md
Implementation: src/features/federated/pathology_fl/weighting/fair_weights_h.py
Test script: scripts/federated/run_pcam_federated_smoke.py

Validation Ladder Position

✅ 1. Synthetic Camelyon17-like smoke
✅ 2. PCam federated smoke (equal)
✅ 3. PCam federated smoke (all strategies)
📋 4. PCam federated benchmark ← YOU ARE HERE
⏭️ 5. Real Camelyon17 subset smoke
⏭️ 6. Real Camelyon17 full validation

This benchmark represents the transition from "does it run?" to "how does it perform?"

PCam Federated Benchmark Plan ​

Objective ​

Motivation ​

Scope ​

What This Benchmark IS ​

What This Benchmark IS NOT ​

Experimental Design ​

Dataset ​

Strategies to Compare ​

Training Configuration ​

Metrics to Track ​

Performance Metrics ​

Calibration Metrics ​

Weight Dynamics ​

Convergence Metrics ​

Implementation Plan ​

Phase 1: Script Enhancement ​

Phase 2: Execution ​

Phase 3: Analysis ​

Phase 4: Documentation ​

Expected Outcomes ​

Hypothesis 1: Convergence ​

Hypothesis 2: Weight Dynamics ​

Hypothesis 3: Calibration ​

Hypothesis 4: Worst-Site Performance ​

Success Criteria ​

Timeline Estimate ​

Risks and Mitigations ​

Risk 1: Long Runtime ​

Risk 2: High Variance Across Seeds ​

Risk 3: No Performance Differences ​

Risk 4: Numerical Instabilities ​

Future Extensions ​

References ​

Validation Ladder Position ​