PatchCamelyon Baseline Comparison Guide

Date: 2026-04-07
Status: ✅ OPERATIONAL

Overview

This guide documents the reproducible PCam baseline comparison pipeline that enables systematic evaluation of different model variants on the PatchCamelyon dataset.

What This Enables

The comparison pipeline allows you to:

Compare multiple model configurations in a single run
Reproduce baseline comparisons with consistent settings
Evaluate architectural choices (e.g., ResNet-18 vs ResNet-50, hidden layer vs direct classification)
Generate comparison reports with aggregated metrics

Available Baselines

1. baseline_resnet18 (Default)

Feature Extractor: ResNet-18 (11.2M params, pretrained)
Encoder: Transformer (1 layer, 4 heads, 256 hidden dim)
Head: Hidden layer (128 dim) + output layer
Total Parameters: 12.2M
Purpose: Current default configuration

2. resnet50

Feature Extractor: ResNet-50 (23.5M params, pretrained)
Encoder: Transformer (1 layer, 4 heads, 256 hidden dim)
Head: Hidden layer (128 dim) + output layer
Total Parameters: ~25M
Purpose: Test larger backbone capacity

3. simple_head

Feature Extractor: ResNet-18 (11.2M params, pretrained)
Encoder: Transformer (1 layer, 4 heads, 256 hidden dim)
Head: Direct linear layer (no hidden layer)
Total Parameters: 12.2M (33K fewer in head)
Purpose: Test simpler classification head

Quick Start

Run All Comparisons (Quick Test - 3 epochs)

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml \
  --quick-test

Run All Comparisons (Full Training - 20 epochs)

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml

Run Specific Variants

python experiments/compare_pcam_baselines.py \
  --configs \
    experiments/configs/pcam_comparison/baseline_resnet18.yaml \
    experiments/configs/pcam_comparison/resnet50.yaml

Output Structure

results/pcam_comparison/
├── comparison_results.json          # Aggregated comparison metrics
├── baseline_resnet18/
│   ├── metrics.json                 # Detailed metrics
│   ├── confusion_matrix.png         # Confusion matrix plot
│   └── roc_curve.png                # ROC curve plot
├── resnet50/
│   ├── metrics.json
│   ├── confusion_matrix.png
│   └── roc_curve.png
└── simple_head/
    ├── metrics.json
    ├── confusion_matrix.png
    └── roc_curve.png

checkpoints/pcam_comparison/
├── baseline_resnet18/best_model.pth
├── resnet50/best_model.pth
└── simple_head/best_model.pth

logs/pcam_comparison/
├── baseline_resnet18/               # TensorBoard logs
├── resnet50/
└── simple_head/

Comparison Results Format

The comparison_results.json file contains aggregated metrics for all variants:

{
  "timestamp": "2026-04-07 15:42:52",
  "variants": [
    {
      "name": "baseline_resnet18",
      "config_path": "experiments/configs/pcam_comparison/baseline_resnet18.yaml",
      "training_status": "success",
      "evaluation_status": "success",
      "training_time_seconds": 31.85,
      "test_accuracy": 1.0,
      "test_auc": 1.0,
      "test_f1": 1.0,
      "test_precision": 1.0,
      "test_recall": 1.0,
      "model_parameters": {
        "feature_extractor": 11176512,
        "encoder": 987904,
        "head": 33281,
        "total": 12197697
      },
      "inference_time_seconds": 0.68,
      "samples_per_second": 146.5,
      "checkpoint_path": "checkpoints/pcam_comparison/baseline_resnet18/best_model.pth",
      "results_dir": "results/pcam_comparison/baseline_resnet18"
    }
  ]
}

Example Comparison Results (Quick Test - 3 epochs)

Variant	Accuracy	AUC	F1	Parameters	Training Time
baseline_resnet18	1.0000	1.0000	1.0000	12.2M	31.8s
simple_head	0.5500	1.0000	0.3548	12.2M	25.1s

Observations (3-epoch quick test):

Baseline with hidden layer converges faster
Simple head is undertrained at 3 epochs
Both have similar parameter counts (difference only in head)

Reproducibility

Fixed Settings Across All Variants

Seed: 42 (fixed for reproducibility)
Dataset: Same synthetic PCam subset (500 train / 100 val / 100 test)
Data Augmentation: Identical (horizontal/vertical flip, color jitter)
Training Hyperparameters: Same learning rate (1e-3), weight decay (1e-4), scheduler (cosine)
Early Stopping: Same patience (5 epochs) and min_delta (0.001)

What Varies Between Baselines

Feature Extractor: ResNet-18 vs ResNet-50
Classification Head: Hidden layer vs direct linear
Model Capacity: Total parameter count

Adding New Baselines

To add a new baseline variant:

Create a new config file in experiments/configs/pcam_comparison/
Set unique paths for checkpoints, logs, and results
Modify architecture as needed
Run comparison with the new config included

Example:

experiment:
  name: my_new_variant
  description: Description of what's different
  tags: [pcam, my-tag]

# ... architecture config ...

checkpoint:
  checkpoint_dir: ./checkpoints/pcam_comparison/my_new_variant

logging:
  log_dir: ./logs/pcam_comparison/my_new_variant

evaluation:
  output_dir: ./results/pcam_comparison/my_new_variant

Important Caveats

Dataset Scale

⚠️ Results on synthetic subset (500 train / 100 test)
⚠️ Not full PCam dataset (262K train / 32K test)
⚠️ Not comparable to published PCam baselines (different scale)

Validation Scope

✅ Framework validation: Demonstrates comparison pipeline works
✅ Architectural comparisons: Fair within this repo (same data, same settings)
❌ NOT clinical validation: Not tested on real patient data
❌ NOT scientific benchmark: Requires full-scale dataset for publication

Honest Claims Enabled

What you CAN say:

“Baseline comparison pipeline is reproducible”
“Multiple model variants can be evaluated systematically”
“ResNet-18 vs ResNet-50 comparison shows X% difference on synthetic subset”
“Hidden layer in classification head improves convergence speed”

What you CANNOT say:

~~“Achieves state-of-the-art performance on PCam”~~ (not tested on full dataset)
~~“Outperforms published methods”~~ (no fair comparison to external baselines)
~~“Validated for clinical use”~~ (not clinically validated)
~~“Generalizes to other pathology tasks”~~ (only tested on PCam format)

Next Steps for Stronger Benchmarks

To make stronger claims, you would need to:

Download full PCam dataset (~7GB)
Train on full 262K training set
Evaluate on full 32K test set
Implement published baselines (e.g., standard ResNet, DenseNet from PCam paper)
Run fair comparisons with same preprocessing and evaluation protocol
Compute confidence intervals with bootstrap or cross-validation
Test on other datasets (CAMELYON16, TCGA) for generalization

Commands Reference

Quick Test (3 epochs, ~2 minutes per variant)

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml \
  --quick-test

Full Training (20 epochs, ~40 seconds per variant on CPU)

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml

Skip Training (Evaluate Existing Checkpoints)

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml \
  --skip-training

Custom Output Path

python experiments/compare_pcam_baselines.py \
  --configs experiments/configs/pcam_comparison/*.yaml \
  --output results/my_comparison/results.json

Troubleshooting

Checkpoint Not Found

If evaluation fails with “checkpoint not found”, ensure training completed successfully. Check:

checkpoints/pcam_comparison/<variant_name>/best_model.pth exists
Training logs in logs/pcam_comparison/<variant_name>/

Out of Memory

If training fails with OOM:

Reduce batch_size in config files
Use --quick-test for faster iteration
Ensure no other GPU processes are running

Inconsistent Results

If results vary between runs:

Check that seed: 42 is set in all configs
Ensure same PyTorch/CUDA versions
Verify data augmentation settings are identical

Status: Comparison pipeline operational ✅
Dataset: Synthetic PCam subset (500/100/100) ⚠️
Clinical validation: Not applicable ❌
Scientific benchmark: Requires full dataset ⚠️