A tested PyTorch framework for computational pathology research with working benchmarks on PatchCamelyon and CAMELYON16
View on GitHub matthewvaishnav/computational-pathology-research
Date: 2026-04-07
Status: Analysis Complete
Goal: Transform from synthetic-only validation to real dataset experiments
Framework Capabilities:
Benchmark Evidence:
| Aspect | Current (Synthetic) | Required (Real) | Gap |
|---|---|---|---|
| Train Size | 500 samples | 262,144 samples | 524x scale-up |
| Test Size | 100 samples | 32,768 samples | 327x scale-up |
| Data Format | H5 (synthetic) | H5 (real PCam) | Format compatible ✅ |
| Download | None (generated) | ~7GB download | Need download script |
| Training Time | 40 seconds (CPU) | ~4-8 hours (GPU) | Need GPU resources |
| Memory | <4GB RAM | 16GB+ RAM | Need optimization |
| Aspect | Current (Synthetic) | Required (Real) | Gap |
|---|---|---|---|
| Slides | 30 slides | 400 slides | 13x scale-up |
| Patches/Slide | 100 patches | 10,000+ patches | 100x scale-up |
| Data Format | H5 features | Raw .tif WSI | Need WSI preprocessing |
| Download | None (generated) | ~1TB raw WSI | Need download + storage |
| Feature Extraction | Synthetic | ResNet-50 on patches | Need extraction pipeline |
| Training Time | 7 seconds (CPU) | Days (GPU) | Need distributed training |
Why Start Here:
Implementation: Extend src/data/pcam_dataset.py
# Already has download capability via TFDS or direct GitHub
# Just need to enable full download (currently generates synthetic)
# Option A: Use TensorFlow Datasets (recommended)
dataset = tfds.load('pcam', split='train', download=True)
# Option B: Direct download from GitHub
# URLs already defined in PCamDataset.PCAM_URLS
Action Items:
Estimated Time: 2-4 hours (implementation + testing)
Storage Required: ~7GB
Download Time: 30-60 minutes (depends on connection)
Current Bottlenecks:
Required Changes:
# experiments/configs/pcam_full.yaml
training:
num_epochs: 20
batch_size: 256 # Increase from 128
learning_rate: 1e-3
use_amp: true # Already supported
num_workers: 4 # Parallel data loading
device: cuda # GPU required
# Add gradient accumulation for larger effective batch size
gradient_accumulation_steps: 4 # Effective batch = 256 * 4 = 1024
Action Items:
pcam_full.yaml config for full datasetEstimated Time: 4-8 hours (implementation)
Hardware Required: NVIDIA GPU with 16GB+ VRAM (RTX 3090, A5000, A100)
Training Time: 4-8 hours for 20 epochs
Target Baselines (from PCam leaderboard):
Implementation: Extend experiments/compare_pcam_baselines.py
# Already have comparison runner infrastructure
# Just need to add baseline configs
# experiments/configs/pcam_comparison/resnet50.yaml
# experiments/configs/pcam_comparison/densenet121.yaml
# experiments/configs/pcam_comparison/efficientnet_b0.yaml
Action Items:
Estimated Time: 8-16 hours (implementation + training all baselines)
Computational Cost: ~$50-100 in GPU time (cloud)
Current Limitations:
Required Additions:
# Add to experiments/evaluate_pcam.py
def compute_bootstrap_ci(predictions, labels, n_bootstrap=1000):
"""Compute 95% confidence intervals via bootstrap."""
metrics = []
for _ in range(n_bootstrap):
indices = np.random.choice(len(labels), len(labels), replace=True)
boot_preds = predictions[indices]
boot_labels = labels[indices]
metrics.append(compute_metrics(boot_preds, boot_labels))
return np.percentile(metrics, [2.5, 97.5], axis=0)
Action Items:
Estimated Time: 4-6 hours
Output: Statistically rigorous benchmark results
Deliverables:
README Updates:
Impact: Removes 2 of 5 ❌ items, partially addresses a third
Why Second:
Current Gap: No raw WSI handling, only pre-extracted features
Required Components:
# scripts/data/extract_camelyon_features.py
import openslide
from torchvision import transforms
def extract_patches_from_wsi(
wsi_path: str,
patch_size: int = 256,
magnification: float = 20.0,
stride: int = 256,
) -> List[np.ndarray]:
"""Extract patches from WSI at specified magnification."""
slide = openslide.OpenSlide(wsi_path)
# Tissue detection (remove background)
tissue_mask = detect_tissue(slide)
# Extract patches from tissue regions
patches = []
for x, y in get_patch_coordinates(tissue_mask, patch_size, stride):
patch = slide.read_region((x, y), level=0, size=(patch_size, patch_size))
patches.append(np.array(patch))
return patches
def extract_features_batch(
patches: List[np.ndarray],
model: torch.nn.Module,
batch_size: int = 32,
) -> np.ndarray:
"""Extract features from patches using pretrained model."""
features = []
for i in range(0, len(patches), batch_size):
batch = patches[i:i+batch_size]
batch_tensor = torch.stack([preprocess(p) for p in batch])
with torch.no_grad():
batch_features = model(batch_tensor)
features.append(batch_features.cpu().numpy())
return np.concatenate(features)
Action Items:
Estimated Time: 16-24 hours (implementation + testing)
Dependencies: openslide-python, opencv-python
Computational Cost: ~100-200 GPU-hours for full CAMELYON16
Dataset Details:
Action Items:
Estimated Time: 1-2 days (mostly download time)
Storage Required: 1TB+ (raw) + 500GB (features)
Cost: Free (registration required)
Current Gap: Stub implementation in src/data/camelyon_annotations.py
Required Implementation:
# src/data/camelyon_annotations.py
import xml.etree.ElementTree as ET
from shapely.geometry import Polygon
def parse_asap_xml(xml_path: str) -> List[Polygon]:
"""Parse ASAP XML annotation file to polygons."""
tree = ET.parse(xml_path)
root = tree.getroot()
polygons = []
for annotation in root.findall('.//Annotation'):
coords = []
for coord in annotation.findall('.//Coordinate'):
x = float(coord.get('X'))
y = float(coord.get('Y'))
coords.append((x, y))
if len(coords) >= 3:
polygons.append(Polygon(coords))
return polygons
def create_annotation_mask(
polygons: List[Polygon],
slide_dimensions: Tuple[int, int],
downsample: int = 32,
) -> np.ndarray:
"""Create binary mask from annotation polygons."""
# Implementation for rasterizing polygons to mask
pass
Action Items:
Estimated Time: 8-12 hours
Dependencies: shapely, opencv-python
Current: SimpleSlideClassifier (mean/max pooling)
Needed: Attention-based aggregation for competitive results
Target Architectures:
Implementation:
# src/models/attention_mil.py
class AttentionMIL(nn.Module):
"""Attention-based Multiple Instance Learning."""
def __init__(self, feature_dim: int, hidden_dim: int):
super().__init__()
self.attention = nn.Sequential(
nn.Linear(feature_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)
self.classifier = nn.Linear(feature_dim, num_classes)
def forward(self, features):
# features: [num_patches, feature_dim]
attention_weights = self.attention(features) # [num_patches, 1]
attention_weights = F.softmax(attention_weights, dim=0)
# Attention-weighted aggregation
slide_features = (features * attention_weights).sum(dim=0)
return self.classifier(slide_features)
Action Items:
Estimated Time: 16-24 hours
Training Time: 2-4 days for all architectures
Deliverables:
README Updates:
Impact: Removes 3 of 5 ❌ items
Why Last: Requires institutional partnerships, IRB approval, clinical expertise
NOT Achievable in This Repository:
Potentially Achievable:
Action Items:
Estimated Time: 3-6 months (institutional processes)
Cost: Varies (may require funding)
Requirements:
Action Items:
Estimated Time: 6-12 months
Cost: Significant (pathologist time, data management)
Study Designs:
Action Items:
Estimated Time: 12-24 months
Cost: High (pathologist time, statistical analysis)
Deliverables:
README Updates:
Impact: Removes remaining 2 ❌ items, but requires major effort
| Resource | Requirement | Cost Estimate |
|---|---|---|
| Storage | 10GB | Negligible |
| GPU | RTX 3090 or A5000 | $50-100 (cloud) |
| Time | 1-2 weeks | Part-time work |
| Difficulty | Low | Mostly config changes |
| Resource | Requirement | Cost Estimate |
|---|---|---|
| Storage | 1.5TB | $50-100/month (cloud) |
| GPU | A100 or multi-GPU | $500-1000 (cloud) |
| Time | 1-2 months | Full-time work |
| Difficulty | Medium-High | WSI preprocessing complex |
| Resource | Requirement | Cost Estimate |
|---|---|---|
| Partnerships | Pathology dept | Varies |
| IRB | Institutional approval | 3-6 months |
| Pathologists | Expert annotations | $10K-50K |
| Time | 12-24 months | Full-time research |
| Difficulty | Very High | Requires clinical expertise |
Goal: Remove 2 ❌ items with minimal effort
Outcome:
Goal: Demonstrate slide-level capabilities
Outcome:
Goal: Clinical validation (if desired)
Outcome:
Instead of removing all ❌ items, consider repositioning the repository:
“This will eventually be clinically validated”
“This is a well-tested research framework for computational pathology”
Updated README Section:
## What This Repository Provides
**Research Framework** (not clinical tool):
- ✅ Modular, tested implementations of pathology AI architectures
- ✅ Benchmark pipelines for PCam and CAMELYON datasets
- ✅ Reproducible training and evaluation workflows
- ✅ Comparison tools for systematic baseline evaluation
- ✅ Extensible codebase for research experimentation
**Validated Capabilities**:
- ✅ PCam: 94% accuracy on synthetic subset (framework validation)
- ✅ CAMELYON: Functional slide-level pipeline (architecture validation)
- ✅ 62% test coverage with comprehensive unit tests
- ✅ Cross-platform compatibility and CI/CD
**Research Use Cases**:
- Algorithm development and prototyping
- Baseline implementations for comparison
- Educational resource for computational pathology
- Starting point for research projects
**NOT Provided**:
- ❌ Clinical validation or FDA approval
- ❌ Production deployment infrastructure
- ❌ Real-time inference optimization
- ❌ Clinical decision support features
**To Use in Research**:
1. Download PCam or CAMELYON16 datasets
2. Run training with provided configs
3. Compare against implemented baselines
4. Extend with your own methods
5. Publish results with proper attribution
This framing:
Easiest Path Forward: Priority 1 (Full PCam)
Most Impactful: Priority 2 (CAMELYON16)
Most Ambitious: Priority 3 (Clinical Validation)
Pragmatic Alternative: Honest Repositioning
Recommendation: Start with Priority 1 (Full PCam) to demonstrate the framework works on real data at scale, then decide whether to pursue Priority 2 or reposition as a research framework.