Computational Pathology Research Framework

Logo

A tested PyTorch framework for computational pathology research with working benchmarks on PatchCamelyon and CAMELYON16

View on GitHub matthewvaishnav/computational-pathology-research

Roadmap to Address Remaining ❌ Items

Date: 2026-04-07
Status: Analysis Complete
Goal: Transform from synthetic-only validation to real dataset experiments

Current Status Summary

✅ What Works Today

Framework Capabilities:

Benchmark Evidence:

❌ What’s Missing (From README)

  1. No experiments on full-scale published datasets
  2. No validation of clinical effectiveness
  3. No comparison to published methods
  4. No trained models on real clinical data
  5. No proof these ideas work at scale

Gap Analysis: Synthetic vs Real Data

PCam Dataset

Aspect Current (Synthetic) Required (Real) Gap
Train Size 500 samples 262,144 samples 524x scale-up
Test Size 100 samples 32,768 samples 327x scale-up
Data Format H5 (synthetic) H5 (real PCam) Format compatible ✅
Download None (generated) ~7GB download Need download script
Training Time 40 seconds (CPU) ~4-8 hours (GPU) Need GPU resources
Memory <4GB RAM 16GB+ RAM Need optimization

CAMELYON16 Dataset

Aspect Current (Synthetic) Required (Real) Gap
Slides 30 slides 400 slides 13x scale-up
Patches/Slide 100 patches 10,000+ patches 100x scale-up
Data Format H5 features Raw .tif WSI Need WSI preprocessing
Download None (generated) ~1TB raw WSI Need download + storage
Feature Extraction Synthetic ResNet-50 on patches Need extraction pipeline
Training Time 7 seconds (CPU) Days (GPU) Need distributed training

Priority 1: Full-Scale PCam Experiments

Why Start Here:

Required Steps

1.1 Download Real PCam Dataset

Implementation: Extend src/data/pcam_dataset.py

# Already has download capability via TFDS or direct GitHub
# Just need to enable full download (currently generates synthetic)

# Option A: Use TensorFlow Datasets (recommended)
dataset = tfds.load('pcam', split='train', download=True)

# Option B: Direct download from GitHub
# URLs already defined in PCamDataset.PCAM_URLS

Action Items:

Estimated Time: 2-4 hours (implementation + testing)
Storage Required: ~7GB
Download Time: 30-60 minutes (depends on connection)

1.2 Scale Up Training Infrastructure

Current Bottlenecks:

Required Changes:

# experiments/configs/pcam_full.yaml
training:
  num_epochs: 20
  batch_size: 256  # Increase from 128
  learning_rate: 1e-3
  use_amp: true  # Already supported
  num_workers: 4  # Parallel data loading
  device: cuda  # GPU required

# Add gradient accumulation for larger effective batch size
gradient_accumulation_steps: 4  # Effective batch = 256 * 4 = 1024

Action Items:

Estimated Time: 4-8 hours (implementation)
Hardware Required: NVIDIA GPU with 16GB+ VRAM (RTX 3090, A5000, A100)
Training Time: 4-8 hours for 20 epochs

1.3 Implement Published Baselines

Target Baselines (from PCam leaderboard):

Implementation: Extend experiments/compare_pcam_baselines.py

# Already have comparison runner infrastructure
# Just need to add baseline configs

# experiments/configs/pcam_comparison/resnet50.yaml
# experiments/configs/pcam_comparison/densenet121.yaml
# experiments/configs/pcam_comparison/efficientnet_b0.yaml

Action Items:

Estimated Time: 8-16 hours (implementation + training all baselines)
Computational Cost: ~$50-100 in GPU time (cloud)

1.4 Statistical Validation

Current Limitations:

Required Additions:

# Add to experiments/evaluate_pcam.py

def compute_bootstrap_ci(predictions, labels, n_bootstrap=1000):
    """Compute 95% confidence intervals via bootstrap."""
    metrics = []
    for _ in range(n_bootstrap):
        indices = np.random.choice(len(labels), len(labels), replace=True)
        boot_preds = predictions[indices]
        boot_labels = labels[indices]
        metrics.append(compute_metrics(boot_preds, boot_labels))
    return np.percentile(metrics, [2.5, 97.5], axis=0)

Action Items:

Estimated Time: 4-6 hours
Output: Statistically rigorous benchmark results

Expected Outcomes (Priority 1)

Deliverables:

  1. Full PCam training on 262K samples
  2. Evaluation on 32K test samples
  3. Comparison to 3-5 published baselines
  4. Statistical validation with confidence intervals
  5. Updated PCAM_BENCHMARK_RESULTS.md with real results

README Updates:

Impact: Removes 2 of 5 ❌ items, partially addresses a third

Priority 2: Real CAMELYON16 Integration

Why Second:

Required Steps

2.1 WSI Preprocessing Pipeline

Current Gap: No raw WSI handling, only pre-extracted features

Required Components:

# scripts/data/extract_camelyon_features.py

import openslide
from torchvision import transforms

def extract_patches_from_wsi(
    wsi_path: str,
    patch_size: int = 256,
    magnification: float = 20.0,
    stride: int = 256,
) -> List[np.ndarray]:
    """Extract patches from WSI at specified magnification."""
    slide = openslide.OpenSlide(wsi_path)
    
    # Tissue detection (remove background)
    tissue_mask = detect_tissue(slide)
    
    # Extract patches from tissue regions
    patches = []
    for x, y in get_patch_coordinates(tissue_mask, patch_size, stride):
        patch = slide.read_region((x, y), level=0, size=(patch_size, patch_size))
        patches.append(np.array(patch))
    
    return patches

def extract_features_batch(
    patches: List[np.ndarray],
    model: torch.nn.Module,
    batch_size: int = 32,
) -> np.ndarray:
    """Extract features from patches using pretrained model."""
    features = []
    for i in range(0, len(patches), batch_size):
        batch = patches[i:i+batch_size]
        batch_tensor = torch.stack([preprocess(p) for p in batch])
        with torch.no_grad():
            batch_features = model(batch_tensor)
        features.append(batch_features.cpu().numpy())
    return np.concatenate(features)

Action Items:

Estimated Time: 16-24 hours (implementation + testing)
Dependencies: openslide-python, opencv-python
Computational Cost: ~100-200 GPU-hours for full CAMELYON16

2.2 Download CAMELYON16 Dataset

Dataset Details:

Action Items:

Estimated Time: 1-2 days (mostly download time)
Storage Required: 1TB+ (raw) + 500GB (features)
Cost: Free (registration required)

2.3 Annotation Processing

Current Gap: Stub implementation in src/data/camelyon_annotations.py

Required Implementation:

# src/data/camelyon_annotations.py

import xml.etree.ElementTree as ET
from shapely.geometry import Polygon

def parse_asap_xml(xml_path: str) -> List[Polygon]:
    """Parse ASAP XML annotation file to polygons."""
    tree = ET.parse(xml_path)
    root = tree.getroot()
    
    polygons = []
    for annotation in root.findall('.//Annotation'):
        coords = []
        for coord in annotation.findall('.//Coordinate'):
            x = float(coord.get('X'))
            y = float(coord.get('Y'))
            coords.append((x, y))
        if len(coords) >= 3:
            polygons.append(Polygon(coords))
    
    return polygons

def create_annotation_mask(
    polygons: List[Polygon],
    slide_dimensions: Tuple[int, int],
    downsample: int = 32,
) -> np.ndarray:
    """Create binary mask from annotation polygons."""
    # Implementation for rasterizing polygons to mask
    pass

Action Items:

Estimated Time: 8-12 hours
Dependencies: shapely, opencv-python

2.4 Advanced MIL Architectures

Current: SimpleSlideClassifier (mean/max pooling)
Needed: Attention-based aggregation for competitive results

Target Architectures:

Implementation:

# src/models/attention_mil.py

class AttentionMIL(nn.Module):
    """Attention-based Multiple Instance Learning."""
    
    def __init__(self, feature_dim: int, hidden_dim: int):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        self.classifier = nn.Linear(feature_dim, num_classes)
    
    def forward(self, features):
        # features: [num_patches, feature_dim]
        attention_weights = self.attention(features)  # [num_patches, 1]
        attention_weights = F.softmax(attention_weights, dim=0)
        
        # Attention-weighted aggregation
        slide_features = (features * attention_weights).sum(dim=0)
        
        return self.classifier(slide_features)

Action Items:

Estimated Time: 16-24 hours
Training Time: 2-4 days for all architectures

Expected Outcomes (Priority 2)

Deliverables:

  1. WSI preprocessing pipeline
  2. Feature extraction for CAMELYON16
  3. Training on real CAMELYON16 data
  4. Comparison of MIL architectures
  5. Evaluation on official test set
  6. Updated CAMELYON_TRAINING_STATUS.md

README Updates:

Impact: Removes 3 of 5 ❌ items

Priority 3: Clinical Validation (Long-Term)

Why Last: Requires institutional partnerships, IRB approval, clinical expertise

What “Clinical Validation” Actually Means

NOT Achievable in This Repository:

Potentially Achievable:

Required Steps (If Pursuing)

3.1 Institutional Partnerships

Action Items:

Estimated Time: 3-6 months (institutional processes)
Cost: Varies (may require funding)

3.2 Clinical Dataset Curation

Requirements:

Action Items:

Estimated Time: 6-12 months
Cost: Significant (pathologist time, data management)

3.3 Clinical Validation Studies

Study Designs:

Action Items:

Estimated Time: 12-24 months
Cost: High (pathologist time, statistical analysis)

Expected Outcomes (Priority 3)

Deliverables:

  1. Clinical validation study results
  2. Comparison with pathologist performance
  3. Multi-center validation
  4. Peer-reviewed publication
  5. Clinical deployment considerations

README Updates:

Impact: Removes remaining 2 ❌ items, but requires major effort

Computational Requirements Summary

Priority 1 (Full PCam)

Resource Requirement Cost Estimate
Storage 10GB Negligible
GPU RTX 3090 or A5000 $50-100 (cloud)
Time 1-2 weeks Part-time work
Difficulty Low Mostly config changes

Priority 2 (CAMELYON16)

Resource Requirement Cost Estimate
Storage 1.5TB $50-100/month (cloud)
GPU A100 or multi-GPU $500-1000 (cloud)
Time 1-2 months Full-time work
Difficulty Medium-High WSI preprocessing complex

Priority 3 (Clinical Validation)

Resource Requirement Cost Estimate
Partnerships Pathology dept Varies
IRB Institutional approval 3-6 months
Pathologists Expert annotations $10K-50K
Time 12-24 months Full-time research
Difficulty Very High Requires clinical expertise

Phase 1: Quick Wins (2-4 weeks)

Goal: Remove 2 ❌ items with minimal effort

  1. Download full PCam dataset (1 day)
  2. Train on full PCam with existing architecture (1 week)
  3. Implement 2-3 baseline comparisons (1 week)
  4. Statistical validation and documentation (3 days)

Outcome:

Phase 2: Scale Up (2-3 months)

Goal: Demonstrate slide-level capabilities

  1. Implement WSI preprocessing (2 weeks)
  2. Download CAMELYON16 (1 week)
  3. Extract features for all slides (1 week)
  4. Implement attention MIL architectures (2 weeks)
  5. Train and evaluate on CAMELYON16 (2 weeks)
  6. Comparison study and documentation (1 week)

Outcome:

Phase 3: Clinical Translation (12+ months)

Goal: Clinical validation (if desired)

  1. Establish institutional partnerships (3-6 months)
  2. IRB approval and data curation (6-12 months)
  3. Clinical validation studies (6-12 months)
  4. Publication and dissemination (3-6 months)

Outcome:

Alternative: Honest Repositioning

Instead of removing all ❌ items, consider repositioning the repository:

Current Framing (Aspirational)

“This will eventually be clinically validated”

Alternative Framing (Honest)

“This is a well-tested research framework for computational pathology”

Updated README Section:

## What This Repository Provides

**Research Framework** (not clinical tool):
- ✅ Modular, tested implementations of pathology AI architectures
- ✅ Benchmark pipelines for PCam and CAMELYON datasets
- ✅ Reproducible training and evaluation workflows
- ✅ Comparison tools for systematic baseline evaluation
- ✅ Extensible codebase for research experimentation

**Validated Capabilities**:
- ✅ PCam: 94% accuracy on synthetic subset (framework validation)
- ✅ CAMELYON: Functional slide-level pipeline (architecture validation)
- ✅ 62% test coverage with comprehensive unit tests
- ✅ Cross-platform compatibility and CI/CD

**Research Use Cases**:
- Algorithm development and prototyping
- Baseline implementations for comparison
- Educational resource for computational pathology
- Starting point for research projects

**NOT Provided**:
- ❌ Clinical validation or FDA approval
- ❌ Production deployment infrastructure
- ❌ Real-time inference optimization
- ❌ Clinical decision support features

**To Use in Research**:
1. Download PCam or CAMELYON16 datasets
2. Run training with provided configs
3. Compare against implemented baselines
4. Extend with your own methods
5. Publish results with proper attribution

This framing:

Conclusion

Easiest Path Forward: Priority 1 (Full PCam)

Most Impactful: Priority 2 (CAMELYON16)

Most Ambitious: Priority 3 (Clinical Validation)

Pragmatic Alternative: Honest Repositioning

Recommendation: Start with Priority 1 (Full PCam) to demonstrate the framework works on real data at scale, then decide whether to pursue Priority 2 or reposition as a research framework.