A tested PyTorch framework for computational pathology research with working benchmarks on PatchCamelyon and CAMELYON16
View on GitHub matthewvaishnav/computational-pathology-research
Complete guide to installing and using the Computational Pathology Research Framework.
Minimum:
Recommended:
Have an RTX 4070 Laptop? Check out our RTX 4070 Optimization Guide for hardware-specific batch sizes, training times, and optimization tips!
git clone https://github.com/matthewvaishnav/computational-pathology-research.git
cd computational-pathology-research
Linux/macOS:
python -m venv venv
source venv/bin/activate
Windows:
python -m venv venv
venv\Scripts\activate
# Install core dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .
# Run quick test
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import src; print('Installation successful!')"
# Check GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
For testing and development, generate synthetic datasets:
# Generate PCam synthetic data
python scripts/generate_synthetic_pcam.py
# Generate CAMELYON synthetic data
python scripts/generate_synthetic_camelyon.py
This creates:
data/pcam/ - Synthetic patch-level datadata/camelyon/features/ - Synthetic slide-level featuresTrain a simple model on PCam:
python experiments/train_pcam.py \
--config experiments/configs/pcam.yaml \
--epochs 5
Expected output:
Epoch 1/5: Train Loss: 0.6234, Acc: 0.6500
Epoch 1/5: Val Loss: 0.5123, Acc: 0.7500
...
Training complete! Best accuracy: 0.9400
python experiments/evaluate_pcam.py \
--checkpoint checkpoints/pcam/best_model.pth \
--data-root data/pcam \
--output-dir results/pcam
Results saved to results/pcam/:
metrics.json - Evaluation metricsconfusion_matrix.png - Confusion matrix visualizationroc_curve.png - ROC curvefrom src.data import PatchCamelyonDataset
from torch.utils.data import DataLoader
# Create dataset
train_dataset = PatchCamelyonDataset(
root_dir="data/pcam",
split="train"
)
# Create data loader
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True,
num_workers=4
)
from src.models import SimpleClassifier
import torch
# Create model
model = SimpleClassifier(num_classes=2, dropout=0.5)
# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
from torch import nn, optim
# Loss function
criterion = nn.CrossEntropyLoss()
# Optimizer
optimizer = optim.Adam(
model.parameters(),
lr=0.001,
weight_decay=0.0001
)
# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(
optimizer,
step_size=5,
gamma=0.1
)
from src.training import train_epoch, evaluate
from src.utils import save_checkpoint
best_acc = 0.0
for epoch in range(10):
# Train
train_metrics = train_epoch(
model, train_loader, criterion, optimizer, device, epoch
)
# Validate
val_metrics = evaluate(
model, val_loader, criterion, device
)
# Update learning rate
scheduler.step()
# Save best model
if val_metrics['accuracy'] > best_acc:
best_acc = val_metrics['accuracy']
save_checkpoint(
model, optimizer, epoch, val_metrics,
"checkpoints/best_model.pth"
)
# Print progress
print(f"Epoch {epoch+1}/10")
print(f" Train - Loss: {train_metrics['loss']:.4f}, Acc: {train_metrics['accuracy']:.4f}")
print(f" Val - Loss: {val_metrics['loss']:.4f}, Acc: {val_metrics['accuracy']:.4f}")
print(f"Training complete! Best accuracy: {best_acc:.4f}")
from src.utils import load_checkpoint
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
import seaborn as sns
# Load best model
checkpoint = load_checkpoint("checkpoints/best_model.pth", model)
# Evaluate
test_metrics = evaluate(model, test_loader, criterion, device)
# Confusion matrix
cm = confusion_matrix(test_metrics['labels'], test_metrics['predictions'])
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.savefig('confusion_matrix.png')
# ROC curve
fpr, tpr, _ = roc_curve(test_metrics['labels'], test_metrics['probabilities'][:, 1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.savefig('roc_curve.png')
print(f"Test Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Test AUC: {roc_auc:.4f}")
# Download from official source
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_y.h5.gz
# Extract
gunzip *.gz
# Move to data directory
mkdir -p data/pcam
mv *.h5 data/pcam/
python experiments/train_pcam.py \
--config experiments/configs/pcam.yaml \
--data-root data/pcam \
--epochs 50 \
--batch-size 64
# Download from CAMELYON16 challenge website
# https://camelyon16.grand-challenge.org/
# Extract features using your preferred method
# (e.g., pretrained ResNet50)
# Organize as HDF5 files
# data/camelyon/features/slide_000.h5
# data/camelyon/features/slide_001.h5
# ...
python experiments/train_camelyon.py \
--config experiments/configs/camelyon.yaml \
--data-root data/camelyon \
--epochs 50
Solution: Install the package in development mode:
pip install -e .
Solution: Reduce batch size in config:
data:
batch_size: 16 # Reduce from 32
Solution: Increase number of workers:
data:
num_workers: 8 # Increase from 4
Solutions: