A tested PyTorch framework for computational pathology research with working benchmarks on PatchCamelyon and CAMELYON16
View on GitHub matthewvaishnav/computational-pathology-research
This project was developed using an RTX 4070 Laptop GPU (8GB VRAM). This guide documents the hardware-specific optimizations and configurations used.
Development System:
RTX 4070 Laptop GPU:
Capabilities:
Training Time Results:
The full PCam dataset was downloaded from Zenodo:
# Create download directory
mkdir -p data/pcam_real
cd data/pcam_real
# Download all splits (~7GB total)
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_y.h5.gz
# Extract (takes ~5 minutes)
gunzip *.gz
cd ../..
Storage: ~15GB total (7GB compressed + 7GB extracted)
Configuration used for RTX 4070 Laptop:
# experiments/configs/pcam_rtx4070_laptop.yaml
data:
root_dir: "data/pcam_real"
num_workers: 0 # Start with 0, increase to 4-8 after testing
pin_memory: true
prefetch_factor: 2
download: false # Data already downloaded
model:
embed_dim: 256 # Embedding dimension for encoder output
feature_extractor:
model: "resnet18"
pretrained: true
feature_dim: 512 # ResNet18 output dimension
wsi:
input_dim: 512 # Must match feature_extractor.feature_dim
hidden_dim: 512
num_heads: 8
num_layers: 2
pooling: "mean"
task:
type: "classification"
classification:
hidden_dims: [128]
dropout: 0.3
training:
batch_size: 64 # Optimized for 8GB VRAM
num_epochs: 20
learning_rate: 0.001
weight_decay: 0.0001
max_grad_norm: 1.0
use_amp: true # Critical for 8GB VRAM - provides 2-3x speedup
dropout: 0.1
optimizer:
name: "adamw"
betas: [0.9, 0.999]
scheduler:
name: "cosine"
min_lr: 1e-6
validation:
interval: 1 # Validate every epoch
metric: "val_auc"
maximize: true
checkpoint:
checkpoint_dir: "checkpoints/pcam_real"
save_interval: 5
save_best: true
early_stopping:
enabled: true
patience: 10
min_delta: 0.001
device: "cuda" # Use CUDA for GPU training
cudnn_benchmark: true # 10-20% speedup
seed: 42
logging:
log_dir: "logs/pcam_real"
log_interval: 100
use_tensorboard: true
use_wandb: false
Training command:
# Activate Python 3.11 environment with CUDA support
# (Python 3.14 doesn't have CUDA wheels yet)
source venv311/bin/activate # Linux/Mac
# or
.\venv311\Scripts\activate # Windows
# Mixed precision enabled for 2-3x speedup
python experiments/train_pcam.py \
--config experiments/configs/pcam_rtx4070_laptop.yaml
# GPU monitoring (separate terminal)
watch -n 1 nvidia-smi
Dataset Format: The PCamDataset class now supports both custom format and Zenodo format files directly:
camelyonpatch_level_2_split_train_x.h5, train_y.h5, etc.train/images.h5py, train/labels.h5py, etc.Actual Results (Verified):
Evaluation on test set:
python experiments/evaluate_pcam.py \
--checkpoint checkpoints/pcam_real/best_model.pth \
--data-root data/pcam_real \
--output-dir results/pcam_real
# View results
cat results/pcam_real/metrics.json
Metrics achieved:
| Model | Batch Size | Memory | Training Time |
|---|---|---|---|
| ResNet-18 | 64 | 7GB | 3h |
| ResNet-50 | 48 | 7.5GB | 4h |
| DenseNet-121 | 48 | 7GB | 4.5h |
| EfficientNet-B0 | 56 | 6.5GB | 3.5h |
| ViT-Base | 24 | 7.5GB | 6h |
Script used for baseline comparison:
cat > experiments/run_all_baselines_rtx4070_laptop.sh << 'EOF'
#!/bin/bash
MODELS=("resnet18" "resnet50" "densenet121" "efficientnet_b0" "vit_base_patch16_224")
BATCH_SIZES=(64 48 48 56 24)
for i in "${!MODELS[@]}"; do
MODEL="${MODELS[$i]}"
BS="${BATCH_SIZES[$i]}"
echo "Training $MODEL with batch size $BS..."
python experiments/train_pcam.py \
--model $MODEL \
--batch-size $BS \
--epochs 20 \
--mixed-precision \
--output-dir checkpoints/baselines/$MODEL \
--device cuda
echo "$MODEL complete!"
done
echo "All baselines trained!"
EOF
chmod +x experiments/run_all_baselines_rtx4070_laptop.sh
# Execute overnight
nohup ./experiments/run_all_baselines_rtx4070_laptop.sh > baselines.log 2>&1 &
Total training time: ~21 hours
Implementation used (memory-efficient for 8GB VRAM):
# experiments/train_pcam_cv_rtx4070_laptop.py
import torch
from sklearn.model_selection import StratifiedKFold
def train_with_cv(config, n_folds=5):
# Load full dataset
full_dataset = PatchCamelyonDataset(config.data_root)
# Get labels for stratification
labels = [full_dataset[i][1] for i in range(len(full_dataset))]
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
results = []
for fold, (train_idx, val_idx) in enumerate(skf.split(range(len(full_dataset)), labels)):
print(f"\n{'='*50}")
print(f"Fold {fold+1}/{n_folds}")
print(f"{'='*50}\n")
# Create fold-specific loaders
train_subset = torch.utils.data.Subset(full_dataset, train_idx)
val_subset = torch.utils.data.Subset(full_dataset, val_idx)
train_loader = DataLoader(
train_subset,
batch_size=64, # RTX 4070 Laptop optimized
shuffle=True,
num_workers=12, # Utilize 16-core CPU
pin_memory=True
)
val_loader = DataLoader(
val_subset,
batch_size=96, # Larger for inference
shuffle=False,
num_workers=12,
pin_memory=True
)
# Train fold
model = create_model(config)
metrics = train_fold(model, train_loader, val_loader, config)
results.append(metrics)
# Clear GPU memory
del model
torch.cuda.empty_cache()
# Aggregate results
print_cv_results(results)
return results
def print_cv_results(results):
import numpy as np
accs = [r['accuracy'] for r in results]
aucs = [r['auc'] for r in results]
print(f"\n{'='*50}")
print("Cross-Validation Results")
print(f"{'='*50}")
print(f"Accuracy: {np.mean(accs):.4f} ± {np.std(accs):.4f}")
print(f"AUC: {np.mean(aucs):.4f} ± {np.std(aucs):.4f}")
print(f"\nPer-fold results:")
for i, r in enumerate(results):
print(f" Fold {i+1}: Acc={r['accuracy']:.4f}, AUC={r['auc']:.4f}")
Execution:
python experiments/train_pcam_cv_rtx4070_laptop.py \
--config experiments/configs/pcam_rtx4070_laptop.yaml \
--n-folds 5
# Total time: ~15-18 hours
Mixed precision training was used throughout for 2-3x speedup:
# Add to training loop
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for epoch in range(num_epochs):
for batch in train_loader:
images, labels = batch
images = images.cuda()
labels = labels.cuda()
optimizer.zero_grad()
# Mixed precision forward pass
with autocast():
outputs = model(images)
loss = criterion(outputs, labels)
# Mixed precision backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
This provided 2-3x speedup with no accuracy loss.
# src/utils/gradcam_rtx4070.py
import torch
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image
import matplotlib.pyplot as plt
def generate_gradcam_batch(model, images, target_layer, batch_size=32):
"""
Generate Grad-CAM for batch of images
Configuration for RTX 4070 Laptop
"""
model.eval()
cam = GradCAM(model=model, target_layers=[target_layer])
all_cams = []
# Process in batches to avoid OOM
for i in range(0, len(images), batch_size):
batch = images[i:i+batch_size].cuda()
with torch.no_grad():
grayscale_cams = cam(input_tensor=batch)
all_cams.append(grayscale_cams)
# Clear cache
torch.cuda.empty_cache()
return torch.cat(all_cams)
# Usage example
model = load_model('checkpoints/best_model.pth').cuda()
target_layer = model.layer4[-1] # Last conv layer
# Generate for test set
test_images = load_test_images()
heatmaps = generate_gradcam_batch(model, test_images, target_layer)
# Save visualizations
save_gradcam_grid(test_images, heatmaps, 'results/gradcam_examples.png')
GPU power settings were configured for maximum performance:
# Windows: NVIDIA Control Panel → Manage 3D Settings → Power Management Mode → Prefer Maximum Performance
# Verify power limit
nvidia-smi -q -d POWER
# RTX 4070 Laptop: 140W TGP
Enabled for 10-20% speedup:
import torch
torch.backends.cudnn.benchmark = True # 10-20% speedup
Used for faster CPU-GPU transfer:
train_loader = DataLoader(
dataset,
batch_size=64,
pin_memory=True, # Faster CPU->GPU transfer
num_workers=12 # Utilize 16-core CPU (start with 0, increase gradually)
)
Implemented for multiprocessing compatibility:
# PCamDataset now opens h5 files lazily in each worker process
# This avoids pickling errors with multiprocessing
def __getitem__(self, idx):
# Open h5 files lazily if not already open
if self._images_h5 is None or self._labels_h5 is None:
self._open_h5_files()
# Load data...
# Add to training loop
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for epoch in range(num_epochs):
for batch in train_loader:
images, labels = batch
images = images.cuda()
labels = labels.cuda()
optimizer.zero_grad()
# Mixed precision forward pass
with autocast():
outputs = model(images)
loss = criterion(outputs, labels)
# Mixed precision backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Used for larger effective batch sizes:
accumulation_steps = 2 # Effective batch size = 64 * 2 = 128
for i, (images, labels) in enumerate(train_loader):
outputs = model(images)
loss = criterion(outputs, labels) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Monitoring tools used:
# nvitop for enhanced monitoring
pip install nvitop
nvitop
# Standard nvidia-smi
watch -n 1 nvidia-smi
Cache clearing strategy:
# Clear cache between experiments
torch.cuda.empty_cache()
# Delete unused tensors
del model, optimizer
torch.cuda.empty_cache()
Solutions applied when encountering OOM:
Cache clearing implementation:
# Cache clearing in training loop
if batch_idx % 100 == 0:
torch.cuda.empty_cache()
If you encounter h5py pickling errors with num_workers > 0:
num_workers: 0 to verify training worksPerformance checks performed:
Profiling code used:
# Profiling for bottleneck detection
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
) as prof:
train_one_epoch(model, train_loader)
print(prof.key_averages().table(sort_by="cuda_time_total"))
Current Status: Week 1 complete, training successfully running Estimated Total GPU Time: ~120-150 hours Estimated Total Calendar Time: 4 weeks Hardware Cost: $0 (owned hardware) Power Consumption: ~140W during training (~$0.02/hour at $0.15/kWh) Estimated Total Electricity Cost: ~$3-4 for entire project
To replicate this setup:
# 1. Create Python 3.11 environment (for CUDA support)
python3.11 -m venv venv311
source venv311/bin/activate # Linux/Mac
# or
.\venv311\Scripts\activate # Windows
# 2. Install PyTorch with CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# 3. Install project dependencies
pip install -e .
# 4. Download PCam dataset from Zenodo
mkdir -p data/pcam_real && cd data/pcam_real
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_train_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_valid_y.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_x.h5.gz
wget https://zenodo.org/record/2546921/files/camelyonpatch_level_2_split_test_y.h5.gz
# Extract files
gunzip *.gz
cd ../..
# 5. Train first model
python experiments/train_pcam.py \
--config experiments/configs/pcam_rtx4070_laptop.yaml
# 6. Monitor GPU (separate terminal)
watch -n 1 nvidia-smi