BM-Net/H&E morphology increment on PDC#
Overview#
This tutorial shows how to run a BM-Net-style H&E morphology increment pilot on the aligned breast Xenium RNA + H&E contour workflow. The goal is not simply to add another image model. The goal is to ask a more specific question:
Does H&E-derived contour morphology add information beyond Xenium-native DAPI, cell-boundary, and nucleus-boundary morphology?
The workflow extends the existing RNA + contour + H&E tutorial with three extra pieces:
a named pathology backend that can emit BM-Net-style breast pathology features such as
bmnet__whole__invasive_proba PDC runner that crops aligned H&E patches for each contour and writes contour-level model features
an increment test that compares H&E morphology blocks against Xenium-native morphology blocks
The implementation is intentionally downstream-only. It does not change the
default behavior of run_contour_boundary_ecology_pilot.
Warning
The completed PDC run documented below used the deterministic-smoke backend.
It validates contour cropping, schema, Slurm execution, artifact writing, and
increment-analysis plumbing. It is not biological evidence and should not be
interpreted as trained BM-Net prediction.
Biological Question#
The breast Xenium export already contains DAPI-derived nucleus segmentation and stain-informed cell segmentation. H&E may still add information because it captures tissue architecture, stromal texture, lumen/duct organization, necrosis-like appearance, tumor-front morphology, and eosin/hematoxylin contrast that are not fully represented by Xenium-native cell and nucleus geometries.
This tutorial therefore treats H&E morphology as a candidate increment, not as a replacement for Xenium-native morphology.
Model Choices#
The BM-Net paper describes a breast whole-slide image classifier with four diagnostic classes: normal, benign, in situ carcinoma, and invasive carcinoma (Bioengineering 2022). Public weights were not confirmed during setup, so the PDC scaffold supports three backend levels:
Backend |
Purpose |
Output semantics |
|---|---|---|
|
Dependency-light smoke test for PDC and artifact schema |
BM-Net-like feature names, no biological meaning |
|
Use when a compatible trained BM-Net checkpoint is available |
|
|
Use a pathology foundation/surrogate model when BM-Net weights are unavailable |
|
Two useful Hugging Face candidates discovered during setup were
1aurent/vit_small_patch8_224.lunit_dino
and wisdomik/QuiltNet-B-32.
The default surrogate in the runner is the Lunit DINO ViT model because it can
be loaded through the current timm/Hugging Face path.
Inputs#
The completed smoke run used the Atera WTA FFPE breast cancer Xenium export on PDC:
/cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs
The expected input files are:
cell_feature_matrix.h5cells.parquetWTA_Preview_FFPE_Breast_Cancer_cell_groups.csvxenium_explorer_annotations.s1_s5.generated.geojsonan aligned H&E OME-TIFF named
WTA_Preview_FFPE_Breast_Cancer_he_image.ome.tifthe matching H&E alignment/keypoint files when available
For the first smoke run, a downsampled aligned H&E OME-TIFF was staged on PDC to avoid uploading the full 17.7 GB H&E image. The downsample was used only to validate the workflow.
PDC Setup#
From the staged pyXenium repository on PDC:
export PDC_ROOT=/cfs/klemming/scratch/h/hutaobo/pyxenium_bmnet_morphology_2026-04
export PDC_XENIUM_ROOT=/cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs
bash benchmarking/bmnet_pdc/scripts/bootstrap_pdc_env.sh
The bootstrap script creates the optional BM-Net/PDC environment with the image model dependencies used by the real backends.
Smoke Run#
Run a small deterministic smoke job first:
bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
--backend deterministic-smoke \
--smoke-max-contours 5
The completed smoke run used:
Field |
Value |
|---|---|
Slurm job |
|
Backend |
|
Run directory |
|
Contours |
5 |
Status |
|
Elapsed time |
|
Max RSS |
about |
The retained contour IDs were S1 S1 #1.1 through S1 S1 #5.1.
Output Artifacts#
Each run writes a compact artifact bundle:
bmnet_patch_predictions.csv
bmnet_pdc_run_summary.json
contour_features_with_bmnet.csv
feature_redundancy.csv
he_morphology_features.csv
incremental_prediction.csv
matched_review_table.csv
morphology_increment_summary.json
partial_associations.csv
program_scores.csv
xenium_native_morphology.csv
The smoke run summary contained the expected high-level checks:
{
"n_contours": 5,
"contour_key": "s1_s5_contours",
"n_he_morphology_features": 139,
"n_xenium_native_morphology_features": 10,
"has_bmnet_features": true,
"evaluation_mode": "in_sample_small_n"
}
bmnet_patch_predictions.csv contains BM-Net-style named features, including:
bmnet__whole__normal_prob
bmnet__whole__benign_prob
bmnet__whole__in_situ_prob
bmnet__whole__invasive_prob
bmnet__outer_rim__invasive_prob
bmnet__outer_minus_inner__invasive_prob
edge_contrast__bmnet__whole__invasive_prob
For a trained BM-Net checkpoint, these columns would represent contour-level breast pathology probabilities and rim/region contrasts. In the smoke backend, they are deterministic H&E color/texture proxies with the same schema.
Increment Analysis#
The increment module writes six analysis artifacts:
Artifact |
Meaning |
|---|---|
|
Morphology derived from Xenium-native cell and nucleus boundaries |
|
H&E pathomics, BM-Net, and named pathology features |
|
Correlation/redundancy between H&E and Xenium-native feature blocks |
|
Nested prediction models comparing baseline, Xenium-native, H&E, and combined blocks |
|
Feature associations after adjusting for selected covariates |
|
Contour-level review table for manual inspection |
For the five-contour smoke run, incremental_prediction.csv is only a schema and
plumbing check because the evaluation is explicitly in_sample_small_n. A
biological interpretation requires a larger contour set and a real trained or
validated pathology backend.
Real Model Runs#
When a compatible BM-Net checkpoint is available, run:
bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
--backend bmnet-local \
--checkpoint /cfs/klemming/scratch/h/hutaobo/models/bmnet/bmnet.pt \
--include-full
When BM-Net weights are unavailable, run a Hugging Face pathology surrogate:
bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
--backend hf-pathology-backbone \
--hf-model 1aurent/vit_small_patch8_224.lunit_dino \
--smoke-max-contours 20
The surrogate backend writes pathology__... features instead of pretending to
be BM-Net. This distinction is important for downstream reports.
Python API#
The same workflow can be called directly:
from pyXenium.multimodal import run_bmnet_morphology_increment_pilot
result = run_bmnet_morphology_increment_pilot(
dataset_root="/path/to/WTA_Preview_FFPE_Breast_Cancer_outs",
output_dir="/path/to/output/bmnet_smoke",
contour_geojson="/path/to/xenium_explorer_annotations.s1_s5.generated.geojson",
backend="deterministic-smoke",
max_contours=5,
program_library="breast_boundary_bmnet_v1",
)
print(result["summary"]["artifact_files"])
The runner performs four steps:
load the Xenium export, cell table, clusters, contours, and aligned H&E image
crop whole-contour, inner, and outer-rim H&E patches
write named H&E pathology features into the contour feature table
compare H&E morphology against Xenium-native morphology with redundancy, nested prediction, partial association, and shuffle-control outputs
Troubleshooting Notes From The Smoke Run#
Several practical issues were resolved during the first PDC run:
If no H&E image is detected, confirm that the OME-TIFF and alignment files are staged under the Xenium export root and that
include_images=Truecan discover them.If GeoJSON contours do not contain
polygon_id, the runner falls back to the contournamefield.If
adata.obsm["spatial"]is absent, pass or stagecells.parquetso the runner can reconstruct Xenium-native morphology.Tiny smoke patches can collapse to a one-pixel RGB array; the current image conversion path handles that case.
Interpretation Rules#
Use the output in three tiers:
Smoke-only result: validates software, files, schema, and PDC orchestration.
Surrogate pathology result: useful for exploring whether H&E representation adds signal, but should be labeled as a surrogate.
Trained BM-Net result: can support BM-Net-specific biological interpretation when the checkpoint, training data, inference date, and labels are recorded in
morphology_increment_summary.json.
For a publishable analysis, the next full run should use at least the full S1/S5 contour set, keep the shuffle control, and report whether H&E/BM-Net blocks add held-out predictive value beyond Xenium-native cell and nucleus morphology.