BM-Net/H&E morphology increment on PDC#

Overview#

This tutorial shows how to run a BM-Net-style H&E morphology increment pilot on the aligned breast Xenium RNA + H&E contour workflow. The goal is not simply to add another image model. The goal is to ask a more specific question:

Does H&E-derived contour morphology add information beyond Xenium-native DAPI, cell-boundary, and nucleus-boundary morphology?

The workflow extends the existing RNA + contour + H&E tutorial with three extra pieces:

a named pathology backend that can emit BM-Net-style breast pathology features such as bmnet__whole__invasive_prob
a PDC runner that crops aligned H&E patches for each contour and writes contour-level model features
an increment test that compares H&E morphology blocks against Xenium-native morphology blocks

The implementation is intentionally downstream-only. It does not change the default behavior of run_contour_boundary_ecology_pilot.

Warning

The completed PDC run documented below used the deterministic-smoke backend. It validates contour cropping, schema, Slurm execution, artifact writing, and increment-analysis plumbing. It is not biological evidence and should not be interpreted as trained BM-Net prediction.

Biological Question#

The breast Xenium export already contains DAPI-derived nucleus segmentation and stain-informed cell segmentation. H&E may still add information because it captures tissue architecture, stromal texture, lumen/duct organization, necrosis-like appearance, tumor-front morphology, and eosin/hematoxylin contrast that are not fully represented by Xenium-native cell and nucleus geometries.

This tutorial therefore treats H&E morphology as a candidate increment, not as a replacement for Xenium-native morphology.

Model Choices#

The BM-Net paper describes a breast whole-slide image classifier with four diagnostic classes: normal, benign, in situ carcinoma, and invasive carcinoma (Bioengineering 2022). Public weights were not confirmed during setup, so the PDC scaffold supports three backend levels:

Backend	Purpose	Output semantics
`deterministic-smoke`	Dependency-light smoke test for PDC and artifact schema	BM-Net-like feature names, no biological meaning
`bmnet-local`	Use when a compatible trained BM-Net checkpoint is available	`bmnet__...` four-class probabilities
`hf-pathology-backbone`	Use a pathology foundation/surrogate model when BM-Net weights are unavailable	`pathology__...` embedding features, not BM-Net probabilities

Two useful Hugging Face candidates discovered during setup were 1aurent/vit_small_patch8_224.lunit_dino and wisdomik/QuiltNet-B-32. The default surrogate in the runner is the Lunit DINO ViT model because it can be loaded through the current timm/Hugging Face path.

Inputs#

The completed smoke run used the Atera WTA FFPE breast cancer Xenium export on PDC:

/cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs

The expected input files are:

cell_feature_matrix.h5
cells.parquet
WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv
xenium_explorer_annotations.s1_s5.generated.geojson
an aligned H&E OME-TIFF named WTA_Preview_FFPE_Breast_Cancer_he_image.ome.tif
the matching H&E alignment/keypoint files when available

For the first smoke run, a downsampled aligned H&E OME-TIFF was staged on PDC to avoid uploading the full 17.7 GB H&E image. The downsample was used only to validate the workflow.

PDC Setup#

From the staged pyXenium repository on PDC:

export PDC_ROOT=/cfs/klemming/scratch/h/hutaobo/pyxenium_bmnet_morphology_2026-04
export PDC_XENIUM_ROOT=/cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs

bash benchmarking/bmnet_pdc/scripts/bootstrap_pdc_env.sh

The bootstrap script creates the optional BM-Net/PDC environment with the image model dependencies used by the real backends.

Smoke Run#

Run a small deterministic smoke job first:

bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
  --backend deterministic-smoke \
  --smoke-max-contours 5

The completed smoke run used:

Field	Value
Slurm job	`20143908`
Backend	`deterministic-smoke`
Run directory	`/cfs/klemming/scratch/h/hutaobo/pyxenium_bmnet_morphology_2026-04/runs/bmnet_smoke5_deterministic_smoke_v2`
Contours	5
Status	`COMPLETED`
Elapsed time	`00:05:52`
Max RSS	about `7.7 GB`

The retained contour IDs were S1 S1 #1.1 through S1 S1 #5.1.

Output Artifacts#

Each run writes a compact artifact bundle:

bmnet_patch_predictions.csv
bmnet_pdc_run_summary.json
contour_features_with_bmnet.csv
feature_redundancy.csv
he_morphology_features.csv
incremental_prediction.csv
matched_review_table.csv
morphology_increment_summary.json
partial_associations.csv
program_scores.csv
xenium_native_morphology.csv

The smoke run summary contained the expected high-level checks:

{
  "n_contours": 5,
  "contour_key": "s1_s5_contours",
  "n_he_morphology_features": 139,
  "n_xenium_native_morphology_features": 10,
  "has_bmnet_features": true,
  "evaluation_mode": "in_sample_small_n"
}

bmnet_patch_predictions.csv contains BM-Net-style named features, including:

bmnet__whole__normal_prob
bmnet__whole__benign_prob
bmnet__whole__in_situ_prob
bmnet__whole__invasive_prob
bmnet__outer_rim__invasive_prob
bmnet__outer_minus_inner__invasive_prob
edge_contrast__bmnet__whole__invasive_prob

For a trained BM-Net checkpoint, these columns would represent contour-level breast pathology probabilities and rim/region contrasts. In the smoke backend, they are deterministic H&E color/texture proxies with the same schema.

Increment Analysis#

The increment module writes six analysis artifacts:

Artifact	Meaning
`xenium_native_morphology.csv`	Morphology derived from Xenium-native cell and nucleus boundaries
`he_morphology_features.csv`	H&E pathomics, BM-Net, and named pathology features
`feature_redundancy.csv`	Correlation/redundancy between H&E and Xenium-native feature blocks
`incremental_prediction.csv`	Nested prediction models comparing baseline, Xenium-native, H&E, and combined blocks
`partial_associations.csv`	Feature associations after adjusting for selected covariates
`matched_review_table.csv`	Contour-level review table for manual inspection

For the five-contour smoke run, incremental_prediction.csv is only a schema and plumbing check because the evaluation is explicitly in_sample_small_n. A biological interpretation requires a larger contour set and a real trained or validated pathology backend.

Real Model Runs#

When a compatible BM-Net checkpoint is available, run:

bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
  --backend bmnet-local \
  --checkpoint /cfs/klemming/scratch/h/hutaobo/models/bmnet/bmnet.pt \
  --include-full

When BM-Net weights are unavailable, run a Hugging Face pathology surrogate:

bash benchmarking/bmnet_pdc/scripts/submit_pdc_bmnet_pilot.sh \
  --backend hf-pathology-backbone \
  --hf-model 1aurent/vit_small_patch8_224.lunit_dino \
  --smoke-max-contours 20

The surrogate backend writes pathology__... features instead of pretending to be BM-Net. This distinction is important for downstream reports.

Python API#

The same workflow can be called directly:

from pyXenium.multimodal import run_bmnet_morphology_increment_pilot

result = run_bmnet_morphology_increment_pilot(
    dataset_root="/path/to/WTA_Preview_FFPE_Breast_Cancer_outs",
    output_dir="/path/to/output/bmnet_smoke",
    contour_geojson="/path/to/xenium_explorer_annotations.s1_s5.generated.geojson",
    backend="deterministic-smoke",
    max_contours=5,
    program_library="breast_boundary_bmnet_v1",
)

print(result["summary"]["artifact_files"])

The runner performs four steps:

load the Xenium export, cell table, clusters, contours, and aligned H&E image
crop whole-contour, inner, and outer-rim H&E patches
write named H&E pathology features into the contour feature table
compare H&E morphology against Xenium-native morphology with redundancy, nested prediction, partial association, and shuffle-control outputs

Troubleshooting Notes From The Smoke Run#

Several practical issues were resolved during the first PDC run:

If no H&E image is detected, confirm that the OME-TIFF and alignment files are staged under the Xenium export root and that include_images=True can discover them.
If GeoJSON contours do not contain polygon_id, the runner falls back to the contour name field.
If adata.obsm["spatial"] is absent, pass or stage cells.parquet so the runner can reconstruct Xenium-native morphology.
Tiny smoke patches can collapse to a one-pixel RGB array; the current image conversion path handles that case.

Interpretation Rules#

Use the output in three tiers:

Smoke-only result: validates software, files, schema, and PDC orchestration.
Surrogate pathology result: useful for exploring whether H&E representation adds signal, but should be labeled as a surrogate.
Trained BM-Net result: can support BM-Net-specific biological interpretation when the checkpoint, training data, inference date, and labels are recorded in morphology_increment_summary.json.

For a publishable analysis, the next full run should use at least the full S1/S5 contour set, keep the shuffle control, and report whether H&E/BM-Net blocks add held-out predictive value beyond Xenium-native cell and nucleus morphology.