Atera contour-GMI workflow#

This workflow promotes the S1/S5 Atera breast contour analysis into the canonical pyXenium.gmi surface. It reuses the contour annotations generated by the S1-S5 contour tutorial and treats each independent contour polygon as one GMI sample.

Dataset#

The workflow is parameterized for the Atera Xenium WTA breast export:

WTA_Preview_FFPE_Breast_Cancer_outs

It expects the standard Xenium export plus the generated S1/S5 contour file:

xenium_explorer_annotations.s1_s5.generated.geojson

Pass --contour-geojson if the file is stored elsewhere.

Primary run#

pyxenium gmi run \
  --dataset-root /path/to/WTA_Preview_FFPE_Breast_Cancer_outs \
  --output-dir pyxenium_gmi_outputs/full_contour_top500_spatial100 \
  --rna-feature-count 500 \
  --spatial-feature-count 100 \
  --min-cells-per-contour 20

The model compares S1 invasive tumor/CAF contours against S5 apocrine-luminal DCIS contours. Endpoint contours that fail QC remain in sample_metadata.tsv and report QC plots, but they do not enter model fitting.

PDC validation presets#

The v0.4.1 validation ran on PDC Dardel as a serial Slurm dependency chain:

  • smoke: top 200 RNA features plus 50 spatial features

  • full: top 500 RNA features plus 100 spatial features

  • stability: 5-fold stratified spatial CV, 10 bootstrap repeats, label permutation, coordinate shuffle, and spatial-feature shuffle

  • RNA-only QC20

  • spatial-only QC20

  • no-coordinate QC20

  • top1000 RNA sensitivity

  • all-nonempty contour sensitivity

Generate the reproducibility manifest with:

pyxenium gmi pdc-plan \
  --pdc-xenium-root /cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs \
  --pdc-root /cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04 \
  --output-json /cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/logs/pdc_gmi_plan.json

Submit and monitor with the PDC scaffold:

bash benchmarking/gmi_pdc/scripts/bootstrap_pdc_env.sh
bash benchmarking/gmi_pdc/scripts/prepare_pdc_inputs.sh
bash benchmarking/gmi_pdc/scripts/submit_pdc_chain.sh
bash benchmarking/gmi_pdc/scripts/monitor_pdc_gmi.sh

The workflow writes GMI outputs only under the configured GMI scratch root. It does not write into the Xenium source cache except for the optional generated S1/S5 GeoJSON when that file is missing.

Artifacts#

Each successful stage writes:

  • design_matrix.tsv.gz

  • sample_metadata.tsv

  • feature_metadata.tsv

  • gmi_fit.rds

  • main_effects.tsv

  • interaction_effects.tsv

  • groups.tsv

  • cv_metrics.tsv

  • stability.tsv

  • heterogeneity.tsv

  • summary.json

  • report.md

  • contour figures under figures/

After any stage, derive supervised spatial gene modules with:

pyxenium gmi modules \
  --gmi-output-dir pyxenium_gmi_outputs/full_contour_top500_spatial100

For the Atera validation, the key module-level question is whether NIBAN1 and SORL1 form one S5/DCIS RNA module and whether its contour scores remain separable from luminal/apocrine DCIS composition, CAF/ECM, vascular/pericyte, immune, and coordinate-derived spatial features.

The fresh module validation and tutorial use a separate PDC root so the v0.4.1 GMI validation remains intact:

/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_modules_2026-04-30

That module validation completed all 8 stages under Slurm jobs 20207833 through 20207840. The primary QC20 module is S5/DCIS-high and anchored by NIBAN1 and SORL1; RNA-only and no-coordinate controls retain the module, while the spatial-only module is driven by luminal-like amorphous DCIS composition. The top1000 sensitivity adds EFHD1; the all-nonempty sensitivity switches to S1 11q13 invasive tumor composition and should be read as a QC stress test rather than the primary result.

Biological readout#

Interpret selected effects only after comparing the full run with RNA-only, spatial-only, no-coordinate, permutation, coordinate-shuffle, and spatial-block shuffle controls.

The original contour-GMI PDC validation completed all 8 stages under job chain 20008045-20008052. The final summary artifacts are:

/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/reports/pdc_gmi_validation_summary.json
/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/reports/pdc_gmi_validation_summary.md

Stage

Contours

Features

Selected main effects

Train AUC

CV mean AUC

smoke top200+spatial50

80

250

none

0.50

-

full top500+spatial100

80

600

NIBAN1, SORL1

1.00

-

stability top500+spatial100

80

600

NIBAN1, SORL1

1.00

1.00

RNA-only QC20

80

500

NIBAN1, SORL1

1.00

1.00

spatial-only QC20

80

100

luminal-like amorphous DCIS composition

0.98

0.94

no-coordinate QC20

80

600

NIBAN1, SORL1

1.00

1.00

top1000 QC20

80

1100

EFHD1, SORL1

1.00

1.00

all-nonempty

102

600

11q13 invasive tumor composition

0.95

0.95

Primary interpretation: the QC20 S1 invasive tumor/CAF versus S5 apocrine-luminal DCIS contrast is driven mainly by an S5/DCIS RNA expression program led by NIBAN1 and SORL1. This is supported by RNA-only and no-coordinate controls. The spatial-only model is predictive but composition driven, especially luminal-like amorphous DCIS fractions, rather than a direct coordinate effect.

Sensitivity interpretation: top1000 keeps SORL1 and shows bootstrap support for NIBAN1, but adds EFHD1, so larger RNA feature budgets should be read as supportive but less sparse-stable. All-nonempty changes the model to 11q13 invasive tumor composition features, meaning low-cell contours can dominate the spatial composition signal. For release-level biology, keep QC20 as the primary result and use all-nonempty only as a QC stress test.

CAF/ECM remodeling, angiogenesis/pericyte, myeloid-vascular context, Notch, IGF/MAPK, Wnt, and TGF-beta programs were not selected as primary sparse drivers in this PDC validation.