Atera contour-GMI workflow#
This workflow promotes the S1/S5 Atera breast contour analysis into the
canonical pyXenium.gmi surface. It reuses the contour annotations generated by
the S1-S5 contour tutorial and treats each independent contour polygon as one
GMI sample.
Dataset#
The workflow is parameterized for the Atera Xenium WTA breast export:
WTA_Preview_FFPE_Breast_Cancer_outs
It expects the standard Xenium export plus the generated S1/S5 contour file:
xenium_explorer_annotations.s1_s5.generated.geojson
Pass --contour-geojson if the file is stored elsewhere.
Primary run#
pyxenium gmi run \
--dataset-root /path/to/WTA_Preview_FFPE_Breast_Cancer_outs \
--output-dir pyxenium_gmi_outputs/full_contour_top500_spatial100 \
--rna-feature-count 500 \
--spatial-feature-count 100 \
--min-cells-per-contour 20
The model compares S1 invasive tumor/CAF contours against S5
apocrine-luminal DCIS contours. Endpoint contours that fail QC remain in
sample_metadata.tsv and report QC plots, but they do not enter model fitting.
PDC validation presets#
The v0.4.1 validation ran on PDC Dardel as a serial Slurm dependency chain:
smoke: top 200 RNA features plus 50 spatial features
full: top 500 RNA features plus 100 spatial features
stability: 5-fold stratified spatial CV, 10 bootstrap repeats, label permutation, coordinate shuffle, and spatial-feature shuffle
RNA-only QC20
spatial-only QC20
no-coordinate QC20
top1000 RNA sensitivity
all-nonempty contour sensitivity
Generate the reproducibility manifest with:
pyxenium gmi pdc-plan \
--pdc-xenium-root /cfs/klemming/scratch/h/hutaobo/pyxenium_cci_benchmark_2026-04/data/source_cache/breast/WTA_Preview_FFPE_Breast_Cancer_outs \
--pdc-root /cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04 \
--output-json /cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/logs/pdc_gmi_plan.json
Submit and monitor with the PDC scaffold:
bash benchmarking/gmi_pdc/scripts/bootstrap_pdc_env.sh
bash benchmarking/gmi_pdc/scripts/prepare_pdc_inputs.sh
bash benchmarking/gmi_pdc/scripts/submit_pdc_chain.sh
bash benchmarking/gmi_pdc/scripts/monitor_pdc_gmi.sh
The workflow writes GMI outputs only under the configured GMI scratch root. It does not write into the Xenium source cache except for the optional generated S1/S5 GeoJSON when that file is missing.
Artifacts#
Each successful stage writes:
design_matrix.tsv.gzsample_metadata.tsvfeature_metadata.tsvgmi_fit.rdsmain_effects.tsvinteraction_effects.tsvgroups.tsvcv_metrics.tsvstability.tsvheterogeneity.tsvsummary.jsonreport.mdcontour figures under
figures/
After any stage, derive supervised spatial gene modules with:
pyxenium gmi modules \
--gmi-output-dir pyxenium_gmi_outputs/full_contour_top500_spatial100
For the Atera validation, the key module-level question is whether NIBAN1 and
SORL1 form one S5/DCIS RNA module and whether its contour scores remain
separable from luminal/apocrine DCIS composition, CAF/ECM, vascular/pericyte,
immune, and coordinate-derived spatial features.
The fresh module validation and tutorial use a separate PDC root so the v0.4.1 GMI validation remains intact:
/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_modules_2026-04-30
That module validation completed all 8 stages under Slurm jobs 20207833
through 20207840. The primary QC20 module is S5/DCIS-high and anchored by
NIBAN1 and SORL1; RNA-only and no-coordinate controls retain the module,
while the spatial-only module is driven by luminal-like amorphous DCIS
composition. The top1000 sensitivity adds EFHD1; the all-nonempty sensitivity
switches to S1 11q13 invasive tumor composition and should be read as a QC
stress test rather than the primary result.
Biological readout#
Interpret selected effects only after comparing the full run with RNA-only, spatial-only, no-coordinate, permutation, coordinate-shuffle, and spatial-block shuffle controls.
The original contour-GMI PDC validation completed all 8 stages under job chain
20008045-20008052.
The final summary artifacts are:
/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/reports/pdc_gmi_validation_summary.json
/cfs/klemming/scratch/h/hutaobo/pyxenium_gmi_contour_2026-04/reports/pdc_gmi_validation_summary.md
Stage |
Contours |
Features |
Selected main effects |
Train AUC |
CV mean AUC |
|---|---|---|---|---|---|
smoke top200+spatial50 |
80 |
250 |
none |
0.50 |
- |
full top500+spatial100 |
80 |
600 |
|
1.00 |
- |
stability top500+spatial100 |
80 |
600 |
|
1.00 |
1.00 |
RNA-only QC20 |
80 |
500 |
|
1.00 |
1.00 |
spatial-only QC20 |
80 |
100 |
luminal-like amorphous DCIS composition |
0.98 |
0.94 |
no-coordinate QC20 |
80 |
600 |
|
1.00 |
1.00 |
top1000 QC20 |
80 |
1100 |
|
1.00 |
1.00 |
all-nonempty |
102 |
600 |
11q13 invasive tumor composition |
0.95 |
0.95 |
Primary interpretation: the QC20 S1 invasive tumor/CAF versus S5
apocrine-luminal DCIS contrast is driven mainly by an S5/DCIS RNA expression
program led by NIBAN1 and SORL1. This is supported by RNA-only and
no-coordinate controls. The spatial-only model is predictive but composition
driven, especially luminal-like amorphous DCIS fractions, rather than a direct
coordinate effect.
Sensitivity interpretation: top1000 keeps SORL1 and shows bootstrap support
for NIBAN1, but adds EFHD1, so larger RNA feature budgets should be read as
supportive but less sparse-stable. All-nonempty changes the model to 11q13
invasive tumor composition features, meaning low-cell contours can dominate the
spatial composition signal. For release-level biology, keep QC20 as the primary
result and use all-nonempty only as a QC stress test.
CAF/ECM remodeling, angiogenesis/pericyte, myeloid-vascular context, Notch, IGF/MAPK, Wnt, and TGF-beta programs were not selected as primary sparse drivers in this PDC validation.