pyXenium.gmi tutorial#
Overview#
pyXenium.gmi uses independent contour polygons as samples for sparse GMI
modeling. The first canonical workflow reuses the Atera WTA breast S1/S5
contours: S1 is the invasive tumor/CAF endpoint and S5 is the
apocrine-luminal DCIS endpoint.
The workflow is contour-first. It does not build spatial tiles.
Biological question#
The reference task asks which RNA programs and numeric contour features separate S1 invasive tumor/CAF contours from S5 apocrine-luminal DCIS contours, and whether the same feature space can describe within-label contour heterogeneity.
Setup#
Create the S1/S5 contour GeoJSON with the contour tutorial, then run GMI from the Xenium export root:
pyxenium gmi run \
--dataset-root /path/to/WTA_Preview_FFPE_Breast_Cancer_outs \
--output-dir pyxenium_gmi_outputs/atera_s1_s5 \
--rna-feature-count 500 \
--spatial-feature-count 100 \
--spatial-cv-folds 5 \
--bootstrap-repeats 10 \
--label-permutation-control \
--spatial-feature-shuffle-control
If the default generated contour file is not present beside the dataset, pass it explicitly:
pyxenium gmi run \
--dataset-root /path/to/WTA_Preview_FFPE_Breast_Cancer_outs \
--contour-geojson /path/to/xenium_explorer_annotations.s1_s5.generated.geojson \
--output-dir pyxenium_gmi_outputs/atera_s1_s5
Core workflow#
GMI builds one sample per retained S1/S5 contour. RNA counts are aggregated
inside each contour, normalized to contour-level logCPM, and combined with
numeric contour features from build_contour_feature_table(...). Feature
metadata marks every column as rna or spatial.
The default QC keeps contours with at least 20 cells and nonzero library size.
Dropped endpoint contours remain in sample_metadata.tsv with retained and
drop_reason fields so QC can be visualized and audited.
Outputs#
Each run writes:
design_matrix.tsv.gzsample_metadata.tsvfeature_metadata.tsvgmi_fit.rdsmain_effects.tsvinteraction_effects.tsvgroups.tsvcv_metrics.tsvstability.tsvheterogeneity.tsvsummary.jsonreport.mdfigures/contour overlays, QC maps, prediction maps, and gene logCPM maps
To derive supervised spatial gene modules from a completed run:
pyxenium gmi modules \
--gmi-output-dir pyxenium_gmi_outputs/atera_s1_s5
This creates a modules/ subdirectory with spatial_modules.tsv,
module_features.tsv, module_scores.tsv.gz, enrichment and spatial
autocorrelation tables, and optional module score maps. The first version is
GMI-native: selected or stable effects seed modules, while correlation,
spatial-lag correlation, and GMI interaction edges expand them.
Controls#
Use RNA-only, spatial-only, no-coordinate, label-permutation, coordinate-shuffle, and spatial-feature-shuffle runs to separate expression programs from spatial layout artifacts. The PDC workflow encodes these presets as reproducible Slurm stages.
Biological interpretation#
The PDC Dardel validation for v0.4.1 completed all 8 stages. The primary
QC20 model retained 80 of 131 endpoint contours and selected the RNA features
NIBAN1 and SORL1, with train AUC 1.0 and 5-fold stratified spatial CV mean
AUC 1.0 in the stability stage. RNA-only and no-coordinate validations also
selected NIBAN1 and SORL1, supporting the interpretation that the main
S1/S5 separation is an S5/DCIS RNA expression program rather than a direct
centroid or slide-position artifact.
The spatial-only validation selected luminal-like amorphous DCIS composition
features, not coordinate features. This means spatial context is predictive, but
the sparse spatial signal is mainly endpoint composition rather than an
independent CAF/ECM, vascular/pericyte, immune, Notch, IGF/MAPK, Wnt, or
TGF-beta axis. The top1000 sensitivity kept SORL1 in the main model and had
bootstrap support for NIBAN1, but introduced EFHD1, so expanded RNA feature
space should be interpreted as a sensitivity result.
The all-nonempty sensitivity retained 102 contours and switched to
11q13 invasive tumor cell composition features. This is useful as a QC warning:
low-cell contours can move GMI toward label-composition structure, so QC20
remains the primary biological result.
Caveats#
GMI is sparse and sample-size sensitive. For contour pseudo-bulk analysis, keep QC20 as the primary model and use all-nonempty contours only as sensitivity analysis. Zero-cell contours should not enter model fitting.
Next steps#
Archive the PDC scratch artifacts if they need long-term retention, then compare new datasets against the same QC20, RNA-only, spatial-only, no-coordinate, top1000, and all-nonempty template before promoting additional selected features to biological claims.