pyXenium.contour Boundary Ecology Tutorial#
Overview#
This tutorial shows how pyXenium combines contour-specific H&E patches with Xenium spatial transcriptomics to produce contour-level biological hypotheses. The unit of analysis is a contour, not a single cell or a fixed square tile. Each row in the study table links one contour to its H&E patch, geometry, inner and outer rim molecular summaries, matched controls, and boundary-program scores.
The tutorial uses the Atera WTA FFPE breast Xenium sample and the checked artifact bundle in manuscript/atera_contour_boundary_ecology/. The notebook is safe for documentation builds because notebook execution is disabled in Sphinx; set RUN_REGENERATION = True below to rerun the real-data workflow locally.
Biological question#
Can contour-level H&E morphology explain spatial transcriptomic programs at tumor, immune, and stromal boundaries?
The default program library asks six boundary-ecology questions:
immune_exclusion: are immune signals enriched outside a contour but depleted inside?myeloid_vascular_belt: do macrophage-like and endothelial/perivascular programs form a boundary belt?emt_invasive_front: do EMT and stromal matrix programs increase at the outer edge?stromal_encapsulation: does the H&E and molecular context look matrix-rich and wrapped?tls_adjacent_activation: are lymphoid and CXCL13-like programs adjacent to the contour?necrotic_hypoxic_rim: does the contour edge show hypoxia-like or low-cellularity morphology?
Why contour-native multimodal analysis?#
Recent histology-plus-spatial-transcriptomics methods motivate this workflow, but pyXenium intentionally keeps v1 interpretable. STPath motivates joint whole-slide image and spatial transcriptomics modeling; Hist2ST motivates local image patch plus spatial-neighborhood representations; iIMPACT motivates histology-assisted spatial domain discovery; TITAN motivates optional foundation-model embeddings.
This tutorial uses the pathomics-first branch: color, texture, edge density, nuclear-density proxy, shape, rim contrasts, pseudobulk RNA, pathway summaries, cell-cell interaction summaries, and matched controls. A foundation model can later be passed through embedding_backend, but it is not required for this example.
Dataset#
The default local data path is:
Y:\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs
The tutorial artifact run uses a 28-contour subset sampled from xenium_explorer_annotations.generated.geojson, with four contours from each of seven Atera Explorer structures. This keeps the notebook quick while preserving real H&E alignment, real cell positions, and real WTA expression.
from __future__ import annotations
import json
import os
import shutil
import sys
from pathlib import Path
import pandas as pd
from IPython.display import Image, Markdown, display
def find_repo_root() -> Path:
for candidate in (Path.cwd(), *Path.cwd().parents):
if (candidate / "pyproject.toml").exists() and (candidate / "src" / "pyXenium").exists():
return candidate
raise RuntimeError("Could not find the pyXenium repository root.")
REPO_ROOT = find_repo_root()
SRC_ROOT = REPO_ROOT / "src"
if str(SRC_ROOT) not in sys.path:
sys.path.insert(0, str(SRC_ROOT))
ATERA_DATASET_PATH = Path(
os.environ.get(
"PYXENIUM_ATERA_DATASET",
r"Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs",
)
)
ARTIFACT_DIR = REPO_ROOT / "manuscript" / "atera_contour_boundary_ecology"
CONTOUR_KEY = "atera_boundary_ecology_contours"
SLIDE_PATH = ATERA_DATASET_PATH / "xenium_slide.zarr"
print(f"Dataset: {ATERA_DATASET_PATH}")
print(f"Artifacts: {ARTIFACT_DIR}")
Inspect the stored result package#
The checked artifact package is the source of the biological statements in this tutorial. It was generated from real Atera cells and H&E patches, but it deliberately uses a small contour subset so the tutorial stays reproducible on a workstation.
summary = json.loads((ARTIFACT_DIR / "summary.json").read_text(encoding="utf-8"))
program_scores = pd.read_csv(ARTIFACT_DIR / "program_scores.csv")
contour_features = pd.read_csv(ARTIFACT_DIR / "contour_features.csv")
ecotypes = pd.read_csv(ARTIFACT_DIR / "ecotype_assignments.csv")
matched = pd.read_csv(ARTIFACT_DIR / "matched_exemplars.csv")
deltas = pd.read_csv(ARTIFACT_DIR / "program_feature_deltas.csv")
hypotheses = pd.read_csv(ARTIFACT_DIR / "hypothesis_ranking.csv")
display(Markdown(
f"""
### Artifact summary
- Sample: `{summary['sample_summary']['sample_id']}`
- Contours: `{summary['sample_summary']['n_contours']}`
- Cells: `{summary['sample_summary']['n_cells']}`
- Ecotypes: `{summary['sample_summary']['n_ecotypes']}`
- Bootstrap ARI: `{summary['sample_summary']['bootstrap_mean_ari']:.3f}`
- Features used for ecotype clustering: `{summary['sample_summary']['feature_count']}`
"""
))
pd.Series(summary["top_program_counts"], name="n_contours").rename_axis("top_program").reset_index()
Core workflow#
A ready pyXenium slide store can be loaded directly with read_slide(xenium_slide.zarr) when it already contains contour_images/. The Atera store in this workspace currently contains the aligned H&E image but not contour patches, so the regeneration path below imports the GeoJSON and extracts patches before running the workflow.
The fast tutorial run uses cell-level spatial transcriptomics (include_transcripts=False) so it finishes quickly. For a longer production run, set include_transcripts=True, stream_transcripts=True to add transcript-level signed-distance gradients.
RUN_REGENERATION = False
if RUN_REGENERATION:
from pyXenium.contour import add_contours_from_geojson
from pyXenium.io import read_slide, read_xenium
from pyXenium.multimodal import run_contour_boundary_ecology_pilot
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
for child in ARTIFACT_DIR.iterdir():
if child.name == "atera_tutorial_contour_subset.geojson":
continue
if child.is_dir():
shutil.rmtree(child)
else:
child.unlink()
structures = [
"11q13 Invasive Tumor Cells",
"Basal-like Structured DCIS Cells",
"Macrophages",
"Plasma Cells",
"Endothelial Cells",
"Luminal-like Amorphous DCIS Cells",
"Apocrine Cells",
]
per_structure = 4
source_geojson = ATERA_DATASET_PATH / "xenium_explorer_annotations.generated.geojson"
subset_geojson = ARTIFACT_DIR / "atera_tutorial_contour_subset.geojson"
payload = json.loads(source_geojson.read_text(encoding="utf-8"))
counts = {name: 0 for name in structures}
selected = []
for feature in payload.get("features", []):
structure = feature.get("properties", {}).get("assigned_structure")
if structure not in counts or counts[structure] >= per_structure:
continue
selected.append(feature)
counts[structure] += 1
if all(value >= per_structure for value in counts.values()):
break
subset_geojson.write_text(json.dumps({"type": "FeatureCollection", "features": selected}), encoding="utf-8")
if (SLIDE_PATH / "contour_images").exists():
slide = read_slide(SLIDE_PATH)
else:
slide = read_xenium(
str(ATERA_DATASET_PATH),
as_="slide",
prefer="h5",
include_transcripts=False,
include_boundaries=False,
include_images=True,
cells_parquet="cells.parquet",
clusters_relpath="analysis/analysis/clustering/gene_expression_graphclust/clusters.csv",
)
add_contours_from_geojson(
slide,
subset_geojson,
key=CONTOUR_KEY,
id_key="name",
extract_he_patches=True,
copy=False,
)
result = run_contour_boundary_ecology_pilot(
slide,
contour_key=CONTOUR_KEY,
output_dir=ARTIFACT_DIR,
)
print(result["sample_summary"])
else:
print("Using checked artifacts. Set RUN_REGENERATION = True to rerun the real Atera workflow.")
Visual outputs#
exemplar_montage.png pairs high-scoring contours with matched controls. Matching is important because a raw high score can otherwise reflect contour size, position, or cell density instead of a boundary-specific biological program.
montage = ARTIFACT_DIR / "exemplar_montage.png"
if montage.exists():
display(Image(filename=str(montage)))
else:
print("Montage not found. Rerun the workflow cell to generate it.")
display(hypotheses[["program", "max_score", "n_matched_exemplars", "hypothesis"]])
display(matched[["program", "exemplar_id", "control_id", "delta_score"]].head(12))
display(deltas[["program", "feature_kind", "feature_name", "effect_size", "matched_control_delta", "fdr"]].head(15))
Biological interpretation#
The 28-contour tutorial run should be read as a discovery vignette, not a powered cohort analysis. It still shows what contour-native H&E plus spatial omics can reveal:
The workflow assigns every contour one dominant boundary program. In this run,
immune_exclusionis the most frequent top program, whilestromal_encapsulation,emt_invasive_front,tls_adjacent_activation,necrotic_hypoxic_rim, andmyeloid_vascular_belteach appear in smaller contour subsets.The highest maximum scores are
stromal_encapsulation(0.999),emt_invasive_front(0.948), andtls_adjacent_activation(0.925), suggesting that the strongest tutorial-scale signals are matrix wrapping, invasive-front morphology, and lymphoid-adjacent activation.Matched controls make the hypothesis concrete. For example, the
emt_invasive_fronttable highlights positive matched-control deltas forSPP1-CD44,CXCL12-CXCR4,TGFB1-TGFBR2, andVIM, which are consistent with a stromal or invasive-front interpretation.The small subset is not intended to claim statistical significance. Most FDR values are high because the run contains only 28 contours; the value is in ranking candidates and generating exemplar/control pairs for follow-up review.
Caveats#
The checked run uses cell-level WTA pseudobulk summaries for speed. Transcript-level signed-distance gradients are available when the sample is loaded with streamed transcripts, but that is a longer run on the full Atera transcript store.
The H&E patch is not resampled into Xenium coordinates. pyXenium keeps the original pixels and stores affine metadata, so patch-level interpretation depends on correct alignment.
Program scores are hypothesis generators. They should be validated by pathologist review, marker-specific maps, larger contour sets, and orthogonal assays where possible.
Foundation model embeddings are intentionally off in this tutorial. Pass an
embedding_backendto add them without replacing the interpretable pathomics branch.
Next steps#
Increase
per_structureor use the full generated GeoJSON for a stronger within-sample analysis.Enable streamed transcript gradients for sentinel genes such as
MMP11,COL1A1,COL1A2,CXCL13,SPP1, andKDR.Review exemplar/control pairs in Xenium Explorer or a pathology viewer.
Aggregate
contour_features.csvacross samples to turn this single-sample pilot into a cohort-level contour ecology study.