RNA + contour + H&E#

Overview#

This tutorial shows how pyXenium.multimodal combines contour-specific H&E patches with Xenium RNA to produce contour-level biological hypotheses. The unit of analysis is a contour, not a single cell or a fixed square tile. Each row in the study table links one contour to its H&E patch, geometry, inner and outer rim molecular summaries, matched controls, and boundary-program scores.

The tutorial uses the Atera WTA FFPE breast Xenium sample and the checked artifact bundle in manuscript/atera_contour_boundary_ecology/. The notebook is safe for documentation builds because notebook execution is disabled in Sphinx; set RUN_REGENERATION = True below to rerun the real-data workflow locally.

Biological question#

Can contour-level H&E morphology explain Xenium RNA programs at tumor, immune, and stromal boundaries?

The default program library asks six boundary-ecology questions:

  • immune_exclusion: are immune signals enriched outside a contour but depleted inside?

  • myeloid_vascular_belt: do macrophage-like and endothelial/perivascular programs form a boundary belt?

  • emt_invasive_front: do EMT and stromal matrix programs increase at the outer edge?

  • stromal_encapsulation: does the H&E and molecular context look matrix-rich and wrapped?

  • tls_adjacent_activation: are lymphoid and CXCL13-like programs adjacent to the contour?

  • necrotic_hypoxic_rim: does the contour edge show hypoxia-like or low-cellularity morphology?

Why RNA + contour + H&E?#

This broader multimodal pattern is useful whenever a Xenium sample has aligned H&E imagery and biological regions represented as contours. Recent histology-plus-spatial-transcriptomics methods motivate the workflow, but pyXenium intentionally keeps v1 interpretable. STPath motivates joint whole-slide image and spatial transcriptomics modeling; Hist2ST motivates local image patch plus spatial-neighborhood representations; iIMPACT motivates histology-assisted spatial domain discovery; TITAN motivates optional foundation-model embeddings.

This tutorial uses the pathomics-first branch: color, texture, edge density, nuclear-density proxy, shape, rim contrasts, pseudobulk RNA, pathway summaries, cell-cell interaction summaries, and matched controls. A foundation model can later be passed through embedding_backend, but it is not required for this example.

Dataset#

The default local data path is:

Y:\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs

The tutorial artifact run uses a 28-contour subset sampled from xenium_explorer_annotations.generated.geojson, with four contours from each of seven Atera Explorer structures. This keeps the notebook quick while preserving real H&E alignment, real cell positions, and real WTA expression.

from __future__ import annotations

import json
import os
import shutil
import sys
from pathlib import Path

import pandas as pd
from IPython.display import Image, Markdown, display


def find_repo_root() -> Path:
    for candidate in (Path.cwd(), *Path.cwd().parents):
        if (candidate / "pyproject.toml").exists() and (candidate / "src" / "pyXenium").exists():
            return candidate
    raise RuntimeError("Could not find the pyXenium repository root.")


REPO_ROOT = find_repo_root()
SRC_ROOT = REPO_ROOT / "src"
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

ATERA_DATASET_PATH = Path(
    os.environ.get(
        "PYXENIUM_ATERA_DATASET",
        r"Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs",
    )
)
ARTIFACT_DIR = REPO_ROOT / "manuscript" / "atera_contour_boundary_ecology"
CONTOUR_KEY = "atera_boundary_ecology_contours"
SLIDE_PATH = ATERA_DATASET_PATH / "xenium_slide.zarr"

print(f"Dataset: {ATERA_DATASET_PATH}")
print(f"Artifacts: {ARTIFACT_DIR}")

Inspect the stored result package#

The checked artifact package is the source of the biological statements in this tutorial. It was generated from real Atera cells and H&E patches, but it deliberately uses a small contour subset so the tutorial stays reproducible on a workstation.

summary = json.loads((ARTIFACT_DIR / "summary.json").read_text(encoding="utf-8"))
program_scores = pd.read_csv(ARTIFACT_DIR / "program_scores.csv")
contour_features = pd.read_csv(ARTIFACT_DIR / "contour_features.csv")
ecotypes = pd.read_csv(ARTIFACT_DIR / "ecotype_assignments.csv")
matched = pd.read_csv(ARTIFACT_DIR / "matched_exemplars.csv")
deltas = pd.read_csv(ARTIFACT_DIR / "program_feature_deltas.csv")
hypotheses = pd.read_csv(ARTIFACT_DIR / "hypothesis_ranking.csv")

display(Markdown(
    f"""
### Artifact summary

- Sample: `{summary['sample_summary']['sample_id']}`
- Contours: `{summary['sample_summary']['n_contours']}`
- Cells: `{summary['sample_summary']['n_cells']}`
- Ecotypes: `{summary['sample_summary']['n_ecotypes']}`
- Bootstrap ARI: `{summary['sample_summary']['bootstrap_mean_ari']:.3f}`
- Features used for ecotype clustering: `{summary['sample_summary']['feature_count']}`
    """
))

pd.Series(summary["top_program_counts"], name="n_contours").rename_axis("top_program").reset_index()

Core workflow#

A ready pyXenium slide store can be loaded directly with read_slide(xenium_slide.zarr) when it already contains contour_images/. For a new sample, the regeneration path below loads Xenium RNA plus aligned H&E, imports a contour GeoJSON, extracts H&E patches, and then runs the multimodal contour workflow.

The fast tutorial run uses cell-level spatial transcriptomics (include_transcripts=False) so it finishes quickly. For a longer production run, set include_transcripts=True, stream_transcripts=True to add transcript-level signed-distance gradients.

RUN_REGENERATION = False

if RUN_REGENERATION:
    from pyXenium.contour import add_contours_from_geojson
    from pyXenium.io import read_slide, read_xenium
    from pyXenium.multimodal import run_contour_boundary_ecology_pilot

    ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
    for child in ARTIFACT_DIR.iterdir():
        if child.name == "atera_tutorial_contour_subset.geojson":
            continue
        if child.is_dir():
            shutil.rmtree(child)
        else:
            child.unlink()

    structures = [
        "11q13 Invasive Tumor Cells",
        "Basal-like Structured DCIS Cells",
        "Macrophages",
        "Plasma Cells",
        "Endothelial Cells",
        "Luminal-like Amorphous DCIS Cells",
        "Apocrine Cells",
    ]
    per_structure = 4
    source_geojson = ATERA_DATASET_PATH / "xenium_explorer_annotations.generated.geojson"
    subset_geojson = ARTIFACT_DIR / "atera_tutorial_contour_subset.geojson"
    payload = json.loads(source_geojson.read_text(encoding="utf-8"))
    counts = {name: 0 for name in structures}
    selected = []
    for feature in payload.get("features", []):
        structure = feature.get("properties", {}).get("assigned_structure")
        if structure not in counts or counts[structure] >= per_structure:
            continue
        selected.append(feature)
        counts[structure] += 1
        if all(value >= per_structure for value in counts.values()):
            break
    subset_geojson.write_text(json.dumps({"type": "FeatureCollection", "features": selected}), encoding="utf-8")

    if (SLIDE_PATH / "contour_images").exists():
        slide = read_slide(SLIDE_PATH)
    else:
        slide = read_xenium(
            str(ATERA_DATASET_PATH),
            as_="slide",
            prefer="h5",
            include_transcripts=False,
            include_boundaries=False,
            include_images=True,
            cells_parquet="cells.parquet",
            clusters_relpath="analysis/analysis/clustering/gene_expression_graphclust/clusters.csv",
        )
        add_contours_from_geojson(
            slide,
            subset_geojson,
            key=CONTOUR_KEY,
            id_key="name",
            extract_he_patches=True,
            copy=False,
        )

    result = run_contour_boundary_ecology_pilot(
        slide,
        contour_key=CONTOUR_KEY,
        output_dir=ARTIFACT_DIR,
    )
    print(result["sample_summary"])
else:
    print("Using checked artifacts. Set RUN_REGENERATION = True to rerun the real Atera workflow.")

Visual outputs#

exemplar_montage.png pairs high-scoring contours with matched controls. Matching is important because a raw high score can otherwise reflect contour size, position, or cell density instead of a boundary-specific biological program.

montage = ARTIFACT_DIR / "exemplar_montage.png"
if montage.exists():
    display(Image(filename=str(montage)))
else:
    print("Montage not found. Rerun the workflow cell to generate it.")
display(hypotheses[["program", "max_score", "n_matched_exemplars", "hypothesis"]])
display(matched[["program", "exemplar_id", "control_id", "delta_score"]].head(12))
display(deltas[["program", "feature_kind", "feature_name", "effect_size", "matched_control_delta", "fdr"]].head(15))

Biological interpretation#

The 28-contour tutorial run should be read as a discovery vignette, not a powered cohort analysis. It still shows what RNA + contour + H&E analysis can reveal:

  • The workflow assigns every contour one dominant boundary program. In this run, immune_exclusion is the most frequent top program, while stromal_encapsulation, emt_invasive_front, tls_adjacent_activation, necrotic_hypoxic_rim, and myeloid_vascular_belt each appear in smaller contour subsets.

  • The highest maximum scores are stromal_encapsulation (0.999), emt_invasive_front (0.948), and tls_adjacent_activation (0.925), suggesting that the strongest tutorial-scale signals are matrix wrapping, invasive-front morphology, and lymphoid-adjacent activation.

  • Matched controls make the hypothesis concrete. For example, the emt_invasive_front table highlights positive matched-control deltas for SPP1-CD44, CXCL12-CXCR4, TGFB1-TGFBR2, and VIM, which are consistent with a stromal or invasive-front interpretation.

  • The small subset is not intended to claim statistical significance. Most FDR values are high because the run contains only 28 contours; the value is in ranking candidates and generating exemplar/control pairs for follow-up review.

Caveats#

  • The checked run uses cell-level WTA pseudobulk summaries for speed. Transcript-level signed-distance gradients are available when the sample is loaded with streamed transcripts, but that is a longer run on the full Atera transcript store.

  • The H&E patch is not resampled into Xenium coordinates. pyXenium keeps the original pixels and stores affine metadata, so patch-level interpretation depends on correct alignment.

  • Program scores are hypothesis generators. They should be validated by pathologist review, marker-specific maps, larger contour sets, and orthogonal assays where possible.

  • Foundation model embeddings are intentionally off in this tutorial. Pass an embedding_backend to add them without replacing the interpretable pathomics branch.

Next steps#

  • Increase per_structure or use the full generated GeoJSON for a stronger within-sample analysis.

  • Enable streamed transcript gradients for sentinel genes such as MMP11, COL1A1, COL1A2, CXCL13, SPP1, and KDR.

  • Review exemplar/control pairs in Xenium Explorer or a pathology viewer.

  • Aggregate contour_features.csv across samples to turn this single-sample pilot into a cohort-level contour ecology study.