pyXenium.io Tutorial#

Overview#

This notebook uses the public 10x Genomics FFPE human renal carcinoma Xenium RNA + protein study to show how pyXenium.io preserves the structures that matter for downstream biology: cells, transcript points, aligned H&E, and cluster-aware metadata.

Biological question#

Before we ask multimodal or topology questions, how do we keep the Xenium export organized enough to recover tissue architecture, image context, and molecular measurements in one reproducible container?

from __future__ import annotations

import json
import os
import sys
from pathlib import Path

import pandas as pd
from IPython.display import Image, Markdown, display


def find_repo_root() -> Path:
    for candidate in (Path.cwd(), *Path.cwd().parents):
        if (candidate / "pyproject.toml").exists():
            return candidate
    raise RuntimeError("Could not locate the pyXenium repository root.")


REPO_ROOT = find_repo_root()
SRC_ROOT = REPO_ROOT / "src"
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

pd.set_option("display.max_columns", 20)
pd.set_option("display.max_rows", 12)

RENAL_DATASET_PATH = Path(
    os.environ.get(
        "PYXENIUM_RENAL_DATASET",
        r"Y:\long\10X_datasets\Xenium\Xenium_Renal\Xenium_V1_Human_Kidney_FFPE_Protein",
    )
)
SMOKE_ARTIFACT_DIR = REPO_ROOT / "manuscript" / "evidence" / "smoke_auto"
RUN_LIVE_IO_DEMO = RENAL_DATASET_PATH.exists()

RENAL_DATASET_PATH, SMOKE_ARTIFACT_DIR, RUN_LIVE_IO_DEMO

(WindowsPath('Y:/long/10X_datasets/Xenium/Xenium_Renal/Xenium_V1_Human_Kidney_FFPE_Protein'),
 WindowsPath('D:/GitHub/pyXenium/manuscript/evidence/smoke_auto'),
 True)

Dataset#

Raw study: 10x Genomics Xenium FFPE human renal cell carcinoma RNA + protein bundle.
Versioned evidence in this repository: manuscript/evidence/smoke_auto/ and manuscript/figures/figure1_pyxenium_validation.png.
Public surface used in this tutorial: read_xenium, read_slide, write_xenium, and export_xenium_to_slide_zarr.

Setup#

The committed outputs below come from a real renal dataset run. The live demo cell also loads the local Xenium bundle when RENAL_DATASET_PATH exists.

smoke_payload = json.loads((SMOKE_ARTIFACT_DIR / "summary.json").read_text(encoding="utf-8"))
summary = smoke_payload["summary"]

core_summary = pd.DataFrame(
    {
        "metric": [
            "cells",
            "RNA features",
            "protein markers",
            "sparse nnz",
            "spatial coordinates",
            "cluster labels",
        ],
        "value": [
            summary["n_cells"],
            summary["n_rna_features"],
            summary["n_protein_markers"],
            summary["x_nnz"],
            summary["has_spatial"],
            summary["has_cluster"],
        ],
    }
)

display(core_summary)
display(pd.DataFrame(summary["largest_clusters"]).head(5))
display(pd.DataFrame(summary["top_protein_markers_by_mean_signal"]).head(5))

	metric	value
0	cells	465545
1	RNA features	405
2	protein markers	27
3	sparse nnz	16454170
4	spatial coordinates	True
5	cluster labels	True

	cluster	n_cells
0	1	87757
1	2	67261
2	3	59896
3	4	53975
4	5	35331

	marker	mean_signal	positive_cells
0	Vimentin	234.769989	455851
1	CD45	206.936478	446921
2	PTEN	149.335358	464946
3	CD3E	142.478271	285619
4	CD68	120.736671	244801

Core workflow#

The canonical I/O path keeps Xenium data close to the original assay structure.

from pyXenium.io import (
    export_xenium_to_slide_zarr,
    read_slide,
    read_xenium,
    write_xenium,
)

slide = read_xenium(
    RENAL_DATASET_PATH,
    as_="slide",
    prefer="h5",
    include_transcripts=True,
    stream_transcripts=True,
    include_images=True,
)

payload = write_xenium(slide, "./renal_example.zarr", format="slide", overwrite=True)
reloaded = read_slide(payload["output_path"])
compat_store = export_xenium_to_slide_zarr(RENAL_DATASET_PATH, overwrite=True)

For the committed notebook output we execute only the loading step, because the full round-trip writes a large real-data store.

from pyXenium.io import read_xenium

if RUN_LIVE_IO_DEMO:
    slide = read_xenium(
        str(RENAL_DATASET_PATH),
        as_="slide",
        prefer="h5",
        include_transcripts=True,
        stream_transcripts=True,
        include_boundaries=False,
        include_images=True,
    )
    live_summary = pd.DataFrame(
        {
            "component": [
                "table cells",
                "RNA features",
                "protein markers",
                "streamed point layers",
                "image layers",
                "cluster key",
            ],
            "value": [
                int(slide.table.n_obs),
                int(slide.table.n_vars),
                int(slide.table.obsm["protein"].shape[1]),
                ", ".join(slide.component_summary()["points"]),
                ", ".join(slide.component_summary()["images"]),
                slide.metadata.get("cluster_key"),
            ],
        }
    )
    display(live_summary)
else:
    display(Markdown("Set `PYXENIUM_RENAL_DATASET` to a local renal Xenium bundle to run the live loading demo."))

	component	value
0	table cells	465545
1	RNA features	405
2	protein markers	27
3	streamed point layers	transcripts
4	image layers	he
5	cluster key	gene_expression_graphclust

Visual outputs#

The validation figure below summarizes the same renal study at the documentation level: loader checks, feature recovery, and multimodal readiness.

display(Image(filename=str(REPO_ROOT / "manuscript" / "figures" / "figure1_pyxenium_validation.png")))

../_images/48f9a3ec090d6e41b91d7e0132b85ceae7f96bc390947fd3d68e2307f1b50131.png

Biological interpretation#

For renal immune-resistance work, I/O is not a neutral preprocessing step. It decides whether we can later connect protein-dominant states, transcript neighborhoods, and histology-aligned niches back to the same tissue coordinates. The validated renal bundle shows that pyXenium keeps enough structure to recover 465,545 cells, 405 RNA features, 27 protein markers, and aligned image context without collapsing those layers into a lossy flat table.

Caveats#

The fully materialized round-trip examples are intentionally shown as code only because writing full real-data stores is much heavier than documentation rendering.
This notebook streams transcript points instead of materializing them into memory; that is the preferred pattern for large Xenium exports.
The public renal dataset is a validated reference, not a universal benchmark for every panel or tissue.

Next steps#

Continue into the multimodal notebook to see how this same renal study becomes joint state, discordance, and niche analysis.
Use read_xenium(..., as_="slide") when you need images, shapes, or streamed transcripts together.
Use load_rna_protein_anndata(...) when you are ready to work on joint RNA + protein analysis directly.