S1-S5 breast cancer contour application#

This advanced pyXenium.contour tutorial uses the Atera Xenium WTA FFPE breast cancer export at Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs to build five HistoSeg-backed contour annotations and then run contour-native biology and topology analyses.

The five annotation classes are intentionally broader than the basic contour tutorial:

S class	Biological compartment	Aggregated Atera cell groups
S1	Invasive tumor plus invasion-associated CAFs	CAFs Invasive Associated, 11q13 invasive tumor, G1/S, mitotic
S2	Basal-like structured DCIS plus lymphoid/DC	Basal DCIS, dendritic cells, B cells, T lymphocytes
S3	Myeloid-vascular-stromal niche	Mast/myeloid/macrophage, CXCL14+ fibroblast, endothelial, pericyte
S4	Myoepithelial/DCIS CAF/plasma compartment	Myoepithelial, DCIS-associated CAFs, plasma cells
S5	Apocrine/luminal amorphous DCIS axis	Apocrine and luminal-like amorphous DCIS

The committed RTD snapshot is generated from a whole-transcriptome transcript scan. The lightweight tables below are top slices from the all-gene results; no curated DEG panel is used for gene selection.

from __future__ import annotations

from pathlib import Path
import json
import shutil
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display


def _find_repo_root() -> Path:
    cwd = Path.cwd().resolve()
    for candidate in (cwd, *cwd.parents):
        if (candidate / 'src' / 'pyXenium').exists():
            return candidate
    return cwd

PROJECT_ROOT = _find_repo_root()
SRC_ROOT = PROJECT_ROOT / 'src'
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

from pyXenium.contour import (
    add_contours_from_geojson,
    compare_contour_cell_composition,
    compare_contour_transcript_de,
    generate_barrier_contour_shells,
    generate_xenium_explorer_annotations,
)
from pyXenium.io import XeniumFrameChunkSource, read_xenium
from pyXenium.io.api import TRANSCRIPT_POINT_COLUMNS, TRANSCRIPT_POINT_COLUMN_TYPES
from pyXenium.io.xenium_artifacts import iter_transcript_chunks

sns.set_theme(context='notebook', style='whitegrid')

DATASET_ROOT = Path(r'Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs')
HISTOSEG_ROOT = Path(r'D:\GitHub\HistoSeg')
OUTPUT_ROOT = DATASET_ROOT / 'pyxenium_tutorial_outputs' / 'atera_s1_s5_contour_application'
FIGURE_DIR = OUTPUT_ROOT / 'figures'
STATIC_SNAPSHOT_DIR = PROJECT_ROOT / 'docs' / '_static' / 'tutorials' / 'contour_s1_s5'
GRAPHCLUST_RELPATH = Path('analysis/analysis/clustering/gene_expression_graphclust/clusters.csv')
CELL_GROUPS_PATH = DATASET_ROOT / 'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv'
CELLS_PARQUET_RELPATH = Path('cells.parquet')
TRANSCRIPTS_PATH = DATASET_ROOT / 'transcripts.parquet'
S1_S5_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.s1_s5.generated.geojson'
PIXEL_SIZE_UM = 0.2125
QUALITY_QUERY = 'quality_score >= 20 and valid == True'

RUN_HISTOSEG_EXPORT = False
RUN_TRANSCRIPT_SNAPSHOT = False

OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
FIGURE_DIR.mkdir(parents=True, exist_ok=True)
STATIC_SNAPSHOT_DIR.mkdir(parents=True, exist_ok=True)


def top_ok(frame: pd.DataFrame, n: int = 200) -> pd.DataFrame:
    local = frame.loc[frame.get('status', 'ok').eq('ok')].copy()
    if local.empty:
        local = frame.copy()
    return local.head(n).reset_index(drop=True)


def top_one_vs_rest_snapshot(frame: pd.DataFrame, per_case: int = 30) -> pd.DataFrame:
    local = frame.loc[frame.get('status', 'ok').eq('ok')].copy()
    if local.empty:
        local = frame.copy()
    return (
        local.sort_values(['case', 'fdr', 'p_value', 'gene'], na_position='last', kind='stable')
        .groupby('case', group_keys=False)
        .head(per_case)
        .reset_index(drop=True)
    )

1. Build the S1-S5 structure definitions#

The official GraphClust table stores numeric cluster IDs, while the Atera cell-group CSV stores readable labels and colors. We bridge those tables and aggregate cluster IDs into the five S classes. Two user-facing names are aliased to the exact dataset labels: CAFs Invasive Associated -> CAFs, Invasive Associated and CAFs DCIS Associated -> CAFs, DCIS Associated.

S1_S5_PANEL = {
    'S1': [
        'CAFs Invasive Associated',
        '11q13 Invasive Tumor Cells',
        '11q13 Invasive Tumor Cells (G1/S)',
        '11q13 Invasive Tumor Cells (Mitotic)',
    ],
    'S2': [
        'Basal-like Structured DCIS Cells',
        'Dendritic Cells',
        'B Cells',
        'T Lymphocytes',
    ],
    'S3': [
        'Mast Cells',
        'Myeloid Cells',
        'Macrophages',
        'CXCL14+ Fibroblasts',
        'Endothelial Cells',
        'Pericytes',
    ],
    'S4': [
        'Myoepithelial Cells',
        'CAFs DCIS Associated',
        'Plasma Cells',
    ],
    'S5': [
        'Apocrine Cells',
        'Luminal-like Amorphous DCIS Cells',
    ],
}

LABEL_ALIASES = {
    'CAFs Invasive Associated': 'CAFs, Invasive Associated',
    'CAFs DCIS Associated': 'CAFs, DCIS Associated',
}
S_COLORS = {'S1': '#C43C39', 'S2': '#3F7CAC', 'S3': '#4A9D67', 'S4': '#B9802D', 'S5': '#9A5BC7'}
S_DESCRIPTIONS = {
    'S1': 'Invasive tumor + CAFs',
    'S2': 'Basal DCIS + lymphoid',
    'S3': 'Myeloid-vascular-stromal',
    'S4': 'Myoepithelial/DCIS CAF/plasma',
    'S5': 'Apocrine/luminal DCIS',
}

cluster_df = pd.read_csv(DATASET_ROOT / GRAPHCLUST_RELPATH)
cell_groups_df = pd.read_csv(CELL_GROUPS_PATH).rename(columns={'cell_id': 'Barcode', 'group': 'cell_group'})
cluster_bridge = cluster_df.merge(
    cell_groups_df[['Barcode', 'cell_group', 'color']],
    on='Barcode',
    how='inner',
)
label_to_clusters = (
    cluster_bridge.groupby('cell_group')['Cluster']
    .apply(lambda values: sorted(pd.Series(values).dropna().astype(int).unique().tolist()))
    .to_dict()
)
label_counts = cluster_bridge.groupby('cell_group').size().to_dict()

structures = []
structure_rows = []
for idx, (structure_name, requested_labels) in enumerate(S1_S5_PANEL.items(), start=1):
    cluster_ids = []
    resolved_labels = []
    input_cell_count = 0
    for requested_label in requested_labels:
        resolved_label = LABEL_ALIASES.get(requested_label, requested_label)
        if resolved_label not in label_to_clusters:
            raise KeyError(f'Missing Atera cell group: {requested_label!r} -> {resolved_label!r}')
        cluster_ids.extend(label_to_clusters[resolved_label])
        resolved_labels.append(resolved_label)
        input_cell_count += int(label_counts[resolved_label])

    structures.append(
        {
            'structure_name': structure_name,
            'cluster_ids': sorted(set(cluster_ids)),
            'structure_color': S_COLORS[structure_name],
            'structure_id': idx,
            'member_labels': tuple(resolved_labels),
        }
    )
    structure_rows.append(
        {
            'structure_name': structure_name,
            'description': S_DESCRIPTIONS[structure_name],
            'cluster_ids': sorted(set(cluster_ids)),
            'member_labels': ', '.join(resolved_labels),
            'input_cell_count': input_cell_count,
        }
    )

structure_definition_table = pd.DataFrame(structure_rows)
display(structure_definition_table)

2. Generate HistoSeg-backed Xenium Explorer annotations#

generate_xenium_explorer_annotations(...) calls the HistoSeg multi-structure contour API and writes a Xenium Explorer-compatible GeoJSON. The legacy xenium_explorer_annotations.geojson is left untouched; this tutorial writes xenium_explorer_annotations.s1_s5.generated.geojson.

if RUN_HISTOSEG_EXPORT or not S1_S5_GEOJSON.exists():
    artifacts = generate_xenium_explorer_annotations(
        DATASET_ROOT,
        structures=structures,
        output_relpath=OUTPUT_ROOT.relative_to(DATASET_ROOT),
        clusters_relpath=GRAPHCLUST_RELPATH,
        cells_parquet_relpath=CELLS_PARQUET_RELPATH,
        histoseg_root=HISTOSEG_ROOT if HISTOSEG_ROOT.exists() else None,
        min_cells=500,
        min_component_pixels=180,
        xenium_pixel_size_um=PIXEL_SIZE_UM,
    )
    shutil.copy2(artifacts['geojson'], S1_S5_GEOJSON)
else:
    artifacts = {
        'geojson': str(S1_S5_GEOJSON),
        'out_dir': str(OUTPUT_ROOT),
        'preview_png': str(OUTPUT_ROOT / 'multi_structure_contour_preview.png'),
    }

contour_counts = pd.read_csv(OUTPUT_ROOT / 's1_s5_contour_counts.csv') if (OUTPUT_ROOT / 's1_s5_contour_counts.csv').exists() else pd.read_csv(OUTPUT_ROOT / 'structure_contour_cell_counts.csv')
display(contour_counts)
artifacts

Snapshot contour counts#

The committed RTD snapshot includes this small result table, so the page remains informative even with notebook execution disabled.

structure_name	description	input_cell_count	assigned_cell_count	contour_count
S1	Invasive tumor + CAFs	72873	71725	106
S2	Basal DCIS + lymphoid	23298	15935	358
S3	Myeloid-vascular-stromal	26690	37038	491
S4	Myoepithelial/DCIS CAF/plasma	33753	36486	313
S5	Apocrine/luminal DCIS	13431	8861	28

Annotation preview#

The generated S1-S5 export contains multiple contours per class. In the snapshot run, the HistoSeg count table reported 106 S1, 358 S2, 491 S3, 313 S4, and 28 S5 contour components before GeoJSON polygon serialization.

S1-S5 annotation preview

3. Load Xenium, import contours, and attach cell-group labels#

We load the cell table without materializing the full transcript parquet. When RUN_TRANSCRIPT_SNAPSHOT=True, the notebook attaches the unfiltered transcripts.parquet stream and lets compare_contour_transcript_de(..., genes=None) discover all genes from the whole transcriptome. The RTD page keeps execution off and displays the committed whole-transcriptome snapshot tables and figures.

slide = read_xenium(
    str(DATASET_ROOT),
    as_='slide',
    include_transcripts=False,
    include_boundaries=False,
    include_images=False,
    clusters_relpath=str(GRAPHCLUST_RELPATH),
)
cell_group_map = cell_groups_df.set_index('Barcode')['cell_group']
slide.table.obs['cell_group'] = slide.table.obs.index.map(cell_group_map).fillna('Unassigned')

add_contours_from_geojson(
    slide,
    S1_S5_GEOJSON,
    key='s1_s5_contours',
    id_key='name',
    pixel_size_um=PIXEL_SIZE_UM,
)

print(slide.component_summary())
print(sorted(slide.shapes['s1_s5_contours']['assigned_structure'].dropna().unique()))

def attach_whole_transcript_source(slide, genes=None) -> None:
    gene_filter = None if genes is None else set(str(gene) for gene in genes)
    slide.points.pop('transcripts', None)
    slide.point_sources['transcripts'] = XeniumFrameChunkSource(
        columns=TRANSCRIPT_POINT_COLUMNS,
        column_types=TRANSCRIPT_POINT_COLUMN_TYPES,
        chunk_iter_factory=lambda: iter_transcript_chunks(str(TRANSCRIPTS_PATH), genes=gene_filter),
        attrs={
            'source_path': str(TRANSCRIPTS_PATH),
            'gene_filter': None if gene_filter is None else tuple(sorted(gene_filter)),
            'analysis_mode': 'whole_transcriptome',
        },
        preserve_extra_columns=True,
    )


if RUN_TRANSCRIPT_SNAPSHOT:
    attach_whole_transcript_source(slide)

4. Transcript-level contour DEG#

compare_contour_transcript_de(...) aggregates raw transcript counts into contour-level pseudobulks, computes contour-level CPM/log1p CPM, and treats each contour as a replicate. Here genes=None means the DEG scan is whole-transcriptome: every gene observed in the unfiltered transcript stream is tested before the top snapshot rows are selected for RTD. For this whole-transcriptome screening step, assignment='cell_id' uses the cell_id field in transcripts.parquet and assigns cell-associated transcripts through the contour membership of each cell centroid, which is much faster than point-in-polygon testing 740M transcript coordinates.

if RUN_TRANSCRIPT_SNAPSHOT:
    deg_result_full = compare_contour_transcript_de(
        slide,
        contour_key='s1_s5_contours',
        groupby='assigned_structure',
        genes=None,
        assignment='cell_id',
        comparisons=('global', 'one_vs_rest'),
        transcript_query=QUALITY_QUERY,
        min_total_transcripts_per_contour=1,
        min_contours_per_group=2,
        include_zero_counts=False,
        return_contour_gene=False,
    )
    deg_result_full['global_de'].to_csv(
        OUTPUT_ROOT / 's1_s5_transcript_de_global.whole_transcriptome.full.csv', index=False
    )
    deg_result_full['one_vs_rest_de'].to_csv(
        OUTPUT_ROOT / 's1_s5_transcript_de_one_vs_rest.whole_transcriptome.full.csv', index=False
    )
    deg_result = {
        'global_de': top_ok(deg_result_full['global_de'], 200),
        'one_vs_rest_de': top_one_vs_rest_snapshot(deg_result_full['one_vs_rest_de'], per_case=30),
    }
    deg_result['global_de'].to_csv(OUTPUT_ROOT / 's1_s5_transcript_de_global.csv', index=False)
    deg_result['one_vs_rest_de'].to_csv(OUTPUT_ROOT / 's1_s5_transcript_de_one_vs_rest.csv', index=False)
else:
    deg_result = {
        'global_de': pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s5_transcript_de_global.csv'),
        'one_vs_rest_de': pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s5_transcript_de_one_vs_rest.csv'),
    }

display(deg_result['global_de'].head(12))
display(deg_result['one_vs_rest_de'].head(12))

Snapshot DEG tables#

Top global genes from the whole-transcriptome contour-level Kruskal-Wallis test. These rows are a committed top slice from 18,028 tested genes, not a curated panel.

gene	top_group	bottom_group	delta_log1p_cpm	delta_mean_density_per_um2	fdr
EGFL7	S3	S4	3.47	0.00167	6.01e-88
SHANK3	S3	S4	3.17	0.00148	2.48e-73
EPAS1	S3	S2	3.08	0.004026	2.48e-73
CD36	S3	S4	3.35	0.00161	3.09e-73
VWF	S3	S4	3.48	0.003595	2.02e-71
KDR	S3	S4	3.01	0.001556	9.05e-67
CDH5	S3	S2	2.77	0.0009097	1.02e-64
PECAM1	S3	S4	3.16	0.002475	8.99e-64

Top two one-vs-rest whole-transcriptome genes per S class:

case	gene	log2fc_cpm	fdr
S1	SPATA20	1.71	5.98e-36
S1	SLC35F5	0.68	2.03e-33
S2	EGFL7	-2.44	9.62e-37
S2	PDE2A	-3.19	4.2e-36
S3	CD36	2.6	3.62e-82
S3	EGFL7	2.97	4.46e-78
S4	PCDH17	-2.64	4.86e-32
S4	SHANK3	-2.11	7.78e-30
S5	GRHL2	1.83	1.73e-175
S5	VWA2	1.77	2.61e-156

DEG visualization#

The heatmap shows mean contour-level log1p(CPM) for top genes selected from the whole-transcriptome global test. The dotplot summarizes the strongest whole-transcriptome one-vs-rest signals per S class.

S1-S5 DEG heatmap

S1-S5 DEG dotplot

5. Cell-type composition across contour classes#

compare_contour_cell_composition(...) assigns cell centroids to every contour independently. It reports both proportions and quantities, then tests S-class differences with Kruskal-Wallis globally and Mann-Whitney tests pairwise.

if RUN_TRANSCRIPT_SNAPSHOT:
    cell_result = compare_contour_cell_composition(
        slide,
        contour_key='s1_s5_contours',
        groupby='assigned_structure',
        cell_type_key='cell_group',
    )
    cell_result['group_summary'].to_csv(OUTPUT_ROOT / 's1_s5_cell_composition_group_summary.csv', index=False)
    pd.concat(
        [
            cell_result['global_stats'].assign(test_scope='global'),
            cell_result['pairwise_stats'].assign(test_scope='pairwise'),
        ],
        ignore_index=True,
        sort=False,
    ).to_csv(OUTPUT_ROOT / 's1_s5_cell_composition_stats.csv', index=False)
else:
    cell_result = {
        'group_summary': pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s5_cell_composition_group_summary.csv'),
        'stats': pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s5_cell_composition_stats.csv'),
    }

display(cell_result['group_summary'].head(12))
if 'stats' in cell_result:
    display(cell_result['stats'].head(12))

Snapshot cell-composition statistics#

Top global cell-type composition differences across S1-S5 contours:

cell_type	metric	top_group	bottom_group	fdr
Pericytes	n_cells	S3	S2	6.01e-98
Endothelial Cells	n_cells	S3	S2	1.40e-94
Pericytes	fraction	S3	S5	3.49e-90
Endothelial Cells	fraction	S3	S1	1.06e-89
Luminal-like Amorphous DCIS Cells	n_cells	S5	S1	5.84e-77
CXCL14+ Fibroblasts	n_cells	S1	S2	5.35e-76
CAFs, DCIS Associated	fraction	S4	S1	7.76e-70
Luminal-like Amorphous DCIS Cells	fraction	S5	S1	2.62e-67

Composition visualization#

S1-S5 cell-type proportion

S1-S5 cell-type quantity

6. Barrier-aware S1/S4/S5 boundary shells#

The boundary question asks how S1, S4, and S5 contours behave when expanded inward and outward in 20 um steps. We generate 10 signed rings: [-100,-80] ... [-20,0] and [0,20] ... [80,100]. Outward rings are clipped by all other S1-S5 contour interiors, so a shell stops contributing when it reaches another annotated contour.

selected_frame = slide.shapes['s1_s5_contours'].loc[
    slide.shapes['s1_s5_contours']['assigned_structure'].isin(['S1', 'S4', 'S5'])
].copy()
slide.shapes['s1_s4_s5_contours'] = selected_frame
slide.metadata.setdefault('contours', {})['s1_s4_s5_contours'] = {
    'units': 'micron',
    'derived_from_key': 's1_s5_contours',
}

if 's1_s4_s5_barrier_shells' not in slide.shapes:
    generate_barrier_contour_shells(
        slide,
        contour_key='s1_s4_s5_contours',
        barrier_contour_key='s1_s5_contours',
        inward=100.0,
        outward=100.0,
        step_size=20.0,
        output_key='s1_s4_s5_barrier_shells',
    )

shell_frame = slide.shapes['s1_s4_s5_barrier_shells']
slide.shapes['s1_s4_s5_boundary_outward_0_60'] = shell_frame.loc[
    shell_frame['shell_direction'].eq('outward')
    & (shell_frame['shell_start'] >= 0)
    & (shell_frame['shell_end'] <= 60)
].copy()
slide.metadata.setdefault('contours', {})['s1_s4_s5_boundary_outward_0_60'] = {
    'units': 'micron',
    'derived_from_key': 's1_s4_s5_barrier_shells',
}

display(shell_frame[['contour_id', 'source_contour_id', 'assigned_structure', 'shell_start', 'shell_end', 'shell_direction']].head())
print('shell contours:', shell_frame['contour_id'].nunique())
print('outward 0-60 um shells:', slide.shapes['s1_s4_s5_boundary_outward_0_60']['contour_id'].nunique())

if RUN_TRANSCRIPT_SNAPSHOT:
    boundary_result_full = compare_contour_transcript_de(
        slide,
        contour_key='s1_s4_s5_boundary_outward_0_60',
        groupby='assigned_structure',
        genes=None,
        assignment='cell_id',
        comparisons=('global',),
        transcript_query=QUALITY_QUERY,
        min_total_transcripts_per_contour=0,
        min_contours_per_group=2,
        include_zero_counts=False,
        return_contour_gene=False,
    )
    boundary_stats = top_ok(boundary_result_full['global_de'], 200)
    boundary_result_full['global_de'].to_csv(
        OUTPUT_ROOT / 's1_s4_s5_boundary_de_global_outward_0_60.whole_transcriptome.full.csv', index=False
    )
    boundary_stats.to_csv(OUTPUT_ROOT / 's1_s4_s5_boundary_de_global_outward_0_60.csv', index=False)

    top_boundary_genes = boundary_stats['gene'].head(10).tolist()
    attach_whole_transcript_source(slide, genes=top_boundary_genes)
    shell_result = compare_contour_transcript_de(
        slide,
        contour_key='s1_s4_s5_barrier_shells',
        groupby='assigned_structure',
        genes=top_boundary_genes,
        comparisons=('global',),
        transcript_query=QUALITY_QUERY,
        min_total_transcripts_per_contour=0,
        min_contours_per_group=2,
        include_zero_counts=True,
        return_contour_gene=True,
    )
    shell_gene = shell_result['contour_gene'].copy()
    boundary_curves = (
        shell_gene.loc[shell_gene['gene'].isin(top_boundary_genes)]
        .groupby(['assigned_structure', 'gene', 'shell_start', 'shell_end', 'shell_mid'], as_index=False)
        .agg(
            mean_density_per_um2=('density_per_um2', 'mean'),
            sem_density_per_um2=('density_per_um2', lambda values: float(pd.Series(values).sem())),
            n_shells=('contour_id', 'nunique'),
            total_count=('count', 'sum'),
        )
        .sort_values(['gene', 'assigned_structure', 'shell_mid'], kind='stable')
    )
    boundary_curves.to_csv(OUTPUT_ROOT / 's1_s4_s5_barrier_ring_density_top10.csv', index=False)
else:
    boundary_curves = pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s4_s5_barrier_ring_density_top10.csv')
    boundary_stats = pd.read_csv(STATIC_SNAPSHOT_DIR / 's1_s4_s5_boundary_de_global_outward_0_60.csv')

display(boundary_stats.head(10))
display(boundary_curves.head(10))

Snapshot boundary DE table#

Top boundary genes were selected from a whole-transcriptome DEG scan on outward 0-60 um barrier-clipped shells. The screening step uses cell-associated transcripts assigned by cell centroid for scalability; the density curves below re-count these top genes from transcript coordinates across all 10 signed-distance steps.

gene	top_group	bottom_group	delta_log1p_cpm	delta_mean_density_per_um2	fdr
SLC39A6	S1	S4	1.48	0.00204	2.81e-21
AGR3	S1	S4	1.56	0.003781	2.81e-21
CERS2	S1	S4	1.57	0.001149	2.81e-21
TPBG	S1	S5	1.07	0.0009283	2.81e-21
TJP1	S1	S4	1.16	0.0006458	2.91e-21
ELAPOR1	S1	S4	1.65	0.002875	2.98e-21
DDB1	S1	S4	1.53	0.001435	4.46e-21
GNS	S5	S4	2.03	0.01711	4.49e-21
TNFSF10	S1	S4	1.68	0.004175	4.49e-21
SPATA20	S1	S4	1.52	0.002	5.14e-21

Boundary density curves#

Top boundary genes were selected from a whole-transcriptome DEG scan on cell-associated transcripts whose cell centroids fall in the outward 0-60 um barrier-clipped shells. After those top genes are selected, their density curves are re-counted from transcript coordinates across the full 10 signed-distance steps. Negative distances are inside the source contour; positive distances are outside but barrier-clipped so they do not enter another contour.

S1/S4/S5 barrier-aware density curves

7. Biological interpretation#

The S1-S5 contours recover expected compartment-level organization, but the whole-transcriptome scan also makes the dominant axes more explicit than the earlier marker-panel snapshot.

S1 remains an invasive tumor/CAF-rich compartment. Its one-vs-rest top genes include SPATA20, SLC35F5, LRIG1, XBP1, CGN, CPD, GNS, and ANXA2, consistent with an epithelial tumor program mixed with stress/secretory and invasion-associated biology.
S2 is a basal-like structured DCIS plus lymphoid/DC compartment. In the whole-transcriptome one-vs-rest table, its strongest contrasts are largely negative vascular/endothelial genes such as EGFL7, PDE2A, CCDC102B, CDH5, and FABP4, supporting that S2 is less vascular than the S3/S5-rich microenvironments.
S3 is the clearest vascular-stromal niche. The global top genes are dominated by endothelial/perivascular markers, including EGFL7, EPAS1, VWF, KDR, CDH5, PECAM1, CLEC14A, EMCN, ESAM, RAMP2, PTPRB, PLVAP, MMRN2, and RGS5.
S4 combines myoepithelial cells, DCIS-associated CAFs, and plasma cells. Its strongest one-vs-rest contrasts include matrix-associated COL6A1 and COL6A2, while several endothelial genes are lower than the rest, separating it from the S3 vascular niche.
S5 is the luminal/apocrine DCIS axis. Its one-vs-rest top genes include epithelial/luminal regulatory and differentiation genes such as MLPH, TFAP2A, ESRP2, FOXA1, GRHL2, and VWA2.
For S1/S4/S5 boundary ecology, whole-transcriptome outward 0-60 um screening selected SLC39A6, AGR3, CERS2, TPBG, TJP1, ELAPOR1, DDB1, GNS, TNFSF10, and SPATA20. Coordinate-based density curves show many of these genes are highest inside S1 and/or S5 and decline across the boundary, which supports an epithelial/luminal boundary program rather than a purely stromal outward signal.

A practical analysis pattern emerges: use whole-transcriptome compare_contour_transcript_de to identify contour-level programs, compare_contour_cell_composition to explain which cell types make those programs plausible, and generate_barrier_contour_shells plus coordinate-counted transcript density curves to test whether a program sits inside the contour, peaks at the boundary, or extends into the surrounding microenvironment.

8. Contour adjacency network analysis#

HistoSeg also provides a contour adjacency network layer. Here we reuse the already imported S1-S5 contour geometries, aggregate pairwise boundary-neighbor and enclosure relationships by assigned_structure, and render a compact network/heatmap snapshot for RTD. The adjacency score combines shared border length with enclosure-derived inner contour perimeter, so it captures both geography-like border contact and one compartment wrapping around another.

histoseg_src = HISTOSEG_ROOT / 'src'
if histoseg_src.exists() and str(histoseg_src) not in sys.path:
    sys.path.insert(0, str(histoseg_src))

from histoseg.contour import ContourAdjacencyConfig, run_contour_adjacency
from pyXenium.contour._geometry import contour_frame_to_geometry_table

ADJACENCY_ROOT = OUTPUT_ROOT / 's1_s5_contour_adjacency'
ADJACENCY_ROOT.mkdir(parents=True, exist_ok=True)

adjacency_contours = contour_frame_to_geometry_table(
    slide.shapes['s1_s5_contours'],
    contour_key='s1_s5_contours',
)[['contour_id', 'assigned_structure', 'classification_name', 'structure_id', 'geometry']].copy()
adjacency_contours = adjacency_contours.rename(columns={'assigned_structure': 'structure'})

adjacency_result = run_contour_adjacency(
    ContourAdjacencyConfig(
        contours=adjacency_contours,
        out_dir=ADJACENCY_ROOT,
        groupby='structure',
        contour_id_col='contour_id',
        geometry_col='geometry',
        boundary_tolerance=1.0,
        min_shared_boundary=0.0,
        enclosure_min_fraction=0.95,
        dpi=220,
        save_preview_png=True,
    )
)

adjacency_edges = pd.read_csv(adjacency_result.edges_csv)
adjacency_matrix = pd.read_csv(adjacency_result.matrix_csv, index_col=0)

snapshot_files = {
    adjacency_result.edges_csv: STATIC_SNAPSHOT_DIR / 's1_s5_contour_adjacency_edges.csv',
    adjacency_result.matrix_csv: STATIC_SNAPSHOT_DIR / 's1_s5_contour_adjacency_matrix.csv',
    adjacency_result.network_png: STATIC_SNAPSHOT_DIR / 's1_s5_contour_adjacency_network.png',
    adjacency_result.heatmap_png: STATIC_SNAPSHOT_DIR / 's1_s5_contour_adjacency_heatmap.png',
    adjacency_result.combined_png: STATIC_SNAPSHOT_DIR / 's1_s5_contour_adjacency_overview.png',
}
for src, dst in snapshot_files.items():
    if src is not None:
        shutil.copy2(src, dst)

display(adjacency_edges.sort_values('total_adjacency_length_um', ascending=False).head(10))
display(adjacency_matrix)
adjacency_result

Snapshot adjacency edge table#

The committed RTD snapshot below was generated with HistoSeg cca55df using boundary_tolerance=1.0 um and enclosure_min_fraction=0.95. It summarizes 1578 S1-S5 contours into 10 cross-structure group pairs; 9 pairs have non-zero adjacency.

group_a	group_b	boundary_pair_count	boundary_shared_length_um	enclosure_pair_count	enclosure_inner_boundary_length_um	total_adjacency_length_um
S3	S4	557	169686	133	80123	249809
S1	S3	323	101744	24	14576.8	116321
S2	S3	672	61709.2	17	2671.4	64380.6
S1	S2	319	33515.6	25	8347.4	41863
S2	S4	374	20373.1	38	4923	25296.1
S4	S5	43	6750.1	15	9422.4	16172.5
S1	S4	163	10859.2	1	61.1	10920.2
S3	S5	75	10650.1	1	266.9	10917
S2	S5	12	554.3	0	0	554.3

Adjacency network and heatmap#

S1-S5 contour adjacency overview

The network emphasizes the same geography seen in the table. S3 and S4 form the dominant adjacency axis (249809 um total adjacency), reflecting extensive contact between the myeloid-vascular-stromal niche and the myoepithelial/DCIS CAF/plasma compartment. S1 is most strongly connected to S3 (116321 um), supporting an invasive tumor/CAF boundary with vascular-myeloid stroma. S2 also contacts S3 (64381 um), while S5 is comparatively peripheral and has no direct positive adjacency with S1 in this S1-S5 contour export.

For reproducibility, the adjacency matrix in microns is:

	S1	S2	S3	S4	S5
S1	nan	41863	116321	10920.2	0
S2	41863	nan	64380.6	25296.1	554.3
S3	116321	64380.6	nan	249809	10917
S4	10920.2	25296.1	249809	nan	16172.5
S5	0	554.3	10917	16172.5	nan

9. Output files#

The lightweight RTD snapshot uses these committed top-slice DEG artifacts plus the compact HistoSeg adjacency outputs:

Local reruns also write full all-gene DEG tables and the full adjacency run directory under pyxenium_tutorial_outputs/atera_s1_s5_contour_application/ beside the Xenium export. Those full tables are intentionally not committed to the docs repository.