pyXenium.contour Tutorial#

This notebook shows an end-to-end contour workflow for the Atera Xenium WTA FFPE breast sample:

Read the Xenium export and inspect the Atera directory layout.
Bridge numeric GraphClust IDs to the human-readable labels in WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.
Generate a HistoSeg-backed Xenium Explorer annotation export.
Copy the generated export to xenium_explorer_annotations.generated.geojson without touching the legacy xenium_explorer_annotations.geojson.
Import the generated GeoJSON into XeniumSlide.
Derive structure-level analysis contours, expand them by 50 um with Voronoi partitioning, and prepare shell-density profiling.
Run ring_density and smooth_density_by_distance with cache-first helpers over the curated Atera marker panel.

The notebook is written in English and is intended to be shipped as package source and linked from the Read the Docs workflows section.

from __future__ import annotations

from pathlib import Path
import json
import shutil
import sys
from typing import Iterable

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display
from shapely import union_all

def _find_repo_root() -> Path:
    cwd = Path.cwd().resolve()
    for candidate in (cwd, *cwd.parents):
        if (candidate / 'src' / 'pyXenium').exists():
            return candidate
    return cwd

PROJECT_ROOT = _find_repo_root()
SRC_ROOT = PROJECT_ROOT / 'src'
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

DATASET_ROOT = Path(r"Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs")
HISTOSEG_ROOT = Path(r"D:\GitHub\HistoSeg")
OUTPUT_ROOT = DATASET_ROOT / 'pyxenium_tutorial_outputs' / 'atera_contour_workflow'
DENSITY_CACHE_DIR = OUTPUT_ROOT / 'density_cache'
FIGURE_DIR = OUTPUT_ROOT / 'figures'
GENERATED_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.generated.geojson'
LEGACY_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.geojson'
GRAPHCLUST_RELPATH = Path('analysis/analysis/clustering/gene_expression_graphclust/clusters.csv')
CELL_GROUPS_PATH = DATASET_ROOT / 'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv'
CELLS_PARQUET_RELPATH = Path('cells.parquet')
PIXEL_SIZE_UM = 0.2125
CONTOUR_EXPAND_DISTANCE_UM = 50.0
ANALYSIS_SIMPLIFY_TOLERANCE_UM = 5.0
MIN_CELLS_OVERRIDE = 250
RUN_RING_DENSITY = False
RUN_PROFILE_DENSITY = False
LOAD_CACHED_DENSITY = True
TARGET_QUERY = 'valid == True and quality_score >= 20'

from pyXenium.contour import (
    add_contours_from_geojson,
    expand_contours,
    generate_xenium_explorer_annotations,
    ring_density,
    smooth_density_by_distance,
)
from pyXenium.contour._geometry import contour_frame_to_geometry_table, geometry_table_to_contour_frame
from pyXenium.io import read_xenium

sns.set_theme(context='talk', style='whitegrid')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
DENSITY_CACHE_DIR.mkdir(parents=True, exist_ok=True)
FIGURE_DIR.mkdir(parents=True, exist_ok=True)

1. Inspect the Atera Xenium export#

The Atera sample uses the official Xenium output layout, but two details matter for this tutorial:

The default GraphClust labels live under analysis/analysis/... instead of the single-layer analysis/... layout.
The readable cell-group labels and colors live in WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.

We also keep the existing xenium_explorer_annotations.geojson untouched and write a new side-by-side file named xenium_explorer_annotations.generated.geojson.

experiment_metadata = json.loads((DATASET_ROOT / 'experiment.xenium').read_text(encoding='utf-8'))
layout_table = pd.DataFrame(
    {
        'path': [
            'analysis/analysis/clustering/gene_expression_graphclust/clusters.csv',
            'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv',
            'cells.parquet',
            'cell_feature_matrix.h5',
            'transcripts.zarr.zip',
            'xenium_explorer_annotations.geojson',
            'xenium_explorer_annotations.generated.geojson',
        ],
        'exists': [
            (DATASET_ROOT / GRAPHCLUST_RELPATH).exists(),
            CELL_GROUPS_PATH.exists(),
            (DATASET_ROOT / CELLS_PARQUET_RELPATH).exists(),
            (DATASET_ROOT / 'cell_feature_matrix.h5').exists(),
            (DATASET_ROOT / 'transcripts.zarr.zip').exists(),
            LEGACY_GEOJSON.exists(),
            GENERATED_GEOJSON.exists(),
        ],
    }
)
display(layout_table)
print('pixel_size_um =', experiment_metadata.get('pixel_size'))

	path	exists
0	analysis/analysis/clustering/gene_expression_g...	True
1	WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv	True
2	cells.parquet	True
3	cell_feature_matrix.h5	True
4	transcripts.zarr.zip	True
5	xenium_explorer_annotations.geojson	True
6	xenium_explorer_annotations.generated.geojson	True

pixel_size_um = 0.2125

2. Bridge GraphClust IDs to readable Atera labels#

clusters.csv contains numeric GraphClust IDs, while WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv contains the named biological groups and their Explorer colors.

We build a one-row-per-cluster bridge table and then extract the 7 default structure panel used in this tutorial.

cluster_df = pd.read_csv(DATASET_ROOT / GRAPHCLUST_RELPATH)
cell_groups_df = pd.read_csv(CELL_GROUPS_PATH)
cluster_bridge_long = cluster_df.merge(
    cell_groups_df[['Barcode', 'Clusters', 'color']],
    on='Barcode',
    how='inner',
)
cluster_label_counts = cluster_bridge_long.groupby('Cluster')['Clusters'].nunique(dropna=True)
assert int(cluster_label_counts.max()) == 1, 'Each numeric cluster should map to exactly one readable label.'

cluster_bridge = (
    cluster_bridge_long.groupby('Cluster', as_index=False)
    .agg(
        structure_label=('Clusters', 'first'),
        color=('color', 'first'),
        n_cells=('Barcode', 'size'),
    )
    .sort_values('Cluster', kind='stable')
    .reset_index(drop=True)
)
display(cluster_bridge.head(20))
print('Mapped numeric clusters:', len(cluster_bridge))

	Cluster	structure_label	color	n_cells
0	1	11q13 Invasive Tumor Cells	#E91E63	19796
1	2	11q13 Invasive Tumor Cells	#E91E63	16784
2	3	CAFs, DCIS Associated	#FFC107	15057
3	4	CAFs, DCIS Associated	#FFC107	9385
4	5	Endothelial Cells	#795548	8624
5	6	Basal-like Structured DCIS Cells	#1AFE02	8285
6	7	11q13 Invasive Tumor Cells	#E91E63	8028
7	8	11q13 Invasive Tumor Cells	#E91E63	7353
8	9	Luminal-like Amorphous DCIS Cells	#F943D2	7219
9	10	11q13 Invasive Tumor Cells	#E91E63	6928
10	11	T Lymphocytes	#1976D2	6146
11	12	Luminal-like Amorphous DCIS Cells	#F943D2	5809
12	13	Pericytes	#8D8D00	5296
13	14	11q13 Invasive Tumor Cells	#E91E63	5147
14	15	Dendritic Cells	#00BCD4	4008
15	16	CAFs, Invasive Associated	#FF9800	4001
16	17	Myoepithelial Cells	#2E7D32	3915
17	18	Pericytes	#8D8D00	2791
18	19	CXCL14+ Fibroblasts	#C2640E	2547
19	20	B Cells	#6E7CFE	2512

Mapped numeric clusters: 34

DEFAULT_STRUCTURE_PANEL = [
    '11q13 Invasive Tumor Cells',
    'Basal-like Structured DCIS Cells',
    'Macrophages',
    'Plasma Cells',
    'Endothelial Cells',
    'Apocrine Cells',
    'Luminal-like Amorphous DCIS Cells',
]

selected_bridge = (
    cluster_bridge_long.loc[cluster_bridge_long['Clusters'].isin(DEFAULT_STRUCTURE_PANEL)]
    .groupby('Clusters', as_index=False)
    .agg(
        cluster_ids=('Cluster', lambda values: sorted(pd.Series(values).dropna().astype(int).unique().tolist())),
        color=('color', 'first'),
        n_cells=('Barcode', 'size'),
    )
)
selected_bridge['structure_id'] = selected_bridge['Clusters'].map(
    {name: idx for idx, name in enumerate(DEFAULT_STRUCTURE_PANEL, start=1)}
)
selected_bridge = selected_bridge.sort_values('structure_id', kind='stable').reset_index(drop=True)
selected_bridge

	Clusters	cluster_ids	color	n_cells	structure_id
0	11q13 Invasive Tumor Cells	[1, 2, 7, 8, 10, 14]	#E91E63	64036	1
1	Basal-like Structured DCIS Cells	[6, 32]	#1AFE02	8760	2
2	Macrophages	[24, 26]	#512DA8	3861	3
3	Plasma Cells	[30]	#0F51BD	873	4
4	Endothelial Cells	[5]	#795548	8624	5
5	Apocrine Cells	[33]	#F48FB1	403	6
6	Luminal-like Amorphous DCIS Cells	[9, 12]	#F943D2	13028	7

Apocrine Cells has only 403 mapped cells in this sample. HistoSeg keeps min_cells=500 as its package default, so this notebook overrides the run-specific threshold to 250 in order to keep the full 7-group Atera panel in the generated export.

DEFAULT_STRUCTURES = [
    {
        'structure_name': row['Clusters'],
        'cluster_ids': row['cluster_ids'],
        'structure_color': row['color'],
        'structure_id': int(row['structure_id']),
    }
    for _, row in selected_bridge.iterrows()
]
DEFAULT_STRUCTURES

[{'structure_name': '11q13 Invasive Tumor Cells',
  'cluster_ids': [1, 2, 7, 8, 10, 14],
  'structure_color': '#E91E63',
  'structure_id': 1},
 {'structure_name': 'Basal-like Structured DCIS Cells',
  'cluster_ids': [6, 32],
  'structure_color': '#1AFE02',
  'structure_id': 2},
 {'structure_name': 'Macrophages',
  'cluster_ids': [24, 26],
  'structure_color': '#512DA8',
  'structure_id': 3},
 {'structure_name': 'Plasma Cells',
  'cluster_ids': [30],
  'structure_color': '#0F51BD',
  'structure_id': 4},
 {'structure_name': 'Endothelial Cells',
  'cluster_ids': [5],
  'structure_color': '#795548',
  'structure_id': 5},
 {'structure_name': 'Apocrine Cells',
  'cluster_ids': [33],
  'structure_color': '#F48FB1',
  'structure_id': 6},
 {'structure_name': 'Luminal-like Amorphous DCIS Cells',
  'cluster_ids': [9, 12],
  'structure_color': '#F943D2',
  'structure_id': 7}]

3. Generate a HistoSeg-backed Xenium Explorer export#

pyXenium.contour.generate_xenium_explorer_annotations(...) is a thin wrapper around the new HistoSeg multi-structure API. The generated files are written into pyxenium_tutorial_outputs/atera_contour_workflow/ and the GeoJSON is copied to xenium_explorer_annotations.generated.geojson beside the legacy file.

artifacts = generate_xenium_explorer_annotations(
    DATASET_ROOT,
    structures=DEFAULT_STRUCTURES,
    output_relpath=OUTPUT_ROOT.relative_to(DATASET_ROOT),
    clusters_relpath=GRAPHCLUST_RELPATH,
    cells_parquet_relpath=CELLS_PARQUET_RELPATH,
    histoseg_root=HISTOSEG_ROOT if HISTOSEG_ROOT.exists() else None,
    min_cells=MIN_CELLS_OVERRIDE,
)
shutil.copy2(artifacts['geojson'], GENERATED_GEOJSON)
artifacts

c:\Users\taobo.hu\AppData\Local\miniconda3\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

{'out_dir': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow',
 'geojson': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.geojson',
 'csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.csv',
 'summary': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations_summary.csv',
 'preview_png': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\multi_structure_contour_preview.png',
 'partition_table': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\cells_with_structure_partition.parquet',
 'structure_count_csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_cell_counts.csv',
 'metrics_json': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_metrics.json'}

structure_count_table = pd.read_csv(OUTPUT_ROOT / 'structure_contour_cell_counts.csv')
metrics_payload = json.loads((OUTPUT_ROOT / 'structure_contour_metrics.json').read_text(encoding='utf-8'))
display(structure_count_table)
print('valid_structures =', metrics_payload['isoline_metrics']['valid_structures'])
print('skipped_structures =', metrics_payload['isoline_metrics']['skipped_structures'])

	structure_id	structure_name	selected_cluster_ids	input_cell_count	assigned_cell_count	contour_count
0	1	11q13 Invasive Tumor Cells	1, 2, 7, 8, 10, 14	64036	69851	110
1	2	Basal-like Structured DCIS Cells	6, 32	8760	13231	386
2	3	Macrophages	24, 26	3861	25865	516
3	4	Plasma Cells	30	873	7131	326
4	5	Endothelial Cells	5	8624	38357	544
5	6	Apocrine Cells	33	403	626	13
6	7	Luminal-like Amorphous DCIS Cells	9, 12	13028	14984	78

valid_structures = [1, 2, 3, 4, 5, 6, 7]
skipped_structures = []

Diagnostic comparison with the legacy GeoJSON#

This comparison is only a sanity check. The generated file is a named 7-group export, while the legacy file is a pre-existing reference artifact with a different structure definition. They are not expected to match byte-for-byte.

def geojson_digest(path: Path) -> dict[str, object]:
    payload = json.loads(path.read_text(encoding='utf-8'))
    features = payload.get('features', [])
    structures = sorted(
        {
            feature.get('properties', {}).get('assigned_structure')
            for feature in features
            if feature.get('properties', {}).get('assigned_structure') is not None
        }
    )
    return {
        'path': str(path),
        'n_features': len(features),
        'assigned_structures': structures,
    }

comparison_table = pd.DataFrame(
    [
        geojson_digest(LEGACY_GEOJSON),
        geojson_digest(GENERATED_GEOJSON),
    ]
)
comparison_table

	path	n_features	assigned_structures
0	Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_...	300	[Structure 1, Structure 2, Structure 3, Struct...
1	Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_...	2606	[11q13 Invasive Tumor Cells, Apocrine Cells, B...

4. Read Xenium as `XeniumSlide` and import the generated contours#

We intentionally use stream_transcripts=True here to exercise the Atera transcript reader on the real transcripts.zarr.zip layout. On this sample, the streamed chunks do not expose cell_id, and that is now handled explicitly.

slide = read_xenium(
    str(DATASET_ROOT),
    as_='slide',
    prefer='auto',
    stream_transcripts=True,
)
print(slide.component_summary())
first_chunk = next(slide.point_sources['transcripts'].iter_chunks())
print('first streamed transcript columns:', first_chunk.columns.tolist())
first_chunk.head()

{'tables': ['cells'], 'points': ['transcripts'], 'shapes': ['cell_boundaries', 'nucleus_boundaries'], 'images': [], 'contour_images': [], 'labels': []}
first streamed transcript columns: ['x', 'y', 'gene_identity', 'gene_name', 'quality_score', 'valid', 'z', 'id_0', 'id_1', 'uuid_0', 'uuid_1', 'codeword_identity_0', 'codeword_identity_1']

	x	y	gene_name	quality_score	valid	z	id_0	id_1	uuid_0	uuid_1	codeword_identity_0	codeword_identity_1
0	4917.281250	5133.515625	A1BG	37.0	True	13.171875	933743	28	933743	65564	2233	4294967295
1	4920.468750	5134.812500	A1BG	40.0	True	13.046875	1095646	28	1095646	65564	2233	4294967295
2	4829.171875	5041.406250	A1BG	40.0	True	14.437500	1240538	28	1240538	65564	2233	4294967295
3	4815.625000	5087.046875	A1BG	25.0	True	17.171875	2189269	28	2189269	65564	2233	4294967295
4	4875.671875	5118.031250	A1BG	40.0	True	14.781250	139045	28	139045	65564	6803	4294967295

CONTOUR_KEY = 'atera_explorer_contours'
if CONTOUR_KEY in slide.shapes:
    del slide.shapes[CONTOUR_KEY]
    slide.metadata.get('contours', {}).pop(CONTOUR_KEY, None)

add_contours_from_geojson(
    slide,
    GENERATED_GEOJSON,
    key=CONTOUR_KEY,
    id_key='name',
    pixel_size_um=PIXEL_SIZE_UM,
)

sorted(slide.shapes[CONTOUR_KEY]['assigned_structure'].dropna().unique().tolist())

['11q13 Invasive Tumor Cells',
 'Apocrine Cells',
 'Basal-like Structured DCIS Cells',
 'Endothelial Cells',
 'Luminal-like Amorphous DCIS Cells',
 'Macrophages',
 'Plasma Cells']

5. Derive structure-level analysis contours and expand them by 50 um#

The Explorer export contains many polygon components. For shell profiling we dissolve polygons with the same assigned_structure into a single analysis contour per named structure and simplify them moderately. This keeps the original Explorer layer available for round-tripping while making downstream density analysis much more tractable.

def build_structure_level_analysis_contours(
    slide,
    *,
    source_key: str,
    output_key: str,
    simplify_tolerance_um: float = 5.0,
) -> pd.DataFrame:
    if output_key in slide.shapes:
        del slide.shapes[output_key]
        slide.metadata.get('contours', {}).pop(output_key, None)

    geometry_table = contour_frame_to_geometry_table(slide.shapes[source_key], contour_key=source_key)
    metadata_columns = [column for column in geometry_table.columns if column not in {'contour_id', 'geometry'}]
    rows = []
    for assigned_structure, group in geometry_table.groupby('assigned_structure', sort=True):
        geometry = union_all(list(group['geometry']))
        geometry = geometry.simplify(simplify_tolerance_um, preserve_topology=True).buffer(0)
        exemplar = group.iloc[0]
        row = {
            'contour_id': assigned_structure,
            'geometry': geometry,
        }
        for column in metadata_columns:
            if column in {'classification_name', 'assigned_structure'}:
                row[column] = assigned_structure
            else:
                row[column] = exemplar[column]
        rows.append(row)

    analysis_table = pd.DataFrame(rows)
    analysis_frame = geometry_table_to_contour_frame(analysis_table)
    slide.shapes[output_key] = analysis_frame
    slide.metadata.setdefault('contours', {})[output_key] = {
        'derived_from_key': source_key,
        'n_contours': int(analysis_table.shape[0]),
        'units': 'micron',
        'simplify_tolerance_um': float(simplify_tolerance_um),
    }
    slide._validate()
    return analysis_table

ANALYSIS_CONTOUR_KEY = 'atera_structure_contours'
analysis_contour_table = build_structure_level_analysis_contours(
    slide,
    source_key=CONTOUR_KEY,
    output_key=ANALYSIS_CONTOUR_KEY,
    simplify_tolerance_um=ANALYSIS_SIMPLIFY_TOLERANCE_UM,
)
analysis_contour_table[['contour_id', 'assigned_structure', 'structure_id']]

	contour_id	assigned_structure	structure_id
0	11q13 Invasive Tumor Cells	11q13 Invasive Tumor Cells	1
1	Apocrine Cells	Apocrine Cells	6
2	Basal-like Structured DCIS Cells	Basal-like Structured DCIS Cells	2
3	Endothelial Cells	Endothelial Cells	5
4	Luminal-like Amorphous DCIS Cells	Luminal-like Amorphous DCIS Cells	7
5	Macrophages	Macrophages	3
6	Plasma Cells	Plasma Cells	4

EXPANDED_CONTOUR_KEY = 'atera_structure_contours_expand50_voronoi'
if EXPANDED_CONTOUR_KEY in slide.shapes:
    del slide.shapes[EXPANDED_CONTOUR_KEY]
    slide.metadata.get('contours', {}).pop(EXPANDED_CONTOUR_KEY, None)

expand_contours(
    slide,
    contour_key=ANALYSIS_CONTOUR_KEY,
    output_key=EXPANDED_CONTOUR_KEY,
    distance=CONTOUR_EXPAND_DISTANCE_UM,
    mode='voronoi',
)

expanded_geometry_table = contour_frame_to_geometry_table(
    slide.shapes[EXPANDED_CONTOUR_KEY],
    contour_key=EXPANDED_CONTOUR_KEY,
)
expanded_geometry_table[['contour_id', 'assigned_structure', 'structure_id']]

	contour_id	assigned_structure	structure_id
0	11q13 Invasive Tumor Cells	11q13 Invasive Tumor Cells	1
1	Apocrine Cells	Apocrine Cells	6
2	Basal-like Structured DCIS Cells	Basal-like Structured DCIS Cells	2
3	Endothelial Cells	Endothelial Cells	5
4	Luminal-like Amorphous DCIS Cells	Luminal-like Amorphous DCIS Cells	7
5	Macrophages	Macrophages	3
6	Plasma Cells	Plasma Cells	4

6. Define the validated Atera marker programs#

The biology section uses the curated marker panel already validated for this sample. The shell-density helpers below are cache-first because full transcript profiling is the long-running part of the workflow.

MARKER_PROGRAMS = {
    'macrophage': ['C1QA', 'CD163', 'CSF1R', 'SIGLEC1', 'C3'],
    'plasma': ['IGHA1', 'IGHA2', 'JCHAIN', 'IGHM'],
    'vascular': ['CDH5', 'MMRN2', 'EPAS1', 'KDR', 'FLT1'],
    'basal_dcis': ['SOSTDC1', 'KLK5', 'ITGB6', 'DSC3', 'KRT23'],
    'apocrine': ['TAT', 'MYBPC1'],
    'luminal_amorphous': ['HSPB8', 'CLIC6', 'PIP', 'ESR1', 'PGR'],
    'invasive_stromal_rim': ['MMP11', 'COL11A1', 'POSTN', 'COL1A1', 'COL1A2'],
}
GENE_TO_PROGRAM = {
    gene: program
    for program, genes in MARKER_PROGRAMS.items()
    for gene in genes
}
SENTINEL_GENES = ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']
all_marker_genes = sorted(GENE_TO_PROGRAM)
print('marker genes =', len(all_marker_genes))
print(all_marker_genes)

marker genes = 31
['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']

def _density_cache_path(kind: str, gene: str) -> Path:
    return DENSITY_CACHE_DIR / f'{kind}_{gene}.parquet'


def _combine_cached_frames(paths: Iterable[Path]) -> pd.DataFrame | None:
    existing = [path for path in paths if path.exists()]
    if not existing:
        return None
    frames = [pd.read_parquet(path) for path in existing]
    return pd.concat(frames, ignore_index=True)


def load_or_run_ring_density_for_panel(
    slide,
    *,
    contour_key: str,
    genes: list[str],
    run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
    frames = []
    missing = []
    for gene in genes:
        cache_path = _density_cache_path('ring', gene)
        if LOAD_CACHED_DENSITY and cache_path.exists():
            frames.append(pd.read_parquet(cache_path))
            continue
        if not run_if_missing:
            missing.append(gene)
            continue
        gene_frame = ring_density(
            slide,
            contour_key=contour_key,
            target='transcripts',
            target_query=TARGET_QUERY,
            feature_values=gene,
            inward=CONTOUR_EXPAND_DISTANCE_UM,
            outward=CONTOUR_EXPAND_DISTANCE_UM,
            ring_width=10.0,
        )
        gene_frame.to_parquet(cache_path, index=False)
        frames.append(gene_frame)
    if not frames:
        return None, missing
    return pd.concat(frames, ignore_index=True), missing


def load_or_run_profile_density_for_panel(
    slide,
    *,
    contour_key: str,
    genes: list[str],
    run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
    frames = []
    missing = []
    for gene in genes:
        cache_path = _density_cache_path('profile', gene)
        if LOAD_CACHED_DENSITY and cache_path.exists():
            frames.append(pd.read_parquet(cache_path))
            continue
        if not run_if_missing:
            missing.append(gene)
            continue
        gene_frame = smooth_density_by_distance(
            slide,
            contour_key=contour_key,
            target='transcripts',
            target_query=TARGET_QUERY,
            feature_values=gene,
            inward=CONTOUR_EXPAND_DISTANCE_UM,
            outward=CONTOUR_EXPAND_DISTANCE_UM,
            bandwidth=10.0,
            grid_step=5.0,
        )
        gene_frame.to_parquet(cache_path, index=False)
        frames.append(gene_frame)
    if not frames:
        return None, missing
    return pd.concat(frames, ignore_index=True), missing

The next two cells are cache-first by design.

If cached parquet files already exist under density_cache/, they are loaded immediately.
If caches are missing and RUN_RING_DENSITY or RUN_PROFILE_DENSITY is False, the notebook reports which genes are still pending.
Set the flags to True when you want to launch the full shell-density run.

ring_density_table, missing_ring_genes = load_or_run_ring_density_for_panel(
    slide,
    contour_key=EXPANDED_CONTOUR_KEY,
    genes=all_marker_genes,
    run_if_missing=RUN_RING_DENSITY,
)
print('ring_density_table loaded:', ring_density_table is not None)
print('missing ring genes:', missing_ring_genes)

ring_density_table loaded: False
missing ring genes: ['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']

profile_density_table, missing_profile_genes = load_or_run_profile_density_for_panel(
    slide,
    contour_key=EXPANDED_CONTOUR_KEY,
    genes=SENTINEL_GENES,
    run_if_missing=RUN_PROFILE_DENSITY,
)
print('profile_density_table loaded:', profile_density_table is not None)
print('missing profile genes:', missing_profile_genes)

profile_density_table loaded: False
missing profile genes: ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']

7. Program-level ring summaries#

The scoring rule is:

Run ring_density per gene.
Average per-gene densities within each program for each contour and shell.
Summarize by assigned_structure.
Report outward-vs-inward enrichment.

if ring_density_table is None:
    display(Markdown(
        'Ring-density caches are not available yet. '
        'Set `RUN_RING_DENSITY = True` and rerun the previous cell to populate them.'
    ))
else:
    ring_density_table = ring_density_table.copy()
    ring_density_table['gene'] = ring_density_table['feature_values'].astype(str)
    ring_density_table['program'] = ring_density_table['gene'].map(GENE_TO_PROGRAM)

    program_ring = (
        ring_density_table.groupby(
            ['assigned_structure', 'contour_id', 'program', 'ring_start', 'ring_end', 'ring_mid'],
            as_index=False,
        )
        .agg(
            mean_density=('density', 'mean'),
            total_count=('count', 'sum'),
            mean_area=('area', 'mean'),
            n_genes=('gene', 'nunique'),
        )
    )
    inward = (
        program_ring.loc[program_ring['ring_mid'] < 0]
        .groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
        .mean()
        .rename(columns={'mean_density': 'inward_density'})
    )
    outward = (
        program_ring.loc[program_ring['ring_mid'] > 0]
        .groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
        .mean()
        .rename(columns={'mean_density': 'outward_density'})
    )
    program_enrichment = inward.merge(outward, on=['assigned_structure', 'program'], how='outer')
    program_enrichment['outward_minus_inward'] = (
        program_enrichment['outward_density'] - program_enrichment['inward_density']
    )
    program_enrichment['outward_over_inward'] = (
        program_enrichment['outward_density'] / program_enrichment['inward_density'].replace(0, np.nan)
    )

    program_ring.to_parquet(OUTPUT_ROOT / 'program_ring_density.parquet', index=False)
    program_enrichment.to_csv(OUTPUT_ROOT / 'program_outward_vs_inward.csv', index=False)
    display(program_enrichment.sort_values(['assigned_structure', 'outward_minus_inward'], ascending=[True, False]))

Ring-density caches are not available yet. Set RUN_RING_DENSITY = True and rerun the previous cell to populate them.

if ring_density_table is None:
    display(Markdown('Skip the heatmap until ring-density caches are available.'))
else:
    heatmap_data = program_enrichment.pivot(
        index='assigned_structure',
        columns='program',
        values='outward_minus_inward',
    )
    plt.figure(figsize=(10, 6))
    sns.heatmap(heatmap_data, cmap='coolwarm', center=0.0, linewidths=0.5)
    plt.title('Atera program enrichment: outward shell minus inward shell')
    plt.tight_layout()
    plt.savefig(FIGURE_DIR / 'atera_program_outward_minus_inward_heatmap.png', dpi=200)
    plt.show()

Skip the heatmap until ring-density caches are available.

8. Continuous signed-distance profiles for sentinel genes and programs#

The profile cell below keeps the sentinel list intentionally short. It is usually the most readable summary for manuscript-style figures because it exposes whether a marker peaks in the contour core, at the boundary, or in the outward stromal shell.

if profile_density_table is None:
    display(Markdown(
        'Profile caches are not available yet. '
        'Set `RUN_PROFILE_DENSITY = True` and rerun the profile cell to populate them.'
    ))
else:
    profile_density_table = profile_density_table.copy()
    profile_density_table['gene'] = profile_density_table['feature_values'].astype(str)
    profile_density_table['program'] = profile_density_table['gene'].map(GENE_TO_PROGRAM)

    program_profiles = (
        profile_density_table.groupby(
            ['assigned_structure', 'program', 'signed_distance'],
            as_index=False,
        )['density']
        .mean()
        .rename(columns={'density': 'program_density'})
    )
    program_profiles.to_parquet(OUTPUT_ROOT / 'program_signed_distance_profiles.parquet', index=False)

    sentinel_profiles = profile_density_table.loc[
        profile_density_table['gene'].isin(SENTINEL_GENES)
    ].copy()
    sentinel_profiles.to_parquet(OUTPUT_ROOT / 'sentinel_gene_profiles.parquet', index=False)

    display(program_profiles.head())
    display(sentinel_profiles.head())

Profile caches are not available yet. Set RUN_PROFILE_DENSITY = True and rerun the profile cell to populate them.

if profile_density_table is None:
    display(Markdown('Skip the sentinel profile figure until profile caches are available.'))
else:
    plot_table = (
        profile_density_table.loc[profile_density_table['gene'].isin(SENTINEL_GENES)]
        .groupby(['assigned_structure', 'gene', 'signed_distance'], as_index=False)['density']
        .mean()
    )
    g = sns.relplot(
        data=plot_table,
        x='signed_distance',
        y='density',
        kind='line',
        hue='gene',
        col='assigned_structure',
        col_wrap=3,
        facet_kws={'sharey': False},
        height=3.0,
        aspect=1.2,
    )
    g.set_titles('{col_name}')
    g.set_axis_labels('signed distance (um)', 'mean density')
    plt.savefig(FIGURE_DIR / 'atera_sentinel_signed_distance_profiles.png', dpi=200)
    plt.show()

Skip the sentinel profile figure until profile caches are available.

9. Interpreting the biology#

Once the density caches have been computed, the intended biological readout is straightforward:

Macrophages should show inward enrichment for C1QA, CD163, CSF1R, SIGLEC1, and C3.
Plasma Cells should localize immunoglobulin-heavy shells (IGHA1, IGHA2, JCHAIN, IGHM).
Endothelial Cells should anchor the vascular program (CDH5, MMRN2, EPAS1, KDR, FLT1).
Basal-like Structured DCIS Cells, Apocrine Cells, and Luminal-like Amorphous DCIS Cells should separate their lineage programs by contour core versus outward shell.
11q13 Invasive Tumor Cells is the structure where outward enrichment of POSTN, COL1A1, COL1A2, COL11A1, and MMP11 is most interpretable as a stromal rim signal.

Because the notebook saves both intermediate density parquet files and derived summary tables, it is easy to rerun only the final plotting and interpretation cells after the long transcript-shell computations have finished.