pyXenium.contour Tutorial#

This notebook shows an end-to-end contour workflow for the Atera Xenium WTA FFPE breast sample:

  1. Read the Xenium export and inspect the Atera directory layout.

  2. Bridge numeric GraphClust IDs to the human-readable labels in WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.

  3. Generate a HistoSeg-backed Xenium Explorer annotation export.

  4. Copy the generated export to xenium_explorer_annotations.generated.geojson without touching the legacy xenium_explorer_annotations.geojson.

  5. Import the generated GeoJSON into XeniumSlide.

  6. Derive structure-level analysis contours, expand them by 50 um with Voronoi partitioning, and prepare shell-density profiling.

  7. Run ring_density and smooth_density_by_distance with cache-first helpers over the curated Atera marker panel.

The notebook is written in English and is intended to be shipped as package source and linked from the Read the Docs workflows section.

from __future__ import annotations

from pathlib import Path
import json
import shutil
import sys
from typing import Iterable

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display
from shapely import union_all

def _find_repo_root() -> Path:
    cwd = Path.cwd().resolve()
    for candidate in (cwd, *cwd.parents):
        if (candidate / 'src' / 'pyXenium').exists():
            return candidate
    return cwd

PROJECT_ROOT = _find_repo_root()
SRC_ROOT = PROJECT_ROOT / 'src'
if str(SRC_ROOT) not in sys.path:
    sys.path.insert(0, str(SRC_ROOT))

DATASET_ROOT = Path(r"Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs")
HISTOSEG_ROOT = Path(r"D:\GitHub\HistoSeg")
OUTPUT_ROOT = DATASET_ROOT / 'pyxenium_tutorial_outputs' / 'atera_contour_workflow'
DENSITY_CACHE_DIR = OUTPUT_ROOT / 'density_cache'
FIGURE_DIR = OUTPUT_ROOT / 'figures'
GENERATED_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.generated.geojson'
LEGACY_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.geojson'
GRAPHCLUST_RELPATH = Path('analysis/analysis/clustering/gene_expression_graphclust/clusters.csv')
CELL_GROUPS_PATH = DATASET_ROOT / 'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv'
CELLS_PARQUET_RELPATH = Path('cells.parquet')
PIXEL_SIZE_UM = 0.2125
CONTOUR_EXPAND_DISTANCE_UM = 50.0
ANALYSIS_SIMPLIFY_TOLERANCE_UM = 5.0
MIN_CELLS_OVERRIDE = 250
RUN_RING_DENSITY = False
RUN_PROFILE_DENSITY = False
LOAD_CACHED_DENSITY = True
TARGET_QUERY = 'valid == True and quality_score >= 20'

from pyXenium.contour import (
    add_contours_from_geojson,
    expand_contours,
    generate_xenium_explorer_annotations,
    ring_density,
    smooth_density_by_distance,
)
from pyXenium.contour._geometry import contour_frame_to_geometry_table, geometry_table_to_contour_frame
from pyXenium.io import read_xenium

sns.set_theme(context='talk', style='whitegrid')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
DENSITY_CACHE_DIR.mkdir(parents=True, exist_ok=True)
FIGURE_DIR.mkdir(parents=True, exist_ok=True)

1. Inspect the Atera Xenium export#

The Atera sample uses the official Xenium output layout, but two details matter for this tutorial:

  • The default GraphClust labels live under analysis/analysis/... instead of the single-layer analysis/... layout.

  • The readable cell-group labels and colors live in WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.

We also keep the existing xenium_explorer_annotations.geojson untouched and write a new side-by-side file named xenium_explorer_annotations.generated.geojson.

experiment_metadata = json.loads((DATASET_ROOT / 'experiment.xenium').read_text(encoding='utf-8'))
layout_table = pd.DataFrame(
    {
        'path': [
            'analysis/analysis/clustering/gene_expression_graphclust/clusters.csv',
            'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv',
            'cells.parquet',
            'cell_feature_matrix.h5',
            'transcripts.zarr.zip',
            'xenium_explorer_annotations.geojson',
            'xenium_explorer_annotations.generated.geojson',
        ],
        'exists': [
            (DATASET_ROOT / GRAPHCLUST_RELPATH).exists(),
            CELL_GROUPS_PATH.exists(),
            (DATASET_ROOT / CELLS_PARQUET_RELPATH).exists(),
            (DATASET_ROOT / 'cell_feature_matrix.h5').exists(),
            (DATASET_ROOT / 'transcripts.zarr.zip').exists(),
            LEGACY_GEOJSON.exists(),
            GENERATED_GEOJSON.exists(),
        ],
    }
)
display(layout_table)
print('pixel_size_um =', experiment_metadata.get('pixel_size'))
path exists
0 analysis/analysis/clustering/gene_expression_g... True
1 WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv True
2 cells.parquet True
3 cell_feature_matrix.h5 True
4 transcripts.zarr.zip True
5 xenium_explorer_annotations.geojson True
6 xenium_explorer_annotations.generated.geojson True
pixel_size_um = 0.2125

2. Bridge GraphClust IDs to readable Atera labels#

clusters.csv contains numeric GraphClust IDs, while WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv contains the named biological groups and their Explorer colors.

We build a one-row-per-cluster bridge table and then extract the 7 default structure panel used in this tutorial.

cluster_df = pd.read_csv(DATASET_ROOT / GRAPHCLUST_RELPATH)
cell_groups_df = pd.read_csv(CELL_GROUPS_PATH)
cluster_bridge_long = cluster_df.merge(
    cell_groups_df[['Barcode', 'Clusters', 'color']],
    on='Barcode',
    how='inner',
)
cluster_label_counts = cluster_bridge_long.groupby('Cluster')['Clusters'].nunique(dropna=True)
assert int(cluster_label_counts.max()) == 1, 'Each numeric cluster should map to exactly one readable label.'

cluster_bridge = (
    cluster_bridge_long.groupby('Cluster', as_index=False)
    .agg(
        structure_label=('Clusters', 'first'),
        color=('color', 'first'),
        n_cells=('Barcode', 'size'),
    )
    .sort_values('Cluster', kind='stable')
    .reset_index(drop=True)
)
display(cluster_bridge.head(20))
print('Mapped numeric clusters:', len(cluster_bridge))
Cluster structure_label color n_cells
0 1 11q13 Invasive Tumor Cells #E91E63 19796
1 2 11q13 Invasive Tumor Cells #E91E63 16784
2 3 CAFs, DCIS Associated #FFC107 15057
3 4 CAFs, DCIS Associated #FFC107 9385
4 5 Endothelial Cells #795548 8624
5 6 Basal-like Structured DCIS Cells #1AFE02 8285
6 7 11q13 Invasive Tumor Cells #E91E63 8028
7 8 11q13 Invasive Tumor Cells #E91E63 7353
8 9 Luminal-like Amorphous DCIS Cells #F943D2 7219
9 10 11q13 Invasive Tumor Cells #E91E63 6928
10 11 T Lymphocytes #1976D2 6146
11 12 Luminal-like Amorphous DCIS Cells #F943D2 5809
12 13 Pericytes #8D8D00 5296
13 14 11q13 Invasive Tumor Cells #E91E63 5147
14 15 Dendritic Cells #00BCD4 4008
15 16 CAFs, Invasive Associated #FF9800 4001
16 17 Myoepithelial Cells #2E7D32 3915
17 18 Pericytes #8D8D00 2791
18 19 CXCL14+ Fibroblasts #C2640E 2547
19 20 B Cells #6E7CFE 2512
Mapped numeric clusters: 34
DEFAULT_STRUCTURE_PANEL = [
    '11q13 Invasive Tumor Cells',
    'Basal-like Structured DCIS Cells',
    'Macrophages',
    'Plasma Cells',
    'Endothelial Cells',
    'Apocrine Cells',
    'Luminal-like Amorphous DCIS Cells',
]

selected_bridge = (
    cluster_bridge_long.loc[cluster_bridge_long['Clusters'].isin(DEFAULT_STRUCTURE_PANEL)]
    .groupby('Clusters', as_index=False)
    .agg(
        cluster_ids=('Cluster', lambda values: sorted(pd.Series(values).dropna().astype(int).unique().tolist())),
        color=('color', 'first'),
        n_cells=('Barcode', 'size'),
    )
)
selected_bridge['structure_id'] = selected_bridge['Clusters'].map(
    {name: idx for idx, name in enumerate(DEFAULT_STRUCTURE_PANEL, start=1)}
)
selected_bridge = selected_bridge.sort_values('structure_id', kind='stable').reset_index(drop=True)
selected_bridge
Clusters cluster_ids color n_cells structure_id
0 11q13 Invasive Tumor Cells [1, 2, 7, 8, 10, 14] #E91E63 64036 1
1 Basal-like Structured DCIS Cells [6, 32] #1AFE02 8760 2
2 Macrophages [24, 26] #512DA8 3861 3
3 Plasma Cells [30] #0F51BD 873 4
4 Endothelial Cells [5] #795548 8624 5
5 Apocrine Cells [33] #F48FB1 403 6
6 Luminal-like Amorphous DCIS Cells [9, 12] #F943D2 13028 7

Apocrine Cells has only 403 mapped cells in this sample. HistoSeg keeps min_cells=500 as its package default, so this notebook overrides the run-specific threshold to 250 in order to keep the full 7-group Atera panel in the generated export.

DEFAULT_STRUCTURES = [
    {
        'structure_name': row['Clusters'],
        'cluster_ids': row['cluster_ids'],
        'structure_color': row['color'],
        'structure_id': int(row['structure_id']),
    }
    for _, row in selected_bridge.iterrows()
]
DEFAULT_STRUCTURES
[{'structure_name': '11q13 Invasive Tumor Cells',
  'cluster_ids': [1, 2, 7, 8, 10, 14],
  'structure_color': '#E91E63',
  'structure_id': 1},
 {'structure_name': 'Basal-like Structured DCIS Cells',
  'cluster_ids': [6, 32],
  'structure_color': '#1AFE02',
  'structure_id': 2},
 {'structure_name': 'Macrophages',
  'cluster_ids': [24, 26],
  'structure_color': '#512DA8',
  'structure_id': 3},
 {'structure_name': 'Plasma Cells',
  'cluster_ids': [30],
  'structure_color': '#0F51BD',
  'structure_id': 4},
 {'structure_name': 'Endothelial Cells',
  'cluster_ids': [5],
  'structure_color': '#795548',
  'structure_id': 5},
 {'structure_name': 'Apocrine Cells',
  'cluster_ids': [33],
  'structure_color': '#F48FB1',
  'structure_id': 6},
 {'structure_name': 'Luminal-like Amorphous DCIS Cells',
  'cluster_ids': [9, 12],
  'structure_color': '#F943D2',
  'structure_id': 7}]

3. Generate a HistoSeg-backed Xenium Explorer export#

pyXenium.contour.generate_xenium_explorer_annotations(...) is a thin wrapper around the new HistoSeg multi-structure API. The generated files are written into pyxenium_tutorial_outputs/atera_contour_workflow/ and the GeoJSON is copied to xenium_explorer_annotations.generated.geojson beside the legacy file.

artifacts = generate_xenium_explorer_annotations(
    DATASET_ROOT,
    structures=DEFAULT_STRUCTURES,
    output_relpath=OUTPUT_ROOT.relative_to(DATASET_ROOT),
    clusters_relpath=GRAPHCLUST_RELPATH,
    cells_parquet_relpath=CELLS_PARQUET_RELPATH,
    histoseg_root=HISTOSEG_ROOT if HISTOSEG_ROOT.exists() else None,
    min_cells=MIN_CELLS_OVERRIDE,
)
shutil.copy2(artifacts['geojson'], GENERATED_GEOJSON)
artifacts
c:\Users\taobo.hu\AppData\Local\miniconda3\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
{'out_dir': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow',
 'geojson': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.geojson',
 'csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.csv',
 'summary': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations_summary.csv',
 'preview_png': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\multi_structure_contour_preview.png',
 'partition_table': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\cells_with_structure_partition.parquet',
 'structure_count_csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_cell_counts.csv',
 'metrics_json': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_metrics.json'}
structure_count_table = pd.read_csv(OUTPUT_ROOT / 'structure_contour_cell_counts.csv')
metrics_payload = json.loads((OUTPUT_ROOT / 'structure_contour_metrics.json').read_text(encoding='utf-8'))
display(structure_count_table)
print('valid_structures =', metrics_payload['isoline_metrics']['valid_structures'])
print('skipped_structures =', metrics_payload['isoline_metrics']['skipped_structures'])
structure_id structure_name selected_cluster_ids input_cell_count assigned_cell_count contour_count
0 1 11q13 Invasive Tumor Cells 1, 2, 7, 8, 10, 14 64036 69851 110
1 2 Basal-like Structured DCIS Cells 6, 32 8760 13231 386
2 3 Macrophages 24, 26 3861 25865 516
3 4 Plasma Cells 30 873 7131 326
4 5 Endothelial Cells 5 8624 38357 544
5 6 Apocrine Cells 33 403 626 13
6 7 Luminal-like Amorphous DCIS Cells 9, 12 13028 14984 78
valid_structures = [1, 2, 3, 4, 5, 6, 7]
skipped_structures = []

Diagnostic comparison with the legacy GeoJSON#

This comparison is only a sanity check. The generated file is a named 7-group export, while the legacy file is a pre-existing reference artifact with a different structure definition. They are not expected to match byte-for-byte.

def geojson_digest(path: Path) -> dict[str, object]:
    payload = json.loads(path.read_text(encoding='utf-8'))
    features = payload.get('features', [])
    structures = sorted(
        {
            feature.get('properties', {}).get('assigned_structure')
            for feature in features
            if feature.get('properties', {}).get('assigned_structure') is not None
        }
    )
    return {
        'path': str(path),
        'n_features': len(features),
        'assigned_structures': structures,
    }

comparison_table = pd.DataFrame(
    [
        geojson_digest(LEGACY_GEOJSON),
        geojson_digest(GENERATED_GEOJSON),
    ]
)
comparison_table
path n_features assigned_structures
0 Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_... 300 [Structure 1, Structure 2, Structure 3, Struct...
1 Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_... 2606 [11q13 Invasive Tumor Cells, Apocrine Cells, B...

4. Read Xenium as XeniumSlide and import the generated contours#

We intentionally use stream_transcripts=True here to exercise the Atera transcript reader on the real transcripts.zarr.zip layout. On this sample, the streamed chunks do not expose cell_id, and that is now handled explicitly.

slide = read_xenium(
    str(DATASET_ROOT),
    as_='slide',
    prefer='auto',
    stream_transcripts=True,
)
print(slide.component_summary())
first_chunk = next(slide.point_sources['transcripts'].iter_chunks())
print('first streamed transcript columns:', first_chunk.columns.tolist())
first_chunk.head()
{'tables': ['cells'], 'points': ['transcripts'], 'shapes': ['cell_boundaries', 'nucleus_boundaries'], 'images': [], 'contour_images': [], 'labels': []}
first streamed transcript columns: ['x', 'y', 'gene_identity', 'gene_name', 'quality_score', 'valid', 'z', 'id_0', 'id_1', 'uuid_0', 'uuid_1', 'codeword_identity_0', 'codeword_identity_1']
x y gene_identity gene_name quality_score valid z id_0 id_1 uuid_0 uuid_1 codeword_identity_0 codeword_identity_1
0 4917.281250 5133.515625 0 A1BG 37.0 True 13.171875 933743 28 933743 65564 2233 4294967295
1 4920.468750 5134.812500 0 A1BG 40.0 True 13.046875 1095646 28 1095646 65564 2233 4294967295
2 4829.171875 5041.406250 0 A1BG 40.0 True 14.437500 1240538 28 1240538 65564 2233 4294967295
3 4815.625000 5087.046875 0 A1BG 25.0 True 17.171875 2189269 28 2189269 65564 2233 4294967295
4 4875.671875 5118.031250 0 A1BG 40.0 True 14.781250 139045 28 139045 65564 6803 4294967295
CONTOUR_KEY = 'atera_explorer_contours'
if CONTOUR_KEY in slide.shapes:
    del slide.shapes[CONTOUR_KEY]
    slide.metadata.get('contours', {}).pop(CONTOUR_KEY, None)

add_contours_from_geojson(
    slide,
    GENERATED_GEOJSON,
    key=CONTOUR_KEY,
    id_key='name',
    pixel_size_um=PIXEL_SIZE_UM,
)

sorted(slide.shapes[CONTOUR_KEY]['assigned_structure'].dropna().unique().tolist())
['11q13 Invasive Tumor Cells',
 'Apocrine Cells',
 'Basal-like Structured DCIS Cells',
 'Endothelial Cells',
 'Luminal-like Amorphous DCIS Cells',
 'Macrophages',
 'Plasma Cells']

5. Derive structure-level analysis contours and expand them by 50 um#

The Explorer export contains many polygon components. For shell profiling we dissolve polygons with the same assigned_structure into a single analysis contour per named structure and simplify them moderately. This keeps the original Explorer layer available for round-tripping while making downstream density analysis much more tractable.

def build_structure_level_analysis_contours(
    slide,
    *,
    source_key: str,
    output_key: str,
    simplify_tolerance_um: float = 5.0,
) -> pd.DataFrame:
    if output_key in slide.shapes:
        del slide.shapes[output_key]
        slide.metadata.get('contours', {}).pop(output_key, None)

    geometry_table = contour_frame_to_geometry_table(slide.shapes[source_key], contour_key=source_key)
    metadata_columns = [column for column in geometry_table.columns if column not in {'contour_id', 'geometry'}]
    rows = []
    for assigned_structure, group in geometry_table.groupby('assigned_structure', sort=True):
        geometry = union_all(list(group['geometry']))
        geometry = geometry.simplify(simplify_tolerance_um, preserve_topology=True).buffer(0)
        exemplar = group.iloc[0]
        row = {
            'contour_id': assigned_structure,
            'geometry': geometry,
        }
        for column in metadata_columns:
            if column in {'classification_name', 'assigned_structure'}:
                row[column] = assigned_structure
            else:
                row[column] = exemplar[column]
        rows.append(row)

    analysis_table = pd.DataFrame(rows)
    analysis_frame = geometry_table_to_contour_frame(analysis_table)
    slide.shapes[output_key] = analysis_frame
    slide.metadata.setdefault('contours', {})[output_key] = {
        'derived_from_key': source_key,
        'n_contours': int(analysis_table.shape[0]),
        'units': 'micron',
        'simplify_tolerance_um': float(simplify_tolerance_um),
    }
    slide._validate()
    return analysis_table

ANALYSIS_CONTOUR_KEY = 'atera_structure_contours'
analysis_contour_table = build_structure_level_analysis_contours(
    slide,
    source_key=CONTOUR_KEY,
    output_key=ANALYSIS_CONTOUR_KEY,
    simplify_tolerance_um=ANALYSIS_SIMPLIFY_TOLERANCE_UM,
)
analysis_contour_table[['contour_id', 'assigned_structure', 'structure_id']]
contour_id assigned_structure structure_id
0 11q13 Invasive Tumor Cells 11q13 Invasive Tumor Cells 1
1 Apocrine Cells Apocrine Cells 6
2 Basal-like Structured DCIS Cells Basal-like Structured DCIS Cells 2
3 Endothelial Cells Endothelial Cells 5
4 Luminal-like Amorphous DCIS Cells Luminal-like Amorphous DCIS Cells 7
5 Macrophages Macrophages 3
6 Plasma Cells Plasma Cells 4
EXPANDED_CONTOUR_KEY = 'atera_structure_contours_expand50_voronoi'
if EXPANDED_CONTOUR_KEY in slide.shapes:
    del slide.shapes[EXPANDED_CONTOUR_KEY]
    slide.metadata.get('contours', {}).pop(EXPANDED_CONTOUR_KEY, None)

expand_contours(
    slide,
    contour_key=ANALYSIS_CONTOUR_KEY,
    output_key=EXPANDED_CONTOUR_KEY,
    distance=CONTOUR_EXPAND_DISTANCE_UM,
    mode='voronoi',
)

expanded_geometry_table = contour_frame_to_geometry_table(
    slide.shapes[EXPANDED_CONTOUR_KEY],
    contour_key=EXPANDED_CONTOUR_KEY,
)
expanded_geometry_table[['contour_id', 'assigned_structure', 'structure_id']]
contour_id assigned_structure structure_id
0 11q13 Invasive Tumor Cells 11q13 Invasive Tumor Cells 1
1 Apocrine Cells Apocrine Cells 6
2 Basal-like Structured DCIS Cells Basal-like Structured DCIS Cells 2
3 Endothelial Cells Endothelial Cells 5
4 Luminal-like Amorphous DCIS Cells Luminal-like Amorphous DCIS Cells 7
5 Macrophages Macrophages 3
6 Plasma Cells Plasma Cells 4

6. Define the validated Atera marker programs#

The biology section uses the curated marker panel already validated for this sample. The shell-density helpers below are cache-first because full transcript profiling is the long-running part of the workflow.

MARKER_PROGRAMS = {
    'macrophage': ['C1QA', 'CD163', 'CSF1R', 'SIGLEC1', 'C3'],
    'plasma': ['IGHA1', 'IGHA2', 'JCHAIN', 'IGHM'],
    'vascular': ['CDH5', 'MMRN2', 'EPAS1', 'KDR', 'FLT1'],
    'basal_dcis': ['SOSTDC1', 'KLK5', 'ITGB6', 'DSC3', 'KRT23'],
    'apocrine': ['TAT', 'MYBPC1'],
    'luminal_amorphous': ['HSPB8', 'CLIC6', 'PIP', 'ESR1', 'PGR'],
    'invasive_stromal_rim': ['MMP11', 'COL11A1', 'POSTN', 'COL1A1', 'COL1A2'],
}
GENE_TO_PROGRAM = {
    gene: program
    for program, genes in MARKER_PROGRAMS.items()
    for gene in genes
}
SENTINEL_GENES = ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']
all_marker_genes = sorted(GENE_TO_PROGRAM)
print('marker genes =', len(all_marker_genes))
print(all_marker_genes)
marker genes = 31
['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']
def _density_cache_path(kind: str, gene: str) -> Path:
    return DENSITY_CACHE_DIR / f'{kind}_{gene}.parquet'


def _combine_cached_frames(paths: Iterable[Path]) -> pd.DataFrame | None:
    existing = [path for path in paths if path.exists()]
    if not existing:
        return None
    frames = [pd.read_parquet(path) for path in existing]
    return pd.concat(frames, ignore_index=True)


def load_or_run_ring_density_for_panel(
    slide,
    *,
    contour_key: str,
    genes: list[str],
    run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
    frames = []
    missing = []
    for gene in genes:
        cache_path = _density_cache_path('ring', gene)
        if LOAD_CACHED_DENSITY and cache_path.exists():
            frames.append(pd.read_parquet(cache_path))
            continue
        if not run_if_missing:
            missing.append(gene)
            continue
        gene_frame = ring_density(
            slide,
            contour_key=contour_key,
            target='transcripts',
            target_query=TARGET_QUERY,
            feature_values=gene,
            inward=CONTOUR_EXPAND_DISTANCE_UM,
            outward=CONTOUR_EXPAND_DISTANCE_UM,
            ring_width=10.0,
        )
        gene_frame.to_parquet(cache_path, index=False)
        frames.append(gene_frame)
    if not frames:
        return None, missing
    return pd.concat(frames, ignore_index=True), missing


def load_or_run_profile_density_for_panel(
    slide,
    *,
    contour_key: str,
    genes: list[str],
    run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
    frames = []
    missing = []
    for gene in genes:
        cache_path = _density_cache_path('profile', gene)
        if LOAD_CACHED_DENSITY and cache_path.exists():
            frames.append(pd.read_parquet(cache_path))
            continue
        if not run_if_missing:
            missing.append(gene)
            continue
        gene_frame = smooth_density_by_distance(
            slide,
            contour_key=contour_key,
            target='transcripts',
            target_query=TARGET_QUERY,
            feature_values=gene,
            inward=CONTOUR_EXPAND_DISTANCE_UM,
            outward=CONTOUR_EXPAND_DISTANCE_UM,
            bandwidth=10.0,
            grid_step=5.0,
        )
        gene_frame.to_parquet(cache_path, index=False)
        frames.append(gene_frame)
    if not frames:
        return None, missing
    return pd.concat(frames, ignore_index=True), missing

The next two cells are cache-first by design.

  • If cached parquet files already exist under density_cache/, they are loaded immediately.

  • If caches are missing and RUN_RING_DENSITY or RUN_PROFILE_DENSITY is False, the notebook reports which genes are still pending.

  • Set the flags to True when you want to launch the full shell-density run.

ring_density_table, missing_ring_genes = load_or_run_ring_density_for_panel(
    slide,
    contour_key=EXPANDED_CONTOUR_KEY,
    genes=all_marker_genes,
    run_if_missing=RUN_RING_DENSITY,
)
print('ring_density_table loaded:', ring_density_table is not None)
print('missing ring genes:', missing_ring_genes)
ring_density_table loaded: False
missing ring genes: ['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']
profile_density_table, missing_profile_genes = load_or_run_profile_density_for_panel(
    slide,
    contour_key=EXPANDED_CONTOUR_KEY,
    genes=SENTINEL_GENES,
    run_if_missing=RUN_PROFILE_DENSITY,
)
print('profile_density_table loaded:', profile_density_table is not None)
print('missing profile genes:', missing_profile_genes)
profile_density_table loaded: False
missing profile genes: ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']

7. Program-level ring summaries#

The scoring rule is:

  1. Run ring_density per gene.

  2. Average per-gene densities within each program for each contour and shell.

  3. Summarize by assigned_structure.

  4. Report outward-vs-inward enrichment.

if ring_density_table is None:
    display(Markdown(
        'Ring-density caches are not available yet. '
        'Set `RUN_RING_DENSITY = True` and rerun the previous cell to populate them.'
    ))
else:
    ring_density_table = ring_density_table.copy()
    ring_density_table['gene'] = ring_density_table['feature_values'].astype(str)
    ring_density_table['program'] = ring_density_table['gene'].map(GENE_TO_PROGRAM)

    program_ring = (
        ring_density_table.groupby(
            ['assigned_structure', 'contour_id', 'program', 'ring_start', 'ring_end', 'ring_mid'],
            as_index=False,
        )
        .agg(
            mean_density=('density', 'mean'),
            total_count=('count', 'sum'),
            mean_area=('area', 'mean'),
            n_genes=('gene', 'nunique'),
        )
    )
    inward = (
        program_ring.loc[program_ring['ring_mid'] < 0]
        .groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
        .mean()
        .rename(columns={'mean_density': 'inward_density'})
    )
    outward = (
        program_ring.loc[program_ring['ring_mid'] > 0]
        .groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
        .mean()
        .rename(columns={'mean_density': 'outward_density'})
    )
    program_enrichment = inward.merge(outward, on=['assigned_structure', 'program'], how='outer')
    program_enrichment['outward_minus_inward'] = (
        program_enrichment['outward_density'] - program_enrichment['inward_density']
    )
    program_enrichment['outward_over_inward'] = (
        program_enrichment['outward_density'] / program_enrichment['inward_density'].replace(0, np.nan)
    )

    program_ring.to_parquet(OUTPUT_ROOT / 'program_ring_density.parquet', index=False)
    program_enrichment.to_csv(OUTPUT_ROOT / 'program_outward_vs_inward.csv', index=False)
    display(program_enrichment.sort_values(['assigned_structure', 'outward_minus_inward'], ascending=[True, False]))

Ring-density caches are not available yet. Set RUN_RING_DENSITY = True and rerun the previous cell to populate them.

if ring_density_table is None:
    display(Markdown('Skip the heatmap until ring-density caches are available.'))
else:
    heatmap_data = program_enrichment.pivot(
        index='assigned_structure',
        columns='program',
        values='outward_minus_inward',
    )
    plt.figure(figsize=(10, 6))
    sns.heatmap(heatmap_data, cmap='coolwarm', center=0.0, linewidths=0.5)
    plt.title('Atera program enrichment: outward shell minus inward shell')
    plt.tight_layout()
    plt.savefig(FIGURE_DIR / 'atera_program_outward_minus_inward_heatmap.png', dpi=200)
    plt.show()

Skip the heatmap until ring-density caches are available.

8. Continuous signed-distance profiles for sentinel genes and programs#

The profile cell below keeps the sentinel list intentionally short. It is usually the most readable summary for manuscript-style figures because it exposes whether a marker peaks in the contour core, at the boundary, or in the outward stromal shell.

if profile_density_table is None:
    display(Markdown(
        'Profile caches are not available yet. '
        'Set `RUN_PROFILE_DENSITY = True` and rerun the profile cell to populate them.'
    ))
else:
    profile_density_table = profile_density_table.copy()
    profile_density_table['gene'] = profile_density_table['feature_values'].astype(str)
    profile_density_table['program'] = profile_density_table['gene'].map(GENE_TO_PROGRAM)

    program_profiles = (
        profile_density_table.groupby(
            ['assigned_structure', 'program', 'signed_distance'],
            as_index=False,
        )['density']
        .mean()
        .rename(columns={'density': 'program_density'})
    )
    program_profiles.to_parquet(OUTPUT_ROOT / 'program_signed_distance_profiles.parquet', index=False)

    sentinel_profiles = profile_density_table.loc[
        profile_density_table['gene'].isin(SENTINEL_GENES)
    ].copy()
    sentinel_profiles.to_parquet(OUTPUT_ROOT / 'sentinel_gene_profiles.parquet', index=False)

    display(program_profiles.head())
    display(sentinel_profiles.head())

Profile caches are not available yet. Set RUN_PROFILE_DENSITY = True and rerun the profile cell to populate them.

if profile_density_table is None:
    display(Markdown('Skip the sentinel profile figure until profile caches are available.'))
else:
    plot_table = (
        profile_density_table.loc[profile_density_table['gene'].isin(SENTINEL_GENES)]
        .groupby(['assigned_structure', 'gene', 'signed_distance'], as_index=False)['density']
        .mean()
    )
    g = sns.relplot(
        data=plot_table,
        x='signed_distance',
        y='density',
        kind='line',
        hue='gene',
        col='assigned_structure',
        col_wrap=3,
        facet_kws={'sharey': False},
        height=3.0,
        aspect=1.2,
    )
    g.set_titles('{col_name}')
    g.set_axis_labels('signed distance (um)', 'mean density')
    plt.savefig(FIGURE_DIR / 'atera_sentinel_signed_distance_profiles.png', dpi=200)
    plt.show()

Skip the sentinel profile figure until profile caches are available.

9. Interpreting the biology#

Once the density caches have been computed, the intended biological readout is straightforward:

  • Macrophages should show inward enrichment for C1QA, CD163, CSF1R, SIGLEC1, and C3.

  • Plasma Cells should localize immunoglobulin-heavy shells (IGHA1, IGHA2, JCHAIN, IGHM).

  • Endothelial Cells should anchor the vascular program (CDH5, MMRN2, EPAS1, KDR, FLT1).

  • Basal-like Structured DCIS Cells, Apocrine Cells, and Luminal-like Amorphous DCIS Cells should separate their lineage programs by contour core versus outward shell.

  • 11q13 Invasive Tumor Cells is the structure where outward enrichment of POSTN, COL1A1, COL1A2, COL11A1, and MMP11 is most interpretable as a stromal rim signal.

Because the notebook saves both intermediate density parquet files and derived summary tables, it is easy to rerun only the final plotting and interpretation cells after the long transcript-shell computations have finished.