pyXenium.contour Tutorial#
This notebook shows an end-to-end contour workflow for the Atera Xenium WTA FFPE breast sample:
Read the Xenium export and inspect the Atera directory layout.
Bridge numeric GraphClust IDs to the human-readable labels in
WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.Generate a HistoSeg-backed Xenium Explorer annotation export.
Copy the generated export to
xenium_explorer_annotations.generated.geojsonwithout touching the legacyxenium_explorer_annotations.geojson.Import the generated GeoJSON into
XeniumSlide.Derive structure-level analysis contours, expand them by 50 um with Voronoi partitioning, and prepare shell-density profiling.
Run
ring_densityandsmooth_density_by_distancewith cache-first helpers over the curated Atera marker panel.
The notebook is written in English and is intended to be shipped as package source and linked from the Read the Docs workflows section.
from __future__ import annotations
from pathlib import Path
import json
import shutil
import sys
from typing import Iterable
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display
from shapely import union_all
def _find_repo_root() -> Path:
cwd = Path.cwd().resolve()
for candidate in (cwd, *cwd.parents):
if (candidate / 'src' / 'pyXenium').exists():
return candidate
return cwd
PROJECT_ROOT = _find_repo_root()
SRC_ROOT = PROJECT_ROOT / 'src'
if str(SRC_ROOT) not in sys.path:
sys.path.insert(0, str(SRC_ROOT))
DATASET_ROOT = Path(r"Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Breast_Cancer_outs")
HISTOSEG_ROOT = Path(r"D:\GitHub\HistoSeg")
OUTPUT_ROOT = DATASET_ROOT / 'pyxenium_tutorial_outputs' / 'atera_contour_workflow'
DENSITY_CACHE_DIR = OUTPUT_ROOT / 'density_cache'
FIGURE_DIR = OUTPUT_ROOT / 'figures'
GENERATED_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.generated.geojson'
LEGACY_GEOJSON = DATASET_ROOT / 'xenium_explorer_annotations.geojson'
GRAPHCLUST_RELPATH = Path('analysis/analysis/clustering/gene_expression_graphclust/clusters.csv')
CELL_GROUPS_PATH = DATASET_ROOT / 'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv'
CELLS_PARQUET_RELPATH = Path('cells.parquet')
PIXEL_SIZE_UM = 0.2125
CONTOUR_EXPAND_DISTANCE_UM = 50.0
ANALYSIS_SIMPLIFY_TOLERANCE_UM = 5.0
MIN_CELLS_OVERRIDE = 250
RUN_RING_DENSITY = False
RUN_PROFILE_DENSITY = False
LOAD_CACHED_DENSITY = True
TARGET_QUERY = 'valid == True and quality_score >= 20'
from pyXenium.contour import (
add_contours_from_geojson,
expand_contours,
generate_xenium_explorer_annotations,
ring_density,
smooth_density_by_distance,
)
from pyXenium.contour._geometry import contour_frame_to_geometry_table, geometry_table_to_contour_frame
from pyXenium.io import read_xenium
sns.set_theme(context='talk', style='whitegrid')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
DENSITY_CACHE_DIR.mkdir(parents=True, exist_ok=True)
FIGURE_DIR.mkdir(parents=True, exist_ok=True)
1. Inspect the Atera Xenium export#
The Atera sample uses the official Xenium output layout, but two details matter for this tutorial:
The default GraphClust labels live under
analysis/analysis/...instead of the single-layeranalysis/...layout.The readable cell-group labels and colors live in
WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv.
We also keep the existing xenium_explorer_annotations.geojson untouched and write a new side-by-side file named xenium_explorer_annotations.generated.geojson.
experiment_metadata = json.loads((DATASET_ROOT / 'experiment.xenium').read_text(encoding='utf-8'))
layout_table = pd.DataFrame(
{
'path': [
'analysis/analysis/clustering/gene_expression_graphclust/clusters.csv',
'WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv',
'cells.parquet',
'cell_feature_matrix.h5',
'transcripts.zarr.zip',
'xenium_explorer_annotations.geojson',
'xenium_explorer_annotations.generated.geojson',
],
'exists': [
(DATASET_ROOT / GRAPHCLUST_RELPATH).exists(),
CELL_GROUPS_PATH.exists(),
(DATASET_ROOT / CELLS_PARQUET_RELPATH).exists(),
(DATASET_ROOT / 'cell_feature_matrix.h5').exists(),
(DATASET_ROOT / 'transcripts.zarr.zip').exists(),
LEGACY_GEOJSON.exists(),
GENERATED_GEOJSON.exists(),
],
}
)
display(layout_table)
print('pixel_size_um =', experiment_metadata.get('pixel_size'))
| path | exists | |
|---|---|---|
| 0 | analysis/analysis/clustering/gene_expression_g... | True |
| 1 | WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv | True |
| 2 | cells.parquet | True |
| 3 | cell_feature_matrix.h5 | True |
| 4 | transcripts.zarr.zip | True |
| 5 | xenium_explorer_annotations.geojson | True |
| 6 | xenium_explorer_annotations.generated.geojson | True |
pixel_size_um = 0.2125
2. Bridge GraphClust IDs to readable Atera labels#
clusters.csv contains numeric GraphClust IDs, while WTA_Preview_FFPE_Breast_Cancer_cell_groups.csv contains the named biological groups and their Explorer colors.
We build a one-row-per-cluster bridge table and then extract the 7 default structure panel used in this tutorial.
cluster_df = pd.read_csv(DATASET_ROOT / GRAPHCLUST_RELPATH)
cell_groups_df = pd.read_csv(CELL_GROUPS_PATH)
cluster_bridge_long = cluster_df.merge(
cell_groups_df[['Barcode', 'Clusters', 'color']],
on='Barcode',
how='inner',
)
cluster_label_counts = cluster_bridge_long.groupby('Cluster')['Clusters'].nunique(dropna=True)
assert int(cluster_label_counts.max()) == 1, 'Each numeric cluster should map to exactly one readable label.'
cluster_bridge = (
cluster_bridge_long.groupby('Cluster', as_index=False)
.agg(
structure_label=('Clusters', 'first'),
color=('color', 'first'),
n_cells=('Barcode', 'size'),
)
.sort_values('Cluster', kind='stable')
.reset_index(drop=True)
)
display(cluster_bridge.head(20))
print('Mapped numeric clusters:', len(cluster_bridge))
| Cluster | structure_label | color | n_cells | |
|---|---|---|---|---|
| 0 | 1 | 11q13 Invasive Tumor Cells | #E91E63 | 19796 |
| 1 | 2 | 11q13 Invasive Tumor Cells | #E91E63 | 16784 |
| 2 | 3 | CAFs, DCIS Associated | #FFC107 | 15057 |
| 3 | 4 | CAFs, DCIS Associated | #FFC107 | 9385 |
| 4 | 5 | Endothelial Cells | #795548 | 8624 |
| 5 | 6 | Basal-like Structured DCIS Cells | #1AFE02 | 8285 |
| 6 | 7 | 11q13 Invasive Tumor Cells | #E91E63 | 8028 |
| 7 | 8 | 11q13 Invasive Tumor Cells | #E91E63 | 7353 |
| 8 | 9 | Luminal-like Amorphous DCIS Cells | #F943D2 | 7219 |
| 9 | 10 | 11q13 Invasive Tumor Cells | #E91E63 | 6928 |
| 10 | 11 | T Lymphocytes | #1976D2 | 6146 |
| 11 | 12 | Luminal-like Amorphous DCIS Cells | #F943D2 | 5809 |
| 12 | 13 | Pericytes | #8D8D00 | 5296 |
| 13 | 14 | 11q13 Invasive Tumor Cells | #E91E63 | 5147 |
| 14 | 15 | Dendritic Cells | #00BCD4 | 4008 |
| 15 | 16 | CAFs, Invasive Associated | #FF9800 | 4001 |
| 16 | 17 | Myoepithelial Cells | #2E7D32 | 3915 |
| 17 | 18 | Pericytes | #8D8D00 | 2791 |
| 18 | 19 | CXCL14+ Fibroblasts | #C2640E | 2547 |
| 19 | 20 | B Cells | #6E7CFE | 2512 |
Mapped numeric clusters: 34
DEFAULT_STRUCTURE_PANEL = [
'11q13 Invasive Tumor Cells',
'Basal-like Structured DCIS Cells',
'Macrophages',
'Plasma Cells',
'Endothelial Cells',
'Apocrine Cells',
'Luminal-like Amorphous DCIS Cells',
]
selected_bridge = (
cluster_bridge_long.loc[cluster_bridge_long['Clusters'].isin(DEFAULT_STRUCTURE_PANEL)]
.groupby('Clusters', as_index=False)
.agg(
cluster_ids=('Cluster', lambda values: sorted(pd.Series(values).dropna().astype(int).unique().tolist())),
color=('color', 'first'),
n_cells=('Barcode', 'size'),
)
)
selected_bridge['structure_id'] = selected_bridge['Clusters'].map(
{name: idx for idx, name in enumerate(DEFAULT_STRUCTURE_PANEL, start=1)}
)
selected_bridge = selected_bridge.sort_values('structure_id', kind='stable').reset_index(drop=True)
selected_bridge
| Clusters | cluster_ids | color | n_cells | structure_id | |
|---|---|---|---|---|---|
| 0 | 11q13 Invasive Tumor Cells | [1, 2, 7, 8, 10, 14] | #E91E63 | 64036 | 1 |
| 1 | Basal-like Structured DCIS Cells | [6, 32] | #1AFE02 | 8760 | 2 |
| 2 | Macrophages | [24, 26] | #512DA8 | 3861 | 3 |
| 3 | Plasma Cells | [30] | #0F51BD | 873 | 4 |
| 4 | Endothelial Cells | [5] | #795548 | 8624 | 5 |
| 5 | Apocrine Cells | [33] | #F48FB1 | 403 | 6 |
| 6 | Luminal-like Amorphous DCIS Cells | [9, 12] | #F943D2 | 13028 | 7 |
Apocrine Cells has only 403 mapped cells in this sample. HistoSeg keeps min_cells=500 as its package default, so this notebook overrides the run-specific threshold to 250 in order to keep the full 7-group Atera panel in the generated export.
DEFAULT_STRUCTURES = [
{
'structure_name': row['Clusters'],
'cluster_ids': row['cluster_ids'],
'structure_color': row['color'],
'structure_id': int(row['structure_id']),
}
for _, row in selected_bridge.iterrows()
]
DEFAULT_STRUCTURES
[{'structure_name': '11q13 Invasive Tumor Cells',
'cluster_ids': [1, 2, 7, 8, 10, 14],
'structure_color': '#E91E63',
'structure_id': 1},
{'structure_name': 'Basal-like Structured DCIS Cells',
'cluster_ids': [6, 32],
'structure_color': '#1AFE02',
'structure_id': 2},
{'structure_name': 'Macrophages',
'cluster_ids': [24, 26],
'structure_color': '#512DA8',
'structure_id': 3},
{'structure_name': 'Plasma Cells',
'cluster_ids': [30],
'structure_color': '#0F51BD',
'structure_id': 4},
{'structure_name': 'Endothelial Cells',
'cluster_ids': [5],
'structure_color': '#795548',
'structure_id': 5},
{'structure_name': 'Apocrine Cells',
'cluster_ids': [33],
'structure_color': '#F48FB1',
'structure_id': 6},
{'structure_name': 'Luminal-like Amorphous DCIS Cells',
'cluster_ids': [9, 12],
'structure_color': '#F943D2',
'structure_id': 7}]
3. Generate a HistoSeg-backed Xenium Explorer export#
pyXenium.contour.generate_xenium_explorer_annotations(...) is a thin wrapper around the new HistoSeg multi-structure API. The generated files are written into pyxenium_tutorial_outputs/atera_contour_workflow/ and the GeoJSON is copied to xenium_explorer_annotations.generated.geojson beside the legacy file.
artifacts = generate_xenium_explorer_annotations(
DATASET_ROOT,
structures=DEFAULT_STRUCTURES,
output_relpath=OUTPUT_ROOT.relative_to(DATASET_ROOT),
clusters_relpath=GRAPHCLUST_RELPATH,
cells_parquet_relpath=CELLS_PARQUET_RELPATH,
histoseg_root=HISTOSEG_ROOT if HISTOSEG_ROOT.exists() else None,
min_cells=MIN_CELLS_OVERRIDE,
)
shutil.copy2(artifacts['geojson'], GENERATED_GEOJSON)
artifacts
c:\Users\taobo.hu\AppData\Local\miniconda3\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
{'out_dir': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow',
'geojson': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.geojson',
'csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations.csv',
'summary': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\xenium_explorer_annotations_summary.csv',
'preview_png': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\multi_structure_contour_preview.png',
'partition_table': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\cells_with_structure_partition.parquet',
'structure_count_csv': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_cell_counts.csv',
'metrics_json': '\\\\storage3.ad.scilifelab.se\\mn-moldia\\long\\10X_datasets\\Xenium\\Atera\\WTA_Preview_FFPE_Breast_Cancer_outs\\pyxenium_tutorial_outputs\\atera_contour_workflow\\structure_contour_metrics.json'}
structure_count_table = pd.read_csv(OUTPUT_ROOT / 'structure_contour_cell_counts.csv')
metrics_payload = json.loads((OUTPUT_ROOT / 'structure_contour_metrics.json').read_text(encoding='utf-8'))
display(structure_count_table)
print('valid_structures =', metrics_payload['isoline_metrics']['valid_structures'])
print('skipped_structures =', metrics_payload['isoline_metrics']['skipped_structures'])
| structure_id | structure_name | selected_cluster_ids | input_cell_count | assigned_cell_count | contour_count | |
|---|---|---|---|---|---|---|
| 0 | 1 | 11q13 Invasive Tumor Cells | 1, 2, 7, 8, 10, 14 | 64036 | 69851 | 110 |
| 1 | 2 | Basal-like Structured DCIS Cells | 6, 32 | 8760 | 13231 | 386 |
| 2 | 3 | Macrophages | 24, 26 | 3861 | 25865 | 516 |
| 3 | 4 | Plasma Cells | 30 | 873 | 7131 | 326 |
| 4 | 5 | Endothelial Cells | 5 | 8624 | 38357 | 544 |
| 5 | 6 | Apocrine Cells | 33 | 403 | 626 | 13 |
| 6 | 7 | Luminal-like Amorphous DCIS Cells | 9, 12 | 13028 | 14984 | 78 |
valid_structures = [1, 2, 3, 4, 5, 6, 7]
skipped_structures = []
Diagnostic comparison with the legacy GeoJSON#
This comparison is only a sanity check. The generated file is a named 7-group export, while the legacy file is a pre-existing reference artifact with a different structure definition. They are not expected to match byte-for-byte.
def geojson_digest(path: Path) -> dict[str, object]:
payload = json.loads(path.read_text(encoding='utf-8'))
features = payload.get('features', [])
structures = sorted(
{
feature.get('properties', {}).get('assigned_structure')
for feature in features
if feature.get('properties', {}).get('assigned_structure') is not None
}
)
return {
'path': str(path),
'n_features': len(features),
'assigned_structures': structures,
}
comparison_table = pd.DataFrame(
[
geojson_digest(LEGACY_GEOJSON),
geojson_digest(GENERATED_GEOJSON),
]
)
comparison_table
| path | n_features | assigned_structures | |
|---|---|---|---|
| 0 | Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_... | 300 | [Structure 1, Structure 2, Structure 3, Struct... |
| 1 | Y:\long\10X_datasets\Xenium\Atera\WTA_Preview_... | 2606 | [11q13 Invasive Tumor Cells, Apocrine Cells, B... |
4. Read Xenium as XeniumSlide and import the generated contours#
We intentionally use stream_transcripts=True here to exercise the Atera transcript reader on the real transcripts.zarr.zip layout. On this sample, the streamed chunks do not expose cell_id, and that is now handled explicitly.
slide = read_xenium(
str(DATASET_ROOT),
as_='slide',
prefer='auto',
stream_transcripts=True,
)
print(slide.component_summary())
first_chunk = next(slide.point_sources['transcripts'].iter_chunks())
print('first streamed transcript columns:', first_chunk.columns.tolist())
first_chunk.head()
{'tables': ['cells'], 'points': ['transcripts'], 'shapes': ['cell_boundaries', 'nucleus_boundaries'], 'images': [], 'contour_images': [], 'labels': []}
first streamed transcript columns: ['x', 'y', 'gene_identity', 'gene_name', 'quality_score', 'valid', 'z', 'id_0', 'id_1', 'uuid_0', 'uuid_1', 'codeword_identity_0', 'codeword_identity_1']
| x | y | gene_identity | gene_name | quality_score | valid | z | id_0 | id_1 | uuid_0 | uuid_1 | codeword_identity_0 | codeword_identity_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4917.281250 | 5133.515625 | 0 | A1BG | 37.0 | True | 13.171875 | 933743 | 28 | 933743 | 65564 | 2233 | 4294967295 |
| 1 | 4920.468750 | 5134.812500 | 0 | A1BG | 40.0 | True | 13.046875 | 1095646 | 28 | 1095646 | 65564 | 2233 | 4294967295 |
| 2 | 4829.171875 | 5041.406250 | 0 | A1BG | 40.0 | True | 14.437500 | 1240538 | 28 | 1240538 | 65564 | 2233 | 4294967295 |
| 3 | 4815.625000 | 5087.046875 | 0 | A1BG | 25.0 | True | 17.171875 | 2189269 | 28 | 2189269 | 65564 | 2233 | 4294967295 |
| 4 | 4875.671875 | 5118.031250 | 0 | A1BG | 40.0 | True | 14.781250 | 139045 | 28 | 139045 | 65564 | 6803 | 4294967295 |
CONTOUR_KEY = 'atera_explorer_contours'
if CONTOUR_KEY in slide.shapes:
del slide.shapes[CONTOUR_KEY]
slide.metadata.get('contours', {}).pop(CONTOUR_KEY, None)
add_contours_from_geojson(
slide,
GENERATED_GEOJSON,
key=CONTOUR_KEY,
id_key='name',
pixel_size_um=PIXEL_SIZE_UM,
)
sorted(slide.shapes[CONTOUR_KEY]['assigned_structure'].dropna().unique().tolist())
['11q13 Invasive Tumor Cells',
'Apocrine Cells',
'Basal-like Structured DCIS Cells',
'Endothelial Cells',
'Luminal-like Amorphous DCIS Cells',
'Macrophages',
'Plasma Cells']
5. Derive structure-level analysis contours and expand them by 50 um#
The Explorer export contains many polygon components. For shell profiling we dissolve polygons with the same assigned_structure into a single analysis contour per named structure and simplify them moderately. This keeps the original Explorer layer available for round-tripping while making downstream density analysis much more tractable.
def build_structure_level_analysis_contours(
slide,
*,
source_key: str,
output_key: str,
simplify_tolerance_um: float = 5.0,
) -> pd.DataFrame:
if output_key in slide.shapes:
del slide.shapes[output_key]
slide.metadata.get('contours', {}).pop(output_key, None)
geometry_table = contour_frame_to_geometry_table(slide.shapes[source_key], contour_key=source_key)
metadata_columns = [column for column in geometry_table.columns if column not in {'contour_id', 'geometry'}]
rows = []
for assigned_structure, group in geometry_table.groupby('assigned_structure', sort=True):
geometry = union_all(list(group['geometry']))
geometry = geometry.simplify(simplify_tolerance_um, preserve_topology=True).buffer(0)
exemplar = group.iloc[0]
row = {
'contour_id': assigned_structure,
'geometry': geometry,
}
for column in metadata_columns:
if column in {'classification_name', 'assigned_structure'}:
row[column] = assigned_structure
else:
row[column] = exemplar[column]
rows.append(row)
analysis_table = pd.DataFrame(rows)
analysis_frame = geometry_table_to_contour_frame(analysis_table)
slide.shapes[output_key] = analysis_frame
slide.metadata.setdefault('contours', {})[output_key] = {
'derived_from_key': source_key,
'n_contours': int(analysis_table.shape[0]),
'units': 'micron',
'simplify_tolerance_um': float(simplify_tolerance_um),
}
slide._validate()
return analysis_table
ANALYSIS_CONTOUR_KEY = 'atera_structure_contours'
analysis_contour_table = build_structure_level_analysis_contours(
slide,
source_key=CONTOUR_KEY,
output_key=ANALYSIS_CONTOUR_KEY,
simplify_tolerance_um=ANALYSIS_SIMPLIFY_TOLERANCE_UM,
)
analysis_contour_table[['contour_id', 'assigned_structure', 'structure_id']]
| contour_id | assigned_structure | structure_id | |
|---|---|---|---|
| 0 | 11q13 Invasive Tumor Cells | 11q13 Invasive Tumor Cells | 1 |
| 1 | Apocrine Cells | Apocrine Cells | 6 |
| 2 | Basal-like Structured DCIS Cells | Basal-like Structured DCIS Cells | 2 |
| 3 | Endothelial Cells | Endothelial Cells | 5 |
| 4 | Luminal-like Amorphous DCIS Cells | Luminal-like Amorphous DCIS Cells | 7 |
| 5 | Macrophages | Macrophages | 3 |
| 6 | Plasma Cells | Plasma Cells | 4 |
EXPANDED_CONTOUR_KEY = 'atera_structure_contours_expand50_voronoi'
if EXPANDED_CONTOUR_KEY in slide.shapes:
del slide.shapes[EXPANDED_CONTOUR_KEY]
slide.metadata.get('contours', {}).pop(EXPANDED_CONTOUR_KEY, None)
expand_contours(
slide,
contour_key=ANALYSIS_CONTOUR_KEY,
output_key=EXPANDED_CONTOUR_KEY,
distance=CONTOUR_EXPAND_DISTANCE_UM,
mode='voronoi',
)
expanded_geometry_table = contour_frame_to_geometry_table(
slide.shapes[EXPANDED_CONTOUR_KEY],
contour_key=EXPANDED_CONTOUR_KEY,
)
expanded_geometry_table[['contour_id', 'assigned_structure', 'structure_id']]
| contour_id | assigned_structure | structure_id | |
|---|---|---|---|
| 0 | 11q13 Invasive Tumor Cells | 11q13 Invasive Tumor Cells | 1 |
| 1 | Apocrine Cells | Apocrine Cells | 6 |
| 2 | Basal-like Structured DCIS Cells | Basal-like Structured DCIS Cells | 2 |
| 3 | Endothelial Cells | Endothelial Cells | 5 |
| 4 | Luminal-like Amorphous DCIS Cells | Luminal-like Amorphous DCIS Cells | 7 |
| 5 | Macrophages | Macrophages | 3 |
| 6 | Plasma Cells | Plasma Cells | 4 |
6. Define the validated Atera marker programs#
The biology section uses the curated marker panel already validated for this sample. The shell-density helpers below are cache-first because full transcript profiling is the long-running part of the workflow.
MARKER_PROGRAMS = {
'macrophage': ['C1QA', 'CD163', 'CSF1R', 'SIGLEC1', 'C3'],
'plasma': ['IGHA1', 'IGHA2', 'JCHAIN', 'IGHM'],
'vascular': ['CDH5', 'MMRN2', 'EPAS1', 'KDR', 'FLT1'],
'basal_dcis': ['SOSTDC1', 'KLK5', 'ITGB6', 'DSC3', 'KRT23'],
'apocrine': ['TAT', 'MYBPC1'],
'luminal_amorphous': ['HSPB8', 'CLIC6', 'PIP', 'ESR1', 'PGR'],
'invasive_stromal_rim': ['MMP11', 'COL11A1', 'POSTN', 'COL1A1', 'COL1A2'],
}
GENE_TO_PROGRAM = {
gene: program
for program, genes in MARKER_PROGRAMS.items()
for gene in genes
}
SENTINEL_GENES = ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']
all_marker_genes = sorted(GENE_TO_PROGRAM)
print('marker genes =', len(all_marker_genes))
print(all_marker_genes)
marker genes = 31
['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']
def _density_cache_path(kind: str, gene: str) -> Path:
return DENSITY_CACHE_DIR / f'{kind}_{gene}.parquet'
def _combine_cached_frames(paths: Iterable[Path]) -> pd.DataFrame | None:
existing = [path for path in paths if path.exists()]
if not existing:
return None
frames = [pd.read_parquet(path) for path in existing]
return pd.concat(frames, ignore_index=True)
def load_or_run_ring_density_for_panel(
slide,
*,
contour_key: str,
genes: list[str],
run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
frames = []
missing = []
for gene in genes:
cache_path = _density_cache_path('ring', gene)
if LOAD_CACHED_DENSITY and cache_path.exists():
frames.append(pd.read_parquet(cache_path))
continue
if not run_if_missing:
missing.append(gene)
continue
gene_frame = ring_density(
slide,
contour_key=contour_key,
target='transcripts',
target_query=TARGET_QUERY,
feature_values=gene,
inward=CONTOUR_EXPAND_DISTANCE_UM,
outward=CONTOUR_EXPAND_DISTANCE_UM,
ring_width=10.0,
)
gene_frame.to_parquet(cache_path, index=False)
frames.append(gene_frame)
if not frames:
return None, missing
return pd.concat(frames, ignore_index=True), missing
def load_or_run_profile_density_for_panel(
slide,
*,
contour_key: str,
genes: list[str],
run_if_missing: bool,
) -> tuple[pd.DataFrame | None, list[str]]:
frames = []
missing = []
for gene in genes:
cache_path = _density_cache_path('profile', gene)
if LOAD_CACHED_DENSITY and cache_path.exists():
frames.append(pd.read_parquet(cache_path))
continue
if not run_if_missing:
missing.append(gene)
continue
gene_frame = smooth_density_by_distance(
slide,
contour_key=contour_key,
target='transcripts',
target_query=TARGET_QUERY,
feature_values=gene,
inward=CONTOUR_EXPAND_DISTANCE_UM,
outward=CONTOUR_EXPAND_DISTANCE_UM,
bandwidth=10.0,
grid_step=5.0,
)
gene_frame.to_parquet(cache_path, index=False)
frames.append(gene_frame)
if not frames:
return None, missing
return pd.concat(frames, ignore_index=True), missing
The next two cells are cache-first by design.
If cached parquet files already exist under
density_cache/, they are loaded immediately.If caches are missing and
RUN_RING_DENSITYorRUN_PROFILE_DENSITYisFalse, the notebook reports which genes are still pending.Set the flags to
Truewhen you want to launch the full shell-density run.
ring_density_table, missing_ring_genes = load_or_run_ring_density_for_panel(
slide,
contour_key=EXPANDED_CONTOUR_KEY,
genes=all_marker_genes,
run_if_missing=RUN_RING_DENSITY,
)
print('ring_density_table loaded:', ring_density_table is not None)
print('missing ring genes:', missing_ring_genes)
ring_density_table loaded: False
missing ring genes: ['C1QA', 'C3', 'CD163', 'CDH5', 'CLIC6', 'COL11A1', 'COL1A1', 'COL1A2', 'CSF1R', 'DSC3', 'EPAS1', 'ESR1', 'FLT1', 'HSPB8', 'IGHA1', 'IGHA2', 'IGHM', 'ITGB6', 'JCHAIN', 'KDR', 'KLK5', 'KRT23', 'MMP11', 'MMRN2', 'MYBPC1', 'PGR', 'PIP', 'POSTN', 'SIGLEC1', 'SOSTDC1', 'TAT']
profile_density_table, missing_profile_genes = load_or_run_profile_density_for_panel(
slide,
contour_key=EXPANDED_CONTOUR_KEY,
genes=SENTINEL_GENES,
run_if_missing=RUN_PROFILE_DENSITY,
)
print('profile_density_table loaded:', profile_density_table is not None)
print('missing profile genes:', missing_profile_genes)
profile_density_table loaded: False
missing profile genes: ['C1QA', 'IGHA1', 'CDH5', 'SOSTDC1', 'TAT', 'ESR1', 'POSTN']
7. Program-level ring summaries#
The scoring rule is:
Run
ring_densityper gene.Average per-gene densities within each program for each contour and shell.
Summarize by
assigned_structure.Report outward-vs-inward enrichment.
if ring_density_table is None:
display(Markdown(
'Ring-density caches are not available yet. '
'Set `RUN_RING_DENSITY = True` and rerun the previous cell to populate them.'
))
else:
ring_density_table = ring_density_table.copy()
ring_density_table['gene'] = ring_density_table['feature_values'].astype(str)
ring_density_table['program'] = ring_density_table['gene'].map(GENE_TO_PROGRAM)
program_ring = (
ring_density_table.groupby(
['assigned_structure', 'contour_id', 'program', 'ring_start', 'ring_end', 'ring_mid'],
as_index=False,
)
.agg(
mean_density=('density', 'mean'),
total_count=('count', 'sum'),
mean_area=('area', 'mean'),
n_genes=('gene', 'nunique'),
)
)
inward = (
program_ring.loc[program_ring['ring_mid'] < 0]
.groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
.mean()
.rename(columns={'mean_density': 'inward_density'})
)
outward = (
program_ring.loc[program_ring['ring_mid'] > 0]
.groupby(['assigned_structure', 'program'], as_index=False)['mean_density']
.mean()
.rename(columns={'mean_density': 'outward_density'})
)
program_enrichment = inward.merge(outward, on=['assigned_structure', 'program'], how='outer')
program_enrichment['outward_minus_inward'] = (
program_enrichment['outward_density'] - program_enrichment['inward_density']
)
program_enrichment['outward_over_inward'] = (
program_enrichment['outward_density'] / program_enrichment['inward_density'].replace(0, np.nan)
)
program_ring.to_parquet(OUTPUT_ROOT / 'program_ring_density.parquet', index=False)
program_enrichment.to_csv(OUTPUT_ROOT / 'program_outward_vs_inward.csv', index=False)
display(program_enrichment.sort_values(['assigned_structure', 'outward_minus_inward'], ascending=[True, False]))
Ring-density caches are not available yet. Set RUN_RING_DENSITY = True and rerun the previous cell to populate them.
if ring_density_table is None:
display(Markdown('Skip the heatmap until ring-density caches are available.'))
else:
heatmap_data = program_enrichment.pivot(
index='assigned_structure',
columns='program',
values='outward_minus_inward',
)
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, cmap='coolwarm', center=0.0, linewidths=0.5)
plt.title('Atera program enrichment: outward shell minus inward shell')
plt.tight_layout()
plt.savefig(FIGURE_DIR / 'atera_program_outward_minus_inward_heatmap.png', dpi=200)
plt.show()
Skip the heatmap until ring-density caches are available.
8. Continuous signed-distance profiles for sentinel genes and programs#
The profile cell below keeps the sentinel list intentionally short. It is usually the most readable summary for manuscript-style figures because it exposes whether a marker peaks in the contour core, at the boundary, or in the outward stromal shell.
if profile_density_table is None:
display(Markdown(
'Profile caches are not available yet. '
'Set `RUN_PROFILE_DENSITY = True` and rerun the profile cell to populate them.'
))
else:
profile_density_table = profile_density_table.copy()
profile_density_table['gene'] = profile_density_table['feature_values'].astype(str)
profile_density_table['program'] = profile_density_table['gene'].map(GENE_TO_PROGRAM)
program_profiles = (
profile_density_table.groupby(
['assigned_structure', 'program', 'signed_distance'],
as_index=False,
)['density']
.mean()
.rename(columns={'density': 'program_density'})
)
program_profiles.to_parquet(OUTPUT_ROOT / 'program_signed_distance_profiles.parquet', index=False)
sentinel_profiles = profile_density_table.loc[
profile_density_table['gene'].isin(SENTINEL_GENES)
].copy()
sentinel_profiles.to_parquet(OUTPUT_ROOT / 'sentinel_gene_profiles.parquet', index=False)
display(program_profiles.head())
display(sentinel_profiles.head())
Profile caches are not available yet. Set RUN_PROFILE_DENSITY = True and rerun the profile cell to populate them.
if profile_density_table is None:
display(Markdown('Skip the sentinel profile figure until profile caches are available.'))
else:
plot_table = (
profile_density_table.loc[profile_density_table['gene'].isin(SENTINEL_GENES)]
.groupby(['assigned_structure', 'gene', 'signed_distance'], as_index=False)['density']
.mean()
)
g = sns.relplot(
data=plot_table,
x='signed_distance',
y='density',
kind='line',
hue='gene',
col='assigned_structure',
col_wrap=3,
facet_kws={'sharey': False},
height=3.0,
aspect=1.2,
)
g.set_titles('{col_name}')
g.set_axis_labels('signed distance (um)', 'mean density')
plt.savefig(FIGURE_DIR / 'atera_sentinel_signed_distance_profiles.png', dpi=200)
plt.show()
Skip the sentinel profile figure until profile caches are available.
9. Interpreting the biology#
Once the density caches have been computed, the intended biological readout is straightforward:
Macrophagesshould show inward enrichment forC1QA,CD163,CSF1R,SIGLEC1, andC3.Plasma Cellsshould localize immunoglobulin-heavy shells (IGHA1,IGHA2,JCHAIN,IGHM).Endothelial Cellsshould anchor the vascular program (CDH5,MMRN2,EPAS1,KDR,FLT1).Basal-like Structured DCIS Cells,Apocrine Cells, andLuminal-like Amorphous DCIS Cellsshould separate their lineage programs by contour core versus outward shell.11q13 Invasive Tumor Cellsis the structure where outward enrichment ofPOSTN,COL1A1,COL1A2,COL11A1, andMMP11is most interpretable as a stromal rim signal.
Because the notebook saves both intermediate density parquet files and derived summary tables, it is easy to rerun only the final plotting and interpretation cells after the long transcript-shell computations have finished.