Atera WTA Breast CCI Benchmark#
pyXenium ships a dedicated benchmark scaffold for comparing cell-cell interaction methods on the Atera Xenium WTA breast dataset used by the CCI tutorial.
What This Workflow Produces#
a frozen
AnnDatabundle for the full dataset and a stratified smoke subsetsparse cross-language matrices and metadata tables for external Python and R methods
a shared CCI resource table for
common-dbbenchmarkinga standardized output schema across methods
biology-oriented summaries for canonical recovery, pathway relevance, spatial coherence, robustness, and novelty support
Benchmark Root#
The benchmark workspace lives under:
benchmarking/cci_2026_atera/
This directory contains:
configs/for method registry and biology scoring panelsenvs/for one environment manifest per methodscripts/for preparation, environment creation, aggregation, report rendering, and A100 stagingrunners/for method-side adapters
Dataset Panel#
The benchmark now has an explicit manuscript dataset panel in
benchmarking/cci_2026_atera/configs/datasets.yaml.
atera_breast_wta: primary Atera Xenium WTA breast discovery and validation dataset.atera_cervical_wta: Atera Xenium WTA cervical cancer cross-tissue generalization dataset atY:\long\10X_datasets\Xenium\Atera\WTA_Preview_FFPE_Cervical_Cancer_outs.public_non_xenium_spatial: placeholder for the required public CosMx/MERSCOPE/MERFISH/Visium HD cross-platform dataset.
Prepare or dry-run a dataset-specific bundle with:
pyxenium benchmark atera-cci prepare-dataset --dataset-id atera_cervical_wta --dry-run
pyxenium benchmark atera-cci prepare-dataset --dataset-id atera_cervical_wta --skip-full-h5ad
CLI Entry Points#
Prepare the shared input bundle:
pyxenium benchmark atera-cci prepare
Prepare a full sparse bundle locally without requiring a full .h5ad:
pyxenium benchmark atera-cci prepare --skip-full-h5ad
Run the built-in pyXenium smoke benchmark:
pyxenium benchmark atera-cci smoke-pyxenium
Dry-run one method adapter:
pyxenium benchmark atera-cci run-method --method squidpy --database-mode common-db --dry-run
Run the first-wave core smoke panel:
pyxenium benchmark atera-cci smoke-core --methods pyxenium,squidpy,liana,commot,cellchat --database-mode common-db
Aggregate standardized results:
pyxenium benchmark atera-cci aggregate
Render a markdown report:
pyxenium benchmark atera-cci report
Generate A100 staging commands:
pyxenium benchmark atera-cci stage-a100 --plan-only \
--remote-xenium-root /mnt/taobo.hu/long/10X_datasets/Xenium/Atera/WTA_Preview_FFPE_Breast_Cancer_outs \
--remote-root /data/taobo.hu/pyxenium_cci_benchmark_2026-04
Build and dry-run the A100 full common-db plan. The plan includes a prepare_full_bundle job that reads from the read-only /mnt Xenium export and writes all bundle/runs/logs/reports under /data/taobo.hu/pyxenium_cci_benchmark_2026-04:
pyxenium benchmark atera-cci prepare-a100-bundle --phase full --database-mode common-db \
--remote-xenium-root /mnt/taobo.hu/long/10X_datasets/Xenium/Atera/WTA_Preview_FFPE_Breast_Cancer_outs \
--remote-root /data/taobo.hu/pyxenium_cci_benchmark_2026-04
pyxenium benchmark atera-cci run-a100-plan --plan-json benchmarking/cci_2026_atera/logs/a100_bundle_plan.json
Generate A100 result recovery commands:
pyxenium benchmark atera-cci collect-a100-results --host <host> --user <user>
Practical Note#
The full Xenium matrix is exported as sparse Matrix Market rather than a dense TSV because the full matrix is too large to move or parse safely as a dense text file. The benchmark prep still emits meta.tsv, coords.tsv, genes.tsv, barcodes.tsv, and the shared CCI resource tables expected by the method adapters.
The first-wave real adapter contract covers pyXenium, Squidpy ligrec, LIANA+ spatial bivariate, COMMOT, and CellChat v3 / SpatialCellChat. Third-party package installation remains isolated per method environment; missing packages should fail inside the method run with a reproducible run_summary.json rather than changing the shared schema.
Use the declared per-method environments rather than a base Python environment. In particular, the Squidpy environment pins zarr<3 because current ome-zarr-heavy imaging stacks can fail to import against incompatible zarr releases.
A100 orchestration writes a portable stage/job manifest and never stores passwords. The A100 source/destination split is explicit: /mnt/taobo.hu/long/10X_datasets/Xenium/Atera/WTA_Preview_FFPE_Breast_Cancer_outs is read-only input, while /data/taobo.hu/pyxenium_cci_benchmark_2026-04 is the only writable benchmark root. The report step automatically includes run status, engineering reproducibility, canonical pair rank matrix, and A100 resource summary when the corresponding run summaries or A100 plan exist.
Nature Methods Readiness Gates#
The reviewer-facing benchmark is tracked in
benchmarking/cci_2026_atera/configs/publication_readiness.yaml. Before final
submission, TopoLink-CCI should have at least three real datasets, ten comparison
methods, one topology-preserving synthetic truth panel, five false-positive
control layers, and five stratified bootstrap repeats. The public method should
report CCI_score as a discovery score and reserve cci_pvalue, cci_fdr,
null_z, downstream support, cross-method consensus, and robustness for
orthogonal validation.