pyXenium.multimodal.rna_protein_cluster_analysis#
- rna_protein_cluster_analysis(adata, *, n_clusters=12, n_pcs=30, cluster_key='rna_cluster', random_state=0, target_sum=1e4, min_cells_per_cluster=50, min_cells_per_group=20, protein_split_method='median', protein_quantile=0.75, test_size=0.2, hidden_layer_sizes=(64, 32), max_iter=200, early_stopping=True)#
Joint RNA/protein analysis for Xenium AnnData objects.
The pipeline performs three consecutive steps:
RNA preprocessing – library-size normalisation (counts per
target_sum) followed bylog1p. ATruncatedSVDis fitted to obtainn_pcslatent dimensions.Clustering –
KMeansis applied on the latent representation to createn_clustersRNA-driven cell groups. Cluster assignments are stored inadata.obs[cluster_key]and the latent space inadata.obsm['X_rna_pca'].Protein explanation – for every cluster and every protein marker, the cells are divided into “high” vs. “low” groups (median split by default). A small neural network (
MLPClassifier) is trained to predict the binary labels from the RNA latent features. The training/test accuracies and optional ROC-AUC are reported.
Parameters#
- adata:
AnnData object returned by
pyXenium.multimodal.load_rna_protein_anndata(). Requiresadata.layers['rna'](oradata.X) andadata.obsm['protein'].- n_clusters:
Number of RNA clusters to compute with KMeans.
- n_pcs:
Number of latent components extracted with TruncatedSVD. The value is automatically capped at
n_genes - 1.- cluster_key:
Column name added to
adata.obsthat stores cluster labels.- random_state:
Seed for the SVD, KMeans and neural networks. Use
Nonefor random initialisation.- target_sum:
Target library size after normalisation (Counts Per
target_sum).- min_cells_per_cluster:
Clusters with fewer cells are skipped entirely.
- min_cells_per_group:
Minimum number of cells required in both “high” and “low” protein groups to train a neural network.
- protein_split_method:
Either
"median"(default) for a median split or"quantile"to keep only the topprotein_quantileand bottom1 - protein_quantilefractions of cells (discarding the middle portion).- protein_quantile:
Quantile used when
protein_split_method='quantile'.- test_size:
Fraction of the cluster reserved for the test split when training the neural network.
- hidden_layer_sizes:
Hidden-layer configuration passed to
MLPClassifier.- max_iter:
Maximum number of training iterations for the neural network.
- early_stopping:
Whether to use early stopping in
MLPClassifier.
Returns#
- summary:
pandas.DataFramesummarising the trained models. Columns are['cluster', 'protein', 'threshold', 'n_cells', 'n_high', 'n_low', 'train_accuracy', 'test_accuracy', 'test_auc'].- models:
Nested dictionary
{cluster -> {protein -> ProteinModelResult}}containing the fitted neural networks and scalers for downstream use.
Examples#
>>> from pyXenium.multimodal import rna_protein_cluster_analysis >>> summary, models = rna_protein_cluster_analysis(adata, n_clusters=8) >>> summary.head() cluster protein threshold n_cells ... test_accuracy test_auc 0 cluster_0 EPCAM (µm) 0.563100 512 ... 0.84 0.91 1 cluster_0 Podocin (µm^2) 0.118775 512 ... 0.79 0.87