pyXenium.multimodal.rna_protein_cluster_analysis#

rna_protein_cluster_analysis(adata, *, n_clusters=12, n_pcs=30, cluster_key='rna_cluster', random_state=0, target_sum=1e4, min_cells_per_cluster=50, min_cells_per_group=20, protein_split_method='median', protein_quantile=0.75, test_size=0.2, hidden_layer_sizes=(64, 32), max_iter=200, early_stopping=True)#

Joint RNA/protein analysis for Xenium AnnData objects.

The pipeline performs three consecutive steps:

RNA preprocessing – library-size normalisation (counts per target_sum) followed by log1p. A TruncatedSVD is fitted to obtain n_pcs latent dimensions.
Clustering – KMeans is applied on the latent representation to create n_clusters RNA-driven cell groups. Cluster assignments are stored in adata.obs[cluster_key] and the latent space in adata.obsm['X_rna_pca'].
Protein explanation – for every cluster and every protein marker, the cells are divided into “high” vs. “low” groups (median split by default). A small neural network (MLPClassifier) is trained to predict the binary labels from the RNA latent features. The training/test accuracies and optional ROC-AUC are reported.

Parameters#

adata:: AnnData object returned by pyXenium.multimodal.load_rna_protein_anndata(). Requires adata.layers['rna'] (or adata.X) and adata.obsm['protein'].
n_clusters:: Number of RNA clusters to compute with KMeans.
n_pcs:: Number of latent components extracted with TruncatedSVD. The value is automatically capped at n_genes - 1.
cluster_key:: Column name added to adata.obs that stores cluster labels.
random_state:: Seed for the SVD, KMeans and neural networks. Use None for random initialisation.
target_sum:: Target library size after normalisation (Counts Per target_sum).
min_cells_per_cluster:: Clusters with fewer cells are skipped entirely.
min_cells_per_group:: Minimum number of cells required in both “high” and “low” protein groups to train a neural network.
protein_split_method:: Either "median" (default) for a median split or "quantile" to keep only the top protein_quantile and bottom 1 - protein_quantile fractions of cells (discarding the middle portion).
protein_quantile:: Quantile used when protein_split_method='quantile'.
test_size:: Fraction of the cluster reserved for the test split when training the neural network.
hidden_layer_sizes:: Hidden-layer configuration passed to MLPClassifier.
max_iter:: Maximum number of training iterations for the neural network.
early_stopping:: Whether to use early stopping in MLPClassifier.

Returns#

summary:: pandas.DataFrame summarising the trained models. Columns are ['cluster', 'protein', 'threshold', 'n_cells', 'n_high', 'n_low', 'train_accuracy', 'test_accuracy', 'test_auc'].
models:: Nested dictionary {cluster -> {protein -> ProteinModelResult}} containing the fitted neural networks and scalers for downstream use.

Examples#

>>> from pyXenium.multimodal import rna_protein_cluster_analysis
>>> summary, models = rna_protein_cluster_analysis(adata, n_clusters=8)
>>> summary.head()
      cluster          protein  threshold  n_cells  ...  test_accuracy  test_auc
0    cluster_0      EPCAM (µm)   0.563100      512  ...           0.84      0.91
1    cluster_0  Podocin (µm^2)   0.118775      512  ...           0.79      0.87

Parameters:

adata (AnnData)
n_clusters (int)
n_pcs (int)
cluster_key (str)
random_state (int | None)
target_sum (float)
min_cells_per_cluster (int)
min_cells_per_group (int)
protein_split_method (str)
protein_quantile (float)
test_size (float)
hidden_layer_sizes (Tuple[int, ...])
max_iter (int)
early_stopping (bool)

Return type:

Tuple[DataFrame, Dict[str, Dict[str, ProteinModelResult]]]