pyXenium.multimodal.rna_protein_cluster_analysis#

rna_protein_cluster_analysis(adata, *, n_clusters=12, n_pcs=30, cluster_key='rna_cluster', random_state=0, target_sum=1e4, min_cells_per_cluster=50, min_cells_per_group=20, protein_split_method='median', protein_quantile=0.75, test_size=0.2, hidden_layer_sizes=(64, 32), max_iter=200, early_stopping=True)#

Joint RNA/protein analysis for Xenium AnnData objects.

The pipeline performs three consecutive steps:

  1. RNA preprocessing – library-size normalisation (counts per target_sum) followed by log1p. A TruncatedSVD is fitted to obtain n_pcs latent dimensions.

  2. ClusteringKMeans is applied on the latent representation to create n_clusters RNA-driven cell groups. Cluster assignments are stored in adata.obs[cluster_key] and the latent space in adata.obsm['X_rna_pca'].

  3. Protein explanation – for every cluster and every protein marker, the cells are divided into “high” vs. “low” groups (median split by default). A small neural network (MLPClassifier) is trained to predict the binary labels from the RNA latent features. The training/test accuracies and optional ROC-AUC are reported.

Parameters#

adata:

AnnData object returned by pyXenium.multimodal.load_rna_protein_anndata(). Requires adata.layers['rna'] (or adata.X) and adata.obsm['protein'].

n_clusters:

Number of RNA clusters to compute with KMeans.

n_pcs:

Number of latent components extracted with TruncatedSVD. The value is automatically capped at n_genes - 1.

cluster_key:

Column name added to adata.obs that stores cluster labels.

random_state:

Seed for the SVD, KMeans and neural networks. Use None for random initialisation.

target_sum:

Target library size after normalisation (Counts Per target_sum).

min_cells_per_cluster:

Clusters with fewer cells are skipped entirely.

min_cells_per_group:

Minimum number of cells required in both “high” and “low” protein groups to train a neural network.

protein_split_method:

Either "median" (default) for a median split or "quantile" to keep only the top protein_quantile and bottom 1 - protein_quantile fractions of cells (discarding the middle portion).

protein_quantile:

Quantile used when protein_split_method='quantile'.

test_size:

Fraction of the cluster reserved for the test split when training the neural network.

hidden_layer_sizes:

Hidden-layer configuration passed to MLPClassifier.

max_iter:

Maximum number of training iterations for the neural network.

early_stopping:

Whether to use early stopping in MLPClassifier.

Returns#

summary:

pandas.DataFrame summarising the trained models. Columns are ['cluster', 'protein', 'threshold', 'n_cells', 'n_high', 'n_low', 'train_accuracy', 'test_accuracy', 'test_auc'].

models:

Nested dictionary {cluster -> {protein -> ProteinModelResult}} containing the fitted neural networks and scalers for downstream use.

Examples#

>>> from pyXenium.multimodal import rna_protein_cluster_analysis
>>> summary, models = rna_protein_cluster_analysis(adata, n_clusters=8)
>>> summary.head()
      cluster          protein  threshold  n_cells  ...  test_accuracy  test_auc
0    cluster_0      EPCAM (µm)   0.563100      512  ...           0.84      0.91
1    cluster_0  Podocin (µm^2)   0.118775      512  ...           0.79      0.87
Parameters:
  • adata (AnnData)

  • n_clusters (int)

  • n_pcs (int)

  • cluster_key (str)

  • random_state (int | None)

  • target_sum (float)

  • min_cells_per_cluster (int)

  • min_cells_per_group (int)

  • protein_split_method (str)

  • protein_quantile (float)

  • test_size (float)

  • hidden_layer_sizes (Tuple[int, ...])

  • max_iter (int)

  • early_stopping (bool)

Return type:

Tuple[DataFrame, Dict[str, Dict[str, ProteinModelResult]]]