### Evaluate the presence of sample level batch effects by PILOT (Subgroups of patients)

In this tutorial, we will demonstrate how to use statistical tests to evaluate the potential association between the detected subgroups of patients with any experimental or clinical variable present in a data set. This example will be based on the kidney single cell data.

```python import pilotpy as pl import scanpy as sc import pandas as pd ``` #### Reading the original Anndata: First, we consider the original kidney single cell data. You can download the Anndata (h5ad) file from [here](https://costalab.ukaachen.de/open_data/PILOT/Kidney_ori.h5ad), and place it in the _Datasets_ folder. ```python adata = sc.read_h5ad('/Datasets/Kidney_ori.h5ad') ``` ##### Loading the required information and computing the Wasserstein distance:

In order to work with PILOT, ensure that your Anndata object is loaded and contains the required information. Use the following parameters to configure PILOT for your analysis (Setting Parameters): - adata: Pass your loaded Anndata object to PILOT. - emb_matrix: Provide the name of the variable in the obsm level that holds the dimension reduction (PCA representation). - clusters_col: Specify the name of the column in the observation level of your Anndata that corresponds to cell types or clusters. - sample_col: Indicate the column name in the observation level of your Anndata that contains information about samples or patients. - status: Provide the column name that represents the status or disease (e.g., "control" or "case").

```python pl.tl.wasserstein_distance( adata, emb_matrix = 'X_pca', clusters_col = 'cell_type', sample_col = 'donor_id', status = 'disease' ) ``` ##### Patients sub-group detection by clustering the Wasserstein distance.

The Silhouette Score Curve is used to find the optimal number of clusters by plotting the average Silhouette Score for different numbers of clusters. The number of clusters corresponding to the highest average Silhouette Score is considered the optimal number of clusters. Then, we demonstrate the heatmap of the found clusters.

```python pl.pl.select_best_sil(adata, start = 0.2) ``` ![png](Kidney_clusters_files/Kidney_clusters_9_1.png) ![png](Kidney_clusters_files/Kidney_clusters_9_2.png) ```python proportion_df=pl.pl.clustering_emd(adata, res = adata.uns['best_res'],show_gene_labels=False,sorter_leiden=['1','0','2'],save=True) ``` ![png](Kidney_clusters_files/Kidney_clusters_10_1.png) ##### Evaluation of the association of estimated subgroups with experimental factor: A very important question is if PILOT analysis is affected by experimental artefacts (location of tissues, batch processing). To evaluate this, we use ANOVA statistics and Chi-Squared correlation to check any association between any variables describing the found clusters. To run these functions, provide the sample_col as the Sample/Patient column and your interested variables. Of note, these functions only show the significant variables (p-value < 0.05). ```python categorical = ['BMI','hypertension','development_stage','sex','eGFR','diabetes_history','disease','tissue'] ``` ```python pl.tl.correlation_categorical_with_clustering(adata, proportion_df, sample_col = 'donor_id', features = categorical) ```

	Feature	ChiSquared_Statistic	ChiSquared_PValue
7	tissue	21.906065	0.001259
0	BMI	18.291352	0.005544
6	disease	10.600704	0.031438

Similarly, you can do the same analysis for numerical variables. ```python numeric = ['degen.score','aStr.score','aEpi.score','matrisome.score','collagen.score','glycoprotein.score','proteoglycan.score'] ``` ```python pl.tl.correlation_numeric_with_clustering(adata, proportion_df, sample_col = 'donor_id', features = numeric) ```

	Feature	ANOVA_F_Statistic	ANOVA_P_Value

We observe that there is a clear association between the sub-clusters and the origin of the biopsies. This is not surprising as PILOT uses cellular composition information, as samples at distinct regions do have distinct cells. To double check these results, one can of course plot the heatmap of clusters by showing the variables with 'clinical_variables_corr_sub_clusters' function. ###### Disease ```python pl.tl.clinical_variables_corr_sub_clusters(adata,sorter_order=['1','0','2'],sample_col='donor_id',feature='disease',proportion_df=proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_20_0.png) ![png](Kidney_clusters_files/Kidney_clusters_20_1.png) ###### Tissue ```python pl.tl.clinical_variables_corr_sub_clusters(adata,sorter_order=['1','0','2'],sample_col='donor_id',feature='tissue',proportion_df=proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_22_0.png) ![png](Kidney_clusters_files/Kidney_clusters_22_1.png) ###### BMI ```python pl.tl.clinical_variables_corr_sub_clusters(adata,sorter_order=['1','0','2'],sample_col='donor_id',feature='BMI',proportion_df=proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_24_0.png) ![png](Kidney_clusters_files/Kidney_clusters_24_1.png) #### Filtering of samples Therefore, we only focus on biopsies from sample locationn 'kidney', which were used in our benchmarking. You can download the Anndata (h5ad) file from [here](https://costalab.ukaachen.de/open_data/PILOT/Kidney_filtered.h5ad), and place it in the _Datasets_ folder. ```python adata_filtered=sc.read_h5ad('/Datasets/Kidney.h5ad') ``` ```python pl.tl.wasserstein_distance( adata_filtered, emb_matrix = 'X_pca', clusters_col = 'cell_type', sample_col = 'donor_id', status = 'disease' ) ``` Run the same functions for the filtered data. ```python pl.pl.select_best_sil(adata_filtered, start = 0.3) ``` ![png](Kidney_clusters_files/Kidney_clusters_30_0.png) ![png](Kidney_clusters_files/Kidney_clusters_30_1.png) ```python proportion_df=pl.pl.clustering_emd(adata_filtered, res = adata_filtered.uns['best_res'],show_gene_labels=False,sorter_leiden=['0','1'],save=True) ``` ![png](Kidney_clusters_files/Kidney_clusters_31_1.png) We next recheck the association of variables with the estimated sub-clusters. ```python pl.tl.correlation_categorical_with_clustering(adata_filtered,proportion_df,sample_col= 'donor_id', features = categorical) ```

	Feature	ChiSquared_Statistic	ChiSquared_PValue
6	disease	17.488757	0.000159
5	diabetes_history	6.784615	0.009195
0	BMI	7.857195	0.049057

In the data we observed that the true label (disease) has the highest association followed by Diabetes History and BMI. We can show the distribution of variables in the found clusters: ##### Disease ```python meta=adata_filtered.obs[['diabetes_history','BMI','donor_id','disease']] meta = meta.drop_duplicates(subset='donor_id').rename(columns={'donor_id': 'sampleID'}) proportion=proportion_df[['Predicted_Labels','sampleID']].merge(meta,on='sampleID') contingency_table = pd.crosstab(proportion['disease'], proportion['Predicted_Labels'],values=proportion['sampleID'], aggfunc=pd.Series.nunique, margins=True, margins_name='Total Unique Patients') # Display the contingency table contingency_table ```

Predicted_Labels	0	1	Total Unique Patients
disease
Normal	17.0	1.0	18
acute kidney failure	NaN	5.0	5
chronic kidney disease	9.0	4.0	13
Total Unique Patients	26.0	10.0	36

```python ``` ##### BMI ```python contingency_table = pd.crosstab(proportion['BMI'], proportion['Predicted_Labels'],values=proportion['sampleID'], aggfunc=pd.Series.nunique, margins=True, margins_name='Total Unique Patients') contingency_table ```

Predicted_Labels	0	1	Total Unique Patients
BMI
18.5—24.9	2.0	NaN	2
25.0—29.9	10.0	1.0	11
30.0 and Above	4.0	NaN	4
unknown	10.0	9.0	19
Total Unique Patients	26.0	10.0	36

```python ``` ##### Diabetes history ```python contingency_table = pd.crosstab(proportion['diabetes_history'], proportion['Predicted_Labels'],values=proportion['sampleID'], aggfunc=pd.Series.nunique, margins=True, margins_name='Total Unique Patients') # Display the contingency table contingency_table ```

Predicted_Labels	0	1	Total Unique Patients
diabetes_history
No	17	1	18
Yes	9	9	18
Total Unique Patients	26	10	36

To double check these results, one can of course plot the heatmap of clusters by showing the variables with 'clinical_variables_corr_sub_clusters' function. ###### Disease ```python pl.tl.clinical_variables_corr_sub_clusters(adata_filtered, sorter_order=['0','1'],sample_col='donor_id',feature= 'disease',proportion_df = proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_45_0.png) ![png](Kidney_clusters_files/Kidney_clusters_45_1.png) ###### BMI ```python pl.tl.clinical_variables_corr_sub_clusters(adata_filtered, sorter_order = ['0','1'],sample_col = 'donor_id',feature = 'BMI',proportion_df = proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_47_0.png) ![png](Kidney_clusters_files/Kidney_clusters_47_1.png) ###### Diabetes_history ```python pl.tl.clinical_variables_corr_sub_clusters(adata_filtered, sorter_order = ['0','1'],sample_col = 'donor_id',feature = 'diabetes_history',proportion_df = proportion_df,figsize_legend=(1,1)) ``` ![png](Kidney_clusters_files/Kidney_clusters_49_0.png) ![png](Kidney_clusters_files/Kidney_clusters_49_1.png)