Trajectory analysis of Myocardial Infarction using PILOT

PILOT

Welcome to the PILOT Package Tutorial for scRNA Data!

Here we show the whole process for applying PILOT to scRNA data using Myocardial Infarction scRNA Data, you can download the Anndata (h5ad) file from here, and place it in the Datasets folder.

import PILOT as pl
import scanpy as sc

Reading Anndata

adata = sc.read_h5ad('Datasets/myocardial_infarction.h5ad')

Loading the required information and computing the Wasserstein distance:

In order to work with PILOT, ensure that your Anndata object is loaded and contains the required information.

Use the following parameters to configure PILOT for your analysis (Setting Parameters):

adata: Pass your loaded Anndata object to PILOT.
emb_matrix: Provide the name of the variable in the obsm level that holds the dimension reduction (PCA representation).
clusters_col: Specify the name of the column in the observation level of your Anndata that corresponds to cell types or clusters.
sample_col: Indicate the column name in the observation level of your Anndata that contains information about samples or patients.
status: Provide the column name that represents the status or disease (e.g., “control” or “case”).

pl.tl.wasserstein_distance(
    adata,
    emb_matrix = 'PCA',
    clusters_col = 'cell_subtype',
    sample_col = 'sampleID',
    status = 'Status',
    )

Ploting the Cost matrix and the Wasserstein distance:

Here we show the heatmaps of Cost matrix (cells) and Wasserstein distance (samples). At this point, you should have a 'Results_PILOT' folder inside the folder where you are running your analysis, which includes all the plots created by PILOT.

pl.pl.heatmaps(adata)

png

Trajectory:

Here we show the Diffusion map of Wasserstein distance. In the upcoming trajectory analysis, the labels "IZ" stands for the ischaemic zone tissue.

pl.pl.trajectory(adata, colors = ['Blue','red'])

png

Fit a principal graph:

The difussion map creates an embedding that potentially reveals a trajectory in the data. Next, PILOT explores EIPLGraph to find the structure of the trajectory. An important parameter is the source_node, which indicates the start of the trajectory. Here, we selected a control sample (node with id = 7). This method returns ranked samples, which we define as a disease progression score (t = t1, ..., tn), where tl represents the ranking of the nth sample.

pl.pl.fit_pricipla_graph(adata, source_node = 7)

png

Cell-type importance:

Next, we can use the robust regression model to find cells whose proportions change linearly or non-linearly with disease progression. As indicated in the paper, major hallmark of MI progression are detected, i.e., a decrease of cardiomyocyte cells (CM) and an increase of fibroblasts and myeloid cells.

pl.tl.cell_importance(adata)

png

Applying PILOT for finding Markers

Gene selection:

Given that we found interesting cell types, we would like to investigate genes associated with these trajectories, i.e. genes, whose expression changes linearly or quadratically with the disease progression.

After running the command, you can find a folder named ‘Markers’ inside the ‘Results_PILOT’ folder. There, we will have a folder for each cell type. The file ‘Whole_expressions.csv’ contains all statistics associated with genes for that cell type.

Here, we run the genes_importance function for all cell types.
You need to set names of columns that show cell_types/clusters and Samples/Patients in your object.
Note that if you are running this step on a personal computer, this step might take time.

for cell in adata.uns['cellnames']:
    pl.tl.genes_importance(adata,
    name_cell = cell,
    sample_col = 'sampleID',
    col_cell = 'cell_subtype',
    plot_genes = False)

Cluster Specific Marker Changes:

The previous test only finds genes with significant changes over time for a given cell type. However, it does not consider if a similar pattern and expression values are found in other clusters. To further select genes, we use a Wald test that compares the fit of the gene in the cluster vs. the fit of the gene in other clusters. In the code below, we consider top genes (regarding the regression fit) for two interesting cell types discussed in the manuscript (‘healthy CM’ and ‘Myofib’).

pl.tl.gene_cluster_differentiation(adata,cellnames = ['healthy_CM','Myofib'], number_genes = 70)

Test results are saved in ‘gene_clusters_stats_extend.csv’. To find a final list of genes, we only consider genes with a fold change higher than 0.5, i.e. genes which expression is increased in the cluster at hand; and we sort the genes based on the Wald test p-value. These can be seen bellow.

df = pl.tl.results_gene_cluster_differentiation(cluster_name = 'Myofib',).head(50)
df.head(15)

	gene	cluster	waldStat	pvalue	FC	Expression pattern	fit-pvalue	fit-mod-rsquared
2642	GAS7	Myofib	212.477292	8.487275e-46	1.086644	linear up quadratic down	1.873033e-107	0.570704
2151	EXT1	Myofib	125.383128	5.344198e-27	0.786136	linear up quadratic down	3.159831e-35	0.555757
4979	PKNOX2	Myofib	89.738712	2.492742e-19	0.855504	quadratic down	1.039404e-117	0.544122
2529	FN1	Myofib	70.641696	3.110595e-15	1.573680	linear down quadratic up	2.947389e-188	0.633774
1437	COL6A3	Myofib	54.751169	7.758841e-12	1.069156	linear down quadratic up	3.514298e-172	0.608543
5775	RORA	Myofib	52.486295	2.359167e-11	0.899459	quadratic down	7.232834e-174	0.587234
2832	GXYLT2	Myofib	24.247113	2.218154e-05	2.000205	linear up quadratic down	2.402171e-85	0.537920
3783	MGP	Myofib	23.244418	3.591226e-05	0.871041	quadratic down	1.327779e-225	0.571374
4726	PCDH9	Myofib	20.439646	1.376052e-04	0.604830	linear down	0.000000e+00	0.596035
1231	CHD9	Myofib	20.389564	1.409364e-04	0.527488	linear up quadratic down	7.658862e-77	0.559604
1710	DCN	Myofib	19.656307	1.999818e-04	1.033697	linear up quadratic down	1.866152e-284	0.588602
2824	GSN	Myofib	18.015612	4.366007e-04	0.638136	linear up quadratic down	2.942472e-279	0.601684
1392	COL3A1	Myofib	17.276479	6.199787e-04	1.240454	linear down quadratic up	0.000000e+00	0.665616
1372	COL1A2	Myofib	14.068816	2.812963e-03	1.327753	linear down quadratic up	0.000000e+00	0.655032
7245	VCAN	Myofib	12.610158	5.560192e-03	0.838764	linear down quadratic up	1.761922e-164	0.571981

Here is the GO enrichment for the 50 first top genes of Myofib (FC >= 0.5 and p-value < 0.01). Plot is saved at Go folder.

pl.pl.go_enrichment(df, cell_type = 'Myofib')

png

We can visualize specific genes, for example the ones discussed in PILOT manuscript (COL1A2, DCN and EXT1). In the plot, the orange line indicates the fit in the target cell type (shown as orange lines) compared to other cell types (represented by grey lines). Plots of genes are saved at 'plot_genes_for_Myofib' folder.

pl.pl.exploring_specific_genes(cluster_name = 'Myofib', gene_list = ['COL1A2','DCN','EXT1'])

png

We can repeat the same analysis for healthy_CM cell type by using the following commands.

df=pl.tl.results_gene_cluster_differentiation(cluster_name = 'healthy_CM').head(50)
df.head(15)

	gene	cluster	waldStat	pvalue	FC	Expression pattern	fit-pvalue	fit-mod-rsquared
6165	SORBS1	healthy_CM	1574.665604	0.000000e+00	1.296470	linear down quadratic up	8.946560e-05	0.522953
1772	DLG2	healthy_CM	1055.313030	1.801893e-228	1.155496	linear down quadratic up	1.323610e-256	0.556306
6733	THSD4	healthy_CM	834.288239	1.583902e-180	1.671315	linear down quadratic up	6.088694e-250	0.582085
1276	CMYA5	healthy_CM	752.301407	9.561746e-163	1.559703	linear down quadratic up	3.774063e-66	0.527869
3281	LDB3	healthy_CM	542.239458	3.342198e-117	1.426196	linear down quadratic up	1.511694e-238	0.546327
36	ABLIM1	healthy_CM	379.423867	6.335728e-82	0.979378	linear up	3.296026e-13	0.513734
6903	TNNT2	healthy_CM	373.037957	1.530428e-80	1.392561	linear down quadratic up	2.329698e-118	0.535236
2398	FHOD3	healthy_CM	343.341326	4.125161e-74	1.731741	linear down quadratic up	0.000000e+00	0.612758
6663	TECRL	healthy_CM	338.848075	3.875199e-73	1.261289	linear up quadratic down	0.000000e+00	0.570571
4056	MYBPC3	healthy_CM	296.157297	6.751814e-64	0.686940	linear up quadratic down	0.000000e+00	0.557570
5652	RCAN2	healthy_CM	287.996090	3.940736e-62	1.214055	linear down quadratic up	0.000000e+00	0.566313
1830	DOCK3	healthy_CM	269.653643	3.667754e-58	0.534836	linear down quadratic up	2.678914e-202	0.527979
4177	MYOM1	healthy_CM	236.482875	5.483541e-51	1.637375	linear down	1.381677e-268	0.548281
1915	EFNA5	healthy_CM	236.263957	6.115035e-51	1.089847	linear down	1.127515e-164	0.532078
5436	PXDNL	healthy_CM	227.066418	5.957588e-49	1.284827	linear down quadratic up	1.815885e-03	0.518712

Plot is saved at Go folder.

pl.pl.go_enrichment(df, cell_type = 'healthy_CM')

png

Plots of genes are saved at 'plot_genes_for_healthy_CM' folder.

pl.pl.exploring_specific_genes(cluster_name = 'healthy_CM', gene_list = ['MYBPC3','MYOM1','FHOD3'])

png

Group genes by pattern:

Here, we cluster genes based on the pattern found for each and plot their heatmap. Below the heatmap, we depict the pattern of each group's genes and top 10 genes having significant changes through disease progression.

You can find curves activities’ statistical scores that show the fold changes of the genes through disease progression in the ‘Markers‘ folder for each cell type separately.

The method used to compute curves activities had been inspired by Dictys project with Wang, L., et al. 2023 paper.

pl.pl.genes_selection_analysis(adata, 'healthy_CM', scaler_value = 0.6)

png

png png No information for cluster 3! png png png

In the table, you can check the curves activities of some genes of the healthy_CM:

Gene ID	Expression pattern	adjusted P-value	R-squared	mod_rsquared_adj	Terminal_logFC	Transient_logFC	Switching_time	area	cluster
GRXCR2	linear down quadratic up	0.00	0.29	0.65	-0.02	-1.8	0.03	59.63	4
SEMA5A	linear down quadratic up	0.00	0.25	0.63	-0.12	-0.82	0.1	48.83	2
PRKG1	linear down quadratic down	0.00	0.19	0.61	-0.19	-0.01	0.62	10.97	1
FHOD3	linear down quadratic up	0.00	0.19	0.61	0.01	-2.06	0.97	64.71	4
L3MBTL4	linear down quadratic up	0.00	0.19	0.6	-0.05	-1.51	0.04	56.02	4
…	…	…	…	…	…	…	…
SNHG25	linear down	0.00	-0.29	0.36	-0.2	0	0.51	0.71	2
BAG3	linear down	0.00	-0.3	0.35	-0.21	0	0.5	0.27	2
ATP6V1C1	linear down	0.02	-0.3	0.35	-0.2	0	0.5	0.31	2
SDC4	quadratic up	0.00	-0.31	0.35	0.2	0	0.64	13.74	5
ZNF490	linear down	0.00	-0.31	0.35	-0.2	0	0.49	0.65	2

The complete table can be found here: healthy_CM_curves_activities.csv

pl.pl.genes_selection_analysis(adata, 'Myofib', scaler_value = 0.5)

png

png png No information for cluster 3! png

In the table, you can check the curves activities of some genes of the Myofib:

Gene ID	Expression pattern	adjusted P-value	R-squared	mod_rsquared_adj	Terminal_logFC	Transient_logFC	Switching_time	area	cluster
LAMA2	linear up quadratic down	0.00	0.38	0.74	-0.15	0.05	0.76	28.73	2
COL1A1	linear down quadratic up	0.00	0.31	0.68	0.14	-0.39	0.86	48.08	1
NEGR1	quadratic down	0.00	0.28	0.67	-0.17	0	0.64	16.17	2
COL3A1	linear down quadratic up	0.00	0.28	0.67	0.12	-0.63	0.89	54.59	1
ZBTB20	linear up quadratic down	0.00	0.27	0.68	-0.15	0.08	0.77	30.9	2
…	…	…	…	…	…	…	…
MAN2A1	quadratic down	0.00	-0.35	0.40	-0.17	0	0.65	17.56	2
ADGRD1	linear down	0.00	-0.36	0.42	-0.17	0	0.49	1.45	4
ATR	quadratic down	0.00	-0.36	0.39	-0.17	0	0.65	17.71	2
URI1	quadratic down	0.00	-0.36	0.39	-0.17	0	0.65	17.49	2
TRA2A	quadratic down	0.01	-0.39	0.38	-0.17	0	0.65	17.01	2

The complete table can be found here: Myofib_curves_activities.csv

Plot specific genes:

Here, you can check the expression pattern of interested genes

pl.pl.plot_gene_list_pattern(['DYNC2H1', 'NEGR1', 'COL3A1'], 'Myofib')

png