Analyzing Pathways from PTMs: A Guide

The PTMsToPathways (P2P) package takes Mass Spectrometry (MS) data of protein post-translational modifications under different experimental conditions and uses machine learning to identify PTM clusters that represent functional modules in cell signaling. The clusters created initially are then used to identify protein-protein interactions, and interactions between cell signaling pathways. The tutorial is intended to be a step-by-step guide to walk users through the process of using PTMsToPathways (P2P) package. It includes descriptions of each function and must be run in order as subsequent steps require the data produced in previous steps. Example code and example outputs as well as estimated run-times are included with each description and are based on a preliminary dataset of ~9000 PTMs and 69 experimental conditions processed with a 12th Gen i7 processor and 16GB of RAM.

An important note about this package: The returned outputs from the functions are data that may be saved in an RData object so that the user may reload the data, which may take a while to generate, and pick up where they left off later.

Installing the package

Ensure you have the latest version of RStudio and base R installed. You will also need to install the devtools package, which can be installed with:

install.packages("devtools")

Next, install the package with:

devtools::install_github("UM-Applied-Algorithms-Lab/PTMsToPathways")

Starting Data

For the tutorial, we will be using two example datasets – a smaller dataset consisting of 933 PTMs and 18 experimental conditions (the example used in the RawDataProcessing vignette) and a larger dataset containing around 9000 PTMs and 69 experimental conditions. The datasets are available with the package as a variable or can be downloaded from our GitHub page ——- to the user’s working directory and imported.

If you are using a smaller dataset, use the following code to read it in:

dim(ex_small_ptm_table)
ex_small_ptm_table[38:50, 1:4]

If you want to use the bigger dataset, use the following code:

dim(ex_full_ptm_table)

If you want to download the data and import the dataset use the following code:

allptmtable <- read.table("AlldataPTMs.txt", sep = "\t", skip = 0, header=TRUE, blank.lines.skip=T, fill=T, quote="\"", dec=".", comment.char = "", stringsAsFactors=F)

Processing the data

The MS data needs to be transformed into a data frame with PTMs as row names and numeric data per experiment as columns to carry out the analysis using P2P vignette. Please refer to the RawDataProcessing Vignette for a tutorial showing all steps needed to transform an MS output file into a P2P package input dataframe.

Step 1: Make Cluster List

MakeClusterList is the first step. This function takes the PTMs table, a dataframe, and runs it through three calculations of statistical measures of distance: Euclidean Distance, Spearman Dissimilarity (1- |Spearman Correlation|), and the average of both Spearman Dissimilarity (1- Spearman Correlation) and Euclidean Distance (SED). Combining the two dissimilarities leads to better resolution of the data and is useful in pattern recognition. A correlation table is generated based on the distances calculated for each pair of PTMs. The function then runs the matrices through t-SNE to generate clusters based on the previously calculated distance and provides you with a cluster list, common.clusters. The returned adj.consuensus and ptm.correlation.matrix are also used in the next step to create co-cluster correlation networks (CCCNs).

# Set seed to obtain repeatable t-SNE graphs
set.seed(88)
clusterlist.data <- MakeClusterList(ex_small_ptm_table, keeplength = 2, toolong = 3.5)
common.clusters <- clusterlist.data[[1]]
adj.consensus <- clusterlist.data[[2]]
ptm.correlation.matrix <- clusterlist.data[[3]]

##Note: t-SNE involves an element of randomness due to pseudorandom initialization or stochastic processes within the algorithm; in order to get the same results on multiple executions, set.seed(#) must be called (# = any integer of choice). toolong manipulates the size of the cluster and is set at 3.5; recommended as a good starting point for t-SNE, which can be changed as per the researcher’s discretion. This function takes a while to run, so it is recommended that the output is saved as an RData object which can then be imported.

Estimated run-time

~60min

Step 2: Make Co-Cluster Correlation Networks (PTM and Gene)

The cluster list generated in the previous step is next used to create a new network of PTMs that have strong associations called the Co-cluster Correlation Network (CCCN). The Spearman correlations between co-clustered PTMs are used as edge-weights in this network. MakeCorrelationNetwork function groups the PTM correlation matrices by PTMs that co-cluster together to create a PTM CCCN and then defines a relationship between proteins modified by PTMs and creates a gene CCCN with sum of PTM correlations serving as edge weights. The output of this function can be saved as an RData object.

In addition to CCCN edge lists, this function also returns igraph objects ptm.cccn.g and gene.cccn.g, which can be used later to extract edge lists or adjacency matrices, plotting, and many other functions available from the igraph package.

CCCN.data <- MakeCorrelationNetwork(adj.consensus, ptm.correlation.matrix)
ptm.cccn.g <- CCCN.data[[1]] # igraph data object
gene.cccn.g <- CCCN.data[[2]] # igraph data object
ptm.cccn.edges <- CCCN.data[[3]] # PTM CCCN edge list
gene.cccn.edges <- CCCN.data[[4]] # Gene CCCN edge list

head(gene.cccn.edges)

Estimated run-time

~10min

Step 3: Retrieve Database Edgefiles

The third step of the P2P package requires the use of multiple protein-protein interaction (PPI) databases. The data is used to generate a PPI network where proteins are nodes, and their interactions are edges and represent all known interactions observed in a wide range of cell types, disease states, and environmental conditions. The P2P package allows the users to integrate data from three external databases: STRING, GeneMANIA, and PhosphoSite Plus. Other databases can also be downloaded and added to the PPI network. All three external databases have different interfaces for downloading data.

###1. STRINGdb

# TO DO: FIX ME 
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("STRINGdb")

stringdb.edges <- GetSTRINGdb(gene.cccn.edges)

head(stringdb.edges)

Note:

Please ensure that only physical interactions is retained.

###2. GeneMANIA

To our knowledge, no R package exists to programmatically query GeneMANIA. Thus, utilizing the data from GeneMANIA involves two steps, first generating the input the file and then processing the output file. MakeDBInput function allows us to automatically generate an input file to query GeneMANIA from within Cytoscape.

MakeDBInput(gene.cccn.edges, file.path.name = "db_nodes.txt")

ProcessGMEdgefile function processes the expected resulting output files from the previous function.

genemania.edges <- ProcessGMEdgefile(gm.edgefile.path, gm.nodetable.path, db_nodes.path)

head(genmania.edges)

###3. Phosphosite Plus

The kinase-substrate data can be downloaded from Phosphosite Plus database. The users will be required to create an account and sign in to download the data.
Format.kinsub.table function reads this downloaded data in and formats it so that all the PPI edge data frames are in the same format for the next step.

kinsub.edges <- format.kinsub.table(kinasesubstrate.filename = "Kinase_Substrate_Dataset.txt")

head(kinsub.edges)

Step 4: Build PPI Network and Cluster Filtered Network

CFN allows the users to filter protein-protein networks using the previously generated PPI clusters. PPIs are retained in the CFN only if the interacting proteins share statistically correlated PTMs identified via t-SNE clusters. BuildClusterFilteredNetwork function combines all the PPI data downloaded in step 3 as efficiently as possible while retaining the desired edge weights. It then normalizes the weights on a scale of 0-1 and gives an output cluster filter network that will only retain interacting proteins whose genes are within the co-cluster correlation network created in step 2.

network.list <- BuildClusterFilteredNetwork(stringdb.edges, genemania.edges, kinsub.edges, gene.cccn.edges, db.filepaths = c())
combined.PPIs <- network.list[[1]]
cfn <- network.list[[2]]
# To reduce clutter on graphs, the cfn edges can be merged:
cfn.merged <- mergeEdges(cfn)

head(cfn.merged)

Step 5: Pathway Crosstalk Network

Note: This step is directory sensitive. The user can check and set the working directory in R using getwd() and setwd(“yourdirectoryhere”) respectively. It needs a path to the bioplanet file and will put an edgelist file in the working directory or the otherwise given path. If the file cannot be found, please check the working directory first.

Step 5, our final analysis step, is the Pathway Crosstalk Network. This step requires input of an external database from NCATS BioPlanet (https://tripod.nih.gov/bioplanet/download/pathway.csv), that contains groups of genes (proteins) involved in various cellular processes known as pathways. PCN turns this data file into a list of pathways and converts those pathways into a pathway x pathway edgelist (PCNedgelist) that possesses multiple weights, a jaccard similarity, and a score. The score is derived from Cluster-Pathway Evidence using the common clusters found in Make Correlation Network. Info about the Cluster-Pathway Evidence score can be found at: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010690. For graphing in Cytoscape, the Cluster-Pathway Evidence and Jaccard similarity edges are listed separately in the edgelist called pathway.crosstalk.network.

PCN.data <- PathwayCrosstalkNetwork(common.clusters, bioplanet.file = "pathway.csv", createfile = getwd())
pathway.crosstalk.network <- PCN.data[[1]]
PCNedgelist <- PCN.data[[2]]
pathways.list <- PCN.data[[3]]

head(PCNedgelist)
head(pathway.crosstalk.network)

Saving Data

If you want to save your data to a file, all data structures can either be exported with the save function and loaded later or saved to a csv file with the write.csv function.

save(object, filename = "filepath/name.rda") # Saves object as a .rda
load("filepath/name.rda")                    # Loads object saved to a file
# For multiple objects
save(object1, object2, object.ect, filename="NewFile.RData")
utils::write.csv(object, file = "filepath/name.csv") # Saves object as a .csv
utils::read.csv(file = "filepath/name.csv")          # Loads object from .csv

You may also save your entire Global Environment namespace using the save.image function as shown below:

save.image(file = "filepath/name.RData")

Nagashree Avabhrath, Mikhail Ukrainetz, Madison Moffett, Grant Smith, Lucia Williams, Mark Grimes

Installing the package

Starting Data

Processing the data

Step 1: Make Cluster List

Estimated run-time

Step 2: Make Co-Cluster Correlation Networks (PTM and Gene)

Estimated run-time

Step 3: Retrieve Database Edgefiles

Note:

Step 4: Build PPI Network and Cluster Filtered Network

Step 5: Pathway Crosstalk Network

Saving Data