
Analyzing Pathways from PTMs: A Guide
Nagashree Avabhrath, Mikhail Ukrainetz, Mark Grimes, Madison Moffett, Grant Smith, Lucia Williams
PTMsToPathways.Rmd
The PTMsToPathways package takes mass spectrometry data of post-translational modifications under different experimental conditions and implicates pathways that are involved. These pathways are generated based on analysis of which ptms cluster together (based on the similar reactiosn to the same environmental conditions) compared to how those proteins are known to interact. This clustering information is also used to find interactions between known, existing pathways.
An important note about this package: There are no returned outputs from any of the functions. All outputs listed are assigned to the Global Namespace in order to prevent loss of data and promote ease of use.
Starting Data
This package provides an example data set using 497 ptms and 69 experimental conditions in the correct format for entry. Below is an example selection of the input dataset:
dim(PTMsToPathways::ex.ptmtable)
PTMsToPathways::ex.ptmtable[38:50, 1:4]
#> [1] 497 69
#> H3122SEPTM.C1 H3122SEPTM.C2 H3122SEPTM.C3 H3122SEPTM.D1
#> CTTN ack K107 19.23272 19.85052 19.58630 21.23678
#> CTTN ack K124 17.50665 19.73125 19.12403 21.14083
#> CTTN ack K161 NA NA NA NA
#> CTTN ack K171 NA NA NA NA
#> CTTN ack K198 20.53742 20.89560 20.04847 21.37846
#> CTTN ack K235 20.62203 20.74460 21.06424 22.02580
#> CTTN ack K272 NA NA NA 21.44114
#> CTTN ack K309 21.52164 21.03127 21.14289 22.49696
#> CTTN ack K87 19.61967 20.63351 20.49436 22.14973
#> CTTN p S156 21.09816 23.48837 22.33529 23.80410
#> CTTN p S405 NA NA NA NA
#> CTTN p S417 NA 18.22809 18.97026 17.71552
#> CTTN p S418 NA 18.22809 18.97026 17.71552
Note that it is very important that the rownames are the names of post-translational modifications. All of their respective columns are the environmental conditions under which the post-translational modifications occurred. NAs – which are importantly not zeroes – represent condition-PTM combinations that were not studied. The numeric values are data output by the mass spectrometer. Ambigious PTMs, cases where a PTM could be any one of several PTMs, should also be separated by semicolons. (e.g. “AARS ubi k747; AMBLIL p U123”).
Pre-Processing Data
For data that does not conform to the above structure, the data must be converted. The following provides a set of general functions that may assist in this process.
Importing the Dataframe
R supports many file types and can automatically convert them into a data frame. For example, read.csv() will take a csv file and convert it into a data frame. The csv file type is recommended for data storage.
Turning Columns of PTMs into Rownames
If there is/are column(s) that contain parts of PTM names (e.g. the first column may contain the protein, the second column may contain the post-translational modification, and the third column may contain the location), the following command concatenates those columns and converts them into data frame. The vector “name.columns” is the indices of the columns which contain names. For the example below, the strings in columns 1, 2, and 3 will be concatenated in each row and then turned into a rowname. It is also possible to have name.columns equal a vector like c(2, 4, 6), which takes columns 2, 4, and 6.
Replacing Patterns
If the list of PTMs possesses symbols that are unnecessary such as “AARS ~K ubi K747”, this command will remove all strings included in the “patterns” vector from the rownames. Any pattern can be chosen so long as the user ensures that any special character (such as $ or @) have a “\\” in front of them.
patterns <- c(" ~K", "zzzz", "/", "\\$", "\\@")
patterns <- patterns[order(nchar(patterns), patterns, decreasing = TRUE)]
rownames(ptmtable) <- sapply(rownames(ptmtable), function(x) {for(p in patterns) x <- gsub(p, "", x); return(x)})
Pipeline
The following pipeline is intended to be a step-by-step guide to walk users through the process of using the PTMsToPathways package. It includes descriptions of each function. This pipeline must be run in order as subsequent steps require the data produced in previous steps. Example code and example outputs as well as estimated run-times are included with each description and are based on a preliminary dataset of ~9,000 post-translational modifications and 69 experimental conditions processed with a 12th Gen i5 processor and 8GB of RAM.
Step 1: Make Cluster List
Make Cluster List is the first step in the analyzing one’s data. This function takes the post-translational modification table and runs it through three calculations of distance: Euclidean Distance, Spearman Dissimilarity (1 - |Spearman Correlation|), and the average of the two of these. These calculations find the ‘distance’ between ptms based upon under what conditions they occur. In other words, they found how dissimilar each pair of PTMs are. These matricies are then run through t-SNE in order to put them into a 3-dimensional space and the clusters that are present in each present in each matrix are identified as well. Please note: t-SNE involves an element of randomness; in order to get the same results, set.seed(#) must be called. A correlation table is also produced based on the Spearman Correlation values for each pair of PTMs.
Code
MakeClusterList(ptmtable, correlation.matrix.name = "ptm.correlation.matrix", clusters.list.name = "clusters.list", tsne.coords.name = "all.tsne.coords", common.clusters.name = "common.clusters", keeplength = 2, toolong = 3.5)

Figure 1 Example plot produced by MakeClusterList calculated using Euclidean Distance. The Euclidean Distance between every PTM is calculated and that information is put into a large matrix. This matrix is then condensed using t-SNE to get coordinates in two-dimensional space, which is what the above figure shows. These data points – and the PTMs they represent – are put into clusters based on these positions. This process is also undergone using Spearman Dissimilarity and the average of Euclidean Distance and Spearman Dissimilarity.

Figure 2 Output of MakeClusterList. Automatic print statements from the function ordiplot from the package vegan.
Figure 3 The second cluster created by Euclidean Distance. The left column is the name of the PTM in this cluster. The right column is the cluster number.

Figure 4 The second common cluster. All six of these post-translational modifications were clustered together whether distance was calculated by Spearman, Euclidean, or the average.

Figure 5 Coordinates of the t-SNE plot produced by Spearman Dissimilarity. Each row is a ptm, and each column is its positional location along the x, y, and z axes.
Step 2: Make Correlation Network
Next, Make Correlation Network is run to filter the correlation matrix of PTMs by specific PTMs that cluster together or have react similarly to the same conditions. It groups the PTM correlation matrix based on the Genes of PTMs. By summing these submatrices, it produces a gene by gene cocluster correlation network shows strength of relationships between proteins using the common clusters between the three distance metrics. The PTM version is also saved for analysis by the user.
Code
MakeCorrelationNetwork(common.clusters, ptm.correlation.matrix, ptm.cccn.name = "ptm.cccn", gene.cccn.name = "gene.cccn")

Figure 6 First 17 rows and columns of the ptm.cccn produced by MakeCorrelationNetwork. PTMs that cluster together in all three distance metrics have entries that represent how strongly they correlate, or how alike their responses are under the same environment. PTMs that don’t cluster in all three distance metrics are correlated by an NA. Self-correlations are also marked by an NA to prevent self-similarity skewing.

Figure 7 First 17 rows and columns of the genes.cccn produced by MakeCorrelationNetwork. The ptm.cccn was condensed by combining all of the scores between PTMs corresponding to one gene. These new summed scores make up the scores of the genes.ccn. Self-correlations are marked by an NA to prevent self-similarity skewing.
Step 3: Retrieve Database Edgefiles
Description
The third step of the PTMsToPathways pipeline is broken up into several parts as there is a lot of choice for the user during each sub-step. PPI (protein-protein interaction) databases are consulted in order to filter the clusters by proteins that are known to interact with each other as well as how strongly they are known to interact. The standard recommended PPI database is STRINGdb; getting data from this database is the first (optional) step. This is accomplished with the function GetSTRINGdb. The function GetSTRINGdb requires the STRINGdb package to be downloaded. Code for this is supplied with the code to run the function itself. Please note, however, that the user may consult any database that they choose. After getting STRINGdb data (or not), the user runs MakeDBInput which produces a text file of all of their gene names. This information can be copy and pasted into any database that the user chooses in order to get data from other PPI networks. Step three is getting a GeneMANIA network, which is also optional but recommended. The user pastes their input data into GeneMANIA on the Cytoscape app and saves the edgefile and the nodetable. These files are then input into ProcessGMEdgefile in order to sort the data.
Note again that the database input can be used in any PPI database that the user chooses, though this package only explicitly supports STRINGdb and GeneMANIA. If another database is chosen, its file will have to be filtered manually by the user before moving on to step 4. The file should have four columns. Column one and two should strictly be labeled “Gene.1” and “Gene.2” in order to integrate with other PPI databases. Column three should name the interaction type between the two proteins (naming of this column is not important). The fourth column should contain the edgeweight and may be named however the user chooses. It is recommended, though, that the database is specified as well as the term ‘weight’ in the column name.
Part 1 — Get STRINGdb Data
Code
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("STRINGdb")
GetSTRINGdb(gene.cccn, stringdb.name = "stringdb.edges", nodenames.name = "nodenames")

Figure 8 First 18 rows of stringdb.edges produced by GetSTRINGdb. The first two columns represent the genes that are known to interact. The second column is the type of interaction. The GetSTRINGdb function limits the interaction types to database, database_transferred (database information from other species’ that are homologous to humans), experimental, and experimental_transferred (database information from other species’ that are homologous to humans). The final column is the score associated with this interaction.

Figure 9 First 18 rows of nodenames produced by GetSTRINGdb. Nodenames is a data frame consisting of every gene in gene.cccn. This represents all of the genes of the study (that cocluster).
Part 2 — Get File for Database Input
Code
MakeDBInput(gene.cccn, file.path.name = "db_nodes.txt")
Figure 10 First 15 lines from the produced text file named by file.path.name. Lists all of the gene names separated by newlines with no other extra characters for clean copy and pasting into a database input.
Part 3 — Process GeneMANIA File
Code
ProcessGMEdgefile(gm.edgefile.path, gm.nodetable.path, db_nodes.path, gm.network.name = "gm.network")
Figure 11 The node table produced by the GeneMANIA query.
Figure 12 The edge table produced by the GeneMANIA query.
Figure 13 First 44 rows of the GeneMANIA network. The first two columns represent the genes that are known to interact. The second column is the type of interaction. The ProcessGmEdgefile function limits the interaction types to Pathway and Physical Interactions. The final column is the score associated with this interaction.
Step 4: Build PPI Network
The fourth step in the pipeline, BuildPPINetwork, combines the data of protein-protein interactions from each of the provided databases. This package explicitly processes STRINGdb and GeneMANIA in Step 3, but the user may consult as many databases as desired (the processing for which is described in Step 3). BuildPPINetwork combines these databases as efficiently as possible while retaining desired edgeweights from each database. BuildPPINetwork also normalizes the weights on a scale of 0-1 and removes duplicate rows from the data frame.
Code
BuildPPINetwork(stringdb.edges = NA, gm.network = NA, db.filepaths = c(), ppi.network.name = "ppi.network")

Figure 14 First 19 rows of the ppi_network produced by find_ppi_edges. The first two columns represent the two genes that interact according to the consulted PPI databases. The third column is all of the interaction types concatenated together from each database. All subsequent columns (two, in the base case of just using STRINGdb and GeneMANIA), show the weights of the interaction from the associated database. These have been normalized to all be on a scale of 0-1.
Step 5: Cluster Filtered Network
Cluster Filtered Network, which is step five in the pipeline and one of the final outputs of the package, checks all of the edges in the PPI network to ensure that both of the genes are within the cocluster correlation network created in step two and that its weight is nonzero. If either of these conditions are not met, then it will be removed from the list of PPI edges. This new, cluster filtered network is then returned.
Code
ClusterFilteredNetwork(gene.cccn, ppi.network, cfn.name = "cfn")

Figure 15 First 19 rows of the cfn produced by ClusterFilteredNetwork. The first two columns represent the two genes that interact according to the consulted PPI databases. The third column is all of the interaction types concatenated together from each database. The final column is the sum of the weights from each inputted database.
Step 6: Pathway Crosstalk Network
Note: This step is directory sensitive. The user can check and set the working directory in R using getwd() and setwd(“yourdirectoryhere”) respectively. It needs a path to the bioplanet file and will put an edgelist file in the working directory or the otherwise given path. If the file cannot be found, please check the working directory first.
Step 6, our final analysis step, is the Pathway Crosstalk Network. This step requires input of an external database, bioplanet, that contains groups of genes (proteins) involved in various cellular processes known as pathways. PCN turns this database into a list of pathways and converts those pathways into a pathway x pathway edgelist that possesses multiple weights, a jaccard similarity, and a score. The score is derived from Cluster-Pathway Evidence using the common clusters found in Make Correlation Network. Info about the Cluster-Pathway Evidence score can be found at: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010690
Code
PathwayCrosstalkNetwork(common.clusters, file = "bioplanet.csv", edgelist.name = "PTPedgelist", createfile = getwd())
Figure 16 The Pathway To Pathway Edgelist created by Pathway Crosstalk Network. This table shows the relationships between each pair of pathways represented in Bioplanet and the dataset as well as two different scores for the strength of the interaction.
Step 7: Graph Cluster Filtered Network
This optional function GraphCfn creates a graph using the cluster filtered network in the Cytoscape app. When graphed, Cytoscape provides an interactive interface to view the data.
Code
GraphCfn(cfn, ptmtable, funckey = PTMsToPathways::ex.funckey, Network.title = "cfn", Network.collection = "PTMsToPathways",
visual.style.name = "PTMsToPathways.style", background.color = "#949494", edge.label.color = '#17202a', node.label.color = '#000000',
default.font = "Times New Roman", node.font.size = 20, edge.font.size = 8, edge.line.style = 'SOLID',
edge.opacity = 175, edge.label.opacity = 255, border.opacity = 255, node.label.opacity = 255, node.fill.opacity = 255)

Figure 17 Default cytoscape graph. Nodes represent the genes of the study. The score of each node is taken as the minimum value of the gene in the ptmtable. Edges are the lines between the nodes and they represent relationships. See the key below for a full description of each characteristic.
Key
The thicker the edge is, the stronger the interaction between these two genes. The border and shape of the node represent the type of protein this gene is.
Node Size:
- Greater the node size, larger the absolute value of the score
Node Color:
- Blue Node
- Negative score
- Negative score
- Yellow Node
- Positive score
- Positive score
- Green Node
- Approximately zero score
Node Shapes:
- “ELLIPSE”
- unknown
- unknown
- “ROUND_RECTANGLE”
- receptor tyrosine kinase
- receptor tyrosine kinase
- “VEE”
- SH2 protein
- SH2-SH3 protein
- SH2 protein
- “TRIANGLE”
- SH3 protein
- SH3 protein
- “HEXAGON”
- tyrosine kinase
- tyrosine kinase
- “DIAMOND”
- SRC-family kinase
- SRC-family kinase
- “OCTAGON”
- kinase
- phosphatase
- kinase
- “PARALLELOGRAM”
- transcription factor
- transcription factor
- “RECTANGLE”
- RNA binding protein
Node Border Colors:
- Orange
- deacetylase
- acetyltransferase
- deacetylase
- Blue
- demethylase
- methyltransferase
- demethylase
- Royal Purple
- membrane protein
- membrane protein
- Red
- kinase
- tyrosine kinase
- SRC-family kinase
- kinase
- Yellow
- phosphatase
- tyrosine phosphatase
- phosphatase
- Lilac
- G protein-coupled receptor
- receptor tyrosine kinase
- G protein-coupled receptor
- Grey
- default
Edge Colors:
- Red
- Phosphorylation
- pp
- controls-phosphorylation-of
- Phosphorylation
- Bright Magenta
- controls-expression-of
- controls-expression-of
- Dull Magenta
- controls-transport-of
- controls-transport-of
- Purple
- controls-state-change-of
- controls-state-change-of
- Blood Orange
- Acetylation
- Acetylation
- Lime Green
- Physical interactions
- Physical interactions
- Green
- BioPlex
- BioPlex
- Dull Green
- in-complex-with
- in-complex-with
- Seafoam Green
- experiments
- experiments_transferred
- experiments
- Cyan
- database
- database_transferred
- database
- Teal
- Pathway
- Predicted
- Pathway
- Dark Turquoise
- Genetic interactions
- Genetic interactions
- Yellow-Orange
- correlation
- correlation
- Royal Blue
- negative correlation
- negative correlation
- Bright Yellow
- positive correlation
- positive correlation
- Grey
- combined_score
- combined_score
- Dark Grey
- merged
- merged
- Light Grey
- intersect
- intersect
- Black
- peptide
- peptide
- Orange
- homology
- homology
- Dull Orange
- Shared protein domains
- Shared protein domains
- White
- Default
Arrow Types:
- Arrow
- Phosphorylation
- pp
- controls-phosphorylation-of
- controls-expression-of
- controls-transport-of
- controls-state-change-of
- Acetylation
- Phosphorylation
- No Arrow
- Default
Post-Processing Data
This is a completely optional step for users that want to visualize their data in different ways or interact with it in a desktop environment.
As an igraph Object
Most data structure outputs in this package are data frames or matrices with columns of nodes and column(s) of weights. You will need to view the data and manually set which two columns are the nodes and which column is the edges. The following code creates an igraph named graph.
df <- yourdatastructure
nodecolumns <- c(1, 2)
edgecolumn <- 3
graph <- igraph::graph_from_data_frame( df[,nodecolumns], directed = FALSE )
igraph::E(graph)$weight <- df[,3]
Saving Data
If you want to save your data to a file, all data structures can either be exported with the save function and loaded later or saved to a csv file with the write.csv function.
save(object, filename = "filepath/name.rda") # Saves object as a .rda
load("filepath/name.rda") # Loads object saved to a file
utils::write.csv(object, file = "filepath/name.csv") # Saves object as a .csv
utils::read.csv(file = "filepath/name.csv") # Loads object from .csv
You may also save your entire Global Environment namespace using the save.image function as shown below:
save.image(file = "filepath/name.RData")