CoClusterCorrelationNetwork.Rmd
The CCCN_CFN package takes experimental data of post-translational modifications based on experimental conditions and generates clusters of likely pathways. These pathways are generated based on analysis of which ptms cluster together (in their ptms based on the same environmental conditions) compared to how those proteins are known to interact (using the STRINGdb database).
An important note about this package: there are no returned outputs from any of the functions. All outputs listed are assigned to the Global Namespace in order to prevent loss of data and promote ease of use. Some functions also pull variables from the global namespace. Ensure that all data is loaded into the global environment especially if analysis is completed across multiple sessions.
This vignette is intended to be a step-by-step guide to walk users through the process of using the CCCN_CFN package. It includes an example pipeline demonstrating how to run the full analysis along with descriptions of each function. this pipeline must be run in order as subsequent steps require the data produced in previous steps. Estimated run-times are included with each description and are based on a preliminary dataset of ~9,000 post-translational modifications and 70 experimental conditions processed with a 12th Gen i5 processor and 8GB of RAM.
MakeClusterList(ptmtable, correlation.matrix.name = "ptm.correlation.matrix", list.name = "clusters.list", toolong = 3.5)
Figure 1 Example plot produced by MakeClusterList
calculated using Euclidean Distance
Figure 2 Output of MakeClusterList
#> $`1`
#> PTM.Name group
#> 1 AARS ubi K747 1
#> 2 ANAPC5 ubi K289 1
#> 3 CUTA ubi K112 1
#> 4 CYHR1 ubi K349 1
#> 5 EEF2 ack K498 1
#> 6 F11R ubi K97 1
#> 7 GMPS ack K9 1
#> 8 HERC2 ubi K20 1
#> 9 KPNB1 ubi K541 1
#> 10 LASP1 ubi K59 1
#> 11 LNPEP ubi K32 1
#> 12 MRPS27 ubi K94 1
#> 13 NME1 ack K39 1
#> 14 PCNA ubi K110 1
#> 15 PKM ack K141 1
#> 16 PLAU ubi K403 1
#> 17 PLK1 ubi K492 1
#> 18 PNPLA2 ubi K435 1
#> 19 PSMC1 ubi K237 1
#> 20 RBCK1 ubi K342 1
#> 21 RPS15A ubi K12 1
#> 22 TCAF1 ubi K817 1
#> 23 TUBB4B ubi K379; TUBB2A ubi K379; TUBB2B ubi K379 1
#> 24 UIMC1 ubi K245 1
#> 25 USP5 ubi K318 1
#> 26 VAMP7 ubi K125 1
#> 27 VCP ubi K295 1
Figure 3 First cluster created by Euclidean Distance
Make Cluster List is the first step in the analyzing one’s data. This function takes the post-translational modification table and runs it through three calculations of distance: Euclidean Distance, Spearman Dissimilarity (1 - |Spearman Correlation|), and the average of the two of these. These calculations find the ‘distance’ between ptms based upon under what conditions they occur. These matricies are then run through t-SNE in order to put them into a 3-dimensional space. Please note: t-SNE involves an element of randomness; in order to get the same results, set.seed(#) must be called. A correlation table is also produced based on the Spearman Correlation table.
MakeCorrelationNetwork(tsne.matrices, ptm.correlation.matrix, keeplength = 2, common.clusters.name = "common.clusters", cccn.name = "cccn_matrix")
Figure 4 First 17 rows and columns of the cccn_matrix
produced by MakeCorrelationNetwork
Make Correlation Network first finds the intersection between the Euclidian, Spearman, and SED cluster matrices in order to find the intersection between the three groups. It then adds the Genes in these PTMs to a list of common clusters and turns it into an adjacency matrix. This adjacency matrix is used to filter relevant data — clusters — from the Spearman correlation matrix. The resultant cocluster correlation network shows strength of relationships between proteins using the common clusters between the three distance metrics.
PPI (protein-protein interaction) databases are consulted in order to filter the clusters by proteins that are known to interact with each other as well as how strongly they are known to interact. The standard PPI database that is used is STRINGdb, and getting data from this database is the first step. This is accomplished with the function GetSTRINGdb. Please note, however, that the user may consult any database that they choose. After getting STRINGdb data (or not), the user runs MakeDBInput which produces a text file of all of their gene names. This information can be copy and pasted into any database that the user chooses in order to get other PPI networks. Step three is getting a GeneMANIA network, which is also optional but recommended. The user pastes their input data into GeneMANIA on the Cytoscape app and saves the edgefile and the nodetable. These files are then input into ProcessGMEdgefile in order to sort the data.
Note again that the database input can be used in any PPI database that the user chooses, though this package only explicitly supports STRINGdb and GeneMANIA. If another database is chosen, its file will have to be filtered manually by the user before moving on to step 4. The file should have three columns. Column one and two should strictly be labeled “Gene.1” and “Gene.2” in order to integrate with other PPI databases. The third column should contain the edgeweight and may be named however the user chooses. It is recommended, though, that the database is specified as well as the term ‘weight’ in the column name.
GetSTRINGdb(cccn_matrix, STRINGdb.name = "string.edges", nodenames.name = "nodenames")
Figure 5 First 18 rows of string.edges produced by
GetSTRINGdb
Figure 6 First 18 rows of nodenames produced by
GetSTRINGdb
MakeDBInput(cccn_matrix, file.path.name = "db_nodes.txt")
Figure 7 First 15 lines from the produced text file
ProcessGMEdgefile(gm.edgefile.path, gm.nodetable.path, nodenames, gm.network.name = "gm.network")
Figure 8 First 44 rows of the GeneMANIA network
BuildPPINetwork(cccn_matrix, db_file_paths = c(), ppi.network.name = "ppi.network")
Figure 9 First 19 rows of the ppi_network produced by
find_ppi_edges
Note: Examples take about 5-10 minutes to run.
Protein-Protein Interaction (or PPI) networks are networks that show us how different proteins are known interact with each other. STRINGdb — a database of these PPI networks — is automatically consulted along with any other database files that are generated and entered by the user. It then gathers data from the PPI networks and filters them down to only examine the determined genes of interest. The data from STRINGdb and any provided files are then combined and returned. The returned data frame shows how strongly the proteins are known to interact.
ClusterFilteredNetwork(cccn.matrix, ppi.network, cfn.name = "cfn")
Figure 10 First 19 rows of the cfn produced by
ClusterFilteredNetwork
Cluster Filtered Network checks all of the edges in the PPI network to see ensure that both of the genes are within our cocluster correlation network and that its weight is nonzero. If either of these conditions are not met, then it will be removed from the list of PPI edges. This new, cluster filtered network is then returned.
PathwayCrosstalkNetwork(file = "bioplanet.csv", common.clusters, edgelist.name = "edgelist")
Figure 11 Will exist at some point
Note: This step is directory sensitive. You can check and set your directory in R using getwd() and setwd(“yourdirectoryhere”) respectively. It needs a path to the bioplanet file and will put an edgelist file in your working directory, or getwd(). If you cannot find a file, please check your directories first.
Pathway Crosstalk Network is the final step in the pipeline. It requires input of an external database, bioplanet, which consists of groups of genes (proteins) involved in various cellular processes. The PCN turns this database into a list of pathways and converts those pathways into a pathway x pathway edgelist that possesses multiple weights, a jaccard similarity and a score derrived from Cluster-Pathway Evidence using common clusters found in Make Correlation Network.