1. Overview

This guide will serve as a brief overview to the pathway significance testing workflow with the pathwayPCA package. We will discuss the four basic steps of pathway significance testing with the pathwayPCA package. These steps are: importing data, creating an Omics data object, testing pathways for significance, and inspecting the results. For detailed discussions of these steps, see the following appropriate vignettes:

  1. Download Packages
  2. Import and Tidy Data (vignette)
  3. Create Data Objects (vignette)
  4. Test Pathway Significance (vignette)
  5. Visualize Results (vignette)

Installing the Package

pathwayPCA is a package for R, so you need R first. We also strongly recommend the RStudio integrated development environment as a user-friendly graphical wrapper for R.

Stable Build

The stable build of our package will be available on Bioconductor in May of 2019. To access Bioconductor packages, first install BiocManager, then use BiocManager to install this package:


Development Build

Because we are currently in the development phase for version 2 of this package, you can install the package from GitHub. In order to install a package from GitHub, you will need the devtools:: package (https://github.com/r-lib/devtools) and either Rtools (for Windows) or Xcode (for Mac). Then you can install the development version of the pathwayPCA package from GitHub:


Loading Packages

Also, if you want your analysis to be performed with parallel computing, you will need a package to help you. We recommend the parallel package (it comes with R automatically). We also recommend the tidyverse package to help you run some of the examples in these vignettes (while the tidyverse package suite is required for many of the examples in the vignettes, it is not required for any of the functions in this package).


2. Import Data

This section is a quick overview of the material covered in the Import and Tidy Data vignette. Here we show how to import pathway information, assay and phenotype data, and how to join the assay and phenotype data into one data frame.

2.1 Import .gmt Files

The .gmt format is a commonly used file format for storing pathway information. Lists of pathways in the Molecular Signatures Database (MSigDB) can be downloaded from the MSigDB Collections page, and you can use the read_gmt function to import such a .gmt file into R. All .gmt files have a “description” field, which contains additional information on the pathway. However, this field can be left empty. In this example, we use description = FALSE to skip importing the “description” field.

The imported .gmt file is stored as a pathwayCollection list object. This list contains:

  • a list of the gene symbols or gene IDs contained in each pathway (pathways),
  • the names of the pathways (TERMS), and
  • (OPTIONAL) the description of each pathway (description). For Canonical Pathways files, if you specify description = TRUE, this is the hyperlink to the pathway description card on the GSEA website.

2.2 Import and Tidy Assay Data

We assume that the assay data (e.g. transcriptomic data) is either in an Excel file or flat text file. For example, your data may look like this:

In this example data set, the columns are individual samples. The values in each row are the gene expression measurements for that gene. Use the read_csv function from the readr package to import .csv files into R as tibble (table and data frame) objects. The read_csv function prints messages to the screen informing you of the assumptions it makes when importing your data. Specifically, this message tells us that all the imported data is numeric (col_double()) except for the gene name column (X1 = col_character()).

Incidentally, we consider gene symbols to adhere to the following conditions:

  • gene symbols must start with an English letter (a-z or A-Z), and
  • gene symbols can only contain English letters, Arabic numerals (0-9), and possibly a dash (-).

Furthermore, if your data has samples in the columns and -omic feature measurements in the rows (like the data set above), you’ll need to “tidy” the imported assay with the TransposeAssay function. The transposed data set will appear similar to the following:

2.3 Import Phenotype Info

Use the read_csv function to import the phenotype data. Once again, the read_csv function displays a message informing us of the types of data in each column. The following phenotype dataset for subject survival information contains the subject ID (Sample), survival time after disease onset in months (eventTime), and a logical (or binary) variable indicating if the subject died (TRUE or 1) or was lost to follow up (eventObserved; 0 or FALSE).

3. Create an Omics Data Object

This section is a quick overview of the material covered in the Creating Data Objects vignette.

3.1 Create an Object

Using the data you just imported, create a data object specific to survival, regression, or categorical responses. For our example dataset, we will create a survival Omics object to hold our assay, pathways, and survival responses. If your indicator is a binary variable, the CreateOmics function will attempt to coerce it to a logical variable. Therefore, death indicators should be coded as 0-1, not 1-2.

This package includes a subject-matched colon cancer survival assay subset (colonSurv_df) and a toy pathway collection with 15 pathways (colon_pathwayCollection). When we create this OmicsSurv object, the pathwayPCA package checks the overlap between the -omes recorded in the assay data and the gene symbols in the supplied pathway collection. The CreateOmics() function also prints some diagnostic messages to inform you of how well your pathway collection overlaps your data.

3.3 Detailed Object Views

Because the printing procedure for Omics objects is to show a summary of the contents, you need to use the get*() functions to view the individual components of the colon_OmicsSurv object we just created. Overall, you can use accessor functions to extract, edit, or replace data contained in the object. The accessor functions are listed in more detail in the Table of Accessors subsection of Chapter 3. Use these functions to confirm that the data object you created accurately reflects the data you intend to analyze.

3.3.1 View the Assay

#> # A tibble: 250 x 656
#>       JUN    SOS2   PAK3    RAF1  PRKCB    BTC     SHC1   PRKCA   ELK1    NRG1
#>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
#>  1 -0.880  0.367   1.30   0.480  -0.867 -0.790 -2.84e-1  0.0104  1.04   0.670 
#>  2 -1.11   2.15    1.62   0.165   0.547 -0.261 -4.91e-4  1.18    1.49   1.30  
#>  3 -0.752  0.764   0.330 -0.541   0.164  0.452  6.91e-1 -0.950   0.667  0.128 
#>  4  1.03  -0.751   2.75  -0.878  -1.09  -0.481 -1.34e+0  1.52    2.23   3.65  
#>  5 -1.73   0.613   2.66   0.550   0.197 -1.02   1.56e-1 -1.29    1.68  -2.15  
#>  6 -0.162  0.126  -0.324  0.627  -0.106 -1.17   1.11e+0 -1.27    0.964  1.79  
#>  7 -0.980 -0.520   1.28  -0.343  -0.762 -0.489 -2.57e-1  0.0518  1.21  -0.312 
#>  8  0.616  0.0633  0.343 -0.0195  0.578 -1.37   6.38e-1  1.37    1.05  -0.703 
#>  9  1.35  -0.467  -0.149 -0.718  -0.898  0.527 -6.04e-2  1.32   -1.000 -0.266 
#> 10 -0.538  0.413  -0.503 -1.27   -0.466  0.178  3.45e-1 -1.40   -0.378  0.0623
#> # … with 240 more rows, and 646 more variables: PAK2 <dbl>, MTOR <dbl>,
#> #   PAK4 <dbl>, MAP2K4 <dbl>, EIF4EBP1 <dbl>, BAD <dbl>, PRKCG <dbl>,
#> #   NRG3 <dbl>, MAPK9 <dbl>, ERBB4 <dbl>, MAPK10 <dbl>, PTK2 <dbl>,
#> #   ERBB2 <dbl>, ERBB3 <dbl>, MAP2K2 <dbl>, TGFA <dbl>, BRAF <dbl>,
#> #   MAP2K1 <dbl>, MAP2K7 <dbl>, ABL1 <dbl>, NRG2 <dbl>, AKT1 <dbl>, ABL2 <dbl>,
#> #   AKT2 <dbl>, SHC4 <dbl>, RPS6KB1 <dbl>, RPS6KB2 <dbl>, AKT3 <dbl>,
#> #   NRAS <dbl>, GRB2 <dbl>, AREG <dbl>, STAT5B <dbl>, MAPK3 <dbl>,
#> #   STAT5A <dbl>, PAK6 <dbl>, SOS1 <dbl>, MYC <dbl>, MAPK1 <dbl>, NCK1 <dbl>,
#> #   PIK3R5 <dbl>, NRG4 <dbl>, HRAS <dbl>, MAPK8 <dbl>, EGFR <dbl>, GSK3B <dbl>,
#> #   CBLB <dbl>, KRAS <dbl>, CBL <dbl>, SHC3 <dbl>, CDKN1B <dbl>, CDKN1A <dbl>,
#> #   EGF <dbl>, EREG <dbl>, ARAF <dbl>, NCK2 <dbl>, SRC <dbl>, PIK3R3 <dbl>,
#> #   CAMK2A <dbl>, CAMK2B <dbl>, CAMK2D <dbl>, CAMK2G <dbl>, PAK1 <dbl>,
#> #   CBLC <dbl>, CRK <dbl>, PIK3CA <dbl>, PIK3CB <dbl>, CRKL <dbl>,
#> #   PIK3CD <dbl>, GAB1 <dbl>, PLCG1 <dbl>, PLCG2 <dbl>, SHC2 <dbl>,
#> #   HBEGF <dbl>, PIK3CG <dbl>, PIK3R1 <dbl>, PIK3R2 <dbl>, EPHB2 <dbl>,
#> #   EPHB4 <dbl>, EFNA5 <dbl>, PXN <dbl>, CDC42 <dbl>, EFNB3 <dbl>, RRAS <dbl>,
#> #   GRB7 <dbl>, SYNJ1 <dbl>, EPHB3 <dbl>, EFNB1 <dbl>, DNM1 <dbl>,
#> #   MAP4K4 <dbl>, GRIA1 <dbl>, EPHB1 <dbl>, ROCK1 <dbl>, ITSN1 <dbl>,
#> #   RAP1A <dbl>, RAC1 <dbl>, RAP1B <dbl>, EFNB2 <dbl>, WASL <dbl>, TF <dbl>,
#> #   KALRN <dbl>, …

3.3.3 View the Event Time

We can use the vector subsetting mechanic in R (vector[]) to view only the first ten event times.

3.3.4 View the Event Indicator

4. Test Pathways for Significance

After you have confirmed that the CreateOmics function created the Omics object you wanted, you can analyze the object with adaptive, elastic-net, sparse (AES) PCA or supervised PCA. This section is a quick overview of the material covered in the Test Pathway Significance vignette. For details of these methods functions, please see their respective sections in Chapter 4.

The function arguments are as follows. Both the AESPCA_pVals and SuperPCA_pVals functions take in an Omics object as the value to the object argument. AES-PCA can use permutation-based \(p\)-values, so the numReps argument controls how many permutations to take. If we set the number of permutations to 0, then the \(p\)-values will be calculated parametrically. The numPCs argument specifies how many principal components will be extracted from each pathway. The parallel and numCores arguments are used to control if and how the functions make use of parallel computing. Finally, the adjustment argument allows you to specify a family-wise error rate (FWER) or false discovery rate (FDR) adjustment for the pathway \(p\)-values. These options are documented in the adjustRaw_pVals function (see the help documentation for details).

5. Inspect Results

This section is a quick overview of the material covered in the Visualizing the Results vignette. The output of AESPCA_pVals() is list object with class aespcOut. The output of SuperPCA_pVals() is a list object with class superpcOut. Both of these objects have the following three elements:

  • pVals_df: a data frame with \(p\)-values and their details each pathway
  • PCs_ls: a list of the data frames of the first selected principal component(s) extracted from the assay data subset corresponding to each pathway.
  • loadings_ls: a list of the matrices of the loading vectors that correspond to the principal components in PCs_ls.

5.1 Analysis Output Table

For a quick and easy view of the pathway significance testing results, we can use the getPathpVals() function to access and subset the output data frame. If you are not using the tidyverse package suite, your results will print differently.

This function has the following modifying arguments:

  • score = FALSE: . Return the raw \(p\)-values. Return the \(p\)-value score, \(-\log(p)\), instead of the unadjusted (raw) \(p\)-values with score = FALSE.
  • numPaths = 20: Return the top 20 pathways by \(p\)-value score.
  • alpha = NULL: Return all pathways with raw \(p\)-values less than alpha. If you specify alpha, then do not specify numPaths.

5.2 Graph of Top Pathways

To visualize the significance of the pathways based on FDR or uncorrected \(p\)-values, we can use the ggplot2 package to create summary graphics of the analysis results.