1. Overview

This vignette is the third chapter in the “Pathway Significance Testing with pathwayPCA” workflow, providing a detailed perspective to the Creating Data Objects section of the Quickstart Guide. This vignette builds on the material covered in the “Import and Tidy Data” vignette. This guide will outline the major steps needed to create a data object for analysis with the pathwayPCA package. These objects are called Omics-class objects.

1.1 Outline

Before we move on, we will outline our steps. After reading this vignette, you should be able to

  1. Describe the data components within the Omics object class.
  2. Create a few Omics objects.
  3. Inspect and edit individual elements contained in these objects.

First, load the pathwayPCA package and the tidyverse package suite.

library(tidyverse)
library(pathwayPCA)

1.2 Import Data

Because this is the third chapter in the workflow, we assume that

  1. Your assay is “tidy”.
  2. Your gene pathways list is stored in a pathwayCollection object.
  3. Your phenotype and assay data have already been ID-matched.

If you are unsure about any of the three points above (or you don’t know what these mean), please review the Import and Tidy Data vignette first. It isn’t very long, but it will help you set up your data in the right way. If your data is not in the proper form, the steps in this vignette may be very difficult.

For the purpose of example, we will load some “toy” data: a combined assay / phenotype data frame and a pathwayCollection list. These objects already fit the three criteria above. This tidy data set has 656 gene expression measurements (columns) on 250 colon cancer patients (rows).

data("colonSurv_df")
colonSurv_df
#> # A tibble: 250 x 659
#>    sampleID OS_time OS_event   JUN  SOS2  PAK3  RAF1 PRKCB   BTC  SHC1 PRKCA
#>    <chr>      <dbl>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 subj1     64.9          0  9.29  5.48  8.21  8.03  5.49  6.65  8.26  8.94
#>  2 subj2     59.8          0  9.13  6.35  8.33  7.94  6.26  7.02  8.39  9.61
#>  3 subj3     62.4          0  9.37  5.67  7.82  7.74  6.05  7.52  8.69  8.40
#>  4 subj4     54.5          0 10.6   4.94  8.79  7.64  5.37  6.87  7.81  9.80
#>  5 subj5     46.3          1  8.70  5.60  8.75  8.05  6.07  6.49  8.45  8.21
#>  6 subj6     55.9          0  9.78  5.36  7.56  8.07  5.90  6.39  8.87  8.22
#>  7 subj7     58.0          0  9.22  5.05  8.20  7.80  5.55  6.86  8.28  8.97
#>  8 subj8     54.0          0 10.3   5.33  7.82  7.89  6.27  6.25  8.66  9.71
#>  9 subj9      0.427        1 10.8   5.07  7.63  7.69  5.48  7.57  8.36  9.69
#> 10 subj10    41.4          0  9.52  5.50  7.48  7.53  5.71  7.33  8.54  8.14
#> # … with 240 more rows, and 648 more variables: ELK1 <dbl>, NRG1 <dbl>,
#> #   PAK2 <dbl>, MTOR <dbl>, PAK4 <dbl>, MAP2K4 <dbl>, EIF4EBP1 <dbl>,
#> #   BAD <dbl>, PRKCG <dbl>, NRG3 <dbl>, MAPK9 <dbl>, ERBB4 <dbl>, MAPK10 <dbl>,
#> #   PTK2 <dbl>, ERBB2 <dbl>, ERBB3 <dbl>, MAP2K2 <dbl>, TGFA <dbl>, BRAF <dbl>,
#> #   MAP2K1 <dbl>, MAP2K7 <dbl>, ABL1 <dbl>, NRG2 <dbl>, AKT1 <dbl>, ABL2 <dbl>,
#> #   AKT2 <dbl>, SHC4 <dbl>, RPS6KB1 <dbl>, RPS6KB2 <dbl>, AKT3 <dbl>,
#> #   NRAS <dbl>, GRB2 <dbl>, AREG <dbl>, STAT5B <dbl>, MAPK3 <dbl>,
#> #   STAT5A <dbl>, PAK6 <dbl>, SOS1 <dbl>, MYC <dbl>, MAPK1 <dbl>, NCK1 <dbl>,
#> #   PIK3R5 <dbl>, NRG4 <dbl>, HRAS <dbl>, MAPK8 <dbl>, EGFR <dbl>, GSK3B <dbl>,
#> #   CBLB <dbl>, KRAS <dbl>, CBL <dbl>, SHC3 <dbl>, CDKN1B <dbl>, CDKN1A <dbl>,
#> #   EGF <dbl>, EREG <dbl>, ARAF <dbl>, NCK2 <dbl>, SRC <dbl>, PIK3R3 <dbl>,
#> #   CAMK2A <dbl>, CAMK2B <dbl>, CAMK2D <dbl>, CAMK2G <dbl>, PAK1 <dbl>,
#> #   CBLC <dbl>, CRK <dbl>, PIK3CA <dbl>, PIK3CB <dbl>, CRKL <dbl>,
#> #   PIK3CD <dbl>, GAB1 <dbl>, PLCG1 <dbl>, PLCG2 <dbl>, SHC2 <dbl>,
#> #   HBEGF <dbl>, PIK3CG <dbl>, PIK3R1 <dbl>, PIK3R2 <dbl>, EPHB2 <dbl>,
#> #   EPHB4 <dbl>, EFNA5 <dbl>, PXN <dbl>, CDC42 <dbl>, EFNB3 <dbl>, RRAS <dbl>,
#> #   GRB7 <dbl>, SYNJ1 <dbl>, EPHB3 <dbl>, EFNB1 <dbl>, DNM1 <dbl>,
#> #   MAP4K4 <dbl>, GRIA1 <dbl>, EPHB1 <dbl>, ROCK1 <dbl>, ITSN1 <dbl>,
#> #   RAP1A <dbl>, RAC1 <dbl>, RAP1B <dbl>, EFNB2 <dbl>, WASL <dbl>, …

Notice that the assay and survival response information have already been merged, so we have two additional columns (for Overall Survival Time and its corresponding death indicator). We also have a small collection of 15 pathways which correspond to our example colon cancer assay.

The pathway collection and tidy assay (with matched phenotype information) are all the information we need to create an Omics-class data object.


2. Omics-Class Objects Defined

Now that we have our data loaded, we can create an analysis object for the pathwayPCA package.

2.1 Class Overview

In this package, all primary input data will be in an Omics data object. There are three classes of Omics* objects, but one function (CreateOmics) creates all of them. Each class contains a tidy assay and pathwayCollection list. The classes differ in the type of response information they can hold. The classes, and their responses, are

  1. OmicsSurv—a data object for survival information, which includes event time (the time of last follow-up with a subject) and event indicator (did the subject die, or was the observation right-censored).
  2. OmicsReg—a data object for continuous responses (usually a linear regression response).
  3. OmicsCateg—a data object for categorical responses, the dependent variable of a generalized linear model. Currently, we only support binary classification (through logistic regression).
  4. OmicsPathway—a data object with no response. This is the “parent” class for the other three Omics classes.

2.2 Review of Data Types in R

Take a quick look back at the structure of our colonSurv_df object. We have a table data frame with the first two columns as subject response information and the rest as an expression design matrix. Look at the types of the columns of this data frame (the <dbl> and <int> tags directly under the column names): these tags tell us that the columns contain “double / numeric” (dbl) and “integer” (int) information. The other tags we could potentially see here are <chr> (character), <lgl> (logical), or <fct> (factor). These tags are important because they identify which “class” of data is in each column.

Here are some examples of how to change data between types. We inspect the first 10 entries of each object.

The CreateOmics function puts the response information into specific classes:

  • Survival data is stored as a pair of numeric (time) and logical (death indicator) vectors.
  • Regression data is stored in a numeric or integer vector.
  • Binary classification data is stored in a factor vector.

These restrictions are on purpose: the internal data creation functions in the pathwayPCA package have very specific requirements about the types of data they take as inputs. This ensures the integrity of your data analysis.


3. Create New Omics Objects

3.1 Overview of Subtypes

All new Omics objects are created with the CreateOmics function. You should use this function to create Omics-class objects for survival, regression, or categorical responses. This CreateOmics function internally calls on a specific creation function for each response type:

  • Survival response: the CreateOmicsSurv() function creates an Omics object with class OmicsSurv. This object will contain:
    • eventTime: a numeric vector of event times.
    • eventObserved: a logical vector of death (or other event) indicators. This format precludes the option of recurrent-event survival analysis.
    • assayData_df: a tidy data.frame or tibble of assay data. Rows are observations or subjects; the columns are -Omics measures (e.g. transcriptome). The column names must match a subset of the genes provided in the pathways list (in the pathwayCollection object).
    • pathwayCollection: a list of pathway information, as returned by the read_gmt function (see the Import and Tidy Data vignette for more details). The names of the genes in these pathways must match a subset of the genes recorded in the assay data frame (in the assayData_df object).
  • Regression response: the CreateOmicsReg() function creates an Omics object with class OmicsReg. This object will contain:
    • response: a numeric vector of the response.
    • assayData_df: a tidy data.frame or tibble of assay data, as described above.
    • pathwayCollection: a list of pathway information, as described above.
  • Binary Classification response: the CreateOmicsCateg() function creates an Omics object with class OmicsCateg. In future versions, this function will be able to take in \(n\)-ary responses and ordered categorical responses, but we only support binary responses for now. This object will contain:
    • response: a factor vector of the response.
    • assayData_df: a tidy data.frame or tibble of assay data, as described above.
    • pathwayCollection: a list of pathway information, as described above.

In order to create example Omics-class objects, we will consider the overall patient survival time (and corresponding censoring indicator) as our survival response, the event time as our regression response, and event indicator as our binary classification response.

3.2 Create a Survival Omics Data Object

Now we are prepared to create our first survival Omics object for later analysis with either AES-PCA or Supervised PCA. Recall that the colonSurv_df data frame has the survival time in the first column, the event indicator in the second column, and the assay expression data in the subsequent columns. Therefore, the four arguments to the CreateOmics function will be:

  • assayData_df: this will be only the expression columns of the colonSurv_df data frame (i.e. all but the first two columns). In R, we can remove the first two columns of the colonSurv_df data frame by negative subsetting: colonSurv_df[, -(1:2)].
  • pathwayCollection_ls: this will be the colon_pathwayCollection list object. Recall that you can import a .gmt file into a pathwayCollection object via the read_gmt function, or create a pathwayCollection list object by hand with the CreatePathwayCollection function.
  • response: this will be the first two columns of the colonSurv_df data frame. The survival time stored in the OS_time column and the event indicator stored in the OS_event column.
  • respType: this will be the word "survival" or an abbreviation of it.

Also, when you create an Omics*-class object, the CreateOmics() function prints helpful diagnostic messages about the overlap between the features in the supplied assay data and those in the pathway collection.

The last three sentences inform you of how strong the overlap is between the genes measured in your data and the genes selected in your pathway collection. This messages tells us that 9% of the 676 total genes included in all pathways were not measured in the assay; zero pathways were removed from the pathways list for having too few genes after gene trimming; and the genes in the pathways list call for 93.8% of the 656 genes measured in the assay. The last number is the most important: it measures how well your pathway collection overlaps with the genes measured in your assay. This number should be as close to 100% as possible. These diagnostic messages depend on the overlap between the pathway collection and the assay, so these messages are response agnostic.

3.3 View the New Object

In order to view a summary of the contents of the colon_OmicsSurv object, you need simply to print it to the R console.

Also notice that the CreateOmics() function stores a “cleaned” copy of the pathway collection. The object creation functions within the pathwayPCA package subset the feature data frame by the genes in each pathway. Therefore, if we have genes in the pathways that are not recorded in the data frame, then we will necessarily create missing (NA) predictors. To circumvent this issue, we check if each gene in each pathway is recorded in the data frame, and remove from each pathway the genes for which the assay does not have recorded expression levels. However, if we remove genes from pathways which do not have recorded levels in the predictor data frame, we could theoretically remove all the genes from a given pathway. Thus, we also check to make sure that each pathway in the given pathways list still has some minimum number of genes present (defaulting to three or more) after we have removed genes without corresponding expression levels.

The IntersectOmicsPwyCollct() function performs these two actions simultaneously, and this function is called and executed automatically within the object creation step. This function removes the unrecorded genes from each pathway, trims the pathways that have fewer than the minimum number of genes allowed, and returns a “trimmed” pathway collection. If there are any pathways removed by this execution, the pathways list within the trimPathwayCollection object within the Omics object will have a character vector of the pathways removed stored as the "missingPaths" attribute. Access this attribute with the attr() function.

3.4 Regression and Classification Omics Data Objects

We create regression- and categorical-type Omics data objects identically to survival-type Omics objects. We will use the survival time as our toy regression response and the death indicator as the toy classification response.


4. Inspecting and Editing Omics-Class Objects

In order to access or edit a specific component of an Omics object, we need to use specific accessor functions. These functions are named with the component they access.

4.1 Example “Get” Function

The get* functions access the part of the data object you specify. You can save these objects to their own variables, or simply print them to the screen for inspection. Here we print the assay data frame contained in the colon_OmicsSurv object to the screen:

getAssay(colon_OmicsSurv)
#> # A tibble: 250 x 656
#>       JUN    SOS2   PAK3    RAF1  PRKCB    BTC     SHC1   PRKCA   ELK1    NRG1
#>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
#>  1 -0.880  0.367   1.30   0.480  -0.867 -0.790 -2.84e-1  0.0104  1.04   0.670 
#>  2 -1.11   2.15    1.62   0.165   0.547 -0.261 -4.91e-4  1.18    1.49   1.30  
#>  3 -0.752  0.764   0.330 -0.541   0.164  0.452  6.91e-1 -0.950   0.667  0.128 
#>  4  1.03  -0.751   2.75  -0.878  -1.09  -0.481 -1.34e+0  1.52    2.23   3.65  
#>  5 -1.73   0.613   2.66   0.550   0.197 -1.02   1.56e-1 -1.29    1.68  -2.15  
#>  6 -0.162  0.126  -0.324  0.627  -0.106 -1.17   1.11e+0 -1.27    0.964  1.79  
#>  7 -0.980 -0.520   1.28  -0.343  -0.762 -0.489 -2.57e-1  0.0518  1.21  -0.312 
#>  8  0.616  0.0633  0.343 -0.0195  0.578 -1.37   6.38e-1  1.37    1.05  -0.703 
#>  9  1.35  -0.467  -0.149 -0.718  -0.898  0.527 -6.04e-2  1.32   -1.00  -0.266 
#> 10 -0.538  0.413  -0.503 -1.27   -0.466  0.178  3.45e-1 -1.40   -0.378  0.0623
#> # … with 240 more rows, and 646 more variables: PAK2 <dbl>, MTOR <dbl>,
#> #   PAK4 <dbl>, MAP2K4 <dbl>, EIF4EBP1 <dbl>, BAD <dbl>, PRKCG <dbl>,
#> #   NRG3 <dbl>, MAPK9 <dbl>, ERBB4 <dbl>, MAPK10 <dbl>, PTK2 <dbl>,
#> #   ERBB2 <dbl>, ERBB3 <dbl>, MAP2K2 <dbl>, TGFA <dbl>, BRAF <dbl>,
#> #   MAP2K1 <dbl>, MAP2K7 <dbl>, ABL1 <dbl>, NRG2 <dbl>, AKT1 <dbl>, ABL2 <dbl>,
#> #   AKT2 <dbl>, SHC4 <dbl>, RPS6KB1 <dbl>, RPS6KB2 <dbl>, AKT3 <dbl>,
#> #   NRAS <dbl>, GRB2 <dbl>, AREG <dbl>, STAT5B <dbl>, MAPK3 <dbl>,
#> #   STAT5A <dbl>, PAK6 <dbl>, SOS1 <dbl>, MYC <dbl>, MAPK1 <dbl>, NCK1 <dbl>,
#> #   PIK3R5 <dbl>, NRG4 <dbl>, HRAS <dbl>, MAPK8 <dbl>, EGFR <dbl>, GSK3B <dbl>,
#> #   CBLB <dbl>, KRAS <dbl>, CBL <dbl>, SHC3 <dbl>, CDKN1B <dbl>, CDKN1A <dbl>,
#> #   EGF <dbl>, EREG <dbl>, ARAF <dbl>, NCK2 <dbl>, SRC <dbl>, PIK3R3 <dbl>,
#> #   CAMK2A <dbl>, CAMK2B <dbl>, CAMK2D <dbl>, CAMK2G <dbl>, PAK1 <dbl>,
#> #   CBLC <dbl>, CRK <dbl>, PIK3CA <dbl>, PIK3CB <dbl>, CRKL <dbl>,
#> #   PIK3CD <dbl>, GAB1 <dbl>, PLCG1 <dbl>, PLCG2 <dbl>, SHC2 <dbl>,
#> #   HBEGF <dbl>, PIK3CG <dbl>, PIK3R1 <dbl>, PIK3R2 <dbl>, EPHB2 <dbl>,
#> #   EPHB4 <dbl>, EFNA5 <dbl>, PXN <dbl>, CDC42 <dbl>, EFNB3 <dbl>, RRAS <dbl>,
#> #   GRB7 <dbl>, SYNJ1 <dbl>, EPHB3 <dbl>, EFNB1 <dbl>, DNM1 <dbl>,
#> #   MAP4K4 <dbl>, GRIA1 <dbl>, EPHB1 <dbl>, ROCK1 <dbl>, ITSN1 <dbl>,
#> #   RAP1A <dbl>, RAC1 <dbl>, RAP1B <dbl>, EFNB2 <dbl>, WASL <dbl>, TF <dbl>,
#> #   KALRN <dbl>, …

This function is rather simple: it shows us what object is stored in the assayData_df slot of the colon_OmicsSurv data object. As we should expect, we see all the columns of the colonSurv_df data frame except for the first two (the survival time and event indicator).

4.2 Example “Set” Function

If we needed to edit the assay data frame in the colon_OmicsSurv object, we can use the “replacement” syntax of the getAssay function. These are the “set” functions, and they use the getSLOT(object) <- value syntax. For example, if we wanted to remove all of the genes except for the first ten from the assay data, we can replace this assay data with a subset of the the original colonSurv_df data frame. The SLOT shorthand name is Assay, and the replacement value is the first ten gene expression columns (in columns 3 through 12) of the colonSurv_df data frame: colonSurv_df[, (3:12)].

getAssay(colon_OmicsSurv) <- colonSurv_df[, (3:12)]

Now, when we inspect the colon_OmicsSurv data object, we see only ten variables measured in the assayData_df slot, instead of our original 656.

Before we move on, we should resest the data in the assayData_df slot to the full data by

getAssay(colon_OmicsSurv) <- colonSurv_df[, -(1:2)]

4.3 Table of Accessors

Here is a table listing each of the “get” and “set” methods for the Omics class, and which sub-classes they can access or modify.

Command Omics Sub-class Function
getAssay(object) All Extract the assayData_df data frame stored in object.
getAssay(object) <- value All Set assayData_df stored in object to value.
getSampleIDs(object) All Extract the sampleIDs_char vector stored in object.
getSampleIDs(object) <- value All Set sampleIDs_char stored in object to value.
getPathwayCollection(object) All Extract the pathwayCollection list stored in object.
getPathwayCollection(object) <- value All Set pathwayCollection stored in object to value.
getEventTime(object) Surv Extract the eventTime_num vector stored in object.
getEventTime(object) <- value Surv Set eventTime_num stored in object to value.
getEvent(object) Surv Extract the eventObserved_lgl vector stored in object.
getEvent(object) <- value Surv Set eventObserved_lgl stored in object to value.
getResponse(object) Reg or Categ Extract the response vector stored in object.
getResponse(object) <- value Reg or Categ Set response stored in object to value.

The response vector accessed or edited with the getResponse method depends on if the object supplied is a “regression” Omics-class object or a “categorical” one. For regression Omics objects, getResponse(object) and getResponse(object) <- value get and set, respectively, the response_num slot. However, for categorical Omics objects, getResponse(object) and getResponse(object) <- value get and set, respectively, the response_fact slot. This is because regression objects contain numeric response vectors while categorical objects contain factor response vectors.

4.4 Inspect the Updated pathwayCollection List

As we mentioned in the Importing with the read_gmt Function subsection of the previous vignette, the pathwayCollection object will be modified upon Omics-object creation. Before, this list only had two elements, pathways and TERMS (we skipped importing the “description” field). Now, it has a third element: setsize—the number of genes contained in each pathway.


5. Review

We now summarize our steps so far. We have

  1. Defined the Omics class and three sub-classes: survival, regression, and categorical (and the “parent” class).
  2. Created an Omics object for the three sub-classes.
  3. Inspected and edited individual elements contained in these objects.

Now we are prepared to analyze our created data objects with either AES-PCA or Supervised PCA. Please read vignette chapter 4 next: Test Pathway Significance.

Here is the R session information for this vignette:

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] pathwayPCA_1.1.4 forcats_0.4.0    stringr_1.4.0    dplyr_0.8.3     
#>  [5] purrr_0.3.3      readr_1.3.1      tidyr_1.0.0      tibble_2.1.3    
#>  [9] ggplot2_3.2.1    tidyverse_1.3.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_0.2.5 xfun_0.11        splines_3.6.2    haven_2.2.0     
#>  [5] lattice_0.20-38  colorspace_1.4-1 vctrs_0.2.1      generics_0.0.2  
#>  [9] htmltools_0.4.0  yaml_2.2.0       utf8_1.1.4       survival_3.1-8  
#> [13] rlang_0.4.2      pkgdown_1.4.1    pillar_1.4.3     withr_2.1.2     
#> [17] glue_1.3.1       DBI_1.1.0        dbplyr_1.4.2     modelr_0.1.5    
#> [21] readxl_1.3.1     lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0    
#> [25] cellranger_1.1.0 rvest_0.3.5      memoise_1.1.0    evaluate_0.14   
#> [29] knitr_1.26       parallel_3.6.2   fansi_0.4.0      lars_1.2        
#> [33] broom_0.5.3      Rcpp_1.0.3       backports_1.1.5  scales_1.1.0    
#> [37] desc_1.2.0       jsonlite_1.6     fs_1.3.1         hms_0.5.2       
#> [41] digest_0.6.23    stringi_1.4.3    grid_3.6.2       rprojroot_1.3-2 
#> [45] cli_2.0.0        tools_3.6.2      magrittr_1.5     lazyeval_0.2.2  
#> [49] crayon_1.3.4     pkgconfig_2.0.3  zeallot_0.1.0    Matrix_1.2-18   
#> [53] MASS_7.3-51.5    xml2_1.2.2       reprex_0.3.0     lubridate_1.7.4 
#> [57] assertthat_0.2.1 rmarkdown_2.0    httr_1.4.1       rstudioapi_0.10 
#> [61] R6_2.4.1         nlme_3.1-143     compiler_3.6.2