Read a set list file in Gene Matrix Transposed (.gmt) format, with special performance consideration for large files. Present this object as a pathwayCollection object.

read_gmt(
  file,
  setType = c("pathways", "genes", "regions"),
  description = FALSE,
  nChars = 1e+07,
  delim = "\t"
)

Arguments

file

A path to a file or a connection. This file must be a .gmt file, otherwise input will likely be nonsense. See the "Details" section for more information.

setType

What is the type of the set: pathway set of gene, gene sites in RNA or DNA, or regions of CpGs. Defaults to ''pathway''.

description

Should the "description" field (the second field in the .gmt file on each line) be included in the output? Defaults to FALSE.

nChars

The number of characters to read from a connection. The largest .gmt file we have encountered is the full C5 pathway collection from MSigDB (5917 pathways), which has roughly 5 million characters in UTF-8 encoding. Therefore, we default this argument to be twice the size of the largest pathway collection we have seen so far, 10,000,000.

delim

The .gmt delimiter. As proper .gmt files are tab delimited, this defaults to "\t".

Value

A pathwayCollection list of sets. This list has three elements:

  • 'setType' : A named list of character vectors. Each vector contains the names of the individual genes, sites, or CpGs within that set as a vector of character strings. The name of this list entry is equal to the value specified in setType.

  • TERMS : A character vector the same length as the 'setType' list with the proper names of the sets.

  • description : (OPTIONAL) A character vector the same length as the 'setType' list with a note on that set (for the .gmt file included with this package, this field contains hyperlinks to the MSigDB description card for that pathway). This field is included when description = TRUE.

Details

This function uses R's readChar function to improve character input performance over readLines (and far improve input performance over scan).

See the Broad Institute's "Data Formats" page for a description of the Gene Matrix Transposed file format: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29

See also

Examples

# If you have installed the package: data_path <- system.file( "extdata", "c2.cp.v6.0.symbols.gmt", package = "pathwayPCA", mustWork = TRUE ) geneset_ls <- read_gmt(data_path, description = TRUE) # # If you are using the development version from GitHub: # geneset_ls <- read_gmt( # "inst/extdata/c2.cp.v6.0.symbols.gmt", # description = TRUE # )