6 Demographics Table With Table1
6.1 Introduction
In most scientific research journals, the first included table is often referred to as Table1. It is a table that presents descriptive statistics of baseline characteristics of the study population stratified by exposure. This package makes it fairly straightforward to produce such a table using R. Table1 includes descriptive statistics for the total study sample, with the rows (explanatory variables) consisting of the key study variables that are often included in the final analysis1. Then within the columns (outcome of interest/response variable), you will find cells given as an (%) for categorical variables, whereas a mean, SD, or the median will be provided for continuous variables. Additionally, there will be a total column provided which can help in the assessment of the overall sample.
6.2 Necessary Packages
The htmlTable
package allows for the usage of the table1()
function to create a table 1, while also making life easy when attempting to copy this table into a Word document.
The boot
package was created to aid in performing bootstrapping analysis. With it comes numerous data sets, specifically clinical trial data sets to make this possible. However, there is no code book provided within the package when the data is downloaded as a csv file. This is a link on Github that explains and elaborates on every data within the package itself2.
6.3 Data source and description
Today, we will be using the melanoma
data set which consists of malignant melanoma measurements of patients. Each patient had their tumor surgically removed between the years of 1962 and 1977 at the Department of Plastic Surgery, University Hospital of Odense located in Denamrk. Each surgery consisted of the complete removal of the tumor with an additional removal of about 2.5cm of the surrounding skin. When this was completed, the thickness of the tumor was recorded along with the physical appearance of ulceration vs no ulceration, as it is an important prognostic indication of those with a thick/ulcerated tumor to have an increased chance of death as a consequence of melanoma.
data(melanoma, package = "boot")
melanoma_data <- melanoma
#Now that we loaded the raw data set, we will conduct a visual exploration before wrangling #the data and applying any functions, while also considering the requirements involved in #the construction of a table1.
summary(melanoma_data)
time status sex age year
Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00 Min. :1962
1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1968
Median :2005 Median :2.00 Median :0.0000 Median :54.00 Median :1970
Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46 Mean :1970
3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00 3rd Qu.:1972
Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00 Max. :1977
thickness ulcer
Min. : 0.10 Min. :0.000
1st Qu.: 0.97 1st Qu.:0.000
Median : 1.94 Median :0.000
Mean : 2.92 Mean :0.439
3rd Qu.: 3.56 3rd Qu.:1.000
Max. :17.42 Max. :1.000
6.4 Cleaning the data to create a model data frame
Let us now explore the type of variables within the data set.
typeof(melanoma_data$status)
[1] "double"
We will first provide a basic table1
to illustrate how the function works. Currently, all the variables are in numeric/double formats, however for the creation of a basic table1
, it is of importance to convert the dependent/response variable of interest to reflect categories (factor).
Our main variable of interest (dependent/response) is the status. According to the code book found in Github, status is coded into three levels that indicate the patients status at the end of the study. Level 1 indicates that they had died from melanoma, Level 2 indicates that they were still alive at the conclusion of the study, and Level 3 indicates that they had died from causes unrelated to their melanoma. As such, we will factor the “status” variable into three levels. With this in mind, let us go ahead and convert melanoma into a factor variable with three levels. For ease of analysis we will use 2 = “Alive” as the reference level. This can be done in two ways:
Although more time consuming, it is highly recommended that beginners utilize the function
as.factor()
and then utilize therecode_factor()
function to minimize the errors.When you become more skilled and are able to understand how the factor function works, it is possible to do everything in one step with the
factor()
function. In this function you can put levels and labels all in one function instead of having to break it up into more than one function.
For our example we will use as.factor
then recode_factor()
using 2 = “Alive” as our reference group.
melanoma_data$status <-
as.factor(melanoma_data$status)
# print the first six observations
head(melanoma_data$status)
[1] 3 3 2 3 1 1
Levels: 1 2 3
# Recode
melanoma_data$status <- recode_factor(
melanoma_data$status,
"2" = "Alive", # this is the reference group
"1" = "Died from melanoma",
"3" = "Non-Melanoma death"
)
# Print the first six observations
head(melanoma_data$status)
[1] Non-Melanoma death Non-Melanoma death Alive Non-Melanoma death
[5] Died from melanoma Died from melanoma
Levels: Alive Died from melanoma Non-Melanoma death
As you can see in the variable levels, “Alive” is the reference level. It is extremely important to pick a reference level to lay the foundation of the table along with highlighting the outcome of interest of your hypothesis. In summary, this lays the foundation of a well organized table.
6.5 Creation of basic table 1
Now that our main variable of interest is a factor with three levels, we will run a basic table1 with the independent/explanatory variables of interest: sex, age, ulcer, and thickness.
Recall that the explanatory variables of interest are still in “double” formats. Conveniently, to analyze data before the independent variables are converted to factors and labeled, the table1 provides the ability to highlight level results. This only applies for independent variables that are in numeric/double formats in which each number represents a group. For instance 0 although is a number format we know it has a group meaning such as male.
For the independent variables, if they have factors in the front, it provides the number of cases (aka observations). If they are a continuous variable, we will get the mean, the SD, the minimum and the maximum amounts.
basic_table1 <- table1(
~ factor(sex) + age + factor(ulcer) + thickness | status,
data = melanoma_data
)
basic_table1
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
Overall (N=205) |
|
---|---|---|---|---|
factor(sex) | ||||
0 | 91 (67.9%) | 28 (49.1%) | 7 (50.0%) | 126 (61.5%) |
1 | 43 (32.1%) | 29 (50.9%) | 7 (50.0%) | 79 (38.5%) |
age | ||||
Mean (SD) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) | 52.5 (16.7) |
Median [Min, Max] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] | 54.0 [4.00, 95.0] |
factor(ulcer) | ||||
0 | 92 (68.7%) | 16 (28.1%) | 7 (50.0%) | 115 (56.1%) |
1 | 42 (31.3%) | 41 (71.9%) | 7 (50.0%) | 90 (43.9%) |
thickness | ||||
Mean (SD) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) | 2.92 (2.96) |
Median [Min, Max] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] | 1.94 [0.100, 17.4] |
Note that the table1
package uses a familiar formula interface, where the variables to include in the table are separated by ‘+’ symbols, the “stratification” variable (which creates the columns) appears to the right of a “conditioning” symbol ‘|’, and the data argument specifies a data.frame
that contains the variables in the formula.
If we do not put factor for a grouped variable then the following will happen:
wrong_table1 <- table1(
~ sex + age + ulcer + thickness | status,
data = melanoma_data
)
wrong_table1
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
Overall (N=205) |
|
---|---|---|---|---|
sex | ||||
Mean (SD) | 0.321 (0.469) | 0.509 (0.504) | 0.500 (0.519) | 0.385 (0.488) |
Median [Min, Max] | 0 [0, 1.00] | 1.00 [0, 1.00] | 0.500 [0, 1.00] | 0 [0, 1.00] |
age | ||||
Mean (SD) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) | 52.5 (16.7) |
Median [Min, Max] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] | 54.0 [4.00, 95.0] |
ulcer | ||||
Mean (SD) | 0.313 (0.466) | 0.719 (0.453) | 0.500 (0.519) | 0.439 (0.497) |
Median [Min, Max] | 0 [0, 1.00] | 1.00 [0, 1.00] | 0.500 [0, 1.00] | 0 [0, 1.00] |
thickness | ||||
Mean (SD) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) | 2.92 (2.96) |
Median [Min, Max] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] | 1.94 [0.100, 17.4] |
As you can see above, we have the incorrect values provided of the explanatory variables. For example, in the variable of sex, we expect to see the number of individuals who identify as male or female, but instead we observe the mean, which is not a proper descriptive statistic as sex is a categorical variable.
To avoid this issue as well as problems in other procedures (like logistic regressions), it is crucial that we remember to factor the variables before we run any function. But because we don’t have nice labels for the variables and categories, it doesn’t look great. To improve things, we can create factors with descriptive labels for the categorical variables (sex
and ulcer
), label each variable the way we want, and specify units for the continuous variables (age
and thickness
). According to the code book, the patient’s sex: 1 = male, 0 = female, and ulcer is an indicator of ulceration : 1 = present, 0 = absent. We also specify that the overall column to be labeled “Total” and be positioned on the left, and add a caption and footnote:
melanoma_data$sex <- as.factor(melanoma_data$sex)
# print the first six observations
head(melanoma_data$sex)
[1] 1 1 1 0 1 1
Levels: 0 1
# Recode
melanoma_data$sex <- recode_factor(
melanoma_data$sex,
"0" = "Female",
"1" = "Male"
)
# Print the first six observations
head(melanoma_data$sex)
[1] Male Male Male Female Male Male
Levels: Female Male
typeof(melanoma_data$ulcer)
[1] "double"
melanoma_data$ulcer <- as.factor(melanoma_data$ulcer)
# print the first six observations
head(melanoma_data$ulcer)
[1] 1 0 0 0 1 1
Levels: 0 1
# Recode
melanoma_data$ulcer <- recode_factor(
melanoma_data$ulcer,
"0" = "Absent",
"1" = "Present"
)
# Print the first six observations
head(melanoma_data$ulcer)
[1] Present Absent Absent Absent Present Present
Levels: Absent Present
In addition, we need to add units to the two continuous variables age and thickness. According to the code book, age is the patient’s age measured in years and thickness corresponds to the tumor’s thickness in millimeters (mm). The package table1
provides an easy way to demonstrate measurement information:
Additionally, for visual and descriptive purposes, the function table1
is able to easily provide labels for the variables that will be shown in the final table using the label()
function. Also, (caption \<-)
provides a title for the table and (footnote \<-)
provides any footnote information.
Below, we can demonstrate the final table1
layout. As you can see, you no longer use factor()
in front of the variable as we already factorized it in the previous steps.
table1(
~ sex + age + ulcer + thickness | status,
data = melanoma_data,
overall = c(left = "Total"),
caption = caption_char,
footnote = footnote_char
)
Total (N=205) |
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
|
---|---|---|---|---|
*Also known as Breslow thickness | ||||
Sex | ||||
Female | 126 (61.5%) | 91 (67.9%) | 28 (49.1%) | 7 (50.0%) |
Male | 79 (38.5%) | 43 (32.1%) | 29 (50.9%) | 7 (50.0%) |
Age (years) | ||||
Mean (SD) | 52.5 (16.7) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) |
Median [Min, Max] | 54.0 [4.00, 95.0] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] |
Ulceration | ||||
Absent | 115 (56.1%) | 92 (68.7%) | 16 (28.1%) | 7 (50.0%) |
Present | 90 (43.9%) | 42 (31.3%) | 41 (71.9%) | 7 (50.0%) |
Thickness* (mm) | ||||
Mean (SD) | 2.92 (2.96) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) |
Median [Min, Max] | 1.94 [0.100, 17.4] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] |
6.6 Changing the table’s appearance
The default style of table1
uses an Arial font, and resembles the booktabs style commonly used in LaTeX. While this default style is not ugly, inevitably there will be a desire to customize the visual appearance of the table (fonts, colors, gridlines, etc). The package provides a limited number of built-in options for changing the style, while further customization can be achieved in R Markdown documents using CSS.3
6.6.1 Using built-in styles
The package includes a limited number of built-in styles including:
zebra: alternating shaded and unshaded rows (zebra stripes)
grid: show all grid lines
shade: shade the header row(s) in gray
times: use a serif font
These styles can be selected using the topclass
argument of table1
. Some examples follow:
table1(~ sex + age + ulcer + thickness | status,
data = melanoma_data,
overall = c(left = "Total"),
caption = caption_char,
footnote = footnote_char,
topclass="Rtable1-zebra"
)
Total (N=205) |
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
|
---|---|---|---|---|
*Also known as Breslow thickness | ||||
Sex | ||||
Female | 126 (61.5%) | 91 (67.9%) | 28 (49.1%) | 7 (50.0%) |
Male | 79 (38.5%) | 43 (32.1%) | 29 (50.9%) | 7 (50.0%) |
Age (years) | ||||
Mean (SD) | 52.5 (16.7) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) |
Median [Min, Max] | 54.0 [4.00, 95.0] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] |
Ulceration | ||||
Absent | 115 (56.1%) | 92 (68.7%) | 16 (28.1%) | 7 (50.0%) |
Present | 90 (43.9%) | 42 (31.3%) | 41 (71.9%) | 7 (50.0%) |
Thickness* (mm) | ||||
Mean (SD) | 2.92 (2.96) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) |
Median [Min, Max] | 1.94 [0.100, 17.4] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] |
table1(~ sex + age + ulcer + thickness | status,
data = melanoma_data,
overall = c(left = "Total"),
caption = caption_char,
footnote = footnote_char,
topclass="Rtable1-grid"
)
Total (N=205) |
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
|
---|---|---|---|---|
*Also known as Breslow thickness | ||||
Sex | ||||
Female | 126 (61.5%) | 91 (67.9%) | 28 (49.1%) | 7 (50.0%) |
Male | 79 (38.5%) | 43 (32.1%) | 29 (50.9%) | 7 (50.0%) |
Age (years) | ||||
Mean (SD) | 52.5 (16.7) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) |
Median [Min, Max] | 54.0 [4.00, 95.0] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] |
Ulceration | ||||
Absent | 115 (56.1%) | 92 (68.7%) | 16 (28.1%) | 7 (50.0%) |
Present | 90 (43.9%) | 42 (31.3%) | 41 (71.9%) | 7 (50.0%) |
Thickness* (mm) | ||||
Mean (SD) | 2.92 (2.96) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) |
Median [Min, Max] | 1.94 [0.100, 17.4] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] |
table1(~ sex + age + ulcer + thickness | status,
data = melanoma_data,
overall = c(left = "Total"),
caption = caption_char,
footnote = footnote_char,
topclass="Rtable1-grid Rtable1-shade Rtable1-times"
)
Total (N=205) |
Alive (N=134) |
Died from melanoma (N=57) |
Non-Melanoma death (N=14) |
|
---|---|---|---|---|
*Also known as Breslow thickness | ||||
Sex | ||||
Female | 126 (61.5%) | 91 (67.9%) | 28 (49.1%) | 7 (50.0%) |
Male | 79 (38.5%) | 43 (32.1%) | 29 (50.9%) | 7 (50.0%) |
Age (years) | ||||
Mean (SD) | 52.5 (16.7) | 50.0 (15.9) | 55.1 (17.9) | 65.3 (10.9) |
Median [Min, Max] | 54.0 [4.00, 95.0] | 52.0 [4.00, 84.0] | 56.0 [14.0, 95.0] | 65.0 [49.0, 86.0] |
Ulceration | ||||
Absent | 115 (56.1%) | 92 (68.7%) | 16 (28.1%) | 7 (50.0%) |
Present | 90 (43.9%) | 42 (31.3%) | 41 (71.9%) | 7 (50.0%) |
Thickness* (mm) | ||||
Mean (SD) | 2.92 (2.96) | 2.24 (2.33) | 4.31 (3.57) | 3.72 (3.63) |
Median [Min, Max] | 1.94 [0.100, 17.4] | 1.36 [0.100, 12.9] | 3.54 [0.320, 17.4] | 2.26 [0.160, 12.6] |
Note that the style name needs to be preceded by the prefix Rtable1-
. Multiple styles can be applied in combination by separating them with a space.
6.7 Conclusion
In conclusion, table1
is one of the most utilized tools in the scientific research field. Understanding how to use the table1 package in R can be of benefit to many. It is important to note that this presentation is just a brief summary with what is possible with this package. For example, you can add extra columns to the table, other than descriptive statistics. This can be accomplished using the extra.col
option. In addition, you can also stratify the response variable to highlight two of the responses, like dead or alive in our example.