Skimr is an R package designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The function is modifiable where you can add additional variables, which are not a part of default summary function within R. Skimr allows us to quickly assess data quality by feature and type in a quick report. This is a critical step in Data Exploration, where Understanding our data helps us to generate a hypothesis and determine what data analysis are appropriate.
This presentation will cover the simplest and most effective ways to explore data in R.
5.1.1 Packages
To begin we will upload the packages necessary for the lesson, this includes the following:
readr() to import our data file
knitr() that houses the kable() feature that allows us to construct and customize tables.
tidyverse houses the dyplyrpackage that assists with data manipulation and visualization.
Theskimrpackage provides a compact summary of the variables in a dataset.
# install.packages("skimr")# install.packages("knitr")# install.packages("tidyverse")# load all the packages we will need to analyze the data and use the skim# functionlibrary(skimr)library(knitr)library(readxl)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
5.1.2 Census Data
For this assignment we will be using the Census_2010 dataset. There is no code book associated with the data, making it difficult to provide an accurate description of the variables. The information recorded shows the United States population estimates from the years 2010-2015, as well as relevant variables like net population change, number of births, number of deaths, international and domestic migration. Within the dataframe, there are 3,193 observations and 100 variables.
# import the data# census_2010 <- read_csv("Data/census_2010.csv")census_2010<-readxl::read_xlsx("../data/01_census_2010.xlsx")# what are the variablescolnames(census_2010)%>%head(n =10)
In R, the most similar function is summary(). The summary() function in R can be used to quickly summarize the values in a data frame or vector.
This syntax shows examples of the summary function using both our data set, and a vector:
#| label: Summary-syntax-with-data# Example using summary function with datasummary(census_2010$CENSUS2010POP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
82 11299 26424 193387 71404 37253956
# Example using summary function with vector# Define vectorx<-c(3, 4, 23, 5, 7, 8, 9, 12, 26, 15, 20, 21, NA)# Summarize values in vectorsummary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3.00 6.50 10.50 12.75 20.25 26.00 1
The summary() function automatically calculates: The minimum value, The value of the 1st quartile (25th percentile), The median value, The value of the 3rd quartile (75th percentile) and The maximum value. Any missing values (NA) in the vector, the summary() function will automatically exclude them when calculating the summary statistics.
The skim() function will generate a summary of the variables in your dataset, including their data type, number of non-missing values, minimum and maximum values, median, mean, standard deviation, and more (Waring et al. 2022).
The following syntax ensures that the data is compatible with Skimr functions.
Code
# is the summary data a skimr dataframeskim(census_2010)%>%is_skim_df()# TRUE
[1] TRUE
attr(,"message")
character(0)
We can explore the data as a tibble:
Code
# use skim to get descriptive statistics of the dataskim(census_2010)%>%head(n =10)
Data summary
Name
census_2010
Number of rows
3193
Number of columns
100
_______________________
Column type frequency:
character
2
numeric
8
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
STNAME
0
1
4
20
0
51
0
CTYNAME
0
1
4
33
0
1927
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
SUMLEV
0
1
49.84
1.25
40
50
50
50
50
▁▁▁▁▇
REGION
0
1
2.67
0.81
1
2
3
3
4
▁▆▁▇▂
DIVISION
0
1
5.19
1.97
1
4
5
7
9
▂▇▅▆▃
STATE
0
1
30.26
15.15
1
18
29
45
56
▃▇▆▆▇
COUNTY
0
1
101.92
107.63
0
33
77
133
840
▇▁▁▁▁
CENSUS2010POP
0
1
193387.05
1176201.45
82
11299
26424
71404
37253956
▇▁▁▁▁
ESTIMATESBASE2010
0
1
193396.87
1176244.25
82
11299
26446
71491
37254503
▇▁▁▁▁
POPESTIMATE2010
0
1
193765.65
1178710.28
83
11275
26467
71721
37334079
▇▁▁▁▁
Using skimr functions provides a cleaner and more detailed display of the results compared to the summary() function. In this example we are showing the first ten variables in our data set. The data summary tab shows the number of rows and columns, column type frequency and group variables. There is also additional descriptive information like missing values, unique characters.
This will be relevant for data cleaning as well as understanding the distribution. Both are critical to determine which statistical analysis would be most appropriate to use for a project.
5.4 Other Skimr Features
5.4.1 Separate dataframes by type
The data frames produced by skim() are wide and sparse, filled with columns that are mostly NA. For that reason, it can be convenient to work with “by type” subsets of the original data frame. These smaller subsets have their NA columns removed.
Features:
partition() - Creates a list of smaller data frames. Each entry in the list is a data type from the original dataframe
bind() - Takes the list and rebuilds the original dataframe.
yank() - Extract a subtable from a dataframe with a particular type.
The following syntax is using partition() to separate the large census_df.
Code
# split the character and numeric dataseparate_df<-partition(skim(census_2010))# check only the character dataseparate_df$character
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
STNAME
0
1
4
20
0
51
0
CTYNAME
0
1
4
33
0
1927
0
Code
# create summary statistics for only numeric variablesnumeric_separate_df<-separate_df[2]# pull out the desired summary statistics in the nested listhead(numeric_separate_df$numeric["mean"])%>%kable(digits =1)
mean
49.8
2.7
5.2
30.3
101.9
193387.1
The following syntax is using bind() to combine the smaller character and numeric lists into the desired df.
Code
# combine the character and numeric datahead(bind(separate_df))
Data summary
Name
census_2010
Number of rows
3193
Number of columns
100
_______________________
Column type frequency:
character
2
numeric
4
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
STNAME
0
1
4
20
0
51
0
CTYNAME
0
1
4
33
0
1927
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
SUMLEV
0
1
49.84
1.25
40
50
50
50
50
▁▁▁▁▇
REGION
0
1
2.67
0.81
1
2
3
3
4
▁▆▁▇▂
DIVISION
0
1
5.19
1.97
1
4
5
7
9
▂▇▅▆▃
STATE
0
1
30.26
15.15
1
18
29
45
56
▃▇▆▆▇
Code
# confirm that the bound table is the same as the original skimmed tableidentical(bind(separate_df), skim(census_2010))
[1] TRUE
The following syntax is using yank() to extract a specific table eg.character to examine.
Code
# Extract character datayank(skim(census_2010), "character")
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
STNAME
0
1
4
20
0
51
0
CTYNAME
0
1
4
33
0
1927
0
5.4.2 Skimr with Dplyr
Skimr functions can be used in combination with Dplyr functions to examine specific variables within the census dataset.
The following example used skim() with filter() to display the variable CENSUS2010POP. The dataframe was further customized to display variable name and data type using select().
Code
# use dplyr functions on the statistics summary tablecensus_filter<-skim(census_2010)%>%filter(skim_variable=="CENSUS2010POP")census_filter
# A tibble: 6 × 2
skim_type skim_variable
<chr> <chr>
1 character STNAME
2 character CTYNAME
3 numeric SUMLEV
4 numeric REGION
5 numeric DIVISION
6 numeric STATE
You can also customize the output of the skim() function by using various arguments. For example, you can use the numeric argument to specify which variables should be treated as numeric variables, or use the ranges argument to specify custom ranges for variables.
Using skim() in combination with mutate() we will compute a new variable to add to our skim dataframe.
Code
# create a new variable calculate the change in birth rate from 2010 to 2011census_2010%>%# new variablemutate(net_birth =BIRTHS2011-BIRTHS2010)%>%# move the variable to the beginning of the datasetrelocate(net_birth, .after =CENSUS2010POP)%>%# summary statistics tableskim()%>%# only the first fifteen variableshead(n =15)%>%# change the formatting kable(digit =2)
skim_type
skim_variable
n_missing
complete_rate
character.min
character.max
character.empty
character.n_unique
character.whitespace
numeric.mean
numeric.sd
numeric.p0
numeric.p25
numeric.p50
numeric.p75
numeric.p100
numeric.hist
character
STNAME
0
1
4
20
0
51
0
NA
NA
NA
NA
NA
NA
NA
NA
character
CTYNAME
0
1
4
33
0
1927
0
NA
NA
NA
NA
NA
NA
NA
NA
numeric
SUMLEV
0
1
NA
NA
NA
NA
NA
49.84
1.25
40
50
50
50
50
▁▁▁▁▇
numeric
REGION
0
1
NA
NA
NA
NA
NA
2.67
0.81
1
2
3
3
4
▁▆▁▇▂
numeric
DIVISION
0
1
NA
NA
NA
NA
NA
5.19
1.97
1
4
5
7
9
▂▇▅▆▃
numeric
STATE
0
1
NA
NA
NA
NA
NA
30.26
15.15
1
18
29
45
56
▃▇▆▆▇
numeric
COUNTY
0
1
NA
NA
NA
NA
NA
101.92
107.63
0
33
77
133
840
▇▁▁▁▁
numeric
CENSUS2010POP
0
1
NA
NA
NA
NA
NA
193387.05
1176201.45
82
11299
26424
71404
37253956
▇▁▁▁▁
numeric
net_birth
0
1
NA
NA
NA
NA
NA
1870.12
11792.85
-3
96
232
639
386443
▇▁▁▁▁
numeric
ESTIMATESBASE2010
0
1
NA
NA
NA
NA
NA
193396.87
1176244.25
82
11299
26446
71491
37254503
▇▁▁▁▁
numeric
POPESTIMATE2010
0
1
NA
NA
NA
NA
NA
193765.65
1178710.28
83
11275
26467
71721
37334079
▇▁▁▁▁
numeric
POPESTIMATE2011
0
1
NA
NA
NA
NA
NA
195251.40
1189647.76
90
11277
26417
72387
37700034
▇▁▁▁▁
numeric
POPESTIMATE2012
0
1
NA
NA
NA
NA
NA
196744.52
1200508.37
81
11195
26362
72496
38056055
▇▁▁▁▁
numeric
POPESTIMATE2013
0
1
NA
NA
NA
NA
NA
198200.69
1211123.45
89
11180
26519
72222
38414128
▇▁▁▁▁
numeric
POPESTIMATE2014
0
1
NA
NA
NA
NA
NA
199754.09
1222669.36
87
11121
26483
72257
38792291
▇▁▁▁▁
5.4.3 Adding Variables
base - An sfl that sets skimmers for all column types.
append - Whether the provided options should be in addition to the defaults already in skim. Default is TRUE.
As mentioned, skim() is designed to display default statistics, however you can use this function to change the summary statistics that it returns.
skim_with() is type closure: a function that returns adds a new variable to the table. This lets you have several skimming functions in a single R session, but it also means that you need to assign the return of skim_with() before you can use it.
You assign values within skim_with() by using the sfl() helper (skimr function list). It identifies which skimming functions you want to remove, by setting them to NULL. Assign an sfl to each column type that you wish to modify.
For example, we will add the following variables to the dataframe: median, min, max, IQR, length.
Code
my_skim<-skim_with( numeric =sfl(median, min, max, IQR), character =sfl(length), append =TRUE)# add new variables into the summary tablecensus_2010%>%my_skim()%>%head(n =10)
Data summary
Name
Piped data
Number of rows
3193
Number of columns
100
_______________________
Column type frequency:
character
2
numeric
8
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
length
STNAME
0
1
4
20
0
51
0
3193
CTYNAME
0
1
4
33
0
1927
0
3193
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
median
min
max
IQR
SUMLEV
0
1
49.84
1.25
40
50
50
50
50
▁▁▁▁▇
50
40
50
0
REGION
0
1
2.67
0.81
1
2
3
3
4
▁▆▁▇▂
3
1
4
1
DIVISION
0
1
5.19
1.97
1
4
5
7
9
▂▇▅▆▃
5
1
9
3
STATE
0
1
30.26
15.15
1
18
29
45
56
▃▇▆▆▇
29
1
56
27
COUNTY
0
1
101.92
107.63
0
33
77
133
840
▇▁▁▁▁
77
0
840
100
CENSUS2010POP
0
1
193387.05
1176201.45
82
11299
26424
71404
37253956
▇▁▁▁▁
26424
82
37253956
60105
ESTIMATESBASE2010
0
1
193396.87
1176244.25
82
11299
26446
71491
37254503
▇▁▁▁▁
26446
82
37254503
60192
POPESTIMATE2010
0
1
193765.65
1178710.28
83
11275
26467
71721
37334079
▇▁▁▁▁
26467
83
37334079
60446
5.5 Conclusion
Overall, Skimr is a useful package for quickly summarizing the variables in a dataset and gaining insights into its structure and content.
5.6 References
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://docs.ropensci.org/skimr/.