R for Epidemiology
Welcome
Acknowledgements
Introduction
Goals
Text conventions used in this book
Other reading
Contributing to R4Epi
Typos
Issues
About the Authors
Brad Cannell
Melvin Livingston
I Getting Started
1
Installing R and RStudio
1.1
Download and install on a Mac
1.2
Download and install on a PC
2
What is R?
2.1
What is data?
2.2
What is R?
2.2.1
Transferring data
2.2.2
Managing data
2.2.3
Analyzing data
2.2.4
Presenting data
3
Navigating the RStudio Interface
3.1
The console
3.2
The environment pane
3.3
The files pane
3.4
The source pane
3.5
RStudio preferences
4
Speaking R’s Language
4.1
R is a
language
4.2
The R interpreter
4.3
Errors
4.4
Functions
4.5
Objects
4.6
Comments
4.7
Packages
4.8
Programming style
5
Let’s Get Programming
5.1
Simulating data
5.2
Vectors
5.2.1
Vector types
5.2.2
Double vectors
5.2.3
Integer vectors
5.2.4
Logical vectors
5.2.5
Factor vectors
5.3
Data frames
5.4
Tibbles
5.4.1
The as_tibble function
5.4.2
The tibble function
5.4.3
The tribble function
5.4.4
Why use tibbles
5.5
Missing data
5.6
Our first analysis
5.6.1
Manual calculation of the mean
5.6.2
Dollar sign notation
5.6.3
Bracket notation
5.6.4
The sum function
5.6.5
Nesting functions
5.6.6
The length function
5.6.7
The mean function
5.7
Some common errors
5.8
Summary
6
Asking Questions
6.1
When should we seek help?
6.2
Where should we seek help?
6.3
How should we seek help?
6.3.1
Creating a post on Stack Overflow
6.3.2
Creating better posts and asking better questions
6.4
Helping others
6.5
Summary
II Coding Tools and Best Practices
7
R Scripts
7.1
Creating R scripts
8
Quarto Files
8.1
What is Quarto?
8.2
Why use Quarto?
8.3
Create a Quarto file
8.4
YAML headers
8.5
R code chunks
8.6
Markdown
8.6.1
Markdown headings
8.7
Summary
9
R Projects
10
Coding Best Practices
10.1
General principles
10.2
Code comments
10.2.1
Defining key variables
10.2.2
What this code is trying to accomplish
10.2.3
Why I chose this particular strategy
10.3
Style guidelines
10.3.1
Comments
10.3.2
Object (variable) names
10.3.3
Use names that are informative
10.3.4
File Names
11
Using Pipes
11.1
What are pipes?
11.2
How do pipes work?
11.2.1
Keyboard shortcut
11.2.2
Pipe style
11.3
Final thought on pipes
III Data Transfer
12
Introduction to Data Transfer
13
File Paths
13.1
Finding file paths
13.2
Relative file paths
14
Importing Plain Text Files
14.1
Packages for importing data
14.2
Importing space delimited files
14.2.1
Specifying missing data values
14.3
Importing tab delimited files
14.4
Importing fixed width format files
14.4.1
Vector of column widths
14.4.2
Paired vector of start and end positions
14.4.3
Using named arguments
14.5
Importing comma separated values files
14.6
Additional arguments
15
Importing Binary Files
15.1
Packages for importing data
15.2
Importing Microsoft Excel spreadsheets
15.3
Importing data from other statistical analysis software
15.4
Importing SAS data sets
15.5
Importing Stata data sets
16
RStudio’s Data Import Tool
17
Exporting Data
17.1
Plain text files
17.2
R binary files
IV Descriptive Analysis
18
Introduction to Descriptive Analysis
18.1
What is descriptive analysis and why would we do it?
18.2
What kind of descriptive analysis should we perform?
19
Numerical Descriptions of Categorical Variables
19.1
Factors
19.1.1
Coerce a numeric variable
19.1.2
Coerce a character variable
19.2
Height and Weight Data
19.2.1
View the data
19.3
Calculating frequencies
19.3.1
The base R table function
19.3.2
The gmodels CrossTable function
19.3.3
The tidyverse way
19.4
Calculating percentages
19.5
Missing data
19.6
Formatting results
19.7
Using freqtables
20
Measures of Central Tendency
20.1
Calculate the mean
20.2
Calculate the median
20.3
Calculate the mode
20.4
Compare mean, median, and mode
20.5
Data checking
20.6
Properties of mean, median, and mode
20.7
Missing data
20.8
Using meantables
21
Measures of Dispersion
21.1
Comparing distributions
22
Describing the Relationship Between a Continuous Outcome and a Continuous Predictor
22.1
Pearson Correlation Coefficient
22.1.1
Calculating r
22.1.2
Correlation intuition
23
Describing the Relationship Between a Continuous Outcome and a Categorical Predictor
23.1
Single predictor and single outcome
23.2
Multiple predictors
24
Describing the Relationship Between a Categorical Outcome and a Categorical Predictor
24.1
Comparing two variables
V Data Management
25
Introduction to Data Management
25.1
Multiple paradigms for data management in R
25.2
The dplyr package
25.2.1
The dplyr verbs
25.2.2
The .data argument
25.2.3
The … argument
25.2.4
Non-standard evaluation
26
Creating and Modifying Columns
26.1
Creating data frames
26.2
Dollar sign notation
26.3
Bracket notation
26.4
Modify individual values
26.5
The mutate() function
26.5.1
Adding or modifying a single column
26.5.2
Recycling rules
26.5.3
Using existing variables in name-value pairs
26.5.4
Adding or modifying multiple columns
26.5.5
Rowwise mutations
26.5.6
Group_by mutations
27
Subsetting Data Frames
27.1
The select() function
27.2
The rename() function
27.3
The filter() function
27.3.1
Subgroup analysis
27.3.2
Complete case analysis
27.4
Deduplication
27.4.1
The distinct() function
27.4.2
Complete duplicate row add tag
27.4.3
Partial duplicate rows
27.4.4
Partial duplicate rows - add tag
27.4.5
Count the number of duplicates
27.4.6
What to do about duplicates
28
Working with Dates
28.1
Date vector types
28.2
Dates under the hood
28.3
Coercing date-times to dates
28.4
Coercing character strings to dates
28.5
Change the appearance of dates with format()
28.6
Some useful built-in dates
28.6.1
Today’s date
28.6.2
Today’s date-time
28.6.3
Character vector of full month names
28.6.4
Character vector of abbreviated month names
28.6.5
Creating a vector containing a sequence of dates
28.7
Calculating date intervals
28.7.1
Calculate age as the difference in time between dob and today
28.7.2
Rounding time intervals
28.8
Extracting out date parts
28.9
Sorting dates
29
Working with Character Strings
29.1
Coerce to lowercase
29.1.1
Lowercase
29.1.2
Upper case
29.1.3
Title case
29.1.4
Sentence case
29.2
Trim white space
29.3
Regular expressions
29.3.1
Remove the comma
29.3.2
Remove middle initial
29.3.3
Remove double spaces
29.4
Separate values into component parts
29.5
Dummy variables
30
Conditional Operations
30.1
Operands and operators
30.2
Testing multiple conditions simultaneously
30.3
Testing a sequence of conditions
30.4
Recoding variables
30.5
case_when() is lazy
30.6
Recode missing
31
Working with Multiple Data Frames
31.1
Combining data frames vertically: Adding rows
31.1.1
Combining more than 2 data frames
31.1.2
Adding rows with differing columns
31.1.3
Differing column positions
31.1.4
Differing column names
31.2
Combining data frames horizontally: Adding columns
31.2.1
Combining data frames horizontally by position
31.2.2
Combining data frames horizontally by key values
32
Restructuring Data frames
32.1
The tidyr package
32.2
Pivoting longer
32.2.1
The names_to argument
32.2.2
The names_prefix argument
32.2.3
The values_to argument
32.2.4
The names_transform argument
32.2.5
Pivoting multiple sets of columns
32.2.6
The names_sep argument
32.2.7
The .value special value
32.2.8
Why person-period?
32.3
Pivoting wider
32.3.1
Why person-level?
32.4
Pivoting summary statistics
32.4.1
Pivoting summary statistics wide to long
32.4.2
Pivoting summary statistics long to wide
32.5
Tidy data
32.5.1
Each variable must have its own column
32.5.2
Each observation must have its own row
32.5.3
Each value must have its own cell
32.6
The complete() function
VI Repeated Operations
33
Introduction to Repeated Operations
33.1
Multiple methods for repeated operations in R
33.2
Tidy evaluation
34
Writing Functions
34.1
When to write functions
34.2
How to write functions
34.2.1
The function() function
34.2.2
The function writing process
34.3
Giving your function arguments default values
34.4
The values your functions return
34.5
Lexical scoping and functions
34.6
Tidy evaluation
35
Column-wise Operations in dplyr
35.1
The across() function
35.2
Across with mutate
35.3
Across with summarise
35.4
Across with filter
35.5
Summary
36
Writing For Loops
36.1
How to write for loops
36.1.1
The for loop body
36.1.2
The for() function
36.2
Using for loops for data transfer
36.3
Using for loops for data management
36.4
Using for loops for analysis
37
Using the purrr Package
37.1
Comparing for loops and the map functions
37.2
Using purrr for data transfer
37.2.1
Example 1: Importing multiple sheets from an Excel workbook
37.2.2
Why walk instead of map?
37.2.3
why we didn’t assign the return value of
walk()
to an object?
37.3
Using purrr for data management
37.3.1
Example 1: Adding NA at multiple positions
37.3.2
Example 2. Detecting matching values by position
37.4
Using purrr for analysis
37.4.1
Example 1: Continuous statistics
37.4.2
Example 2: Categorical statistics
VII Collaboration
38
Introduction to git and GitHub
38.1
Versioning
38.2
Preservation
38.3
Reproducibility
38.4
Collaboration
38.5
Summary
39
Using git and GitHub
39.1
Install git
39.2
Sign up for a GitHub account
39.3
Install GitKraken
39.4
Example 1: Contribute to R4Epi
39.5
Example 2: Create a repository for a research project
Step 1: Create a repository on GitHub
Step 2: Clone the repository to your computer
Step 3: Add an R project file to the repository
Step 4: Update and commit gitignore
Step 5: Keep adding and committing files
39.6
Committing and pushing
39.7
Example 3: Contribute to a research project
39.7.1
Forking a repository
39.7.2
Creating a pull request
39.8
Summary
VIII Presenting Results
40
Creating tables with R and Microsoft Word
40.1
Table 1
40.2
Opioid drug use
40.3
Table columns
40.4
Table rows
40.5
Make the table skeleton
40.6
Fill in column headers
40.6.1
Group sample sizes
40.6.2
Formatting column headers
40.7
Fill in row headers
40.7.1
Label statistics
40.7.2
Formatting row headers
40.8
Fill in data values
40.8.1
Manually type values
40.8.2
Copy and paste values
40.8.3
Knit a Word document
40.8.4
flextable and officer
40.8.5
Significant digits
40.8.6
Formatting data values
40.9
Fill in title
40.10
Fill in footnotes
40.10.1
Formatting footnotes
40.11
Final formatting
40.11.1
Adjust column widths
40.11.2
Merge cells
40.11.3
Remove cell borders
40.12
Summary
IX Introduction to Epidemiology
41
Introduction to Epidemiology
41.1
Measurement
41.1.1
Descriptive measures
41.2
Uncertainty
41.2.1
Statistical uncertainty
41.2.2
Uncertainty in the research process
41.2.3
Epistemological uncertainty
41.3
Study design
41.4
Summary
42
Populations
42.1
Open and closed populations
42.2
Other ways to define populations
42.3
Samples
42.4
Cohorts
42.5
Summary
43
Measures of Occurrence
43.1
Terminology
43.1.1
Prevalence and incidence
43.1.2
Point prevalence and period prevalence
43.2
Quantifying prevalence
43.2.1
Prevalence counts
43.2.2
Prevalence proportion
43.2.3
Prevalence Odds
43.3
Quantifying incidence
43.3.1
Incidence Count
43.3.2
Incidence proportion
43.3.3
Incidence Odds
43.3.4
Incidence Rate
44
Random Error
45
Measures of Association
45.1
Exposures and outcomes
45.2
Contingency tables
45.3
Building contingency tables in R
45.3.1
Matrix dimensions
45.3.2
Matrix to contingency tables
45.3.3
Add row and column names
45.3.4
Add margins
45.4
Probabilities
45.4.1
Frequency probabilities
45.4.2
Conditional probabilities
45.5
Associations
45.5.1
Statistical independence and null values
45.6
Calculating measures of association in R
45.6.1
Incidence proportion ratios
45.6.2
Incidence proportion difference
45.6.3
Incidence odds ratio
45.6.4
Incidence rate ratio
45.6.5
Incidence rate difference
45.7
Summary
X Introduction to Regression Analysis
46
Introduction to Regression Analysis
46.1
Generalize linear models
46.1.1
The glm function
46.2
Regression intuition
47
Linear Regression
47.1
Continuous regressand and continuous regressor
47.1.1
Interpretation
47.2
Continuous regressand and categorical regressor
47.2.1
Interpretation
47.3
Waist circumference and deep abdominal adipose tissue example
47.3.1
Continuous regressor (waist circumference)
47.3.2
Categorical regressor (large waist)
48
Linear Regression
48.1
Categorical regressand continuous regressor
48.1.1
Interpretation
48.2
Categorical regressand categorical regressor
48.2.1
Interpretation
48.3
Elder mistreatment example
48.3.1
Categorical regressor (dementia)
48.3.2
Interpretation
48.3.3
Continuous regressor (age)
48.3.4
Interpretation
48.4
Assumptions
49
Poisson Regression
49.1
Count regressand continuous regressor
49.1.1
Interpretation
49.2
Count regressand categorical regressor
49.2.1
Interpretation
49.3
Number of drinks and personal problems example
49.3.1
Count regressand and continuous regressor
49.3.2
Interpretation
49.3.3
Count regressand categorical regressor
49.3.4
Interpretation
49.4
Assumptions
XI Introduction to Causal Inference
50
Introduction to Causal Inference
XII Introduction to Systematic Error
51
Introduction to Systematic Error
XIII Appendix
Appendix A: Glossary
Appendix: Alternative table formats
51.1
Smaller data frame
51.1.1
Default method for printing the data frame to the screen
51.1.2
Using the kable function
51.1.3
Using the datatable function
51.2
Larger data frame
51.2.1
Default method for printing the data frame to the screen
51.2.2
Using the kable function
51.2.3
Using the datatable function
References
Published with bookdown
R for Epidemiology
44
Random Error
This chapter is under heavy development and may still undergo significant changes.