Data Science for the Biomedical Sciences
Welcome
Preface
What to expect
Acknowledgements
Who drew the illustrations?
Dedication
About the Authors
Daniel Chen, MPH
Anne Brown, PhD
Who is this book for
The Personas
Alex Academic
Clare Clinician
Patricia Programmer
Samir Student
Code of Conduct
Workshops attendees
About this Document
Setup
Datasets
Spreadsheet
Programming language
R
Python
Binder (For Installation Issues)
R + RStudio
Python + Jupyter
Workshop Logistics
Zoom
Screen Layouts
Pin Video
1
Introduction
1.1
A glossary of terms
1.2
The Learning Process
1.3
Expectations
2
Spreadsheets
2.1
Learning Objectives
2.2
Introduction
2.3
Summary
2.4
Additional resources
3
R + RStudio
3.1
Introduction
3.2
The panels
3.3
Executing R code
3.4
Global settings
3.5
RStudio projects
4
Load Data
4.1
Learning Objectives
4.2
Introduction
4.3
Find your files
4.3.1
Paths in Windows
4.4
Set your working directory
4.5
Reading text files (CSV)
4.6
Reading Excel files
4.7
Selecting columns
4.8
Filtering rows
4.9
Subsetting columns and rows
4.10
Saving out data
4.11
Summary
4.12
Additional Resources
5
Descriptive Calculations
5.1
Introduction
5.2
Learning Objectives
5.3
Building the pipeline
5.4
Summary statistics
5.5
Groupby operations
5.6
Summary
5.7
Additional Resources
6
Clean Data (Tidy)
6.1
Introduction
6.2
What is tidy data?
6.3
Common data problems
6.4
Column headers are values, not variable names
6.5
Multiple variables stored in one column
6.6
Variables are stored in both rows and columns
6.7
Summary
6.8
Additional Resources
7
Visualization (Intro)
7.1
Introduction
7.2
Data Types
7.3
Grammar of Graphics
7.4
Data + geometries
7.4.1
Layer values
7.5
Geometries
7.5.1
Univariate
7.5.2
Bivariate
7.6
Other Astetic mappings
7.7
Facets
7.8
Themes
7.9
Additional Resources
8
Analysis Intro
8.1
Logistic Regression
8.1.1
Model 1
8.1.2
Model 2
8.1.3
Model 3
9
30-Day Readmittance
9.1
Data Filtering
9.2
Working with dates
9.2.1
Converting to datetime objects
9.2.2
Datetime calculations
9.2.3
Lead and lag time
9.3
Grouped column mutations
9.4
Find 30-day readmittance
10
Working with multiple datasets
10.1
Joins
10.1.1
Keep track of the number of rows
10.1.2
Formally test your assumptions
10.1.3
Different types of joins
10.1.4
Age at time of heart attack
10.2
Databases
10.2.1
Database connection
10.2.2
Database tables
10.2.3
SQL
10.2.4
Show SQL query
11
APIs (Application Programming Interface)
11.1
Leading Causes of Deaths
11.2
Adjusted Counts
11.3
Census API
11.3.1
Find the tables you need
11.3.2
Download the data
11.3.3
Write a function
11.3.4
Download and Tidy Data
11.3.5
Population counts
11.4
Population Adjusted Death
12
Functions
12.1
Breakdown the problem
12.1.1
Partial solution
12.1.2
Complete solution
12.2
Test your code
12.3
Use the glue package
12.4
Apply functions to data (purrr)
12.5
Use processed columns
13
Survival Analysis
13.1
Between Groups
13.2
Multiple variables
13.3
Cox Regression
13.3.1
Hazard Ratio
14
Machine Learning (tidymodels)
14.1
Logistic Regression
14.1.1
glm: Model 1
14.1.2
glm: Model 2
14.2
Tidymodels
14.2.1
Make a prediction
14.3
Train Test Split
14.4
Get class predictions and probabilties
14.5
Model Metrics and Evaluation
14.6
Feature Engineering
15
Additional resources
15.1
Communities
References
Published with bookdown
Data Science for the Biomedical Sciences
References