Chapter 4 Load Data
4.1 Learning Objectives
4.2 Introduction
Loading data into R is the first step!
First we need to load up a package to make loading data sets easier.
We will be using the tidyverse
set of packages for all of our data processing needs in R
1
and the pandas
library in Python.
R
We first will load up the tidyverse
packages using the library
function.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4
✔ tibble 3.1.4 ✔ dplyr 1.0.7
✔ tidyr 1.1.4 ✔ stringr 1.4.0
✔ readr 2.0.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
The first time you load up the tidyverse
library,
there will be some output that lists the packages tidyverse
loads,
along with any functions that share the same name as other functions (i.e., conflicts).
As long as you are mindful about starting a new RStudio session before you work,
you can ignore this output for now.
Now we can use all the functions within the Tidyverse to do our data processing.
If you are following along and you run a piece of code and end up with an could not find function
error,
make sure you typed library(tidyverse)
correctly and executed the line of code.
Python
import pandas as pd
4.3 Find your files
Let the below figures represent a ds4biomed
folder on your Desktop on Windows and Mac, respectively.
C:\
|- Users\
|- Daniel\
|- Desktop\
|- ds4biomed\
|- data\
| |- medicaldata_tumorgrowth.csv
|- output\
|- 01-load_data.R
|- README.md
|- ds4biomed.Rproj
/
|- Users/
|- Daniel/
|- Desktop/
|- ds4biomed/
|- data/
| |- medicaldata_tumorgrowth.csv
|- output/
|- 01-load_data.R
|- README.md
|- ds4biomed.Rproj
Suppose we are “in” the ds4biomed
folder,
where we can see the data
and output
folders
along with the
01-load_data.R
, README.md
, and ds4biomed.Rproj
files.
We can reference our medicaldata_tumorgrowth.csv
file inside the data
folder as: data/medicaldata_tumorgrowth.csv
.
That is, we can use the backslash, /
to move into folders.
We can write data/medicaldata_tumorgrowth.csv
because we are “starting from” the ds4biomed
folder.
This is called a relative path because the location of the csv
file is relative to the ds4biomed
starting point (aka working directory).
If we want to refer to any arbitrary filer or folder on the computer, we can specify the full path of the file.
The full path will start with a drive letter on windows,
C:\Users\Daniel\Desktop\ds4biomed\data\medicaldata_tumorgrowth.csv
,
and a /
on a Mac,
/Users/Daniel/Desktop/ds4biomed/data/medicaldata_tumorgrowth.csv
.
Suppose we create an analysis
folder for our 01-load_data.R
script so that our folder structure looks like this
(only the mac version is shown in the example below):
/
|- Users/
|- Daniel/
|- Desktop/
|- ds4biomed/
|- data/
| |- medicaldata_tumorgrowth.csv
|- output/
|- analysis
| |- 01-load_data.R
|- README.md
|- ds4biomed.Rproj
Now, if our working directory is now in the analysis
folder,
we need a way to reference one folder up to the ds4biomed
folder and then back down to the data
folder.
The way we can relatively reference the previous folder is with 2 dots, ..
, ../data/medicaldata_tumorgrowth.csv
Given the following example folder structure where we have a chart_review
and rct_m22-0305
folder in our projects
folder:
/
|- Users/
|- Daniel/
|- projects/
|-chart_review/
| |- data/
| | |- patients.csv
| |- analysis/
| |- demographics.R # you are working here
|- rct_m22-0305
|- data/
|- patients.csv
Let’s say we are currently in the chart_review/analysis
folder, working on our demographics.R
as denoted by the #
.
- Write the relative path to the
patients.csv
file in therct_m22-0305
folder. - Write the absolute path to the
patients.csv
file in thechart_review
folder.
4.3.1 Paths in Windows
When you are looking at file paths in the Windows Explorer,
you will notice that all Windows paths will use the backslash, \
,
instead of the forward slash, /
to refer to files.
In a lot of programming languages, including R, the \
is a special character,
so if you want to use \
for file paths in Windows, you will have to use 2 backslashes, e.g., ..\\data\\patients.csv\\
.
However, you can still use the regular /
in Windows to refer to folders just like other operating systems.
4.4 Set your working directory
So far, we have been talking about a “starting point” or “working directory”, when we have been referring to files around our computer. In order to quickly and reliably set your working directory,
we use RStudio Projects (Section 3.5), or Jupyter Lab
4.5 Reading text files (CSV)
Now that we know how to find our files, let’s load up our first data set.
When trying to type in a file path,
you can hit the <TAB>
key to autocomplete the files.
This will help you with a lot of potential spelling mistakes.
R
read_csv("data/medicaldata_tumorgrowth.csv")
Rows: 574 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Grp
dbl (4): Group, ID, Day, Size
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 574 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
Python
"data/medicaldata_tumorgrowth.csv") pd.read_csv(
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
1 1.CTR 1 101 3 85.0
2 1.CTR 1 101 4 114.0
3 1.CTR 1 101 5 162.3
4 1.CTR 1 101 6 178.3
.. ... ... ... ... ...
569 4.D+R 4 409 19 940.6
570 4.D+R 4 409 20 1057.0
571 4.D+R 4 409 21 1107.6
572 4.D+R 4 409 24 1694.2
573 4.D+R 4 409 25 1423.6
[574 rows x 5 columns]
Debug help:
- If the above code returns a could not find function "read_csv"
make sure you have loaded up the proper library with library(tidyverse)
- If the above code returns a does not exist in current working directory
,
make sure the working directory it lists is your expected “starting point” (i.e., working directory),
and make sure the file path is spelled correctly.
read_csv
will show us the columns that were read in,
as well as the data type of that column (e.g., character, double – a number).
Loading a data set is great, but we need a convenient way to refer to the data set.
We don’t want to re-load the data set every time we want to perform an action on it.
We can take this loaded data set and assign it to a variable.
We can do this with the assignment operator, <-
.
Note the way it is typed, a less than symbol (<
) followed immediately by the dash (-
) without any spaces in between.
The right side of the assignment operator, <-
, will be executed and then assigned to the variable on the left.
R
<- read_csv("data/medicaldata_tumorgrowth.csv") tumor
Rows: 574 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Grp
dbl (4): Group, ID, Day, Size
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Python
= pd.read_csv("data/medicaldata_tumorgrowth.csv") tumor
Notice this time we no longer see the dataset being printed.
The “Environment” tab in the RStudio panel will now have an entry for the variable you used.
Clicking on the right data set icon will open a view of your dataset,
clicking on the arrow will show you the column-by-column text representation (technically it’s called the structure
).
To look at our dataset we can execute just the variable we assigned the dataset to.
R
tumor
# A tibble: 574 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
Python
tumor
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
1 1.CTR 1 101 3 85.0
2 1.CTR 1 101 4 114.0
3 1.CTR 1 101 5 162.3
4 1.CTR 1 101 6 178.3
.. ... ... ... ... ...
569 4.D+R 4 409 19 940.6
570 4.D+R 4 409 20 1057.0
571 4.D+R 4 409 21 1107.6
572 4.D+R 4 409 24 1694.2
573 4.D+R 4 409 25 1423.6
[574 rows x 5 columns]
This tabular dataset that has now been loaded into R is called a data frame object (or simply dataframe),
the tidyverse
uses a tibble
.
For the most part, a data.frame
object will behave like a tibble
object.
4.6 Reading Excel files
R
To read an excel file we will need a separate library that can handle Excel files, readxl
.
library(readxl)
The readxl
library gives us access to the read_excel
function which we can use to read in an Excel file.
<- read_excel("data/medicaldata_tumorgrowth.xlsx") tumor_xl
Python
= pd.read_excel("data/medicaldata_tumorgrowth.xlsx") tumor_xl
And we can view the loaded Excel file just like before.
R
tumor_xl
# A tibble: 574 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
Python
tumor_xl
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
1 1.CTR 1 101 3 85.0
2 1.CTR 1 101 4 114.0
3 1.CTR 1 101 5 162.3
4 1.CTR 1 101 6 178.3
.. ... ... ... ... ...
569 4.D+R 4 409 19 940.6
570 4.D+R 4 409 20 1057.0
571 4.D+R 4 409 21 1107.6
572 4.D+R 4 409 24 1694.2
573 4.D+R 4 409 25 1423.6
[574 rows x 5 columns]
If you are ever lost as to what object you are working with, you can use the class
function in R
or the type
function in Python to tell you.
R
class(tumor)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(tumor_xl)
[1] "tbl_df" "tbl" "data.frame"
Python
type(tumor)
<class 'pandas.core.frame.DataFrame'>
type(tumor_xl)
<class 'pandas.core.frame.DataFrame'>
Once you have see a "tbl_df"
or "data.frame"
(R) or <class 'pandas.core.frame.DataFrame'>
(Python) you can be sure that the dataframe processing functions
we’ll learn will work.
Take away: Once you load a dataset and end up with either a data.frame
or tibble
object you can continue using
the same data processing methods we cover in these materials.
- Make sure you have the
ds4biomed
RStudio project open - Load up the
exercise-010-050.csv
file underdata > spreadsheets
(you can use the relative path or full path). - Save this dataframe object to the variable,
spreadsheet_df
<- read_csv("data/spreadsheets/exercise-010-050.csv") spreadsheet_df
Rows: 3 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): first_name, last_name
dbl (6): diastolic, systolic, heart_rate, respiratory_rate, pulse_ox, tempe...
lgl (1): temperature_unit
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4.7 Selecting columns
Here is the tumor dataset we are working with.
R
tumor
# A tibble: 574 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 564 more rows
Python
tumor
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
1 1.CTR 1 101 3 85.0
2 1.CTR 1 101 4 114.0
3 1.CTR 1 101 5 162.3
4 1.CTR 1 101 6 178.3
.. ... ... ... ... ...
569 4.D+R 4 409 19 940.6
570 4.D+R 4 409 20 1057.0
571 4.D+R 4 409 21 1107.6
572 4.D+R 4 409 24 1694.2
573 4.D+R 4 409 25 1423.6
[574 rows x 5 columns]
R
In order to subset the columns of our dataset, we can use the select
function.
The first argument of the select
function is the dataset we want to select columns from.
Then, the other arguments are the columns we want to select.
select(tumor, Size)
# A tibble: 574 × 1
Size
<dbl>
1 41.8
2 85
3 114
4 162.
5 178.
6 325
7 624.
8 648.
9 836.
10 1030.
# … with 564 more rows
If we want multiple columns (or to re-order columns) we can specify each column we want selected.
select(tumor, Group, Day, Size)
# A tibble: 574 × 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 0 41.8
2 1 3 85
3 1 4 114
4 1 5 162.
5 1 6 178.
6 1 7 325
7 1 10 624.
8 1 11 648.
9 1 12 836.
10 1 13 1030.
# … with 564 more rows
Python
"Size"] tumor[
0 41.8
1 85.0
2 114.0
3 162.3
4 178.3
...
569 940.6
570 1057.0
571 1107.6
572 1694.2
573 1423.6
Name: Size, Length: 574, dtype: float64
"Group", "Day", "Size"]] tumor[[
Group Day Size
0 1 0 41.8
1 1 3 85.0
2 1 4 114.0
3 1 5 162.3
4 1 6 178.3
.. ... ... ...
569 4 19 940.6
570 4 20 1057.0
571 4 21 1107.6
572 4 24 1694.2
573 4 25 1423.6
[574 rows x 3 columns]
"Size"]] tumor[[
Size
0 41.8
1 85.0
2 114.0
3 162.3
4 178.3
.. ...
569 940.6
570 1057.0
571 1107.6
572 1694.2
573 1423.6
[574 rows x 1 columns]
R
4.7.1 The %>%
pipe operator
A common way tidyverse
code gets written is using the %>%
operator.
The %>%
takes the object on the left, and passes that into the first argument to the function on the right.
We can re-write the above select statements as follows.
R interprets the code in the exact same way.
%>%
tumor select(Size)
# A tibble: 574 × 1
Size
<dbl>
1 41.8
2 85
3 114
4 162.
5 178.
6 325
7 624.
8 648.
9 836.
10 1030.
# … with 564 more rows
The way we pronounce the pipe, %>%
is “then”.
So the above statement is read “tumor, then, select size”.
%>%
tumor select(Group, Day, Size)
# A tibble: 574 × 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 0 41.8
2 1 3 85
3 1 4 114
4 1 5 162.
5 1 6 178.
6 1 7 325
7 1 10 624.
8 1 11 648.
9 1 12 836.
10 1 13 1030.
# … with 564 more rows
The benefits of the pipe may not be as apparent now, but when we start chaining multiple data sets together, it’ll make the code a little more readable.
4.8 Filtering rows
Now that we know how to select columns on our dataframe, we can now learn how to filter the rows.
R
To filter rows, we can use the filter
function.
The first argument of filter
is the dataframe object (the same pattern as the select
function),
and the other arguments are statements that return some TRUE
or FALSE
value (this is known as a boolean condition).
For example if we want to filter the tumor
dataset such that the Group
value is 1
we can filter the rows using that condition.
filter(tumor, Group == 1)
# A tibble: 97 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 87 more rows
We can write the “pipe” version as such:
%>%
tumor filter(Group == 1)
# A tibble: 97 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 3 85
3 1.CTR 1 101 4 114
4 1.CTR 1 101 5 162.
5 1.CTR 1 101 6 178.
6 1.CTR 1 101 7 325
7 1.CTR 1 101 10 624.
8 1.CTR 1 101 11 648.
9 1.CTR 1 101 12 836.
10 1.CTR 1 101 13 1030.
# … with 87 more rows
Python
"Group"] == 1] tumor.loc[tumor[
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
1 1.CTR 1 101 3 85.0
2 1.CTR 1 101 4 114.0
3 1.CTR 1 101 5 162.3
4 1.CTR 1 101 6 178.3
.. ... ... ... ... ...
92 1.CTR 1 108 6 484.9
93 1.CTR 1 108 7 584.5
94 1.CTR 1 108 11 1581.9
95 1.CTR 1 108 12 1930.6
96 1.CTR 1 108 13 2295.9
[97 rows x 5 columns]
If we want to combine multiple conditions, where Group
is 1
and Day
is 0
.
R
we can put each statement as a separate parameter into the filter
function.
%>%
tumor filter(Group == 1, Day == 0)
# A tibble: 8 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 102 0 79.4
3 1.CTR 1 103 0 44.8
4 1.CTR 1 104 0 67.7
5 1.CTR 1 105 0 54.7
6 1.CTR 1 106 0 60
7 1.CTR 1 107 0 46.8
8 1.CTR 1 108 0 49.4
We can also use the &
for the and operator.
R
%>%
tumor filter(Group == 1 & Day == 0)
# A tibble: 8 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 102 0 79.4
3 1.CTR 1 103 0 44.8
4 1.CTR 1 104 0 67.7
5 1.CTR 1 105 0 54.7
6 1.CTR 1 106 0 60
7 1.CTR 1 107 0 46.8
8 1.CTR 1 108 0 49.4
Python
In pandas,
if you have multiple conditions you must put each condition around a pair of round brackets ( )
.
"Group"] == 1) & (tumor["Day"] == 0)] tumor.loc[(tumor[
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
16 1.CTR 1 102 0 79.4
32 1.CTR 1 103 0 44.8
44 1.CTR 1 104 0 67.7
57 1.CTR 1 105 0 54.7
68 1.CTR 1 106 0 60.0
81 1.CTR 1 107 0 46.8
89 1.CTR 1 108 0 49.4
Or use the |
for the or operator.
R
%>%
tumor filter(Day == 0 | Day == 13)
# A tibble: 60 × 5
Grp Group ID Day Size
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR 1 101 0 41.8
2 1.CTR 1 101 13 1030.
3 1.CTR 1 102 0 79.4
4 1.CTR 1 102 13 619.
5 1.CTR 1 103 0 44.8
6 1.CTR 1 104 0 67.7
7 1.CTR 1 105 0 54.7
8 1.CTR 1 105 13 1699.
9 1.CTR 1 106 0 60
10 1.CTR 1 107 0 46.8
# … with 50 more rows
Python
"Day"] == 0) | (tumor["Day"] == 13)] tumor.loc[(tumor[
Grp Group ID Day Size
0 1.CTR 1 101 0 41.8
9 1.CTR 1 101 13 1030.4
16 1.CTR 1 102 0 79.4
25 1.CTR 1 102 13 618.8
32 1.CTR 1 103 0 44.8
44 1.CTR 1 104 0 67.7
57 1.CTR 1 105 0 54.7
66 1.CTR 1 105 13 1699.3
68 1.CTR 1 106 0 60.0
81 1.CTR 1 107 0 46.8
88 1.CTR 1 107 13 2342.6
89 1.CTR 1 108 0 49.4
96 1.CTR 1 108 13 2295.9
97 2.D 2 201 0 49.1
106 2.D 2 201 13 455.5
116 2.D 2 202 0 60.6
131 2.D 2 203 0 41.5
147 2.D 2 204 0 46.8
162 2.D 2 205 0 39.5
171 2.D 2 205 13 552.5
178 2.D 2 206 0 53.5
196 2.D 2 207 0 43.5
205 2.D 2 207 13 62.4
215 2.D 2 208 0 64.4
230 2.D 2 209 0 47.5
249 2.D 2 210 0 71.7
258 2.D 2 210 13 743.5
268 3.R 3 301 0 44.1
277 3.R 3 301 13 2058.7
278 3.R 3 302 0 42.1
287 3.R 3 302 13 589.5
298 3.R 3 303 0 42.5
307 3.R 3 303 13 442.4
318 3.R 3 304 0 56.9
327 3.R 3 304 13 1066.4
330 3.R 3 305 0 46.7
339 3.R 3 305 13 455.3
349 3.R 3 306 0 51.2
358 3.R 3 306 13 1177.5
363 3.R 3 307 0 44.0
372 3.R 3 307 13 651.8
380 3.R 3 308 0 59.8
387 3.R 3 308 13 956.9
389 3.R 3 309 0 40.7
398 3.R 3 309 13 565.4
410 3.R 3 310 0 58.2
419 3.R 3 310 13 1372.7
429 4.D+R 4 401 0 41.3
445 4.D+R 4 402 0 53.5
451 4.D+R 4 403 0 45.8
460 4.D+R 4 403 13 624.1
471 4.D+R 4 404 0 48.2
490 4.D+R 4 405 0 47.7
499 4.D+R 4 405 13 527.1
511 4.D+R 4 406 0 69.2
518 4.D+R 4 406 13 1227.7
524 4.D+R 4 407 0 43.9
541 4.D+R 4 408 0 59.3
558 4.D+R 4 409 0 51.1
566 4.D+R 4 409 13 691.4
4.9 Subsetting columns and rows
R
To subset on both columns and rows,
we call the corresponding subseting function one after the other.
Here we are taking the tumor
column, then select
ing the Group
, Day
and Size
columns, then
filter
ing the rows where the Size
is greater than 2000
.
%>%
tumor select(Group, Day, Size) %>%
filter(Size > 2000)
# A tibble: 14 × 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 15 2406.
2 1 14 2163.
3 1 16 2125.
4 1 13 2343.
5 1 13 2296.
6 2 22 2047.
7 2 25 2269.
8 2 27 2115.
9 3 13 2059.
10 3 28 2362.
11 3 27 2157.
12 3 20 2160.
13 3 24 2189.
14 4 22 2343
This is an example where the pipe, %>%
makes the code a little easier to read.
We can “un-pipe” our code and it looks like this:
filter(select(tumor, Group, Day, Size), Size > 2000)
# A tibble: 14 × 3
Group Day Size
<dbl> <dbl> <dbl>
1 1 15 2406.
2 1 14 2163.
3 1 16 2125.
4 1 13 2343.
5 1 13 2296.
6 2 22 2047.
7 2 25 2269.
8 2 27 2115.
9 3 13 2059.
10 3 28 2362.
11 3 27 2157.
12 3 20 2160.
13 3 24 2189.
14 4 22 2343
Python
the .loc
notation actually has 2 parts, and looks like .loc[rows columns]
.
So in order to specifiy rows and columns, we separate how we want to select columns with a comma.
"Size"] > 2000, ["Group", "Day", "Size"]] tumor.loc[tumor[
Group Day Size
43 1 15 2405.7
67 1 14 2162.8
80 1 16 2125.1
88 1 13 2342.6
96 1 13 2295.9
130 2 22 2046.7
248 2 25 2269.3
267 2 27 2114.6
277 3 13 2058.7
297 3 28 2362.4
348 3 27 2157.4
362 3 20 2159.6
428 3 24 2188.7
523 4 22 2343.0
4.10 Saving out data
Saving out your data frame objects is the first step in creating pipelines. Even if all you do is read in data, filter it, and write out the subsetted dataset for another use case, you have created your first pipeline. Now, you have a repeatable way of performing the same action when your input dataset is updated or changed.
Let’s say we want to work with a filtered version of our tumor
dataset,
we can first write the code to subset the data we want.
R
<- tumor %>%
filtered select(Group, ID, Day, Size) %>%
filter(Day == 0 | Day == 13)
filtered
# A tibble: 60 × 4
Group ID Day Size
<dbl> <dbl> <dbl> <dbl>
1 1 101 0 41.8
2 1 101 13 1030.
3 1 102 0 79.4
4 1 102 13 619.
5 1 103 0 44.8
6 1 104 0 67.7
7 1 105 0 54.7
8 1 105 13 1699.
9 1 106 0 60
10 1 107 0 46.8
# … with 50 more rows
Python
= tumor.loc[(tumor["Day"] == 0) | (tumor["Day"] == 13), ["Group", "ID", "Day", "Size"]] filtered
filtered
Group ID Day Size
0 1 101 0 41.8
9 1 101 13 1030.4
16 1 102 0 79.4
25 1 102 13 618.8
32 1 103 0 44.8
44 1 104 0 67.7
57 1 105 0 54.7
66 1 105 13 1699.3
68 1 106 0 60.0
81 1 107 0 46.8
88 1 107 13 2342.6
89 1 108 0 49.4
96 1 108 13 2295.9
97 2 201 0 49.1
106 2 201 13 455.5
116 2 202 0 60.6
131 2 203 0 41.5
147 2 204 0 46.8
162 2 205 0 39.5
171 2 205 13 552.5
178 2 206 0 53.5
196 2 207 0 43.5
205 2 207 13 62.4
215 2 208 0 64.4
230 2 209 0 47.5
249 2 210 0 71.7
258 2 210 13 743.5
268 3 301 0 44.1
277 3 301 13 2058.7
278 3 302 0 42.1
287 3 302 13 589.5
298 3 303 0 42.5
307 3 303 13 442.4
318 3 304 0 56.9
327 3 304 13 1066.4
330 3 305 0 46.7
339 3 305 13 455.3
349 3 306 0 51.2
358 3 306 13 1177.5
363 3 307 0 44.0
372 3 307 13 651.8
380 3 308 0 59.8
387 3 308 13 956.9
389 3 309 0 40.7
398 3 309 13 565.4
410 3 310 0 58.2
419 3 310 13 1372.7
429 4 401 0 41.3
445 4 402 0 53.5
451 4 403 0 45.8
460 4 403 13 624.1
471 4 404 0 48.2
490 4 405 0 47.7
499 4 405 13 527.1
511 4 406 0 69.2
518 4 406 13 1227.7
524 4 407 0 43.9
541 4 408 0 59.3
558 4 409 0 51.1
566 4 409 13 691.4
We can then write out the datafrme to a file (in this case a CSV text file), using the same file pathing string format as we loaded a dataset earlier.
R
write_csv(filtered, "data/tumor_filtered.csv")
Python
"data/tumor_filtered_py.csv") filtered.to_csv(
One thing to be mindful of is most programs and computer languages don’t like having spaces in file names.
In general, you’ll end up typing more characters dealing with spaces in file names,
instead of using an underscore, _
, or dash, -
.
Also as a general rule, keeping files in lower-case will prevent a lot of potential spelling mistakes.
- Read in all the patient data from the spreadsheets chapter (chapter 2) found in
data > spreadsheets > all.csv
select
thefirst_name
,last_name
,day
,time
, andtemp_f
columns.filter
the rows such that the person’s name is “Aniya Warner”- Save the filtered dataset into the
output
folder namedaniya_warner.csv
.
4.11 Summary
4.12 Additional Resources
This is not the only way you can process data in R, but from experience, it seems to be the easier way to learn R due to its consistency, community, and learning materials.↩︎