Chapter 4 Load Data

4.1 Learning Objectives

4.2 Introduction

Loading data into R is the first step!

First we need to load up a package to make loading data sets easier. We will be using the tidyverse set of packages for all of our data processing needs in R 1 and the pandas library in Python.

"Tidyverse Celestial" by Allison Horst. "tidyverse" hex sticker in space surrounded by other tidyverse package hex logos.

Figure 0.1: “Tidyverse Celestial” by Allison Horst. “tidyverse” hex sticker in space surrounded by other tidyverse package hex logos.

R

We first will load up the tidyverse packages using the library function.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.4     ✔ dplyr   1.0.7
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   2.0.1     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

The first time you load up the tidyverse library, there will be some output that lists the packages tidyverse loads, along with any functions that share the same name as other functions (i.e., conflicts). As long as you are mindful about starting a new RStudio session before you work, you can ignore this output for now.

Now we can use all the functions within the Tidyverse to do our data processing. If you are following along and you run a piece of code and end up with an could not find function error, make sure you typed library(tidyverse) correctly and executed the line of code.

Python
import pandas as pd

4.3 Find your files

Let the below figures represent a ds4biomed folder on your Desktop on Windows and Mac, respectively.

C:\
  |- Users\
    |- Daniel\
      |- Desktop\
        |- ds4biomed\
         |- data\
         |    |- medicaldata_tumorgrowth.csv
         |- output\
         |- 01-load_data.R
         |- README.md
         |- ds4biomed.Rproj
/
  |- Users/
    |- Daniel/
      |- Desktop/
        |- ds4biomed/
          |- data/
          |    |- medicaldata_tumorgrowth.csv
          |- output/
          |- 01-load_data.R
          |- README.md
          |- ds4biomed.Rproj

Suppose we are “in” the ds4biomed folder, where we can see the data and output folders along with the 01-load_data.R, README.md, and ds4biomed.Rproj files.

We can reference our medicaldata_tumorgrowth.csv file inside the data folder as: data/medicaldata_tumorgrowth.csv. That is, we can use the backslash, / to move into folders. We can write data/medicaldata_tumorgrowth.csv because we are “starting from” the ds4biomed folder. This is called a relative path because the location of the csv file is relative to the ds4biomed starting point (aka working directory). If we want to refer to any arbitrary filer or folder on the computer, we can specify the full path of the file. The full path will start with a drive letter on windows, C:\Users\Daniel\Desktop\ds4biomed\data\medicaldata_tumorgrowth.csv, and a / on a Mac, /Users/Daniel/Desktop/ds4biomed/data/medicaldata_tumorgrowth.csv.

Suppose we create an analysis folder for our 01-load_data.R script so that our folder structure looks like this (only the mac version is shown in the example below):

/
  |- Users/
    |- Daniel/
      |- Desktop/
        |- ds4biomed/
          |- data/
          |    |- medicaldata_tumorgrowth.csv
          |- output/
          |- analysis
          |    |- 01-load_data.R
          |- README.md
          |- ds4biomed.Rproj

Now, if our working directory is now in the analysis folder, we need a way to reference one folder up to the ds4biomed folder and then back down to the data folder. The way we can relatively reference the previous folder is with 2 dots, .., ../data/medicaldata_tumorgrowth.csv

Given the following example folder structure where we have a chart_review and rct_m22-0305 folder in our projects folder:

/
  |- Users/
    |- Daniel/
      |- projects/
        |-chart_review/
        |    |- data/
        |    |    |- patients.csv
        |    |- analysis/
        |         |- demographics.R # you are working here
        |- rct_m22-0305
             |- data/
                  |- patients.csv

Let’s say we are currently in the chart_review/analysis folder, working on our demographics.R as denoted by the #.

  1. Write the relative path to the patients.csv file in the rct_m22-0305 folder.
  2. Write the absolute path to the patients.csv file in the chart_review folder.

4.3.1 Paths in Windows

When you are looking at file paths in the Windows Explorer, you will notice that all Windows paths will use the backslash, \, instead of the forward slash, / to refer to files. In a lot of programming languages, including R, the \ is a special character, so if you want to use \ for file paths in Windows, you will have to use 2 backslashes, e.g., ..\\data\\patients.csv\\. However, you can still use the regular / in Windows to refer to folders just like other operating systems.

4.4 Set your working directory

So far, we have been talking about a “starting point” or “working directory”, when we have been referring to files around our computer. In order to quickly and reliably set your working directory,

we use RStudio Projects (Section 3.5), or Jupyter Lab

4.5 Reading text files (CSV)

Now that we know how to find our files, let’s load up our first data set. When trying to type in a file path, you can hit the <TAB> key to autocomplete the files. This will help you with a lot of potential spelling mistakes.

R
read_csv("data/medicaldata_tumorgrowth.csv")
Rows: 574 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Grp
dbl (4): Group, ID, Day, Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 574 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows
Python
pd.read_csv("data/medicaldata_tumorgrowth.csv")
       Grp  Group   ID  Day    Size
0    1.CTR      1  101    0    41.8
1    1.CTR      1  101    3    85.0
2    1.CTR      1  101    4   114.0
3    1.CTR      1  101    5   162.3
4    1.CTR      1  101    6   178.3
..     ...    ...  ...  ...     ...
569  4.D+R      4  409   19   940.6
570  4.D+R      4  409   20  1057.0
571  4.D+R      4  409   21  1107.6
572  4.D+R      4  409   24  1694.2
573  4.D+R      4  409   25  1423.6

[574 rows x 5 columns]

Debug help: - If the above code returns a could not find function "read_csv" make sure you have loaded up the proper library with library(tidyverse) - If the above code returns a does not exist in current working directory, make sure the working directory it lists is your expected “starting point” (i.e., working directory), and make sure the file path is spelled correctly.

read_csv will show us the columns that were read in, as well as the data type of that column (e.g., character, double – a number).

Loading a data set is great, but we need a convenient way to refer to the data set. We don’t want to re-load the data set every time we want to perform an action on it. We can take this loaded data set and assign it to a variable. We can do this with the assignment operator, <-. Note the way it is typed, a less than symbol (<) followed immediately by the dash (-) without any spaces in between. The right side of the assignment operator, <-, will be executed and then assigned to the variable on the left.

R
tumor <- read_csv("data/medicaldata_tumorgrowth.csv")
Rows: 574 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Grp
dbl (4): Group, ID, Day, Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Python
tumor = pd.read_csv("data/medicaldata_tumorgrowth.csv")

Notice this time we no longer see the dataset being printed. The “Environment” tab in the RStudio panel will now have an entry for the variable you used. Clicking on the right data set icon will open a view of your dataset, clicking on the arrow will show you the column-by-column text representation (technically it’s called the structure).

To look at our dataset we can execute just the variable we assigned the dataset to.

R
tumor
# A tibble: 574 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows
Python
tumor
       Grp  Group   ID  Day    Size
0    1.CTR      1  101    0    41.8
1    1.CTR      1  101    3    85.0
2    1.CTR      1  101    4   114.0
3    1.CTR      1  101    5   162.3
4    1.CTR      1  101    6   178.3
..     ...    ...  ...  ...     ...
569  4.D+R      4  409   19   940.6
570  4.D+R      4  409   20  1057.0
571  4.D+R      4  409   21  1107.6
572  4.D+R      4  409   24  1694.2
573  4.D+R      4  409   25  1423.6

[574 rows x 5 columns]

This tabular dataset that has now been loaded into R is called a data frame object (or simply dataframe), the tidyverse uses a tibble. For the most part, a data.frame object will behave like a tibble object.

4.6 Reading Excel files

R

To read an excel file we will need a separate library that can handle Excel files, readxl.

library(readxl)

The readxl library gives us access to the read_excel function which we can use to read in an Excel file.

tumor_xl <- read_excel("data/medicaldata_tumorgrowth.xlsx")
Python
tumor_xl = pd.read_excel("data/medicaldata_tumorgrowth.xlsx")

And we can view the loaded Excel file just like before.

R
tumor_xl
# A tibble: 574 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows
Python
tumor_xl
       Grp  Group   ID  Day    Size
0    1.CTR      1  101    0    41.8
1    1.CTR      1  101    3    85.0
2    1.CTR      1  101    4   114.0
3    1.CTR      1  101    5   162.3
4    1.CTR      1  101    6   178.3
..     ...    ...  ...  ...     ...
569  4.D+R      4  409   19   940.6
570  4.D+R      4  409   20  1057.0
571  4.D+R      4  409   21  1107.6
572  4.D+R      4  409   24  1694.2
573  4.D+R      4  409   25  1423.6

[574 rows x 5 columns]

If you are ever lost as to what object you are working with, you can use the class function in R or the type function in Python to tell you.

R
class(tumor)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
class(tumor_xl)
[1] "tbl_df"     "tbl"        "data.frame"
Python
type(tumor)
<class 'pandas.core.frame.DataFrame'>
type(tumor_xl)
<class 'pandas.core.frame.DataFrame'>

Once you have see a "tbl_df" or "data.frame" (R) or <class 'pandas.core.frame.DataFrame'> (Python) you can be sure that the dataframe processing functions we’ll learn will work.

Take away: Once you load a dataset and end up with either a data.frame or tibble object you can continue using the same data processing methods we cover in these materials.

  1. Make sure you have the ds4biomed RStudio project open
  2. Load up the exercise-010-050.csv file under data > spreadsheets (you can use the relative path or full path).
  3. Save this dataframe object to the variable, spreadsheet_df
spreadsheet_df <- read_csv("data/spreadsheets/exercise-010-050.csv")
Rows: 3 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_name, last_name
dbl  (6): diastolic, systolic, heart_rate, respiratory_rate, pulse_ox, tempe...
lgl  (1): temperature_unit
time (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.7 Selecting columns

Here is the tumor dataset we are working with.

R
tumor
# A tibble: 574 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 564 more rows
Python
tumor
       Grp  Group   ID  Day    Size
0    1.CTR      1  101    0    41.8
1    1.CTR      1  101    3    85.0
2    1.CTR      1  101    4   114.0
3    1.CTR      1  101    5   162.3
4    1.CTR      1  101    6   178.3
..     ...    ...  ...  ...     ...
569  4.D+R      4  409   19   940.6
570  4.D+R      4  409   20  1057.0
571  4.D+R      4  409   21  1107.6
572  4.D+R      4  409   24  1694.2
573  4.D+R      4  409   25  1423.6

[574 rows x 5 columns]
R

In order to subset the columns of our dataset, we can use the select function. The first argument of the select function is the dataset we want to select columns from. Then, the other arguments are the columns we want to select.

select(tumor, Size)
# A tibble: 574 × 1
     Size
    <dbl>
 1   41.8
 2   85  
 3  114  
 4  162. 
 5  178. 
 6  325  
 7  624. 
 8  648. 
 9  836. 
10 1030. 
# … with 564 more rows

If we want multiple columns (or to re-order columns) we can specify each column we want selected.

select(tumor, Group, Day, Size)
# A tibble: 574 × 3
   Group   Day   Size
   <dbl> <dbl>  <dbl>
 1     1     0   41.8
 2     1     3   85  
 3     1     4  114  
 4     1     5  162. 
 5     1     6  178. 
 6     1     7  325  
 7     1    10  624. 
 8     1    11  648. 
 9     1    12  836. 
10     1    13 1030. 
# … with 564 more rows
Python
tumor["Size"]
0        41.8
1        85.0
2       114.0
3       162.3
4       178.3
        ...  
569     940.6
570    1057.0
571    1107.6
572    1694.2
573    1423.6
Name: Size, Length: 574, dtype: float64
tumor[["Group", "Day", "Size"]]
     Group  Day    Size
0        1    0    41.8
1        1    3    85.0
2        1    4   114.0
3        1    5   162.3
4        1    6   178.3
..     ...  ...     ...
569      4   19   940.6
570      4   20  1057.0
571      4   21  1107.6
572      4   24  1694.2
573      4   25  1423.6

[574 rows x 3 columns]
tumor[["Size"]]
       Size
0      41.8
1      85.0
2     114.0
3     162.3
4     178.3
..      ...
569   940.6
570  1057.0
571  1107.6
572  1694.2
573  1423.6

[574 rows x 1 columns]
R

4.7.1 The %>% pipe operator

A common way tidyverse code gets written is using the %>% operator. The %>% takes the object on the left, and passes that into the first argument to the function on the right. We can re-write the above select statements as follows. R interprets the code in the exact same way.

tumor %>%
  select(Size)
# A tibble: 574 × 1
     Size
    <dbl>
 1   41.8
 2   85  
 3  114  
 4  162. 
 5  178. 
 6  325  
 7  624. 
 8  648. 
 9  836. 
10 1030. 
# … with 564 more rows

The way we pronounce the pipe, %>% is “then”. So the above statement is read “tumor, then, select size”.

tumor %>%
  select(Group, Day, Size)
# A tibble: 574 × 3
   Group   Day   Size
   <dbl> <dbl>  <dbl>
 1     1     0   41.8
 2     1     3   85  
 3     1     4  114  
 4     1     5  162. 
 5     1     6  178. 
 6     1     7  325  
 7     1    10  624. 
 8     1    11  648. 
 9     1    12  836. 
10     1    13 1030. 
# … with 564 more rows

The benefits of the pipe may not be as apparent now, but when we start chaining multiple data sets together, it’ll make the code a little more readable.

4.8 Filtering rows

Now that we know how to select columns on our dataframe, we can now learn how to filter the rows.

R

To filter rows, we can use the filter function. The first argument of filter is the dataframe object (the same pattern as the select function), and the other arguments are statements that return some TRUE or FALSE value (this is known as a boolean condition).

For example if we want to filter the tumor dataset such that the Group value is 1 we can filter the rows using that condition.

filter(tumor, Group == 1)
# A tibble: 97 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 87 more rows

We can write the “pipe” version as such:

tumor %>%
  filter(Group == 1)
# A tibble: 97 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101     3   85  
 3 1.CTR     1   101     4  114  
 4 1.CTR     1   101     5  162. 
 5 1.CTR     1   101     6  178. 
 6 1.CTR     1   101     7  325  
 7 1.CTR     1   101    10  624. 
 8 1.CTR     1   101    11  648. 
 9 1.CTR     1   101    12  836. 
10 1.CTR     1   101    13 1030. 
# … with 87 more rows
Python
tumor.loc[tumor["Group"] == 1]
      Grp  Group   ID  Day    Size
0   1.CTR      1  101    0    41.8
1   1.CTR      1  101    3    85.0
2   1.CTR      1  101    4   114.0
3   1.CTR      1  101    5   162.3
4   1.CTR      1  101    6   178.3
..    ...    ...  ...  ...     ...
92  1.CTR      1  108    6   484.9
93  1.CTR      1  108    7   584.5
94  1.CTR      1  108   11  1581.9
95  1.CTR      1  108   12  1930.6
96  1.CTR      1  108   13  2295.9

[97 rows x 5 columns]

If we want to combine multiple conditions, where Group is 1 and Day is 0.

R

we can put each statement as a separate parameter into the filter function.

tumor %>%
  filter(Group == 1, Day == 0)
# A tibble: 8 × 5
  Grp   Group    ID   Day  Size
  <chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR     1   101     0  41.8
2 1.CTR     1   102     0  79.4
3 1.CTR     1   103     0  44.8
4 1.CTR     1   104     0  67.7
5 1.CTR     1   105     0  54.7
6 1.CTR     1   106     0  60  
7 1.CTR     1   107     0  46.8
8 1.CTR     1   108     0  49.4

We can also use the & for the and operator.

R
tumor %>%
  filter(Group == 1 & Day == 0)
# A tibble: 8 × 5
  Grp   Group    ID   Day  Size
  <chr> <dbl> <dbl> <dbl> <dbl>
1 1.CTR     1   101     0  41.8
2 1.CTR     1   102     0  79.4
3 1.CTR     1   103     0  44.8
4 1.CTR     1   104     0  67.7
5 1.CTR     1   105     0  54.7
6 1.CTR     1   106     0  60  
7 1.CTR     1   107     0  46.8
8 1.CTR     1   108     0  49.4
Python

In pandas, if you have multiple conditions you must put each condition around a pair of round brackets ( ).

tumor.loc[(tumor["Group"] == 1) & (tumor["Day"] == 0)]
      Grp  Group   ID  Day  Size
0   1.CTR      1  101    0  41.8
16  1.CTR      1  102    0  79.4
32  1.CTR      1  103    0  44.8
44  1.CTR      1  104    0  67.7
57  1.CTR      1  105    0  54.7
68  1.CTR      1  106    0  60.0
81  1.CTR      1  107    0  46.8
89  1.CTR      1  108    0  49.4

Or use the | for the or operator.

R
tumor %>%
  filter(Day == 0 | Day == 13)
# A tibble: 60 × 5
   Grp   Group    ID   Day   Size
   <chr> <dbl> <dbl> <dbl>  <dbl>
 1 1.CTR     1   101     0   41.8
 2 1.CTR     1   101    13 1030. 
 3 1.CTR     1   102     0   79.4
 4 1.CTR     1   102    13  619. 
 5 1.CTR     1   103     0   44.8
 6 1.CTR     1   104     0   67.7
 7 1.CTR     1   105     0   54.7
 8 1.CTR     1   105    13 1699. 
 9 1.CTR     1   106     0   60  
10 1.CTR     1   107     0   46.8
# … with 50 more rows
Python
tumor.loc[(tumor["Day"] == 0) | (tumor["Day"] == 13)]
       Grp  Group   ID  Day    Size
0    1.CTR      1  101    0    41.8
9    1.CTR      1  101   13  1030.4
16   1.CTR      1  102    0    79.4
25   1.CTR      1  102   13   618.8
32   1.CTR      1  103    0    44.8
44   1.CTR      1  104    0    67.7
57   1.CTR      1  105    0    54.7
66   1.CTR      1  105   13  1699.3
68   1.CTR      1  106    0    60.0
81   1.CTR      1  107    0    46.8
88   1.CTR      1  107   13  2342.6
89   1.CTR      1  108    0    49.4
96   1.CTR      1  108   13  2295.9
97     2.D      2  201    0    49.1
106    2.D      2  201   13   455.5
116    2.D      2  202    0    60.6
131    2.D      2  203    0    41.5
147    2.D      2  204    0    46.8
162    2.D      2  205    0    39.5
171    2.D      2  205   13   552.5
178    2.D      2  206    0    53.5
196    2.D      2  207    0    43.5
205    2.D      2  207   13    62.4
215    2.D      2  208    0    64.4
230    2.D      2  209    0    47.5
249    2.D      2  210    0    71.7
258    2.D      2  210   13   743.5
268    3.R      3  301    0    44.1
277    3.R      3  301   13  2058.7
278    3.R      3  302    0    42.1
287    3.R      3  302   13   589.5
298    3.R      3  303    0    42.5
307    3.R      3  303   13   442.4
318    3.R      3  304    0    56.9
327    3.R      3  304   13  1066.4
330    3.R      3  305    0    46.7
339    3.R      3  305   13   455.3
349    3.R      3  306    0    51.2
358    3.R      3  306   13  1177.5
363    3.R      3  307    0    44.0
372    3.R      3  307   13   651.8
380    3.R      3  308    0    59.8
387    3.R      3  308   13   956.9
389    3.R      3  309    0    40.7
398    3.R      3  309   13   565.4
410    3.R      3  310    0    58.2
419    3.R      3  310   13  1372.7
429  4.D+R      4  401    0    41.3
445  4.D+R      4  402    0    53.5
451  4.D+R      4  403    0    45.8
460  4.D+R      4  403   13   624.1
471  4.D+R      4  404    0    48.2
490  4.D+R      4  405    0    47.7
499  4.D+R      4  405   13   527.1
511  4.D+R      4  406    0    69.2
518  4.D+R      4  406   13  1227.7
524  4.D+R      4  407    0    43.9
541  4.D+R      4  408    0    59.3
558  4.D+R      4  409    0    51.1
566  4.D+R      4  409   13   691.4

4.9 Subsetting columns and rows

R

To subset on both columns and rows, we call the corresponding subseting function one after the other. Here we are taking the tumor column, then selecting the Group, Day and Size columns, then filtering the rows where the Size is greater than 2000.

tumor %>%
  select(Group, Day, Size) %>%
  filter(Size > 2000)
# A tibble: 14 × 3
   Group   Day  Size
   <dbl> <dbl> <dbl>
 1     1    15 2406.
 2     1    14 2163.
 3     1    16 2125.
 4     1    13 2343.
 5     1    13 2296.
 6     2    22 2047.
 7     2    25 2269.
 8     2    27 2115.
 9     3    13 2059.
10     3    28 2362.
11     3    27 2157.
12     3    20 2160.
13     3    24 2189.
14     4    22 2343 

This is an example where the pipe, %>% makes the code a little easier to read. We can “un-pipe” our code and it looks like this:

filter(select(tumor, Group, Day, Size), Size > 2000)
# A tibble: 14 × 3
   Group   Day  Size
   <dbl> <dbl> <dbl>
 1     1    15 2406.
 2     1    14 2163.
 3     1    16 2125.
 4     1    13 2343.
 5     1    13 2296.
 6     2    22 2047.
 7     2    25 2269.
 8     2    27 2115.
 9     3    13 2059.
10     3    28 2362.
11     3    27 2157.
12     3    20 2160.
13     3    24 2189.
14     4    22 2343 
Python

the .loc notation actually has 2 parts, and looks like .loc[rows columns]. So in order to specifiy rows and columns, we separate how we want to select columns with a comma.

tumor.loc[tumor["Size"] > 2000, ["Group", "Day", "Size"]]
     Group  Day    Size
43       1   15  2405.7
67       1   14  2162.8
80       1   16  2125.1
88       1   13  2342.6
96       1   13  2295.9
130      2   22  2046.7
248      2   25  2269.3
267      2   27  2114.6
277      3   13  2058.7
297      3   28  2362.4
348      3   27  2157.4
362      3   20  2159.6
428      3   24  2188.7
523      4   22  2343.0

4.10 Saving out data

Saving out your data frame objects is the first step in creating pipelines. Even if all you do is read in data, filter it, and write out the subsetted dataset for another use case, you have created your first pipeline. Now, you have a repeatable way of performing the same action when your input dataset is updated or changed.

Let’s say we want to work with a filtered version of our tumor dataset, we can first write the code to subset the data we want.

R
filtered <- tumor %>%
  select(Group, ID, Day, Size) %>%
  filter(Day == 0 | Day == 13)
filtered
# A tibble: 60 × 4
   Group    ID   Day   Size
   <dbl> <dbl> <dbl>  <dbl>
 1     1   101     0   41.8
 2     1   101    13 1030. 
 3     1   102     0   79.4
 4     1   102    13  619. 
 5     1   103     0   44.8
 6     1   104     0   67.7
 7     1   105     0   54.7
 8     1   105    13 1699. 
 9     1   106     0   60  
10     1   107     0   46.8
# … with 50 more rows
Python
filtered = tumor.loc[(tumor["Day"] == 0) | (tumor["Day"] == 13), ["Group", "ID", "Day", "Size"]]
filtered
     Group   ID  Day    Size
0        1  101    0    41.8
9        1  101   13  1030.4
16       1  102    0    79.4
25       1  102   13   618.8
32       1  103    0    44.8
44       1  104    0    67.7
57       1  105    0    54.7
66       1  105   13  1699.3
68       1  106    0    60.0
81       1  107    0    46.8
88       1  107   13  2342.6
89       1  108    0    49.4
96       1  108   13  2295.9
97       2  201    0    49.1
106      2  201   13   455.5
116      2  202    0    60.6
131      2  203    0    41.5
147      2  204    0    46.8
162      2  205    0    39.5
171      2  205   13   552.5
178      2  206    0    53.5
196      2  207    0    43.5
205      2  207   13    62.4
215      2  208    0    64.4
230      2  209    0    47.5
249      2  210    0    71.7
258      2  210   13   743.5
268      3  301    0    44.1
277      3  301   13  2058.7
278      3  302    0    42.1
287      3  302   13   589.5
298      3  303    0    42.5
307      3  303   13   442.4
318      3  304    0    56.9
327      3  304   13  1066.4
330      3  305    0    46.7
339      3  305   13   455.3
349      3  306    0    51.2
358      3  306   13  1177.5
363      3  307    0    44.0
372      3  307   13   651.8
380      3  308    0    59.8
387      3  308   13   956.9
389      3  309    0    40.7
398      3  309   13   565.4
410      3  310    0    58.2
419      3  310   13  1372.7
429      4  401    0    41.3
445      4  402    0    53.5
451      4  403    0    45.8
460      4  403   13   624.1
471      4  404    0    48.2
490      4  405    0    47.7
499      4  405   13   527.1
511      4  406    0    69.2
518      4  406   13  1227.7
524      4  407    0    43.9
541      4  408    0    59.3
558      4  409    0    51.1
566      4  409   13   691.4

We can then write out the datafrme to a file (in this case a CSV text file), using the same file pathing string format as we loaded a dataset earlier.

R
write_csv(filtered, "data/tumor_filtered.csv")
Python
filtered.to_csv("data/tumor_filtered_py.csv")

One thing to be mindful of is most programs and computer languages don’t like having spaces in file names. In general, you’ll end up typing more characters dealing with spaces in file names, instead of using an underscore, _, or dash, -. Also as a general rule, keeping files in lower-case will prevent a lot of potential spelling mistakes.

  1. Read in all the patient data from the spreadsheets chapter (chapter 2) found in data > spreadsheets > all.csv
  2. select the first_name, last_name, day, time, and temp_f columns.
  3. filter the rows such that the person’s name is “Aniya Warner”
  4. Save the filtered dataset into the output folder named aniya_warner.csv.

4.11 Summary

4.12 Additional Resources


  1. This is not the only way you can process data in R, but from experience, it seems to be the easier way to learn R due to its consistency, community, and learning materials.↩︎