Chapter 2 Spreadsheets
2.1 Learning Objectives
- Identify when spreadsheets are useful
- Assess when a task should not be done in a spreadsheet software
2.2 Introduction
If you have worked with data, chances are you have used spreadsheets before. In an introductory data analytics class, spreadsheet programs give the learners a visual of the dataset encased around a program that lets the user manipulate the data. That is, spreadsheet programs create a graphical user interface (GUI, pronounced like “gooey”) for one’s data.
A GUI allows the user to quickly explore and look at their data. The ability to click on things can never be underestimated. However, the convenience of clickable graphical elements come at the price of flexibility when more complex and nuanced tasks are needed.
You are one of the researchers assigned to a pharmacokinetics study of a phase 3 clinical trial for Hepatitis C. You have 3 patients who need to have their blood drawn and various vitals recorded:
- Gemma Dunn:
- Blood drawn at 8:00am
- Heart rate 65bpm
- Respiratory rate: 12bpm
- Blood pressure 128/78 mmHg
- Pulse Ox 98%
- Temperature 96.8F
- Priscilla Holding:
- Blood drawn at 8:20am
- Heart rate 75bpm
- Respiratory rate: 18bpm
- Blood pressure 130/85 mmHg
- Pulse Ox 95%
- Temperature 97.0F
- Aniya Warner
- Blood drawn at 8:40am
- Heart rate 63bpm
- Respiratory rate: 13bpm
- Blood pressure 115/73 mmHg
- Pulse Ox 98%
- Temperature 96.5F
How would you input these values into a spreadsheet?
1. Input values without units
Units might be understood and assumed. If there is an already established convention on units and if there is a code book specifying the units, you may not need to have the unit information in the dataset itself. In this case, if the units were swapped between Fahrenheit and Celsius, we would be able to detect that since body temperatures values would only make sense with a particular unit.
2. Add units to values
One way to incorporate units would be to place them directly with the value. This would make sure you do not forget logging the unit in a different column, and would make the individual values more self sufficient. The downside here is the unit information turns the value in the cell into a character/string/text value and you won’t be able to do any numeric calculations without removing the unit from the value. Also, the column will hold 2 bits of information, the value and the unit. In general, you want each cell to only hold 1 bit of information.
3. Split names into 2 columns
Splitting the name into first and last names will make searching, sorting, and finding observations easier. It also prevents manual user entry from specifying the name in different formats (first name last name versus last name, first name). However, be mindful of how non-english names would be entered when entering data this way.
4. Placing unit information in the columns
You can choose to place the unit information in the column names themselves. This would make it clearer what the units for a particular value are without having to look up and log a separate column. Chances are that the units will not change across values, so it would be highly repetitive to keep the information in a separate column. However, be aware of making the column names too pretty. In general, you want to avoid special characters and spaces in your column names, and if possible keep everything in a single case (e.g., all lower case). This prevents having errors with leading and trailing spaces (e.g., “last name” and ” last name “) when searching for column names, as well as not having to worry about the differences between”Last Name”, “last name”, “lastName”, “last_name”, etc.
This solution isn’t perfect, but it has a lot of features of good spreadsheet data.
- All the column names are lower-cased and does not contain spaces or special characters
- Each cell in a column only contains 1 bit of information (separate first and last names, blood pressure values in separate columns, units not part of the value)
What’s currently not obviously clear is whether or not time is recorded in 12-hour or 24-hour time,
this may be more evident as we fill in more data.
I realized when I loaded this dataset into R,
that the temerature_unit
column of F
values are read in as FALSE
values by default.
I could either write the data import code in such a way to accommodate for this default behavior (which I did),
or choose to rename the temperature
column as temp_f
so we have the unit information
directly in the column name.
When you’re manually entering data,
it’s hard to know how much to breakdown values into separate columns.
As a general rule of thumb,
having more columns in your data is much better than having a single column containing multiple bits of information.
In the case of blood pressure values,
we should have separate columns for the diastolic and systolic values,
instead of a single column containing both.
This gives us a lot more flexibility later on if we only want to look at one of the values,
and also gives us the flexibility to combine them into a single reading if needed.
Since we’re manually entering values,
reporting the blood pressure value in a single column runs the risk of formatting issues
such as 120/80
, 120 / 80
, 120 80
, etc.
Now, let’s say you performed these tasks multiple times over the course of 72 hours. You know at the end of the PK study, you want to be able to look to see if there were any drastic changes to the patient’s vitals. Here’s a sample of more data, does knowing what you want to look at change how you setup your data collection?
Example additional reading 1:
Taken on same day 12 hours from the first reading:
- Gemma Dunn:
- Blood drawn at 8:00pm
- Heart rate 65bpm
- Respiratory rate: 12bpm
- Blood pressure 128/78 mmHg
- Pulse Ox 98%
- Temperature 96.8F
- Priscilla Holding:
- Blood drawn at 8:20pm
- Heart rate 75bpm
- Respiratory rate: 18bpm
- Blood pressure 130/85 mmHg
- Pulse Ox 95%
- Temperature 97.0F
- Aniya Warner:
- Blood drawn at 8:40pm
- Heart rate 63bpm
- Respiratory rate: 13bpm
- Blood pressure 115/73 mmHg
- Pulse Ox 98%
- Temperature 96.5F
Example additional reading 2:
Taken on the next day 24 hours from the first reading:
- Gemma Dunn:
- Blood drawn at 8:00am
- Heart rate 65bpm
- Respiratory rate: 12bpm
- Blood pressure 128/78 mmHg
- Pulse Ox 98%
- Temperature 96.8F
- Priscilla Holding:
- Blood drawn at 8:20am
- Heart rate 75bpm
- Respiratory rate: 18bpm
- Blood pressure 130/85 mmHg
- Pulse Ox 95%
- Temperature 97.0F
- Aniya Warner:
- Blood drawn at 8:40am
- Heart rate 127bpm
- Respiratory rate: 36bpm
- Blood pressure 115/73 mmHg
- Pulse Ox 98%
- Temperature 96.5F
1. Separate tables for each patient
Table for Gemma:
Table For Priscilla:
Table for Aniya:
One way to input these new readings is to create a separate table for each person. This could be done as a separate file or separate sheets in an Excel workbook. This has the benefit of being easy to run calculations on a single person, but you would need to process the data to get overall summaries.
2. Multiple sections in the same table
This is very similar to the previous example, but instead of having separate files for each person, there’s some kind of separation in the main spreadsheet to distinguish between individuals. This allows us to calculate group statistics for a particular column (assuming the blank rows are accounted for) and we can save temporary calculations on a patient-by-patient basis. While this may be a format that is useful during the data collection phase, the empty rows between data values can cause data import problems. One-off calculations below the tables or to the side can also disrupt data import.
3. Values across the rows
Single table of all patients in chronological order
Having all the data in a single dataset allows for the greatest amount of post data collection analysis. You don’t need the additional step of combining and joining tables together since all the values will be available. It gives you the flexibility to filter results on values or observations of interest for whatever calculation is needed. The chronological order is just for data entry but the data can be re-arranged as necessary (e.g., by patient name)
In general, each row in your dataset represents a single observation. In our case, our observational unit is person and time, so each row contains a set of readings for a given person and time. It may be common to have multiple readings on a single row, where you have different columns representing each reading (TODO THIS ISN”T CLEAR). There is nothing wrong with having a time1, time2, heart_rate1, hear_rate2, etc, columns per se, but when you’re thinking longer term for data visualization and analysis, you would have to process your data further.
With proper planning, there is no such thing as “messy” data.
Having said that, since this is a manual data entry process, if it is easier to enter and check your data inputs in a certain format, then that would be the better data format for the use case. As long as you have a plan for later steps in the data processing, then choose the format that makes the current task easier.
The protocol asks to report an adverse event if any of the patient’s vitals become abnormally elevated. How do you mark any abnormal observations?
Elevated vitals include:
- Blood pressure greater than or equal to 140/90
- Resting heart rate greater than or equal to 90
- Resting respiratory rate over 20
- Pulse ox under 95
- Temperature over 100 degress Fahrenheit
Note that the language is specific about “greater than or equal to” versus “over”.
Color code the values of interest
Using colors to make values of interest stand out is a common way to “flag” cells. It can be done either manually in a spread sheet program, or by using code to write a conditional statement to color values. In essence, you are turning the spreadsheet of data into a data visualization that behave similarly to a heatmap diagram. However, one of the main takeaways from Exercise 1, was to not have multiple bits of information encoded into a single data entry cell. By putting a color value representing a “good” or “bad” value, we now have assigned multiple bits of information to a single cell.
This may further problems later on when loading Excel spreadsheets since the color information may show up as additional characters or may not. At best you have to process the color information from the stored value in your dataset, and at worst, you have also lost a bit of information in your dataset that you tried to encode.
Separate column to indicate status
A much better way would be to create a separate column that denotes whether or not an observation would need to be reported or not. This way each column only holds one bit of information, is easily transferable between programs, and you won’t loose any information as the data gets passed around. You can then use this column in either the spreadsheet program and/or programming language to filter results.
If you absolute need colored cells in your spreadsheet or dataset, then this new column would be a good place to use color since the multiple bits of information in the cell will encode the same information.
The benefit of spreadsheet programs, is that it gives you a graphical view of your dataset. This means it’s very easy to add colors and other visual markers to data values to make them stand out more. However, one of the takeaways in exercise 1 was to be mindful of storing multiple bits of information in a single cell. If we add colors to a cell, we know have multiple bits of information stored in a single cell, the value, and the color that represents something else. A better approach in this case in order to make a more “shareable” dataset, would be to create a separate column that indicates a “bad” or “good” value. This would be easier when each row is a separate patient reading, where you can create a flag about a “bad” observation.
Treat your dataset as data, not as a data visualization.
You can take this to an even further extreme where each row is just a patient, time, reading type, and reading value, where you can really flag a “bad” value, but this would make the data entry process very difficult.
During the examination you ask the patient if they are feeling any discomfort and why they are breathing so heavily, and they tell you they were stretching and doing yoga poses to “get the blood moving”. How do you document this in your data?
When actually doing data collection, it’s impossible to predict everything and it’s common to have a separate open text field to put in notes.
It’s good to have some kind of notes section for you to enter in so all the comments and notes travel with the data, and not somewhere else that could be easily lost or forgotten. The better your dataset can self document, the less chance of mistakes during the data processing and analysis phase. There are times when a dataset is provided without a codebook, or values in the dataset get interpreted in the wrong way.
As long as you are consistent within a particular column (and ideally within the entire dataset),
moden programming langauges can work with data encoded as "yes"
/"no"
or "true"
/"false"
or
"< 18"
/"18-25
“/"26-30"
/"31+"
etc and do not need to be numerically encoded as 0
, 1
, 2
, etc.
This makes the dataset much more self-sufficient and does not need to rely on a separate codebook file.
2.3 Summary
- Spreadsheet programs serve as a graphical user interface to one’s data
- GUIs are very convenient, but come at the cost of flexibility
- Spreadsheets have a legitimate place in the data science process
- Not all tasks are best suited in a spreadsheet program
2.4 Additional resources
- Data carpentry spreadsheet lesson
- Data Organization in Spreadsheets, Karl W. Broman & Kara H. Woo:
- Be consistent
- Choose good names for things
- Write dates as YYYY-MM-DD
- No empty cells
- Put just one thing in a cell
- Make it a rectangle
- Create a data dictionary
- No calculations in the raw data files
- Do not use font color or highlighting as data
- Make backups
- Use data validation to avoid errors
- Save the data in plain text files