Chapter 4 Data Manipulation 1
In this chapter we’ll look at tidying up and cleaning our data with the tidyr package (Wickham and Henry 2018) from the tidyverse (Wickham 2017). We’ll also use the dplyr package (Wickham et al. 2017) from the tidyverse (Wickham 2017) to join different sets of data together into a useful format for plotting and running analyses.
This data cleaning is an important first step on your way to working with your data, so it’s important for you to get acquainted with the different ways your data can be formatted, and how to get it in the format we want.
This chapter, particularly the section on joins, was informed by the Glasgow University Research Cycle by Dale Barr and Lisa DeBruine. Please see this resource for further reading.
4.1 Getting Started
First, we’ll load the packages necessary for this class. Nicely, tidyr and dplyr are part of the tidyverse family, so we don’t need to load this separately to the other packages in our library.
library(tidyverse)
Next, we’ll load some data from the languageR
library. The data set we’ll look at is the lexdec
data set, which looks at lexical decision latencies for English nouns from native and non-native speakers of English. If you load this data set from the languageR
library then it’ll already be in the correct format for us to perform our analyses. So, I’ve made this data more messy; adding missing values, additional columns that represent more than one source of data, and few extra participants.
On top of this, I’ve produced a separate data set which stores some (fake) demographic data for the participants. This data set contains information about the participant ID, their gender, age, and any programming languages that they know. Why did I add this last column? Well, sometimes your data contains additional information that isn’t important for your current analyses, so it’s good to get some experience with filtering our data. (We’ll cover filtering data in more detail in Lesson 5.)
4.2 Data Formats
I’ve saved the data in both wide and long formats.
In wide formats, each row represents one participant, and any information gathered from these participants is stored in a new column. So, let’s say we have several items where we gather some data (e.g. reaction times), here each column will represent an item, and each cell will store a participant’s score.
In long formats, each column represents one measurement. Here, we could have a column for participant ID, a column for item number, and a column for reaction times. In this instance, each row should be unique by its combination of our three columns, but IDs (e.g. participant ID, item ID) will be repeated several times.
4.2.1 Loading Data
Let’s load the messy data sets from the csv files in the Lesson 4 lesson_materials folder to get a better grasp of these formats. We’ll get some (scary looking, red) messages when we load the data. That’s because we haven’t specified how we want each column to be parsed.
<- read_csv("inputs/lexical_decision_demographic_data_wide.csv")
demo_wide <- read_csv("inputs/lexical_decision_demographic_data_long.csv") demo_long
When we use read_csv()
this function tries to guess the data type contained within any column. Basically, if all the data in a column are numbers, then it’ll be parsed as a numeric data type. If even one cell in a column is text, the whole column will be parsed as text. This can cause problems if most of the data are numbers and you want to do some calculations with this column, as you can’t add, subract, or divide with text columns!
Often, dates are parsed as datetimes, which allows for some easy calculations for differences in times. But, our completion_time column was parsed as a character. Why? Because this column contains two datetimes separated by an underscore. It looks like the researchers (me) were too lazy to actually calculate the completion times, and just threw the start and end times togerther into one column!
4.2.2 Wide and Long Data
Now we’ve loaded the data, and understand how R reads data, let’s look back at wide and long data formats.
4.2.2.1 Wide Data
In the wide data format, each row is a participant (ID). We have columns representing all of the programming languages reported to be known by the participants. If a participant knows the language, they get a 1 in this column, otherwise they have an NA. We also have several other columns covering other information gathered.
demo_wide
## # A tibble: 29 × 14
## ID `C++` FORTRAN JavaScr…¹ Python R Ruby LANGU…² progr…³ gender age
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 22 1 1 1 NA NA NA OTHER FINISH female 30
## 2 23 1 NA NA NA NA NA <NA> no male 30
## 3 24 NA 1 NA NA NA NA ENGLISH END female 30
## 4 25 1 1 NA NA NA NA <NA> ethics male 18
## 5 26 NA NA NA 1 NA 1 english ethics male 31
## 6 27 NA 1 1 NA NA NA ENGLISH END female 44
## 7 28 NA NA NA NA 1 1 <NA> ethics male 23
## 8 29 NA NA 1 NA NA NA english ethics male 34
## 9 30 NA 1 NA NA NA NA <NA> <NA> female 32
## 10 A2 1 1 1 NA NA NA English END female 33
## # … with 19 more rows, 3 more variables: tester <chr>, funRec <chr>,
## # completion_time <chr>, and abbreviated variable names ¹JavaScript,
## # ²LANGUAGE, ³progress
4.2.2.2 Long Data
With the long data format, we have each ID for our data in one column, and the measurements for these variables in each cell. The main difference here is that we have a column at the end called computer_language which simply lists the language each participant knows. This cuts down on the need for redundant columns for a computer language when a participant doesn’t know that language. Compare the languages known for participant 22 in this data set and how it’s represented in the wide data set.
In order to display this properly, I’ll cut out the columns tester, funRec, and completion time, when printing for this website, but you needn’t do this in R.
demo_long
## # A tibble: 54 × 6
## ID LANGUAGE progress gender age computer_language
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 22 OTHER FINISH female 30 C++
## 2 22 OTHER FINISH female 30 FORTRAN
## 3 22 OTHER FINISH female 30 JavaScript
## 4 23 <NA> no male 30 C++
## 5 24 ENGLISH END female 30 FORTRAN
## 6 25 <NA> ethics male 18 C++
## 7 25 <NA> ethics male 18 FORTRAN
## 8 26 english ethics male 31 Python
## 9 26 english ethics male 31 Ruby
## 10 27 ENGLISH END female 44 FORTRAN
## # … with 44 more rows
We end up with some repetition here (several rows with the same ID, language, gender, etc.), but each row is unique when we consider all measurements. This is a common data format for raw data, as we’ll see next.
4.2.2.3 Understanding our Loaded Data
The NA value is important in R; as Hadley Wickham says, it is the evidence of absence. Missing values however are more problematic in that they are the absence of evidence. If you want to indicate missing data, use NA, and not N/A or N_A etc. as you have to tell R to parse these as NAs.
We also have columns indicating the language known by the participant, their progress in the experiment (i.e. did they finish it or not?), their gender, age, who tested them and two final columns. The funRec column tells us whether they liked the experiment or not (on a 0-7 scale) and whether they’d recommend the experiment to others (yes/no). Unfortunately, these values are separated by a dash. This is bad practice as each cell should represent one data point, not two. The same can be said for completion time, with the dates and times for starting and ending the experiment separated by an underscore.
4.3 Reformatting Data
4.3.1 Gathering Data
Let’s say we want to perform some operations to change how our data looks. What if we want to turn our data from a wide format into long format? We might do this if we want to make a bar plot which counts how many people know each programming language.
To gather data that is spread across several columns, we use the gather()
function.
In this function, we have to specify a few things. As always, with our tidyverse functions we need to tell R which data set on which to perform the function.
We have to say what we will call our new column which contains the headings of the columns we want to gather. Here, we call it prog_lang
, and it will contain the column names for each programming language. (Normally, I’d use a more readable name, but I want to show as many columns for this data on the website.)
Next, we need to specify a value, which will contain the numbers from the programming language columns we’ve gathered together. This will essentially tell us whether or not people know that language or not.
Finally, we need to give the function the columns to gather together. We can do this by name, or by number. Since our programming language columns are all together, from column number 2 to 7, we can just specify the range of 2:7
.
gather(data = demo_wide,
key = prog_lang,
value = known,
2:7
)
## # A tibble: 174 × 8
## ID LANGUAGE progress gender age tester prog_lang known
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 22 OTHER FINISH female 30 GW C++ 1
## 2 23 <NA> no male 30 GW C++ 1
## 3 24 ENGLISH END female 30 GW C++ NA
## 4 25 <NA> ethics male 18 GW C++ 1
## 5 26 english ethics male 31 GW C++ NA
## 6 27 ENGLISH END female 44 GW C++ NA
## 7 28 <NA> ethics male 23 GW C++ NA
## 8 29 english ethics male 34 GW C++ NA
## 9 30 <NA> <NA> female 32 GW C++ NA
## 10 A2 English END female 33 RHB C++ 1
## # … with 164 more rows
That worked nicely, but it seems that it’s formatted the data so it goes through each programming language alphabetically first, so our IDs are spread all over the data set.
To fix this, we can use another function, called arrange
, which takes a data argument and a column by which to arrange the data.
4.3.1.1 The Pipe
At this point, I’ll introduce you to a new way of writing our commands which is better when we want to apply several functions. Instead of nesting it all together, we can write our commands from left to right, like how we read English text. Here, we can use the pipe %>%
at the end of a line, which can be read as “and then do…”. Below, we simply give our data.frame, demo_wide, and use the pipe to apply the gather function. This is the same code as above, just represented in a different way. Read this like, “take our data, and then, gather the columns together.”
%>%
demo_wide gather(
key = prog_lang,
value = known,
2:7
)
## # A tibble: 174 × 8
## ID LANGUAGE progress gender age tester prog_lang known
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 22 OTHER FINISH female 30 GW C++ 1
## 2 23 <NA> no male 30 GW C++ 1
## 3 24 ENGLISH END female 30 GW C++ NA
## 4 25 <NA> ethics male 18 GW C++ 1
## 5 26 english ethics male 31 GW C++ NA
## 6 27 ENGLISH END female 44 GW C++ NA
## 7 28 <NA> ethics male 23 GW C++ NA
## 8 29 english ethics male 34 GW C++ NA
## 9 30 <NA> <NA> female 32 GW C++ NA
## 10 A2 English END female 33 RHB C++ 1
## # … with 164 more rows
Still, we’re left with the same grouping problem, so we can apply another function, arrange()
at the end to arrange the data by ID.
%>%
demo_wide gather(
key = prog_lang,
value = known,
2:7
%>%
) arrange(ID)
## # A tibble: 174 × 8
## ID LANGUAGE progress gender age tester prog_lang known
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 22 OTHER FINISH female 30 GW C++ 1
## 2 22 OTHER FINISH female 30 GW FORTRAN 1
## 3 22 OTHER FINISH female 30 GW JavaScript 1
## 4 22 OTHER FINISH female 30 GW Python NA
## 5 22 OTHER FINISH female 30 GW R NA
## 6 22 OTHER FINISH female 30 GW Ruby NA
## 7 23 <NA> no male 30 GW C++ 1
## 8 23 <NA> no male 30 GW FORTRAN NA
## 9 23 <NA> no male 30 GW JavaScript NA
## 10 23 <NA> no male 30 GW Python NA
## # … with 164 more rows
This is now in a better format. However, we have a lot of rows with NA in the language, where people don’t know that language. This is redundant information. Additionally, the known column is now redundant if we remove the languages people don’t know, so we can remove this column too.
We’ll save this data under the name demo_gathered for comparison with the long formatted data set we already loaded.
<- demo_wide %>%
demo_gathered gather(
key = prog_lang,
value = known,
2:7,
na.rm = TRUE
%>%
) arrange(ID) %>%
select(-known)
demo_gathered
## # A tibble: 54 × 7
## ID LANGUAGE progress gender age tester prog_lang
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW C++
## 2 22 OTHER FINISH female 30 GW FORTRAN
## 3 22 OTHER FINISH female 30 GW JavaScript
## 4 23 <NA> no male 30 GW C++
## 5 24 ENGLISH END female 30 GW FORTRAN
## 6 25 <NA> ethics male 18 GW C++
## 7 25 <NA> ethics male 18 GW FORTRAN
## 8 26 english ethics male 31 GW Python
## 9 26 english ethics male 31 GW Ruby
## 10 27 ENGLISH END female 44 GW FORTRAN
## # … with 44 more rows
We used the argument, na.rm = TRUE
to remove any rows in our value column, known, with an NA. Since we did that, the known column contains all 1s, as the only languages left are the ones people know. So, we used the select
function from dplyr to remove the known column. This function is used to select the column you want to keep in your data set. If you provide a column name with the -
prefix, this tells R to keep everything except that column; so we drop it from our data set!
4.3.2 Separating Columns
In our wide formatted data, we have two columns which store two data points in each cell: funRec has information on whether people found the experiment fun, and whether they’d recommend it to others. Let’s split this into separate columns. We just need to supply the name of the column to separate col
, and what you want it split into, as a list of the names the columns should take, into
.
As in previous examples, I’ll remove the middle rows from the data here, but feel free to print them all yourself in R. I’ve presented the code to do this below.
Remember, if we want to supply multiple names, we need to concatenate (c
) these names together.
%>%
demo_gathered separate(
col = funRec,
into = c("fun", "recommend")
)
## # A tibble: 54 × 5
## ID tester fun recommend completion_time
## <chr> <chr> <chr> <chr> <chr>
## 1 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 2 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 3 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 4 23 GW 6 yes 2018-03-26 00:30:20_2018-03-26 02:15:52
## 5 24 GW 4 yes 2018-03-21 11:09:38_2018-03-21 12:16:28
## 6 25 GW 0 no 2018-03-25 13:03:58_2018-03-25 14:45:25
## 7 25 GW 0 no 2018-03-25 13:03:58_2018-03-25 14:45:25
## 8 26 GW 4 no 2018-03-24 06:46:30_2018-03-24 08:17:29
## 9 26 GW 4 no 2018-03-24 06:46:30_2018-03-24 08:17:29
## 10 27 GW 5 no 2018-03-21 03:23:57_2018-03-21 04:28:01
## # … with 44 more rows
Take a look at the two new columns. They are both parsed as characters, even though the fun column only contains numbers. We can ask R to convert the data types for the split column during in the separate function using convert = TRUE
.
%>%
demo_gathered separate(
col = funRec,
into = c("fun", "recommend"),
convert = TRUE
)
## # A tibble: 54 × 5
## ID tester fun recommend completion_time
## <chr> <chr> <int> <chr> <chr>
## 1 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 2 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 3 22 GW 7 no 2018-03-22 23:06:11_2018-03-23 00:25:51
## 4 23 GW 6 yes 2018-03-26 00:30:20_2018-03-26 02:15:52
## 5 24 GW 4 yes 2018-03-21 11:09:38_2018-03-21 12:16:28
## 6 25 GW 0 no 2018-03-25 13:03:58_2018-03-25 14:45:25
## 7 25 GW 0 no 2018-03-25 13:03:58_2018-03-25 14:45:25
## 8 26 GW 4 no 2018-03-24 06:46:30_2018-03-24 08:17:29
## 9 26 GW 4 no 2018-03-24 06:46:30_2018-03-24 08:17:29
## 10 27 GW 5 no 2018-03-21 03:23:57_2018-03-21 04:28:01
## # … with 44 more rows
That looks much better!
separate()
is smart enough to know how to separate values if they are split by special characters. Before, we had an underscore in the funRec column, so it split the data by that. If this fails, you can directly specify how the values are separated using the sep
argument.
We also wanted to split the completion_time column. It looks like the first value is the start time, and the second is the end time. So lets separate these together with the funRec column.
%>%
demo_gathered separate(
col = funRec,
into = c("fun", "recommend"),
convert = TRUE
%>%
) separate(
col = completion_time,
into = c("start_time", "end_time")
)
## Warning: Expected 2 pieces. Additional pieces discarded in 54 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 54 × 6
## ID tester fun recommend start_time end_time
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 22 GW 7 no 2018 03
## 2 22 GW 7 no 2018 03
## 3 22 GW 7 no 2018 03
## 4 23 GW 6 yes 2018 03
## 5 24 GW 4 yes 2018 03
## 6 25 GW 0 no 2018 03
## 7 25 GW 0 no 2018 03
## 8 26 GW 4 no 2018 03
## 9 26 GW 4 no 2018 03
## 10 27 GW 5 no 2018 03
## # … with 44 more rows
Oops, it looks like separate struggled to split our completion_time column correctly. That’s because it wants to split at every dash, whitespace, colon, and underscore; pretty much the whole completion_time column! Let’s be more specific and tell separate()
to just split the columns at the underscore.
Let’s also overwrite our demo_wide data (assign the new data to the old data name) to use this new format in the next section.
<- demo_gathered %>%
demo_gathered separate(
col = funRec,
into = c("fun", "recommend"),
convert = TRUE
%>%
) separate(
col = completion_time,
into = c("start_time", "end_time"),
sep = "_"
)
# see the data
demo_gathered
## # A tibble: 54 × 6
## ID tester fun recommend start_time end_time
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 22 GW 7 no 2018-03-22 23:06:11 2018-03-23 00:25:51
## 2 22 GW 7 no 2018-03-22 23:06:11 2018-03-23 00:25:51
## 3 22 GW 7 no 2018-03-22 23:06:11 2018-03-23 00:25:51
## 4 23 GW 6 yes 2018-03-26 00:30:20 2018-03-26 02:15:52
## 5 24 GW 4 yes 2018-03-21 11:09:38 2018-03-21 12:16:28
## 6 25 GW 0 no 2018-03-25 13:03:58 2018-03-25 14:45:25
## 7 25 GW 0 no 2018-03-25 13:03:58 2018-03-25 14:45:25
## 8 26 GW 4 no 2018-03-24 06:46:30 2018-03-24 08:17:29
## 9 26 GW 4 no 2018-03-24 06:46:30 2018-03-24 08:17:29
## 10 27 GW 5 no 2018-03-21 03:23:57 2018-03-21 04:28:01
## # … with 44 more rows
That looks a lot better! Notice that we didn’t try to convert the start_time and end_time columns as this data type doesn’t play nicely with separte. We’ll look into how to convert between data types in Lesson 5.
Note: If every row doesn’t produce the same number of columns, you can control what happens here with the extra
argument. We won’t cover this, but it’s useful to know if you get into problems with separate()
because of this issue.
Let’s compare our gathered data to the long formatted data we already loaded. It’s exactly the same, only we separated the two problematic columns – funRec and completion_time – and we have a different label for the programming languages known (prog_lang vs. computer_language).
head(demo_gathered)
head(demo_long)
## # A tibble: 6 × 11
## ID LANGUAGE progress gender age tester fun recommend start_time end_t…¹
## <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW 7 no 2018-03-2… 2018-0…
## 2 22 OTHER FINISH female 30 GW 7 no 2018-03-2… 2018-0…
## 3 22 OTHER FINISH female 30 GW 7 no 2018-03-2… 2018-0…
## 4 23 <NA> no male 30 GW 6 yes 2018-03-2… 2018-0…
## 5 24 ENGLISH END female 30 GW 4 yes 2018-03-2… 2018-0…
## 6 25 <NA> ethics male 18 GW 0 no 2018-03-2… 2018-0…
## # … with 1 more variable: prog_lang <chr>, and abbreviated variable name
## # ¹end_time
## # A tibble: 6 × 9
## ID LANGUAGE progress gender age tester funRec completion_time compu…¹
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW 7-no 2018-03-22 23:06:1… C++
## 2 22 OTHER FINISH female 30 GW 7-no 2018-03-22 23:06:1… FORTRAN
## 3 22 OTHER FINISH female 30 GW 7-no 2018-03-22 23:06:1… JavaSc…
## 4 23 <NA> no male 30 GW 6-yes 2018-03-26 00:30:2… C++
## 5 24 ENGLISH END female 30 GW 4-yes 2018-03-21 11:09:3… FORTRAN
## 6 25 <NA> ethics male 18 GW 0-no 2018-03-25 13:03:5… C++
## # … with abbreviated variable name ¹computer_language
4.3.3 Spreading Data
What if we want to go from long format to wide format? This can be useful if we want to do a paired-samples t-test, where we might want the first scores in one column, and the second scores in another. (We’ll cover t-tests in Lesson 6.)
To make our long data wide, we use the spread()
function from tidyr.
To spread our data we need a key
, the column containing the values we’d like to make column headers. We then also need a value, indicating the column containing the scores associated with the values. Often, this would be conditions in an experiment and test results.
Our problem here is a little more complex. Remember that we dropped the redundant column telling us whether or not people knew a programming language? Well, we need this back so we have a value
column to work from. We’ll mutate our data to create this column. To do this, we use the mutate()
function from dplyr. (We’ll look at this process in detail in Lesson 5.) Here, we just set everything in our new known column to 1 as we know if a language is present in a participant’s row, then they know it!
%>%
demo_gathered mutate(known = 1) %>%
spread(key = prog_lang, value = known)
## # A tibble: 29 × 11
## ID fun recomm…¹ start…² end_t…³ `C++` FORTRAN JavaS…⁴ Python R Ruby
## <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22 7 no 2018-0… 2018-0… 1 1 1 NA NA NA
## 2 23 6 yes 2018-0… 2018-0… 1 NA NA NA NA NA
## 3 24 4 yes 2018-0… 2018-0… NA 1 NA NA NA NA
## 4 25 0 no 2018-0… 2018-0… 1 1 NA NA NA NA
## 5 26 4 no 2018-0… 2018-0… NA NA NA 1 NA 1
## 6 27 5 no 2018-0… 2018-0… NA 1 1 NA NA NA
## 7 28 0 no 2018-0… 2018-0… NA NA NA NA 1 1
## 8 29 6 no 2018-0… 2018-0… NA NA 1 NA NA NA
## 9 30 1 yes 2018-0… 2018-0… NA 1 NA NA NA NA
## 10 A2 4 yes 2018-0… 2018-0… 1 1 1 NA NA NA
## # … with 19 more rows, and abbreviated variable names ¹recommend, ²start_time,
## # ³end_time, ⁴JavaScript
Great, that looks exactly the same as our demo_wide data, only with our nicely split columns.
Note: If you have your values spread across several columns, spread will spread by every unique value, so be sure to collapse your values into one column before you do this. You can do this using the unite()
function, or pasting values together with mutate()
, but we’ll cover this more in Lesson 5.
4.4 Joins
Finally, we’ll look at combining data together from separate tables. This is a common problem when we store demographic information in one data set, and test scores in another. Let’s say we’re interested in differences in performance by age. To do this, we somehow need to join together the demographic information of age with the correct participant ID in the test data.
Let’s load some raw data for the lexical decision times to see how we might join data together from separate data sets.
<- read_csv("inputs/lexical_decision_raw_data.csv") lexdec_data
To make these examples easier to digest, we’ll simply look at a single trial for an individual participant. Again, to do this we’ll use some subsetting techniques that we’ll go into in more detail in Lesson 5.
# keep only trials (rows) where the word is ant
<- lexdec_data %>% filter(word == "ant") lexdec_subset
How does the data look? We have 26 recorded entries. For two participants, they have missing values (NA) for these trials, indicating that they didn’t complete this trial, or the trial wasn’t recorded.
lexdec_subset
## # A tibble: 25 × 9
## subject trial native_language word class frequency length correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A1 157 English ant animal 5.35 3 correct 876.
## 2 A3 41 Other ant animal 5.35 3 correct 607.
## 3 C 86 English ant animal 5.35 3 correct 725.
## 4 D 138 Other ant animal 5.35 3 correct 628.
## 5 I 105 Other ant animal 5.35 3 correct 560.
## 6 J 43 Other ant animal 5.35 3 correct 516.
## 7 K 183 English ant animal 5.35 3 correct 442.
## 8 M1 171 English ant animal 5.35 3 correct 427.
## 9 M2 117 Other ant animal 5.35 3 correct 530.
## 10 P 115 Other ant animal 5.35 3 correct 533.
## # … with 15 more rows
This data set contains all the information for our trials, including the data to identify each trial, and what the score was on our dependent variables (correct/incorrect response, reaction time). Here we have the data in a nice format where each column represents a variable, and each cell represents a value for that variable.
Let’s look at our demographic data set.
As you can tell, our our data is in a long format. Additionally, it looks like we don’t have any record of the language spoken by subject 23. On top of this, we have the cryptically named “no” entry in the progress column. I’m guessing this means that they decided to withdraw from the experiment. This means that we have more information on participants than we will have in the lexdec_subset dataset. The implications of this will become apparent as we try out different joining operations.
demo_gathered
## # A tibble: 54 × 11
## ID LANGUAGE progress gender age tester fun recommend start_t…¹ end_t…²
## <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW 7 no 2018-03-… 2018-0…
## 2 22 OTHER FINISH female 30 GW 7 no 2018-03-… 2018-0…
## 3 22 OTHER FINISH female 30 GW 7 no 2018-03-… 2018-0…
## 4 23 <NA> no male 30 GW 6 yes 2018-03-… 2018-0…
## 5 24 ENGLISH END female 30 GW 4 yes 2018-03-… 2018-0…
## 6 25 <NA> ethics male 18 GW 0 no 2018-03-… 2018-0…
## 7 25 <NA> ethics male 18 GW 0 no 2018-03-… 2018-0…
## 8 26 english ethics male 31 GW 4 no 2018-03-… 2018-0…
## 9 26 english ethics male 31 GW 4 no 2018-03-… 2018-0…
## 10 27 ENGLISH END female 44 GW 5 no 2018-03-… 2018-0…
## # … with 44 more rows, 1 more variable: prog_lang <chr>, and abbreviated
## # variable names ¹start_time, ²end_time
4.4.1 Mutating Joins
There are a number of joining operations we can do that will mutate (change the look of) our data:
left_join(data_one, data_two)
: Keeps everything in data_one and adds everything present in both data_one and data_tworight_join(data_one, data_two)
: Keeps everything in data_two and adds everything present in both data_two and data_oneinner_join(data_one, data_two)
: Keeps everything present in both data sets.full_join(data_one, data_two)
: Keeps everything from both data sets. Adds NAs if information is present in only one data set.
Don’t worry about the number of different joins here, they all take a similar form, but just do slightly different things to your data.
All of these joins take a by
argument, which asks you which columns by which you want to combine the data. If we want to make sure we match up the data, we have to make sure our columns have the same headings across the two data sets.
Take a look at the two data sets above, it looks like we identify subjects with subject in the lexdec_data data set, and by ID in the demo_gathered data set. We also have the identifier for the language spoken as native_language in the lexdec_data data set, and as LANGUAGE in the demo_gathered data set.
We can use rename
from dplr to rename our columns. Here we just supply the new name and the old name. The names in demo_gathered are messy, so we’ll change those to match the lexdec_data names.
<- rename(demo_gathered,
demo_gathered subject = ID,
native_language = LANGUAGE
) demo_gathered
## # A tibble: 54 × 11
## subject native_la…¹ progr…² gender age tester fun recom…³ start…⁴ end_t…⁵
## <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW 7 no 2018-0… 2018-0…
## 2 22 OTHER FINISH female 30 GW 7 no 2018-0… 2018-0…
## 3 22 OTHER FINISH female 30 GW 7 no 2018-0… 2018-0…
## 4 23 <NA> no male 30 GW 6 yes 2018-0… 2018-0…
## 5 24 ENGLISH END female 30 GW 4 yes 2018-0… 2018-0…
## 6 25 <NA> ethics male 18 GW 0 no 2018-0… 2018-0…
## 7 25 <NA> ethics male 18 GW 0 no 2018-0… 2018-0…
## 8 26 english ethics male 31 GW 4 no 2018-0… 2018-0…
## 9 26 english ethics male 31 GW 4 no 2018-0… 2018-0…
## 10 27 ENGLISH END female 44 GW 5 no 2018-0… 2018-0…
## # … with 44 more rows, 1 more variable: prog_lang <chr>, and abbreviated
## # variable names ¹native_language, ²progress, ³recommend, ⁴start_time,
## # ⁵end_time
Note: We could alternatively set by
to by = c("subject" = "ID", "native_language" = "LANGUAGE")
to join by variables with different names across the data sets, but I find it’s good practice to be consistent with your naming.
4.4.1.1 Full Join
Now we can join the data sets together. We’ll do a full_join()
first, just to see what happens.
full_join(lexdec_subset, demo_gathered, by = c("subject", "native_language"))
## # A tibble: 55 × 18
## subject trial nativ…¹ word class frequ…² length correct RT progr…³ gender
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 A1 157 English ant anim… 5.35 3 correct 876. <NA> <NA>
## 2 A3 41 Other ant anim… 5.35 3 correct 607. END male
## 3 C 86 English ant anim… 5.35 3 correct 725. END female
## 4 C 86 English ant anim… 5.35 3 correct 725. END female
## 5 D 138 Other ant anim… 5.35 3 correct 628. END non-b…
## 6 D 138 Other ant anim… 5.35 3 correct 628. END non-b…
## 7 I 105 Other ant anim… 5.35 3 correct 560. END male
## 8 J 43 Other ant anim… 5.35 3 correct 516. END female
## 9 J 43 Other ant anim… 5.35 3 correct 516. END female
## 10 J 43 Other ant anim… 5.35 3 correct 516. END female
## # … with 45 more rows, 7 more variables: age <dbl>, tester <chr>, fun <int>,
## # recommend <chr>, start_time <chr>, end_time <chr>, prog_lang <chr>, and
## # abbreviated variable names ¹native_language, ²frequency, ³progress
We’ve successfully merged the two data sets, but we now have multiple rows for our responses because we kept the programming language column. This is problematic if we want to calculate any statistics directly on this data frame, as we’ll end up with what seems like multiple observations for a single trial.
We have a couple of workarounds for this problem:
- Merge with the demographic data in a wide format, in which case we’ll have multiple columns each representing a different programming language.
- Merge with the demographic data set in a long format, but exclude the prog_lang column and filter the leftover duplicate rows prior to merging.
For now, we’ll stick with 1 as it required fewer steps. But first, we want to transform our nicely tidied demographic data set into a wide format. Just reuse the code from the spreading section to do this:
<- demo_gathered %>%
tidy_demo_wide mutate(known = 1) %>% # create a value column
spread(key = prog_lang, value = known) # data to wide format
# see the output
tidy_demo_wide
## # A tibble: 29 × 16
## subject native_la…¹ progr…² gender age tester fun recom…³ start…⁴ end_t…⁵
## <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr> <chr> <chr>
## 1 22 OTHER FINISH female 30 GW 7 no 2018-0… 2018-0…
## 2 23 <NA> no male 30 GW 6 yes 2018-0… 2018-0…
## 3 24 ENGLISH END female 30 GW 4 yes 2018-0… 2018-0…
## 4 25 <NA> ethics male 18 GW 0 no 2018-0… 2018-0…
## 5 26 english ethics male 31 GW 4 no 2018-0… 2018-0…
## 6 27 ENGLISH END female 44 GW 5 no 2018-0… 2018-0…
## 7 28 <NA> ethics male 23 GW 0 no 2018-0… 2018-0…
## 8 29 english ethics male 34 GW 6 no 2018-0… 2018-0…
## 9 30 <NA> <NA> female 32 GW 1 yes 2018-0… 2018-0…
## 10 A2 English END female 33 RHB 4 yes 2018-0… 2018-0…
## # … with 19 more rows, 6 more variables: `C++` <dbl>, FORTRAN <dbl>,
## # JavaScript <dbl>, Python <dbl>, R <dbl>, Ruby <dbl>, and abbreviated
## # variable names ¹native_language, ²progress, ³recommend, ⁴start_time,
## # ⁵end_time
Now, if we try the full join, we’ll merge together the two data sets so we have all of the information in one place! We want to match the data sets by the subject ID and the native language spoken by the participants, as these two columns appear in both data sets.
As before, we have so much data that I’ll subset things so we can see the relevant information, but be sure to print the whole output in R yourself.
full_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 5 × 6
## subject trial native_language progress correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 A1 157 English <NA> correct 876.
## 2 23 41 <NA> no correct 602.
## 3 24 43 ENGLISH END correct 541.
## 4 28 NA <NA> ethics <NA> NA
## 5 A2 NA English END <NA> NA
As you can see, we now have 1 row for each subject.
In cases where we don’t have data on a subject, we simply have NAs in those cells.
Look at subject 23, you can see that we don’t have data on their native language, but we have their trial information. This means they are in both data sets, but they have missing data in both cases.
Look at subject A2, they have missing data for their trial information, but we know their native language and progress. This means they are missing from the lexdec_subset data set, but they are present in the tidy_demo_wide data set.
Look at subject 28, they have missing trial data, indicating they aren’t present in the lexdec_subset data set, and they are missing a native language, which indicates this data is also missing in the tidy_demo_wide data set.
4.4.1.2 Inner Join
This keeps data only present in both data sets.
We have lost participants A1 and A2 because A1 wasn’t present in the tidy_demo_wide data set, and A2 wasn’t present in the lexdec_subset data set.
inner_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 24 × 6
## subject trial native_language progress correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 A3 41 Other END correct 607.
## 2 C 86 English END correct 725.
## 3 D 138 Other END correct 628.
## 4 I 105 Other END correct 560.
## 5 J 43 Other END correct 516.
## 6 K 183 English END correct 442.
## 7 M1 171 English END correct 427.
## 8 M2 117 Other END correct 530.
## 9 P 115 Other END correct 533.
## 10 R1 30 English END correct 483.
## # … with 14 more rows
4.4.1.3 Left Join
Left joins only keep the data that is present in the left data set (lexdec_subset) and adds anything that matches up from the right data set (tidy_demo_wide).
Here we have participant A1 because they are in the lexdec_subset, even if they are missing from the tidy_demo_wide data set.
left_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 25 × 6
## subject trial native_language progress correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 A1 157 English <NA> correct 876.
## 2 A3 41 Other END correct 607.
## 3 C 86 English END correct 725.
## 4 D 138 Other END correct 628.
## 5 I 105 Other END correct 560.
## 6 J 43 Other END correct 516.
## 7 K 183 English END correct 442.
## 8 M1 171 English END correct 427.
## 9 M2 117 Other END correct 530.
## 10 P 115 Other END correct 533.
## # … with 15 more rows
4.4.1.4 Right Join
This works like the left join, only it keeps everything present in the right data set and anything matching from the left data set.
Here, we do not have data on participant A1 because they are not present in the tidy_demo_wide data set.
right_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 29 × 6
## subject trial native_language progress correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 A3 41 Other END correct 607.
## 2 C 86 English END correct 725.
## 3 D 138 Other END correct 628.
## 4 I 105 Other END correct 560.
## 5 J 43 Other END correct 516.
## 6 K 183 English END correct 442.
## 7 M1 171 English END correct 427.
## 8 M2 117 Other END correct 530.
## 9 P 115 Other END correct 533.
## 10 R1 30 English END correct 483.
## # … with 19 more rows
4.4.2 Filtering Joins
We can filter data by using joins. These next joins don’t merge columns, but instead allow us to just subset our data.
4.4.2.1 Semi Join
With a semi-join we keep all rows and columns from the left data set where we have matching values in the right data set. Crucially, we do not keep the columns from the right data set.
semi_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 24 × 9
## subject trial native_language word class frequency length correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A3 41 Other ant animal 5.35 3 correct 607.
## 2 C 86 English ant animal 5.35 3 correct 725.
## 3 D 138 Other ant animal 5.35 3 correct 628.
## 4 I 105 Other ant animal 5.35 3 correct 560.
## 5 J 43 Other ant animal 5.35 3 correct 516.
## 6 K 183 English ant animal 5.35 3 correct 442.
## 7 M1 171 English ant animal 5.35 3 correct 427.
## 8 M2 117 Other ant animal 5.35 3 correct 530.
## 9 P 115 Other ant animal 5.35 3 correct 533.
## 10 R1 30 English ant animal 5.35 3 correct 483.
## # … with 14 more rows
Here we only kept data in the lexdec_subset for subjects that were present in both data sets. Notice how we do not have data for subjects A1 and A2.
This works like an inner join, but does not duplicate rows.
Notice that we get the same result with the long demographic data set as with the wide demographic data set.
semi_join(lexdec_subset, demo_gathered, by = c("subject", "native_language"))
## # A tibble: 24 × 9
## subject trial native_language word class frequency length correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A3 41 Other ant animal 5.35 3 correct 607.
## 2 C 86 English ant animal 5.35 3 correct 725.
## 3 D 138 Other ant animal 5.35 3 correct 628.
## 4 I 105 Other ant animal 5.35 3 correct 560.
## 5 J 43 Other ant animal 5.35 3 correct 516.
## 6 K 183 English ant animal 5.35 3 correct 442.
## 7 M1 171 English ant animal 5.35 3 correct 427.
## 8 M2 117 Other ant animal 5.35 3 correct 530.
## 9 P 115 Other ant animal 5.35 3 correct 533.
## 10 R1 30 English ant animal 5.35 3 correct 483.
## # … with 14 more rows
4.4.2.2 Anti Join
An anti-join works like the inverse of a semi-join. Here, we get all the values from the left table that do not have a match in the right table.
anti_join(lexdec_subset, tidy_demo_wide, by = c("subject", "native_language"))
## # A tibble: 1 × 9
## subject trial native_language word class frequency length correct RT
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A1 157 English ant animal 5.35 3 correct 876.
In this case, we only get participant A1 from the lexdec_subset data set, as we do not have any demographic information on this subject in the tidy_demo_wide data set.
4.4.3 Binding Joins
We can bind rows from separate data sets with the same number of columns using the bind_rows()
command. This is useful if we have ran an experiment in two parts on differnet sets of subjects, and we simply want to put all of the responses in one data set.
Alternatively, we can bind columns from separate data sets with the same number of rows using the bind_cols()
command. This is useful if we have an experiment in two parts where we want to want to add some additional information about all of participants to one data set.
4.5 Checking for Unique and Duplicate Information
Finally, we can use a number of functions to check for unique information across two different data sets.
intersect()
gives us all the rows in two tables that match exactly. This is useful if we have messy data stored in multiple tables and we’re not sure if we have duplicates. Note that every cell has to match exactly for this to work.union()
gives us all of the rows from two tables except any duplicates.setdiff()
gives us rows from our first data set that aren’t present in the second.
4.6 Exercises
4.6.1 Introduction and Setup
For these exercises, we will look at the core concepts from this lesson. We’ll also get some hands-on experience with binding joins and checking for duplicates, two concepts that we’ve touched on but not went into much detail.
For these exercises we’ll use some toy data sets; ex_demo_data, which has demographic information on 6 participants, and ex_test_data, which has IQ test scores for 6 participants. Crucially, the first data set has some missing values, and the second has the same participant tested twice.
# load the tidyverse
library(tidyverse)
# demographic data
<- tibble(
ex_demo_data subject = seq(1: 6),
height = c(NA, 170, 160, 165, NA, 180),
weight = c(70, 65, 80, NA, 77, 90),
age = c(18, 19, 19, NA, 22, 28)
)
# IQ test scores
<- tibble(
ex_test_data subject = c(1, 3, 4, 4, 5, 6, 7),
IQ = c(150, 160, 155, 155, 190, 120, 140)
)
4.6.2 Long and Wide Data
4.6.2.1 Question 1
Put the ex_demo_data
into a long format with three columns: subject, measurement_id, and measurement. The measurement column should contain the scores for the height, weight, and age of the participants. The measurement_id column should contain text specifying which measurement belongs to which variable (height, weight, or age). Assign this to the variable long_data
and return this table of data.
4.6.3 Uniting and Separating Columns
Here we have some messy data where we have two values for two variables in one column; height_weight
.
<- unite(ex_demo_data,
messy_demo_data "height_weight",
c("height", "weight"),
sep = "_"
) messy_demo_data
## # A tibble: 6 × 3
## subject height_weight age
## <int> <chr> <dbl>
## 1 1 NA_70 18
## 2 2 170_65 19
## 3 3 160_80 19
## 4 4 165_NA NA
## 5 5 NA_77 22
## 6 6 180_90 28
4.6.4 Mutating Joins
4.6.4.1 Question 4
Join the ex_demo_data
and ex_test_data
together by subject number, keeping only data with a match in ex_test_data
.
4.6.6 Binding Joins
Here we have some new data looking at the demographic scores for new subjects. We also have another rating for all of the participants from our study and we want to add this to the demographic data.
<- tibble(
new_demographics subject = c(9, 10),
height = c(170, 190),
weight = c(76, 85),
age = c(40, 59)
)
<- tibble(
eye_colour eye_colour = sample(c("blue", "brown", "green"),
size = 8,
replace = TRUE
) )