Chapter 6 Data Transformation 4: Fine-tuned Data Manipulation
Intended Learning Outcomes
The whole purpose of this lecture is to expose you to a couple of useful functions that would make your daily life easier when dealing with your own data. By the end of today, you will know
- How to rename column headers
- How to recode cell values based on conditions
- How to obtain unique values from your data
This lesson is led by Gaby Mahrholz.
6.1 Pre-steps
Before we can start to focus on some other functions that might come in handy for data manipulation, we need to make sure to load tidyverse
into the library.
library(tidyverse)
6.2 rename()
rename()
is a very useful function if we wanted to change column names. All available column are retained, so nothing is lost. If you had 5 variables in a tibble, and wanted to change the name of one of them, your output would be 5 columns in total - one column with a changed name and 4 columns with the previous names. The rename()
function follows a very simple pattern of
rename(data, new_column_name = old_column_name)
You can rename multiple column headers by including more arguments, separating them with a comma. Let’s take a look at a simple example in table1
.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
If we wanted to change the column headers of country
to Country
and population
to Population
, and keep the columns year
and cases
as they are, we would code
rename(table1, Country = country,
Population = population)
## # A tibble: 6 x 4
## Country year cases Population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
If you want to select a few columns and rename them in the process, you could use select()
. The structure is the same as with rename()
, the only difference is that the new data only retains the columns mentioned explicitly within select()
.
select(data, new_column_name = old_column_name)
Modifying the rename()
example from above would lead to selecting only the columns country
and population
from table1
and renaming them as Country
and Population
at the same time.
select(table1, Country = country,
Population = population)
## # A tibble: 6 x 2
## Country Population
## <chr> <int>
## 1 Afghanistan 19987071
## 2 Afghanistan 20595360
## 3 Brazil 172006362
## 4 Brazil 174504898
## 5 China 1272915272
## 6 China 1280428583
6.3 Conditional statements
Conditional statements are used when you want to create an output value that is conditioned on an evaluation. Here, we are showing you how to use them in combination with mutate()
by either modifying the values within one column, or by adding a new column.
Example scenario: Imagine that we have data in which a column Sex
holds values of 1
and 2
that stand for Male
, and Female
participants respectively. Conditional statements make an evaluation: if Sex
is 1
, then put the value of Male
in a new column called Sex_char
; if Sex
is 2
put Female
in column Sex_char
.
Let’s create some fake data for this. 40 participants (17 male, 20 female, 3 other), they are between 16 and 40 years of age, they study at 4 different institutions (University of Glasgow, University of Strathclyde, Glasgow Caledonian University, University of the West of Scotland), and we had them complete an R assignment and recorded their grade points.
set.seed(999) #so we all get the same random data
fake_data <- tibble(PP_ID = 1:40,
Sex = c(rep(1,17), rep(2,20), rep(3,3)),
Age = sample(16:40, 40, replace = TRUE),
Institution = sample(c(rep(1,10), rep(2,8), rep(3,10), rep(4,10), rep(99,2))),
Grade_points = sample(c(0,9:22), 40, replace = TRUE))
glimpse(fake_data)
## Observations: 40
## Variables: 5
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, 22, 21...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, 3, 99,...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10, 0, 1...
kable(fake_data)
PP_ID | Sex | Age | Institution | Grade_points |
---|---|---|---|---|
1 | 1 | 19 | 3 | 0 |
2 | 1 | 22 | 1 | 20 |
3 | 1 | 24 | 1 | 15 |
4 | 1 | 29 | 4 | 20 |
5 | 1 | 16 | 2 | 11 |
6 | 1 | 25 | 3 | 22 |
7 | 1 | 37 | 1 | 21 |
8 | 1 | 18 | 4 | 9 |
9 | 1 | 38 | 3 | 13 |
10 | 1 | 39 | 4 | 22 |
11 | 1 | 30 | 3 | 19 |
12 | 1 | 20 | 4 | 13 |
13 | 1 | 22 | 4 | 10 |
14 | 1 | 21 | 4 | 0 |
15 | 1 | 34 | 2 | 12 |
16 | 1 | 26 | 3 | 17 |
17 | 1 | 33 | 3 | 20 |
18 | 2 | 16 | 99 | 22 |
19 | 2 | 20 | 2 | 12 |
20 | 2 | 34 | 3 | 20 |
21 | 2 | 27 | 3 | 10 |
22 | 2 | 24 | 4 | 11 |
23 | 2 | 36 | 1 | 15 |
24 | 2 | 38 | 99 | 21 |
25 | 2 | 20 | 2 | 19 |
26 | 2 | 40 | 2 | 15 |
27 | 2 | 19 | 4 | 18 |
28 | 2 | 39 | 3 | 16 |
29 | 2 | 40 | 4 | 21 |
30 | 2 | 30 | 2 | 20 |
31 | 2 | 38 | 1 | 16 |
32 | 2 | 31 | 1 | 19 |
33 | 2 | 26 | 4 | 17 |
34 | 2 | 17 | 1 | 18 |
35 | 2 | 24 | 1 | 21 |
36 | 2 | 16 | 3 | 21 |
37 | 2 | 33 | 1 | 18 |
38 | 3 | 39 | 2 | 18 |
39 | 3 | 23 | 1 | 17 |
40 | 3 | 19 | 2 | 18 |
Now, the data doesn’t look very readable at the moment. If we pass this data set onto a fellow researcher, they would ask us how the columns Sex
and Institution
are coded.
Best would be to adjust the values to make that clearer. There are the options of using the function recode()
, an ifelse
statement or case_when()
.
6.3.1 recode()
Can be used for character, numeric, and factors columns.
The structure for recode()
is:
recode(variable_to_recode, old_value = new_value)
This is a bit tricky to remember as it doesn’t follow the usual tidyverse logic. In select()
, rename()
, and mutate()
for example it is the other way around new_thing = old_thing
or new_thing = something_computational
)
In combination with mutate()
, the structure looks as follows:
mutate(data, new_column_name = recode(variable_to_recode, old_value = new_value))
What is a bit unusual in this structure is that we need a new column name within the mutate()
function but also a reference column within recode()
on which the conditional recoding is based upon.
The new column name within mutate()
can either be an entirely new column name, or if it is an existing column name that original column would be overwritten by the new values we recode.
So, let’s turn to our fake_data
, and apply that logic for recoding the values of the column Sex
where 1 = “Male”, 2 = “Female”, and 3 = “Other”. We will add a new column Sex_char
onto our fake_data
.
The only tricky thing to remember here is that recoding numeric values requires backticks (if the numbers are on the left side of the equation).
fake_data <- mutate(fake_data, Sex_char = recode(Sex,
`1` = "Male",
`2` = "Female",
`3` = "Other"))
glimpse(fake_data)
## Observations: 40
## Variables: 6
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, 22, 21...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, 3, 99,...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10, 0, 1...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male"...
If you forget to recode one condition (for example `3` = “Other”
), values for 3 would be displayed as NA
in the new column. This becomes more important when you decide to overwrite an existing column.
Your turn
The next variable to recode is Institution
:
- 1 = UoG,
- 2 = Strathclyde,
- 3 = GCU,
- 4 = UWS
In the questionnaire, two participants forgot to indicate which university they are studying at (coded as 99), but fortunately we know that they are at the University of Strathclyde (the test was distributed to 10 students at each institution, however, Strathclyde only has 8 responses).
Add a new column Institution_char
to fake_data
that holds the recoded values as
fake_data <- mutate(fake_data, Institution_char = recode(Institution,
`1` = "UoG",
`2` = "Strathclyde",
`3` = "GCU",
`4` = "UWS",
`99` = "Strathclyde"))
glimpse(fake_data)
## Observations: 40
## Variables: 7
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, 22...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, 3,...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10, ...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", "M...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU", ...
An alternative to recode()
is ifelse()
. Let’s see how Sex
and Institution
could have been recoded using ifelse()
.
6.3.2 ifelse()
ifelse()
is a good option when dealing with character or numeric input but struggles with dates and factors. None of the columns in fake_data
are dates or factors, so we can easily use ifelse()
for recoding variables.
The generic structure for ifelse()
is:
ifelse(test, yes, no)
- test: is an expression to test that creates an output of
TRUE
orFALSE
- yes: if test has a
TRUE
outcome, this is what would be coded in the new column - no: if test has a
FALSE
outcome, this is what would be coded in the new column
The interpretation of ifelse()
is: if the test expression is TRUE
, write the value of yes, otherwise no.
In combination with mutate()
, the structure looks as follows:
mutate(data, new_column_name = ifelse(test, yes, no))
How would we apply that to recoding Sex
? We would want the expression to test whether the Sex column holds a value of 1, then a new column should code it as “Male”, otherwise as “Female”. Let’s add a new column Sex_ifelse
to fake_data
fake_data <- mutate(fake_data, Sex_ifelse = ifelse(Sex == 1, "Male", "Female"))
glimpse(fake_data)
## Observations: 40
## Variables: 8
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, 22...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, 3,...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10, ...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", "M...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU", ...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male", "M...
AHHHH!!! But what about the 3 participants that chose option 3? They are now coded as “Female”. Hmmm. One way to get around that would be a nested ifelse()
statement. Instead of replacing the no argument (FALSE
evaluation) with “Female”, we could include a second ifelse()
statement.
fake_data <- mutate(fake_data, Sex_ifelse = ifelse(Sex == 1, "Male",
ifelse(Sex == 2, "Female", "Other")))
Now everything is coded as it should be.
Your turn
Recode Institution
using ifelse()
. Add a new column Institution_ifelse
to fake_data
. A quick reminder of the Institution
values:
- 1 = “UoG”,
- 2 and 99 = “Strathclyde”,
- 3 = “GCU”, and
- 4 = “UWS”
fake_data <- mutate(fake_data, Institution_ifelse = ifelse(Institution == 1, "UoG",
ifelse(Institution == 3, "GCU",
ifelse(Institution == 4, "UWS", "Strathclyde"))))
glimpse(fake_data)
## Observations: 40
## Variables: 9
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, ...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, ...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_ifelse <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
It does look a bit convoluted, and there is definitely the need to count opening and closing brackets, but it gets us the output we wanted. Before showing you how this can be coded in a slightly neater way with case_when()
, I’d like to highlight one advantage ifelse()
has over recode()
, namely that the test expression can evaluate more complex “things” as long as they return a logical value. For example, if we wanted to add a column Pass_Fail
to fake_data
, that holds values of “Pass” for grade points above 9, and “Fail” for values of 9 and below, we would code
fake_data <- mutate(fake_data, Pass_Fail = ifelse(Grade_points > 9, "Pass", "Fail"))
glimpse(fake_data)
## Observations: 40
## Variables: 10
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, ...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, ...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_ifelse <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
## $ Pass_Fail <chr> "Fail", "Pass", "Pass", "Pass", "Pass", "Pass", ...
We could also add a column Grade
to fake_data
that tells us the primary band the students got in their test. Grade_points
between 18 and 22 result in A, 15 to 17 = B, 12 to 14 = C, 9 to 11 = D, 6 to 8 = E, 3 to 5 = F, 1 and 2 = G, and 0 = H. Here we can use the %in%
operator.
fake_data <- mutate(fake_data, Grade = ifelse(Grade_points %in% 18:22, "A",
ifelse(Grade_points %in% 15:17, "B",
ifelse(Grade_points %in% 12:14, "C",
ifelse(Grade_points %in% 9:11, "D",
ifelse(Grade_points %in% 6:8, "E",
ifelse(Grade_points %in% 3:5, "F",
ifelse(Grade_points %in% 1:2, "G","H"))))))))
glimpse(fake_data)
## Observations: 40
## Variables: 11
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 20, ...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, 3, ...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13, 10...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male", ...
## $ Institution_ifelse <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "GCU"...
## $ Pass_Fail <chr> "Fail", "Pass", "Pass", "Pass", "Pass", "Pass", ...
## $ Grade <chr> "H", "A", "B", "A", "D", "A", "A", "D", "C", "A"...
6.3.3 case_when()
case_when()
vectorises multiple ifelse()
statements, and might potentially be a touch easier to digest visually than ifelse()
.
There are few guidelines to follow when using case_when()
:
- A sequence of two-sided formulas is needed.
- The left hand side (LHS) determines which values match this case. The right hand side (RHS) provides the replacement value.
- The LHS must evaluate to a logical vector. The RHS does not need to be logical, but all RHSs must evaluate to the same type of vector.
Let’s apply this to recoding the values in the column Institution
. Combining case_when()
with mutate()
, we will add another column called Institution_case_when
to fake_data
.
fake_data <- mutate(fake_data, Institution_case_when = case_when(
Institution == 1 ~ "UoG",
Institution == 3 ~ "GCU",
Institution == 4 ~ "UWS",
TRUE ~ "Strathclyde"
)
)
glimpse(fake_data)
## Observations: 40
## Variables: 12
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 2...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, ...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13,...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male...
## $ Institution_ifelse <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
## $ Pass_Fail <chr> "Fail", "Pass", "Pass", "Pass", "Pass", "Pass...
## $ Grade <chr> "H", "A", "B", "A", "D", "A", "A", "D", "C", ...
## $ Institution_case_when <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
The TRUE
statement at the end is similar to the the “else” in the ifelse()
function. A further similarity between the two is that arguments are evaluated in order, so you must proceed from the most specific to the most general.
Your turn
Add a column Grade_case_when
into fake_data
using a combination of mutate()
and case_when()
, that recodes grade points to primary bands. Quick reminder:
- grade points 18 - 22 = A,
- 15 to 17 = B,
- 12 to 14 = C,
- 9 to 11 = D,
- 6 to 8 = E,
- 3 to 5 = F,
- 1 and 2 = G, and
- 0 = H.
fake_data <- mutate(fake_data, Grade_case_when = case_when(
Grade_points %in% 18:22 ~ "A",
Grade_points %in% 15:17 ~ "B",
Grade_points %in% 12:14 ~ "C",
Grade_points %in% 9:11 ~ "D",
Grade_points %in% 6:8 ~ "E",
Grade_points %in% 3:5 ~ "F",
Grade_points %in% 1:2 ~ "G",
TRUE ~ "H"
)
)
glimpse(fake_data)
## Observations: 40
## Variables: 13
## $ PP_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ Sex <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Age <int> 19, 22, 24, 29, 16, 25, 37, 18, 38, 39, 30, 2...
## $ Institution <dbl> 3, 1, 1, 4, 2, 3, 1, 4, 3, 4, 3, 4, 4, 4, 2, ...
## $ Grade_points <dbl> 0, 20, 15, 20, 11, 22, 21, 9, 13, 22, 19, 13,...
## $ Sex_char <chr> "Male", "Male", "Male", "Male", "Male", "Male...
## $ Institution_char <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
## $ Sex_ifelse <chr> "Male", "Male", "Male", "Male", "Male", "Male...
## $ Institution_ifelse <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
## $ Pass_Fail <chr> "Fail", "Pass", "Pass", "Pass", "Pass", "Pass...
## $ Grade <chr> "H", "A", "B", "A", "D", "A", "A", "D", "C", ...
## $ Institution_case_when <chr> "GCU", "UoG", "UoG", "UWS", "Strathclyde", "G...
## $ Grade_case_when <chr> "H", "A", "B", "A", "D", "A", "A", "D", "C", ...
6.4 distinct()
The distinct()
function is used to remove duplicate rows in your dataframe. If we are using distinct()
without specifying any arguments, R checks the whole row and excludes any row that is exactly repeated.
fake_data_no_dup <- fake_data %>%
distinct()
We can see that all rows are still there. This is not surprising, given every participant has a unique ID number. However, distinct()
does take arguments as well, namely the variable columns you are trying to determine uniqueness for. Say we we wanted to see how many different institutions we have data from.
institution_distinct <- fake_data %>%
distinct(Institution_char)
institution_distinct
## # A tibble: 4 x 1
## Institution_char
## <chr>
## 1 GCU
## 2 UoG
## 3 UWS
## 4 Strathclyde
We can see that there are 4 distinct institutions. But what happened to the output? We still have the output from the column we specified but all other information is gone. One quick solution would be to use the argument .keep_all
and set that to TRUE
(the default here is FALSE
).
institution_distinct <- fake_data %>%
distinct(Institution_char, .keep_all = TRUE)
institution_distinct
## # A tibble: 4 x 13
## PP_ID Sex Age Institution Grade_points Sex_char Institution_char
## <int> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 1 1 19 3 0 Male GCU
## 2 2 1 22 1 20 Male UoG
## 3 4 1 29 4 20 Male UWS
## 4 5 1 16 2 11 Male Strathclyde
## # ... with 6 more variables: Sex_ifelse <chr>, Institution_ifelse <chr>,
## # Pass_Fail <chr>, Grade <chr>, Institution_case_when <chr>,
## # Grade_case_when <chr>
In the new institution_distinct
dataframe, we still have 4 institutions listed but we kept all 13 columns. If there are multiple rows with the same input (in this case 40 participants from 4 different institutions), distinct()
only preserves the first row a new value occurs.
The distinct()
function also allows you to eliminate duplicate rows with multiple variables in the dataframe. So if we wanted to see where participants were studying but also the different sexes taking part in the study, we would code…
institution_sex_distinct <- fake_data %>%
distinct(Institution_char, Sex_char)
institution_sex_distinct
## # A tibble: 10 x 2
## Institution_char Sex_char
## <chr> <chr>
## 1 GCU Male
## 2 UoG Male
## 3 UWS Male
## 4 Strathclyde Male
## 5 Strathclyde Female
## 6 GCU Female
## 7 UWS Female
## 8 UoG Female
## 9 Strathclyde Other
## 10 UoG Other
institution_sex_distinct
has 10 observations, so that means, we have at least 1 male and 1 female participant from each of the 4 institutions. Additionally, Strathclyde and Glasgow have participants classifying as neither male nor female (here coded as “Other”).
You can see they are not sorted by either of the column. Why is that?
The values are sorted by occurrence. R goes through the tibble row by row to determine whether the values are distinct from the ones in the row before. So if you look at your original dataframe fake_data
, you can see the first row is a male participant from Glasgow Caledonian University. The participant in row 2 is from the University of Glasgow, but so is participant 3 (which means he is not listed in the distinct tibble). The next unique value would be detected in row 4, then in row 7, then in row 18, 20, 22, 23, 38, and 39.
6.5 Recap
Let’s just have some fun with fake_data
whilst recapping a little bit of what we did in the past 5 weeks.
6.5.1 Question 1
Select the variables PP_ID
, Sex_char
, Age
, Institution_char
, Grade_points
, Grade
, and Pass_Fail
from fake_data
and save it as fake_data
in your Global Environment (i.e. overwrite/replace fake_data
). Rename the following variables in the process:
PP_ID
asParticipant_ID
Sex_char
asSex
Institution_char
asInstitution
fake_data <- NULL
# solution using select:
fake_data <- fake_data %>%
select(Participant_ID = PP_ID, Sex = Sex_char, Age, Institution = Institution_char, Grade_points, Grade, Pass_Fail)
fake_data
## # A tibble: 40 x 7
## Participant_ID Sex Age Institution Grade_points Grade Pass_Fail
## <int> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 1 Male 19 GCU 0 H Fail
## 2 2 Male 22 UoG 20 A Pass
## 3 3 Male 24 UoG 15 B Pass
## 4 4 Male 29 UWS 20 A Pass
## 5 5 Male 16 Strathclyde 11 D Pass
## 6 6 Male 25 GCU 22 A Pass
## 7 7 Male 37 UoG 21 A Pass
## 8 8 Male 18 UWS 9 D Fail
## 9 9 Male 38 GCU 13 C Pass
## 10 10 Male 39 UWS 22 A Pass
## # ... with 30 more rows
# solution involving select and rename:
# fake_data <- fake_data %>%
# select(PP_ID, Sex_char, Age, Institution_char, Grade_points, Grade, Pass_Fail) %>%
# rename(Participant_ID = PP_ID, Sex = Sex_char, Institution = Institution_char)
6.5.2 Question 2
Calculate the average grade points for each University. Store the output as a new object grade_points
in your Global Environment
. The new column should be named Average_grade_points
.
grade_points <- NULL
grade_points <- fake_data %>%
group_by(Institution) %>%
summarise(Average_grade_points = mean(Grade_points)) %>%
ungroup()
grade_points
## # A tibble: 4 x 2
## Institution Average_grade_points
## <chr> <dbl>
## 1 GCU 15.8
## 2 Strathclyde 16.8
## 3 UoG 18
## 4 UWS 14.1
6.5.3 Question 3
Join the fake_data
and grade_points
, and calculate how much each person differs from the mean of their university. Results should be stored in a column named grade_diff
. Store the output in your Global Environment
as a new object called fake_data_meandiff
. Sort the data by Institution
and Sex
.
fake_data_meandiff <- NULL
fake_data_meandiff <- left_join(fake_data, grade_points) %>%
mutate(grade_diff = Grade_points - Average_grade_points) %>%
arrange(Institution, Sex)
## Joining, by = "Institution"
fake_data_meandiff
## # A tibble: 40 x 9
## Participant_ID Sex Age Institution Grade_points Grade Pass_Fail
## <int> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 20 Fema~ 34 GCU 20 A Pass
## 2 21 Fema~ 27 GCU 10 D Pass
## 3 28 Fema~ 39 GCU 16 B Pass
## 4 36 Fema~ 16 GCU 21 A Pass
## 5 1 Male 19 GCU 0 H Fail
## 6 6 Male 25 GCU 22 A Pass
## 7 9 Male 38 GCU 13 C Pass
## 8 11 Male 30 GCU 19 A Pass
## 9 16 Male 26 GCU 17 B Pass
## 10 17 Male 33 GCU 20 A Pass
## # ... with 30 more rows, and 2 more variables: Average_grade_points <dbl>,
## # grade_diff <dbl>
6.5.4 Question 4
Our colleague is interested in what the grade point average for each university would be if we were only looking at the students who passed the R assignment initially (i.e. grade points above 9). Store the output in your Global Environment
as a new object called fake_data_pass
.
fake_data_pass <- NULL
# filter Grade_points
fake_data_pass <- fake_data %>%
filter(Grade_points > 9) %>%
group_by(Institution) %>%
summarise(Average_grade_points_pass = mean(Grade_points)) %>%
ungroup()
# filter Pass_Fail
fake_data_pass <- fake_data %>%
filter(Pass_Fail == "Pass") %>%
group_by(Institution) %>%
summarise(Average_grade_points_pass = mean(Grade_points)) %>%
ungroup()
fake_data_pass
## # A tibble: 4 x 2
## Institution Average_grade_points_pass
## <chr> <dbl>
## 1 GCU 17.6
## 2 Strathclyde 16.8
## 3 UoG 18
## 4 UWS 16.5
6.5.5 Question 5
Now our colleague wants to know whether the males or the females did better at each university. They are asking for a table with 4 columns in which the first column titled Institution
lists the university, the second column is called Males
for the average grade point scores of the males at each institution, the third column called Females
with average grade point scores for the females, and the fourth column Better
should say "female"
or "male"
if the females or males had higher grade points respectively. Store the output in your Global Environment
as a new object called fake_data_sex
. Use fake_data
as the reference dataframe.
fake_data_sex <- NULL
fake_data_sex <- fake_data %>%
filter(Sex != "Other") %>% # this step could actually be left out
group_by(Institution, Sex) %>%
summarise(Average_gp = mean(Grade_points)) %>%
ungroup() %>%
spread(Sex, Average_gp) %>%
select(Institution, Males = Male, Females = Female) %>%
mutate(Better = ifelse(Males > Females, "male", "female"))
fake_data_sex
## # A tibble: 4 x 4
## Institution Males Females Better
## <chr> <dbl> <dbl> <chr>
## 1 GCU 15.2 16.8 female
## 2 Strathclyde 11.5 18.2 female
## 3 UoG 18.7 17.8 male
## 4 UWS 12.3 16.8 female
6.6 Summative Homework
The third summative assignment is available on moodle now.
Good luck.
Check that your Rmd file knits into a html file before submitting. Upload your Rmd file (not the knitted html) to moodle.