Chapter 5 Data Transformation 3: Tidy Data
Intended Learning Outcomes
The whole purpose of the next two lecture is to tidy up dataframe and expose you to a bunch of useful functions that would make your daily life easier when dealing with your own data. By the end of today, you will know
- How to rearrange data from long format into to wide format and vice versa
- How to combine and separate columns
This lesson is led by Gaby Mahrholz.
5.1 Pre-steps
All the functions for tidying data are within tidyverse
so we need to make sure to load it into the library. Today’s lecture will make use of built-in dataframes which means we do not have to load any data into our Global Environment
for the minute.
library(tidyverse)
5.2 Tidy data
“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R. Hadley Wickham and Garrett Grolemund call this tidy data. In most cases, data does not come in a tidy format. Researchers spend most of their time getting the data into the shape it needs to be for a particular analysis. The tools we introduce you to this week and next should form the basics for data wrangling. The idea is that tidy datasets are easier to work with in the long run.
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
If you want to know more about tidy data, Hadley Wickham’s paper can be found on https://vita.had.co.nz/papers/tidy-data.pdf or check out chapter 12 in the book R for Data Science (https://r4ds.had.co.nz/tidy-data.html).
5.3 gather() and spread()
gather()
and spread()
are part of tidyr
(also part of the tidyverse package) and help you to rearrange the shape of your data. To get your data into long format, you would use gather()
; if you wanted to get your data from long to wide format, you would need spread()
. Here is a nice gif animation by Garrick Aden-Buie (https://www.garrickadenbuie.com/project/tidyexplain/#spread-and-gather) that shows you how gather()
and spread()
work.
5.3.1 gather()
We can start this lecture by looking at a practical example in the build-in dataset table1
to table5
. They show the number of Tuberculosis cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000 (the full dataset is called who
). The data is part of the tidyverse
package, so as long as we have tidyverse
loaded into the library, there should not be an issue in accessing this dataset.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table1
has a very tidy structure as it follows the three principles above - each variable is its own column, each observation is its own row, and each value is its own cell.
You can see the data contains values associated with four variables (country
, year
, cases
, and population
). table4a
however, looks a touch not right.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
Somehow the values for the columns year
and cases
are missing and instead we find two separate columns for 1999
and 2000
that have all cases
values listed. This is a problem as 1999
and 2000
are actual values and not variables. What we want to do, is gather 1999
and 2000
in one column year
and respective values into a second column called cases
.
Currently the data is in what is called a wide format
. The function gather()
comes in handy when we want to get the data into the shape of a long format
like in table1
. gather()
is structured like:
Let’s apply this formula to table4a
- We need a dataframe that we want to modify ->
table4a
- We need to come up with the name of a new column in which the values of 1999 and 2000 will be stored ->
year
- We need to come up with the name of a new column in which all the case values will be stored ->
cases
- And finally, we need to tell R which columns in
table4a
we want to turn into long format (i.e. the columns of the 7 subscales) -> 1999 and 2000 (Note that “1999” and “2000” are non-syntactic names as they don’t start with a letter. The solution is to surround them in backticks.)
gather(table4a, key = year, value = cases, `1999`:`2000`)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Here we have only 2 columns to reshape 1999
and 2000
. We could have written gather(table4a, year, cases, `1999`, `2000`)
.
Your turn
Look at table4b
. There are two separate columns 1999
and 2000
with population values for Afghanistan, Brazil, and China. Use the gather()
function on table4b
to reshape it from wide format into long format.
gather(table4b, key = year, value = population, `1999`:`2000`)
## # A tibble: 6 x 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
5.3.2 spread()
Now onto the reverse. spread()
will spread the data from long
into wide format
. When you have observations that are scattered across multiple rows, you can use spread()
to split those rows up into separate variables. Let’s take a look at table2
.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2
has the variables cases
and population
stored in one column type
and their respective values in count
. However, cases
and population
are separate variables rather than values of type
. So best to get them spread out into their own columns as shown in the image below.
Take a look how spread()
achieves that.
Let’s apply the formula above to table2
:
- We need a dataframe that we want to modify ->
table2
- We need to decide which of the columns in
table2
will become the new column headers ->type
- We need to decide which of the columns in
table2
has all the data in it that will become the new cell values ->count
spread(table2, key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Voila, cases
and population
are in their own columns.
Let me modify table1
a tiny bit.
table1_mod <- table1 %>%
mutate(percent = cases/ population * 100) %>%
gather(variable, values, cases:percent)
country | year | variable | values |
---|---|---|---|
Afghanistan | 1999 | cases | 7.450000e+02 |
Afghanistan | 2000 | cases | 2.666000e+03 |
Brazil | 1999 | cases | 3.773700e+04 |
Brazil | 2000 | cases | 8.048800e+04 |
China | 1999 | cases | 2.122580e+05 |
China | 2000 | cases | 2.137660e+05 |
Afghanistan | 1999 | population | 1.998707e+07 |
Afghanistan | 2000 | population | 2.059536e+07 |
Brazil | 1999 | population | 1.720064e+08 |
Brazil | 2000 | population | 1.745049e+08 |
China | 1999 | population | 1.272915e+09 |
China | 2000 | population | 1.280429e+09 |
Afghanistan | 1999 | percent | 3.727400e-03 |
Afghanistan | 2000 | percent | 1.294470e-02 |
Brazil | 1999 | percent | 2.193930e-02 |
Brazil | 2000 | percent | 4.612360e-02 |
China | 1999 | percent | 1.667500e-02 |
China | 2000 | percent | 1.669490e-02 |
Your turn
In the modified table above table1_mod
, we added a percent
column to see how much percentage of the population had Tuberculosis in 1999 and 2000 in each of the countries Afghanistan, Brazil, and China. But somehow, all numerical information has ended up in 1 column (values).
Your task is to spread the data back out, so that cases, population, and percentage values are in their own columns, titled as cases
, population
, and percent
respectively. Store the new object in the Global Environment
as table1_mod_corrected
. When you are done, table1_mod_corrected
should look like this…
country | year | cases | percent | population |
---|---|---|---|---|
Afghanistan | 1999 | 745 | 0.0037274 | 19987071 |
Afghanistan | 2000 | 2666 | 0.0129447 | 20595360 |
Brazil | 1999 | 37737 | 0.0219393 | 172006362 |
Brazil | 2000 | 80488 | 0.0461236 | 174504898 |
China | 1999 | 212258 | 0.0166750 | 1272915272 |
China | 2000 | 213766 | 0.0166949 | 1280428583 |
table1_mod_corrected <- spread(table1_mod, key = variable, value = values)
5.4 separate() and unite()
separate()
turns a single character column into multiple columns, whereas unite()
combines the values of multiple columns into one column.
5.4.1 separate()
Take a look at table3
.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
In table3
, two out of three principles of tidy data are violated. There are multiple values stored in column rate
(rule 3) but these multiple values also belong to two different variables - cases
and population
respectively (rule 1).
Here, we would use separate()
to split the column rate
into two columns cases
and population
.
The function separate()
is structured like this:
We would need
- the data in which the column to separate is located ->
table3
, - the actual column we want to separate ->
rate
, - the into argument asking into how many columns we want to separate the old column and what we want to call these new columns ->
c("cases", "population")
, and - a separator argument ->
"/"
.
separate(table3, rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
R has put all cell information before the /
into a column we named cases
, and everything after the /
into the column we decided to call population
. /
will not appear in either column cases
or population
. The data is now in a tidy format.
By the way, the separator argument sep =
can be any non-alphanumeric character, for example:
- underscore (“_“),
- comma (“,”),
- semi-colon (“;”),
- colon (“:”),
- forward-slash (“/”),
- space (" “) or
- nothing (“”).
But watch out! When the separator is a full stop
, you would need to add a double backward slash in front of it (sep = “\\.”
).
5.4.2 additional arguments for separate()
5.4.2.1 Retaining the original column
If for some reason, you wanted to keep the original column, you can set an additional argument remove
to FALSE
.
separate(table3, rate, into = c("cases", "population"), sep = "/", remove = FALSE)
## # A tibble: 6 x 5
## country year rate cases population
## <chr> <int> <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071 745 19987071
## 2 Afghanistan 2000 2666/20595360 2666 20595360
## 3 Brazil 1999 37737/172006362 37737 172006362
## 4 Brazil 2000 80488/174504898 80488 174504898
## 5 China 1999 212258/1272915272 212258 1272915272
## 6 China 2000 213766/1280428583 213766 1280428583
5.4.2.2 Dropping parts of the original column
What if we were just interested retaining part of the data in the cell, for example keeping the column cases
but not population
. The solution would be to work with an NA
argument. Defining the new columns as cases
and NA
will keep everything before the separator in the column cases
, and drops everything after the separator.
separate(table3, rate, into = c("cases", NA), sep = "/")
## # A tibble: 6 x 3
## country year cases
## <chr> <int> <chr>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Your turn
Separate the column rate
in table3
to keep only the values after the separator in a column called population
. Drop the values in front of the separator. Retain the column rate
as well.
separate(table3, rate, into = c(NA, "population"), sep = "/", remove = FALSE)
## # A tibble: 6 x 4
## country year rate population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745/19987071 19987071
## 2 Afghanistan 2000 2666/20595360 20595360
## 3 Brazil 1999 37737/172006362 172006362
## 4 Brazil 2000 80488/174504898 174504898
## 5 China 1999 212258/1272915272 1272915272
## 6 China 2000 213766/1280428583 1280428583
5.4.2.3 Separating by position
It is also possible to separate by position. sep = 2
would split the column between the second and third character/ number etc.
Let’s try that the separation by position. We will keep 4 letters from each country
name rather than the whole word, and drop the rest.
separate(table3, country, into = c("country", NA), sep = 4)
## # A tibble: 6 x 3
## country year rate
## <chr> <int> <chr>
## 1 Afgh 1999 745/19987071
## 2 Afgh 2000 2666/20595360
## 3 Braz 1999 37737/172006362
## 4 Braz 2000 80488/174504898
## 5 Chin 1999 212258/1272915272
## 6 Chin 2000 213766/1280428583
5.4.2.4 Separating into more than 2 columns
Special case. Let’s modify table3
a little bit by adding a new column with a value of 200 and unite that column with column rate
.
table3_mod <- table3 %>%
mutate(new = 200) %>%
unite(rate, rate, new, sep = "/")
Now column rate
has 3 sets of numbers, separated by two /
. If we define we want to split the data into 2 columns, the first 2 sets of numbers are kept, and the third one is dropped with a warning. The code is still executed but the warning indicates that it might not have gone as planned. Hence, it is always good to check that the data object in the Global Environment
looks exactly as what you have intended.
separate(table3_mod, rate, into = c("cases", "population"), sep = "/")
## Warning: Expected 2 pieces. Additional pieces discarded in 6 rows [1, 2, 3, 4,
## 5, 6].
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
If we are separating the columns and wanted to keep all sets of numbers, one in each column, we would use an into argument that includes 3 new columns.
separate(table3_mod, rate, into = c("cases", "population", "200"), sep = "/")
## # A tibble: 6 x 5
## country year cases population `200`
## <chr> <int> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071 200
## 2 Afghanistan 2000 2666 20595360 200
## 3 Brazil 1999 37737 172006362 200
## 4 Brazil 2000 80488 174504898 200
## 5 China 1999 212258 1272915272 200
## 6 China 2000 213766 1280428583 200
Technically, you would not need a separator argument in the case above, as separate()
splits the data whenever it detects a non-alphanumeric character (i.e. a character that isn’t a number or letter). Saying that, the separator argument is needed when your column contains non-alphanumeric characters as part of your cell values.
If the values in population
and 200
(separated by a /
) were some sort of ID_code
, AND we wanted to split the data into 2 columns, after the first /
only, we would have to include an extra
argument, and set it to "merge"
.
separate(table3_mod, rate, into = c("cases", "ID_code"), sep = "/", extra = "merge")
## # A tibble: 6 x 4
## country year cases ID_code
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071/200
## 2 Afghanistan 2000 2666 20595360/200
## 3 Brazil 1999 37737 172006362/200
## 4 Brazil 2000 80488 174504898/200
## 5 China 1999 212258 1272915272/200
## 6 China 2000 213766 1280428583/200
There is still a tiny bit off in the tb_cases
. Can you spot what it is? * Hint: Think about if you wanted to sum up all the TB cases occurred in the 3 countries within the years 1999, and 2000.
The new columns ID_code
and cases
are character
columns. Whilst this is fine for ID_code
, it is less beneficial for cases.
If we wanted to sum up the cases column, values would need to be numerical
. We could use the converting function as.integer()
.
table3_mod %>%
separate(rate, into = c("cases", "ID_code"), sep = "/", extra = "merge") %>%
mutate(cases = as.integer(cases))
## # A tibble: 6 x 4
## country year cases ID_code
## <chr> <int> <int> <chr>
## 1 Afghanistan 1999 745 19987071/200
## 2 Afghanistan 2000 2666 20595360/200
## 3 Brazil 1999 37737 172006362/200
## 4 Brazil 2000 80488 174504898/200
## 5 China 1999 212258 1272915272/200
## 6 China 2000 213766 1280428583/200
5.4.2.5 Converting data types
The slightly quicker and perhaps more elegant solution would be to add another argument convert
to separate()
and set it to TRUE
. This will convert cases
to integer
and leave ID_code
as character
due to the non-numerical /
.
table3_mod %>%
separate(rate, into = c("cases", "ID_code"), sep = "/", extra = "merge", convert = TRUE)
## # A tibble: 6 x 4
## country year cases ID_code
## <chr> <int> <int> <chr>
## 1 Afghanistan 1999 745 19987071/200
## 2 Afghanistan 2000 2666 20595360/200
## 3 Brazil 1999 37737 172006362/200
## 4 Brazil 2000 80488 174504898/200
## 5 China 1999 212258 1272915272/200
## 6 China 2000 213766 1280428583/200
5.4.3 unite()
Take a look at table5
.
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Here the values in column year
are split into century
and year
which is not very useful. Best to merge the 2 columns together as shown in the image below.
unite()
(the inverse of separate) can help to combine the values of century
and year
into a single column. It is used less frequently than it’s counterpart, but is still a useful function to know.
The function unite()
is structured like this:
We would need the data table5
, a new column we want to create year
, the two columns we would like to combine (century
and year
), and define the separator as “nothing” since the default is sep = “_“.
unite(table5, year, century, year, sep = "")
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Yay! The data of table5
is in a tidier format now - at least for column year
. Ah, but we didn’t save the changes as an object in our Global Environment
.
Your turn
Let’s combine the two steps of tidying table5
in a single pipe %>%
- separate the rate
column into cases
and population
, and unite the columns year
and century
into a new column year
. Store the new object as tb_cases
in the Global Environment
. Make sure, the columns cases
and population
are integer
values.
tb_cases <- table5 %>%
separate(rate, into = c("cases", "population"), sep = "/", convert = TRUE) %>%
unite(year, century, year, sep = "")
tb_cases
## # A tibble: 6 x 4
## country year cases population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
The year column is still a character
column. In this case, it does not matter too much, as 1999 and 2000 are levels of the column year
and very unlikely to be involved in any calculations. In case we are dealing with actual numerical value and have the need to convert the data type, we need to use the function as.integer()
as unite does not have a convert
argument available.
tb_cases_converted <- tb_cases %>%
mutate(year = as.integer(year))
tb_cases_converted
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
5.4.4 additional arguments for unite()
5.4.4.1 Retaining the original column
Similarly to separate()
, unite also comes with a remove statement, so if you wanted to keep the original columns before merging them together, you would code
unite(table5, year_combined, century, year, sep = "", remove = FALSE)
## # A tibble: 6 x 5
## country year_combined century year rate
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 19 99 745/19987071
## 2 Afghanistan 2000 20 00 2666/20595360
## 3 Brazil 1999 19 99 37737/172006362
## 4 Brazil 2000 20 00 80488/174504898
## 5 China 1999 19 99 212258/1272915272
## 6 China 2000 20 00 213766/1280428583
5.4.4.2 Removing missing values before uniting columns
unite()
also comes with an na.rm
statement that will remove missing values prior to uniting the values of a column (actually, the values have to be a “character” column). Let me modify table5
a wee bit to make that point.
table5_mod <- table5 %>%
mutate(new = rep(c("200", NA), 3))
table5_mod
## # A tibble: 6 x 5
## country century year rate new
## <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071 200
## 2 Afghanistan 20 00 2666/20595360 <NA>
## 3 Brazil 19 99 37737/172006362 200
## 4 Brazil 20 00 80488/174504898 <NA>
## 5 China 19 99 212258/1272915272 200
## 6 China 20 00 213766/1280428583 <NA>
If we wanted to combine the values of rate
and new
, we would code
unite(table5_mod, rate, rate, new, sep = "/")
## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071/200
## 2 Afghanistan 20 00 2666/20595360/NA
## 3 Brazil 19 99 37737/172006362/200
## 4 Brazil 20 00 80488/174504898/NA
## 5 China 19 99 212258/1272915272/200
## 6 China 20 00 213766/1280428583/NA
See how R adds the missing values onto rate, which we do not want. To avoid that happening, we can use the na.rm
statement which removes missing values in columns of data type character before merging the columns.
unite(table5_mod, rate, rate, new, sep = "/", na.rm = TRUE)
## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071/200
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362/200
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272/200
## 6 China 20 00 213766/1280428583
5.5 Bonus materials: mutate row names from the mtcars
Remember, in lecture 2 when we were talking about the mtcars dataset and how the car types were not in listed as a separate column but rather row names? Let’s just read in the data and have another quick look to jog our memories…
df_mtcars <- mtcars
head(df_mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can see much better what I’m talking about when you view your df_mtcars
.
Now, that we have learnt about the mutate()
function, we can attempt to get all of these row names into a new column. All we have to do is combining a function called rownames()
with mutate()
. We also could use some more of these handy pipes %>%
you learnt about last week.
df_mtcars <- mtcars %>%
mutate(Car_type = rownames(mtcars))
In this example, we have taken the dataset mtcars
, then added a new column called Car_type
with mutate()
. The column Car_type
should then hold all the values that were previously listed as the row names in the original mtcars
data.
head(df_mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb Car_type
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant
We can now look at the data df_mtcars
, and see that the new column was added at the end of the dataset. What would we have to do to make Car_type
the first column in the dataframe?
# We could use the fuction select()
df_mtcars <- mtcars %>%
mutate(Car_type = rownames(mtcars)) %>%
select(Car_type, mpg:carb)
head(df_mtcars)
## Car_type mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
5.6 Additional information
More information on stuff we covered today and beyond can be found here:
- https://r4ds.had.co.nz “R 4 Data Science” by Garrett Grolemund and Hadley Wickham is the official guidebook for R
- https://suzan.rbind.io/categories/tutorial/ She has some pretty cool tutorials on data wrangling
5.7 Formative Homework
The formative assessment can now be downloaded from moodle. There is no need to submit your answers to moodle.