Introduction
In this series, we are going to talk about data manipulation in R. The data we obtain or collect from various sources is usually unstructured or comprises information that is futile for our purpose. Therefore, spending time on the entire dataset is needless and inessential.
So, Data Manipulation is the key. We transform our data according to our needs which is significant for robust analysis. We manipulate data for further easy processing.
R has many built-in libraries and functions for our desired task.
In this episode, we start with the widely used tidyverse which is an opinionated collection of R packages designed for data science.
tidyverse
The tidyverse is an umbrella term comprising many packages for data interaction. Let’s learn its installation process which similar to any other installation.
Also load it into your R session.
#install.packages("tidyverse")
library(tidyverse)
There are four main packages for Data wrangling and Transformation: * dplyr * tidyr * stringr * forcats
We are going to learn about the dplyr package in depth in this episode.
dplyr
dplyr is a grammer of data manipulation providing a consistent set of verbs that solve the most common data manipulation challenges.
#install.packages("dplyr")
library(dplyr)
Single Table Verbs This tutorial guides you through dplyr’s basic set of tools.
Choosing dataset
Using data()
to reveal in-built datasets.
In this tutorial, we are going to use the Titanic
dataset.
#data()
#Loading the dataset
data("Titanic")
Viewing the description of the dataset
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
Creating a local dataset for easier working
df_titanic <- as.data.frame(Titanic)
Observations x Features
dim(df_titanic)
## [1] 32 5
The pipe
Function - a better alternative
All the functions take the dataset as the first argument. Using the %>%
operator that is the pipe
operator makes our working easier. For instance, an argument or the dataset x
in some function(x,y)
becomes x %>% function(y)
which means that results from previous steps can be piped into another or the next steps.
Single Table Verbs
- Functions manipulating the dataset with rows
1. filter()
: subsetting our observations/rows based on filtered colums/features
How many children survived in the tragedy?
df_titanic %>% filter(Age == "Child")
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Child Yes 5
## 10 2nd Male Child Yes 11
## 11 3rd Male Child Yes 13
## 12 Crew Male Child Yes 0
## 13 1st Female Child Yes 1
## 14 2nd Female Child Yes 13
## 15 3rd Female Child Yes 14
## 16 Crew Female Child Yes 0
How many adult females from the 3rd class survived the tragedy?
df_titanic %>% filter(Sex=="Female",Age == "Adult",Class=="3rd")
## Class Sex Age Survived Freq
## 1 3rd Female Adult No 89
## 2 3rd Female Adult Yes 76
2. arrange()
: subsetting our observations/rows based on arranging colums/features in unique or different order
Arranging rows according to the increasing order of number of people who survived:
df_titanic %>% arrange(Freq)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 Crew Male Child No 0
## 4 1st Female Child No 0
## 5 2nd Female Child No 0
## 6 Crew Female Child No 0
## 7 Crew Male Child Yes 0
## 8 Crew Female Child Yes 0
## 9 1st Female Child Yes 1
## 10 Crew Female Adult No 3
## 11 1st Female Adult No 4
## 12 1st Male Child Yes 5
## 13 2nd Male Child Yes 11
## 14 2nd Female Adult No 13
## 15 3rd Male Child Yes 13
## 16 2nd Female Child Yes 13
## 17 3rd Female Child Yes 14
## 18 2nd Male Adult Yes 14
## 19 3rd Female Child No 17
## 20 Crew Female Adult Yes 20
## 21 3rd Male Child No 35
## 22 1st Male Adult Yes 57
## 23 3rd Male Adult Yes 75
## 24 3rd Female Adult Yes 76
## 25 2nd Female Adult Yes 80
## 26 3rd Female Adult No 89
## 27 1st Male Adult No 118
## 28 1st Female Adult Yes 140
## 29 2nd Male Adult No 154
## 30 Crew Male Adult Yes 192
## 31 3rd Male Adult No 387
## 32 Crew Male Adult No 670
#in the descending order
df_titanic %>% arrange(desc(Freq))
## Class Sex Age Survived Freq
## 1 Crew Male Adult No 670
## 2 3rd Male Adult No 387
## 3 Crew Male Adult Yes 192
## 4 2nd Male Adult No 154
## 5 1st Female Adult Yes 140
## 6 1st Male Adult No 118
## 7 3rd Female Adult No 89
## 8 2nd Female Adult Yes 80
## 9 3rd Female Adult Yes 76
## 10 3rd Male Adult Yes 75
## 11 1st Male Adult Yes 57
## 12 3rd Male Child No 35
## 13 Crew Female Adult Yes 20
## 14 3rd Female Child No 17
## 15 3rd Female Child Yes 14
## 16 2nd Male Adult Yes 14
## 17 2nd Female Adult No 13
## 18 3rd Male Child Yes 13
## 19 2nd Female Child Yes 13
## 20 2nd Male Child Yes 11
## 21 1st Male Child Yes 5
## 22 1st Female Adult No 4
## 23 Crew Female Adult No 3
## 24 1st Female Child Yes 1
## 25 1st Male Child No 0
## 26 2nd Male Child No 0
## 27 Crew Male Child No 0
## 28 1st Female Child No 0
## 29 2nd Female Child No 0
## 30 Crew Female Child No 0
## 31 Crew Male Child Yes 0
## 32 Crew Female Child Yes 0
3. slice()
: indexing rows/observations using their integer locations
#rows between 6 to 27
df_titanic %>% slice(6:27)
## Class Sex Age Survived Freq
## 1 2nd Female Child No 0
## 2 3rd Female Child No 17
## 3 Crew Female Child No 0
## 4 1st Male Adult No 118
## 5 2nd Male Adult No 154
## 6 3rd Male Adult No 387
## 7 Crew Male Adult No 670
## 8 1st Female Adult No 4
## 9 2nd Female Adult No 13
## 10 3rd Female Adult No 89
## 11 Crew Female Adult No 3
## 12 1st Male Child Yes 5
## 13 2nd Male Child Yes 11
## 14 3rd Male Child Yes 13
## 15 Crew Male Child Yes 0
## 16 1st Female Child Yes 1
## 17 2nd Female Child Yes 13
## 18 3rd Female Child Yes 14
## 19 Crew Female Child Yes 0
## 20 1st Male Adult Yes 57
## 21 2nd Male Adult Yes 14
## 22 3rd Male Adult Yes 75
#first 3 rows
df_titanic %>% slice_head(n=3)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
#for selecting observation at random
df_titanic %>% slice_sample(n=6)
## Class Sex Age Survived Freq
## 1 2nd Male Child No 0
## 2 3rd Female Adult Yes 76
## 3 2nd Female Adult No 13
## 4 2nd Female Adult Yes 80
## 5 1st Male Child No 0
## 6 1st Male Adult Yes 57
#rows with top five values of Freq
df_titanic %>% slice_max(Freq,n=5)
## Class Sex Age Survived Freq
## 1 Crew Male Adult No 670
## 2 3rd Male Adult No 387
## 3 Crew Male Adult Yes 192
## 4 2nd Male Adult No 154
## 5 1st Female Adult Yes 140
- Functions manipulating the dataset with columns
1. select()
: selecting or choosing a few columns while discarding others that are of no use thereby reducing our valuable time
How many males and females survived the tragedy?
#shows only two columns
df_titanic %>% select(Sex,Survived)
## Sex Survived
## 1 Male No
## 2 Male No
## 3 Male No
## 4 Male No
## 5 Female No
## 6 Female No
## 7 Female No
## 8 Female No
## 9 Male No
## 10 Male No
## 11 Male No
## 12 Male No
## 13 Female No
## 14 Female No
## 15 Female No
## 16 Female No
## 17 Male Yes
## 18 Male Yes
## 19 Male Yes
## 20 Male Yes
## 21 Female Yes
## 22 Female Yes
## 23 Female Yes
## 24 Female Yes
## 25 Male Yes
## 26 Male Yes
## 27 Male Yes
## 28 Male Yes
## 29 Female Yes
## 30 Female Yes
## 31 Female Yes
## 32 Female Yes
But,
#shows all the columns from 'sex' through to 'survived'(inclusive)
df_titanic %>% select(Sex:Survived)
## Sex Age Survived
## 1 Male Child No
## 2 Male Child No
## 3 Male Child No
## 4 Male Child No
## 5 Female Child No
## 6 Female Child No
## 7 Female Child No
## 8 Female Child No
## 9 Male Adult No
## 10 Male Adult No
## 11 Male Adult No
## 12 Male Adult No
## 13 Female Adult No
## 14 Female Adult No
## 15 Female Adult No
## 16 Female Adult No
## 17 Male Child Yes
## 18 Male Child Yes
## 19 Male Child Yes
## 20 Male Child Yes
## 21 Female Child Yes
## 22 Female Child Yes
## 23 Female Child Yes
## 24 Female Child Yes
## 25 Male Adult Yes
## 26 Male Adult Yes
## 27 Male Adult Yes
## 28 Male Adult Yes
## 29 Female Adult Yes
## 30 Female Adult Yes
## 31 Female Adult Yes
## 32 Female Adult Yes
And this code below deselects the mentioned columns as arguments:
df_titanic %>% select(-c(Sex,Age))
## Class Survived Freq
## 1 1st No 0
## 2 2nd No 0
## 3 3rd No 35
## 4 Crew No 0
## 5 1st No 0
## 6 2nd No 0
## 7 3rd No 17
## 8 Crew No 0
## 9 1st No 118
## 10 2nd No 154
## 11 3rd No 387
## 12 Crew No 670
## 13 1st No 4
## 14 2nd No 13
## 15 3rd No 89
## 16 Crew No 3
## 17 1st Yes 5
## 18 2nd Yes 11
## 19 3rd Yes 13
## 20 Crew Yes 0
## 21 1st Yes 1
## 22 2nd Yes 13
## 23 3rd Yes 14
## 24 Crew Yes 0
## 25 1st Yes 57
## 26 2nd Yes 14
## 27 3rd Yes 75
## 28 Crew Yes 192
## 29 1st Yes 140
## 30 2nd Yes 80
## 31 3rd Yes 76
## 32 Crew Yes 20
Let us learn some of the helper functions
that are used alomg with the select()
function:
- starts_with()
- ends_with()
- matches()
- contains()
For instance:
df_titanic %>% select(starts_with("S"))
## Sex Survived
## 1 Male No
## 2 Male No
## 3 Male No
## 4 Male No
## 5 Female No
## 6 Female No
## 7 Female No
## 8 Female No
## 9 Male No
## 10 Male No
## 11 Male No
## 12 Male No
## 13 Female No
## 14 Female No
## 15 Female No
## 16 Female No
## 17 Male Yes
## 18 Male Yes
## 19 Male Yes
## 20 Male Yes
## 21 Female Yes
## 22 Female Yes
## 23 Female Yes
## 24 Female Yes
## 25 Male Yes
## 26 Male Yes
## 27 Male Yes
## 28 Male Yes
## 29 Female Yes
## 30 Female Yes
## 31 Female Yes
## 32 Female Yes
2. rename()
: Using this function is recommended in place of select()
###rename
df_titanic %>% rename(Cl_ass = Class)
## Cl_ass Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
## 16 Crew Female Adult No 3
## 17 1st Male Child Yes 5
## 18 2nd Male Child Yes 11
## 19 3rd Male Child Yes 13
## 20 Crew Male Child Yes 0
## 21 1st Female Child Yes 1
## 22 2nd Female Child Yes 13
## 23 3rd Female Child Yes 14
## 24 Crew Female Child Yes 0
## 25 1st Male Adult Yes 57
## 26 2nd Male Adult Yes 14
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
###select
df_titanic %>% select(S_ex = Sex)
## S_ex
## 1 Male
## 2 Male
## 3 Male
## 4 Male
## 5 Female
## 6 Female
## 7 Female
## 8 Female
## 9 Male
## 10 Male
## 11 Male
## 12 Male
## 13 Female
## 14 Female
## 15 Female
## 16 Female
## 17 Male
## 18 Male
## 19 Male
## 20 Male
## 21 Female
## 22 Female
## 23 Female
## 24 Female
## 25 Male
## 26 Male
## 27 Male
## 28 Male
## 29 Female
## 30 Female
## 31 Female
## 32 Female
Note: Return back to original data frame.
df_titanic <- as.data.frame(Titanic)
2. mutate()
: It is basically used to add new columns which are created using the existing columns as the new columns that are to be added i.e mutated are functions of the existng columns
Suppose there were about a thousand people on board, then what percenatge of people survived?
#Assume total = 1000 thus 100/1000 for percentage i.e 0.1
df_titanic %>% mutate(Freq_per = Freq * 0.1)
## Class Sex Age Survived Freq Freq_per
## 1 1st Male Child No 0 0.0
## 2 2nd Male Child No 0 0.0
## 3 3rd Male Child No 35 3.5
## 4 Crew Male Child No 0 0.0
## 5 1st Female Child No 0 0.0
## 6 2nd Female Child No 0 0.0
## 7 3rd Female Child No 17 1.7
## 8 Crew Female Child No 0 0.0
## 9 1st Male Adult No 118 11.8
## 10 2nd Male Adult No 154 15.4
## 11 3rd Male Adult No 387 38.7
## 12 Crew Male Adult No 670 67.0
## 13 1st Female Adult No 4 0.4
## 14 2nd Female Adult No 13 1.3
## 15 3rd Female Adult No 89 8.9
## 16 Crew Female Adult No 3 0.3
## 17 1st Male Child Yes 5 0.5
## 18 2nd Male Child Yes 11 1.1
## 19 3rd Male Child Yes 13 1.3
## 20 Crew Male Child Yes 0 0.0
## 21 1st Female Child Yes 1 0.1
## 22 2nd Female Child Yes 13 1.3
## 23 3rd Female Child Yes 14 1.4
## 24 Crew Female Child Yes 0 0.0
## 25 1st Male Adult Yes 57 5.7
## 26 2nd Male Adult Yes 14 1.4
## 27 3rd Male Adult Yes 75 7.5
## 28 Crew Male Adult Yes 192 19.2
## 29 1st Female Adult Yes 140 14.0
## 30 2nd Female Adult Yes 80 8.0
## 31 3rd Female Adult Yes 76 7.6
## 32 Crew Female Adult Yes 20 2.0
3. relocate()
: basically relocating or moving columns
In the code below, the columns between ‘sex’ and ‘Age’ (both inclusive) are placed before the column ‘Class’ using the .before
attribute.
df_titanic %>% relocate(Sex:Age, .before = Class)
## Sex Age Class Survived Freq
## 1 Male Child 1st No 0
## 2 Male Child 2nd No 0
## 3 Male Child 3rd No 35
## 4 Male Child Crew No 0
## 5 Female Child 1st No 0
## 6 Female Child 2nd No 0
## 7 Female Child 3rd No 17
## 8 Female Child Crew No 0
## 9 Male Adult 1st No 118
## 10 Male Adult 2nd No 154
## 11 Male Adult 3rd No 387
## 12 Male Adult Crew No 670
## 13 Female Adult 1st No 4
## 14 Female Adult 2nd No 13
## 15 Female Adult 3rd No 89
## 16 Female Adult Crew No 3
## 17 Male Child 1st Yes 5
## 18 Male Child 2nd Yes 11
## 19 Male Child 3rd Yes 13
## 20 Male Child Crew Yes 0
## 21 Female Child 1st Yes 1
## 22 Female Child 2nd Yes 13
## 23 Female Child 3rd Yes 14
## 24 Female Child Crew Yes 0
## 25 Male Adult 1st Yes 57
## 26 Male Adult 2nd Yes 14
## 27 Male Adult 3rd Yes 75
## 28 Male Adult Crew Yes 192
## 29 Female Adult 1st Yes 140
## 30 Female Adult 2nd Yes 80
## 31 Female Adult 3rd Yes 76
## 32 Female Adult Crew Yes 20
- Functions manipulating the dataset with group of data
1. summarise()
: collapses a data frame to a single row
Using mean of the Freq
column and removing any NA values that exist
df_titanic %>% summarise(Freq = mean(Freq, na.rm = T))
## Freq
## 1 68.78125
Note: The similarities in all of the above functions is their syntax like the dataset as their first argument which can be piped as well and the next set of arguments as to what action has to be performed on the dataset, etc.
So, this is it. We chose a simple, small dataset to start with and learnt how to manipulate data using a few simple functions.
You can chose any other dataset from the in-built ones or from many websites offering datasets for free. Here are some of the sites that I recommend: Gapminder Data.gov Data.world
Stay tuned for more tutorials!
Thank You!