R tidyverse examples
Author: Gibran Hemani
In this talk I pointed to this video, which talked through a set of examples for tidyverse. Unfortunately that set of examples has now disappeared, so for the sake of preservation I'm posting another set of examples here.
First of all load the required packages which are all available from CRAN.
library(tidyverse)
library(lubridate)
library(nycflights13)
1. Create a new column basis count option
flights %>%
mutate(long_flight = (air_time >= 6 * 60)) %>%
View()
You can create new column long flights based on above scripts. Now need to count the number of long flights
flights %>%
mutate(long_flight = (air_time >= 6 * 60)) %>%
count(long_flight)
The above two steps you can execute in a single line.
flights %>%
count(long_flight = air_time >= 6 * 60)
Same way all different column count can calculate, one example is here.
flights %>%
count(flight_path = str_c(origin, " -> ", dest), sort = TRUE)
2. Create a new column basis group by
You can create group by summary based on below script.
flights %>%
group_by(date = make_date(year, month, day)) %>%
summarise(flights_n = n(), air_time_mean = mean(air_time, na.rm = TRUE)) %>%
ungroup()
3. Randomly Shuffle the data
Suppose you want to randomly slice the data with 15 rows, can execute the same basis below command.
flights %>%
slice_sample(n = 15)
Using prop argument also you can slice the data set.
flights %>%
slice_sample(prop = 0.15)
4. Date column creation
In the original data set year, month and date contained as separate columns-based make_date command can create new date column.
flights %>%
select(year, month, day) %>%
mutate(date = make_date(year, month, day))
5. Number Parsing
Suppose you want extract only numbers then you can you parse_number option.
numbers_1 <- tibble(number = c("#1", "Number8", "How are you 3"))
numbers_1 %>% mutate(number = parse_number(number))
6. Select columns with starts_with() and ends_with()
You can select the columns based on start_with() and end_with(), here is the example
flights %>%
select(starts_with("dep_"))
flights %>%
select(ends_with("hour"))
flights %>%
select(contains("hour"))
This is one of the useful code for our day to day life.
7. case_when() to create when conditions are met
Create a new columns when conditions are met. case_when() is one of the handy tool for conditions identification.
flights %>%
mutate(origin = case_when(
(origin == "EWR") & dep_delay > 20 ~ "Newark International Airport - DELAYED",
(origin == "EWR") & dep_delay <= 20 ~ "Newark International Airport - ON TIME DEPARTURE",
)) %>%
count(origin)
8. str_replace_all() to find and replace multiple options at once
Everyone is aware about str_replace() in string r package, here we can execute replace multiple options at a once.
flights %>%
mutate(origin = str_replace_all(origin, c(
"^EWR$" = "Newark International", "^JFK$" = "John F. Kennedy International"
))) %>%
count(origin)
9. Filter groups without making a new column
Filtering is one of the essential function for cleaning and checking data sets.
flights_top_carriers <- flights %>%
group_by(carrier) %>%
filter(n() >= 10000) %>%
ungroup()
10. Extract rows from the first table which are matched in the second table
You can extract the row's information based on str_detect() function
beginning_with_am <- airlines %>%
filter(name %>% str_detect("^Am"))
11. Extract rows from the first table which are not matched in the second table
In the same way you can remove row information from the data frame while using anti_join() function.
airways_beginning_with_a <- airlines %>%
filter(name %>% str_detect("^A"))
flights %>%
anti_join(airways_beginning_with_a, by = "carrier")
12. fct_reorder() to sort for charts creation
When you are creating graphs reordering one of the key function, tidyverse will handle such kind of situations.
airline_names <- flights %>%
left_join(airlines, by = "carrier")
airline_names %>%
count(name) %>%
ggplot(aes(name, n)) +
geom_col()
airline_names %>%
count(name) %>%
mutate(name = fct_reorder(name, n)) %>%
ggplot(aes(name, n)) +
geom_col()
13. coord_flip() to display counts more accurately
To change x and y axis and make a beautiful display
flights_with_airline_names %>%
count(name) %>%
mutate(name = fct_reorder(name, n)) %>%
ggplot(aes(name, n)) +
geom_col() +
coord_flip()
14. Generate all combinations using crossing
Like expand grid in R, you can create all possible combinations based on crossing function in tidyverse.
crossing(
customer_channel = c("Bus", "Car"),
customer_status = c("New", "Repeat"),
spend_range = c("$0-$10", "$10-$20", "$20-$50", "$50+"))
15. Group by based on function
Write the function based on your requirements and group by accordingly.
flights_with_airline_names <- flights %>%
left_join(airlines, by = 'carrier')
summary <- function(data, col_names, na.rm = TRUE) {
data %>%
summarise(across({{ col_names }},
list(
min = min,
max = max,
median = median,
mean = mean
),
na.rm = na.rm,
.names = "{col}_{fn}"
))
}
flights_with_airline_names %>%
summary(c(air_time, arr_delay))
flights_with_airline_names %>%
group_by(carrier) %>%
summary(c(air_time, arr_delay))