Dplyr and ggplot2 basics

Apr 05, 2024

Pipe Operators

Connects the dataset from one step to the next. For example, the following code will filter the original dataset and then arrange it in that order. Without the pipe operator you will get an error.

dataset %>%
     filter() %>%
     arrange()

Filtering a Dataset

Allows you to filter based on a variable

dataset %>%
     filter(variable > x)

Can be >, <, or ==. Note that if you’re filtering for a string (text) it needs to be in quotations.

Sorting Data

You can sort data based on a column using the arrange verb. By default it will arrange data in ascending order, but by adding desc() you can have it sort in descending order.

dataset %>%
    arrange(desc(variable))

Calculate a New Column

Use the mutate verb to calculate a new field using existing data. Can use arithmetic operators such as +, -, *, /,^, %% , and log()

dataset %>%
     mutate(new_variable = variable_1 + variable_2)

Summarize

Mean, median, standard deviation, IQR, min, max, count

dataset %>%
     summarize(summary_name = mean(variable))

Summarize by groups (Pivot Table)

Group_by reduces every category to a single row. You can also choose to group by multiple factors (but be careful because order matters)

dataset %>%
     group_by(variable) %>%
     summarize(mean_variable = mean(variable))

Select Columns for a Simplified Dataset

Use the select verb. A colon can be used to select a range of columns

simplified_dataset <- dataset %>%
     select(variable_2, variable_4:variable_6)

Selection helpers - contains(), starts_with(), ends_with(), last_col(), matches()

Glimpse Verb

Gives you a glimpse of the dataset by transposing in the console (i.e. you see a list of column names and the first few values in each one)

glimpse(dataset)

Sort observations in your dataset by a variable

Use the arrange() verb. By default it goes in ascending order, so specify descending with desc() if needed

dataset %>%
     arrange(variable)
dataset %>%
     arrange(desc(variable)

Filter Verb

Allows you to filter rows based on a condition

dataset %>%
     filter (variable == "x")

Filter for multiple variables using %in% c()

dataset %>%
     filter(variable %in% c("value 1", "value 2"))

Create new columns

Use the mutate() verb. Control which columns are retained in the output using the “.keep” function. The second example would output a table with variable 1, 2, and the new variable only.

dataset %>%
     mutate(new_variable = variable_1 + variable_2)
dataset %>%
     mutate(variable_1, variable_2, new_variable = variable_6 /2, .keep = "none")

Grouped Mutate

Achieved by using mutate() and group_by() together. Make sure to ungroup afterwards to return your dataset to its uncompressed format

dataset %>%
     group_by(variable) %>%
     mutate(total_variable = sum(variable_2)) %>%
     ungroup()

Find the frequency of an value

Use the count() verb. Produces a table with the number of observations for each value of the variable. Use the sort function to arrange the table in descending order and the weight function to weigh the observations by another variable (optional)

dataset %>%
     count(variable_1, sort = TRUE, wt = variable_2)

View the Largest/Smallest Observation for a Group

Use slice_min() for the smallest observations and slice_max() for the largest observations. Generally used in conjunction with the group_by() verb. Specify the amount of results you want in the output table by assigning the value to “n”

dataset %>%
     group_by(categorical_variable) %>%
     slice_max(continuous_variable, n= # of outputs)

Remove Variables

Use select but make the variable negative

dataset %>%
     select(- variable_1)

Rename a Column

dataset %>%
     rename(new_variable = old_variable)

You can also rename columns within the select verb

dataset %>%
     select(variable_1:variable_4, variable_new = variable_old)

Relocate Columns

.after, .before used to place it relative to existing columns. Use last_col() as placeholder for the last column in the dataset

Initial Dataset:
variable_1     variable_2     variable_3

dataset %>%
     relocate(variable_1, .after = variable_2)

Output:
variable_2     variable_1     variable_3

Lag Function

Replaces the first value in a column with NA and moves down every value in the vector by one place. Useful for comparing the value of a variable in adjacent observations.

dataset %>%
     lag(variable)

ggplot2

Basic format

ggplot(dataset, aes(x = variable_1, y = variable_2)+
     geom_point()

Note: Each specification in a ggplot needs to be connected with a + sign

Geom options

geom_point (scatter plot), geom_histogram, geom_line, geom_bar, geom_boxplot, etc.

Full list available here: https://ggplot2.tidyverse.org/reference/

Specify a logarithmic scale

ggplot(dataset, aes(x = variable_1, y = variable_2))+
     geom_line+
     scale_x_log10()

Use color to represent a categorical variable

ggplot(dataset, aes(x = variable_1, y = variable_2, color = variable_3))+
     geom_point()

Use point size to represent a continuous variable

ggplot(dataset, aes(x = variable_1, y = variable_2, size = variable_3))+
     geom_point()

Divide into smaller plots based on categorical variable

ggplot(dataset, aes(x = variable_1, y = variable_2))+
     geom_point()+
     facet_wrap(~variable_3)

Make sure the x or y axis includes a certain value

ggplot(dataset, aes(x = variable_1, y = variable_2))+
     geom_line()+
     expand_limits(y = 0)

Add title to graph

ggplot(dataset, aes(x = variable_1, y = variable_2))+
     geom_histogram()+
     ggtitle("Graph Title")

Docling Dispatches

Discussion about this post