Dplyr and ggplot2 basics
Pipe Operators
Connects the dataset from one step to the next. For example, the following code will filter the original dataset and then arrange it in that order. Without the pipe operator you will get an error.
dataset %>%
filter() %>%
arrange()
Filtering a Dataset
Allows you to filter based on a variable
dataset %>%
filter(variable > x)
Can be >, <, or ==. Note that if you’re filtering for a string (text) it needs to be in quotations.
Sorting Data
You can sort data based on a column using the arrange verb. By default it will arrange data in ascending order, but by adding desc()
you can have it sort in descending order.
dataset %>%
arrange(desc(variable))
Calculate a New Column
Use the mutate verb to calculate a new field using existing data. Can use arithmetic operators such as +, -, *, /,^, %% , and log()
dataset %>%
mutate(new_variable = variable_1 + variable_2)
Summarize
Mean, median, standard deviation, IQR, min, max, count
dataset %>%
summarize(summary_name = mean(variable))
Summarize by groups (Pivot Table)
Group_by reduces every category to a single row. You can also choose to group by multiple factors (but be careful because order matters)
dataset %>%
group_by(variable) %>%
summarize(mean_variable = mean(variable))
Select Columns for a Simplified Dataset
Use the select verb. A colon can be used to select a range of columns
simplified_dataset <- dataset %>%
select(variable_2, variable_4:variable_6)
Selection helpers - contains(), starts_with(), ends_with(), last_col(), matches()
Glimpse Verb
Gives you a glimpse of the dataset by transposing in the console (i.e. you see a list of column names and the first few values in each one)
glimpse(dataset)
Sort observations in your dataset by a variable
Use the arrange()
verb. By default it goes in ascending order, so specify descending with desc()
if needed
dataset %>%
arrange(variable)
dataset %>%
arrange(desc(variable)
Filter Verb
Allows you to filter rows based on a condition
dataset %>%
filter (variable == "x")
Filter for multiple variables using %in% c()
dataset %>%
filter(variable %in% c("value 1", "value 2"))
Create new columns
Use the mutate()
verb. Control which columns are retained in the output using the “.keep” function. The second example would output a table with variable 1, 2, and the new variable only.
dataset %>%
mutate(new_variable = variable_1 + variable_2)
dataset %>%
mutate(variable_1, variable_2, new_variable = variable_6 /2, .keep = "none")
Grouped Mutate
Achieved by using mutate()
and group_by()
together. Make sure to ungroup afterwards to return your dataset to its uncompressed format
dataset %>%
group_by(variable) %>%
mutate(total_variable = sum(variable_2)) %>%
ungroup()
Find the frequency of an value
Use the count()
verb. Produces a table with the number of observations for each value of the variable. Use the sort function to arrange the table in descending order and the weight function to weigh the observations by another variable (optional)
dataset %>%
count(variable_1, sort = TRUE, wt = variable_2)
View the Largest/Smallest Observation for a Group
Use slice_min()
for the smallest observations and slice_max()
for the largest observations. Generally used in conjunction with the group_by()
verb. Specify the amount of results you want in the output table by assigning the value to “n”
dataset %>%
group_by(categorical_variable) %>%
slice_max(continuous_variable, n= # of outputs)
Remove Variables
Use select but make the variable negative
dataset %>%
select(- variable_1)
Rename a Column
dataset %>%
rename(new_variable = old_variable)
You can also rename columns within the select verb
dataset %>%
select(variable_1:variable_4, variable_new = variable_old)
Relocate Columns
.after, .before used to place it relative to existing columns. Use last_col()
as placeholder for the last column in the dataset
Initial Dataset:
variable_1 variable_2 variable_3
dataset %>%
relocate(variable_1, .after = variable_2)
Output:
variable_2 variable_1 variable_3
Lag Function
Replaces the first value in a column with NA
and moves down every value in the vector by one place. Useful for comparing the value of a variable in adjacent observations.
dataset %>%
lag(variable)
ggplot2
Basic format
ggplot(dataset, aes(x = variable_1, y = variable_2)+
geom_point()
Note: Each specification in a ggplot needs to be connected with a + sign
Geom options
geom_point (scatter plot), geom_histogram, geom_line, geom_bar, geom_boxplot, etc.
Full list available here: https://ggplot2.tidyverse.org/reference/
Specify a logarithmic scale
ggplot(dataset, aes(x = variable_1, y = variable_2))+
geom_line+
scale_x_log10()
Use color to represent a categorical variable
ggplot(dataset, aes(x = variable_1, y = variable_2, color = variable_3))+
geom_point()
Use point size to represent a continuous variable
ggplot(dataset, aes(x = variable_1, y = variable_2, size = variable_3))+
geom_point()
Divide into smaller plots based on categorical variable
ggplot(dataset, aes(x = variable_1, y = variable_2))+
geom_point()+
facet_wrap(~variable_3)
Make sure the x or y axis includes a certain value
ggplot(dataset, aes(x = variable_1, y = variable_2))+
geom_line()+
expand_limits(y = 0)
Add title to graph
ggplot(dataset, aes(x = variable_1, y = variable_2))+
geom_histogram()+
ggtitle("Graph Title")