--- title: "practical2" author: "" output: html_document --- ## Task 0: Configure your R session ```{r packages, eval=FALSE} # Bundle of packages for data manipulation. library(tidyverse) # For working with geospatial data. library(sf) # For performing microsimulation. Install the dev version from GitHub. Uncomment to run. # devtools::install_github("philmikejones/rakeR") # For simplifying shapefiles. Uncomment to run. # install.packages("rmapshaper") # Set default ggplot2 theme. theme_set(theme_minimal()) ``` ## Task 1: Study available variables ### Task 1.1: Attach OAC and Airport names Lookup tables containing full names are provided as `.csv` files in the `./data/info` directory. We can add these to our simulated dataset using `dplyr::join`. ```{r join-attributes, eval=FALSE} # Read in lookup tables. oac_lookup <- read_csv("./data/info/oac_lookup.csv") %>% # Cast oac_supergroup as character to join with factor variable in simuldated data. mutate(supergroup_code=as.character(supergroup_code)) airport_lookup <- read_csv("./data/info/airport_lookup.csv") # Join oac names with simulated_oac_age_sex. # Specify variables on which to join as names not identical. simulated_oac_age_sex <- simulated_oac_age_sex %>% left_join(oac_lookup, by=c("oac_grp"="supergroup_code")) # Join airport codes names with simulated_oac_age_sex. # Specify variables on which to join as names not identical. simulated_oac_age_sex <- simulated_oac_age_sex %>% # Do separately for overseas. left_join(airport_lookup, by=c("overseas_airport"="airport_code")) %>% # Add dest_ prefix so can distinguish destination from origin. rename_at(vars(airport_name:airport_country), ~paste0("dest_",.)) %>% # And UK airports. left_join(airport_lookup, by=c("uk_airport"="airport_code")) %>% # Add orig_ prefix so can distinguish destination from origin. rename_at(vars(airport_name:airport_country), ~paste0("orig_",.)) ``` You may wish to view this updated tibble to check that OAC and Airport data were successfully attached. ```{r view-joined-attributes, eval=FALSE} View(simulated_oac_age_sex) ``` ### Task 1.2: Generate frequency plots An initial means of exploring how the simulated population distributes on these categorical variables is to plot frequencies across those categories. The code snippet below, also in the `.Rmd`, displays frequencies of the simulated population amongst each `age_band`, using [`ggplot2`](https://ggplot2.tidyverse.org). ```{r plot-frequencies, eval=FALSE} ggplot(data=simulated_oac_age_sex, mapping=aes(x=age_band))+ geom_bar() ``` ### Task 1.3: Generate frequency plots on variables with many categories. The code block below is a ggplot2 specification for generating a dotplot of frequencies by airport destination. This plot is ordered according to the frequency with which countries are visited, and within countries, the frequency with which airports are visited. In order to effect this ordering, we cast the country variable (`dest_airport_country`) and airport variable (`overseas_airport`) as [factors](https://r4ds.had.co.nz/factors.html) and order the factor levels by frequency. ```{r plot-frequencies-dest, eval=FALSE} # Generate tibble of countries ordered by frequency (for ordered factors). order_country <- simulated_oac_age_sex %>% group_by(dest_airport_country) %>% summarise(count=n()) %>% arrange(-count) # Cleveland dot plot of destinations, grouped by country. plot <- simulated_oac_age_sex %>% # Order dest countries, casting as a factor and ordering levels on frequency. mutate(dest_airport_country=factor(dest_airport_country,levels=order_country$dest_airport_country)) %>% # Calculate num holidays to each dest airport. group_by(overseas_airport) %>% summarise(count_airport=n(), dest_airport_country=first(dest_airport_country)) %>% # Order by these frequencies. arrange(count_airport) %>% # Cast as factor and order levels. mutate(overseas_airport=factor(overseas_airport,levels=.$overseas_airport)) %>% # List airports vertically and frequencies horizontally. ggplot(aes(x=count_airport,y=overseas_airport))+ geom_segment(aes(x=0, y=overseas_airport, xend=count_airport, yend=overseas_airport), colour="#636363")+ geom_point(colour="#636363", fill="#cccccc", shape=21)+ # Facet the plot on country to display group freq by destination country. facet_grid(dest_airport_country~., scales="free_y", space="free_y")+ theme( strip.text.y = element_text(angle=0)) # Print plot to console plot # Save plot ggsave("./dests_cleveland.png", plot, width=4, height=10) ``` ## Task 2: Explore research questions ### Task 2.1: Calculate totals and proportions The code block below makes use of `dplyr` functions to calculate: 1. The number of Leeds households holidaying in Palma Mallorca by OAC group expressed as a proportion of all holiday-makers to Palma Mallorca 2. The number of Leeds households by OAC group expressed as a proportion of all Leeds households ```{r calc-proportions-oac, eval=FALSE} simulated_oac_age_sex %>% # Dummy variable identifying destination to be profiled: PMI. mutate( # Edit the if_else() to switch focus. dest_focus=if_else(overseas_airport=="PMI",1,0), total_focus=sum(dest_focus) ) %>% # Calculate proportions for each OAC group. group_by(supergroup_name) %>% summarise( oac_grp_focus=sum(dest_focus)/first(total_focus), oac_grp_control=n()/nrow(.), diff_focus_control=oac_grp_focus-oac_grp_control ) ``` ### Task 2.2: Plot totals and proportions The code block below defines the function -- on creating this (by running the code), you will notice it listed in the _Global Environment_ window of RStudio: ```{r calc-proportions-function, eval=FALSE} #' Calculate proportions for a focus and control variable, with user-defined grouping. #' @param data A df with focus and control variables (focus_var, control_var). #' @param group_var A string describing the grouping variable passed as a symbol. #' @return A df of focus and control proportions, with diff values. calculate_props <- function(data, group_var) { data %>% group_by(!!group_var) %>% summarise( focus_prop=sum(focus_var)/first(focus_total), control_prop=sum(control_var)/first(control_total), diff_prop=first(focus_prop)-first(control_prop), variable_type=rlang::as_string(group_var) ) %>% ungroup() %>% mutate(variable_name=!!group_var) %>% select(-!!group_var) } ``` Then create a new dataset that will hold the summary statistics (control and focus proportions) by variable type. 1. Create a vector containing the names of grouping variables supplied to `calculate_props()`. You can change these variables by editing the fields supplied to `select()`. Also identify the destination(s) to focus on. 2. Create a new dataset with the control and focus variables (used by `calculate_props()`) defined. Note that the control variable is de-facto set to entire population, but you could select a particular subset for comparison here in the same was as the focus variable is defined. 3. Using a [`map`](https://r4ds.had.co.nz/iteration.html#the-map-functions) function, iterate over each grouping variable name and use this to parameterise `calculate_props()`. ```{r generate-plot-proportions, eval=FALSE} # Create a vector of the names of grouping variables that will be summarised over. groups <- names(simulated_oac_age_sex %>% select(supergroup_name, age_band:household_income, satisfaction_overall)) # Identify the destination in focus. focus <- "IBZ" control <- "ALL" # calculate_props() requires control and focus variables, not yet contained in # simulated dataset. Create a new dataset with these added. temp_simulated_data <- simulated_oac_age_sex %>% mutate( focus_var=if_else(overseas_airport %in% focus,1,0), control_var=1, focus_total=sum(focus_var), control_total=sum(control_var), number_children=as.character(number_children) ) # Iterate over each grouping variable, using map_df to bind rows of returned # data frames. temp_plot_data <- purrr::map_df(groups, ~calculate_props(temp_simulated_data, rlang::sym(.x))) rm(temp_simulated_data) ``` Finally plot these data to explore demographic and attitudinal characteristics that are _over-_ or _under- represented_ amongst the destination in focus. ```{r plot-proportions, eval=FALSE} # Define two colours used to colour pos and neg bars differently fill_colours <- c("#ffffff", "#cccccc") plot <- temp_plot_data %>% gather(key=stat_type, value=stat_value, -c(variable_type, variable_name)) %>% mutate( stat_type=factor(stat_type, levels=c("focus_prop","control_prop","diff_prop")), stat_sign=stat_value>0 ) %>% filter(!is.na(variable_name)) %>% ggplot(aes(x=variable_name, y=stat_value))+ # stat_sign is a boolean identifying whether stat_value is pos or neg. geom_col(aes(fill=stat_sign), colour="#636363", size=0.3)+ scale_fill_manual(values=fill_colours, guide=FALSE)+ facet_grid(variable_type~stat_type, scales="free", space="free_y")+ labs(caption=paste0("focus var : ",focus," | control var : ",control))+ coord_flip()+ theme(axis.title=element_blank(), strip.text.y = element_text(angle=0)) # Print plot to console plot # Generate as .png ggsave("./proportions_bars.png", plot, width=8, height=7) ```