This post is a sub-module of the project and focuses on analysing countries’ attitudes towards COVID-19 vaccination. Data for the analyses is obtained from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub.
The project focuses on leveraging the richness of COVID-19 data available across various countries to glean interesting insights and analyses. Some of the key aspects of COVID-19 in focus include trends and patterns in new cases of COVID-19, deaths arising and attitudes towards vaccination, across the different countries.
This post is a sub-module of the project and focuses on analysing countries’ attitudes towards COVID-19 vaccination. Data for the analyses is obtained from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub.
YouGov has partnered with the Institute of Global Health Innovation (IGHI) at Imperial College London to gather global insights on people’s behaviours in response to COVID-19. The survey covers 29 countries and interviews around 21,000 people each week and datasets can be obtained from https://github.com/YouGov-Data/covid-19-tracker/tree/master/data.
Relevant questions relating to public’s attitudes towards vaccination have been extracted for the analyses in this post.
The scope of this post is as follows.
1. Preparation of Data
2. Exploratory Data Analysis
3. Review of visual analytic techniques
4. Packages and Functions for Visualization
5. Storyboard for the Shiny Visual Analytics Application
The following code is first input to define Global Settings for code chunks in this post.
knitr::opts_chunk$set(fig.retina=3,
echo = TRUE,
eval = TRUE,
message = TRUE,
warning = FALSE)
The relevant packages required in the analysis are then installed and loaded with the following code.
packages = c('tidyverse','dplyr','readr','HH',
'plotly','UpSetR','ggplot2',
'naniar','dlookr','ggridges','forcats')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The survey responses of the different countries were stored in separate csv files. Data for the first country was first imported to examine the available variables of interest. This was done with reference to the data dictionary for the datasets (available at https://github.com/YouGov-Data/covid-19-tracker/blob/master/codebook.xlsx).
# Read in 1st csv (Australia)
australia_df <- read_csv("data/australia.csv",col_types = cols(
age = col_integer(), .default = col_character()
))
head(australia_df,5)
# A tibble: 5 x 398
RecordNo endtime household_children qweek i1_health i2_health
<chr> <chr> <chr> <chr> <chr> <chr>
1 310 01/04/2020 16~ 0 week~ 1 1
2 311 01/04/2020 16~ 0 week~ 0 6
3 312 01/04/2020 16~ 1 week~ 0 0
4 313 01/04/2020 16~ 1 week~ 2 20
5 314 01/04/2020 16~ 1 week~ 2 0
# ... with 392 more variables: i7a_health <chr>, i3_health <chr>,
# i4_health <chr>, i5_health_1 <chr>, i5_health_2 <chr>,
# i5_health_3 <chr>, i5_health_4 <chr>, i5_health_5 <chr>,
# i5_health_99 <chr>, i5a_health <chr>, i6_health <chr>,
# i7b_health <chr>, i8_health <chr>, i9_health <chr>,
# i10_health <chr>, i11_health <chr>, i12_health_1 <chr>,
# i12_health_2 <chr>, i12_health_3 <chr>, i12_health_4 <chr>,
# i12_health_5 <chr>, i12_health_6 <chr>, i12_health_7 <chr>,
# i12_health_8 <chr>, i12_health_9 <chr>, i12_health_10 <chr>,
# i12_health_11 <chr>, i12_health_12 <chr>, i12_health_13 <chr>,
# i12_health_14 <chr>, i12_health_15 <chr>, i12_health_16 <chr>,
# i12_health_17 <chr>, i12_health_18 <chr>, i12_health_19 <chr>,
# i12_health_20 <chr>, i13_health <chr>, i14_health_1 <chr>,
# i14_health_2 <chr>, i14_health_3 <chr>, i14_health_4 <chr>,
# i14_health_5 <chr>, i14_health_6 <chr>, i14_health_7 <chr>,
# i14_health_8 <chr>, i14_health_9 <chr>, i14_health_10 <chr>,
# i14_health_96 <chr>, i14_health_98 <chr>, i14_health_99 <chr>,
# i14_health_other <chr>, d1_health_1 <chr>, d1_health_2 <chr>,
# d1_health_3 <chr>, d1_health_4 <chr>, d1_health_5 <chr>,
# d1_health_6 <chr>, d1_health_7 <chr>, d1_health_8 <chr>,
# d1_health_9 <chr>, d1_health_10 <chr>, d1_health_11 <chr>,
# d1_health_12 <chr>, d1_health_13 <chr>, d1_health_98 <chr>,
# d1_health_99 <chr>, weight <chr>, age <int>, gender <chr>,
# state <chr>, household_size <chr>, employment_status <chr>,
# WCRex2 <chr>, WCRV_4 <chr>, CORE_B2_4 <chr>,
# cantril_ladder <chr>, PHQ4_1 <chr>, PHQ4_2 <chr>, PHQ4_3 <chr>,
# PHQ4_4 <chr>, m1_1 <chr>, m1_2 <chr>, m1_3 <chr>, m1_4 <chr>,
# m2 <chr>, m3 <chr>, m4_1 <chr>, m4_2 <chr>, m4_3 <chr>,
# m4_4 <chr>, m4_96 <chr>, m4_99 <chr>, m4_other <chr>, m5_1 <chr>,
# m5_2 <chr>, m6_1 <chr>, m6_2 <chr>, m6_3 <chr>, m6_4 <chr>,
# m6_5 <chr>, ...
While the dataset had 398 variables, only variables which were of interest (i.e.socio-demographic variables and variables related to vaccination attitudes) were retained.
# Create main_df with selected variables from Australia dataset
fieldNames <- c("endtime", "age", "gender", "household_size",
"household_children", "vac_1", "vac2_1", "vac2_2",
"vac2_3", "vac2_4", "vac2_5", "vac2_6", "vac_3",
"vac4", "vac5", "vac6", "vac7")
main_df <- australia_df %>%
select(any_of(fieldNames)) %>% # select required variables
mutate(country = "Australia") %>% # Add column to store country name
drop_na(vac_1) # remove all records which did not
# answer vac_1 - willing to take vaccine
glimpse(main_df)
Rows: 7,761
Columns: 18
$ endtime <chr> "11/11/2020 11:51", "11/11/2020 11:51", "~
$ age <int> 69, 49, 70, 69, 48, 69, 20, 60, 69, 36, 2~
$ gender <chr> "Male", "Male", "Female", "Female", "Male~
$ household_size <chr> "2", "2", "1", "2", "5", "1", "4", "1", "~
$ household_children <chr> "0", "0", "0", "0", "1", "0", "0", "0", "~
$ vac_1 <chr> "1 - Strongly agree", "1 - Strongly agree~
$ vac2_1 <chr> "1 - Strongly agree", "4", "5 – Strongly ~
$ vac2_2 <chr> "3", "3", "5 – Strongly disagree", "1 - S~
$ vac2_3 <chr> "1 - Strongly agree", "2", "1 - Strongly ~
$ vac2_4 <chr> "2", "4", "5 – Strongly disagree", "4", "~
$ vac2_5 <chr> "2", "4", "5 – Strongly disagree", "5 – S~
$ vac2_6 <chr> "1 - Strongly agree", "2", "1 - Strongly ~
$ vac_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac7 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ country <chr> "Australia", "Australia", "Australia", "A~
The other country datasets were then imported and merged to the main dataset created (main_df). As the main focus of the sub-module is to gain insights on countries’ attitudes towards vaccination, non-responses (NA) to vac_1 are excluded from main_df.
countrylist <- c("Canada","Denmark","Finland",
"France","Germany","Israel",
"Italy","Japan","Netherlands",
"Norway","Singapore","South-Korea",
"Spain","Sweden","United-Kingdom",
"United-States")
for(ctry in countrylist){
# loop to read individual country datasets and append to main_df
fileName <- paste0("data/",ctry,".csv")
df <- read_csv(fileName,col_types = cols(
age = col_integer(),
.default = col_character()
))
tableName <- paste0(ctry,"_df")
tableName <- df %>%
select(any_of(fieldNames)) %>%
mutate(country = ctry) %>%
drop_na(vac_1)
# main purpose of analysis is to look at vaccination attitudes
# so records with no response to vac_1 is filtered out
main_df <- bind_rows(main_df,tableName)
}
For some of the responses, recoding needs to be done to assign numerical values to allow for easier manipulation of the data. As the responses to the survey are categorical, the numerical values stored in the variables are recoded and stored as characters.
#recode variables
main_df <- main_df %>% mutate(vac_1 = case_when(
vac_1 == "1 - Strongly agree" ~ "1",
vac_1 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac_1)))
main_df <- main_df %>% mutate(vac2_1 = case_when(
vac2_1 == "1 - Strongly agree" ~ "1",
vac2_1 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_1)))
main_df <- main_df %>% mutate(vac2_2 = case_when(
vac2_2 == "1 - Strongly agree" ~ "1",
vac2_2 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_2)))
main_df <- main_df %>% mutate(vac2_3 = case_when(
vac2_3 == "1 - Strongly agree" ~ "1",
vac2_3 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_3)))
main_df <- main_df %>% mutate(vac2_4 = case_when(
vac2_4 == "1 - Strongly agree" ~ "1",
vac2_4 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_4)))
main_df <- main_df %>% mutate(vac2_5 = case_when(
vac2_5 == "1 - Strongly agree" ~ "1",
vac2_5 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_5)))
main_df <- main_df %>% mutate(vac2_6 = case_when(
vac2_6 == "1 - Strongly agree" ~ "1",
vac2_6 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac2_6)))
main_df <- main_df %>% mutate(vac_3 = case_when(
vac_3 == "1 - Strongly agree" ~ "1",
vac_3 == "5 – Strongly disagree" ~ "5",
TRUE ~ as.character(vac_3)))
main_df <- main_df %>% mutate(vac4 = case_when(
vac4 == "Not at all important" ~ "1",
vac4 == "A little important" ~ "2",
vac4 == "Moderately important" ~ "3",
vac4 == "Very important" ~ "4"))
main_df <- main_df %>% mutate(vac5 = case_when(
vac5 == "Yes" ~ "1",
vac5 == "No" ~ "2",
vac5 == "Not sure" ~ "99"))
main_df <- main_df %>% mutate(vac6 = case_when(
vac6 == "Yes" ~ "1",
vac6 == "No" ~ "2",
vac6 == "Not sure" ~ "99"))
main_df <- main_df %>% mutate(vac7 = case_when(
vac7 == "Not at all" ~ "1",
vac7 == "A little" ~ "2",
vac7 == "Moderately" ~ "3",
vac7 == "Very much" ~ "4"))
main_df <- main_df %>% replace_with_na(replace = list(vac5 = "99", vac6 = "99" ))
# declare responses with "Not Sure" as missing values
After preparing the data, it would be timely to carry out EDA to further understand distribution of the variables, correlation and relationships between variables. This can be done using the dlookr package.
For the main_df, there is only 1 variable that is numerical, age. We use the describe() function to compute descriptive statistics for age.
describe(main_df)
# A tibble: 1 x 26
variable n na mean sd se_mean IQR skewness kurtosis
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 103535 0 47.9 16.5 0.0512 27 -0.00853 -1.03
# ... with 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>,
# p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
# p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
# p99 <dbl>, p100 <dbl>
The density plot of age can also be plotted to investigate the mean age and age distribution of respondents who respondents to vac_1.
a <- ggplot(main_df, aes(x=age))
a + geom_density(aes(y=..count..), fill = "slategrey") +
geom_vline(aes(xintercept = mean(age)),
linetype = "dashed", size = 0.8,
color = "red")

To investigate the age distribution by country, we could use the package ggridges to do a comparison.
# need to install ggridges package
theme_set(theme_ridges())
ggplot(main_df, aes(x = `age`, y = `country`)) +
geom_density_ridges_gradient(aes(fill = ..x..), scale = 3, size = 0.3) +
scale_fill_gradientn(colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
name = "Age")+
labs(title = 'Age Distribution by Country')
Picking joint bandwidth of 2.65

From the plot, we observe that countries like Sweden, Israel and Finland had a greater proportion of younger respondents, whereas countries like Singapore, Norway and Netherlands had a relatively even distribution of respondents across all ages.
main_df %>%
mutate(CountryFct = fct_rev(as.factor(country))) %>%
ggplot(aes(y = CountryFct)) +
geom_density_ridges(
aes(x = age, fill = paste(CountryFct, gender)),
alpha = .5, color = "white", from = 0, to = 100) +
labs(x = "Age (years)",y = "Country",
title = "Age Distribution of Respondents by Gender and Country") +
scale_y_discrete(expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_cyclical(
breaks = c("Male", "Female"),
labels = c(`Male` = "Male", `Female` = "Female"),
values = c("#ff0000", "#0000ff", "#ff8080", "#8080ff"),
name = "Gender", guide = "legend") +
coord_cartesian(clip = "off") +
theme_ridges(grid = FALSE)
Picking joint bandwidth of 3.04

From the plots, it is observed that in general, most countries have similar proportions of male and females (purple areas), except for Finland, France, Germany, Italy and Spain where there is a slightly higher proportion of female (red areas)at the older age bands.
For Categorical data, we could use table to carry out Univariate EDA.
table(main_df$vac_1)
1 2 3 4 5
41528 11978 20280 9948 19801
table(main_df$vac2_1)
1 2 3 4 5
23943 22310 29635 15804 11843
table(main_df$vac2_2)
1 2 3 4 5
29117 21198 25083 16326 11811
For Bivariate Categorical EDA, we could use the following to find the frequencies or proportions between 2 categorical variables.
my_table_0 <- table(main_df$vac_1, main_df$gender)
print.table(my_table_0)
Female Male
1 19896 21632
2 5899 6079
3 11010 9270
4 5354 4594
5 11353 8448
my_table_01 <- addmargins(my_table_0)
print.table(my_table_01)
Female Male Sum
1 19896 21632 41528
2 5899 6079 11978
3 11010 9270 20280
4 5354 4594 9948
5 11353 8448 19801
Sum 53512 50023 103535
my_table_3 <- prop.table(my_table_0, margin = 2) %>%
as.data.frame.matrix()
print.data.frame(my_table_3, digits = 2)
Female Male
1 0.37 0.432
2 0.11 0.122
3 0.21 0.185
4 0.10 0.092
5 0.21 0.169
From the above, we are able to observe that males form a greater proportion of those who disagreed or strongly disagreed (responded 2 or 1) while females form a greater proportion of those who agreed or strongly agreed.
To investigate the relationship between a categorical and continuous variable, we can use the relate function as shown below. The user is able to select the 2 variables for investigation as shown.
# User to select target variable and predictor numerical
# variables (i.e. vac_1, and Age)
variable_selected <- "vac_1" # to be selected by user
pred_var <- "age" # to be selected by user
categ <- target_by(main_df, variable_selected)
Note: Using an external vector in selections is ambiguous.
i Use `all_of(variable_selected)` instead of `variable_selected` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
cat_num <- relate(categ, pred_var)
Note: Using an external vector in selections is ambiguous.
i Use `all_of(pred_var)` instead of `pred_var` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
cat_num
# A tibble: 6 x 27
variable vac_1 n na mean sd se_mean IQR skewness
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 1 41528 0 52.4 16.6 0.0814 26 -0.312
2 age 2 11978 0 46.3 16.2 0.148 28 0.0556
3 age 3 20280 0 44.3 15.6 0.110 26 0.190
4 age 4 9948 0 43.8 15.6 0.156 26 0.204
5 age 5 19801 0 45.2 15.5 0.110 26 0.154
6 age total 103535 0 47.9 16.5 0.0512 27 -0.00853
# ... with 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>,
# p05 <dbl>, p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>,
# p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>,
# p95 <dbl>, p99 <dbl>, p100 <dbl>
plot(cat_num)

It can be observed that respondents who responded 2,3,4 or 5 have similar age distribution patterns. However, for those that strongly disagreed that they are keen on vaccination, a large proportion come from the older age groups.
The above is a density plot that is plotted when a categorical variable is plotted with a continuous variable. For relationship between 2 categorical variables, the following would be a good plot for visualization.
# User to select target variable and predictor numerical
# variables (i.e. vac_1, and vac2_1)
variable_selected <- "vac_1" # to be selected by user
pred_var <- "vac2_1" # to be selected by user
categ <- target_by(main_df, variable_selected)
cat_num <- relate(categ, pred_var)
cat_num
vac2_1
vac_1 1 2 3 4 5
1 13213 9497 10309 5310 3199
2 2402 3805 3383 1803 585
3 3401 4284 8131 3093 1371
4 1482 2205 2996 2349 916
5 3445 2519 4816 3249 5772
plot(cat_num)

It can be observed that a large proportion of respondents who responded 1 in vac_1 also responded 1 for vac2_1. The same is observed for those who responded 5 in vac_1, where many of them also responded 5 for vac2_1.
Cat3 <- main_df %>%
select(gender, vac_1, vac2_1, household_size) %>%
ftable()
Cat3
household_size 1 2 3 4 5 6 7 8 8 or more Don't know Prefer not to say
gender vac_1 vac2_1
Female 1 1 1455 2720 1261 1046 339 102 31 0 39 13 46
2 1032 1741 686 579 178 55 22 2 12 2 21
3 1245 1901 746 597 194 63 15 0 19 7 36
4 586 790 325 262 90 23 4 0 13 3 12
5 340 499 167 158 69 19 7 1 8 5 15
2 1 249 462 290 248 90 35 11 0 4 6 12
2 312 559 351 339 124 47 20 2 10 5 21
3 322 507 296 299 102 35 12 1 9 10 16
4 167 276 133 135 57 13 5 0 3 1 3
5 58 84 42 35 18 6 1 0 6 2 2
3 1 362 650 475 460 184 56 25 0 14 3 25
2 359 696 504 482 189 43 12 0 10 9 37
3 683 1133 812 792 336 107 41 1 23 77 113
4 282 447 285 313 124 41 17 0 7 11 21
5 141 209 116 114 45 16 7 0 4 4 8
4 1 156 276 209 175 69 33 10 0 2 2 9
2 216 352 260 226 115 28 4 0 10 4 19
3 275 514 325 288 134 36 14 1 4 8 22
4 214 322 207 207 81 36 19 0 11 8 16
5 76 118 84 70 29 10 6 0 4 2 2
5 1 375 648 491 386 152 63 30 1 8 7 32
2 268 459 312 256 103 26 11 1 5 5 12
3 589 912 605 475 199 57 20 1 13 12 53
4 412 542 360 282 115 41 15 2 10 9 22
5 659 862 518 402 152 63 28 1 21 20 72
Male 1 1 1123 2138 1122 938 295 106 46 0 30 32 77
2 997 2044 894 751 215 75 32 0 9 5 26
3 1150 2310 805 729 220 73 18 1 16 13 34
4 701 1334 472 416 137 30 11 0 7 4 13
5 433 702 285 260 94 30 7 0 14 8 19
2 1 143 266 245 208 65 24 9 2 5 0 16
2 341 551 417 432 139 45 15 0 8 11 32
3 326 490 344 344 135 48 15 0 6 12 26
4 219 309 198 158 60 19 7 0 4 2 17
5 82 99 57 47 25 7 2 0 2 3 5
3 1 159 318 256 227 87 35 10 0 8 8 18
2 340 493 429 391 130 57 21 0 10 8 29
3 698 891 772 846 354 121 32 2 23 93 134
4 286 390 312 288 124 60 15 0 10 10 26
5 151 198 134 118 47 15 7 0 5 9 11
4 1 74 134 123 117 41 19 5 0 7 4 8
2 169 256 185 215 72 30 7 0 6 4 14
3 235 363 246 290 116 53 18 0 5 11 21
4 210 308 223 250 107 52 23 0 11 5 19
5 121 139 84 89 32 19 1 1 4 3 9
5 1 220 349 272 213 87 17 6 0 8 4 28
2 227 301 201 177 64 23 7 0 7 1 13
3 393 539 349 328 113 33 11 0 11 11 36
4 312 469 248 205 94 26 13 0 6 5 17
5 805 809 461 426 161 54 22 3 27 40 87
For Likert responses, there are several ways in which they can be represented for visualization. The following are some ways.
3.1.1 Grouped Column / Bar Charts One way is to group responses by questions (or other categories) and to represent the responses as column or bar charts in terms of the Likert scale as shown below.
Figure 1: Grouped Bar Chart representing responses to a Likert Scale survey
While Bar Charts allow easy comparison on the count (frequency) of responses across the different Likert Scores, they do not allow the easy comparison of proportions of the responses (e.g., what is the proportion who strongly agree or agree to a question?)
3.1.2 Multiple Pie Charts While Pie Charts depict the proportions of each type of response to a question (as shown below), it is difficult to compare and visualize the differences in proportion between the questions.
Figure 2: Multiple Pie Charts representing responses to a Likert Scale survey
3.1.3 Stacked Bar Charts
Stacked bar charts such as the below chart shows clearly the proportion of the various responses for each question asked in a survey. The use of different tones of colours (i.e. Light Green to Dark Green) provide the reader with an idea of increasing levels of responses (i.e. Agree to Strongly Agree).
Figure 3: Stacked Bar Charts representing responses to Likert Scale survey
A benefit of Stacked Bar Charts is that it allows readers to compare different proportions across different bars since each set of stacked bar sums up to 100%. For the example above, it is evident that while a large proportion agreed that they liked the presentation, a lower proportion actually felt that theu learnt something from the presentation.
3.1.4 Diverging Stacked Bar Chart
Another way of representing responses to a Likert Scale survey is Diverging Stacked Bar Chart. This visualization is similar to the Stacked Bar Chart but differs in that the bars have a common vertical baseline located in the the centre of the diagram. The lengths of the segments of the bar charts are proportional to the number of responses for each value of the Likert scale for each question. Segments which represent favourable responses are usually on the right of the baseline while those that are unfavarouble are on the left. Neutral responses are located on the central baseline.
Figure 4: Diverging Stacked Bar Charts representing responses to Likert Scale survey
The Diverging Stacked Bar Chart is most appropriate to represent responses from Likert Scale Surveys as it allows easy interpretation and comparison of multiple categories.
3.1.5 Comparing the mean Likert Score
While it may seem appropriate to compare the mean Likert Score of each category, there is a potential pitfall in doing so.
Using the below column graph that depicts the responses to a survey on 3 different courses, it is evident that the responses for each of the 3 courses are very different. However when the mean score of responses is computed for each course, the mean appears similar.
Figure 5: Column Bar Charts with responses for 3 different courses
Mean values of the above responses for each course are as shown.
| Course | Mean |
|---|---|
| Course_1 | 2.84 |
| Course_2 | 3.04 |
| Course_3 | 3.00 |
To mitigate this issue, it would be better to find the proportion of responses that meet a certain value (e.g. proportion of respondents who gave a 4 or 5 score).
As the results of a survey are usually used to represent a population, the sample (survey) proportion should be represented with confidence intervals to allow readers an indication of the confidence level that the actual population proportion will be within the confidence interval.
3.2.1 Bar plots with Error Bars
One of the ways to represent the confidence intervals of sample data is with the use of error bars as shown below.
Figure 6: Bar Plots with Error Bars
For surveys, it is useful to study the correlation and relationship between responses to understand if there are certain determinants or factors that affect the response of certain questions.
For this project, while we examine the inclination of countries towards vaccination. It would be useful to investigate if that inclination is dependent on certain socio-demographical factors (e.g., age, gender, household size or number of children in the household) or certain attitudes or beliefs (e.g., confidence of vaccine efficacy, or concerns on the side effects of the vaccine).
To visualize the correlation between factors, one of the methods usually used is to build a correlation matrix or correlation scatterplot. However this is usually used to depict the correlation between continuous variables.
Figure 7: Correlation Matrix
Figure 8: Correlation Scatterplot
For correlations between categorical variables, one way to visualize correlation is via the use of the UpSet plot via the use of the UpSetR function (see next section).
An UpSet plot (seen below) allows the reader to see how frequently each combination or intersection of different factors takes place. Combinations that occur more frequently indicate a stronger correlation between the factors in the combination.
Figure 9: Upset plot
As discussed in Section 3.1.4, a Diverging Stacked Bar Chart is the most suitable for visualization of the responses from a Likert Scale survey. The following steps involves steps needed to create a Diverging Stacked Bar Chart from the dataset created.
To prepare the dataset for use by the likert function subsequently, the gather function is used to consolidate the variables and the corresponding responses into 2 columns.
# Make the table long - Gather question data from columns to rows
main_gathered <- gather(main_df, measure, response, c(6:17))
To see responses for the different questions, users will be able to select different variables for qn_selected in the eventual Shiny app. For now, we can manually change the values for qn_selected to simulate the selection.
# Obtain dataset with only responses for selected question
qn_selected <- "vac_1" # variable of question selected
vac_1 <- filter(main_gathered, measure == qn_selected)
# Obtain contingency table for vac_1
vac_1_df <- table(vac_1$country,vac_1$response) %>% as.data.frame.matrix()
# Change column names, now labelled as 1-5
colnames(vac_1_df) <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
rownames(vac_1_df) <- c("Australia","Canada","Denmark",
"Finland","France","Germany",
"Israel","Italy","Japan",
"Netherlands","Norway","Singapore",
"South Korea","Spain","Sweden",
"United Kingdom","United States")
# Remove other columns containing other responses (for other questions)
vac_1_df <- vac_1_df[,c(1:5)]
# Add a column with rownames
vac_1_df <- tibble::rownames_to_column(vac_1_df, var="Country")
# Matching appropriate graph titles to qn_selected
if (qn_selected == "vac_1") {
div_chart_title <- "Proportion who are willing to take vaccine"
} else if (qn_selected == "vac2_1") {
div_chart_title <- "Proportion worried about getting COVID-19"
} else if (qn_selected == "vac2_2") {
div_chart_title <- "Proportion worried about side effects of COVID-19 vaccines"
} else if (qn_selected == "vac2_3") {
div_chart_title <- "Proportion confident government will provide effective COVID-19 vaccines"
} else if (qn_selected == "vac2_4") {
div_chart_title <- "Proportion confident vaccine will completely protect recipients from health effects of COVID-19"
} else if (qn_selected == "vac2_5") {
div_chart_title <- "Proportion confident vaccine will completely prevent transmission of COVID-19 from recipient to others"
} else if (qn_selected == "vac2_6") {
div_chart_title <- "Proportion who feel they will regret if they do not take the vaccine"
} else if (qn_selected == "vac_3") {
div_chart_title <- "Proportion who will take the vaccine if available in 1 year"
}
The likert function from the HH package is used to plot the Diverging Stacked Bar Chart in the following code chunk.
# Plot diverging bar chart
likert(Country ~ .,data = vac_1_df, ylab = NULL,
RefernceZero = 3, as.percent=TRUE,
positive.order=TRUE,
main = list(div_chart_title,
x =unit(.55,"npc")),
sub = list("Response", x=unit(.57,"npc")),
xlim = c(-100,-80,-60,-40,-20,0,20,40,60,80,100),
strip = FALSE,
par.strip.text=list(cex=.7)
)

From the Diverging Stacked Bar Chart plotted, we are able to observe the various responses for respective countries. For the above example where “vac_1” was selected as qn_selected, we observe that France, followed by the United States and Singapore, were the most receptive to vaccination among the other countries.
By changing the value of qn_selected to other questions (e.g. vac2_1, vac4), we will be able to visualize the responses to the different questions and hence get a deeper understanding of the possible association and determinants behind the attitudes and inclination towards vaccination.
For illustration, when qn_selected is changed to “vac2_1” and the below code chunk is run, the corresponding Diverging is also produced below.
# Obtain dataset with only responses for selected question
qn_selected <- "vac2_1" # variable of question selected
vac_1 <- filter(main_gathered, measure == qn_selected)
# Obtain contingency table for vac_1
vac_1_df <- table(vac_1$country,vac_1$response) %>% as.data.frame.matrix()
# Change column names, now labelled as 1-5
colnames(vac_1_df) <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
rownames(vac_1_df) <- c("Australia","Canada","Denmark",
"Finland","France","Germany",
"Israel","Italy","Japan",
"Netherlands","Norway","Singapore",
"South Korea","Spain","Sweden",
"United Kingdom","United States")
# Remove other columns containing other responses (for other questions)
vac_1_df <- vac_1_df[,c(1:5)]
# Add a column with rownames
vac_1_df <- tibble::rownames_to_column(vac_1_df, var="Country")
# Matching appropriate graph titles to qn_selected
if (qn_selected == "vac_1") {
div_chart_title <- "Proportion who are willing to take vaccine"
} else if (qn_selected == "vac2_1") {
div_chart_title <- "Proportion worried about getting COVID-19"
} else if (qn_selected == "vac2_2") {
div_chart_title <- "Proportion worried about side effects of COVID-19 vaccines"
} else if (qn_selected == "vac2_3") {
div_chart_title <- "Proportion confident government will provide effective COVID-19 vaccines"
} else if (qn_selected == "vac2_4") {
div_chart_title <- "Proportion confident vaccine will completely protect recipients from health effects of COVID-19"
} else if (qn_selected == "vac2_5") {
div_chart_title <- "Proportion confident vaccine will completely prevent transmission of COVID-19 from recipient to others"
} else if (qn_selected == "vac2_6") {
div_chart_title <- "Proportion who feel they will regret if they do not take the vaccine"
} else if (qn_selected == "vac_3") {
div_chart_title <- "Proportion who will take the vaccine if available in 1 year"
}
# Plot diverging bar chart
likert(Country ~ .,data = vac_1_df, ylab = NULL,
RefernceZero = 3, as.percent=TRUE,
positive.order=TRUE,
main = list(div_chart_title,
x =unit(.55,"npc")),
sub = list("Response", x=unit(.57,"npc")),
xlim = c(-100,-80,-60,-40,-20,0,20,40,60,80,100),
strip = FALSE,
par.strip.text=list(cex=.7)
)

From the 2nd chart, it is observed that United States had a relatively high proportion of people agreeing that they were worried about getting COVID-19 which could explain why United States was one of the countries that were more inclined to vaccination.
Singapore on the other hand, had most respondents disagreeing that there were worried about getting COVID-19. Closer examination would be needed to understand the factors that could be affecting different countries’ inclination towards vaccination.
As discussed in Section 3.1.5, we will use a sample statistic such as the proportion of respondents who strongly agree that they are keen on vaccination (i.e. vac_1 = 5). The statistic will also be plotted with confidence intervals as discussed in Section 3.2.
The following code chunk plots the bar plot and error bars. User will be able to vary and select the following variables to explore the different bar plots.
(1) responseLevel
(2) confidence level
responseLevel <- 5
vac_1 <- vac_1 %>% mutate(countSA = case_when(
response == responseLevel ~ 1,
TRUE ~ as.double(0)
))
vac_1 <- vac_1 %>% mutate(countResponse = 1)
# Obtain contingency table for Prop of SA
propSA_df <- table(vac_1$country,vac_1$countSA) %>% as.data.frame.matrix()
countResponse_df <- table(vac_1$country,vac_1$countResponse) %>% as.data.frame.matrix()
# Change column names, now labelled as 1-5
colnames(propSA_df) <- c("NotSA","SA")
# Compute proportion of SA
propSA_df$propSA <- (propSA_df$SA / (propSA_df$SA + propSA_df$NotSA))
# Compute SE
propSA_df$SE <- sqrt((propSA_df$propSA*(1 - propSA_df$propSA))/ (propSA_df$SA + propSA_df$NotSA))
# compute z-value
confidence_level <- "0.95" # to select between 0.9, 0.95, 0.98, 0.99
if (confidence_level == "0.9") {
z_value <- 1.645
} else if (confidence_level == "0.95") {
z_value <- 1.96
} else if (confidence_level == "0.98") {
z_value <- 2.33
} else if (confidence_level == "0.99") {
z_value <- 2.58
}
# Compute Upper and Lower Limits
propSA_df$UppLim <- propSA_df$propSA + z_value*propSA_df$SE
propSA_df$LowLim <- propSA_df$propSA - z_value*propSA_df$SE
propSA_df$country <- c("Australia","Canada","Denmark",
"Finland","France","Germany",
"Israel","Italy","Japan",
"Netherlands","Norway","Singapore",
"South Korea","Spain","Sweden",
"United Kingdom","United States")
#vertical bar chart with error plots
ggplot(propSA_df) +
geom_bar(aes(reorder(country,propSA),y = propSA), stat = "identity", fill ="dodgerblue3", alpha = 0.8) +
geom_errorbar(aes(x = country, ymin = LowLim, ymax = UppLim), width = 0.5, colour = "firebrick2",
alpha = 0.9, size = 0.6) +
coord_flip()

theme(legend.position = "top")
List of 1
$ legend.position: chr "top"
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi FALSE
- attr(*, "validate")= logi TRUE
From the above bar plot, we can observe the countries which have the highest proportions of Strongly Agreed to vac2_1 as well as the correspondig confidence intervals.
For Singapore, we can see that it has a lower proportion of people who Strongly Agree to vac2_1 (“worried about COVID19”) than Spain. However as their confidence intervals overlap, this difference may not be statistically significant. However when Singapore is compared to Italy, we are 95% confident that Singapore’s proportion is lower than Italy’s since the confidence intervals do not overlap.
To understand the association between categorical variables as discussed in Section 3.3, we will explore the creation of an UpSet plot.
To start, the user will be able to select / input the following variables.
(1) countrySelected
(2) strength
# Select country - User to input selected country
countrySelected = "Australia"
strength = 4 # For user to determine level of agreement - 4 Agree, 5 Strongly agree
upset_df <- filter(main_df, country == countrySelected)
# vac_1 - willing to get vaccine
upset_df <- upset_df %>% mutate(vac_1_ag = case_when(
vac_1 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
)
)
# vac2_1 - worried about getting COVID19
upset_df <- upset_df %>% mutate(vac2_1_ag = case_when(
vac2_1 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac2_2 - worried about side effects of vaccine (inv relationship)
upset_df <- upset_df %>% mutate(vac2_2_ag = case_when(
vac2_2 <= (6-strength) ~ as.integer(1), # when strength 4, equivalent is 2. When strength 5, equiv is 1
TRUE ~ as.integer(0)
))
# vac2_3 - confidence in government providing effective vaccine
upset_df <- upset_df %>% mutate(vac2_3_ag = case_when(
vac2_3 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac2_4 - confident vaccine protects recipient from COVID19 effects
upset_df <- upset_df %>% mutate(vac2_4_ag = case_when(
vac2_4 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac2_5 - confidence in vaccine preventing recipient from spreading COVID19
upset_df <- upset_df %>% mutate(vac2_5_ag = case_when(
vac2_5 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac2_6 - will regret if do not take vaccine
upset_df <- upset_df %>% mutate(vac2_6_ag = case_when(
vac2_6 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac_3 - keen to get vaccine 1 year later
upset_df <- upset_df %>% mutate(vac_3_ag = case_when(
vac_3 >= strength ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac4 - how important vaccine is for health
upset_df <- upset_df %>% mutate(vac4_ag = case_when(
vac4 >= (strength-1) ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac5 - will get vaccine if available
upset_df <- upset_df %>% mutate(vac5_ag = case_when(
vac5 == 1 ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac6 - will get vaccine if available
upset_df <- upset_df %>% mutate(vac6_ag = case_when(
vac6 == 1 ~ as.integer(1),
TRUE ~ as.integer(0)
))
# vac7 - trust COVID19 vaccines
upset_df <- upset_df %>% mutate(vac7_ag = case_when(
vac7 >= (strength-1) ~ as.integer(1),
TRUE ~ as.integer(0)
))
# create combination matrix
combiMatrix <- select(upset_df,vac_1_ag,vac2_1_ag,vac2_2_ag,
vac2_3_ag,vac2_4_ag,vac2_5_ag,
vac2_6_ag,vac_3_ag,vac4_ag,
vac5_ag,vac6_ag,vac7_ag,)
combiMatrix <- filter(combiMatrix, vac_1_ag == 1)
# UpSetR cannot have 0 length argument (i.e. 0 0 0 0 0) so investigate only vac_1 = 1 cases
upset(combiMatrix, sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
"vac2_3_ag","vac2_4_ag","vac2_5_ag",
"vac2_6_ag","vac_3_ag","vac4_ag",
"vac5_ag","vac6_ag","vac7_ag"),
mb.ratio = c(0.55,0.45), order.by = "freq")

In addition to the UpSet plot, queries can be used to identify specific combinations. For example, if we want to know what the strength of association is like for vac_1 and vac2_1, we can input them in the code below. The corresponding UpSet plot with the required combination will be highlighted (in blue).
upset_df_filtered <- filter(upset_df, vac_1_ag == 1)
upset(upset_df_filtered, main.bar.color = "black",
shade.color = "slategray",
sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
"vac2_3_ag","vac2_4_ag","vac2_5_ag",
"vac2_6_ag","vac_3_ag","vac4_ag",
"vac5_ag","vac6_ag","vac7_ag"),
mb.ratio = c(0.55,0.45), order.by = "freq",
queries = list(
list(query = intersects, params = list("vac_1_ag","vac2_1_ag"), active = T)
))

In addition to queries, the UpSet plot can also be plotted with attribute plots below which allow the user to visualize the distribution of the selected combination in relation to other variables in the attribute plots.
myplot <- function(mydata, x, y) {
plot <- (ggplot(data = mydata, aes_string(x = x, y = y, colour = "color")) +
geom_point() + scale_color_identity() + theme(plot.margin = unit(c(0,
0, 0, 0), "cm")))
}
upset(upset_df_filtered, main.bar.color = "black",
shade.color = "slategray",
sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
"vac2_3_ag","vac2_4_ag","vac2_5_ag",
"vac2_6_ag","vac_3_ag","vac4_ag",
"vac5_ag","vac6_ag","vac7_ag"),
mb.ratio = c(0.55,0.45), order.by = "freq",
queries = list(
list(query = intersects, params = list("vac_1_ag","vac2_1_ag"), active = T)),
attribute.plots = list(gridrows = 40,
plots = list(list(plot = myplot,
x = "household_children",
y = "household_size", queries = T))))

For the Shiny App, 3 pages are proposed.
(1) Survey Findings
On this page, the survey findings will be presented to the user in the form of a Diverging Stacked Bar Chart, which displays the responses for the selected question for all countries.
Also presented is the Bar Plots with Error Bars that will provide users an idea of the proportions of respondents for each country which strongly agreed or agreed with the question, as well as the confidence interval of the proportions.
Users will be able to select the question of interest to be viewed as well as the confidence level for the error bars.
Figure 10: Survey Findings page
(2) Association of Factors
On this page, users will be able to select the country of interest, strength of response and multiple factors of interest.
What will be displayed is an UpSet plot which provides an overview of the strength of association of the various factors selected.
Users will also be able to select factors to be queried and highlighted in the plot.
There will also be a multiple drop down box for users to select factors to be included in the attributes plot at the bottom.
Figure 11: Association of Factors page
(3) Data Exploration
This page allows users to explore the association between 2 selected variables for a selected country.
If the 2 variables selected are both categorical, a mosaic plot will be displayed. However if 1 variable is continuous, then a density plot will be displayed.
Figure 12: Data Exploration page
With the interactive visualization provided on this sub-module, users will gain deeper insights into the attitudes and perceptions towards vaccination by various countries, and also the determinants and factors that influence those perceptions.