Receptivity of Countries towards COVID-19 Vaccination

This post is a sub-module of the project and focuses on analysing countries’ attitudes towards COVID-19 vaccination. Data for the analyses is obtained from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub.

Author

Affiliation

Daniel Lin Yongyan

 

Published

April 1, 2021

DOI

Overview

The project focuses on leveraging the richness of COVID-19 data available across various countries to glean interesting insights and analyses. Some of the key aspects of COVID-19 in focus include trends and patterns in new cases of COVID-19, deaths arising and attitudes towards vaccination, across the different countries.

This post is a sub-module of the project and focuses on analysing countries’ attitudes towards COVID-19 vaccination. Data for the analyses is obtained from the Imperial College London YouGov Covid 19 Behaviour Tracker Data Hub.

YouGov has partnered with the Institute of Global Health Innovation (IGHI) at Imperial College London to gather global insights on people’s behaviours in response to COVID-19. The survey covers 29 countries and interviews around 21,000 people each week and datasets can be obtained from https://github.com/YouGov-Data/covid-19-tracker/tree/master/data.

Relevant questions relating to public’s attitudes towards vaccination have been extracted for the analyses in this post.

The scope of this post is as follows.

1. Preparation of Data
2. Exploratory Data Analysis
3. Review of visual analytic techniques
4. Packages and Functions for Visualization
5. Storyboard for the Shiny Visual Analytics Application

1. Exploration and Preparation of Data

The following code is first input to define Global Settings for code chunks in this post.

knitr::opts_chunk$set(fig.retina=3,
                      echo = TRUE,
                      eval = TRUE,
                      message = TRUE,
                      warning = FALSE)

Installation and loading of required packages

The relevant packages required in the analysis are then installed and loaded with the following code.

packages = c('tidyverse','dplyr','readr','HH',
             'plotly','UpSetR','ggplot2',
             'naniar','dlookr','ggridges','forcats')
for (p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only = T)
}

Importing of datasets and creation of combined dataset

The survey responses of the different countries were stored in separate csv files. Data for the first country was first imported to examine the available variables of interest. This was done with reference to the data dictionary for the datasets (available at https://github.com/YouGov-Data/covid-19-tracker/blob/master/codebook.xlsx).

# Read in 1st csv (Australia)
australia_df <- read_csv("data/australia.csv",col_types = cols(
  age = col_integer(), .default = col_character()
  ))

head(australia_df,5)
# A tibble: 5 x 398
  RecordNo endtime        household_children qweek i1_health i2_health
  <chr>    <chr>          <chr>              <chr> <chr>     <chr>    
1 310      01/04/2020 16~ 0                  week~ 1         1        
2 311      01/04/2020 16~ 0                  week~ 0         6        
3 312      01/04/2020 16~ 1                  week~ 0         0        
4 313      01/04/2020 16~ 1                  week~ 2         20       
5 314      01/04/2020 16~ 1                  week~ 2         0        
# ... with 392 more variables: i7a_health <chr>, i3_health <chr>,
#   i4_health <chr>, i5_health_1 <chr>, i5_health_2 <chr>,
#   i5_health_3 <chr>, i5_health_4 <chr>, i5_health_5 <chr>,
#   i5_health_99 <chr>, i5a_health <chr>, i6_health <chr>,
#   i7b_health <chr>, i8_health <chr>, i9_health <chr>,
#   i10_health <chr>, i11_health <chr>, i12_health_1 <chr>,
#   i12_health_2 <chr>, i12_health_3 <chr>, i12_health_4 <chr>,
#   i12_health_5 <chr>, i12_health_6 <chr>, i12_health_7 <chr>,
#   i12_health_8 <chr>, i12_health_9 <chr>, i12_health_10 <chr>,
#   i12_health_11 <chr>, i12_health_12 <chr>, i12_health_13 <chr>,
#   i12_health_14 <chr>, i12_health_15 <chr>, i12_health_16 <chr>,
#   i12_health_17 <chr>, i12_health_18 <chr>, i12_health_19 <chr>,
#   i12_health_20 <chr>, i13_health <chr>, i14_health_1 <chr>,
#   i14_health_2 <chr>, i14_health_3 <chr>, i14_health_4 <chr>,
#   i14_health_5 <chr>, i14_health_6 <chr>, i14_health_7 <chr>,
#   i14_health_8 <chr>, i14_health_9 <chr>, i14_health_10 <chr>,
#   i14_health_96 <chr>, i14_health_98 <chr>, i14_health_99 <chr>,
#   i14_health_other <chr>, d1_health_1 <chr>, d1_health_2 <chr>,
#   d1_health_3 <chr>, d1_health_4 <chr>, d1_health_5 <chr>,
#   d1_health_6 <chr>, d1_health_7 <chr>, d1_health_8 <chr>,
#   d1_health_9 <chr>, d1_health_10 <chr>, d1_health_11 <chr>,
#   d1_health_12 <chr>, d1_health_13 <chr>, d1_health_98 <chr>,
#   d1_health_99 <chr>, weight <chr>, age <int>, gender <chr>,
#   state <chr>, household_size <chr>, employment_status <chr>,
#   WCRex2 <chr>, WCRV_4 <chr>, CORE_B2_4 <chr>,
#   cantril_ladder <chr>, PHQ4_1 <chr>, PHQ4_2 <chr>, PHQ4_3 <chr>,
#   PHQ4_4 <chr>, m1_1 <chr>, m1_2 <chr>, m1_3 <chr>, m1_4 <chr>,
#   m2 <chr>, m3 <chr>, m4_1 <chr>, m4_2 <chr>, m4_3 <chr>,
#   m4_4 <chr>, m4_96 <chr>, m4_99 <chr>, m4_other <chr>, m5_1 <chr>,
#   m5_2 <chr>, m6_1 <chr>, m6_2 <chr>, m6_3 <chr>, m6_4 <chr>,
#   m6_5 <chr>, ...

While the dataset had 398 variables, only variables which were of interest (i.e.socio-demographic variables and variables related to vaccination attitudes) were retained.

# Create main_df with selected variables from Australia dataset
fieldNames <- c("endtime", "age", "gender", "household_size",
                "household_children", "vac_1", "vac2_1", "vac2_2",
                "vac2_3", "vac2_4", "vac2_5", "vac2_6", "vac_3",
                "vac4", "vac5", "vac6", "vac7")

main_df <- australia_df %>%
  select(any_of(fieldNames)) %>%      # select required variables
  mutate(country = "Australia") %>%   # Add column to store country name
    drop_na(vac_1)                    # remove all records which did not 
                                      # answer vac_1 - willing to take vaccine

glimpse(main_df)
Rows: 7,761
Columns: 18
$ endtime            <chr> "11/11/2020 11:51", "11/11/2020 11:51", "~
$ age                <int> 69, 49, 70, 69, 48, 69, 20, 60, 69, 36, 2~
$ gender             <chr> "Male", "Male", "Female", "Female", "Male~
$ household_size     <chr> "2", "2", "1", "2", "5", "1", "4", "1", "~
$ household_children <chr> "0", "0", "0", "0", "1", "0", "0", "0", "~
$ vac_1              <chr> "1 - Strongly agree", "1 - Strongly agree~
$ vac2_1             <chr> "1 - Strongly agree", "4", "5 – Strongly ~
$ vac2_2             <chr> "3", "3", "5 – Strongly disagree", "1 - S~
$ vac2_3             <chr> "1 - Strongly agree", "2", "1 - Strongly ~
$ vac2_4             <chr> "2", "4", "5 – Strongly disagree", "4", "~
$ vac2_5             <chr> "2", "4", "5 – Strongly disagree", "5 – S~
$ vac2_6             <chr> "1 - Strongly agree", "2", "1 - Strongly ~
$ vac_3              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac4               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac5               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac6               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ vac7               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ country            <chr> "Australia", "Australia", "Australia", "A~

The other country datasets were then imported and merged to the main dataset created (main_df). As the main focus of the sub-module is to gain insights on countries’ attitudes towards vaccination, non-responses (NA) to vac_1 are excluded from main_df.

countrylist <- c("Canada","Denmark","Finland",
                 "France","Germany","Israel",
                 "Italy","Japan","Netherlands",
                 "Norway","Singapore","South-Korea",
                 "Spain","Sweden","United-Kingdom",
                 "United-States")

for(ctry in countrylist){ 
  # loop to read individual country datasets and append to main_df
  fileName <- paste0("data/",ctry,".csv")
  df <- read_csv(fileName,col_types = cols(
    age = col_integer(),
    .default = col_character() 
  ))

  tableName <- paste0(ctry,"_df")
  tableName <- df %>%
  select(any_of(fieldNames)) %>%
  mutate(country = ctry) %>%
    drop_na(vac_1)  
  # main purpose of analysis is to look at vaccination attitudes
  # so records with no response to vac_1 is filtered out
  
  main_df <- bind_rows(main_df,tableName)
}

Recoding of variables

For some of the responses, recoding needs to be done to assign numerical values to allow for easier manipulation of the data. As the responses to the survey are categorical, the numerical values stored in the variables are recoded and stored as characters.

#recode variables
main_df <- main_df %>% mutate(vac_1 = case_when(
  vac_1 == "1 - Strongly agree" ~ "1",
  vac_1 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac_1)))

main_df <- main_df %>% mutate(vac2_1 = case_when(
  vac2_1 == "1 - Strongly agree" ~ "1",
  vac2_1 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_1)))

main_df <- main_df %>% mutate(vac2_2 = case_when(
  vac2_2 == "1 - Strongly agree" ~ "1",
  vac2_2 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_2)))

main_df <- main_df %>% mutate(vac2_3 = case_when(
  vac2_3 == "1 - Strongly agree" ~ "1",
  vac2_3 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_3)))

main_df <- main_df %>% mutate(vac2_4 = case_when(
  vac2_4 == "1 - Strongly agree" ~ "1",
  vac2_4 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_4)))

main_df <- main_df %>% mutate(vac2_5 = case_when(
  vac2_5 == "1 - Strongly agree" ~ "1",
  vac2_5 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_5)))

main_df <- main_df %>% mutate(vac2_6 = case_when(
  vac2_6 == "1 - Strongly agree" ~ "1",
  vac2_6 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac2_6)))

main_df <- main_df %>% mutate(vac_3 = case_when(
  vac_3 == "1 - Strongly agree" ~ "1",
  vac_3 == "5 – Strongly disagree" ~ "5",
  TRUE ~ as.character(vac_3)))

main_df <- main_df %>% mutate(vac4 = case_when(
  vac4 == "Not at all important" ~ "1",
  vac4 == "A little important" ~ "2",
  vac4 == "Moderately important" ~ "3",
  vac4 == "Very important" ~ "4"))

main_df <- main_df %>% mutate(vac5 = case_when(
  vac5 == "Yes" ~ "1",
  vac5 == "No" ~ "2",
  vac5 == "Not sure" ~ "99"))

main_df <- main_df %>% mutate(vac6 = case_when(
  vac6 == "Yes" ~ "1",
  vac6 == "No" ~ "2",
  vac6 == "Not sure" ~ "99"))

main_df <- main_df %>% mutate(vac7 = case_when(
  vac7 == "Not at all" ~ "1",
  vac7 == "A little" ~ "2",
  vac7 == "Moderately" ~ "3",
  vac7 == "Very much" ~ "4"))

main_df <- main_df %>% replace_with_na(replace = list(vac5 = "99", vac6 = "99" ))
# declare responses with "Not Sure" as missing values

Return to Top

2. Exploratory Data Analysis (EDA)

After preparing the data, it would be timely to carry out EDA to further understand distribution of the variables, correlation and relationships between variables. This can be done using the dlookr package.

Univariate data EDA

For the main_df, there is only 1 variable that is numerical, age. We use the describe() function to compute descriptive statistics for age.

describe(main_df)
# A tibble: 1 x 26
  variable      n    na  mean    sd se_mean   IQR skewness kurtosis
  <chr>     <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 age      103535     0  47.9  16.5  0.0512    27 -0.00853    -1.03
# ... with 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>,
#   p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
#   p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
#   p99 <dbl>, p100 <dbl>

The density plot of age can also be plotted to investigate the mean age and age distribution of respondents who respondents to vac_1.

a <- ggplot(main_df, aes(x=age))
a + geom_density(aes(y=..count..), fill = "slategrey") +
  geom_vline(aes(xintercept = mean(age)),
             linetype = "dashed", size = 0.8,
             color = "red")

To investigate the age distribution by country, we could use the package ggridges to do a comparison.

# need to install ggridges package
theme_set(theme_ridges())

ggplot(main_df, aes(x = `age`, y = `country`)) +
  geom_density_ridges_gradient(aes(fill = ..x..), scale = 3, size = 0.3) +  
  scale_fill_gradientn(colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
                       name = "Age")+
  labs(title = 'Age Distribution by Country') 
Picking joint bandwidth of 2.65

From the plot, we observe that countries like Sweden, Israel and Finland had a greater proportion of younger respondents, whereas countries like Singapore, Norway and Netherlands had a relatively even distribution of respondents across all ages.

main_df %>%
  mutate(CountryFct = fct_rev(as.factor(country))) %>%
  ggplot(aes(y = CountryFct)) +
  geom_density_ridges(
    aes(x = age, fill = paste(CountryFct, gender)), 
    alpha = .5, color = "white", from = 0, to = 100) +
  labs(x = "Age (years)",y = "Country",
       title = "Age Distribution of Respondents by Gender and Country") +
  scale_y_discrete(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_fill_cyclical(
    breaks = c("Male", "Female"),
    labels = c(`Male` = "Male", `Female` = "Female"),
    values = c("#ff0000", "#0000ff", "#ff8080", "#8080ff"),
    name = "Gender", guide = "legend") +
  coord_cartesian(clip = "off") +
  theme_ridges(grid = FALSE)
Picking joint bandwidth of 3.04

From the plots, it is observed that in general, most countries have similar proportions of male and females (purple areas), except for Finland, France, Germany, Italy and Spain where there is a slightly higher proportion of female (red areas)at the older age bands.

For Categorical data, we could use table to carry out Univariate EDA.

table(main_df$vac_1)

    1     2     3     4     5 
41528 11978 20280  9948 19801 
table(main_df$vac2_1)

    1     2     3     4     5 
23943 22310 29635 15804 11843 
table(main_df$vac2_2)

    1     2     3     4     5 
29117 21198 25083 16326 11811 

Bivariate EDA for Categorical Data

For Bivariate Categorical EDA, we could use the following to find the frequencies or proportions between 2 categorical variables.

my_table_0 <- table(main_df$vac_1, main_df$gender)
print.table(my_table_0)
   
    Female  Male
  1  19896 21632
  2   5899  6079
  3  11010  9270
  4   5354  4594
  5  11353  8448
my_table_01 <- addmargins(my_table_0)
print.table(my_table_01)
     
      Female   Male    Sum
  1    19896  21632  41528
  2     5899   6079  11978
  3    11010   9270  20280
  4     5354   4594   9948
  5    11353   8448  19801
  Sum  53512  50023 103535
my_table_3 <- prop.table(my_table_0, margin = 2) %>% 
  as.data.frame.matrix() 
print.data.frame(my_table_3, digits = 2)
  Female  Male
1   0.37 0.432
2   0.11 0.122
3   0.21 0.185
4   0.10 0.092
5   0.21 0.169

From the above, we are able to observe that males form a greater proportion of those who disagreed or strongly disagreed (responded 2 or 1) while females form a greater proportion of those who agreed or strongly agreed.

Bivariate EDA between categorical and continuous variable

To investigate the relationship between a categorical and continuous variable, we can use the relate function as shown below. The user is able to select the 2 variables for investigation as shown.

# User to select target variable and predictor numerical 
# variables (i.e. vac_1, and Age)

variable_selected <- "vac_1"   # to be selected by user
pred_var <- "age"              # to be selected by user

categ <- target_by(main_df, variable_selected)
Note: Using an external vector in selections is ambiguous.
i Use `all_of(variable_selected)` instead of `variable_selected` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
cat_num <- relate(categ, pred_var)
Note: Using an external vector in selections is ambiguous.
i Use `all_of(pred_var)` instead of `pred_var` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
cat_num
# A tibble: 6 x 27
  variable vac_1      n    na  mean    sd se_mean   IQR skewness
  <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl>
1 age      1      41528     0  52.4  16.6  0.0814    26 -0.312  
2 age      2      11978     0  46.3  16.2  0.148     28  0.0556 
3 age      3      20280     0  44.3  15.6  0.110     26  0.190  
4 age      4       9948     0  43.8  15.6  0.156     26  0.204  
5 age      5      19801     0  45.2  15.5  0.110     26  0.154  
6 age      total 103535     0  47.9  16.5  0.0512    27 -0.00853
# ... with 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>,
#   p05 <dbl>, p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>,
#   p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>,
#   p95 <dbl>, p99 <dbl>, p100 <dbl>
plot(cat_num)

It can be observed that respondents who responded 2,3,4 or 5 have similar age distribution patterns. However, for those that strongly disagreed that they are keen on vaccination, a large proportion come from the older age groups.

The above is a density plot that is plotted when a categorical variable is plotted with a continuous variable. For relationship between 2 categorical variables, the following would be a good plot for visualization.

# User to select target variable and predictor numerical 
# variables (i.e. vac_1, and vac2_1)

variable_selected <- "vac_1"   # to be selected by user
pred_var <- "vac2_1"              # to be selected by user

categ <- target_by(main_df, variable_selected)
cat_num <- relate(categ, pred_var)
cat_num
     vac2_1
vac_1     1     2     3     4     5
    1 13213  9497 10309  5310  3199
    2  2402  3805  3383  1803   585
    3  3401  4284  8131  3093  1371
    4  1482  2205  2996  2349   916
    5  3445  2519  4816  3249  5772
plot(cat_num)

It can be observed that a large proportion of respondents who responded 1 in vac_1 also responded 1 for vac2_1. The same is observed for those who responded 5 in vac_1, where many of them also responded 5 for vac2_1.

Cat3 <- main_df %>%                       
  select(gender, vac_1, vac2_1, household_size) %>%             
  ftable()                                            
Cat3   
                    household_size    1    2    3    4    5    6    7    8 8 or more Don't know Prefer not to say
gender vac_1 vac2_1                                                                                              
Female 1     1                     1455 2720 1261 1046  339  102   31    0        39         13                46
             2                     1032 1741  686  579  178   55   22    2        12          2                21
             3                     1245 1901  746  597  194   63   15    0        19          7                36
             4                      586  790  325  262   90   23    4    0        13          3                12
             5                      340  499  167  158   69   19    7    1         8          5                15
       2     1                      249  462  290  248   90   35   11    0         4          6                12
             2                      312  559  351  339  124   47   20    2        10          5                21
             3                      322  507  296  299  102   35   12    1         9         10                16
             4                      167  276  133  135   57   13    5    0         3          1                 3
             5                       58   84   42   35   18    6    1    0         6          2                 2
       3     1                      362  650  475  460  184   56   25    0        14          3                25
             2                      359  696  504  482  189   43   12    0        10          9                37
             3                      683 1133  812  792  336  107   41    1        23         77               113
             4                      282  447  285  313  124   41   17    0         7         11                21
             5                      141  209  116  114   45   16    7    0         4          4                 8
       4     1                      156  276  209  175   69   33   10    0         2          2                 9
             2                      216  352  260  226  115   28    4    0        10          4                19
             3                      275  514  325  288  134   36   14    1         4          8                22
             4                      214  322  207  207   81   36   19    0        11          8                16
             5                       76  118   84   70   29   10    6    0         4          2                 2
       5     1                      375  648  491  386  152   63   30    1         8          7                32
             2                      268  459  312  256  103   26   11    1         5          5                12
             3                      589  912  605  475  199   57   20    1        13         12                53
             4                      412  542  360  282  115   41   15    2        10          9                22
             5                      659  862  518  402  152   63   28    1        21         20                72
Male   1     1                     1123 2138 1122  938  295  106   46    0        30         32                77
             2                      997 2044  894  751  215   75   32    0         9          5                26
             3                     1150 2310  805  729  220   73   18    1        16         13                34
             4                      701 1334  472  416  137   30   11    0         7          4                13
             5                      433  702  285  260   94   30    7    0        14          8                19
       2     1                      143  266  245  208   65   24    9    2         5          0                16
             2                      341  551  417  432  139   45   15    0         8         11                32
             3                      326  490  344  344  135   48   15    0         6         12                26
             4                      219  309  198  158   60   19    7    0         4          2                17
             5                       82   99   57   47   25    7    2    0         2          3                 5
       3     1                      159  318  256  227   87   35   10    0         8          8                18
             2                      340  493  429  391  130   57   21    0        10          8                29
             3                      698  891  772  846  354  121   32    2        23         93               134
             4                      286  390  312  288  124   60   15    0        10         10                26
             5                      151  198  134  118   47   15    7    0         5          9                11
       4     1                       74  134  123  117   41   19    5    0         7          4                 8
             2                      169  256  185  215   72   30    7    0         6          4                14
             3                      235  363  246  290  116   53   18    0         5         11                21
             4                      210  308  223  250  107   52   23    0        11          5                19
             5                      121  139   84   89   32   19    1    1         4          3                 9
       5     1                      220  349  272  213   87   17    6    0         8          4                28
             2                      227  301  201  177   64   23    7    0         7          1                13
             3                      393  539  349  328  113   33   11    0        11         11                36
             4                      312  469  248  205   94   26   13    0         6          5                17
             5                      805  809  461  426  161   54   22    3        27         40                87

Return to Top

3. Review of visual analytic techniques

3.1 Representation of Likert Scales

For Likert responses, there are several ways in which they can be represented for visualization. The following are some ways.

3.1.1 Grouped Column / Bar Charts One way is to group responses by questions (or other categories) and to represent the responses as column or bar charts in terms of the Likert scale as shown below.

Grouped Bar Chart representing responses to a Likert Scale survey

Figure 1: Grouped Bar Chart representing responses to a Likert Scale survey

While Bar Charts allow easy comparison on the count (frequency) of responses across the different Likert Scores, they do not allow the easy comparison of proportions of the responses (e.g., what is the proportion who strongly agree or agree to a question?)

3.1.2 Multiple Pie Charts While Pie Charts depict the proportions of each type of response to a question (as shown below), it is difficult to compare and visualize the differences in proportion between the questions.

Multiple Pie Charts representing responses to a Likert Scale survey

Figure 2: Multiple Pie Charts representing responses to a Likert Scale survey

3.1.3 Stacked Bar Charts

Stacked bar charts such as the below chart shows clearly the proportion of the various responses for each question asked in a survey. The use of different tones of colours (i.e. Light Green to Dark Green) provide the reader with an idea of increasing levels of responses (i.e. Agree to Strongly Agree).

Stacked Bar Charts representing responses to Likert Scale survey

Figure 3: Stacked Bar Charts representing responses to Likert Scale survey

A benefit of Stacked Bar Charts is that it allows readers to compare different proportions across different bars since each set of stacked bar sums up to 100%. For the example above, it is evident that while a large proportion agreed that they liked the presentation, a lower proportion actually felt that theu learnt something from the presentation.

3.1.4 Diverging Stacked Bar Chart

Another way of representing responses to a Likert Scale survey is Diverging Stacked Bar Chart. This visualization is similar to the Stacked Bar Chart but differs in that the bars have a common vertical baseline located in the the centre of the diagram. The lengths of the segments of the bar charts are proportional to the number of responses for each value of the Likert scale for each question. Segments which represent favourable responses are usually on the right of the baseline while those that are unfavarouble are on the left. Neutral responses are located on the central baseline.

Diverging Stacked Bar Charts representing responses to Likert Scale survey

Figure 4: Diverging Stacked Bar Charts representing responses to Likert Scale survey

The Diverging Stacked Bar Chart is most appropriate to represent responses from Likert Scale Surveys as it allows easy interpretation and comparison of multiple categories.

3.1.5 Comparing the mean Likert Score

While it may seem appropriate to compare the mean Likert Score of each category, there is a potential pitfall in doing so.

Using the below column graph that depicts the responses to a survey on 3 different courses, it is evident that the responses for each of the 3 courses are very different. However when the mean score of responses is computed for each course, the mean appears similar.

Column Bar Charts with responses for 3 different courses

Figure 5: Column Bar Charts with responses for 3 different courses

Mean values of the above responses for each course are as shown.

Course Mean
Course_1 2.84
Course_2 3.04
Course_3 3.00

To mitigate this issue, it would be better to find the proportion of responses that meet a certain value (e.g. proportion of respondents who gave a 4 or 5 score).

3.2 Visualizing Uncertainty

As the results of a survey are usually used to represent a population, the sample (survey) proportion should be represented with confidence intervals to allow readers an indication of the confidence level that the actual population proportion will be within the confidence interval.

3.2.1 Bar plots with Error Bars

One of the ways to represent the confidence intervals of sample data is with the use of error bars as shown below.

Bar Plots with Error Bars

Figure 6: Bar Plots with Error Bars

3.3 Visualizing Correlation

For surveys, it is useful to study the correlation and relationship between responses to understand if there are certain determinants or factors that affect the response of certain questions.

For this project, while we examine the inclination of countries towards vaccination. It would be useful to investigate if that inclination is dependent on certain socio-demographical factors (e.g., age, gender, household size or number of children in the household) or certain attitudes or beliefs (e.g., confidence of vaccine efficacy, or concerns on the side effects of the vaccine).

To visualize the correlation between factors, one of the methods usually used is to build a correlation matrix or correlation scatterplot. However this is usually used to depict the correlation between continuous variables.

Correlation Matrix

Figure 7: Correlation Matrix

Correlation Scatterplot

Figure 8: Correlation Scatterplot

For correlations between categorical variables, one way to visualize correlation is via the use of the UpSet plot via the use of the UpSetR function (see next section).

An UpSet plot (seen below) allows the reader to see how frequently each combination or intersection of different factors takes place. Combinations that occur more frequently indicate a stronger correlation between the factors in the combination.

Upset plot

Figure 9: Upset plot

Return to Top

4. Packages and Functions for Visualization

4.1 Diverging Stacked Bar Chart

As discussed in Section 3.1.4, a Diverging Stacked Bar Chart is the most suitable for visualization of the responses from a Likert Scale survey. The following steps involves steps needed to create a Diverging Stacked Bar Chart from the dataset created.

To prepare the dataset for use by the likert function subsequently, the gather function is used to consolidate the variables and the corresponding responses into 2 columns.

# Make the table long - Gather question data from columns to rows
main_gathered <- gather(main_df, measure, response, c(6:17))

To see responses for the different questions, users will be able to select different variables for qn_selected in the eventual Shiny app. For now, we can manually change the values for qn_selected to simulate the selection.

# Obtain dataset with only responses for selected question

qn_selected <- "vac_1"    # variable of question selected

vac_1 <- filter(main_gathered, measure == qn_selected)

# Obtain contingency table for vac_1
vac_1_df <- table(vac_1$country,vac_1$response) %>% as.data.frame.matrix()
# Change column names, now labelled as 1-5
colnames(vac_1_df) <- c("Strongly Disagree",
                        "Disagree",
                        "Neutral",
                        "Agree",
                        "Strongly Agree")

rownames(vac_1_df) <- c("Australia","Canada","Denmark",
                        "Finland","France","Germany",
                        "Israel","Italy","Japan",
                        "Netherlands","Norway","Singapore",
                        "South Korea","Spain","Sweden",
                        "United Kingdom","United States")
# Remove other columns containing other responses (for other questions)
vac_1_df <- vac_1_df[,c(1:5)]

# Add a column with rownames
vac_1_df <- tibble::rownames_to_column(vac_1_df, var="Country")
# Matching appropriate graph titles to qn_selected
if (qn_selected == "vac_1") {
  div_chart_title <- "Proportion who are willing to take vaccine"
} else if (qn_selected == "vac2_1") {
  div_chart_title <- "Proportion worried about getting COVID-19"
} else if (qn_selected == "vac2_2") {
  div_chart_title <- "Proportion worried about side effects of COVID-19 vaccines"
} else if (qn_selected == "vac2_3") {
  div_chart_title <- "Proportion confident government will provide effective COVID-19 vaccines"
} else if (qn_selected == "vac2_4") {
  div_chart_title <- "Proportion confident vaccine will completely protect recipients from health effects of COVID-19"
} else if (qn_selected == "vac2_5") {
  div_chart_title <- "Proportion confident vaccine will completely prevent transmission of COVID-19 from recipient to others"
} else if (qn_selected == "vac2_6") {
  div_chart_title <- "Proportion who feel they will regret if they do not take the vaccine"
} else if (qn_selected == "vac_3") {
  div_chart_title <- "Proportion who will take the vaccine if available in 1 year"
}

The likert function from the HH package is used to plot the Diverging Stacked Bar Chart in the following code chunk.

# Plot diverging bar chart

likert(Country ~ .,data = vac_1_df, ylab = NULL,
       RefernceZero = 3, as.percent=TRUE,
       positive.order=TRUE,
       main = list(div_chart_title, 
                   x =unit(.55,"npc")),
       sub = list("Response", x=unit(.57,"npc")),
       xlim = c(-100,-80,-60,-40,-20,0,20,40,60,80,100),
       strip = FALSE,
       par.strip.text=list(cex=.7)
       )

From the Diverging Stacked Bar Chart plotted, we are able to observe the various responses for respective countries. For the above example where “vac_1” was selected as qn_selected, we observe that France, followed by the United States and Singapore, were the most receptive to vaccination among the other countries.

By changing the value of qn_selected to other questions (e.g. vac2_1, vac4), we will be able to visualize the responses to the different questions and hence get a deeper understanding of the possible association and determinants behind the attitudes and inclination towards vaccination.

For illustration, when qn_selected is changed to “vac2_1” and the below code chunk is run, the corresponding Diverging is also produced below.

# Obtain dataset with only responses for selected question

qn_selected <- "vac2_1"    # variable of question selected

vac_1 <- filter(main_gathered, measure == qn_selected)

# Obtain contingency table for vac_1
vac_1_df <- table(vac_1$country,vac_1$response) %>% as.data.frame.matrix()
# Change column names, now labelled as 1-5
colnames(vac_1_df) <- c("Strongly Disagree",
                        "Disagree",
                        "Neutral",
                        "Agree",
                        "Strongly Agree")

rownames(vac_1_df) <- c("Australia","Canada","Denmark",
                        "Finland","France","Germany",
                        "Israel","Italy","Japan",
                        "Netherlands","Norway","Singapore",
                        "South Korea","Spain","Sweden",
                        "United Kingdom","United States")
# Remove other columns containing other responses (for other questions)
vac_1_df <- vac_1_df[,c(1:5)]

# Add a column with rownames
vac_1_df <- tibble::rownames_to_column(vac_1_df, var="Country")
# Matching appropriate graph titles to qn_selected
if (qn_selected == "vac_1") {
  div_chart_title <- "Proportion who are willing to take vaccine"
} else if (qn_selected == "vac2_1") {
  div_chart_title <- "Proportion worried about getting COVID-19"
} else if (qn_selected == "vac2_2") {
  div_chart_title <- "Proportion worried about side effects of COVID-19 vaccines"
} else if (qn_selected == "vac2_3") {
  div_chart_title <- "Proportion confident government will provide effective COVID-19 vaccines"
} else if (qn_selected == "vac2_4") {
  div_chart_title <- "Proportion confident vaccine will completely protect recipients from health effects of COVID-19"
} else if (qn_selected == "vac2_5") {
  div_chart_title <- "Proportion confident vaccine will completely prevent transmission of COVID-19 from recipient to others"
} else if (qn_selected == "vac2_6") {
  div_chart_title <- "Proportion who feel they will regret if they do not take the vaccine"
} else if (qn_selected == "vac_3") {
  div_chart_title <- "Proportion who will take the vaccine if available in 1 year"
}
# Plot diverging bar chart

likert(Country ~ .,data = vac_1_df, ylab = NULL,
       RefernceZero = 3, as.percent=TRUE,
       positive.order=TRUE,
       main = list(div_chart_title, 
                   x =unit(.55,"npc")),
       sub = list("Response", x=unit(.57,"npc")),
       xlim = c(-100,-80,-60,-40,-20,0,20,40,60,80,100),
       strip = FALSE,
       par.strip.text=list(cex=.7)
       )

From the 2nd chart, it is observed that United States had a relatively high proportion of people agreeing that they were worried about getting COVID-19 which could explain why United States was one of the countries that were more inclined to vaccination.

Singapore on the other hand, had most respondents disagreeing that there were worried about getting COVID-19. Closer examination would be needed to understand the factors that could be affecting different countries’ inclination towards vaccination.

4.2 Bar plot with Error Bars

As discussed in Section 3.1.5, we will use a sample statistic such as the proportion of respondents who strongly agree that they are keen on vaccination (i.e. vac_1 = 5). The statistic will also be plotted with confidence intervals as discussed in Section 3.2.

The following code chunk plots the bar plot and error bars. User will be able to vary and select the following variables to explore the different bar plots.

(1) responseLevel

(2) confidence level

responseLevel <- 5

vac_1 <- vac_1 %>% mutate(countSA = case_when(
  response == responseLevel ~ 1,
  TRUE ~ as.double(0)
))

vac_1 <- vac_1 %>% mutate(countResponse = 1)
# Obtain contingency table for Prop of SA
propSA_df <- table(vac_1$country,vac_1$countSA) %>% as.data.frame.matrix()
countResponse_df <- table(vac_1$country,vac_1$countResponse) %>% as.data.frame.matrix()

# Change column names, now labelled as 1-5
colnames(propSA_df) <- c("NotSA","SA")

# Compute proportion of SA
propSA_df$propSA <- (propSA_df$SA / (propSA_df$SA + propSA_df$NotSA))

# Compute SE
propSA_df$SE <- sqrt((propSA_df$propSA*(1 - propSA_df$propSA))/ (propSA_df$SA + propSA_df$NotSA))

# compute z-value

confidence_level <- "0.95"  # to select between 0.9, 0.95, 0.98, 0.99

if (confidence_level == "0.9") {
  z_value <- 1.645
} else if (confidence_level == "0.95") {
  z_value <- 1.96
} else if (confidence_level == "0.98") {
  z_value <- 2.33
} else if (confidence_level == "0.99") {
  z_value <- 2.58
}

# Compute Upper and Lower Limits
propSA_df$UppLim <- propSA_df$propSA + z_value*propSA_df$SE
propSA_df$LowLim <- propSA_df$propSA - z_value*propSA_df$SE
propSA_df$country <- c("Australia","Canada","Denmark",
                        "Finland","France","Germany",
                        "Israel","Italy","Japan",
                        "Netherlands","Norway","Singapore",
                        "South Korea","Spain","Sweden",
                        "United Kingdom","United States")
#vertical bar chart with error plots
ggplot(propSA_df) +
 geom_bar(aes(reorder(country,propSA),y = propSA), stat = "identity", fill ="dodgerblue3", alpha = 0.8) +
 geom_errorbar(aes(x = country, ymin = LowLim, ymax = UppLim), width = 0.5, colour = "firebrick2",
               alpha = 0.9, size = 0.6) +
  coord_flip()
  theme(legend.position = "top")
List of 1
 $ legend.position: chr "top"
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE

From the above bar plot, we can observe the countries which have the highest proportions of Strongly Agreed to vac2_1 as well as the correspondig confidence intervals.

For Singapore, we can see that it has a lower proportion of people who Strongly Agree to vac2_1 (“worried about COVID19”) than Spain. However as their confidence intervals overlap, this difference may not be statistically significant. However when Singapore is compared to Italy, we are 95% confident that Singapore’s proportion is lower than Italy’s since the confidence intervals do not overlap.

4.3 UpSet plots with UpSetR package

To understand the association between categorical variables as discussed in Section 3.3, we will explore the creation of an UpSet plot.

To start, the user will be able to select / input the following variables.

(1) countrySelected

(2) strength

# Select country - User to input selected country
countrySelected = "Australia"   
strength = 4  # For user to determine level of agreement - 4 Agree, 5 Strongly agree

upset_df <- filter(main_df, country == countrySelected)
# vac_1 - willing to get vaccine
upset_df <- upset_df %>% mutate(vac_1_ag = case_when(
  vac_1 >= strength ~ as.integer(1),
  TRUE ~ as.integer(0)
)
)

# vac2_1 - worried about getting COVID19
upset_df <- upset_df %>% mutate(vac2_1_ag = case_when(
  vac2_1 >= strength ~ as.integer(1),
  TRUE ~ as.integer(0)
))

# vac2_2 - worried about side effects of vaccine (inv relationship)
upset_df <- upset_df %>% mutate(vac2_2_ag = case_when(
  vac2_2 <= (6-strength) ~ as.integer(1),   # when strength 4, equivalent is 2. When strength 5, equiv is 1
  TRUE ~ as.integer(0)
))

# vac2_3 - confidence in government providing effective vaccine
upset_df <- upset_df %>% mutate(vac2_3_ag = case_when(
  vac2_3 >= strength ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac2_4 - confident vaccine protects recipient from COVID19 effects
upset_df <- upset_df %>% mutate(vac2_4_ag = case_when(
  vac2_4 >= strength ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac2_5 - confidence in vaccine preventing recipient from spreading COVID19
upset_df <- upset_df %>% mutate(vac2_5_ag = case_when(
  vac2_5 >= strength ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac2_6 - will regret if do not take vaccine
upset_df <- upset_df %>% mutate(vac2_6_ag = case_when(
  vac2_6 >= strength ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac_3 - keen to get vaccine 1 year later
upset_df <- upset_df %>% mutate(vac_3_ag = case_when(
  vac_3 >= strength ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac4 - how important vaccine is for health
upset_df <- upset_df %>% mutate(vac4_ag = case_when(
  vac4 >= (strength-1) ~ as.integer(1),
  TRUE ~ as.integer(0)
))

# vac5 - will get vaccine if available
upset_df <- upset_df %>% mutate(vac5_ag = case_when(
  vac5 == 1 ~ as.integer(1),   
  TRUE ~ as.integer(0)
))

# vac6 - will get vaccine if available
upset_df <- upset_df %>% mutate(vac6_ag = case_when(
  vac6 == 1 ~ as.integer(1),  
  TRUE ~ as.integer(0)
))

# vac7 - trust COVID19 vaccines
upset_df <- upset_df %>% mutate(vac7_ag = case_when(
  vac7 >= (strength-1) ~ as.integer(1),
  TRUE ~ as.integer(0)
))
# create combination matrix

combiMatrix <- select(upset_df,vac_1_ag,vac2_1_ag,vac2_2_ag,
                      vac2_3_ag,vac2_4_ag,vac2_5_ag,
                      vac2_6_ag,vac_3_ag,vac4_ag,
                      vac5_ag,vac6_ag,vac7_ag,)

combiMatrix <- filter(combiMatrix, vac_1_ag == 1) 
# UpSetR cannot have 0 length argument (i.e. 0 0 0 0 0) so investigate only vac_1 = 1 cases
upset(combiMatrix, sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
                            "vac2_3_ag","vac2_4_ag","vac2_5_ag",          
                            "vac2_6_ag","vac_3_ag","vac4_ag",
                            "vac5_ag","vac6_ag","vac7_ag"),
      mb.ratio = c(0.55,0.45), order.by = "freq")

In addition to the UpSet plot, queries can be used to identify specific combinations. For example, if we want to know what the strength of association is like for vac_1 and vac2_1, we can input them in the code below. The corresponding UpSet plot with the required combination will be highlighted (in blue).

upset_df_filtered <- filter(upset_df, vac_1_ag == 1) 
upset(upset_df_filtered, main.bar.color = "black",
      shade.color = "slategray",
      sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
                "vac2_3_ag","vac2_4_ag","vac2_5_ag",          
                "vac2_6_ag","vac_3_ag","vac4_ag",
                "vac5_ag","vac6_ag","vac7_ag"),
      mb.ratio = c(0.55,0.45), order.by = "freq",
      queries = list(
        list(query = intersects, params = list("vac_1_ag","vac2_1_ag"), active = T)
        ))

In addition to queries, the UpSet plot can also be plotted with attribute plots below which allow the user to visualize the distribution of the selected combination in relation to other variables in the attribute plots.

myplot <- function(mydata, x, y) {
    plot <- (ggplot(data = mydata, aes_string(x = x, y = y, colour = "color")) + 
        geom_point() + scale_color_identity() + theme(plot.margin = unit(c(0, 
        0, 0, 0), "cm")))
}

upset(upset_df_filtered, main.bar.color = "black",
      shade.color = "slategray",
      sets = c("vac_1_ag","vac2_1_ag","vac2_2_ag",
                "vac2_3_ag","vac2_4_ag","vac2_5_ag",          
                "vac2_6_ag","vac_3_ag","vac4_ag",
                "vac5_ag","vac6_ag","vac7_ag"),
      mb.ratio = c(0.55,0.45), order.by = "freq",
      queries = list(
        list(query = intersects, params = list("vac_1_ag","vac2_1_ag"), active = T)),
      attribute.plots = list(gridrows = 40,
                             plots = list(list(plot = myplot, 
                                               x = "household_children",
                                               y = "household_size", queries = T))))

Return to Top

5. Storyboard for the Shiny Visual Analytics Application

For the Shiny App, 3 pages are proposed.

(1) Survey Findings

On this page, the survey findings will be presented to the user in the form of a Diverging Stacked Bar Chart, which displays the responses for the selected question for all countries.

Also presented is the Bar Plots with Error Bars that will provide users an idea of the proportions of respondents for each country which strongly agreed or agreed with the question, as well as the confidence interval of the proportions.

Users will be able to select the question of interest to be viewed as well as the confidence level for the error bars.

Survey Findings page

Figure 10: Survey Findings page

(2) Association of Factors

On this page, users will be able to select the country of interest, strength of response and multiple factors of interest.

What will be displayed is an UpSet plot which provides an overview of the strength of association of the various factors selected.

Users will also be able to select factors to be queried and highlighted in the plot.

There will also be a multiple drop down box for users to select factors to be included in the attributes plot at the bottom.

Association of Factors page

Figure 11: Association of Factors page

(3) Data Exploration

This page allows users to explore the association between 2 selected variables for a selected country.

If the 2 variables selected are both categorical, a mosaic plot will be displayed. However if 1 variable is continuous, then a density plot will be displayed.

Data Exploration page

Figure 12: Data Exploration page

Conclusion

With the interactive visualization provided on this sub-module, users will gain deeper insights into the attitudes and perceptions towards vaccination by various countries, and also the determinants and factors that influence those perceptions.

Return to Top

Footnotes