Exploring the Temporal Evolution of COVID-19 Cases

This post is on the planning and prototyping process for one of the three sub-modules of a Shiny-based Visual Analytics Application project for the course ISSS608. In this sub-module, the purpose is to investigate the spread of Coronavirus (COVID-19) over time for different countries and building models to forecast the near-future case load in each country, using the Johns Hopkins COVID-19 dataset.

Siow Mun Chai true
04-09-2021

Overview

The purpose of this assignment is to provides estimates of COVID-19 near-future case load in each country, based on global data collated by Johns Hopkins University. Generally, the data is a sequence of observations stored in time orders (Time Series Data).

Time series analysis aim to analyse the time series data to extract meaningful characteristics of data and generate useful insights. For our case, we will attempt to forecast the near-future number of confirmed cases. Time series may have several characteristics that make their analysis different from other types of data, namely:

These characteristics will affect the outcome of the forecasting models. Decomposing a time series into its components (trend, cycle, seasonality, and random (stationary)) is an important part of time series analysis. We do this to help understand the time series better and also to improve the forecast result.

Literature Review ofCcurrent Visualisations on COVID-19 Confirmed Cases

Most of the sites that forecast the near-future case load uses time series charts and/or putting values in tabular form in white papers (See Fig 1). The site/papers will give detailed explanation of the mathematical models, without letting the users to explore the dataset.

Fig 1: Time Series Chart

Proposed Enhancement

Suggested Visualisations and R packages

Most of the visualisations for the forecast of Covid-19 confirmed cases are time series charts, without much interactivity with the users. Most of the time, only 1 model result is shown.

In this assignment, we will attempt to create an interactive visualisation. Firstly, the application will have a page to allow users to explore the time-series chart (see the trend, seasonality, anomaly etc). Subsequently, at the forecasting page, will allow users to choose the date range to be used for the forecasting. Countries are seeing spikes in the Covid-19 confirmed cases and had different measures to control the spread. As such, allowing users to choose the range of date will result in different outcome that can best suit the users’ requirement. Different Models will also be shown so that users can compare the outcome.

The following R packages will be explored:

Visualisation Packages
EDA of Time Series Data ggplot, TimeTK
Forecasting Forecast, ModelTime

Data Source

The COVID-19 dataset is a timeseries dataset (with observations recorded daily). The dataset used is sourced from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. CSSE had collated the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries. The data used for this assignment is the cumulative confirmed daily cases for all affected countries and can be downloaded at https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv.

Installing and launching R packages

A list of packages are required for this assignment. This code chunk installs the required packages and loads them onto RStudio environment.

packages = c('xgboost','tidymodels','modeltime','tidyverse','lubridate','timetk','earth','tseries','forecast',
             'viridis','plotly','tidyverse','tsibble','tsintermittent','randomForest',
             'fpp3','Metrics','gghighlight')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Loading the dataset onto R

Data import was accomplished using read_csv() from readr package.

confirmed_cases_raw <- read_csv("data/time_series_covid19_confirmed_global.csv") 
confirmed_cases_raw
# A tibble: 274 x 435
   `Province/State`  `Country/Region`   Lat   Long `1/22/20` `1/23/20`
   <chr>             <chr>            <dbl>  <dbl>     <dbl>     <dbl>
 1 <NA>              Afghanistan       33.9  67.7          0         0
 2 <NA>              Albania           41.2  20.2          0         0
 3 <NA>              Algeria           28.0   1.66         0         0
 4 <NA>              Andorra           42.5   1.52         0         0
 5 <NA>              Angola           -11.2  17.9          0         0
 6 <NA>              Antigua and Bar~  17.1 -61.8          0         0
 7 <NA>              Argentina        -38.4 -63.6          0         0
 8 <NA>              Armenia           40.1  45.0          0         0
 9 Australian Capit~ Australia        -35.5 149.           0         0
10 New South Wales   Australia        -33.9 151.           0         0
# ... with 264 more rows, and 429 more variables: 1/24/20 <dbl>,
#   1/25/20 <dbl>, 1/26/20 <dbl>, 1/27/20 <dbl>, 1/28/20 <dbl>,
#   1/29/20 <dbl>, 1/30/20 <dbl>, 1/31/20 <dbl>, 2/1/20 <dbl>,
#   2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>, 2/5/20 <dbl>,
#   2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>,
#   2/10/20 <dbl>, 2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>,
#   2/14/20 <dbl>, 2/15/20 <dbl>, 2/16/20 <dbl>, 2/17/20 <dbl>,
#   2/18/20 <dbl>, 2/19/20 <dbl>, 2/20/20 <dbl>, 2/21/20 <dbl>,
#   2/22/20 <dbl>, 2/23/20 <dbl>, 2/24/20 <dbl>, 2/25/20 <dbl>,
#   2/26/20 <dbl>, 2/27/20 <dbl>, 2/28/20 <dbl>, 2/29/20 <dbl>,
#   3/1/20 <dbl>, 3/2/20 <dbl>, 3/3/20 <dbl>, 3/4/20 <dbl>,
#   3/5/20 <dbl>, 3/6/20 <dbl>, 3/7/20 <dbl>, 3/8/20 <dbl>,
#   3/9/20 <dbl>, 3/10/20 <dbl>, 3/11/20 <dbl>, 3/12/20 <dbl>,
#   3/13/20 <dbl>, 3/14/20 <dbl>, 3/15/20 <dbl>, 3/16/20 <dbl>,
#   3/17/20 <dbl>, 3/18/20 <dbl>, 3/19/20 <dbl>, 3/20/20 <dbl>,
#   3/21/20 <dbl>, 3/22/20 <dbl>, 3/23/20 <dbl>, 3/24/20 <dbl>,
#   3/25/20 <dbl>, 3/26/20 <dbl>, 3/27/20 <dbl>, 3/28/20 <dbl>,
#   3/29/20 <dbl>, 3/30/20 <dbl>, 3/31/20 <dbl>, 4/1/20 <dbl>,
#   4/2/20 <dbl>, 4/3/20 <dbl>, 4/4/20 <dbl>, 4/5/20 <dbl>,
#   4/6/20 <dbl>, 4/7/20 <dbl>, 4/8/20 <dbl>, 4/9/20 <dbl>,
#   4/10/20 <dbl>, 4/11/20 <dbl>, 4/12/20 <dbl>, 4/13/20 <dbl>,
#   4/14/20 <dbl>, 4/15/20 <dbl>, 4/16/20 <dbl>, 4/17/20 <dbl>,
#   4/18/20 <dbl>, 4/19/20 <dbl>, 4/20/20 <dbl>, 4/21/20 <dbl>,
#   4/22/20 <dbl>, 4/23/20 <dbl>, 4/24/20 <dbl>, 4/25/20 <dbl>,
#   4/26/20 <dbl>, 4/27/20 <dbl>, 4/28/20 <dbl>, 4/29/20 <dbl>,
#   4/30/20 <dbl>, 5/1/20 <dbl>, 5/2/20 <dbl>, ...

The following are observed from the dataset:

Data Cleaning

The following code helps to convert the data to the correct format for subsequently analysis.

confirmed_cases_country_level <- confirmed_cases_raw %>%
  gather(Date,Total_Cases,-'Province/State',-'Country/Region',-Lat,-Long) %>%    #collecting all the date variables into a single date variable.
  group_by(`Country/Region`,Date) %>%
  summarize(total = sum(Total_Cases)) %>% 
     select(`Country/Region`, Date, total) %>%
     set_names(c("Country", "Date", "TotalCases")) %>%   #rename the variables
  ungroup()

confirmed_cases_country_level$Date <- as.Date(confirmed_cases_country_level$Date,format="%m/%d/%y")  #converting the Date variable to Date format

# finding the daily new confirmed cases.
confirmed_cases_country_level <- confirmed_cases_country_level %>%
  group_by(Country) %>%
  arrange(Country,Date)%>%
  mutate(Daily_new_cases = TotalCases - lag(TotalCases, n =1))%>%
  drop_na()

confirmed_cases_country_level
# A tibble: 82,560 x 4
# Groups:   Country [192]
   Country     Date       TotalCases Daily_new_cases
   <chr>       <date>          <dbl>           <dbl>
 1 Afghanistan 2020-01-23          0               0
 2 Afghanistan 2020-01-24          0               0
 3 Afghanistan 2020-01-25          0               0
 4 Afghanistan 2020-01-26          0               0
 5 Afghanistan 2020-01-27          0               0
 6 Afghanistan 2020-01-28          0               0
 7 Afghanistan 2020-01-29          0               0
 8 Afghanistan 2020-01-30          0               0
 9 Afghanistan 2020-01-31          0               0
10 Afghanistan 2020-02-01          0               0
# ... with 82,550 more rows

EDA - Visualise the Distribution of the Daily Confirmed Cases for each Country - Using TimeTK package.

To allow users to explore / visualise the time series data for the countries, time series charts would be plotted. Since our dataset contains many different countries, will be good to show the distribution of various countries, for comparison. The following section help to explore what is the optimal number of charts that can be displayed at any point of time, using the TimeTK package.

South East Asia - 11 Countries in 4 rows of 3 countries each

Create Time Series Plots for South East Asia (Brunei, Burma, Cambodia, Timor-Leste, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand, Vietnam).

Setting .facet_ncol =3 to have 4 rows of 3 countries.

#selecting only the south east asia countries from the dataset.
SEA <- filter(confirmed_cases_country_level,Country == c("Brunei", "Burma", "Cambodia", "Timor-Leste", "Indonesia", "Laos", "Malaysia", "Philippines","Singapore", "Thailand", "Vietnam")) #11 countries

SEA%>%
   plot_time_series(Date, Daily_new_cases,
                   .facet_ncol =3, .facet_scales = "free",
                   .interactive = TRUE,)

South East Asia - 11 Countries in 6 rows of 2 countries each

Setting .facet_ncol =2 to have 6 rows of 2 countries.

SEA%>%
   plot_time_series(Date, Daily_new_cases, 
                   .facet_ncol =2, .facet_scales = "free",
                   .interactive = TRUE)

Europe - 15 Countries in 4 rows of 4 countries each

Create Time Series Plots for European Countries (Austria, Belgium, Denmark, Finland,France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal,Spain, Sweden, United Kingdom)

Setting .facet_ncol =4 to have 4 rows of 4 countries.

#choosing the european countries
EU <- filter(confirmed_cases_country_level,Country == c("Austria", "Belgium", "Denmark", "Finland","France", "Germany", "Greece", "Ireland","Italy", "Luxembourg", "Netherlands", "Portugal","Spain", "Sweden", "United Kingdom"))

EU%>%
   plot_time_series(Date, Daily_new_cases, 
                   .facet_ncol =4, .facet_scales = "free",
                   .interactive = TRUE)

Europe - 15 Countries in 5 rows of 3 countries each

Setting .facet_ncol =3 to have 5 rows of 3 countries.

EU%>%
   plot_time_series(Date, Daily_new_cases, 
                   .facet_ncol =3, .facet_scales = "free",
                   .interactive = TRUE)

US, China, - 2 Countries with sliders charts

Setting .plotly_slider = TRUE to try out the slider function. Sliders can be used to change the data displayed.

Other1 <- filter(confirmed_cases_country_level,Country == c("US","China"))

Other1%>%
   plot_time_series(Date, Daily_new_cases, 
                   .facet_ncol =2, .facet_scales = "free",
                   .interactive = TRUE,
                   .plotly_slider = TRUE)

US - 1 Country with sliders chart

US_Confirmed_Cases <- filter(confirmed_cases_country_level,Country == "US")

US_Confirmed_Cases%>%
   plot_time_series(Date, Daily_new_cases, 
                   .facet_ncol =2, .facet_scales = "free",
                   .interactive = TRUE,
                   .plotly_slider = TRUE)

STL Diagnostics - included in TimeTK package - Determining the Trend and Seasonality of the data

The plot_stl_diagnostics generates a Seasonal-Trend-Loess decomposition. Allow the users to view the three components in time series, namely: the trend, seasonality and remainder.

#STL Diagnostics

#Using US dataset
US_Confirmed_Cases %>%
    plot_stl_diagnostics(
        Date, Daily_new_cases,
        .frequency = "auto", .trend = "auto",
        .feature_set = c("observed", "season", "trend", "remainder"),
        .interactive = TRUE)

Anomaly Diagnostics - included in TimeTK package - Determining any anomalities in the dataset.

plot_anomaly_diagnostics is an interactive and scalable function for visualizing anomalies in time series data.

US_Confirmed_Cases %>%
  plot_anomaly_diagnostics(Date, Daily_new_cases, .facet_ncol = 2)

Lag Diagnostics - included in TimeTK package - Determining the Correlation in the dataset

Autocorrelation is the presence of correlation that is connected to lagged versions of a time series. plot_acf_diagnostics() returns the ACF and PACF of a target and optionally CCF’s of one or more lagged predictors in interactive plotly plots.

US_Confirmed_Cases %>%
    plot_acf_diagnostics(Date, Daily_new_cases, 
                     .facet_ncol = 2, 
                     .facet_scale = "free",
                     .interactive = TRUE)

EDA - Visualise the daily confirmed cases for each country - Using ggplot package.

We will explore using ggplot to plot the time series charts for different countries.

The following code will plot the chart for South East Asia.

ggplot(SEA, aes(Date,Daily_new_cases, color=Country)) +
  geom_path(color='blue', lineend = "round", size=1) +
  facet_wrap(~Country, scales="free") +
  geom_smooth()

Trying out the gghighlight() function and removing the smooth line.

ggplot(SEA, aes(Date,Daily_new_cases, color=Country)) +
  geom_path(color='blue', lineend = "round", size=1) +
  gghighlight() +
  facet_wrap(~Country, scales="free") 

Decomposing time series in Trend, Seasonality and Remainder.

(the following code return an error: Error in decompose(.): time series has no or less than 2 periods)

US_Confirmed_Cases %>%
  decompose() %>%
  autoplot()

Obervations for Packages that support EDA

Both the TIMETK and GGPLOT packages do have functions to support EDA.

From the visualisation, the following can be observed:

In conclusion, TIMETK package is more customizable and flexible to be used. Therefore, we will be using the TIMETK package for EDA.


Building Time Series Forecasting Model, using Forecast packages

For a start, the US dataset would be used to build the forecasting models.

Creating the Training and Test dataset

We will first create a Training and Test dataset.

A training set is implemented to build up a model, while a test set is used to validate the model. Usually, data points in the training set are excluded from the test set.

The following code will break the US dataset into Training (80%) and Test dataset (20%).

US_train_tbl <- training(initial_time_split(US_Confirmed_Cases, prop = 0.8)) # 80% training and 20% test
US_test_tbl <- testing(initial_time_split(US_Confirmed_Cases, prop = 0.8))

US_train_tbl %>% mutate(type = "train") %>%
  bind_rows(US_test_tbl%>%mutate(type="test"))%>%
  ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
  geom_line() +  ggtitle("US")
US_train_tbl
# A tibble: 344 x 4
# Groups:   Country [1]
   Country Date       TotalCases Daily_new_cases
   <chr>   <date>          <dbl>           <dbl>
 1 US      2020-01-23          1               0
 2 US      2020-01-24          2               1
 3 US      2020-01-25          2               0
 4 US      2020-01-26          5               3
 5 US      2020-01-27          5               0
 6 US      2020-01-28          5               0
 7 US      2020-01-29          6               1
 8 US      2020-01-30          6               0
 9 US      2020-01-31          8               2
10 US      2020-02-01          8               0
# ... with 334 more rows
US_test_tbl
# A tibble: 86 x 4
# Groups:   Country [1]
   Country Date       TotalCases Daily_new_cases
   <chr>   <date>          <dbl>           <dbl>
 1 US      2021-01-01   20252310          153510
 2 US      2021-01-02   20552726          300416
 3 US      2021-01-03   20761467          208741
 4 US      2021-01-04   20945766          184299
 5 US      2021-01-05   21180814          235048
 6 US      2021-01-06   21436203          255389
 7 US      2021-01-07   21714268          278065
 8 US      2021-01-08   22007515          293247
 9 US      2021-01-09   22270172          262657
10 US      2021-01-10   22483440          213268
# ... with 76 more rows

Fitting the Training data into the respective Models - FPP3 Packages.

A total of 10 models are created, namely: NAIVE Model, SNAIVE Model, DRIFT Model, SES Model, Holt’s Linear Model, Damped Holt’s Linear Model, STL ETS Model, ARIMA, Dynamic Harmonic Regression Model and TSLM.

US_tsbl <- as_tsibble(US_train_tbl)

fpp3_model_table <- US_tsbl %>%
  model(
    ## Model 1: Naive ----
    naive_mod = NAIVE(Daily_new_cases),
    ## Model 2: Snaive ----
    snaive_mod = SNAIVE(Daily_new_cases),
    ## Model 3: Drift ----
    drift_mod = RW(Daily_new_cases ~ drift()),
    ## Model 4: SES ----
    ses_mod = ETS(Daily_new_cases ~ error("A") + trend("N") + season("N"), opt_crit = "mse"),
    ## Model 5: Holt's Linear ----
    hl_mod = ETS(Daily_new_cases  ~ error("A") + trend("A") + season("N"), opt_crit = "mse"),
    ## Model 6: Damped Holt's Linear ----
    hldamp_mod = ETS(Daily_new_cases  ~ error("A") + trend("Ad") + season("N"), opt_crit = "mse"),
    ## Model 7: STL decomposition with ETS ----
    stl_ets_mod = decomposition_model(STL(Daily_new_cases), ETS(season_adjust ~ season("N"))),
    ## Model 8: ARIMA ----
    arima_mod = ARIMA(Daily_new_cases),
    ## Model 9: Dynamic harmonic regression ----
    dhr_mod = ARIMA(Daily_new_cases ~ PDQ(0,0,0) + fourier(K=2)),
    ## Model 10: TSLM ----
    tslm_mod = TSLM(Daily_new_cases ~ Date)
  )

Forecast and calibrate using the Test Data and display the RMSE for each of the model.

US_tsbl_test <- as_tsibble(US_test_tbl)

forecast_tbl <- fpp3_model_table %>%
  forecast(US_tsbl_test, times = 0)%>%
   as_tibble() %>%
   select(Date, Daily_new_cases, .model, fc_qty = .mean)

forecast_tbl <- US_Confirmed_Cases %>%
  as_tibble() %>% # change tsibble -> tibble
  select(Date, Daily_new_cases) %>%
  right_join(forecast_tbl, by = c("Date"))%>% # join forecast values
  mutate(fc_qty = ifelse(fc_qty < 0, 0, fc_qty)) # change negative & NA orders

accuracy_tbl <- forecast_tbl %>%
  group_by(.model) %>%
  summarise(accuracy_rmse = Metrics::rmse(Daily_new_cases.x, fc_qty)) # calculate RMSE
  
  accuracy_tbl
# A tibble: 10 x 2
   .model      accuracy_rmse
 * <chr>               <dbl>
 1 arima_mod         122817.
 2 dhr_mod           151250.
 3 drift_mod         170477.
 4 hl_mod            180090.
 5 hldamp_mod        138663.
 6 naive_mod         136876.
 7 ses_mod           121623.
 8 snaive_mod        109711.
 9 stl_ets_mod       126826.
10 tslm_mod           95737.

Building Time Series Forecasting Model, using the ModelTime packages.

For comparison, we will be building forecasting models using the ModelTime package as well.

Creating the Training and Test dataset

SImilarly to previous model, we will create a Training and Test dataset.

US_train_data <- training(initial_time_split(US_Confirmed_Cases, prop = 0.8)) # 80% training and 20% test
US_test_data <- testing(initial_time_split(US_Confirmed_Cases, prop = 0.8))

US_train_tbl %>% mutate(type = "train") %>%
  bind_rows(US_test_data%>%mutate(type="test"))%>%
  ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
  geom_line() +  ggtitle("US")

Fitting the Training data into the respective Models - ModelTime Packages.

USmodel_fit_arima_no_boost <- arima_reg() %>%
    set_engine(engine = "auto_arima") %>%
    fit(Daily_new_cases ~ Date, data = US_train_data)

USmodel_fit_arima_no_boost
parsnip model object

Fit time:  461ms 
Series: outcome 
ARIMA(0,1,2)(1,0,2)[7] 

Coefficients:
          ma1     ma2    sar1     sma1    sma2
      -0.7187  0.1412  0.8711  -0.7250  0.1975
s.e.   0.0546  0.0578  0.0686   0.0966  0.1002

sigma^2 estimated as 148451337:  log likelihood=-3713.61
AIC=7439.22   AICc=7439.47   BIC=7462.24
USmodel_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>%
    fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
        data = US_train_data)

USmodel_fit_arima_boosted
parsnip model object

Fit time:  3.4s 
ARIMA(1,1,1)(1,0,2)[7] w/ XGBoost Errors
---
Model 1: Auto ARIMA
Series: outcome 
ARIMA(1,1,1)(1,0,2)[7] 

Coefficients:
          ar1      ma1    sar1     sma1    sma2
      -0.2135  -0.5056  0.8590  -0.7097  0.2031
s.e.   0.0848   0.0743  0.0691   0.0968  0.0994

sigma^2 estimated as 148520204:  log likelihood=-3713.6
AIC=7439.2   AICc=7439.45   BIC=7462.23

---
Model 2: XGBoost Errors

xgboost::xgb.train(params = list(eta = 0.015, max_depth = 6, 
    gamma = 0, colsample_bytree = 1, min_child_weight = 2, subsample = 1, 
    objective = "reg:squarederror"), data = x$data, nrounds = 15, 
    watchlist = x$watchlist, verbose = 0, nthread = 1)
USmodel_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(Daily_new_cases ~ Date , data = US_train_data)

USmodel_fit_ets
parsnip model object

Fit time:  80ms 
ETS(A,A,A) 

Call:
 forecast::ets(y = outcome, model = model_ets, damped = damping_ets) 

  Smoothing parameters:
    alpha = 0.3533 
    beta  = 0.006 
    gamma = 0.1348 

  Initial states:
    l = -1537.1621 
    b = 205.0553 
    s = 2179.517 -21.3413 -5304.516 -6968.096 990.3832 6167.667
           2956.387

  sigma:  12388.67

     AIC     AICc      BIC 
8506.083 8507.025 8552.171 
USmodel_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(Daily_new_cases ~ Date, data = US_train_data)

USmodel_fit_prophet
parsnip model object

Fit time:  790ms 
PROPHET Model
- growth: 'linear'
- n.changepoints: 25
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
USmodel_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    #fit(Daily_new_cases ~ as.numeric(Date) + factor(month(Date, label = TRUE), ordered = FALSE),
    fit(Daily_new_cases ~ as.numeric(Date),
        data = US_train_data)

USmodel_fit_lm
parsnip model object

Fit time:  0ms 

Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)

Coefficients:
     (Intercept)  as.numeric(Date)  
      -9403557.1             512.7  
USmodel_spec_mars <- mars(mode = "regression") %>%
    set_engine("earth") 

USrecipe_spec <- recipe(Daily_new_cases ~ Date, data = US_train_data) %>%
    step_date(Date, features = "month", ordinal = FALSE) %>%
    step_mutate(date_num = as.numeric(Date)) %>%
    step_normalize(date_num) %>%
    step_rm(Date)
  
USwflw_fit_mars <- workflow() %>%
    add_recipe(USrecipe_spec) %>%
    add_model(USmodel_spec_mars) %>%
    fit(US_train_data)

USwflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()

-- Preprocessor ------------------------------------------------------
4 Recipe Steps

* step_date()
* step_mutate()
* step_normalize()
* step_rm()

-- Model -------------------------------------------------------------
Selected 7 of 7 terms, and 3 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 7 terms
Importance: date_num, Date_monthJul, Date_monthApr, ...
Number of terms at each degree of interaction: 1 6 (additive model)
GCV 184188893    RSS 58662556016    GRSq 0.9527876    RSq 0.9560333
USmodel_snaive <- naive_reg() %>%
    set_engine(engine = "snaive") %>%
    fit(Daily_new_cases ~ Date, data = US_train_data)

USmodel_snaive
parsnip model object

Fit time:  31ms 
SNAIVE [7]
--------
Model: 
# A tibble: 7 x 2
  Date        value
  <date>      <dbl>
1 2020-12-25  97615
2 2020-12-26 226303
3 2020-12-27 155634
4 2020-12-28 174651
5 2020-12-29 200191
6 2020-12-30 233634
7 2020-12-31 235565
USmodel_ETS <- exp_smoothing( 
    error = "additive",
    trend = "additive",
    season = "none",) %>%
    set_engine(engine = "ets") %>%
    fit(Daily_new_cases ~ Date , data = US_train_data)

USmodel_ETS
parsnip model object

Fit time:  30ms 
ETS(A,A,N) 

Call:
 forecast::ets(y = outcome, model = model_ets, damped = damping_ets) 

  Smoothing parameters:
    alpha = 0.3757 
    beta  = 0.0045 

  Initial states:
    l = 0.6736 
    b = -0.0317 

  sigma:  13545.84

     AIC     AICc      BIC 
8560.675 8560.853 8579.879 
# adding the models to table

USmodels_tbl <- modeltime_table(
    USmodel_fit_arima_no_boost,
    USmodel_fit_arima_boosted,
    USmodel_fit_ets,
    USmodel_fit_prophet,
    USmodel_fit_lm,
    USwflw_fit_mars,
    USmodel_snaive,
    USmodel_ETS
)

USmodels_tbl
# Modeltime Table
# A tibble: 8 x 3
  .model_id .model     .model_desc                             
      <int> <list>     <chr>                                   
1         1 <fit[+]>   ARIMA(0,1,2)(1,0,2)[7]                  
2         2 <fit[+]>   ARIMA(1,1,1)(1,0,2)[7] W/ XGBOOST ERRORS
3         3 <fit[+]>   ETS(A,A,A)                              
4         4 <fit[+]>   PROPHET                                 
5         5 <fit[+]>   LM                                      
6         6 <workflow> EARTH                                   
7         7 <fit[+]>   SNAIVE [7]                              
8         8 <fit[+]>   ETS(A,A,N)                              

Calibrate the model using test data

UScalibration_tbl <- USmodels_tbl %>%
    modeltime_calibrate(new_data = US_test_data)

UScalibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = "US Accuracy Table",
        .searchable = FALSE
    )
UScalibration_tbl
# Modeltime Table
# A tibble: 8 x 5
  .model_id .model   .model_desc                .type .calibration_da~
      <int> <list>   <chr>                      <chr> <list>          
1         1 <fit[+]> ARIMA(0,1,2)(1,0,2)[7]     Test  <tibble [86 x 4~
2         2 <fit[+]> ARIMA(1,1,1)(1,0,2)[7] W/~ Test  <tibble [86 x 4~
3         3 <fit[+]> ETS(A,A,A)                 Test  <tibble [86 x 4~
4         4 <fit[+]> PROPHET                    Test  <tibble [86 x 4~
5         5 <fit[+]> LM                         Test  <tibble [86 x 4~
6         6 <workfl~ EARTH                      Test  <tibble [86 x 4~
7         7 <fit[+]> SNAIVE [7]                 Test  <tibble [86 x 4~
8         8 <fit[+]> ETS(A,A,N)                 Test  <tibble [86 x 4~

Comparing ModelTime and Forecast Packages

From the result, it was observed that Forecast and Modeltime packages are able to create similar models (e.g ARIMA, ETS etc). However, it is easier to code using ModelTime package. Moreover, the results for the models are identical, See image below (highlighted 3 models in which the parameters are aligned and it returned same results.)

Upon further investigation, it was observed that ModelTime packages included models from Forecast and Prophet packages. Therefore, ModelTime package would be able to support more models. It also integrated well with TimeTK package. Therefore, we will be using ModelTime package for the model building.

Showing visualisation of the forecast (calibrate with Test Data)

Fitting the forecast with the Test Data and plot the visualisation. The visualisation allow users to choose the models it wish to view by clicking on the legend. It also allows users to zoom into the specific time period.

UScalibration_tbl %>%
    modeltime_forecast(
        new_data    = US_test_data,
        actual_data = US_Confirmed_Cases
    ) %>%
    plot_modeltime_forecast(
      .interactive      = TRUE
    )

Forecasting US next 10 days of confirmed cases

To re-calibrate the model using the full dataset and forecast the next 10 days of confirmed cases. The time series chart will show the forecast outcome and the accuracy table will show the values of each of the reading (MAE, MASE etc).

USrefit_tbl <- UScalibration_tbl %>%
    modeltime_refit(data = US_Confirmed_Cases)

USrefit_tbl %>%
    modeltime_forecast(h = "10 days", actual_data = US_Confirmed_Cases) %>%
    plot_modeltime_forecast(
      .legend_max_width = 25, # For mobile screens
      .interactive      = TRUE,
      .title = "US"
    )
USrefit_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
      .title = "US"
    )

Using Shorter Time Period for the Forecast (Calibrate the Models)

As discussed earlier, we should allow users to choose the period to forecast. Therefore, we need to explore the minimum period that the models accept.

Observed that the US dataset have a spike of confirmed cases from Nov 2020 to Feb 2021. This “anomaly” might affect the accuracy of the forecast. Explore using only data from May 2020 to Oct 2020 (6 month data) to see if the forecast result improves.

The following code choose only May 2020 to Oct 2020 US data. Subsequently, breaking the data into Training and Test Set.

chosendata <- confirmed_cases_country_level%>%
  filter(Country == "US")%>%
  filter_by_time(Date, "2020-05", "2020-10")

traindata <- training(initial_time_split(chosendata, prop = 0.8)) # 80% training and 20% test
testdata <- testing(initial_time_split(chosendata, prop = 0.8))

traindata %>% mutate(type = "train") %>%
  bind_rows(testdata%>%mutate(type="test"))%>%
  ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
  geom_line() +  ggtitle("US")

Fitting the Train Data into the Models

 model_fit_arima_no_boost <- arima_reg() %>%
    set_engine(engine = "auto_arima") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

model_fit_arima_no_boost
parsnip model object

Fit time:  371ms 
Series: outcome 
ARIMA(0,1,1)(0,1,1)[7] 

Coefficients:
          ma1     sma1
      -0.4638  -0.5883
s.e.   0.0752   0.0912

sigma^2 estimated as 17631434:  log likelihood=-1357.46
AIC=2720.91   AICc=2721.09   BIC=2729.72
model_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>%
    fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
        data = traindata)

model_fit_arima_boosted
parsnip model object

Fit time:  500ms 
ARIMA(0,1,1)(0,1,1)[7] w/ XGBoost Errors
---
Model 1: Auto ARIMA
Series: outcome 
ARIMA(0,1,1)(0,1,1)[7] 

Coefficients:
          ma1     sma1
      -0.4638  -0.5883
s.e.   0.0752   0.0912

sigma^2 estimated as 17631434:  log likelihood=-1357.46
AIC=2720.91   AICc=2721.09   BIC=2729.72

---
Model 2: XGBoost Errors

xgboost::xgb.train(params = list(eta = 0.015, max_depth = 6, 
    gamma = 0, colsample_bytree = 1, min_child_weight = 2, subsample = 1, 
    objective = "reg:squarederror"), data = x$data, nrounds = 15, 
    watchlist = x$watchlist, verbose = 0, nthread = 1)
model_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(Daily_new_cases ~ Date , data = traindata)

model_fit_ets
parsnip model object

Fit time:  310ms 
ETS(M,Ad,M) 

Call:
 forecast::ets(y = outcome, model = model_ets, damped = damping_ets) 

  Smoothing parameters:
    alpha = 0.516 
    beta  = 0.0357 
    gamma = 0.0028 
    phi   = 0.9473 

  Initial states:
    l = 28791.9403 
    b = -481.1038 
    s = 1.0864 1.0125 0.9771 0.8876 0.8696 1.012
           1.1547

  sigma:  0.0999

     AIC     AICc      BIC 
3160.327 3163.064 3199.203 
model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

model_fit_prophet
parsnip model object

Fit time:  51ms 
PROPHET Model
- growth: 'linear'
- n.changepoints: 25
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
#Multivariate Adaptive Regression Spline model

model_spec_mars <- mars(mode = "regression") %>%
   set_engine("earth") 

recipe_spec <- recipe(Daily_new_cases ~ Date, data = traindata) %>%
    step_date(Date, features = "month", ordinal = FALSE) %>%
    step_mutate(date_num = as.numeric(Date)) %>%
    step_normalize(date_num) %>%
    step_rm(Date)
  
wflw_fit_mars <- workflow() %>%
    add_recipe(recipe_spec) %>%
    add_model(model_spec_mars) %>%
    fit(traindata)

wflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()

-- Preprocessor ------------------------------------------------------
4 Recipe Steps

* step_date()
* step_mutate()
* step_normalize()
* step_rm()

-- Model -------------------------------------------------------------
Selected 6 of 13 terms, and 2 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 13 terms
Importance: date_num, Date_monthJul, Date_monthFeb-unused, ...
Number of terms at each degree of interaction: 1 5 (additive model)
GCV 28665468    RSS 3606778880    GRSq 0.8892415    RSq 0.9038943
model_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    #fit(Daily_new_cases ~ as.numeric(Date) + factor(month(Date, label = TRUE), ordered = FALSE),
    fit(Daily_new_cases ~ as.numeric(Date),
        data = traindata)

model_fit_lm
parsnip model object

Fit time:  0ms 

Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)

Coefficients:
     (Intercept)  as.numeric(Date)  
      -3548183.4             194.4  
models_tbl <- modeltime_table(
    model_fit_arima_no_boost,
    model_fit_arima_boosted,
    model_fit_ets,
    model_fit_prophet,
    model_fit_lm,
    wflw_fit_mars
)
calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testdata)


calibration_tbl %>%
    modeltime_forecast(
        new_data    = testdata,
        actual_data = chosendata
    ) %>%
    plot_modeltime_forecast(
      .interactive      = TRUE
    )
 calibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = " Accuracy Table",
        .searchable = FALSE
    )

Choosing only 1 Month Data Range

To check if the models allow input of 1 month data only. May 2020 data is chosen for this test.

The outcome shows that the certain models (e.g SNAIVE) do require longer date period to do the forecasting. As a result, it is unable to give any forecast.

chosendata <- confirmed_cases_country_level%>%
  filter(Country == "US")%>%
  filter_by_time(Date, "2020-05", "2020-05")

traindata <- training(initial_time_split(chosendata, prop = 0.8)) # 80% training and 20% test
testdata <- testing(initial_time_split(chosendata, prop = 0.8))

traindata %>% mutate(type = "train") %>%
  bind_rows(testdata%>%mutate(type="test"))%>%
  ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
  geom_line() +  ggtitle("US")
 model_fit_arima_no_boost <- arima_reg() %>%
    set_engine(engine = "auto_arima") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

model_fit_arima_no_boost
parsnip model object

Fit time:  120ms 
Series: outcome 
ARIMA(1,1,0)(0,1,0)[7] 

Coefficients:
          ar1
      -0.8015
s.e.   0.1686

sigma^2 estimated as 5360153:  log likelihood=-146.66
AIC=297.31   AICc=298.23   BIC=298.86
model_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>%
    fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
        data = traindata)

model_fit_arima_boosted
parsnip model object

Fit time:  171ms 
ARIMA(1,1,0)(0,1,0)[7]
---
Model 1: Auto ARIMA
Series: outcome 
ARIMA(1,1,0)(0,1,0)[7] 

Coefficients:
          ar1
      -0.8015
s.e.   0.1686

sigma^2 estimated as 5360153:  log likelihood=-146.66
AIC=297.31   AICc=298.23   BIC=298.86

---
Model 2: XGBoost Errors

NULL
model_fit_ets <- exp_smoothing() %>%
    set_engine(engine = "ets") %>%
    fit(Daily_new_cases ~ Date , data = traindata)

model_fit_ets
parsnip model object

Fit time:  271ms 
ETS(M,N,N) 

Call:
 forecast::ets(y = outcome, model = model_ets, damped = damping_ets) 

  Smoothing parameters:
    alpha = 0.6722 

  Initial states:
    l = 32127.6278 

  sigma:  0.137

     AIC     AICc      BIC 
469.4644 470.6644 472.9986 
model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

model_fit_prophet
parsnip model object

Fit time:  180ms 
PROPHET Model
- growth: 'linear'
- n.changepoints: 18
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
#Multivariate Adaptive Regression Spline model

model_spec_mars <- mars(mode = "regression") %>%
   set_engine("earth") 

recipe_spec <- recipe(Daily_new_cases ~ Date, data = traindata) %>%
    step_date(Date, features = "month", ordinal = FALSE) %>%
    step_mutate(date_num = as.numeric(Date)) %>%
    step_normalize(date_num) %>%
    step_rm(Date)
  
wflw_fit_mars <- workflow() %>%
    add_recipe(recipe_spec) %>%
    add_model(model_spec_mars) %>%
    fit(traindata)

wflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()

-- Preprocessor ------------------------------------------------------
4 Recipe Steps

* step_date()
* step_mutate()
* step_normalize()
* step_rm()

-- Model -------------------------------------------------------------
Selected 2 of 3 terms, and 1 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 3 terms
Importance: date_num, Date_monthFeb-unused, Date_monthMar-unused, ...
Number of terms at each degree of interaction: 1 1 (additive model)
GCV 9755897    RSS 179264608    GRSq 0.2703689    RSq 0.3917442
model_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    fit(Daily_new_cases ~ as.numeric(Date),
        data = traindata)

model_fit_lm
parsnip model object

Fit time:  0ms 

Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)

Coefficients:
     (Intercept)  as.numeric(Date)  
       4973584.8            -269.1  
models_tbl <- modeltime_table(
    model_fit_arima_no_boost,
    model_fit_arima_boosted,
    model_fit_ets,
    model_fit_prophet,
    model_fit_lm,
    wflw_fit_mars
)

calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testdata)




calibration_tbl %>%
    modeltime_forecast(
        new_data    = testdata,
        actual_data = chosendata
    ) %>%
    plot_modeltime_forecast(
      .interactive      = TRUE
    )
 calibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = " Accuracy Table",
        .searchable = FALSE
    )

To do forecast of next 10 days in Mar 2021, using the May 2020 dataset.

The outcome shows that the models require the date to be continuous. As seen from the plot, despite putting in the full US data set to predict the Mar 2021 data, the model calculated the outcome for Jun 2020. This will be limitation for the models. Users have to include the latest date in the period.

refit_tbl <- calibration_tbl %>%
    modeltime_refit(data = US_Confirmed_Cases)

refit_tbl %>%
    modeltime_forecast(h = "10 days", actual_data = US_Confirmed_Cases) %>%
    plot_modeltime_forecast(
      .legend_max_width = 25, # For mobile screens
      .interactive      = TRUE,
      .title = "US"
    )
refit_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
      .title = "US"
    )

Observations - Shorten Time Periods

From the result, it suggested that the “anomaly” in the dataset do affect the forecast result. For example, the Prophet result is 33675.32 (RMSE) for Jul to Oct 2020 dataset and result is 241226.03 (RMSE) for the full set of US data. To improve the accuracy of the models, the application should allow the users to choose the period to be used for the forecasting. From the test, it also show that the models is able to accommodate as little as 1 month data to do the forecast. However, some of the models do require longer dataset and could not be used if a short date range is chosen. This will be the limitation for the date range. It was also observed that we are not able to use May 2020 modelling to forecast for Mar 2021. Therefore, most recent data have to be included in the forecast.

Explore other parameters for Prophet Model.

Currently, all the models are using the default setting. Parameters within each of the model could actually be changed, to change the forecast outcome. The subsequent section will explore the parameter setting for the Prophet Model.

Prophet is an open source library published by Facebook that is based on decomposable (trend+seasonality+holidays) models. It provides us with the ability to make time series predictions with good accuracy using simple intuitive parameters and has support for including impact of custom seasonality and holidays.

Prophet Model

Default setting

model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

modeltime_table(
    model_fit_prophet
) %>%
    modeltime_calibrate(new_data = testdata) %>%
    modeltime_forecast(
        new_data      = testdata,
        actual_data   = chosendata,
        conf_interval = 0.95
    ) %>%
    plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
    model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = " Accuracy Table",
        .searchable = FALSE
    )

Since the US data shows changing variance over time, we can stabilize the variance by applying log transformation using the log() function.

model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(log(Daily_new_cases) ~ Date, data = traindata)

modeltime_table(
    model_fit_prophet
) %>%
    modeltime_calibrate(new_data = testdata) %>%
    modeltime_forecast(
        new_data      = testdata,
        actual_data   = chosendata,
        conf_interval = 0.95
    ) %>%
    plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
    model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = " Accuracy Table",
        .searchable = FALSE
    )

Changing seasonality to Daily and change it to a Multiplicative model.

model_fit_prophet <- prophet_reg(seasonality_daily=TRUE,season = 'multiplicative') %>%
    set_engine(engine = "prophet") %>%
    fit(Daily_new_cases ~ Date, data = traindata)

modeltime_table(
    model_fit_prophet
) %>%
    modeltime_calibrate(new_data = testdata) %>%
    modeltime_forecast(
        new_data      = testdata,
        actual_data   = chosendata,
        conf_interval = 0.95
    ) %>%
    plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
    model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = TRUE,
        .title = " Accuracy Table",
        .searchable = FALSE
    )

Observations - Changing Parameters for Prophet Model

Changing the parameters for the model do change the forecast outcome. Users should explore the dataset and change the parameters of the model to find the best forecast outcome. The application will allow the users to change the parameter of the models.

Conclusion

In this Sub-module, the Shiny App will allows users to (1) explore the dataset and (2) have different models for forecasting. We will be using the TimeTK and ModelTime packages for the development. For EDA, users will be able to view the time series plot, STL plot, Anomaly Plot and ACF/PACF plot. Using these plots, the users will have better understanding of the dataset and can choose the right time period and parameters for the forecast.

StoryBoard

Sketch of Proposed Story Board

There will be a total of 2 pages under Time Series Analysis tab.

Page 1 will allow users to explore the time series dataset. Page 2 will show the forecast result.



Suggested Interactivity for the Proposed Visualisation