This post is on the planning and prototyping process for one of the three sub-modules of a Shiny-based Visual Analytics Application project for the course ISSS608. In this sub-module, the purpose is to investigate the spread of Coronavirus (COVID-19) over time for different countries and building models to forecast the near-future case load in each country, using the Johns Hopkins COVID-19 dataset.
The purpose of this assignment is to provides estimates of COVID-19 near-future case load in each country, based on global data collated by Johns Hopkins University. Generally, the data is a sequence of observations stored in time orders (Time Series Data).
Time series analysis aim to analyse the time series data to extract meaningful characteristics of data and generate useful insights. For our case, we will attempt to forecast the near-future number of confirmed cases. Time series may have several characteristics that make their analysis different from other types of data, namely:
These characteristics will affect the outcome of the forecasting models. Decomposing a time series into its components (trend, cycle, seasonality, and random (stationary)) is an important part of time series analysis. We do this to help understand the time series better and also to improve the forecast result.
Most of the sites that forecast the near-future case load uses time series charts and/or putting values in tabular form in white papers (See Fig 1). The site/papers will give detailed explanation of the mathematical models, without letting the users to explore the dataset.

These visualisations focus on showing the result of the chosen models, without allowing users to change any of the parameters of the models. Most of the time, users are also not given the choice on which forecasting models to use. If possible, to have a few models so that users can make a comparison and to allow the users to choose the models he/she wants.
As with all forecasting/prediction models, most of the forecast models do show the Confidence Interval of the outcome. This will show the users the range in which the forecast will fall in. We will continue to maintain this in our application.
Most forecasting site do not allow users to explore the dataset. Will be good if users are able to observe the “characteristics” of the dataset before proceeding to the forecast. Also, should explore allowing the users to select the date range to be used for the forecast.
Most site only show the Time Series plot of the forecast, without the values of the result (RMSE etc). Will be good to show the users the results so that users can gauge the accuracy of the models.
Most of the visualisations for the forecast of Covid-19 confirmed cases are time series charts, without much interactivity with the users. Most of the time, only 1 model result is shown.
In this assignment, we will attempt to create an interactive visualisation. Firstly, the application will have a page to allow users to explore the time-series chart (see the trend, seasonality, anomaly etc). Subsequently, at the forecasting page, will allow users to choose the date range to be used for the forecasting. Countries are seeing spikes in the Covid-19 confirmed cases and had different measures to control the spread. As such, allowing users to choose the range of date will result in different outcome that can best suit the users’ requirement. Different Models will also be shown so that users can compare the outcome.
The following R packages will be explored:
| Visualisation | Packages |
|---|---|
| EDA of Time Series Data | ggplot, TimeTK |
| Forecasting | Forecast, ModelTime |
The COVID-19 dataset is a timeseries dataset (with observations recorded daily). The dataset used is sourced from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. CSSE had collated the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries. The data used for this assignment is the cumulative confirmed daily cases for all affected countries and can be downloaded at https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv.
A list of packages are required for this assignment. This code chunk installs the required packages and loads them onto RStudio environment.
packages = c('xgboost','tidymodels','modeltime','tidyverse','lubridate','timetk','earth','tseries','forecast',
'viridis','plotly','tidyverse','tsibble','tsintermittent','randomForest',
'fpp3','Metrics','gghighlight')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Data import was accomplished using read_csv() from readr package.
confirmed_cases_raw <- read_csv("data/time_series_covid19_confirmed_global.csv")
confirmed_cases_raw
# A tibble: 274 x 435
`Province/State` `Country/Region` Lat Long `1/22/20` `1/23/20`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 <NA> Afghanistan 33.9 67.7 0 0
2 <NA> Albania 41.2 20.2 0 0
3 <NA> Algeria 28.0 1.66 0 0
4 <NA> Andorra 42.5 1.52 0 0
5 <NA> Angola -11.2 17.9 0 0
6 <NA> Antigua and Bar~ 17.1 -61.8 0 0
7 <NA> Argentina -38.4 -63.6 0 0
8 <NA> Armenia 40.1 45.0 0 0
9 Australian Capit~ Australia -35.5 149. 0 0
10 New South Wales Australia -33.9 151. 0 0
# ... with 264 more rows, and 429 more variables: 1/24/20 <dbl>,
# 1/25/20 <dbl>, 1/26/20 <dbl>, 1/27/20 <dbl>, 1/28/20 <dbl>,
# 1/29/20 <dbl>, 1/30/20 <dbl>, 1/31/20 <dbl>, 2/1/20 <dbl>,
# 2/2/20 <dbl>, 2/3/20 <dbl>, 2/4/20 <dbl>, 2/5/20 <dbl>,
# 2/6/20 <dbl>, 2/7/20 <dbl>, 2/8/20 <dbl>, 2/9/20 <dbl>,
# 2/10/20 <dbl>, 2/11/20 <dbl>, 2/12/20 <dbl>, 2/13/20 <dbl>,
# 2/14/20 <dbl>, 2/15/20 <dbl>, 2/16/20 <dbl>, 2/17/20 <dbl>,
# 2/18/20 <dbl>, 2/19/20 <dbl>, 2/20/20 <dbl>, 2/21/20 <dbl>,
# 2/22/20 <dbl>, 2/23/20 <dbl>, 2/24/20 <dbl>, 2/25/20 <dbl>,
# 2/26/20 <dbl>, 2/27/20 <dbl>, 2/28/20 <dbl>, 2/29/20 <dbl>,
# 3/1/20 <dbl>, 3/2/20 <dbl>, 3/3/20 <dbl>, 3/4/20 <dbl>,
# 3/5/20 <dbl>, 3/6/20 <dbl>, 3/7/20 <dbl>, 3/8/20 <dbl>,
# 3/9/20 <dbl>, 3/10/20 <dbl>, 3/11/20 <dbl>, 3/12/20 <dbl>,
# 3/13/20 <dbl>, 3/14/20 <dbl>, 3/15/20 <dbl>, 3/16/20 <dbl>,
# 3/17/20 <dbl>, 3/18/20 <dbl>, 3/19/20 <dbl>, 3/20/20 <dbl>,
# 3/21/20 <dbl>, 3/22/20 <dbl>, 3/23/20 <dbl>, 3/24/20 <dbl>,
# 3/25/20 <dbl>, 3/26/20 <dbl>, 3/27/20 <dbl>, 3/28/20 <dbl>,
# 3/29/20 <dbl>, 3/30/20 <dbl>, 3/31/20 <dbl>, 4/1/20 <dbl>,
# 4/2/20 <dbl>, 4/3/20 <dbl>, 4/4/20 <dbl>, 4/5/20 <dbl>,
# 4/6/20 <dbl>, 4/7/20 <dbl>, 4/8/20 <dbl>, 4/9/20 <dbl>,
# 4/10/20 <dbl>, 4/11/20 <dbl>, 4/12/20 <dbl>, 4/13/20 <dbl>,
# 4/14/20 <dbl>, 4/15/20 <dbl>, 4/16/20 <dbl>, 4/17/20 <dbl>,
# 4/18/20 <dbl>, 4/19/20 <dbl>, 4/20/20 <dbl>, 4/21/20 <dbl>,
# 4/22/20 <dbl>, 4/23/20 <dbl>, 4/24/20 <dbl>, 4/25/20 <dbl>,
# 4/26/20 <dbl>, 4/27/20 <dbl>, 4/28/20 <dbl>, 4/29/20 <dbl>,
# 4/30/20 <dbl>, 5/1/20 <dbl>, 5/2/20 <dbl>, ...
The following are observed from the dataset:
The date are recorded as variables and number of cases are recorded as observations, there is a need to collect all the date variables into a single date variable.
The date variable is currently in “double” format. Need to recode into Date format.
The dataset contained cases at
The following code helps to convert the data to the correct format for subsequently analysis.
Create a “confirmed_cases_country_level” variable to store cumulative confirmed cases for the country (aggregating the province/state, city level data to Country/Region)
We are interested in the daily new confirmed cases. Current data record the cumulative daily total. Need to calculate the “daily new confirmed cases” using the lag function.
Selecting only Country, Date, TotalCases and Daily_new_cases to be stored.Dropping all other variables (e.g Lat and Long).
confirmed_cases_country_level <- confirmed_cases_raw %>%
gather(Date,Total_Cases,-'Province/State',-'Country/Region',-Lat,-Long) %>% #collecting all the date variables into a single date variable.
group_by(`Country/Region`,Date) %>%
summarize(total = sum(Total_Cases)) %>%
select(`Country/Region`, Date, total) %>%
set_names(c("Country", "Date", "TotalCases")) %>% #rename the variables
ungroup()
confirmed_cases_country_level$Date <- as.Date(confirmed_cases_country_level$Date,format="%m/%d/%y") #converting the Date variable to Date format
# finding the daily new confirmed cases.
confirmed_cases_country_level <- confirmed_cases_country_level %>%
group_by(Country) %>%
arrange(Country,Date)%>%
mutate(Daily_new_cases = TotalCases - lag(TotalCases, n =1))%>%
drop_na()
confirmed_cases_country_level
# A tibble: 82,560 x 4
# Groups: Country [192]
Country Date TotalCases Daily_new_cases
<chr> <date> <dbl> <dbl>
1 Afghanistan 2020-01-23 0 0
2 Afghanistan 2020-01-24 0 0
3 Afghanistan 2020-01-25 0 0
4 Afghanistan 2020-01-26 0 0
5 Afghanistan 2020-01-27 0 0
6 Afghanistan 2020-01-28 0 0
7 Afghanistan 2020-01-29 0 0
8 Afghanistan 2020-01-30 0 0
9 Afghanistan 2020-01-31 0 0
10 Afghanistan 2020-02-01 0 0
# ... with 82,550 more rows
To allow users to explore / visualise the time series data for the countries, time series charts would be plotted. Since our dataset contains many different countries, will be good to show the distribution of various countries, for comparison. The following section help to explore what is the optimal number of charts that can be displayed at any point of time, using the TimeTK package.
Create Time Series Plots for South East Asia (Brunei, Burma, Cambodia, Timor-Leste, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand, Vietnam).
Setting .facet_ncol =3 to have 4 rows of 3 countries.
#selecting only the south east asia countries from the dataset.
SEA <- filter(confirmed_cases_country_level,Country == c("Brunei", "Burma", "Cambodia", "Timor-Leste", "Indonesia", "Laos", "Malaysia", "Philippines","Singapore", "Thailand", "Vietnam")) #11 countries
SEA%>%
plot_time_series(Date, Daily_new_cases,
.facet_ncol =3, .facet_scales = "free",
.interactive = TRUE,)
Setting .facet_ncol =2 to have 6 rows of 2 countries.
SEA%>%
plot_time_series(Date, Daily_new_cases,
.facet_ncol =2, .facet_scales = "free",
.interactive = TRUE)
Create Time Series Plots for European Countries (Austria, Belgium, Denmark, Finland,France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal,Spain, Sweden, United Kingdom)
Setting .facet_ncol =4 to have 4 rows of 4 countries.
#choosing the european countries
EU <- filter(confirmed_cases_country_level,Country == c("Austria", "Belgium", "Denmark", "Finland","France", "Germany", "Greece", "Ireland","Italy", "Luxembourg", "Netherlands", "Portugal","Spain", "Sweden", "United Kingdom"))
EU%>%
plot_time_series(Date, Daily_new_cases,
.facet_ncol =4, .facet_scales = "free",
.interactive = TRUE)
Setting .facet_ncol =3 to have 5 rows of 3 countries.
EU%>%
plot_time_series(Date, Daily_new_cases,
.facet_ncol =3, .facet_scales = "free",
.interactive = TRUE)
Setting .plotly_slider = TRUE to try out the slider function. Sliders can be used to change the data displayed.
US_Confirmed_Cases <- filter(confirmed_cases_country_level,Country == "US")
US_Confirmed_Cases%>%
plot_time_series(Date, Daily_new_cases,
.facet_ncol =2, .facet_scales = "free",
.interactive = TRUE,
.plotly_slider = TRUE)
The plot_stl_diagnostics generates a Seasonal-Trend-Loess decomposition. Allow the users to view the three components in time series, namely: the trend, seasonality and remainder.
#STL Diagnostics
#Using US dataset
US_Confirmed_Cases %>%
plot_stl_diagnostics(
Date, Daily_new_cases,
.frequency = "auto", .trend = "auto",
.feature_set = c("observed", "season", "trend", "remainder"),
.interactive = TRUE)
plot_anomaly_diagnostics is an interactive and scalable function for visualizing anomalies in time series data.
US_Confirmed_Cases %>%
plot_anomaly_diagnostics(Date, Daily_new_cases, .facet_ncol = 2)
Autocorrelation is the presence of correlation that is connected to lagged versions of a time series. plot_acf_diagnostics() returns the ACF and PACF of a target and optionally CCF’s of one or more lagged predictors in interactive plotly plots.
US_Confirmed_Cases %>%
plot_acf_diagnostics(Date, Daily_new_cases,
.facet_ncol = 2,
.facet_scale = "free",
.interactive = TRUE)
We will explore using ggplot to plot the time series charts for different countries.
The following code will plot the chart for South East Asia.
ggplot(SEA, aes(Date,Daily_new_cases, color=Country)) +
geom_path(color='blue', lineend = "round", size=1) +
facet_wrap(~Country, scales="free") +
geom_smooth()

Trying out the gghighlight() function and removing the smooth line.
ggplot(SEA, aes(Date,Daily_new_cases, color=Country)) +
geom_path(color='blue', lineend = "round", size=1) +
gghighlight() +
facet_wrap(~Country, scales="free")

Decomposing time series in Trend, Seasonality and Remainder.
(the following code return an error: Error in decompose(.): time series has no or less than 2 periods)
US_Confirmed_Cases %>%
decompose() %>%
autoplot()
Both the TIMETK and GGPLOT packages do have functions to support EDA.
From the visualisation, the following can be observed:
For most optimal view, to have maximum of 12 countries to be plot (4 rows of 3 countries each) to view the distribution of the data. If not, the chart will be too “squeeze”. However, the scale of the Y axis would be different and this might confused users.
TIMETK package allow interactivity such as PAN, ZOOM etc to be incorporated into the charts by a single argument (.interactive = TRUE). For ggplot, would need to incorporate the use of plotly package to achieve the same effect.
TIMETK allow Slider to be incorporated by a single arguement (.plotly_slider = TRUE). For ggplot, need to incorporate with plotly package.
TIMETK package incorporated function to view the STL and Anomaly and are flexible with the amount of data. For ggplot, the time series need a minimum of 2 periods for the data to be shown.
In conclusion, TIMETK package is more customizable and flexible to be used. Therefore, we will be using the TIMETK package for EDA.
For a start, the US dataset would be used to build the forecasting models.
We will first create a Training and Test dataset.
A training set is implemented to build up a model, while a test set is used to validate the model. Usually, data points in the training set are excluded from the test set.
The following code will break the US dataset into Training (80%) and Test dataset (20%).
US_train_tbl <- training(initial_time_split(US_Confirmed_Cases, prop = 0.8)) # 80% training and 20% test
US_test_tbl <- testing(initial_time_split(US_Confirmed_Cases, prop = 0.8))
US_train_tbl %>% mutate(type = "train") %>%
bind_rows(US_test_tbl%>%mutate(type="test"))%>%
ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
geom_line() + ggtitle("US")
US_train_tbl
# A tibble: 344 x 4
# Groups: Country [1]
Country Date TotalCases Daily_new_cases
<chr> <date> <dbl> <dbl>
1 US 2020-01-23 1 0
2 US 2020-01-24 2 1
3 US 2020-01-25 2 0
4 US 2020-01-26 5 3
5 US 2020-01-27 5 0
6 US 2020-01-28 5 0
7 US 2020-01-29 6 1
8 US 2020-01-30 6 0
9 US 2020-01-31 8 2
10 US 2020-02-01 8 0
# ... with 334 more rows
US_test_tbl
# A tibble: 86 x 4
# Groups: Country [1]
Country Date TotalCases Daily_new_cases
<chr> <date> <dbl> <dbl>
1 US 2021-01-01 20252310 153510
2 US 2021-01-02 20552726 300416
3 US 2021-01-03 20761467 208741
4 US 2021-01-04 20945766 184299
5 US 2021-01-05 21180814 235048
6 US 2021-01-06 21436203 255389
7 US 2021-01-07 21714268 278065
8 US 2021-01-08 22007515 293247
9 US 2021-01-09 22270172 262657
10 US 2021-01-10 22483440 213268
# ... with 76 more rows
A total of 10 models are created, namely: NAIVE Model, SNAIVE Model, DRIFT Model, SES Model, Holt’s Linear Model, Damped Holt’s Linear Model, STL ETS Model, ARIMA, Dynamic Harmonic Regression Model and TSLM.
US_tsbl <- as_tsibble(US_train_tbl)
fpp3_model_table <- US_tsbl %>%
model(
## Model 1: Naive ----
naive_mod = NAIVE(Daily_new_cases),
## Model 2: Snaive ----
snaive_mod = SNAIVE(Daily_new_cases),
## Model 3: Drift ----
drift_mod = RW(Daily_new_cases ~ drift()),
## Model 4: SES ----
ses_mod = ETS(Daily_new_cases ~ error("A") + trend("N") + season("N"), opt_crit = "mse"),
## Model 5: Holt's Linear ----
hl_mod = ETS(Daily_new_cases ~ error("A") + trend("A") + season("N"), opt_crit = "mse"),
## Model 6: Damped Holt's Linear ----
hldamp_mod = ETS(Daily_new_cases ~ error("A") + trend("Ad") + season("N"), opt_crit = "mse"),
## Model 7: STL decomposition with ETS ----
stl_ets_mod = decomposition_model(STL(Daily_new_cases), ETS(season_adjust ~ season("N"))),
## Model 8: ARIMA ----
arima_mod = ARIMA(Daily_new_cases),
## Model 9: Dynamic harmonic regression ----
dhr_mod = ARIMA(Daily_new_cases ~ PDQ(0,0,0) + fourier(K=2)),
## Model 10: TSLM ----
tslm_mod = TSLM(Daily_new_cases ~ Date)
)
US_tsbl_test <- as_tsibble(US_test_tbl)
forecast_tbl <- fpp3_model_table %>%
forecast(US_tsbl_test, times = 0)%>%
as_tibble() %>%
select(Date, Daily_new_cases, .model, fc_qty = .mean)
forecast_tbl <- US_Confirmed_Cases %>%
as_tibble() %>% # change tsibble -> tibble
select(Date, Daily_new_cases) %>%
right_join(forecast_tbl, by = c("Date"))%>% # join forecast values
mutate(fc_qty = ifelse(fc_qty < 0, 0, fc_qty)) # change negative & NA orders
accuracy_tbl <- forecast_tbl %>%
group_by(.model) %>%
summarise(accuracy_rmse = Metrics::rmse(Daily_new_cases.x, fc_qty)) # calculate RMSE
accuracy_tbl
# A tibble: 10 x 2
.model accuracy_rmse
* <chr> <dbl>
1 arima_mod 122817.
2 dhr_mod 151250.
3 drift_mod 170477.
4 hl_mod 180090.
5 hldamp_mod 138663.
6 naive_mod 136876.
7 ses_mod 121623.
8 snaive_mod 109711.
9 stl_ets_mod 126826.
10 tslm_mod 95737.
For comparison, we will be building forecasting models using the ModelTime package as well.
SImilarly to previous model, we will create a Training and Test dataset.
US_train_data <- training(initial_time_split(US_Confirmed_Cases, prop = 0.8)) # 80% training and 20% test
US_test_data <- testing(initial_time_split(US_Confirmed_Cases, prop = 0.8))
US_train_tbl %>% mutate(type = "train") %>%
bind_rows(US_test_data%>%mutate(type="test"))%>%
ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
geom_line() + ggtitle("US")

USmodel_fit_arima_no_boost <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(Daily_new_cases ~ Date, data = US_train_data)
USmodel_fit_arima_no_boost
parsnip model object
Fit time: 461ms
Series: outcome
ARIMA(0,1,2)(1,0,2)[7]
Coefficients:
ma1 ma2 sar1 sma1 sma2
-0.7187 0.1412 0.8711 -0.7250 0.1975
s.e. 0.0546 0.0578 0.0686 0.0966 0.1002
sigma^2 estimated as 148451337: log likelihood=-3713.61
AIC=7439.22 AICc=7439.47 BIC=7462.24
USmodel_fit_arima_boosted <- arima_boost(
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
data = US_train_data)
USmodel_fit_arima_boosted
parsnip model object
Fit time: 3.4s
ARIMA(1,1,1)(1,0,2)[7] w/ XGBoost Errors
---
Model 1: Auto ARIMA
Series: outcome
ARIMA(1,1,1)(1,0,2)[7]
Coefficients:
ar1 ma1 sar1 sma1 sma2
-0.2135 -0.5056 0.8590 -0.7097 0.2031
s.e. 0.0848 0.0743 0.0691 0.0968 0.0994
sigma^2 estimated as 148520204: log likelihood=-3713.6
AIC=7439.2 AICc=7439.45 BIC=7462.23
---
Model 2: XGBoost Errors
xgboost::xgb.train(params = list(eta = 0.015, max_depth = 6,
gamma = 0, colsample_bytree = 1, min_child_weight = 2, subsample = 1,
objective = "reg:squarederror"), data = x$data, nrounds = 15,
watchlist = x$watchlist, verbose = 0, nthread = 1)
USmodel_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(Daily_new_cases ~ Date , data = US_train_data)
USmodel_fit_ets
parsnip model object
Fit time: 80ms
ETS(A,A,A)
Call:
forecast::ets(y = outcome, model = model_ets, damped = damping_ets)
Smoothing parameters:
alpha = 0.3533
beta = 0.006
gamma = 0.1348
Initial states:
l = -1537.1621
b = 205.0553
s = 2179.517 -21.3413 -5304.516 -6968.096 990.3832 6167.667
2956.387
sigma: 12388.67
AIC AICc BIC
8506.083 8507.025 8552.171
USmodel_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(Daily_new_cases ~ Date, data = US_train_data)
USmodel_fit_prophet
parsnip model object
Fit time: 790ms
PROPHET Model
- growth: 'linear'
- n.changepoints: 25
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
USmodel_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
#fit(Daily_new_cases ~ as.numeric(Date) + factor(month(Date, label = TRUE), ordered = FALSE),
fit(Daily_new_cases ~ as.numeric(Date),
data = US_train_data)
USmodel_fit_lm
parsnip model object
Fit time: 0ms
Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)
Coefficients:
(Intercept) as.numeric(Date)
-9403557.1 512.7
USmodel_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
USrecipe_spec <- recipe(Daily_new_cases ~ Date, data = US_train_data) %>%
step_date(Date, features = "month", ordinal = FALSE) %>%
step_mutate(date_num = as.numeric(Date)) %>%
step_normalize(date_num) %>%
step_rm(Date)
USwflw_fit_mars <- workflow() %>%
add_recipe(USrecipe_spec) %>%
add_model(USmodel_spec_mars) %>%
fit(US_train_data)
USwflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()
-- Preprocessor ------------------------------------------------------
4 Recipe Steps
* step_date()
* step_mutate()
* step_normalize()
* step_rm()
-- Model -------------------------------------------------------------
Selected 7 of 7 terms, and 3 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 7 terms
Importance: date_num, Date_monthJul, Date_monthApr, ...
Number of terms at each degree of interaction: 1 6 (additive model)
GCV 184188893 RSS 58662556016 GRSq 0.9527876 RSq 0.9560333
USmodel_snaive <- naive_reg() %>%
set_engine(engine = "snaive") %>%
fit(Daily_new_cases ~ Date, data = US_train_data)
USmodel_snaive
parsnip model object
Fit time: 31ms
SNAIVE [7]
--------
Model:
# A tibble: 7 x 2
Date value
<date> <dbl>
1 2020-12-25 97615
2 2020-12-26 226303
3 2020-12-27 155634
4 2020-12-28 174651
5 2020-12-29 200191
6 2020-12-30 233634
7 2020-12-31 235565
USmodel_ETS <- exp_smoothing(
error = "additive",
trend = "additive",
season = "none",) %>%
set_engine(engine = "ets") %>%
fit(Daily_new_cases ~ Date , data = US_train_data)
USmodel_ETS
parsnip model object
Fit time: 30ms
ETS(A,A,N)
Call:
forecast::ets(y = outcome, model = model_ets, damped = damping_ets)
Smoothing parameters:
alpha = 0.3757
beta = 0.0045
Initial states:
l = 0.6736
b = -0.0317
sigma: 13545.84
AIC AICc BIC
8560.675 8560.853 8579.879
# adding the models to table
USmodels_tbl <- modeltime_table(
USmodel_fit_arima_no_boost,
USmodel_fit_arima_boosted,
USmodel_fit_ets,
USmodel_fit_prophet,
USmodel_fit_lm,
USwflw_fit_mars,
USmodel_snaive,
USmodel_ETS
)
USmodels_tbl
# Modeltime Table
# A tibble: 8 x 3
.model_id .model .model_desc
<int> <list> <chr>
1 1 <fit[+]> ARIMA(0,1,2)(1,0,2)[7]
2 2 <fit[+]> ARIMA(1,1,1)(1,0,2)[7] W/ XGBOOST ERRORS
3 3 <fit[+]> ETS(A,A,A)
4 4 <fit[+]> PROPHET
5 5 <fit[+]> LM
6 6 <workflow> EARTH
7 7 <fit[+]> SNAIVE [7]
8 8 <fit[+]> ETS(A,A,N)
UScalibration_tbl <- USmodels_tbl %>%
modeltime_calibrate(new_data = US_test_data)
UScalibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = "US Accuracy Table",
.searchable = FALSE
)
UScalibration_tbl
# Modeltime Table
# A tibble: 8 x 5
.model_id .model .model_desc .type .calibration_da~
<int> <list> <chr> <chr> <list>
1 1 <fit[+]> ARIMA(0,1,2)(1,0,2)[7] Test <tibble [86 x 4~
2 2 <fit[+]> ARIMA(1,1,1)(1,0,2)[7] W/~ Test <tibble [86 x 4~
3 3 <fit[+]> ETS(A,A,A) Test <tibble [86 x 4~
4 4 <fit[+]> PROPHET Test <tibble [86 x 4~
5 5 <fit[+]> LM Test <tibble [86 x 4~
6 6 <workfl~ EARTH Test <tibble [86 x 4~
7 7 <fit[+]> SNAIVE [7] Test <tibble [86 x 4~
8 8 <fit[+]> ETS(A,A,N) Test <tibble [86 x 4~
From the result, it was observed that Forecast and Modeltime packages are able to create similar models (e.g ARIMA, ETS etc). However, it is easier to code using ModelTime package. Moreover, the results for the models are identical, See image below (highlighted 3 models in which the parameters are aligned and it returned same results.) 
Upon further investigation, it was observed that ModelTime packages included models from Forecast and Prophet packages. Therefore, ModelTime package would be able to support more models. It also integrated well with TimeTK package. Therefore, we will be using ModelTime package for the model building.
Fitting the forecast with the Test Data and plot the visualisation. The visualisation allow users to choose the models it wish to view by clicking on the legend. It also allows users to zoom into the specific time period.
UScalibration_tbl %>%
modeltime_forecast(
new_data = US_test_data,
actual_data = US_Confirmed_Cases
) %>%
plot_modeltime_forecast(
.interactive = TRUE
)
To re-calibrate the model using the full dataset and forecast the next 10 days of confirmed cases. The time series chart will show the forecast outcome and the accuracy table will show the values of each of the reading (MAE, MASE etc).
USrefit_tbl <- UScalibration_tbl %>%
modeltime_refit(data = US_Confirmed_Cases)
USrefit_tbl %>%
modeltime_forecast(h = "10 days", actual_data = US_Confirmed_Cases) %>%
plot_modeltime_forecast(
.legend_max_width = 25, # For mobile screens
.interactive = TRUE,
.title = "US"
)
USrefit_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = "US"
)
As discussed earlier, we should allow users to choose the period to forecast. Therefore, we need to explore the minimum period that the models accept.
Observed that the US dataset have a spike of confirmed cases from Nov 2020 to Feb 2021. This “anomaly” might affect the accuracy of the forecast. Explore using only data from May 2020 to Oct 2020 (6 month data) to see if the forecast result improves.
The following code choose only May 2020 to Oct 2020 US data. Subsequently, breaking the data into Training and Test Set.
chosendata <- confirmed_cases_country_level%>%
filter(Country == "US")%>%
filter_by_time(Date, "2020-05", "2020-10")
traindata <- training(initial_time_split(chosendata, prop = 0.8)) # 80% training and 20% test
testdata <- testing(initial_time_split(chosendata, prop = 0.8))
traindata %>% mutate(type = "train") %>%
bind_rows(testdata%>%mutate(type="test"))%>%
ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
geom_line() + ggtitle("US")

model_fit_arima_no_boost <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(Daily_new_cases ~ Date, data = traindata)
model_fit_arima_no_boost
parsnip model object
Fit time: 371ms
Series: outcome
ARIMA(0,1,1)(0,1,1)[7]
Coefficients:
ma1 sma1
-0.4638 -0.5883
s.e. 0.0752 0.0912
sigma^2 estimated as 17631434: log likelihood=-1357.46
AIC=2720.91 AICc=2721.09 BIC=2729.72
model_fit_arima_boosted <- arima_boost(
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
data = traindata)
model_fit_arima_boosted
parsnip model object
Fit time: 500ms
ARIMA(0,1,1)(0,1,1)[7] w/ XGBoost Errors
---
Model 1: Auto ARIMA
Series: outcome
ARIMA(0,1,1)(0,1,1)[7]
Coefficients:
ma1 sma1
-0.4638 -0.5883
s.e. 0.0752 0.0912
sigma^2 estimated as 17631434: log likelihood=-1357.46
AIC=2720.91 AICc=2721.09 BIC=2729.72
---
Model 2: XGBoost Errors
xgboost::xgb.train(params = list(eta = 0.015, max_depth = 6,
gamma = 0, colsample_bytree = 1, min_child_weight = 2, subsample = 1,
objective = "reg:squarederror"), data = x$data, nrounds = 15,
watchlist = x$watchlist, verbose = 0, nthread = 1)
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(Daily_new_cases ~ Date , data = traindata)
model_fit_ets
parsnip model object
Fit time: 310ms
ETS(M,Ad,M)
Call:
forecast::ets(y = outcome, model = model_ets, damped = damping_ets)
Smoothing parameters:
alpha = 0.516
beta = 0.0357
gamma = 0.0028
phi = 0.9473
Initial states:
l = 28791.9403
b = -481.1038
s = 1.0864 1.0125 0.9771 0.8876 0.8696 1.012
1.1547
sigma: 0.0999
AIC AICc BIC
3160.327 3163.064 3199.203
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(Daily_new_cases ~ Date, data = traindata)
model_fit_prophet
parsnip model object
Fit time: 51ms
PROPHET Model
- growth: 'linear'
- n.changepoints: 25
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
#Multivariate Adaptive Regression Spline model
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
recipe_spec <- recipe(Daily_new_cases ~ Date, data = traindata) %>%
step_date(Date, features = "month", ordinal = FALSE) %>%
step_mutate(date_num = as.numeric(Date)) %>%
step_normalize(date_num) %>%
step_rm(Date)
wflw_fit_mars <- workflow() %>%
add_recipe(recipe_spec) %>%
add_model(model_spec_mars) %>%
fit(traindata)
wflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()
-- Preprocessor ------------------------------------------------------
4 Recipe Steps
* step_date()
* step_mutate()
* step_normalize()
* step_rm()
-- Model -------------------------------------------------------------
Selected 6 of 13 terms, and 2 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 13 terms
Importance: date_num, Date_monthJul, Date_monthFeb-unused, ...
Number of terms at each degree of interaction: 1 5 (additive model)
GCV 28665468 RSS 3606778880 GRSq 0.8892415 RSq 0.9038943
model_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
#fit(Daily_new_cases ~ as.numeric(Date) + factor(month(Date, label = TRUE), ordered = FALSE),
fit(Daily_new_cases ~ as.numeric(Date),
data = traindata)
model_fit_lm
parsnip model object
Fit time: 0ms
Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)
Coefficients:
(Intercept) as.numeric(Date)
-3548183.4 194.4
models_tbl <- modeltime_table(
model_fit_arima_no_boost,
model_fit_arima_boosted,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars
)
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testdata)
calibration_tbl %>%
modeltime_forecast(
new_data = testdata,
actual_data = chosendata
) %>%
plot_modeltime_forecast(
.interactive = TRUE
)
calibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = " Accuracy Table",
.searchable = FALSE
)
To check if the models allow input of 1 month data only. May 2020 data is chosen for this test.
The outcome shows that the certain models (e.g SNAIVE) do require longer date period to do the forecasting. As a result, it is unable to give any forecast.
chosendata <- confirmed_cases_country_level%>%
filter(Country == "US")%>%
filter_by_time(Date, "2020-05", "2020-05")
traindata <- training(initial_time_split(chosendata, prop = 0.8)) # 80% training and 20% test
testdata <- testing(initial_time_split(chosendata, prop = 0.8))
traindata %>% mutate(type = "train") %>%
bind_rows(testdata%>%mutate(type="test"))%>%
ggplot(aes(x=Date,y=Daily_new_cases,color=type)) +
geom_line() + ggtitle("US")
model_fit_arima_no_boost <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(Daily_new_cases ~ Date, data = traindata)
model_fit_arima_no_boost
parsnip model object
Fit time: 120ms
Series: outcome
ARIMA(1,1,0)(0,1,0)[7]
Coefficients:
ar1
-0.8015
s.e. 0.1686
sigma^2 estimated as 5360153: log likelihood=-146.66
AIC=297.31 AICc=298.23 BIC=298.86
model_fit_arima_boosted <- arima_boost(
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(Daily_new_cases ~ Date + as.numeric(Date) + factor(month(Date, label = TRUE), ordered = F),
data = traindata)
model_fit_arima_boosted
parsnip model object
Fit time: 171ms
ARIMA(1,1,0)(0,1,0)[7]
---
Model 1: Auto ARIMA
Series: outcome
ARIMA(1,1,0)(0,1,0)[7]
Coefficients:
ar1
-0.8015
s.e. 0.1686
sigma^2 estimated as 5360153: log likelihood=-146.66
AIC=297.31 AICc=298.23 BIC=298.86
---
Model 2: XGBoost Errors
NULL
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(Daily_new_cases ~ Date , data = traindata)
model_fit_ets
parsnip model object
Fit time: 271ms
ETS(M,N,N)
Call:
forecast::ets(y = outcome, model = model_ets, damped = damping_ets)
Smoothing parameters:
alpha = 0.6722
Initial states:
l = 32127.6278
sigma: 0.137
AIC AICc BIC
469.4644 470.6644 472.9986
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(Daily_new_cases ~ Date, data = traindata)
model_fit_prophet
parsnip model object
Fit time: 180ms
PROPHET Model
- growth: 'linear'
- n.changepoints: 18
- changepoint.range: 0.8
- yearly.seasonality: 'auto'
- weekly.seasonality: 'auto'
- daily.seasonality: 'auto'
- seasonality.mode: 'additive'
- changepoint.prior.scale: 0.05
- seasonality.prior.scale: 10
- holidays.prior.scale: 10
- logistic_cap: NULL
- logistic_floor: NULL
- extra_regressors: 0
#Multivariate Adaptive Regression Spline model
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
recipe_spec <- recipe(Daily_new_cases ~ Date, data = traindata) %>%
step_date(Date, features = "month", ordinal = FALSE) %>%
step_mutate(date_num = as.numeric(Date)) %>%
step_normalize(date_num) %>%
step_rm(Date)
wflw_fit_mars <- workflow() %>%
add_recipe(recipe_spec) %>%
add_model(model_spec_mars) %>%
fit(traindata)
wflw_fit_mars
== Workflow [trained] ================================================
Preprocessor: Recipe
Model: mars()
-- Preprocessor ------------------------------------------------------
4 Recipe Steps
* step_date()
* step_mutate()
* step_normalize()
* step_rm()
-- Model -------------------------------------------------------------
Selected 2 of 3 terms, and 1 of 12 predictors
Termination condition: RSq changed by less than 0.001 at 3 terms
Importance: date_num, Date_monthFeb-unused, Date_monthMar-unused, ...
Number of terms at each degree of interaction: 1 1 (additive model)
GCV 9755897 RSS 179264608 GRSq 0.2703689 RSq 0.3917442
model_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
fit(Daily_new_cases ~ as.numeric(Date),
data = traindata)
model_fit_lm
parsnip model object
Fit time: 0ms
Call:
stats::lm(formula = Daily_new_cases ~ as.numeric(Date), data = data)
Coefficients:
(Intercept) as.numeric(Date)
4973584.8 -269.1
models_tbl <- modeltime_table(
model_fit_arima_no_boost,
model_fit_arima_boosted,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars
)
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testdata)
calibration_tbl %>%
modeltime_forecast(
new_data = testdata,
actual_data = chosendata
) %>%
plot_modeltime_forecast(
.interactive = TRUE
)
calibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = " Accuracy Table",
.searchable = FALSE
)
To do forecast of next 10 days in Mar 2021, using the May 2020 dataset.
The outcome shows that the models require the date to be continuous. As seen from the plot, despite putting in the full US data set to predict the Mar 2021 data, the model calculated the outcome for Jun 2020. This will be limitation for the models. Users have to include the latest date in the period.
refit_tbl <- calibration_tbl %>%
modeltime_refit(data = US_Confirmed_Cases)
refit_tbl %>%
modeltime_forecast(h = "10 days", actual_data = US_Confirmed_Cases) %>%
plot_modeltime_forecast(
.legend_max_width = 25, # For mobile screens
.interactive = TRUE,
.title = "US"
)
refit_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = "US"
)
From the result, it suggested that the “anomaly” in the dataset do affect the forecast result. For example, the Prophet result is 33675.32 (RMSE) for Jul to Oct 2020 dataset and result is 241226.03 (RMSE) for the full set of US data. To improve the accuracy of the models, the application should allow the users to choose the period to be used for the forecasting. From the test, it also show that the models is able to accommodate as little as 1 month data to do the forecast. However, some of the models do require longer dataset and could not be used if a short date range is chosen. This will be the limitation for the date range. It was also observed that we are not able to use May 2020 modelling to forecast for Mar 2021. Therefore, most recent data have to be included in the forecast.
Currently, all the models are using the default setting. Parameters within each of the model could actually be changed, to change the forecast outcome. The subsequent section will explore the parameter setting for the Prophet Model.
Prophet is an open source library published by Facebook that is based on decomposable (trend+seasonality+holidays) models. It provides us with the ability to make time series predictions with good accuracy using simple intuitive parameters and has support for including impact of custom seasonality and holidays.
Default setting
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(Daily_new_cases ~ Date, data = traindata)
modeltime_table(
model_fit_prophet
) %>%
modeltime_calibrate(new_data = testdata) %>%
modeltime_forecast(
new_data = testdata,
actual_data = chosendata,
conf_interval = 0.95
) %>%
plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = " Accuracy Table",
.searchable = FALSE
)
Since the US data shows changing variance over time, we can stabilize the variance by applying log transformation using the log() function.
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(log(Daily_new_cases) ~ Date, data = traindata)
modeltime_table(
model_fit_prophet
) %>%
modeltime_calibrate(new_data = testdata) %>%
modeltime_forecast(
new_data = testdata,
actual_data = chosendata,
conf_interval = 0.95
) %>%
plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = " Accuracy Table",
.searchable = FALSE
)
Changing seasonality to Daily and change it to a Multiplicative model.
model_fit_prophet <- prophet_reg(seasonality_daily=TRUE,season = 'multiplicative') %>%
set_engine(engine = "prophet") %>%
fit(Daily_new_cases ~ Date, data = traindata)
modeltime_table(
model_fit_prophet
) %>%
modeltime_calibrate(new_data = testdata) %>%
modeltime_forecast(
new_data = testdata,
actual_data = chosendata,
conf_interval = 0.95
) %>%
plot_modeltime_forecast(.interactive = FALSE)
modeltime_table(
model_fit_prophet
) %>%modeltime_calibrate(new_data = testdata) %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = TRUE,
.title = " Accuracy Table",
.searchable = FALSE
)
Changing the parameters for the model do change the forecast outcome. Users should explore the dataset and change the parameters of the model to find the best forecast outcome. The application will allow the users to change the parameter of the models.
In this Sub-module, the Shiny App will allows users to (1) explore the dataset and (2) have different models for forecasting. We will be using the TimeTK and ModelTime packages for the development. For EDA, users will be able to view the time series plot, STL plot, Anomaly Plot and ACF/PACF plot. Using these plots, the users will have better understanding of the dataset and can choose the right time period and parameters for the forecast.
There will be a total of 2 pages under Time Series Analysis tab.
Page 1 will allow users to explore the time series dataset. Page 2 will show the forecast result.

