# CoronaDash app use case - Clustering countries' COVID-19 active cases trajectories

COVID-19 disease spread hit the World really globally and also the field of mathematicians/ statisticians/ machine learning researchers and related.
These experts want to help to understand for example future trends (forecast) of the coronavirus spread.
My motivation, in this case, was to create **interactive dashboard** about COVID-19 to inform about various scenarios in every country and compare them through **data mining methods**.

I created **CoronaDash** `shinydashboard`

application that is hosted on **petolau.shinyapps.io** RStudio platform.
The dashboard provides various data mining/ visualization techniques for **comparing countries’ COVID-19 data statistics** as:

- extrapolating total confirmed cases by exponential smoothing model,
- trajectories of cases/ deaths spread,
- multidimensional clustering of countries’ data/ statistics - with dendrogram and table of clusters averages,
- aggregated views for the whole World,
- hierarchical clustering of countries’ trajectories based on DTW distance and preprocessing by SMA (+ normalization), for fast comparison of a large number of countries’ COVID-19 magnitudes and trends.

The blog post will be about the last bullet of the above list - **clustering of countries’ trajectories**.
This use case is challenging because of the **clustering time series with different lengths**.

#### CovidR contest

I submitted my shiny application also to the interesting initiative of eRum 2020 organizers - **CovidR Contest**.

## Preprocessing COVID-19 open-data

Firstly, load all the needed packages for an analysis.

I use data coming from Johns Hopkins CSSE GitHub repository - cases and deaths numbers, GitHub repository by ulklc - recovery cases number, tests data are coming from COVID19 API, and population sizes are from worldometers.

I prepared time series data at *2020-05-24* snapshot with various statistics also computed **per 1 million population** (so much better comparable), so let’s read them.

Since I want to analyze (cluster) trajectories of countries’ active cases spread, I need to set **starting position for every countries’ time series** - in this case (and in other many analyses out there) *100-th* cumulative confirmed case is set as starting point.
I will also use only top *82* affected countries (+ Slovakia as my home country) for the whole analysis.
Let’s transform our data for ‘since first 100-th case’ countries’ trajectories (with the same lengths!).

You can see that we got nicely the same length time series for every country.

Now, preparation of **trajectories’ data for clustering** is coming…
We have to remove missing rows/ columns if there are so + I will preprocess time series with **Simple Moving Average** (SMA) to little bit smooth our trajectories (removes noise) - the function `repr_sma`

is implemented in my **TSrepr package**.

## Clustering trajectories with the hierarchical method with DTW distance

Since we use data with different lengths, we have to use different distance measures than Euclidean (or Manhattan, etc.).
Here comes very handy **Dynamic Time Warping** distance measure that can compute distances between time series with various lags and different lengths.

As a clustering method, I picked **hierarchical clustering with Ward criterion** for its next nice post-analysis tools as **dendrograms**.

Let’s define clustering function with DTW distance with additional data preprocessing steps necessary for `dtwclust`

package. I allow user also vary number of clusters and normalization of time series before clustering.

Let’s cluster data with *14* clusters and **normalization of countries’ trajectories** for extracting clusters with **same trends (curves)** - not magnitudes! It is very important thing before every clustering/ classification task.

Let’s prepare clustered data for visualization:

You can also search for your preferred country in the `datatable`

.

Here comes finally **plot of cluster members** with `ggplot2`

package (log scale is used for better comparison of trends):

We can see nicely distinguishable clusters with various active cases trends (settled, rapid/ steady increase/ decrease).

Let’s check some clusters interactively with `dygraphs`

package:

Here, I picked two clusters (2 and 6) with nice decreasing trends - there are countries mostly from Central/ West Europe.

Let’s see also clusters with increasing trends of active cases per 1 mil. population:

We can see that on this day, the increasing trend of active cases has countries mostly in Western Asia, South America, and Africa.

## Post-analysis visualizations with dendrograms and MDS

In order to see whole connectivity between countries’ clusters as a tree, we can use for example **dendrogram**.
Here, we can simply use object of clustering result to generate the tree:

In order to see for example connections between countries in 2D scatter plot, we can use **dimensionality reduction method Multidimensional scaling (MDS)**. It uses (stored) distance matrix between objects - and we have it in our clustering result object (Yey `clust_res@distmat`

!). For countries labels, I use great package `ggrepel`

.

In both graphs (dendrogram and MDS scatter plot), we can see clearly how far (or close) are countries from each other based on **DTW distance**.

## Summary

In this blog post, I showed you how to **cluster time series with different lengths** with DTW distance and hierarchical method, and how to visualize the results of such an analysis.
As a use case, I picked data of **countries’ COVID-19 active cases trajectories computed per 1 mil. population** to see trends of the disease spread.

### Sources

Application is running on **petolau.shinyapps.io** platform, the source code of the whole app is on **CoronaDash GitHub** repository.

*Take care of yourself!*

*Seven coffees were consumed while writing this article.*

*If you’ve found it valuable, please consider supporting my work and...*