Time series data mining in R. Bratislava, Slovakia.
CoronaDash app use case - Clustering countries' COVID-19 active cases trajectories
Written on 2020-05-27
COVID-19 disease spread hit the World really globally and also the field of mathematicians/ statisticians/ machine learning researchers and related.
These experts want to help to understand for example future trends (forecast) of the coronavirus spread.
My motivation, in this case, was to create interactive dashboard about COVID-19 to inform about various scenarios in every country and compare them through data mining methods.
I created CoronaDashshinydashboard application that is hosted on petolau.shinyapps.io RStudio platform.
The dashboard provides various data mining/ visualization techniques for comparing countries’ COVID-19 data statistics as:
extrapolating total confirmed cases by exponential smoothing model,
trajectories of cases/ deaths spread,
multidimensional clustering of countries’ data/ statistics - with dendrogram and table of clusters averages,
aggregated views for the whole World,
hierarchical clustering of countries’ trajectories based on DTW distance and preprocessing by SMA (+ normalization), for fast comparison of a large number of countries’ COVID-19 magnitudes and trends.
The blog post will be about the last bullet of the above list - clustering of countries’ trajectories.
This use case is challenging because of the clustering time series with different lengths.
CovidR contest
I submitted my shiny application also to the interesting initiative of eRum 2020 organizers - CovidR Contest.
Preprocessing COVID-19 open-data
Firstly, load all the needed packages for an analysis.
I prepared time series data at 2020-05-24 snapshot with various statistics also computed per 1 million population (so much better comparable), so let’s read them.
Since I want to analyze (cluster) trajectories of countries’ active cases spread, I need to set starting position for every countries’ time series - in this case (and in other many analyses out there) 100-th cumulative confirmed case is set as starting point.
I will also use only top 82 affected countries (+ Slovakia as my home country) for the whole analysis.
Let’s transform our data for ‘since first 100-th case’ countries’ trajectories (with the same lengths!).
You can see that we got nicely the same length time series for every country.
Now, preparation of trajectories’ data for clustering is coming…
We have to remove missing rows/ columns if there are so + I will preprocess time series with Simple Moving Average (SMA) to little bit smooth our trajectories (removes noise) - the function repr_sma is implemented in my TSrepr package.
Clustering trajectories with the hierarchical method with DTW distance
Since we use data with different lengths, we have to use different distance measures than Euclidean (or Manhattan, etc.).
Here comes very handy Dynamic Time Warping distance measure that can compute distances between time series with various lags and different lengths.
Let’s define clustering function with DTW distance with additional data preprocessing steps necessary for dtwclust package. I allow user also vary number of clusters and normalization of time series before clustering.
Let’s cluster data with 14 clusters and normalization of countries’ trajectories for extracting clusters with same trends (curves) - not magnitudes! It is very important thing before every clustering/ classification task.
Let’s prepare clustered data for visualization:
You can also search for your preferred country in the datatable.
Here comes finally plot of cluster members with ggplot2 package (log scale is used for better comparison of trends):
We can see nicely distinguishable clusters with various active cases trends (settled, rapid/ steady increase/ decrease).
Here, I picked two clusters (2 and 6) with nice decreasing trends - there are countries mostly from Central/ West Europe.
Let’s see also clusters with increasing trends of active cases per 1 mil. population:
We can see that on this day, the increasing trend of active cases has countries mostly in Western Asia, South America, and Africa.
Post-analysis visualizations with dendrograms and MDS
In order to see whole connectivity between countries’ clusters as a tree, we can use for example dendrogram.
Here, we can simply use object of clustering result to generate the tree:
In order to see for example connections between countries in 2D scatter plot, we can use dimensionality reduction method Multidimensional scaling (MDS). It uses (stored) distance matrix between objects - and we have it in our clustering result object (Yey clust_res@distmat!). For countries labels, I use great package ggrepel.
In both graphs (dendrogram and MDS scatter plot), we can see clearly how far (or close) are countries from each other based on DTW distance.
Summary
In this blog post, I showed you how to cluster time series with different lengths with DTW distance and hierarchical method, and how to visualize the results of such an analysis.
As a use case, I picked data of countries’ COVID-19 active cases trajectories computed per 1 mil. population to see trends of the disease spread.