Time series data mining in R. Bratislava, Slovakia.
Enernoc smart meter data - forecast electricity consumption with similar day approach in R
Written on 2016-11-12
Deployment of smart grids gives space to an occurrence of new methods of machine learning and data analysis. Smart grids can contain of millions of smart meters, which produce a large amount of data of electricity consumption (long time series). In addition to time series of electricity consumption, we can have extra information about the consumer like ZIP code, type of consumer (consumer vs. prosumer) and so on. These data can be used to support intelligent grid control, make an accurate forecast or to detect anomalies. In this blog post, I will focus on the exploration of available open smart meter data and on the creation of a simple forecast model, which uses similar day approach (will be drawn up in detail below).
Firstly, we must download smart meter data of electricity consumption. These data can be downloaded from the site open enernoc data. This dataset was produced by the EnerNOC company and consists of 100 anonymized commercial buildings for 2012. The dataset contains time series of 100 consumers and theirs corresponding meta data. Let’s download the all-data.tar.gz file and explore it.
Exploring meta data of consumers (IDs)
I will do every reading and filtering (cleaning) of the dataset by means of the awesome package data.table. As you may already know, with data.table you can change dataset by reference (:= or set), which is very effective. You can do similar things with package dplyr, but I prefer data.table, because of performance and memory usage. An interesting comparison of both packages can be seen on this stackoverflow question. To visualize interesting relations, I will use package ggplot2. Manipulation with date and time can be done easily by package lubridate.
So…I hope I haven’t forgotten something, go ahead to the programming and exploration part of this post. First step - scan all of the needed packages.
Read the meta data by function fread and show their structure. Of course, you must firstly set your working directory by setwd("YOUR PATH"), where smart meter data are situated.
These are some nice features for you to explore…
We can do something interesting with the INDUSTRY, SUB_INDUSTRY, SQ_FT, LAT and LNG features. For instance, making a frequency table of industries and sub-industries would be nice. This can be done by package data.table very effectively:
Plot to table:
With the package ggmap it is easy to map location of our consumers to the map of USA. Let’s split them by industries.
Now look at the SQ_FT feature. Firstly, I transform square feets to square meters (I am an European…). Histogram of SQ_M of buildings.
Looks like we have a majority of buildings under 20,000 \( m^2 \).
Let’s do something similar, but now do density plot for our 4 industries separately.
Looks like Food Sales & Storage buildings have relatively small size. On the other hand, Commercial Property buildings have very variable size.
Now try to combine meta data with real electricity consumption data and see interesting relations between them.
Load all .csv files containing electricity consumption in one data.table by functions rbindlist and lapply. See the structure of these data and change the column names. V2 to ID and dttm_utc to date.
Prepare meta_data to merging with DT. Remove useless columns, change the column name and unify class of column ID.
Let’s extract possible interesting features from IDs - mean, median and sum of consumption.
Merge it with meta_data and aggregate result by SUB_INDUSTRY.
Bar plot of mean load by sub-industries:
Looks like the biggest consumers in average are manufacturers, shopping centers and business service buildings. On the other hand, schools have the lowest consumption.
Look at possible (maybe obvious) dependence between amount of consumption and SQ_M. I will use median load and simple linear regression.
There is an evident correlation between median load and square meters of consumers.
Prepare dataset to forecast and explore time series of load
Let’s do the must changes to construct our forecast model. Transform characters of date to classical date and date&time format (POSIXct). Remove the useless columns and look at structure of current dataset, which is full of time series.
Not every ID has all measurements during the observed year. So extract IDs
with a whole length (105408). This is necessary to facilitate further work with time series.
Our extracted (filtered) IDs:
Extract date with all measurements during the day (288). First and last date has not all measurements - so remove them. So our period of day (daily seasonality) is 288.
Let’s finally look at one ID and corresponding time series - num. 99.
There is strong dependence on time. Daily, weekly and monthly seasonalities are represented.
It is highly recommended to aggregate our time series to lower dimension, thus reduce dimension (period) from 288 measurements per day to 48 per day. Do it this way:
Plot typical representants of 4 groups of sub-industries. ID 213 is from the Primary/Secondary School segment, ID 401 is the Grocer/Market, ID 832 is the Corporate Office and ID 9 is the Manufactory.
Forecast of electricity consumption, in practice, is mainly done for some area of consumers. So aggregate (cumulative) consumption is used. Aggregate it for all our consumers (43) and plot it.
For utility (distribution) companies it is very helpful to create daily profiles of consumers or daily profiles for some area. It deals with characteristic behavior of consumer during the day. So let’s create median daily profile of aggregate consumption with MAD (median absolute deviation). I use medians and MAD because of theirs robustness.
Looks like the biggest peak of the load is during the evening.
Similarly, we can do this now with week pattern. So let’s make median weekly profile of aggregate consumption with MAD.
This is a much more interesting plot than the previous one, isn’t it? We can see 5 different patterns (separated by vertical lines) in behavior of the consumers during the week. From Monday till Friday the consumption is quite similar, but Monday starts with low consumption (because of the weekend), so it’s different than others. Friday has a similar pattern, but consumption is a little bit lower than on Thursday. It is obvious that weekend is absolutely different than workdays. Thereto Saturday and Sunday are different too.
Creation of forecast model for different days during the week (similar day approach)
As we have seen in the previous plot (plots), a forecast of time series of electricity consumption will be a challenging task. We have 2 main seasonalities - daily and weekly. So it is necessary to adapt the forecast model to this problem. One of the ideas to overcome this problem is to use a similar day approach - separate forecast models for groups of days.
Let’s prepare data and define all functions to achieve this goal. Add corresponding weekdays to date for datasets DT_48 and DT_agg.
Now we have datasets with all the needed features to build a model for different days.
Extract date, ID, weekdays and period for further better working with subsetting.
Let’s define basic forecast methods - functions, which will produce forecasts. I am using two powerful methods, which are based on the decomposition of time series. STL + ARIMA and STL + exponential smoothing. STL decomposition is widely used decomposition for seasonal time series, based on Loess regression. With package forecast it can be combined to produce very accurate forecasts. We have two main possibilities of usage - with ARIMA and with exponential smoothing. We will use both to compare the performance (accuracy) of both. It should be added that the functions return forecast of the length of one period (in this case 48 values).
Next, it is necessary to define the metric, with which our forecasts will be evaluated and compared. For a simple comparison, MAPE is used. Let’s define a function to compute Mean Absolute Percentage Error.
Now it’s “simple” to define the function, which will produce a forecast for the whole week. It’s based on subsetting the given data.table by a group of weekdays. We can simply vary these features as arguments of a function: training data (data), a function of a forecast (FUN), a set of dates (set_of_date) and length of training window (train_win).
Let’s do some examples of using predictWeek function. Run the forecast for selection of dates on aggregated consumption DT_agg and compute MAPE for both methods (STL+ARIMA and STL+EXP).
Not so bad, actually very accurate.
Compute MAPE for every day of the week separately - for better analysis.
And of course…plot computed forecast for one week ahead.
Seems like ARIMA can produce more accurate forecast on aggregated consumption than exponential smoothing.
Let’s try it on a disaggregated load, so on one consumer (ID). We can simply use variable nID to subset dataset DT_48.
Similar results, but obviously not so accurate because of stochastic behavior of the consumer.
Plot of computed forecasts for one week ahead.
Again, STL+ARIMA is winning against STL+EXP.
In this way, you can make a forecast for any consumer, any set of dates and with your own forecast method.
Except for the two methods shown, you can try for example these functions (methods): HoltWinters, ar, arima and snaive, which are suitable for seasonal time series. These methods are starters (benchmarks) for time series analyses and forecast. In R, there are already implemented methods of time series, which can handle two seasons. For example dshw and tbats (both in the package forecast). Their disadvantage is high computational complexity and not as good results of the forecast as the custom functions that I have shown you. Script (1post-EnernocDataForecast.R) for a creation of whole analysis (tutorial) is on my GitHub repository.
To sum up, I have pointed out to you, which interesting features are contained in smart meter data. Then a forecast model with similar day approach was proposed.
In my future posts, I want to focus mainly on regression methods for time series forecasting, because they can handle similar day approach (double seasonality) much easier. So methods like multiple linear regression, generalized additive model, support vector regression, regression trees and forests and artificial neural networks will be demonstrated.