To defrost or not to defrost

Published in

Analytics Vidhya

13 min readFeb 19, 2021

Using recurrent neural networks to predict defrost events in an air/water heat pump system

Some years ago my wife and I built a new plus energy house based on the passive house standard. The central heating unit is an air/water heat pump at the front or our new home. We learned rather quickly that the default mode of operation for the heat pump was not really efficient and a lot of energy is wasted or not efficiently used. Something needs to be done! Of course, as an IT architect it was a must-have to integrate all major components to be able to monitor an control everything in a central home automation system.

The current implementation based on openHAB is controlling the heat pump regarding current power production from the photovoltaics on the roof, in- and outside temperatures, seasonal demands and weather forecasts. This works really well besides one small effect that happens from time to time: defrosting.

Let me explain. An air/water heat pump is using the energy in the outside air to heat the water for the underfloor heating and hot water. It’s using an internal compressor to transfer the heat energy from the source (air) to the destination (water). This works down to temperatures of -20°C, but as the air loses energy, it gets colder and is no longer able to hold the same amount of water vapor than before. The absolute humidity decreases and the water condenses on the slats of the heat pump as ice. Depending on air humidity, temperatures and a lot of other factors, the ice gets thicker and thicker and is blocking the air flow and therefore the heat pumps ability to pump heat. To get rid of the ice the heat pump is able to defrosting itself. It is reversing the water flow and is using the heat energy from the underfloor heating to melt the ice. The defrosting needs some minutes and happens, depending an temperature and humidity, up to 10 times a day.

This process needs a lot of energy, but it is absolutely necessary and can’t be prevented unless you can predict a defrosting, stop the heat pump upfront and make sure that the ice has enough time to melt naturally in the next hours. It’s not enough to monitor the heat pump and stop everything when a defrosting happens. At least our heat pump system will go on until the defrosting is done, even if you command it to stop. And therefore it happens from now and then that the monitoring and control algorithm decides to stop heating just seconds after a defrosting started. A lot of energy is used to defrost the heat pump and then everything shuts down as no more heating is needed for the day.

That is not efficient and I hate it. Something needs to be done!

The problem

My openHAB based solution has monitored the heat pump for the last five years and persisted all sensor values, as-is and to-be temperatures, heating settings and defrosting events in a MySQL database. Is it possible to use this data to predict upcoming defrosting events before they happen?

The data consists of independent data points for every value change, e.g. only when a temperature changed, a new value is written in the database together with a timestamp when this change happened. These independent data points needs to be imported, aggregated and resampled before a time series analysis can happen.

My goal is to use the resampled data to train a machine learning algorithm with the defrost events from the last years. The algorithm should learn to predict upcoming defrosting events based on data from some past timespan (e.g. some minutes or hours) to hopefully predict an defrosting event in the next 5 minutes.

The data

I exported all data tables as csv files. These are the data points I used for this exercise:

return: as-is temperature after underfloor heating
reference_return: to-be return temperature for heating
supply: as-is supply temperature after heat pump
servicewater_reference: to-be temperature for hot water
servicewater: as-is temperature for hot water
hot_gas: temperature of the hot gas inside the compressor
probe_in: outside air temperature
mk1: as-is temperature before underfloor heating
state: State information (heating, defrosting, etc.)
extended_state: Extended state information (heating, defrosting, etc.)

(Why no outside temperature you may ask? The reason is that the sensor is not at the correct spot and not used for controlling the heat pump right now.)

All csv files contain timestamp based value changes for one value only. This is what the heatpump_hot_gas.csv looks like:

...
"2018-10-01 15:42:33";"68.7"
"2018-10-01 15:42:54";"68.8"
"2018-10-01 15:43:14";"68.9"
"2018-10-01 15:43:44";"69"
"2018-10-01 15:44:24";"69.1"
"2018-10-01 15:44:44";"69.3"
"2018-10-01 15:45:15";"69.4"
...

The state and extended_state files are a little bit different and needed to be processed during reading because of their size and their text-based values with an additional timestamp. To better explain let’s have a look at the heatpump_extended_state.csv:

"2016-01-16 14:41:43";"heating: 04:36:16"
"2016-01-16 14:41:53";"heating: 04:36:29"
"2016-01-16 14:42:03";"heating: 04:36:37"
"2016-01-16 14:42:13";"defrosting: 04:36:48"
"2016-01-16 14:42:23";"defrosting: 04:36:58"
"2016-01-16 14:42:34";"defrosting: 04:37:06"

Because of the sheer amount of data (600 MB for this file) the additional timestamp was immediately stripped (the clock from the heatpump was wrong either way) to simple strings like “heating” and “defrosting”.

All those values were merged into one sparse dataframe with 12 columns, 16 Mio rows and a lot of nan values, indexed be the timestamp when a value change occurred.

The dataframe is backed with sparse columns because of the huge memory consumption otherwise. Therefore some detours needed to be taken for some data conversions because of Pandas sparse implementation. The state and extended_state columns were converted to “binary” values with their own columns like compressor_heating, heatpump_heating, heatpump_servicewater, heatpump_running and finally defrosting. For now everything is kept sparse with a lot of NaN values where no changes occour.

Now it’s time for a first look at the charts. This is what the data looks like at this time of implementation:

The defrosting events are what we are looking for. As you can see there is not one strong signal from any sensor the minutes upfront. It just “happened” out of nowhere at least for the first defrosting. For the second defrosting at 14:00 you see a switch from producing hot water to floor heating 15 minutes earlier. The reason for this is that the heat pump needs heated floors upfront to get back enough energy for the defrosting. This happens for every defrosting while producing hot water, but this is only a small fraction of all defrosting events.

Let’s have a look at the overall defrosting distribution over the years.

Those are 1853 defrostings happening mostly during the winter. This should be enough data to use a machine learning model to predict a defrosting event before it happens.

Btw, you can see a slight shift of defrosting events from year to year to start a little bit earlier in Autumn and stop a little bit earlier in Spring. That’s because of my optimizations regarding the use of weather forecasts and predictions of what the outside temperature might be in some days. I managed to reduce the power consumption of our home by ~20% over the years.

A lot of preprocessing

To use this time series data for machine learning it needed to be converted from sparse to dense, resampled in regular intervals (1 min) and normalized to a usable range (between 0.0 and 1.0).

This is the same chart as above but normalized:

There are many ways and algorithms for time series analysis, but I decided to focus on deep learning with recurrent neural networks (RNN). It’s important to note that there is some data missing in the data set. There are holes of some minutes, hours and even days which needs to be addressed by the way how the RNN is using this data. I decided to divide the data into smaller overlapping chunks of 120 minutes and make sure that every chunk has data for all 120 minutes. If a chunk is shorter, it’s not used in the training. After a lot of training, testing and revalidating my models I ended up with chunks that contain 115 minutes of input features for timesteps 0–114:

hot_gas
probe_in
return
heatpump_running
defrosting

and 115 minutes of output data with one feature defrosting for timesteps 5–119. The idea is to train the RNN to predict a defrosting event in 5 minutes from up to 115 minutes of data before.

To get enough training data the chunks overlap each other by 3 minutes, so in an ideal case there are 10 chunks generated out of 150 minutes of data. As there are much more “not defrosting” than “defrosting” chunks those “not defrosting” ones are filtered on a random basis so that only 20% are accepted to be included in the data set. And last but not least the data is filtered to only include data from the winter season (September — April) and during the day as the heat pump is correlated with solar power production (11:00 am — 6:00 pm).

I decided to not use a random chunk split for test and training data as this would result in an information leak from training to test data as the chunks will mostly overlap. Therefore the test and training data is split on full days depending on the day of month. Every 5th day will be used for the testing data, so no test and training chunks overlap each other.

After all those filtering and splitting we end up with 10722 chunks for the training set and 2443 chunks for the test set. The training set contains 2062 chunks with a defrosting event in the last timestep.

Training

For the training it is important to define a valid loss function. All neural nets use the Cross-entropy loss with two classes (not-defrosting and defrosting) for this exercise. As both classes are still not represented equally in the data sets (~19% defrosting and 81% not-defrosting), the calculation of the loss is re-balanced to factor in those weights. Also the accuracy is using those weights to give a better estimation about training progress. With these weights a fully random prediction will now give an accuracy of 50% while a not-weighted accuracy would give 81%. Therefore the weighted accuracy is used to give a more valid metric of the networks performance against the test data set.

The training is done in mini-batches of 32 samples and every training stops after 200 epochs. I played around with different optimizers and learning rates but ended with RMSprop and a learning rate of 0.0001. Adam with a learning rate of 0.001 gave good results, too, but performed a little bit too slow especially at the beginning (< epoch #30), so testing different hyperparameter was much faster with RMSprop.

Gradient clipping is also used as some network topologies had some problems with exploding gradients, esp. when multiple RNN layers where used.

I did not do a full grid search through all hyperparameter, because unfortunately I don’t have a CUDA enabled workstation available. Every training was done on my Windows tablet with an Intel i5 processor and only 8 GB RAM and needed up to multiple hours. Don’t do this at home!

Results

As I wanted to learn about recurrent neural networks the first network is based on a simple LSTM layer with a linear output layer at the end. This is the full spec:

DefrostLSTMLin(
(lstm): LSTM(5, 20, batch_first=True)   
(linear): Linear(in_features=20, out_features=2, bias=True)   
(dropout): Dropout(p=0.0, inplace=False) )

(a dropout of 0.0 is like no dropout!)

And here is the training chart:

A) 20 x 1 x LSTM + Linear: 0.9280 (Overfitting #150)

This very simple neural network achieved a weighted accuracy of 0.9280 at epoch #195. Not bad for the beginning, but you have to keep in mind that a random result would give already 50% weighted accuracy. Nevertheless that is quite good for the first try. As you can see in the chart the network tends to overfit after epoch #150, so early stopping the training process could have been appropriate, but I decided to train all networks the same to give a better (visual) comparison. I’ll try to give an estimation when overfitting is happening, but as seen in this first chart it’s not always simple to answer. Somewhere after epoch #150 you can see that the test loss is no longer decreasing besides some statistical fluctuations.

I tested some more complex networks and achieved some interesting results I will show here:

Added dropout after the LSTM layer to reduce overfitting

DefrostLSTMLin(
(lstm): LSTM(5, 20, num_layers=1, batch_first=True, dropout=0.2)
(linear): Linear(in_features=20, out_features=2, bias=True)
(dropout): Dropout(p=0.0, inplace=False))

B) 20 x 1 x LSTM + Linear + dropout: 0.9255 (Overfitting #170)

-> Overfitting slightly reduced, but same weighted accuracy.

Increased hidden size of the LSTM layer to 30

DefrostLSTMLin( 
(lstm): LSTM(5, 30, batch_first=True, dropout=0.2) 
(linear): Linear(in_features=30, out_features=2, bias=True)
(dropout): Dropout(p=0.0, inplace=False) )

C) 30 x 1 x LSTM + Linear + dropout: 0.9286 (Overfitting #70)

-> Small improvements of weighted accuracy, but overfitting earlier.

Tried a second LSTM layer instead

DefrostLSTMLin( 
(lstm): LSTM(5, 20, num_layers=2, batch_first=True, dropout=0.2) 
(linear): Linear(in_features=20, out_features=2, bias=True)
(dropout): Dropout(p=0.0, inplace=False) )

D) 20 x 2 x LSTM + Linear: 0.9351 (Overfitting #80)

-> Much better result then last test! A second LSTM layer might be needed.

Added a second dropout before output

DefrostLSTMLinDO( 
(lstm): LSTM(5, 20, num_layers=2, batch_first=True, dropout=0.2) (linear): Linear(in_features=20, out_features=2, bias=True) 
(dropout): Dropout(p=0.2, inplace=False) )

E) 20 x 2 x LSTM + Linear + dropout: 0.9327 (Overfitting #110)

-> Again slight improvement regarding overfitting.

As you can see more complexity result in a higher weighted accuracy, but earlier overfitting. Dropouts help preventing this a little, but the overall weighted accuracy seems to stay below 0.94.

Also one strange characteristic is shown: The test loss is lower than the training loss. I tried to understand why and after digging into this phenomenon (e.g. here) I think it’s a mixture of multiple shortcomings. My theory is that the test/training split based on every 5th day might have some side effects I’m currently not aware of. But getting a good explanation for this would be a valid research question on it’s own and currently not in my focus, so I’ll ignore this for now as this is not so important for the training success.

Nevertheless the current results encouraged me to test even more complex networks. Therefore I changed the architecture to a combination of a convolutional network and a LSTM layer. The idea is that the kernels of the convolutional layer will learn the smaller patterns of the input feature(s) to give the LSTM layer the ability to focus on the temporal structure of the full input sequence.

Let’s start with a CNN with 5 out-channels and a kernel size of 5 like in this spec:

DefrostCNNLSTM(
(conv): Conv1d(5, 5, kernel_size=(5,), stride=(1,), padding=(2,), padding_mode=replicate)
(lstm): LSTM(5, 20, num_layers=2, batch_first=True, dropout=0.2) 
(linear): Linear(in_features=20, out_features=2, bias=True) 
(dropout): Dropout(p=0.2, inplace=False) )

This results in 5 filters for all 5 input channels (=25 kernels). As you can see in the training chart this architecture works way better:

F) 5 x 5 CNN + 20 x 2 x LSTM + dropout + Linear + dropout: 0.9878 (No overfitting)

This network achieved an weighted accuracy increase to 0.9878!

So I went on and changed some hyper parameters. These are the networks I came up with:

Increased number of LSTM layers to 3

DefrostCNNLSTM(   
(conv): Conv1d(5, 5, kernel_size=(5,), stride=(1,), padding=(2,), padding_mode=replicate)   
(lstm): LSTM(5, 20, num_layers=3, batch_first=True, dropout=0.2)   (linear): Linear(in_features=20, out_features=2, bias=True)   
(dropout): Dropout(p=0.2, inplace=False) )

G) 5 x 5 CNN + 20 x 3 x LSTM + dropout + Linear + dropout: 0.9880 (No overfitting)

-> Again slight improvement regarding overfitting.

Increased hidden size of the LSTM layer to 30

DefrostCNNLSTM(   
(conv): Conv1d(5, 5, kernel_size=(5,), stride=(1,), padding=(2,), padding_mode=replicate)   
(lstm): LSTM(5, 30, num_layers=3, batch_first=True, dropout=0.2)   
(linear): Linear(in_features=30, out_features=2, bias=True)   
(dropout): Dropout(p=0.2, inplace=False) )

H) 5 x 5 CNN + 30 x 3 x LSTM + dropout + Linear + dropout: 0.9883 (No overfitting)

Increased the convolutional filters from 5 to 10:

DefrostCNNLSTM(   
(conv): Conv1d(5, 10, kernel_size=(5,), stride=(1,), padding=(2,), padding_mode=replicate)   
(lstm): LSTM(10, 30, num_layers=3, batch_first=True, dropout=0.2)   
(linear): Linear(in_features=30, out_features=2, bias=True)   
(dropout): Dropout(p=0.2, inplace=False) )

I) 5 x 10 CNN + 30 x 3 x LSTM + dropout + Linear + dropout: 0.9948 (No overfitting)

As you can see the last network did really well and as no overfitting was happening and the weighted accuracy was still doing good at later epochs, I think this network architecture is a good fit for this kind of problem.

Conclusion

This project shows that it is possible to predict defrosting events for our heat pump from only a few features (remember I used three sensor values hot_gas, probe_in, return and two binary state values heatpump_running and defrosting) with an impressive accuracy of 0.9948.

I’m sure there are multiple ways to increase this accuracy even further, e.g.

adding more features
using data augmentation to generate more training data
Reshape data set with more or less timestamps per sequence
increase depth of network
use grid search to find better hyperparameter
…

and I could try different network architectures like

GRU
IndRNN
pre-trained autoencoder
…

but as already mentioned the available hardware was limited as well as my available time. For the next project I would use a cloud-based setup as this would reduce the training time dramatically.

What’s missing is to bring this Pytorch-based model to my Java based openHAB instance. I’m sure the current network can be used on an embedded ARM device with limited memory as not this many parameters are used (~20k parameters for model I). But porting this to Java is another project which wants to be implemented at a another time…

You can find all Pytorch code in my Github repository here.