Winners' report on [MLSS 2015 Predictive Modeling Challenge] Solar energy production prediction

The first and second place winners kindly share their solutions on [MLSS 2015 Predictive Modeling Challenge] Solar energy production prediction.

1st place winner: Dmitry Kit ("Dmitry")

1. Summary

In this document I describe the methodology used for predicting future solar energy production given historical data. Time of day is a very important feature for determining how much sunlight is available to the solar panel. Therefore, most of the engineering was targeted on this feature. The date string was replaced by two features: "time of day" and "time of week." The wind can be an indicator of stormy weather, so an additional feature was added that represented the magnitude of the wind vector. Through data exploration it was found that "prmsl" and "temperature" appeared to provide a little more predictive power than the other features. In the end, however, all features were used as the procedure employed proved fairly robust to non-informative features.

It was also found that taking into account the entire dataset as a single sequence produced poor results. Therefore, to maximize accuracy, the data was grouped by the hour of the day and for each group a separate regression model was trained. A zero was predicted for any group where historical data was always zero. For each group Canonical Correlation Analysis (CCA) was used to embed the data into a subspace which maximized the correlation between the observations and target. A Gaussian Process Regression (GPR) was used on this transformed data to model the relationship between inputs and outputs. Validation sets were used to determine the appropriate GPR parameters for each group.

To make predictions, the observations were first transformed to the subspace determined by CCA, then GPR was used to predict the output. The output was then projected back into the original space.

2. Features Selection / Extraction

The features provided by the competition were expanded to include three more features: time of day, day of the week, and wind strength. Knowing the day of the week had little predictive power, but knowing the time of day proved paramount. Since there was not enough historical data the "day of the year feature" was not informative and was removed. Furthermore, "rh" and "temperature" were correlated with energy production. The issue, however, is that energy production is related to the amount of sun falling on the solar panel and most of the features with the exception of time of day, are weak predictors. In fact, time of day is the strongest predictor of how much energy a solar panel will produce.

Therefore, to make the learning problem more manageable, the data was grouped by time of day. The intuition was that on a clear day the solar panel will produce maximum amount of energy. The only reason it would not is if the sunlight is blocked due to atmospheric conditions. Temperature and wind speed, for example, could be used as an indicator for whether a storm system is present.

3. Modeling Techniques and Training

It became apparent that treating all hours of the day as a single training set was not correct, as the relationship between observations and output differed greatly depending on the hour. By themselves almost all the features are not good indicators for energy production, as the distributions appear to be almost uniform (i.e. uncorrelated). Visual inspection of the plots in Figure 1 only identify the feature “rh” as being useful, which appears to be negatively correlated with the output. Empirical tests, however, showed that using other variables also improved performance. It would be advantageous to automatically determine the relationship between all the variables and the output. Correlation between two sets of variables, X and Y, is measured as:

Equation 1

One question that can be asked is can we linearly transform X and Y by some matrices U and V such that the correlation is maximized.

Equation 2
This is known as cannonical correlation analysis (CCA). Figure 2 shows that CCA is able to find a useful relationship between observations and output. In our problem Y is a single dimensional variable and therefore V is simply a scalar. This also means that U is a vector that scales the different dimensions of X before the variables are added together. This scaling can tell us about the usefulness of different features. Figure 3 plots the coefficients of U for different hours of the day. As we can see the variable "temperature" is mostly useful in the mornings and evenings. This could be because the days are getting longer and there is more variability in temperature as the output of the solar panel increases. During the day it is not as important. As expected "rh" has a strong negative correlation, especially during midday. Interestingly the "u" component of wind velocity is not very useful, while the "v" component demonstrates a moderate negative relationship. "prmsl" is mainly useful in the afternoon and evening where it has a positive and negative relationship, respectively. Since CCA found a set of transformations that maximizes the correlation, the inference was made in this space and the prediction was projected back into the original space.

Initially, polynomial regression was used to achieve the first promising results. In particular, a 3rd order polynomial seemed to maximize regression performance. However, the relationship between input and output differed depending on the time of day. The other issue is that for different days a polynomial with different degrees was necessary. Instead of trying to determine the optimal degree (and potentially overfit the data), Gaussian Process Regression (GPR), a non-parametric model, was used. A GPR only requires the user to define a kernel that provides a relationship between the training data points. For this application the widely used Squared Exponential kernel was employed.

Equation 3
The results were generated by setting σ = 1 and l was varied depending on the time of day being modeled. Ideally,l should be estimated by maximizing the likelihood of the training data, but due to the small number of samples this led to a model that overfit the data and manual search for appropriate parameters was performed using a custom validation set that used part of the training data set and submission results.

GPR returns the mean and standard deviation of the prediction (i.e. the posterior distribution over the target values). Figure 4 is the plot of the estimates made by GPR in the transformed space. For the submission only the means were used. It can be seen that the most consistent predictions were made at the '15:30:00' and the noon periods. In the evenings ('18:30:00') and mornings ('06:30:00' and '09:30:00') the uncertainty starts out low, but increases greatly as the sum of transformed observations increases. Although some arguments could be made (e.g. lengthening of the days), further study should be undertaken to determine the reason for this pattern.

The query procedure was then as follows:

  • Transform the observations by the results of CCA (on normalized training data).
  • Predict the output by using GPR.
  • Project the prediction back into the original space.
Figure 1

Figure 2

4. Additional Comments and Observations

This dataset was fairly noisy and small. Furthermore, there was a stronger relationship between the last month and the test month. A validation set where only the last few days were used proved to be a better indicator of final performance. Additionally, although my validation results would sometimes predict performance improvement, the intermediate score would increase instead. My final submission had a worse score than the submission before that despite empirical data showing that a dramatic increase in performance should have been observed. On the other hand, I specifically avoided overfitting the test data, allowing me to be a bit more robust on the data used to calculate the final score.

Depending on the time of day, the same set of measurements could lead to a considerably different outcome.

This is why grouping the data by hour was helpful. There was a possibility of weather changes between readings, and so I did not investigate the correlation in readings between groups. However, there is probably at least a weak relationship that should be exploited. This could be incorporated into the feature vector. I did not model the change of daylight hours. This was not really an issue once the last month of training data was included in the training, but a validation set that used the last month was a poor indicator of performance as some hours might have been set to zero as there was no data. Finally, the posterior distribution over values returned by GPR should be exploited.


Standard techniques were used in this work. However, to learn more about Gaussian Processes the reader is encouraged to read "Gaussian processes for machine learning" (Carl Edward Rasmussen and Christopher K. I. Williams. Cambridge, Mass. MIT Press, 2006.) available at

2nd place winner: Artem Oboturov ("eraser")

1. Summary

Linear regression with lagged data was used. Stepwise AIC was used to obtain regression models with coefficients specified in Table 1-5. Other predicted values were set to be zero.

Table 1

Table 2

Table 3

Table 4

Table 5

2. Features Selection / Extraction

Besides all originally available features, there was a couple more added. First one of them is the wind speed, which is a square root of squared linear speeds in north and east directions. The slsap was normalized for the 101325 pressure, so that it would be around one most of time. Lags of order up to two were added for temp.C., rh and slsap.

3. Modeling Techniques and Training

Instead of creating a single model for prediction of values for each time slot, five different ones were used. Predictions for time slots 0, 3 and 21 hours were set to zero for all rows. It helped a lot that only 30-minutes intervals were predicted, so that error would be minimal for this approximation. For other time slots linear regressions with a formula:

Equation 1
were used. Where T stands for temperature in °C, ρ for relative humidity and S for slsap. This regression contained all available features, plus wind speed and lags of order up to two. Then stepwise AIC was used to select models.

4. Additional Comments and Observations

This competition, unfortunately, did not provide geographic coordinates for the location of measurements. If the region of placements of solar arrays would be concentrated enough, one could have computed the integral of flow of solar energy during each of 30-minutes intervals (even though we know that flow itself would have a variation), which should have been used as the main regressor variable. Other features would have probably had only negative regression coefficients in that case.

5. Simple Features and Methods

Simply put, just the standard linear regression with a couple of tweaks. I regret not to use regularization to prevent overfit.