1. Introduction
In recent years, there has been an increasing interest in renewable energy development as a response to global warming and environmental problems
[1]. In this context, solar photovoltaic installations and their growth are of great importance. The use of photovoltaic (PV) systems on city rooftops can help to increase self-sufficiency, and they are safe and do not produce noise or other disruptions
[2]. However, because of the need to balance electricity generation with demand in real time, accurate forecasting of PV production is required for better integration of this resource in the grid
[3]. Real-time predictions are required in different fields (e.g., energy, health, and finances) to process information as data are received continuously. This helps in taking action and making decisions with information that is constantly updated.
However, the dependance of PV generation on environmental conditions makes prediction a challenging problem
[4]. The amount of energy generated depends especially on irradiation on the panel, which depends on the hour, season, and climatic conditions (cloud coverage and precipitations). Therefore, in recent decades, different methods and approaches have been proposed, from traditional statistical and physics-based models to machine learning and deep learning models.
Physical models are methods that use meteorological data as input in equations to calculate the solar irradiation and output power
[5]. For example, numerical weather prediction (NWP)
[6] is used to forecast the weather by using numerical methods that simulate the atmosphere’s behavior. NWP uses mathematical equations that describe the physical processes occurring in the atmosphere, such as thermodynamics, fluid mechanics, and heat transfer. These models incorporate data from various sources, such as weather stations and satellites, to provide initial conditions for the computation. Typically, NWP output is used to feed other analytical models which calculate irradiation or PV power on the panels
[7]. Other physical models use sky images to predict the movement of the clouds
[8].
Statistical models make predictions based on previous values of the time series
[7]. Auto-Regressive Integrated Moving Average (ARIMA) model time series
[9] are a combination of autoregression (AR), differencing (I), and moving average (MA) terms. Autoregression predicts values depending on previous values; moving average makes predictions based on previous errors; and differencing removes the trend and seasonality to make the model stationary. The model used can be a combination of only some of the terms, like ARMA models
[10]. Although these models provide forecasting information, their use is limited due to the lack of capacity to model complex nonlinear behaviors. Persistence model considers that the predicted value will not change with respect to the previous value in the series
[11]. ARIMA and persistence models are typically used as a benchmark reference for the models proposed in the studies, consisting mainly of machine learning techniques
[7]. However, other mathematical approaches can be used for PV forecasting as well
[12].
Machine learning (ML) is a field in Computer Science that uses big sets of data to model complex functions or relationships
[13]. In recent years, it has increased its application in many fields thanks to the developments in computational capacity and data processing. There are several techniques depending on the problem to solve. They can be divided into Supervised Learning, Unsupervised Learning and Reinforcement Learning
[13]. In Supervised Learning, the models are fed with labelled data to find the relationships between features. Some typical algorithms are linear and polynomial regression
[14], logistic regression
[13], support vector machine
[3], decision trees
[3] or random forest
[15]. These algorithms can be used for different tasks depending on the complexity of the problem. While linear or polynomial regression are well fit for simple mathematical functions, other tools such as random forests or support vector machines can model very much more complex problems where the relevant physics are not well understood or imply nonlinear mathematical equations.
Deep learning (DL)
[16] is a branch of the machine learning field that makes use of artificial neural networks (ANN)
[17] to model complex, nonlinear behaviors in different fields. Inspired by the functionality and structure of the human brain, these models are composed of computational units called neurons that are interconnected in multiple layers. Each neuron receives inputs from other neurons, computes a weighted sum and applies an activation function to produce an output that is transmitted to the next neurons.
There are several architectures, depending on the connections, types of neurons and activation functions. Feedforward ANN are the most basic, consisting of an input layer, one or more hidden layers, and an output layer. The information goes from the input layer to the output, with the hidden layers processing the relationships in the data. In recurrent neural networks (RNN)
[18] the output of the neurons is connected to their own input and the input of the neurons of the same or previous layers. This feedback gives the network the ability to handle complex relationships between past and future observations. There are other types of architectures, like convolutional neural networks (CNN), encoder-decoders, or transformers, that can be used for different goals.
2. Real-Time Photovoltaic Prediction Systems
The advance in ML and DL in recent years has made PV prediction methods more broadly used. Different types of artificial neural networks are the more recent and frequent methods. Most of the articles present some machine learning or deep learning algorithm, along with traditional models (such as persistence or ARIMA) generally used for comparison [19]. Hybrid and ensemble methods, usually including a machine learning model, are proposed for an improvement in accuracy. The results show that combining several models and data provides the best results, although it increases the computational cost of the model. Other machine learning algorithms are very spread too, like support vector machines and random forests, while several articles use only traditional methods like ARIMA or physical models.
The increasing use of neural networks has provided a large number of possible architectures and configurations. RNNs and specifically LSTM networks
[18][20][21][22][23][24][25][26][27] are the ones used most often. These types of networks are usually applied to time series forecasting due to their ability to capture temporal dependencies thanks to their recurrent nature. Articles comparing RNNs with other types of ANNs show that the RNNs can get more accuracy when predicting. CNNs are very common for PV prediction as well, especially in combination with other types of networks like RNNs or feed-forward networks (FFNN). In
[21] the authors tried different types of recurrent networks, including RNN, LSTM, GRU, and combinations with convolutional layers. The results showed that using bidirectional LSTM and GRU cells, and CNN layers, yielded the best results. Nevertheless, the best model changed depending on the weather conditions and forecasting horizon. This implies that some architectures offer advantages for this task, and GRU and LSTM networks may capture temporal patterns better. However, the large number of studies using different network architectures shows that good results can be achieved with any type of network if the data and hyperparameters are well tuned, and some articles use specific architectures designed for the research. In
[18] the authors compare different configurations of RNN, changing the number of layers and time steps; and in
[20] they compare different CNN. This shows the need to continue researching in order to find the network that performs the task best.
The machine learning algorithms most frequently used are support vector machines and random forests. These methods are very common in many applications due to their ability to generalize and model nonlinear functions, and they are well established in the field of machine learning. Other papers
[3][14][15] propose some type of linear regression model or decision trees or develop a different algorithm
[19]. The results show that there are not big differences with respect to the models used.
Regarding the results, hybrid or ensemble methods seem to be the best in terms of accuracy. The drawback of these models is the bigger computational cost. If the PV installation includes embedded hardware with the forecasting method, the size of the model is limited by the hardware.
Due to the chaotic nature of the weather
[4], and the dependency of PV generation on the environment, long-term forecasting cannot be done with accuracy
[21]. This makes predictions for the day ahead the habitual option for grid management applications. The forecast range from a few hours to one day is used to adapt the load to the demand, which helps to improve PV penetration
[3]. Long-term forecasting can be used for trend analysis and planning. The articles that try different horizons show that increasing the horizon implies a fast decrease in prediction accuracy
[20]. In general, horizons of a few hours produce good results and can be used by grid operators to manage the energy production
[3].
The parameters used as input for the prediction vary with the type of model and data origin, but most of the articles use weather variables consisting of temperature, relative humidity, wind speed and direction and solar irradiation. The production of the panel is correlated with the irradiation it receives, so it is the most used feature. Temperature also has influence in the energy generation, and wind speed, direction, and relative humidity are used to predict the atmospheric behavior to improve the accuracy. While some studies include more variables, like pressure, precipitation, cloud coverage or dew point temperature, it does not have a big impact on the results
[28]. These parameters are taken from several sources. Some articles take information from sensors close to the installation for which the predictions are made, while others use weather agencies forecasted values or use the prediction of a physical model like NWP. Additionally, some studies use meteorological data from open databases. With respect to the dataset size, it does not seem to have a strong effect on the predictions, and some articles choose online learning when datasets are small. Models developed with smaller datasets get results as good as articles using 6 or 7 years of data.
Although the most relevant information is the generated power, solar irradiation is strongly correlated with it. For this reason and considering that irradiation is more directly dependent on physical variables, it is chosen in some studies
[20][29][30][31][32][33][34][35] as the prediction objective. However, most of the articles analyzed forecast PV power. Due to the data-driven nature of machine learning algorithms, forecasting power avoids the need of further modelling and correlating with irradiance.
With respect to the hardware used, all studies seem to have been developed and used on personal computers or stations, with only four articles taking into account the production of the model on an embedded system at the PV installation location. For an extensive use of the forecasts this could be further investigated in the future. It would allow the system to have more autonomy and make predictions without relying on external computers.
Taking into consideration the accuracy metrics, the lack of a standard measure that can be used independently of the data and results makes quantitative analysis a difficult task. The use of the same metrics for all articles within the field could help improve the research, allowing to understand the results better.
The most used metric by the articles (28 out of 60) is the RMSE, which is proportional to the data used and has the same units (W/m
2 for irradiance and W for power). This makes RMSE not suitable for comparison between articles using different datasets. Other metrics like MAE, used by 23 studies, have the same problem. A total of 18 articles use normalized RMSE, and 11 consider MAPE, both of which give percentual errors. These metrics can be used to compare results of different articles, but there is another issue that prevents an accurate quantitative analysis. The articles that provide metrics for different conditions show that there are parameters that influence the accuracy of the forecasts more than the election of a model. The horizon is the most important feature, which varies significantly throughout the articles. The data used to train and test the models also have a strong relevance. On
[21], the most accurate model differs depending on the weather. The climatic characteristics of the place strongly influence the forecasts, as some locations have more unpredictable conditions than others. All these variables imply that to get a good quantitative analysis of which models perform better, a standardized testbench should be designed, making use of the same dataset, time horizons, and evaluation metrics.
The influence of the horizon on accuracy can be observed in the articles that consider different horizons. On
[36], the nRMSE increases from 3.49% at forecasting 1 h ahead to 7.92% at 5 h ahead. On
[11], an increase on nRMSE can be observed as well between the different horizons: 30 min (3%), 3 h (16%), and one day (17%). This shows that accuracy is high for very short-term predictions (a few minutes ahead), but it drops quickly with longer time lengths. On
[32], the nRMSE increases from a 21% at the horizon of 30 min to a 33% at 6 h. On
[37], the accuracy drops from 2.6% nRMSE predicting one hour ahead to 11.7% predicting 24 h ahead.
Regarding the input variables, on
[38] the results show an increase in accuracy when more weather variables were included, from 17% MAPE to 10%. On
[7] the results show an important drop in accuracy due to the presence of clouds. They also show big differences between seasons, having the best forecasts for spring. On the developed LSTM model, the MAPE decreases from 3% for sunny days in spring to 22% for cloudy days. The drop in accuracy due to the increase in the horizon is more important for cloudy days than for sunny days. This happens because predicting the presence and movement of clouds is harder with longer time spans, while sunny days lead to more stable energy generation. The study also shows that there is not a model that clearly outperforms the rest. Comparing several deep learning, machine learning, statistical and hybrid models, the results show that some models perform better for some season, horizon, or type of day. This illustrates that ensemble models perform better because some algorithms perform slightly better in some conditions, and the inclusion of several decisions can give better predictions than those of an individual model.