Submitted Successfully!
Thank you for your contribution! You can also upload a video entry or images related to this topic.
Ver. Summary Created by Modification Content Size Created at Operation
1 -- 351 2022-12-07 11:43:41 |
2 Leijiao Ge Meta information modification 351 2022-12-07 12:53:31 | |
3 translated the Chinese into English + 2245 word(s) 2596 2022-12-08 03:26:00 | |
4 Improve the content + 200 word(s) 2796 2022-12-08 04:13:58 | |
5 update references and layout -14 word(s) 2782 2022-12-08 04:24:52 |

Video Upload Options

Do you have a full video?


Are you sure to Delete?
If you have any further questions, please contact Encyclopedia Editorial Office.
Ge, L.;  Du, T.;  Li, C.;  Li, Y.;  Yan, J.;  Rafiq, M.U. Virtual Collection for Distributed Photovoltaic Data. Encyclopedia. Available online: (accessed on 02 December 2023).
Ge L,  Du T,  Li C,  Li Y,  Yan J,  Rafiq MU. Virtual Collection for Distributed Photovoltaic Data. Encyclopedia. Available at: Accessed December 02, 2023.
Ge, Leijiao, Tianshuo Du, Changlu Li, Yuanliang Li, Jun Yan, Muhammad Umer Rafiq. "Virtual Collection for Distributed Photovoltaic Data" Encyclopedia, (accessed December 02, 2023).
Ge, L.,  Du, T.,  Li, C.,  Li, Y.,  Yan, J., & Rafiq, M.U.(2022, December 07). Virtual Collection for Distributed Photovoltaic Data. In Encyclopedia.
Ge, Leijiao, et al. "Virtual Collection for Distributed Photovoltaic Data." Encyclopedia. Web. 07 December, 2022.
Virtual Collection for Distributed Photovoltaic Data

With the rapid development of distributed photovoltaic systems (DPVS), the shortage of data monitoring devices and the difficulty of comprehensive coverage of measurement equipment has become more significant, bringing great challenges to the efficient management and maintenance of DPVS. Virtual collection is a new DPVS data collection scheme with cost-effectiveness and computational efficiency that meets the needs of distributed energy management but lacks attention and research.

distributed photovoltaic virtual collection similarity analysis artificial intelligence

1. Introduction

In the context of the current global energy crisis and increasing environmental pollution, photovoltaic (PV) power generation has received strong support from countries worldwide due to its high efficiency and cleanliness, rapidly becoming the third largest renewable energy source after hydropower and wind power [1]. According to the reports of the International Energy Agency (IEA), more than 175 GW of new PV capacity was installed worldwide in 2021, accounting for more than half of the new renewable energy capacity. By the end of 2021, the cumulative installed PV capacity worldwide had exceeded 942 GW. Figure 1 illustrates the changing dynamics of the global PV market and the substantial influence of the Chinese PV market [1].
Figure 1. 2017–2021 growth per region.
PV stations are mainly divided into two types: centralized systems and distributed systems. Distributed photovoltaic systems have been rapidly developed due to their flexible installation, outstanding environmental benefits, and coexistence of power generation and consumption [2]. As shown in Figure 2, the newly installed capacity of distributed photovoltaics exceeded that of centralized photovoltaics in 2021 [3]. Therefore, efficient and accurate access to DPVS operational data is becoming increasingly important. High-quality operational data can help assess the output performance index of DPVS to improve the reliability of the PV plant in terms of the operation and maintenance and the accuracy of DPVS output prediction. It can also provide power companies with accurate electricity metering and billing audit indicators, better monitoring of the market, and prolong the service life of DPVS. However, most DPVS are scattered and disorderly, with many points and wide areas. To effectively manage them, a large number of sensors, collectors, and concentrators need to be deployed to monitor the output of DPVS, as well as dedicated communication channels, servers, databases, and data monitoring software [4]. However, higher implementation costs and personal privacy requirements make a significant portion of PV users reluctant to purchase these data monitoring services, limiting the further development of the PV industry. In addition, as the scale of distributed PV continues to expand and the operating environment becomes more complex and diverse, the collection of its operating data often suffers from transmission blockage and equipment failure. For this reason, it is crucial and beneficial to develop a cost-effective and computationally efficient data collection method for large-scale DPVS clusters with relatively small numbers of sensing devices deployed at strategic locations. If deployed at strategic locations with proper redundancy, the reduced sensing network can still provide low-cost yet highly sufficiently accurate measurements of the DPVS networks for various power operations.
Figure 2. Research motivation of virtual collection technology.
Aware of this need, relevant scholars have been inspired by the virtual collection concept [4] and have researched virtual collection for DPVS. The core idea of virtual collection for DPVS is to use the power data of selected reference power stations (RPSs) in the region as input to infer the power data of other stations through computational intelligence algorithms. To reduce the number of sensors, Ref. [5] fitted all DPVS station operation data in the selected region by a deep recurrent denoising autoencoder and introduced a bionic artificial neural network to dynamically select the best subset of reference stations in the set of candidate RPSs.
It can be seen that there are relatively few studies combining artificial intelligence algorithms for the virtual collection of DPVS data, an approach which is still in the early exploration stage. Moreover, there is no comprehensive introduction to current knowledge in virtual collection research, resulting in a lack of attention to the virtual collection of DPVS in the industry. Various methods suitable for virtual collection and application scenarios of the virtual collection have not been summarized.

2. Overview of DPVS Virtual Collection

The word “virtual” in the virtual collection indicates that the technology does not collect PV data through collection equipment such as sensors, collectors, and concentrators in the field. The virtual collection is a new type of inference technology for data that cannot be collected in real-time or is difficult to collect. Its essence is the system identification and state estimation of a large system composed of multiple subsystems [5].
As shown in Figure 3, the virtual collection process into three steps according to the implementation conditions of DPVS virtual collection:

Figure 3. Schematic diagram of the DPVS virtual collection process.
  • Similarity analysis of DPVS in the virtual collection region.
  • Selection of reference power stations (RPSs) for virtual collection.
  • Data inference of DPVS, i.e., accurate estimation of the output of all power stations in the region through computational intelligence.
Firstly, the data of all power stations need to be transmitted to the company’s PV intelligent operation and maintenance cloud platform through 5G, ZigBee, and other wireless communication means. Then, the similarity between each power station in terms of geography, equipment, and climate is analyzed by the similarity analysis method to obtain the set of power stations that meet the prerequisites for virtual collection. Further, the best RPSs in the whole PV system are selected through clustering or intelligent optimization algorithms to deploy the sensing equipment so as to accurately estimate the operation data of the whole PV system.

3. Process and Challenges of DPVS Virtual Collection

The virtual collection of DPVS data is a new field that few scholars have studied. Therefore, in this section, to facilitate understanding, researchers compare and analyze the similarities and differences between the steps of virtual collection and other methods. Focusing on their similarities, researchers give directions for DPVS virtual collection research, and then focusing on their differences, they summarize the challenges faced by DPVS virtual collection. It is worth noting that this approach to elaboration is novel for the review literature and can help the reader understand the connections and differences clearly between virtual collection and other studies.

3.1. Similarity Analysis of Regional DPVS

PV data is most closely correlated with external conditions, and factors such as the geographic location of the installation site. Environmental factors have a significant impact on the accuracy of the virtual collection model. Therefore, one of the prerequisites assumed for the realization of virtual collection is that the station to be collected and the RPS have similar external factors. From the data point of view, researchers want the data set to obey similar distribution patterns as much as possible, thus providing higher-quality input data for supervised learning. This similarity can make the virtual collection more robust and ensure the virtual collection data’s accuracy even in weather changes.
To illustrate, Badong County in southwestern Hubei Province, China, and Jiangning District in Nanjing City, Jiangsu Province, produce widely different power data due to different terrain, topography, meteorology and other conditions. Figure 4 shows the power output of PV stations in Badong County and Jiangning District on a typical summer day. For the DPVS of Jiangning district, using the DPVS operation data of Badong district for data inference would seriously reduce the accuracy of the virtual collection because of the extremely low similarity between them. Therefore, it is necessary to define clusters of PV stations that satisfy the similarity requirement by similarity analysis in advance.
Figure 4. Badong county and Jiangning district DPVS output on a typical day.
Many factors affecting the PV output state are coupled with each other [6]. The main factors influencing the solar energy conversion process are shown in Figure 5. It can be seen that the degree of solar irradiance received by the PV module is significantly influenced by the geographical location and meteorological conditions. The climate is the comprehensive pattern in the general state of the atmosphere and weather processes in a certain area on a long timescale, which is an important factor affecting the level of light resources, and meteorology refers to the physical phenomena of the atmosphere on a short time scale, such as temperature, clouds, etc. Secondly, the link of solar irradiance to power for conversion is closely related to the selection of equipment, the design of the station, and electrical efficiency. After the series of the energy conversion process mentioned above, the final PV power output is obtained. Therefore, similarity analysis can be performed from two perspectives: influencing factors (causes) and power output trends (results). However, from the perspective of influencing factors, it is difficult to analyze the similarity due to the large number of factors affecting PV output, the significant difference between the dimensions, and the complex types of characteristics. From the perspective of the PV output trend, the trend changes are complicated, and the time scale is long, which makes it challenging to analyze the trend characteristics.
Figure 5. Factors affecting DPVS power output.
The importance of data similarity is also reflected in many areas of research. In Ref. [7], an anomaly identification and reconstruction model based on curve similarity analysis with a BP neural network is proposed for detecting anomalous and compensating missing PV historical data. Similar to the virtual collection, the method also requires the power of neighboring PV stations. Considering the periodicity of PV power, Ref. [8] proposes a data cleaning method based on approximate periodic time series, effectively improving the quality of PV data. Considering the uncertainty of PV power generation due to the variation in weather conditions, Ref. [9] proposes a prediction framework combining similar day selection techniques. In this framework, the authors first screen external variables that can accurately capture the similarity between different days and select dates with higher similarity based on these external variables for the historical day and the day to be predicted, thus improving the prediction accuracy. Although the research methods of the above studies are different, they all desire to obtain higher-quality data.

3.2. RPS Selection for Virtual Collection

Selecting the RPSs is the most crucial step in the virtual collection process. The RPSs’ real-time power data will be input into the computational intelligence algorithm as multidimensional features to estimate the output of all regional DPVS. Researchers aim to select the subset of DPVS among the regional DPVS that can estimate the data of other stations with higher accuracy. The key to this step is to identify selective sensor locations where the most important data is collected to monitor the status of all DPVS. Therefore, researchers will analyze the differences and associations between the RPS selection step and other methods from the perspectives of input data and equipment placement.
As shown in Figure 6, from the perspective of input data, the selection of RPSs can be approximated as the feature selection problem of machine learning. Both aim to improve the accuracy of the results as much as possible by selecting RPSs (features). Therefore, although there are few studies on the selection of reference power stations, the relatively mature feature selection theory can also provide researchers with inspiration. For data mining techniques, the feature quality of input data seriously affects the model’s performance, so many scholars have researched the feature selection problem. Ref. [10] systematically examined the existing sparse learning models for feature selection from the perspective of individual sparse feature selection and group sparse feature selection. It analyzed the differences and connections among various sparse learning models. Ref. [11] proposes a new incremental feature selection that makes the method robust to dynamically ordered data. Ref. [12] proposes a grasshopper optimization algorithm that can solve the binary optimization problem by selecting a subset of features that can better characterize the data attributes from a large set of original features, thus improving the classification accuracy. The above studies proposed effective processing for the feature selection problem, which can provide some theoretical reference for selecting RPSs, such as transforming the RPS selection into a combinatorial optimization problem. However, it is worth noting that if a power station is selected as the RPS, it is used as the input feature, and the remaining power stations are used as the power stations to be collected. It can be seen that the RPS selection problem for virtual collection is similar to the high-dimensional feature selection problem [13] yet different from the feature selection in the traditional regression and classification [14] problems. Therefore, choosing reasonable RPSs is more challenging than feature selection.
Figure 6. RPS selection and feature selection process.
From a sensor placement perspective, the selection of RPSs can also be inspired by the data aggregation point (DAP) selection problem in smart meters. As shown in Figure 7, both DAP selection and RPS selection can be regarded as the optimal configuration of transmission nodes in the system. DAPs are selected to reduce data redundancy and bandwidth requirements by aggregating data locally at the sensor or intermediate nodes to form high-quality information and reduce the quality of packets sent to the base station, thus saving energy and bandwidth. Ref. [15] treats DAP placement as a mixed integer programming problem and proposes a new heuristic algorithm to minimize installation, transmission, and delay costs to select the optimal DAP placement location. Ref. [16] proposes an improved k-means clustering algorithm to assign DAPs, significantly reducing the number of DAPs installed.
Figure 7. RPS selection and DAP selection process.
Although there are certain commonalities between the selection of DAPs and RPSs, there are still many challenges that need to be studied. Data aggregation points are obtained with the objective of determining the lowest transmission and delay cost among all SM layout points to achieve aggregation and transmission of data for the whole system. The RPS is selected by selecting a subset of PVs among the regional PV systems to achieve a data estimation of the whole system. Therefore, the elements considered in the selection of RPSs are more diversified. In addition to communication and equipment costs, the accuracy of data estimation for the whole system from different RPS sets needs to be considered, as well as the time and space coupling characteristics.

3.3. Data Inference for Regional DPVS

The final step of the virtual collection technique is to infer the operational data of the whole DPVS through an artificial intelligence algorithm. This step maps the relationship between the RPS and the whole system by building a computational intelligence model between the RPSs and the power stations to be collected in the region, using the data from the RPSs selected in the second step as the input. This step is similar to the method used in PV prediction techniques, both of which require certain historical data as a driver to obtain the unknown PV output power.
There is relatively little research in the industry on DPVS virtual data inference, with most studies focusing only on PV power prediction, using historical data, real-time weather, and other environmental information to predict PV power output. Thankfully, the current DPVS power prediction algorithms are relatively mature and can provide some theoretical references for virtual DPVS data collection. However, it is worth noting that data inference in virtual collection differs from traditional PV prediction in model construction and use. Virtual data collection estimates the current PV power output in real time through a data inference model, whereas the PV predictor estimates the future power output. The input to the virtual collection model is real-time PV data from the RPSs, and the input to the PV predictor is historical operational data and environmental information. This real-time nature makes it necessary that the data inference model for virtual collection has better robustness and higher accuracy requirements than that for PV prediction.

4. Methods for DPVS Virtual Collection

The previous section introduces the specific implementation steps of virtual collection and its purpose, and pinpoints the urgent need to provide solutions to the challenges faced by the above steps. Therefore, this section provides theoretical support for the development of virtual collection technology by summarizing the methods applicable to DPVS similarity analysis, RPS selection, and DPVS data inference in various fields. Various methods for DPVS virtual collection are summarized in Figure 8.

Energies 15 08783 g008

Figure 8. Summary of methods for virtual collection.

5. Application Scenarios of Virtual Collection Technology

With the scale expansion of DPVS, the DPVS application scenarios are more and more complex and variable. The acquisition of operation and maintenance information often suffers from incomplete data collection, transmission blockage, and high collection and transmission costs. Therefore, to bring more scholars’ attention to the practical application value of virtual collection, a variety of application scenarios for virtual collection based on multi-source information, including but not limited to the following:

  • DPVSoperation data anomaly detection;
  • DPVS fault diagnosis;
  • DPVSmissing data recovery;
  • DPVSreal-time operation data collection;

Figure 9 summarizes the four application scenarios and the significance of DPVS virtual collection technology.

Energies 15 08783 g012

Figure 9. Application scenarios of DPVS virtual collection technology.


  1. Masson, G.; Bosch, E.; Kaizuka, I.; Jäger-Waldau, A.; Donoso, J. Snapshot of Global PV Markets 2022 Task 1 Strategic PV Analysis and Outreach PVPS; IEA PVPS: Paris, France, 2022.
  2. Allouhi, A.; Rehman, S.; Buker, M.S.; Said, Z. Up-to-date literature review on Solar PV systems: Technology progress market status and R&D. J. Clean. Prod. 2022, 362, 132339.
  3. National Energy Administration (NEA) China. Available online: (accessed on 25 September 2022).
  4. Zhu, C.; Long, X.H.; Han, G.J.; Jiang, J.F.; Zhang, S. A virtual grid-based real-time data collection algorithm for industrial wireless sensor networks. Eurasip J. Wirel. Commun. Netw. 2018, 2018, 134.
  5. Ge, L.; Liu, H.; Yan, J.; Li, Y.; Zhang, J. A Virtual Data Collection Model of DPVs considering Spatio-Temporal Coupling and Affine Optimization Reference. IEEE Trans. Power Syst. 2022, 1–12.
  6. Sobri, S.; Koohi-Kamali, S.; Abd Rahim, N. Solar photovoltaic generation forecasting methods: A review. Energy Convers. Manag. 2018, 156, 459–497.
  7. Lin, S.M.; Li, P.Q.; Xue, W.Q.; Tang, X.X.; Wang, J.F. Recognition and Reconstruction of Photovoltaic Output Abnormal Data Based on Geographic Correlation. In Proceedings of the 2021 3rd Asia Energy and Electrical Engineering Symposium, Chengdu, China, 26–29 March 2021; pp. 942–948.
  8. Zhang, J.; Zhang, S.; Liang, J.; Tian, B.; Hou, Z.; Liu, B.Z. Photovoltaic Generation Data Cleaning Method Based on Approximately Periodic Time Series. IOP Conf. Ser. Earth Environ. 2017, 63, 12008.
  9. Zhang, Y.; Beaudin, M.; Taheri, R.; Zareipour, H.; Wood, D. Day-Ahead Power Output Forecasting for Small-Scale Solar Photovoltaic Electricity Generators. IEEE Trans. Smart Grid 2015, 6, 2253–2262.
  10. Li, X.P.; Wang, Y.D.; Ruiz, R. A Survey on Sparse Learning Models for Feature Selection. IEEE Trans. Cybern. 2022, 52, 1642–1660.
  11. Sang, B.B.; Chen, H.M.; Yang, L.; Li, T.R.; Xu, W.H. Incremental Feature Selection Using a Conditional Entropy Based on Fuzzy Dominance Neighborhood Rough Sets. IEEE Trans. Fuzzy Syst. 2022, 30, 1683–1697.
  12. Hichem, H.; Elkamel, M.; Rafik, M.; Mesaaoud, M.T.; Ouahiba, C. A new binary grasshopper optimization algorithm for feature selection problem. J. King Saud. Univ.-Com. 2022, 34, 316–328.
  13. Liang, J.N.; Yang, S.; Winstanley, A. Invariant optimal feature selection: A distance discriminant and feature ranking based solution. Recognition 2008, 41, 1429–1439.
  14. Zhang, L.; Mistry, K.; Lim, C.P.; Neoh, S.C. Feature selection using firefly optimization for classification and regression models. Decis. Support Syst. 2018, 106, 64–85.
  15. Lang, A.; Wang, Y.; Feng, C.; Stai, E.; Hug, G. Data Aggregation Point Placement for Smart Meters in the Smart Grid. IEEE Trans. Smart Grid 2022, 13, 541–554.
  16. Wang, G.D.; Zhao, Y.X.; Huang, J.; Winter, R.M. On the Data Aggregation Point Placement in Smart Meter Networks. In Proceedings of the 2017 26th International Conference on Computer Communication and Networks (Icccn 2017), Vancouver, BC, Canada, 31 July–3 August 2017.
Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to : , , , , ,
View Times: 413
Revisions: 5 times (View History)
Update Date: 08 Dec 2022