分布式光伏数据的虚拟采集

分布式光伏数据的虚拟采集: Comparison

Please note this is a comparison between Version 2 by Leijiao Ge and Version 5 by Rita Xu.

With the rapid development of distributed photovoltaic systems (DPVS), the shortage of data monitoring devices and the difficulty of comprehensive coverage of measurement equipment has become more significant, bringing great challenges to the efficient management and maintenance of DPVS. Virtual collection is a new DPVS data collection scheme with cost-effectiveness and computational efficiency that meets the needs of distributed energy management but lacks attention and research.

随着分布式光伏系统（DPVS）的快速发展，数据监测设备的短缺和测量设备全面覆盖的难度变得更加突出，给DPVS的高效管理和维护带来了巨大的挑战。虚拟采集是一种新型的DPVS数据采集方案，具有成本效益和计算效率，满足分布式能源管理的需求，但缺乏关注和研究。为了填补当前研究领域的空白，本文对DPVS虚拟馆藏进行了全面系统的综述。

distributed photovoltaic
virtual collection
similarity analysis
artificial intelligence

1. Introduction

In the context of the current global energy crisis and increasing environmental pollution, photovoltaic (

简介

在当前全球能源危机和环境污染日益严重的背景下，光伏（PV) power generation has received strong support from countries worldwide due to its high efficiency and cleanliness, rapidly becoming the third largest renewable energy source after hydropower and wind power ^[1]. According to the reports of the ）发电因其高效率和清洁性而得到了世界各国的大力支持，迅速成为仅次于水电和风力发电的第三大可再生能源[1]。根据国际能源署（International Energy Agency (IEA), more than 175 GW of new PV capacity was installed worldwide in 2021, accounting for more than half of the new renewable energy capacity. By the end of 2021, the cumulative installed PV capacity worldwide had exceeded 942 GW. Figure 1 illustrates the changing dynamics of the global PV market and the substantial influence of the Chinese PV market ^[1]. A）的报告，2021年全球新增光伏装机容量超过175吉瓦，占新增可再生能源装机容量的一半以上。截至2021年底，全球累计光伏装机容量已超过942吉瓦。图1说明了全球光伏市场的变化动态以及中国光伏市场的重大影响[1]。

Figure 图1. 2017–2021 growth per region.-2021 年每个地区的增长。

PV stations are mainly divided into two types: centralized systems and distributed systems. Distributed photovoltaic systems have been rapidly developed due to their flexible installation, outstanding environmental benefits, and coexistence of power generation and consumption ^[2]. As shown in Figure

光伏电站主要分为集中式系统和分布式系统两种。分布式光伏系统由于其安装灵活，环境效益突出以及发电和消费共存而迅速发展[2]。如图2, the newly installed capacity of distributed photovoltaics exceeded that of centralized photovoltaics in 所示，2021 ^[3]. Therefore, efficient and accurate access to 年分布式光伏新增装机容量超过集中式光伏[3]。因此，高效准确地访问DPVS operational data is becoming increasingly important. High-quality operational data can help assess the output performance index of 操作数据变得越来越重要。高质量的运行数据可以帮助评估DPVS to improve the reliability of the PV plant in terms of the operation and maintenance and the accuracy of DPVS output prediction. It can also provide power companies with accurate electricity metering and billing audit indicators, better monitoring of the market, and prolong the service life of DPVS. However, most DPVS are scattered and disorderly, with many points and wide areas. To effectively manage them, a large number of sensors, collectors, and concentrators need to be deployed to monitor the output of DPVS, as well as dedicated communication channels, servers, databases, and data monitoring software ^[4]. However, higher implementation costs and personal privacy requirements make a significant portion of PV users reluctant to purchase these data monitoring services, limiting the further development of the PV industry. In addition, as the scale of distributed PV continues to expand and the operating environment becomes more complex and diverse, the collection of its operating data often suffers from transmission blockage and equipment failure. For this reason, it is crucial and beneficial to develop a cost-effective and computationally efficient data collection method for large-scale 的输出性能指标，以提高光伏电站在运行和维护方面的可靠性以及DPVS输出预测的准确性。还可以为电力公司提供准确的电表计费审计指标，更好地监控市场，延长DPVS的使用寿命。然而，大多数DPVS是分散和无序的，点多，区域广。为了有效地管理它们，需要部署大量的传感器、收集器和集中器来监控DPVS的输出，以及专用的通信通道、服务器、数据库和数据监控软件[4]。然而，较高的实施成本和个人隐私要求使得很大一部分光伏用户不愿意购买这些数据监控服务，限制了光伏产业的进一步发展。此外，随着分布式光伏规模的不断扩大和运行环境变得更加复杂和多样化，其运行数据的收集往往遭受传输堵塞和设备故障的困扰。因此，为部署在战略位置的传感设备数量相对较少的大规模DPVS clusters with relatively small numbers of sensing devices deployed at strategic locations. If deployed at strategic locations with proper redundancy, the reduced sensing network can still provide low-cost yet highly sufficiently accurate measurements of the 集群开发一种具有成本效益和计算效率的数据收集方法至关重要且有益。如果部署在具有适当冗余的战略位置，减少的传感网络仍然可以为各种电力操作提供低成本但足够准确的DPVS networks for various power operations. 网络测量。

Figure 图2. Research motivation of virtual collection technology.虚拟采集技术的研究动机。

Aware of this need, relevant scholars have been inspired by the virtual collection concept ^[4] and have researched virtual collection for

意识到这一需求，相关学者受到虚拟馆藏概念的启发[4]，并研究了DPVS. The core idea of virtual collection for 的虚拟馆藏。DPVS is to use the power data of selected reference power stations (RPSs) in the region as input to infer the power data of other stations through computational intelligence algorithms. To reduce the number of sensors, Ref. ^[5] fitted all 虚拟采集的核心思想是利用区域内选定参考电站（RPS）的电力数据作为输入，通过计算智能算法推断其他电站的电力数据。为了减少传感器的数量，参考文献[5]通过深度循环降噪自动编码器拟合选定区域内的所有DPVS station operation data in the selected region by a deep recurrent denoising autoencoder and introduced a bionic artificial neural network to dynamically select the best subset of reference stations in the set of candidate 台站运行数据，并引入仿生人工神经网络来动态选择候选RPSs. It can be seen that there are relatively few studies combining artificial intelligence algorithms for the virtual collection of 集中的最佳参考站子集。可以看出，结合人工智能算法进行DPVS data, an approach which is still in the early exploration stage. Moreover, there is no comprehensive introduction to current knowledge in virtual collection research, resulting in a lack of attention to the virtual collection of 数据虚拟采集的研究相对较少，这种方法仍处于早期探索阶段。而且，目前在虚拟馆藏研究中没有对知识进行全面介绍，导致业界对DPVS in the industry. Various methods suitable for virtual collection and application scenarios of the virtual collection have not been summarized. 的虚拟馆藏缺乏重视。适合虚拟收藏的各种方法和虚拟收藏的应用场景尚未总结。

2. Overview of DPVS Virtual Collection虚拟馆藏概述

The word 虚拟采集中的“virtual” in the virtual collection indicates that the technology does not collect PV data through collection equipment such as sensors, collectors, and concentrators in the field. The virtual collection is a new type of inference technology for data that cannot be collected in real-time or is difficult to collect. Its essence is the system identification and state estimation of a large system composed of multiple subsystems ^[5].虚拟”一词表示该技术不通过现场的传感器、集热器和聚光器等采集设备收集光伏数据。虚拟采集是一种新型的推理技术，用于无法实时采集或难以采集的数据。其本质是由多个子系统组成的大型系统的系统识别和状态估计[5]。 As shown in Figure 如图3, the virtual collection process into three steps according to the implementation conditions of 所示，本文根据DPVS virtual collection:虚拟采集的实现条件，将虚拟采集过程分为三个步骤：

Figure 图3. Schematic diagram of the DPVS virtual collection process.虚拟采集过程示意图。

Similarity analysis of DPVS in the virtual collection region.在虚拟采集区域的相似性分析。
Selection选择用于虚拟采集的参考电站 of reference power stations (RPSs) for virtual collection.（RPS）。
Data inference of DPVS, i.e., accurate estimation of the output of all power stations in the region through computational intelligence.的数据推断，即通过计算智能准确估计该地区所有电站的输出。

Firstly, the data of all power stations need to be transmitted to the company’s PV intelligent operation and maintenance cloud platform through 首先，所有电站的数据都需要通过5G, 、ZigBee, and other wireless communication means. Then, the similarity between each power station in terms of geography, equipment, and climate is analyzed by the similarity analysis method to obtain the set of power stations that meet the prerequisites for virtual collection. Further, the best RPSs in the whole PV system are selected through clustering or intelligent optimization algorithms to deploy the sensing equipment so as to accurately estimate the operation data of the whole PV system.

等无线通信方式传输到公司的光伏智能运维云平台。然后，通过相似性分析方法分析各电站在地理、设备、气候等方面的相似性，得到满足虚拟采集前提的电站集。此外，通过聚类或智能优化算法选择整个光伏系统中最好的RPS来部署传感设备，从而准确估计整个光伏系统的运行数据。

3. Process and Challenges of DPVS Virtual Collection虚拟采集的过程与挑战

The virtual collection of DPVS data is a new field that few scholars have studied. Therefore, in this section, to facilitate understanding, researchers compare and analyze the similarities and differences between the steps of virtual collection and other methods. Focusing on their similarities, researchers give directions for DPVS virtual collection research, and then focusing on their differences, they summarize the challenges faced by DPVS virtual collection. It is worth noting that this approach to elaboration is novel for the review literature and can help the reader understand the connections and differences clearly between virtual collection and other studies.

3.1. Similarity Analysis of Regional 区域DPVS的相似性分析

PV data is most closely correlated with external conditions, and factors such as the geographic location of the installation site. Environmental factors have a significant impact on the accuracy of the virtual collection model. Therefore, one of the prerequisites assumed for the realization of virtual collection is that the station to be collected and the 光伏数据与外部条件以及安装地点的地理位置等因素密切相关。环境因素对虚拟馆藏模型的准确性有重大影响。因此，实现虚拟采集的前提之一是要采集的站点和RPS have similar external factors. From the data point of view, researchers want the data set to obey similar distribution patterns as much as possible, thus providing higher-quality input data for supervised learning. This similarity can make the virtual collection more robust and ensure the virtual collection data’s accuracy even in weather changes.具有相似的外部因素。从数据的角度来看，我们希望数据集尽可能遵循相似的分布模式，从而为监督学习提供更高质量的输入数据。这种相似性可以使虚拟馆藏更加健壮，并确保虚拟馆藏数据即使在天气变化的情况下也能准确无误。 To illustrate, Badong County in southwestern Hubei Province, China, and Jiangning District in Nanjing City, Jiangsu Province, produce widely different power data due to different terrain, topography, meteorology and other conditions. Figure 举例来说，中国湖北省西南部的巴东县和江苏省南京市的江宁区，由于地形、地形、气象和其他条件的不同，产生的电力数据差异很大。图4 shows the power output of PV stations in Badong County and Jiangning 显示了典型夏日巴东县和江宁区光伏电站的功率输出。对于江宁区的District on a typical summer day. For the DPVS of Jiangning district, using the DPVS operation data of Badong district for data inference would seriously reduce the accuracy of the virtual collection because of the extremely low similarity between them. Therefore, it is necessary to define clusters of PV stations that satisfy the similarity requirement by similarity analysis in advance.PVS来说，使用巴东区的DPVS运行数据进行数据推断，由于两者之间的相似度极低，会严重降低虚拟采集的准确性。因此，有必要通过相似性分析提前定义满足相似性要求的光伏电站集群。

Figure 4. Badong county and Jiangning district DPVS output on a typical day.

Many factors affecting the PV output state are coupled with each other [6]. The main factors influencing the solar energy conversion process are shown in Figure 5. It can be seen that the degree of solar irradiance received by the PV module is significantly influenced by the geographical location and meteorological conditions. The climate is the comprehensive pattern in the general state of the atmosphere and weather processes in a certain area on a long timescale, which is an important factor affecting the level of light resources, and meteorology refers to the physical phenomena of the atmosphere on a short time scale, such as temperature, clouds, etc. Secondly, the link of solar irradiance to power for conversion is closely related to the selection of equipment, the design of the station, and electrical efficiency. After the series of the energy conversion process mentioned above, the final PV power output is obtained. Therefore, similarity analysis can be performed from two perspectives: influencing factors (causes) and power output trends (results). However, from the perspective of influencing factors, it is difficult to analyze the similarity due to the large number of factors affecting PV output, the significant difference between the dimensions, and the complex types of characteristics. From the perspective of the PV output trend, the trend changes are complicated, and the time scale is long, which makes it challenging to analyze the trend characteristics.

Figure 图5. Factors affecting 影响DPVS power output.功率输出的因素。

The importance of data similarity is also reflected in many areas of research. In Ref. ^[7], an anomaly identification and reconstruction model based on curve similarity analysis with a 数据相似性的重要性也反映在许多研究领域。在参考文献[7]中，提出了一种基于BP neural network is proposed for detecting anomalous and compensating missing 神经网络曲线相似性分析的异常识别和重建模型，用于检测异常并补偿缺失的PV historical data. Similar to the virtual collection, the method also requires the power of neighboring PV stations. Considering the periodicity of PV power, Ref. ^[8] proposes a data cleaning method based on approximate periodic time series, effectively improving the quality of PV data. Considering the uncertainty of PV power generation due to the variation in weather conditions, Ref. ^[9] proposes a prediction framework combining similar day selection techniques. In this framework, the authors first screen external variables that can accurately capture the similarity between different days and select dates with higher similarity based on these external variables for the historical day and the day to be predicted, thus improving the prediction accuracy. Although the research methods of the above studies are different, they all desire to obtain higher-quality data.历史数据。与虚拟采集类似，该方法也需要相邻光伏电站的电源。考虑到光伏发电的周期性，参考文献[8]提出了一种基于近似周期时间序列的数据清洗方法，有效地提高了光伏数据的质量。考虑到由于天气条件的变化而导致光伏发电的不确定性，参考文献[9]提出了一个结合类似日期选择技术的预测框架。在此框架中，作者首先筛选出能够准确捕捉不同日期相似度的外部变量，并根据这些外部变量选择相似度较高的日期进行历史日期和待预测日期，从而提高预测精度。

3.2. 虚拟集合的 RPS Selection for Virtual Collection选择

Selecting选择 the RPSs is the most crucial step in the virtual collection process. The RPSs’ real-time power data will be input into the computational intelligence algorithm as multidimensional features to estimate the output of all regional DPVS. Researchers aim to select the subset of DPVS among the regional DPVS that can estimate the data of other stations with higher accuracy. The key to this step is to identify selective sensor locations where the most important data is collected to monitor the status of all DPVS. Therefore, researchers will analyze the differences and associations between the RPS selection step and other methods from the perspectives of input data and equipment placement.RPS 是虚拟收集过程中最关键的一步。RPS的实时功率数据将作为多维特征输入计算智能算法，以估计所有区域DPV的输出。我们的目标是在区域DPVS中选择DPVS的子集，这些DPVS可以更准确地估计其他台站的数据。此步骤的关键是确定收集最重要数据的选择性传感器位置，以监控所有DPV的状态。因此，我们将从输入数据和设备放置的角度分析RPS选择步骤与其他方法之间的差异和关联。 As shown in Figure 如图6, from the perspective of input data, the selection of 所示，从输入数据的角度来看，RPSs can be approximated as the feature selection problem of machine learning. Both aim to improve the accuracy of the results as much as possible by selecting RPSs (features). Therefore, although there are few studies on the selection of reference power stations, the relatively mature feature selection theory can also provide researchers with inspiration. For data mining techniques, the feature quality of input data seriously affects the model’s performance, so many scholars have researched the feature selection problem. 的选择可以近似为机器学习的特征选择问题。两者都旨在通过选择 RPS（特征）尽可能提高结果的准确性。因此，虽然关于参考电站选择的研究很少，但相对成熟的特征选择理论也可以为我们提供启示。对于数据挖掘技术，输入数据的特征质量严重影响模型的性能，因此许多学者都研究了特征选择问题。参考文献[10]从单个稀疏特征选择和组稀疏特征选择的角度系统地研究了现有的稀疏学习模型的特征选择。它分析了各种稀疏学习模型之间的差异和联系。参考文献[11]提出了一种新的增量特征选择，使该方法对动态排序的数据具有鲁棒性。参考文献[12]提出了一种蚱蜢优化算法，该算法可以通过从大量原始特征中选择可以更好地表征数据属性的特征子集来解决二元优化问题，从而提高分类精度。上述研究对特征选择问题提出了有效的处理方法，可为选择Ref. ^[10] systematically examined the existing sparse learning models for feature selection from the perspective of individual sparse feature selection and group sparse feature selection. It analyzed the differences and connections among various sparse learning models. Ref. ^[11] proposes a new incremental feature selection that makes the method robust to dynamically ordered data. Ref. ^[12] proposes a grasshopper optimization algorithm that can solve the binary optimization problem by selecting a subset of features that can better characterize the data attributes from a large set of original features, thus improving the classification accuracy. The above studies proposed effective processing for the feature selection problem, which can provide some theoretical reference for selecting RPSs, such as transforming the 提供一定的理论参考，例如将RPS selection into a combinatorial optimization problem. However, it is worth noting that if a power station is selected as the RPS, it is used as the input feature, and the remaining power stations are used as the power stations to be collected. It can be seen that the RPS selection problem for virtual collection is similar to the high-dimensional feature selection problem ^[13] yet different from the feature selection in the traditional regression and classification ^[14] problems. Therefore, choosing reasonable 选择转化为组合优化问题。但值得注意的是，如果选择某电站作为RPS，则将其用作输入要素，其余电站则用作要采集的电站。可以看出，虚拟集合的RPS选择问题类似于高维特征选择问题[13]，但与传统回归和分类[14]问题中的特征选择不同。因此，选择合理的RPSs is more challenging than feature selection.比选择功能更具挑战性。

Figure 图6. RPS selection and feature selection process.选择和功能选择过程。

From a sensor placement perspective, the selection of 从传感器放置的角度来看，RPSs can also be inspired by the data aggregation point (DAP) selection problem in smart meters. As shown in Figure 的选择也可以受到智能电表中数据聚合点（DAP）选择问题的启发。如图7, both 所示，DAP selection and RPS selection can be regarded as the optimal configuration of transmission nodes in the system. DAPs are选择和RPS选择都可以看作是系统中传输节点的优化配置。选择DAP是为了降低数据冗余和带宽需求，将数据本地聚合在传感器或中间节点，形成高质量的信息，降低发送到基站的数据包质量，从而节省能源和带宽。参考文献 [15] selected将 to reduce data redundancy and bandwidth requirements by aggregating data locally at the sensor or intermediate nodes to form high-quality information and reduce the quality of packets sent to the base station, thus saving energy and bandwidth. Ref. ^[15] treats DAP placement as a放置视为混合整数规划问题，并提出了一种新的启发式算法，以最小化安装、传输和延迟成本，以选择最佳 mixed integer programming problem and proposes a new heuristic algorithm to minimize installation, transmission, and delay costs to select the optimal DAP placement location. Ref. ^[16] proposes an improved DAP 放置位置。参考文献[16]提出了一种改进的k-means clustering algorithm to assign 均值聚类算法来分配DAPs, significantly reducing the number of DAPs installed.，从而大大减少了安装的DAP数量。

Figure 图7. RPS selection and DAP selection process.选择和 DAP 选择过程。

Although there are certain commonalities between the selection of 尽管DAPs and RPSs, there are still many challenges that need to be studied. Data aggregation points are obtained with the objective of determining the lowest transmission and delay cost among all SM layout points to achieve aggregation and transmission of data for the whole system. The RPS is selected by selecting a subset of PVs among the regional PV systems to achieve a data estimation of the whole system. Therefore, the elements considered in the selection of RPSs are more diversified. In addition to communication and equipment costs, the accuracy of data estimation for the whole system from different RPS sets needs to be considered, as well as the time and space coupling characteristics.和RPS的选择存在一定的共性，但仍有许多挑战需要研究。获取数据聚合点的目的是确定所有SM布局点中传输和延迟成本最低的，以实现整个系统数据的聚合和传输。通过在区域光伏系统中选择一个光伏子集来选择RPS，以实现整个系统的数据估计。因此，在选择RPS时考虑的要素更加多样化。除了通信和设备成本外，还需要考虑不同RPS集对整个系统的数据估计精度，以及当前研究所缺乏的时空耦合特性。

3.3. Data区域 Inference for Regional DPVS 的数据推断

The final step of the virtual collection technique is to infer the operational data of the whole 虚拟采集技术的最后一步是通过人工智能算法推断整个DPVS through an artificial intelligence algorithm. This step maps the relationship between the RPS and the whole system by building a computational intelligence model between the RPSs and the power stations to be collected in the region, using the data from the RPSs selected in the second step as the input. This step is similar to the method used in PV prediction techniques, both of which require certain historical data as a driver to obtain the unknown PV output power.的运行数据。此步骤通过在RPS和要在区域中收集的电站之间构建计算智能模型，使用第二步中选择的RPS数据作为输入，映射RPS与整个系统之间的关系。此步骤类似于PV预测技术中使用的方法，两者都需要一定的历史数据作为驱动因素来获得未知的PV输出功率。 There is relatively little research in the industry on 业界对DPVS virtual data inference, with most studies focusing only on PV power prediction, using historical data, real-time weather, and other environmental information to predict PV power output. Thankfully, the current DPVS power prediction algorithms are relatively mature and can provide some theoretical references for virtual DPVS data collection. However, it is worth noting that data inference in virtual collection differs from traditional PV prediction in model construction and use. Virtual data collection estimates the current PV power output in real time through a data inference model, whereas the PV predictor estimates the future power output. The input to the virtual collection model is real-time PV data from the RPSs, and the input to the PV predictor is historical operational data and environmental information. This real-time nature makes it necessary that the data inference model for virtual collection has better robustness and higher accuracy requirements than that for PV prediction.虚拟数据推断的研究相对较少，大多数研究仅关注光伏发电量预测，利用历史数据、实时天气等环境信息来预测光伏发电量。值得庆幸的是，目前的DPVS功率预测算法相对成熟，可以为虚拟DPVS数据采集提供一些理论参考。但值得注意的是，虚拟采集中的数据推断在模型构建和使用上与传统的PV预测存在差异。虚拟数据收集通过数据推理模型实时估计当前的光伏功率输出，而光伏预测器估计未来的功率输出。虚拟采集模型的输入是来自 RPS 的实时 PV 数据，PV 预测器的输入是历史运行数据和环境信息。这种实时性使得虚拟采集的数据推理模型必须比PV预测模型具有更好的鲁棒性和更高的精度要求。

4. Methods for DPVS Virtual CollectionDPVS虚拟采集方法

The previous section introduces the specific implementation steps of virtual collection and its purpose, and pinpoints the urgent need to provide solutions to the challenges faced by the above steps. Therefore, this section provides theoretical support for the development of virtual collection technology by summarizing the methods applicable to 上一节介绍了虚拟采集的具体实施步骤及其目的，并指出了为上述步骤面临的挑战提供解决方案的迫切需要。因此，本节通过总结适用于DPVS similarity analysis, RPS selection, and DPVS相似性分析、RPS选择和DPVS数据推理的方法，为虚拟采集技术的发展提供理论支持。图 8 data总结了 inference in various fields. Various methods for DPVS virtual collection are summarized in Figure 8.DPVS 虚拟采集的各种方法。

Figure 图8. Summary of methods for virtual collection.虚拟收集方法摘要。

5. Application Scenarios of Virtual Collection Technology虚拟采集技术的应用场景

With the scale expansion of 随着DPVS, the DPVS application scenarios are more and more complex and variable. The acquisition of operation and maintenance information often suffers from incomplete data collection, transmission blockage, and high collection and transmission costs. Therefore, to bring more scholars’ attention to the practical application value of virtual collection, a variety of application scenarios for virtual collection based on multi-source information, including but not limited to the following:的规模扩大，DPVS的应用场景越来越复杂多变。运维信息的获取往往存在数据采集不完整、传输堵塞、采集传输成本高等问题。因此，为引起更多学者对虚拟馆藏实际应用价值的关注，本文创新性地总结了基于多源信息的虚拟馆藏的多种应用场景，包括但不限于以下内容：

DPVSoperation data anomaly detection;操作数据异常检测。
DPVS fault diagnosis;故障诊断。
DPVSmissing data recovery;缺少数据恢复。
DPVSreal-time operation data collection;实时操作数据收集。

Figure 图9 summarizes the four application scenarios and the significance of 总结了四种应用场景以及DPVS virtual collection technology.虚拟采集技术的意义。

Figure 图9. Application scenarios of DPVS virtual collection technology.

虚拟采集技术的应用场景。