Fruit Sizing with Machine Vision: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , ,

Forward estimates of harvest load require information on fruit size as well as number. The task of sizing fruit and vegetables has been automated in the packhouse, progressing from mechanical methods to machine vision. This shift is now occurring for size assessment of fruit on trees, i.e., in the orchard.

  • estimation
  • fruit sizing
  • image segmentation
  • machine vision
  • deep-learning

1. Background

RGB imagery is adequate for fruit identification (detection or segmentation) and on-tree fruit count, e.g., in mango orchards, [36] and [1]. For fruit sizing, however, the dimension of a detected object must be converted from image pixels to real world dimension. A ‘first generation’ approach involved placement of an object or scale with known dimensions in the field of view [23,27,37], or acquired images at a known camera to object distance [38,39]. A ‘second generation’ approach involved the use of the combination of RGB and depth sensors in a single (RGB-D) camera to obtain camera-to-object distance information. Both approaches to fruit segmentation are undertaken using the 2D RGB image. In a ‘third generation’ approach, segmentation can be based on 3D point clouds generated from RGB-D or LiDAR data. The 2D method involves the imaging of a given fruit from a single camera perspective, with conversions of a measurement of the lineal dimensions of the height and width of the imaged fruit in pixels to metric dimensions. The 3D method involves the imaging of a given fruit from several perspectives to allow the generation of a reconstruction. Size metrics are then assessed of the reconstruction. Other advances have occurred in the methods used in fruit detection and segmentation.
Fruit may be partly occluded by leaves or other fruit in images of whole tree canopies. Detected partly occluded fruit should be included in fruit counting applications, but for a fruit sizing pipeline these detections should either be rejected, e.g., as undertaken by [13] and Neupane et al. [40], or the geometry of the fruit must be reconstructed from visible portions of the fruit, e.g., as undertaken by Wang and Chen [41], Gené-Mola et al. [42], and Mirbod, Choi, Heinemann, Marini, and He [28].

2. Application Scenarios

Fruit size estimation using machine vision has been implemented in fruit pack-lines since the 1980s, e.g., [68]. Commercial vision systems in pack-line applications utilize a structured imaging environment, e.g., fixed camera angles and distances and use of a lighting box with optimum illumination, to facilitate vision assessment of fruit attributes. Multiple cameras are typically employed, providing multiple perspectives of each fruit, and roller cups or conveyors can rotate the fruit as it passes under the field of view of the camera.
In contrast, imaging conditions are far less controlled in an orchard setting. Image quality in daytime is significantly affected by a range of illumination conditions across the image, from over-exposure to strong shadow. This issue is exacerbated in the strong sunlight of tropical settings. Artificial lighting can be used to provide consistent imaging conditions, either as high-intensity strobe lighting with very short exposure time to reduce the effect of sunlight [69] or by image acquisition at night time images with lower cost lighting [1].
For in-orchard fruit size estimation, three application scenarios have been reported: (i) the use of a smartphone or tablet as a handheld imaging device with sufficient computing power for image processing or communication capacity to enable cloud processing; (ii) the use of a depth camera mounted to a mobile platform which moves through orchard inter-rows; and (iii) a camera in a fixed position, used for continuous measurement of fruit size.
Publications on the use of handheld imaging solutions have reported use of a physical marker as a scale for inference of fruit dimensions. For example, Wang, Koirala, Walsh, Anderson, and Verma [23] employed a backing board that incorporated a scale, placed behind the hanging fruit. The fruit was positioned relative to the camera such that fruit length and width were captured in the image. An RMSE of about 4 mm on measurements of both fruit length and width was achieved. Application issues include strong lighting, and tilt and yaw of the camera relative to the object plane. Possible improvements include use of a fixed frame to hold the camera parallel to the backing plane, although this reduces the portability of the system.
From the published work (that involves inclusion of a scale in the image), there appears to be little advantage in sampling speed or ease of use of the mobile device fruit sizing application compared to use of calipers with a data transfer/storage option, although there is advantage in technology accessibility—mobile devices are ubiquitous.
In the second application, images are collected in a ‘drive-by’ mode, as performed for fruit number estimation, e.g., [13,40,63]. RMSEs of 4.9 and 4.3 mm on fruit length and width, respectively, were reported for the system employed by Wang, Walsh, and Verma [13]. The advantage of this system over the handheld system lies in the ability to upscale to an orchard level given the ability to rapidly collect image data. Disadvantages include (i) a higher capital cost; (ii) a potential for higher RMSE arising from the increased camera to fruit distance and uncontrolled fruit orientation; and (iii) a potential sampling bias if a size difference exists between the fully visible fruits processed for sizing and remaining partly and fully occluded fruit.
In a third application, real-time monitoring of fruit growth using machine vision estimation of fruit size was reported by Behera, Sethy, Sahoo, Panigrahi, and Rajpoot [47]. This system employed a fixed position camera with a 4G internet connection to a remote server for processing.

3. Hardware for Machine Vision-Based Fruit Sizing

Fruit can be localized and segmented within an RGB image, with object length and width measurement made in units of image pixels. The conversion of object pixel dimensions to real world dimensions can be undertaken using the input of the camera to fruit distance, which can be assessed using LiDAR, one of the expanding ranges of RGB-D cameras, or other depth sensing technology.
A number of studies have compared the performance of depth camera technologies (stereovision, Time of Flight (ToF), structured light, and active IR stereo) in the context of application scenario [13,71,72,73,74,75,76]. For example, ToF cameras provide better depth accuracy than stereovision [13,74], but the technique is not recommended for use in strong sunlight. Other factors, such as the field of view (FoV) of the depth sensor, frame capture rate, use of color or monochrome imagery, and weather proofing also impact choice of hardware for the use case of in-orchard measurement. Commercial product life is also a consideration, as exemplified by the Microsoft Kinect v1 and v2, which each entered and exited the market within a period of 5 years.
Combination RGB—Time of Flight (ToF) depth cameras have dominated horticultural sizing applications. For example, the Kinect v2 was used in the sizing of mangoes [11,13], onions [65], citrus fruits [51], and pears [41]; the ToF RealSense L515 was used for sizing of peppers [57] and apples [46]; and the PMD CamCube 3.0 ToF camera was used by [25] for apple fruit size estimation. The successor in the Kinect camera series, the Azure Kinect ToF camera was released in 2019 with improved depth and RGB sensor resolutions, good angular resolution, lower noise, and better accuracy [77]. The Azure Kinect camera was used for mango fruit sizing by Neupane, Koirala, and Walsh [40].
Depth cameras based on active IR stereoscopy technology have also been used for fruit sizing, e.g., the Intel RealSense D435 for use in grape cluster sizing [63], peach fruit sizing [78], RealSense D415 for cucumber, eggplant, tomato, and pepper sizing [55]. The ZED mini stereo camera was used for sizing of tomato fruit by Hsieh, Huang, Hsiao, Tuan, Shih, Hsieh, Chen, and Yang [54].
Neupane, Koirala, Wang and Walsh [74] evaluated accuracy of eight depth cameras of various technologies for the application of in-orchard fruit localization and sizing. The Azure Kinect was recommended in terms of depth accuracy, outdoor use, cost, and its integrated RGB-D capability. The Blaze 101 (Basler, Ahrensburg, Germany) was recommended for its relative insensitivity to daylight, by use of 940 rather than 850 nm illuminating light, and its IP67 rating.

4. Software for Machine Vision-Based Fruit Sizing

Fruit size estimation using machine vision requires object detection, followed by extraction of pixels belonging to fruit using a color- or intensity-based threshold or deep learning-based segmentation method. Having segmented the object of interest, either the ‘2D’ or ‘3D’ method, can be applied. These topics are covered in this section.

4.1. Image Segmentation

One of the common approaches used for segmentation of fruit pixels in an image is thresholding. Thresholding involves setting a threshold value for pixel intensity, with categorization of pixels above or below the threshold into fruit or background pixels. For example, a segmentation method based on a threshold set from a grayscale histogram, Otsu method [79] has been used for fruit segmentation by Wang and Li [65] for sweet onion; Wang, Walsh, and Verma [13] and Wang, Koirala, Walsh, Anderson, and Verma [23] for mangoes; Gongal, Karkee, and Amatya [25] and Lu, Chen, Zhang, and Karkee [27] for apples; and by Ponce, Aquino, Millán, and Andújar [59] for olives. Thresholding methods, including Otsu segmentation, fails if the object of interest (fruit) and background objects have similar characteristics, e.g., color and texture, with a false segmentation mask resulting in a false sizing result. Other color- and intensity-based thresholding methods, e.g., [64], can also fail to properly segment fruit pixels from the background.
An alternative approach involves use of a CNN based semantic segmentation network, such as U-Net [80]. Semantic segmentation algorithms categorize image pixels into classes but do not separate instances of the same class. This limits the use of the technique in sizing of fruits that overlap in bunches/clusters. For example, Fukuda, Okuno, and Yuki [56] used a U-Net based segmentation method for segmentation and sizing of on-tree pear fruit.
CNN instance segmentation networks are capable of segmenting pixels belonging to different object classes and separating instances of each object class. Mask R-CNN [82] is a popular instance segmentation network based on the two-stage detection method of R-CNN [83]. Mask R-CNN was used for segmentation and sizing of tomato by Lee, Nazki, Baek, Hong, and Lee [53] and Hsieh, Huang, Hsiao, Tuan, Shih, Hsieh, Chen, and Yang [54]; for mango by Neupane, Koirala, and Walsh [40]; and for apple by Mirbod, Choi, Heinemann, Marini, and He [28]. YOLOv8 (https://github.com/ultralytics/ultralytics, accessed on 8 March 2023) is a recently developed object detection and instance segmentation network which offers better speed based on the one-stage detection method of YOLO. This network is recommended for applications requiring real time fruit sizing.
Following the separation of object (fruit) instances using instance segmentation technique, a check is required to verify whether each instance is a shape mask of a complete fruit or a mask of a partly occluded fruit.

4.2. 2D Segmentation

The 2D method involves measurement of the lineal dimensions of the detected object in terms of image pixels, followed by conversion of pixel to real dimensions. This conversion can be based on the use of a reference scale placed in the image plane of the object, or by use of a camera pin-hole model, given camera to fruit distance and use of the thin lens formula and intrinsic camera parameters, such as focal length. As noted earlier, camera to fruit distance can be obtained using RGG-D cameras using either stereo-vision, structured light, ToF, or active infrared (IR) technologies.
Example applications follow, chosen to represent a progression in the evolution techniques in the 2D method.
A ‘first generation’ approach to scaling involves placement of an object of known size in the object plane, as illustrated by the work of Wang, Koirala, Walsh, Anderson, and Verma [23] involving an app on a mobile device, with imaging of fruit against a backing board of a contrasting color which positioned the fruit relative to the camera, such that fruit length and width were captured in the image. A circular marker on the board was used as a scale, with a correction for the difference in the plane of the fruit perimeter and the scale based on an allometric relationship between fruit length and thickness. For sizing of citrus fruit in canopy images, Apolo-Apolo, Martínez-Guanter, Egea, Raja, and Pérez-Ruiz [37] placed a rectangular marker of known size into the canopy, to act as a reference scale. Kohno, Ting, Kondo, Iida, Kurita, Yamakawa, Shiigi, and Ogawa [49] developed a mobile citrus fruit size grading system in which the camera to fruit distance was fixed.
In a variant on the inclusion of a scale in image, Hsieh, Huang, Hsiao, Tuan, Shih, Hsieh, Chen, and Yang [54] used the ratio of bounding box dimensions of detected tomato fruit in images to the physical dimensions of sampled fruit as a calibration factor to predict fruit sizes on new images.
A ‘second generation’ approach to image scaling involves use of a RGB-D camera to obtain camera to object distance information, as illustrated by the work of Wang, Walsh, and Verma [13] using depth information from the Kinect v2 RGB-D camera for conversion of bounding box pixel dimensions into real world dimensions. Similarly, Kurtser, Ringdahl, Rotstein, Berenstein, and Edan [63] employed depth information from the RealSense D435 RGB-D camera to estimate grape cluster size and Bortolotti, Mengoli, Piani, Grappadelli, and Manfrini [78] used distance measurements from the RealSense D435 and D455.
In a variant on this approach, Wittstruck, Kühling, Trautz, Kohlbrecher, and Jarmer [67] presented estimated size of pumpkin fruit from high-resolution aerial imagery of a UAV flown at a known height above the ground. Measurement error in this application is relatively high, with a reported standard deviation on measurement residuals of 3.0 cm on fruit diameter, for fruit with a mean diameters of 13.8 cm. This error could be associated with low pixel resolution and/or error in height measurement.
Other advances have occurred in the methods used in fruit detection and segmentation, as described in the previous section. A ‘first generation’ approach is illustrated by Kohno, Ting, Kondo, Iida, Kurita, Yamakawa, Shiigi, and Ogawa [49] who created binary masks for citrus fruit generated using a simple color thresholding technique. Similarly, Wang, Koirala, Walsh, Anderson, and Verma [23] used a binary mask of mango fruit obtained using Otsu’s dynamic thresholding, then a morphological operator to remove the fruit stalk before extracting pixel dimensions. Patel, Kar, and Khan [48] used HIS color space thresholding to obtain contours of mango fruits.
A ‘second generation’ approach to fruit detection and segmentation involves the use of machine learning for object detection, with fruit dimensions taken from a fitted bounding box. For example, Wang, Walsh and Verma [13] used a cascade classifier model trained on histogram of oriented gradient (HOG) features, followed by Otsu’s thresholding. Apolo-Apolo, Martínez-Guanter, Egea, Raja, and Pérez-Ruiz [37] used the deep learning Faster R-CNN model for fruit detection on images of tree canopies.
Kurtser, Ringdahl, Rotstein, Berenstein, and Edan [63] reported that of four algorithms trialed, viz., percentile bounding box edges, percentile bounding box diagonals, ellipsoid fitting, and cylinder fittings, the lowest average absolute error was obtained using enclosing bounding box on refined segmentation through color-based K-Means clustering.
A ’third generation’ approach involves deep learning-based semantic or instance segmentation to obtain object perimeters. For example, both [53] and [54] used the Mask R-CNN instance segmentation method to extract tomato fruit masks from images. Fukuda, Okuno and Yuki [56] utilized a deep learning semantic segmentation method (UNet). Zaenker, Smitt, McCool, and Bennewitz [57] employed the instance segmentation method (YOLACT) to extract masks of pepper fruit.
There is a general trend for improvement in reported measurement accuracy and precision with these ‘generations’ of technology, although direct comparison of published results is compromised by the use of different image sets and the reporting of different performance metrics. An RMSE of < 5 mm is now routinely achieved on measurement of lineal dimensions of fruit size for non-occluded fruit. For a ‘best case’ result involving use of controlled-lighting against artificial plain background and a high camera resolution [53], mean average errors of 2.3 and 2.6 mm for fruit length and width estimates, respectively. This result likely represents the error of the reference method, caliper measurements, with a SD of repeated measurements of around 2 mm reported by Anderson et al. [86].
It is recommended that the performance metric of RMSE always be reported. To facilitate inter-study comparisons. The public release of an RGB-D data sets for a number of commodities and imaging conditions is also recommended, to allow direct comparison of new techniques.

4.3. 3D Segmentation

Fruit segmentation on 2D image data fails when fruit are visually similar data (in shape, color, and texture) to the background, although instance segmentation can improve results markedly. The 3D information can also assist in segmentation. The 3D point clouds can be generated from RGB-D data, with a method required for identifying the cluster of points associated with the object of interest, i.e., fruit. Information from multiple image captures, involving multiple perspectives of the fruit, can also be combined in generating the 3D point cloud.
Color can be used as a criterion in 3D segmentation. Lin, Tang, Zou, Xiong, and Fang [51] reported color based segmentation of the 3D point cloud from RGB-D images of citrus fruit on tree using a Bayes-classifier, followed by grouping of adjacent points using a SVM classifier and a density clustering method. Wang and Chen [41] clustered RBG-D point clouds associated with pear fruit within canopy images, using the locally connected convex patches (LCCP) method, followed by use of a principal component analysis bounding box algorithm to acquire morphological features of the fruit.
Gené-Mola, Sanz-Cortiella, Rosell-Polo, Escolà, and Gregorio [42] used structure from motion (SfM) and multi-view stereo (MVS) methods for generation of a 3D point cloud in an on-tree apple fruit sizing application. Of the methods of M-estimator sample consensus (MSAC), template matching and least squares, the template matching technique provided the lowest mean absolute error (MAE) for occluded fruits. In a banana fruit study, Hartley, Jackson, Pound, and French [38] used a 3D reconstruction method based on cycleGAN (generative adversarial network-based model). Zheng, Sun, Meng, and Nan [55] used a key-point RCNN model to identify six key points on vegetables from input color images, with mapping of the key points to a 3D coordinate system to obtain physical dimensions. Freeman and Kantor [46] used a YOLACT instance segmentation model for fruit detection, segmentation, and ROI generation. The 3D fruit surface from the point cloud was generated using the DBSCAN clustering algorithm and the axes of a fitted ellipse were used as dimension measures for fruit sizing. A comparison was made between the ‘ROI Viewpoint Planner’ (RVP) and the ‘Fruitlet Viewpoint Planner’ (FVP) methods, with the latter recommended for lower error in the sizing result. Future work should see consensus emerge on a recommended technique.
The 3D segmentation method has an increased computation requirement compared to the 2D methods, which can be problematic. For example, an average time of 1.25 s was reported for identification and localization of an individual fruit in the computing hardware used by Lin, Tang, Zou, Xiong, and Fang [51].
Using ‘ideal’ conditions (of optimal indoor lighting and imaging of fruit of a turntable to capture multiple perspectives), Wang and Chen [41] reported a RMSE of 1.17 and 1.03 mm on pear fruit diameter for fruit height and diameter for sizing based on segmentation of the 3D point cloud. X, Y, and Z-axis positional errors of 7, −4, and 13 mm, respectively, was reported for fruit localization using the 3D point cloud of citrus fruit on trees [51], with a bias of −1 mm and a median absolute deviation of error of 4 mm reported in fruit size estimation. Using RGB-D images collected at an approximate camera to fruit distance of 1 m, MAPE on length measurements was 14.2% for cucumber, 7.4% for eggplant, 11.6% for tomato, and 14.5% for pepper. At an approximate fruit length of 100 m, these values are equivalent to errors of around 10 mm. Freeman and Kantor [46] reported a MAE of 1.04 mm in their application.
In summary, a major attraction to the use of the 3D method is its potential to improve segmentation of occluded fruit and to improve sizing accuracy, as indicated in the work of Gené-Mola, Sanz-Cortiella, Rosell-Polo, Escolà, and Gregorio [42].

5. Dealing with Occlusion

As noted earlier, occluded fruit in images of fruit on-tree must be excluded from size analysis, or a morphological operator employed to reconstruct the outline of the entire fruit.
Partly occluded fruit within 2D images have been excluded from analysis following identification using various geometric rules. For example, Wang, Walsh, and Verma [13] and Neupane, Koirala, and Walsh [40] used ellipse fitting on segmentation masks and used a pixel area overlap threshold to validate fully visible (complete) fruit. Occluded fruits were also filtered on the basis of pixel mask area, depth values of the objects and the ellipse eccentricity value.
In another approach using 2D image data, Mirbod, Choi, Heinemann, Marini, and He [28] attempted to train a neural network to classify fruit as non-occluded and occluded fruits. However, false classification rates were relatively high, impacting size estimation.
Template matching or geometry reconstruction approaches have been used with circular symmetrical fruit, such as blueberries, citrus fruit, and apples for both 2D and 3D shape fitting, e.g., [41,42]. Gené-Mola, Sanz-Cortiella, Rosell-Polo, Escolà, and Gregorio [42] report an MAE of 3.7 mm on apple fruit diameter using the M-estimator sample consensus (MSAC) algorithm-based sphere fitting method. Similarly, Mirbod, Choi, Heinemann, Marini, and He [28] used a sphere fitting approach using point clouds for apple fruit diameter estimation, reporting an MAE of 3.93 mm. The authors indicate that noisy point cloud data along the fruit surface expanded the fitted sphere, contributing to the error in fruit diameter estimation.

6. Commercial Offers for Fruit Sizing

As the technology for machine vision-based fruit sizing becomes more mature, commercial tools begin to be offered for grower use. These systems involve a diverse use of technologies—from use of the RGB and depth cameras in consumer grade handheld mobile devices to specialty imaging systems mounted on mobile ground or aerial platforms. Some systems use edge computing capacity for in-field image processing, while others rely on transfer of data for cloud processing. Some systems employ 2D processing techniques, while others have employed 3D processing. The range of technologies offered is highlighted in the following examples.
CropTrackerTM offers a tablet-based system. Images of fruit in harvest bin are acquired at a distance of around 50 cm scan, with 3D reconstruction occurring in the cloud. Fruit diameter estimates within 1–3 mm of actual are claimed.
SpectreTM is deployed on a smart tablet or phone or as a camera on a fixed portal frame, imaging the top layer of fruit in trucks. An image of harvested apple fruit in a field bin is processed using a deep learning model for fruit detection, and sizing is based on the known dimensions of the field bin. A perspective transformation is used to adjust the image of the field bin into a ‘top–down’ view, to accommodate operator difficulty in holding the camera parallel to the object. A color calibration feature allows adjustment for the lighting environment (outdoors vs. indoors). The app outputs a fruit count and a size distribution for the top layer of fruit.
Harvest Quality VisionTM is deployed either as an RGB-D enabled tablet equipped with a macro lens for imaging of fruit in field bins, or as an array of cameras on a fixed portal frame, imaging the top layer of fruit in trucks. The handheld application involves the acquisition of multiple images of fruit in field bins over approximately 3 s at a camera-to-fruit distance of 50 cm. A 3D reconstruction is undertaken on a cloud-based processor, providing color, quantity, and size information. A 1–3 mm sizing accuracy is claimed, enabled by the close camera-to-fruit distance and the 3D reconstruction.
PixofarmTM offers fruit count and sizing capability deployed on a mobile device. A sticker must be affixed to each fruit to be measured, providing a scale (akin to the approach of Wang, Koirala, Walsh, Anderson, and Verma [23]). An output of average size, size class distribution and growth rate are provided.
FruitScoutTM is an app delivered through a mobile device for estimation of trunk diameter, and bud, blossom, and fruit count and size [87].
AeroboticsTM offers an app for an iPhone device with three cameras for monitoring yield and size estimation (Fresh Plaza, 2022). Images are uploaded to a cloud server for processing, with a claimed error of within 1.5 mm. A UAV based solution is used for orchard imaging, providing ‘Smart Sampling Locations’. for fruit size measurements, with the app used to guide scouts to geo-referenced locations.
Tevel AeroboticsTM is developing a tethered UAV based apple harvester. The UAV is equipped with an RGB-D camera (RealSense D415) that is used in estimation of fruit size at the time of harvest.
The commercially available systems are evolving rapidly with improvements in consumer sensors in mobile devices and in cloud computing and communication network coverage of orchards. For example, it is notable that a number of the commercial products use cloud processing of images. Additionally notable is the quick adoption of the LiDAR camera introduced with the iPhone 12 and iPad Pro [88] into fruit sizing apps.
A commercially relevant solution can differ from a researcher’s solution for farm implementation reasons. For example, mobile phone processor capacity now allows for image processing using lightweight deep learning models, e.g., [89], but a practical implementation involving large numbers of images can be problematic. Thus, a number of commercial solutions involve cloud processing.
The merging of different solutions in the commercial products is also encouraging, e.g., in-field fruit sizing with a system for representative sample selection and location, and fruit sizing at harvest. However, continued research is required to validate the solutions being proposed.

This entry is adapted from the peer-reviewed paper 10.3390/s23083868

This entry is offline, you can click here to edit this entry!
Video Production Service