Semi-Supervised Image Stitching from Unstructured Camera Arrays

This entry is adapted from the peer-reviewed paper 10.3390/s23239481

Image stitching involves combining multiple images of the same scene captured from different viewpoints into a single image with an expanded field of view. While this technique has various applications in computer vision, traditional methods rely on the successive stitching of image pairs taken from multiple cameras. While this approach is effective for organized camera arrays, it can pose challenges for unstructured ones, especially when handling scene overlaps.

image stitching self-supervised learning image blending

1. Introduction

Image stitching is an important technique in many computer vision applications. Image stitching aims to combine multiple images captured from different viewpoints into a single image with a wider field of view (FOV) that encompasses all contributing images. Image stitching is a well-studied topic with widespread applications ^[1]^[2]^[3]^[4]^[5], and has proven to be very useful in domains such as virtual reality, teleconferencing, sports broadcasting, and immersive technologies ^[6]^[7]^[8]. However, existing stitching methods, despite their broad adoption, do not scale well to systems with many unorganized cameras.

State-of-the-art techniques often use sequential pairwise image stitching to generate panoramic images from multiple cameras ^[8]^[9]^[10]. Sequential Pairwise is a multi-step process in which two images are stitched simultaneously. At every step, a new image is stitched with the composite image of all already stitched ones. This image stitching technique simplifies the problem of complex overlapping regions introduced by multi-cameras, making it easy to find the intersection of the overlapping regions for efficient processing. However, this sequential process has several issues regarding time complexity and error propagation. An error introduced during an early merge is likely to be maintained rather than corrected, leading to many issues in the final output. Additionally, by only examining two sub-images, current solutions often fail to properly merge the content in the resulting image, leading to broken objects and ghosting effects.

As an alternative, seam-based approaches provide a way to limit error propagation by differentiating the pixels that previously belonged to a seam; however, this approach introduces seam-related artifacts ^[3]^[11]. Usually, an additional step is required to remove these artifacts. Methods such as Poisson blending ^[4] are often the primary postprocessing choice to remove color artifacts; however, Poisson blending assumes that the boundary between the overlapping pairs of images (e.g., not the overlap of each image but the overlap between each pair of images) is well-defined. In a multi-camera system with unorganized camera arrays, this assumption does not stand. There may be many individual sets of images, complicating the optimization process used in Poisson blending. As an alternative to Poisson blending, multi-band blending bypasses complications stemming from large amounts of overlapping images; however, it requires many iterations to convey an acceptable outcome under the same conditions, raising the issue of time complexity.

It is common to find approaches in the literature that tackle each of these problems individually very well; unfortunately, these methods are rarely able to produce an artifact-free panorama stream in real time when used with more complex and heterogeneous setups involving a large number of cameras ^[12]. In some approaches, significant computational resources are required to remove visual artifacts. For example, seam-based methods require additional processing, such as Poisson image blending, to eliminate visual artifacts caused by camera exposure ^[11]^[13]^[14].

A few recent works have examined image stitching using deep learning ^[15]^[16]^[17]^[18]. However, all of these focused on pairwise stitching, wherein two inputs are stitched together and the output is stitched with the next input until all inputs have been used, similar to the traditional methods discussed above. This can lead to several problems, in particular error propagation, as a single error early in the process can cascade; with each iteration, pre-existing errors are built upon and expanded.

Deep learning is a powerful tool in computer vision, and recent advances in image generation, particularly through Generative Adversarial Networks (GANs), have inspired a large body of work. In a GAN, a generator creates an image, and a discriminator scores or rates it, with the ultimate goal being the creation of an image that can fool the discriminator ^[19]. Image stitching is similar in scope; the central idea is finding a method to create a new image from several smaller ones such that the resulting image maintains all the information, content, and structure from the constituent images. In ^[20], Shen et al. proposed using a GAN for image stitching; however, their solution has a few key weaknesses. Notably, their work focuses on pairwise stitching and requires a precomputed and entirely accurate binary mask to highlight the overlap between image pairs when generating an image, which adds additional computational overhead.

2. Image and Video Stitching

Image stitching aims to create seamless and natural photo-mosaics using multiple images from different sources. A comprehensive survey of traditional image stitching algorithms is provided in ^[5]. Recent studies have focused on structure deformation and its extension to video stitching ^[14]^[21]. These approaches assume a pairwise overlapping of cameras and use dynamic programming to search for the optimal seam, which is unsuitable for unstructured multi-camera systems. To handle unstructured camera arrays, Perrazi et al. ^[10] used a straightforward approach for camera alignment by applying pairwise homography directly on the input videos. While this makes the technique more flexible and straightforward for camera alignment initialization, it requires additional processing to handle lens distortion and exposure compensation.

Machine learning, particularly deep learning, is highly impactful in several domains, particularly computer vision. Convolutional Neural Networks (CNNs), a type of deep learning architecture, are extremely successful when applied to traditional computer vision problems such as image classification ^[22], object detection ^[23], image segmentation ^[24], and human pose detection ^[25].

Image stitching has received far less attention from deep learning experts than these subdomains, though this is not to say there has been no prior work at all. Song et al. used CNNs in ^[16]^[18], making use of weak supervision and expanding their network to work with images taken in a simulated outdoor environment, which can be more difficult as these images have more variation in exposure levels. In ^[15], Chilukuri et al. stitched two images together and leveraged auto-encoders ^[26] in addition to standard convolutional layers when constructing their network. Specifically, they encoded two input images into a shared space and then decoded the result into a single output image. Shen et al. proposed a method in ^[20] involving the use of a Generative Adversarial Network to stitch two images with an overlapping field of view together using a CNN. Their work heavily leveraged a mirror to finely tune the amount of overlap between the fields of view of the images and to create perfectly aligned images for use as ground truth. However, while their proposed network introduces low amounts of artifacts and is able to run in real-time, which are among the greatest challenges in image stitching, it requires a precomputed binary mask to highlight the overlap between the input images. Finally, in ^[17], Nie et al. proposed a method using deep learning to better solve the problem of rectangling in image stitching. Again, this requires a precomputed binary mask and only attempts to solve pairwise stitching.

3. Parallax-Tolerant Stitching

Many recent approaches have focused on addressing parallax-tolerant stitching. One variety of these approaches assumes that all images with the same projection center are parallax-tolerant. It is possible to manipulate images to meet this constraint by carefully rotating each camera in the scene ^[2]^[5]; however, many errors can be introduced through misalignment of these projection centers caused by objects moving during image acquisition or incorrect mitigation of lens distortion. These errors can be removed using Multi-Band Blending (MBB) ^[2], content-preserving warping ^[13], and seam selection ^[3]. MBB usually provides satisfactory results; however, several iterations may be required for the algorithm to converge, making it unsuitable for real-time video stitching.

4. Gradient Domain Smoothing

The main challenge with the seam-based approach is finding a good compromise between the structure of an image and the visual perception of the seam. When the emphasis is on preserving the structure of the objects in the scene, the stitches appear with a more visible seam. Additional steps are often required to remove seam-related artifacts using the Poisson equation, as formulated by Perez et al. ^[4]. The Poisson equation is designed to blend the image based in the assumption that the boundary of the intersection area is well-defined. To the best of our knowledge, this equation has not been formulated for blending several images (more than two) simultaneously. One solution that is often provided in the literature involves formulating the problem in the frequency domain and then using a guidance vector to find and approximate the solution with FFT ^[27]. This reformulation is known as the Fourier implementation of Poisson Image Editing ^[27]^[28]^[29]^[30]. These algorithms effectively remove additional artifacts when the composite image is not multi-style. For instance, if one part of the scene is under shadow and the other part is under strong illumination, the resulting image tends to be either too bright or dark.

5. Supervised, Unsupervised, and Semi-Supervised Networks

As machine learning becomes increasingly popular, its limitations are becoming more apparent. One of the largest drawbacks of supervised learning (the most common machine and deep learning approach) is that it requires large datasets with accurate ground truth labels ^[31]. The algorithm learns by predicting an input and comparing its result with the known true result. Differences between these two sets are used to calculate a loss function, which the network then attempts to minimize. As it does this, the network’s predictions and the ground truth become more aligned and the model grows more accurate. Ideally, the loss eventually reaches a minimum value, resulting in the network’s outputs closely aligning with the ground truth.

It is not always possible to obtain a large dataset with accurate labels when training a network. One possible solution to this problem is unsupervised learning. In unsupervised learning, there are no known ground truth labels; instead, the characteristics of the data itself are used as labels. A classic example of this in deep learning is the use of autoencoders for noise reduction ^[31]. In these networks, the inputs are taken as labels. An encoder-decoder network might receive an image, perform convolutions to lower its resolution, deconvolve it to upscale it, and use the difference between the original image and output to calculate the loss. Following this approach, labels can be quickly and automatically generated rather than being found through human input or expensive computation.

The network first trains on this data in the same manner as in supervised learning. Soon after, additional unlabeled data are added to the training set, which is then trained in an unsupervised manner. By priming the network with supervised data, it should have a better chance to converge to a low loss than one trained in an entirely unsupervised approach while requiring less labeled data than supervised learning.

6. Image Quality Assessment (IQA)

An important aspect of the training process is efficiently assessing the quality of the generated output in a way that correlates with human judgment. This task is challenging in the context of unstructured image stitching for two main reasons. First, the camera registration process that allows images to be aligned in a frame of reference prior to stitching consists of geometric transformations. These transformations often depend on the perspective of dominant objects in the picture. Homography, for example, seeks to favor dominant planar structures. Thus, the transformation matrices used for projecting images into a common warping space are obtained as a trade-off between the content of the image and the objective scene ^[2]^[5]^[10]. For this reason, it is difficult to design a ground truth dataset for an unstructured array of cameras without it being subject to the geometric-related error obtained during the registration process. Second, in unsupervised image stitching the goal is to compare the generated image to the warped images. Because of the geometric errors introduced during the alignment process, pixel-based metrics such as MSE, PNSR, and SSIM that assess the image quality through direct pixel-to-pixel comparison are not suitable for evaluating the quality of the generated image against the warped inputs. In addition, these metrics do not usually correlate with human judgment, as shown by ^[32]^[33]^[34].

Recently, several metrics have been proposed to evaluate performance in GAN-based image generation models ^[32]^[33]^[34]^[35]^[36]. These metrics can be categorized as featured-based, as they evaluate the quality of images using high-level features from pretrained networks. As opposed to pixel-based metrics (SSIM, PNSR, MSE, etc.), which compute the similarity between two images by relying on pixel values directly, feature-based metrics correlate well with human perception ^[32]. The Fréchet Inception Distance (FID) ^[34] was created to evaluate the performance of GAN networks by measuring the Fréchet distance on the feature space between the real image dataset and the fake one. The FID has been widely adopted in the literature, along with other metrics such as the Inception Score (IS) and the LPIPS for IQA.

References

Szeliski, R. Computer Vision: Algorithms and Applications; Springer: London, UK; New York, NY, USA, 2011.
Brown, M.; Lowe, D.G. Automatic Panoramic Image Stitching using Invariant Features. Int. J. Comput. Vis. 2007, 74, 59–73.
Kwatra, V.; Schödl, A.; Essa, I.; Turk, G.; Bobick, A. Graphcut Textures: Image and Video Synthesis Using Graph Cuts. ACM Trans. Graph. 2003, 22, 277–286.
Pérez, P.; Gangnet, M.; Blake, A. Poisson Image Editing. ACM Trans. Graph. 2003, 22, 313–318.
Szeliski, R. Image Alignment and Stitching: A Tutorial. Found. Trends. Comput. Graph. Vis. 2006, 2, 1–104.
Tchinda, E.N.; Kwadjo, D.T.; Bobda, C. A Distributed Smart Camera Apparatus to Enable Scene Immersion: Work-in-progress. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, New York, NY, USA, 13–18 October 2019; ACM: New York, NY, USA, 2019. CODES/ISSS ’19. pp. 18:1–18:2.
Lee, W.T.; Chen, H.I.; Chen, M.S.; Shen, I.C.; Chen, B.Y. High-resolution 360 Video Foveated Stitching for Real-time VR. Comput. Graph. Forum 2017, 36, 115–123.
Yang, W.; Qian, Y.; Kämäräinen, J.; Cricri, F.; Fan, L. Object Detection in Equirectangular Panorama. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2190–2195.
Lee, K.Y.; Sim, J.Y. Warping Residual Based Image Stitching for Large Parallax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020.
Perazzi, F.; Sorkine-Hornung, A.; Zimmer, H.; Kaufmann, P.; Wang, O.; Watson, S.; Gross, M.H. Panoramic Video from Unstructured Camera Arrays. Comput. Graph. Forum 2015, 34, 57–68.
He, B.; Yu, S. Parallax-Robust Surveillance Video Stitching. Sensors 2016, 16, 7.
Lai, W.S.; Gallo, O.; Gu, J.; Sun, D.; Yang, M.H.; Kautz, J. Video stitching for linear camera arrays. arXiv 2019, arXiv:1907.13622.
Jiang, W.; Gu, J. Video stitching with spatial-temporal content-preserving warping. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 42–48.
Li, J.; Xu, W.; Zhang, J.; Zhang, M.; Wang, Z.; Li, X. Efficient Video Stitching Based on Fast Structure Deformation. IEEE Trans. Cybern. 2015, 45, 2707–2719.
Chilukuri, P.K.; Padala, P.; Padala, P.; Desanamukula, V.S.; Pvgd, P.R. l, r-Stitch Unit: Encoder-Decoder-CNN Based Image-Mosaicing Mechanism for Stitching Non-Homogeneous Image Sequences. IEEE Access 2021, 9, 16761–16782.
Song, D.Y.; Um, G.M.; Lee, H.K.; Cho, D. End-to-End Image Stitching Network via Multi-Homography Estimation. IEEE Signal Process. Lett. 2021, 28, 763–767.
Nie, L.; Lin, C.; Liao, K.; Liu, S.; Zhao, Y. Deep Rectangling for Image Stitching: A Learning Baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5740–5748.
Song, D.Y.; Lee, G.; Lee, H.; Um, G.M.; Cho, D. Weakly-Supervised Stitching Network for Real-World Panoramic Image Generation; Springer: Berlin/Heidelberg, Germany, 2022; pp. 54–71.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27.
Shen, C.; Ji, X.; Miao, C. Real-Time Image Stitching with Convolutional Neural Networks. In Proceedings of the 2019 IEEE International Conference on Real-time Computing and Robotics (RCAR), Irkutsk, Russia, 4–9 August 2019; pp. 192–197.
Jia, J.; Tang, C. Image Stitching Using Structure Deformation. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 617–631.
Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 26 June–1 July 2016; pp. 2285–2294.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99.
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587.
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186.
Ye, J.C.; Sung, W.K. Understanding geometry of encoder-decoder CNNs. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7064–7073.
Morel, J.M.; Petro, A.; Sbert, C. Fourier implementation of Poisson image editing. Pattern Recognit. Lett. 2012, 33, 342–348.
Di Martino, J.M.; Facciolo, G.; Meinhardt-Llopis, E. Poisson Image Editing. Image Process. Line 2016, 6, 300–325.
Petro, A.B.; Sbert, C. Selective Contrast Adjustment by Poisson Equation. Image Process. Line 2013, 3, 208–222.
Farbman, Z.; Hoffer, G.; Lipman, Y.; Cohen-Or, D.; Lischinski, D. Coordinates for Instant Image Cloning. ACM Trans. Graph. 2009, 28, 67:1–67:9.
Ng, A. Sparse autoencoder. CS294A Lect. Notes 2011, 72, 1–19.
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Klambauer, G.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017.
Barratt, S.; Sharma, R. A Note on the Inception Score. arXiv 2018, arXiv:1801.01973.
Liu, S.; Wei, Y.; Lu, J.; Zhou, J. An Improved Evaluation Framework for Generative Adversarial Networks. arXiv 2018, arXiv:1803.07474.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Imaging Science & Photographic Technology

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Erman Nghonda Tchinda

Maximillian Kealoha Panoff

Danielle Tchuinkou Kwadjo

Christophe Bobda

View Times: 815

Entry Collection: Remote Sensing Data Fusion

Update Date: 14 Dec 2023

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Erman Nghonda Tchinda	--	2111	2023-12-13 12:15:45	\|
2	format correct	Catherine Yang	Meta information modification	2111	2023-12-14 02:02:26	\|