Image stitching involves combining multiple images of the same scene captured from different viewpoints into a single image with an expanded field of view. While this technique has various applications in computer vision, traditional methods rely on the successive stitching of image pairs taken from multiple cameras. While this approach is effective for organized camera arrays, it can pose challenges for unstructured ones, especially when handling scene overlaps.
1. Introduction
Image stitching is an important technique in many computer vision applications. Image stitching aims to combine multiple images captured from different viewpoints into a single image with a wider field of view (FOV) that encompasses all contributing images. Image stitching is a well-studied topic with widespread applications
[1][2][3][4][5], and has proven to be very useful in domains such as virtual reality, teleconferencing, sports broadcasting, and immersive technologies
[6][7][8]. However, existing stitching methods, despite their broad adoption, do not scale well to systems with many unorganized cameras.
State-of-the-art techniques often use sequential pairwise image stitching to generate panoramic images from multiple cameras
[8][9][10]. Sequential Pairwise is a multi-step process in which two images are stitched simultaneously. At every step, a new image is stitched with the composite image of all already stitched ones. This image stitching technique simplifies the problem of complex overlapping regions introduced by multi-cameras, making it easy to find the intersection of the overlapping regions for efficient processing. However, this sequential process has several issues regarding time complexity and error propagation. An error introduced during an early merge is likely to be maintained rather than corrected, leading to many issues in the final output. Additionally, by only examining two sub-images, current solutions often fail to properly merge the content in the resulting image, leading to broken objects and ghosting effects.
As an alternative, seam-based approaches provide a way to limit error propagation by differentiating the pixels that previously belonged to a seam; however, this approach introduces seam-related artifacts
[3][11]. Usually, an additional step is required to remove these artifacts. Methods such as Poisson blending
[4] are often the primary postprocessing choice to remove color artifacts; however, Poisson blending assumes that the boundary between the overlapping pairs of images (e.g., not the overlap of each image but the overlap between each pair of images) is well-defined. In a multi-camera system with unorganized camera arrays, this assumption does not stand. There may be many individual sets of images, complicating the optimization process used in Poisson blending. As an alternative to Poisson blending, multi-band blending bypasses complications stemming from large amounts of overlapping images; however, it requires many iterations to convey an acceptable outcome under the same conditions, raising the issue of time complexity.
It is common to find approaches in the literature that tackle each of these problems individually very well; unfortunately, these methods are rarely able to produce an artifact-free panorama stream in real time when used with more complex and heterogeneous setups involving a large number of cameras
[12]. In some approaches, significant computational resources are required to remove visual artifacts. For example, seam-based methods require additional processing, such as Poisson image blending, to eliminate visual artifacts caused by camera exposure
[11][13][14].
A few recent works have examined image stitching using deep learning
[15][16][17][18]. However, all of these focused on pairwise stitching, wherein two inputs are stitched together and the output is stitched with the next input until all inputs have been used, similar to the traditional methods discussed above. This can lead to several problems, in particular error propagation, as a single error early in the process can cascade; with each iteration, pre-existing errors are built upon and expanded.
Deep learning is a powerful tool in computer vision, and recent advances in image generation, particularly through Generative Adversarial Networks (GANs), have inspired a large body of work. In a GAN, a generator creates an image, and a discriminator scores or rates it, with the ultimate goal being the creation of an image that can fool the discriminator
[19]. Image stitching is similar in scope; the central idea is finding a method to create a new image from several smaller ones such that the resulting image maintains all the information, content, and structure from the constituent images. In
[20], Shen et al. proposed using a GAN for image stitching; however, their solution has a few key weaknesses. Notably, their work focuses on pairwise stitching and requires a precomputed and entirely accurate binary mask to highlight the overlap between image pairs when generating an image, which adds additional computational overhead.
2. Image and Video Stitching
Image stitching aims to create seamless and natural photo-mosaics using multiple images from different sources. A comprehensive survey of traditional image stitching algorithms is provided in
[5]. Recent studies have focused on structure deformation and its extension to video stitching
[14][21]. These approaches assume a pairwise overlapping of cameras and use dynamic programming to search for the optimal seam, which is unsuitable for unstructured multi-camera systems. To handle unstructured camera arrays, Perrazi et al.
[10] used a straightforward approach for camera alignment by applying pairwise homography directly on the input videos. While this makes the technique more flexible and straightforward for camera alignment initialization, it requires additional processing to handle lens distortion and exposure compensation.
Machine learning, particularly deep learning, is highly impactful in several domains, particularly computer vision. Convolutional Neural Networks (CNNs), a type of deep learning architecture, are extremely successful when applied to traditional computer vision problems such as image classification
[22], object detection
[23], image segmentation
[24], and human pose detection
[25].
Image stitching has received far less attention from deep learning experts than these subdomains, though this is not to say there has been no prior work at all. Song et al. used CNNs in
[16][18], making use of weak supervision and expanding their network to work with images taken in a simulated outdoor environment, which can be more difficult as these images have more variation in exposure levels. In
[15], Chilukuri et al. stitched two images together and leveraged auto-encoders
[26] in addition to standard convolutional layers when constructing their network. Specifically, they encoded two input images into a shared space and then decoded the result into a single output image. Shen et al. proposed a method in
[20] involving the use of a Generative Adversarial Network to stitch two images with an overlapping field of view together using a CNN. Their work heavily leveraged a mirror to finely tune the amount of overlap between the fields of view of the images and to create perfectly aligned images for use as ground truth. However, while their proposed network introduces low amounts of artifacts and is able to run in real-time, which are among the greatest challenges in image stitching, it requires a precomputed binary mask to highlight the overlap between the input images. Finally, in
[17], Nie et al. proposed a method using deep learning to better solve the problem of rectangling in image stitching. Again, this requires a precomputed binary mask and only attempts to solve pairwise stitching.
3. Parallax-Tolerant Stitching
Many recent approaches have focused on addressing parallax-tolerant stitching. One variety of these approaches assumes that all images with the same projection center are parallax-tolerant. It is possible to manipulate images to meet this constraint by carefully rotating each camera in the scene
[2][5]; however, many errors can be introduced through misalignment of these projection centers caused by objects moving during image acquisition or incorrect mitigation of lens distortion. These errors can be removed using Multi-Band Blending (MBB)
[2], content-preserving warping
[13], and seam selection
[3]. MBB usually provides satisfactory results; however, several iterations may be required for the algorithm to converge, making it unsuitable for real-time video stitching.
4. Gradient Domain Smoothing
The main challenge with the seam-based approach is finding a good compromise between the structure of an image and the visual perception of the seam. When the emphasis is on preserving the structure of the objects in the scene, the stitches appear with a more visible seam. Additional steps are often required to remove seam-related artifacts using the Poisson equation, as formulated by Perez et al.
[4]. The Poisson equation is designed to blend the image based in the assumption that the boundary of the intersection area is well-defined. To the best of our knowledge, this equation has not been formulated for blending several images (more than two) simultaneously. One solution that is often provided in the literature involves formulating the problem in the frequency domain and then using a guidance vector to find and approximate the solution with FFT
[27]. This reformulation is known as the Fourier implementation of Poisson Image Editing
[27][28][29][30]. These algorithms effectively remove additional artifacts when the composite image is not multi-style. For instance, if one part of the scene is under shadow and the other part is under strong illumination, the resulting image tends to be either too bright or dark.
5. Supervised, Unsupervised, and Semi-Supervised Networks
As machine learning becomes increasingly popular, its limitations are becoming more apparent. One of the largest drawbacks of supervised learning (the most common machine and deep learning approach) is that it requires large datasets with accurate ground truth labels
[31]. The algorithm learns by predicting an input and comparing its result with the known true result. Differences between these two sets are used to calculate a loss function, which the network then attempts to minimize. As it does this, the network’s predictions and the ground truth become more aligned and the model grows more accurate. Ideally, the loss eventually reaches a minimum value, resulting in the network’s outputs closely aligning with the ground truth.
It is not always possible to obtain a large dataset with accurate labels when training a network. One possible solution to this problem is unsupervised learning. In unsupervised learning, there are no known ground truth labels; instead, the characteristics of the data itself are used as labels. A classic example of this in deep learning is the use of autoencoders for noise reduction
[31]. In these networks, the inputs are taken as labels. An encoder-decoder network might receive an image, perform convolutions to lower its resolution, deconvolve it to upscale it, and use the difference between the original image and output to calculate the loss. Following this approach, labels can be quickly and automatically generated rather than being found through human input or expensive computation.
The network first trains on this data in the same manner as in supervised learning. Soon after, additional unlabeled data are added to the training set, which is then trained in an unsupervised manner. By priming the network with supervised data, it should have a better chance to converge to a low loss than one trained in an entirely unsupervised approach while requiring less labeled data than supervised learning.
6. Image Quality Assessment (IQA)
An important aspect of the training process is efficiently assessing the quality of the generated output in a way that correlates with human judgment. This task is challenging in the context of unstructured image stitching for two main reasons. First, the camera registration process that allows images to be aligned in a frame of reference prior to stitching consists of geometric transformations. These transformations often depend on the perspective of dominant objects in the picture. Homography, for example, seeks to favor dominant planar structures. Thus, the transformation matrices used for projecting images into a common warping space are obtained as a trade-off between the content of the image and the objective scene
[2][5][10]. For this reason, it is difficult to design a ground truth dataset for an unstructured array of cameras without it being subject to the geometric-related error obtained during the registration process. Second, in unsupervised image stitching the goal is to compare the generated image to the warped images. Because of the geometric errors introduced during the alignment process, pixel-based metrics such as MSE, PNSR, and SSIM that assess the image quality through direct pixel-to-pixel comparison are not suitable for evaluating the quality of the generated image against the warped inputs. In addition, these metrics do not usually correlate with human judgment, as shown by
[32][33][34].
Recently, several metrics have been proposed to evaluate performance in GAN-based image generation models
[32][33][34][35][36]. These metrics can be categorized as featured-based, as they evaluate the quality of images using high-level features from pretrained networks. As opposed to pixel-based metrics (SSIM, PNSR, MSE, etc.), which compute the similarity between two images by relying on pixel values directly, feature-based metrics correlate well with human perception
[32]. The Fréchet Inception Distance (FID)
[34] was created to evaluate the performance of GAN networks by measuring the Fréchet distance on the feature space between the real image dataset and the fake one. The FID has been widely adopted in the literature, along with other metrics such as the Inception Score (IS) and the LPIPS for IQA.