1. Introduction
Despite significant advancements in the medical field, the prevalence of Non-Communicable Diseases (NCDs), such as cardiovascular diseases, cancers, chronic respiratory diseases, obesity, and diabetes, remains alarmingly high. According to a report from the World Health Organization (WHO)
[1], in 2022, NCDs were responsible for 41 million deaths, accounting for 74% of all global deaths, with 40% of these occurring prematurely before the age of 70. This NCD epidemic not only has devastating health consequences for individuals, families, and communities, but also poses a significant burden on healthcare systems worldwide. This burden makes their prevention and control a crucial priority for the 21st century.
Diet plays an important role in the prevention and treatment of NCDs
[2]. Unhealthy dietary habits and a lack of knowledge about proper nutrition often contribute to poor diet choices. Fortunately, dietary assessment can help monitor daily food intake and promote healthier eating habits. In recent years, researchers in the field of computer vision and health have shown great interest in dietary assessments
[3]. Tools for automating the dietary assessment process have emerged with the widespread use of smartphones with high capacities and the advancements in computer vision models. These tools are known as vision-based dietary assessment (VBDA) systems
[4][5][6][7][8][4,5,6,7,8]. They utilize image computer vision models to directly identify food items categories, evaluate their volume and estimate nutrient content from smartphone camera pictures. VBDA systems typically involve three stages: food image analysis, portion estimation and nutrient derivation
[4]. The performance of the first two stages heavily relies on the effectiveness of artificial intelligence algorithms and the availability of good food datasets, while the final stage depends on a nutritional composition database. The food image analysis stage entails segmenting food regions from the background and recognizing each type of food present in the image. The next step involves evaluating the quantity or volume of each detected food item.
Food image segmentation and recognition indeed pose significant challenges due to various factors. One of the primary challenges is the non-rigid structure of food, which differs from common objects. This characteristic makes it difficult to utilize shape as a reliable feature for machine learning models. Additionally, foods usually have high intra-class variation, meaning that the visual characteristics of the same food can differ significantly from one to another. This variation is particularly pronounced in African foods, further complicating accurate food recognition. Furthermore, inter-class resemblance is another source of potential recognition issues, as different food items can appear very similar, as illustrated in Figure 1. Some examples of generic food with such resemblances include brownies and chocolate cake, and margarine and butter. Moreover, certain dishes may contain various ingredients, resulting in the same dish with distinct visual aspects. Another significant challenge in food image segmentation and recognition is the scarcity of publicly available datasets for model training. This lack of datasets hinders the development of accurate segmentation models.
Figure 1.
Different kinds of Cameroonian food with a similar yellow texture.
Current research on food image segmentation and recognition focuses mainly on images of Asian and Western foods. Unfortunately, there are only a few publicly available datasets for image segmentation, and none of them incorporate images of African foods, as shown in Table 1. However, African foods, including Cameroonian foods, present their own unique challenges. African dishes often consist of multiple mixed classes of food, as depicted in Figure 1. This complexity adds significant difficulty when attempting to segment and recognize individual food items. The more food classes mixed together on a plate, the more challenging it is to accurately detect the contours of each food component in the dish.
2. Food Image Dataset
With advancements in deep learning models for computer vision, the field of food segmentation and recognition techniques is rapidly evolving. However, the performance of these techniques heavily relies on the availability of large and diverse well-annotated image datasets. Collecting such datasets is a labor-intensive task, and the quality of annotations directly affects the performance of the models.
While some publicly available food image datasets exist, only a few of them are annotated for image segmentation and detection tasks
[9][10][11,12]. The annotation process for segmentation is particularly tedious and sensitive. Image segmentation datasets vary in terms of geographic origin of the food and the methods used for image collection. Some datasets are annotated with only bounding boxes (UECFOOD100
[11][13], UECFOOD256
[12][14]), while others include polygon or mask annotations (MyFood Dataset
[13][15], FoodSeg103
[14][16], UECFoodPixComplete
[15][17]).
There are four main methods for collecting images for food image datasets. First, images can be captured in a standardized laboratory environment (e.g., UNIMIB2016
[16][18]), which ensures high-resolution and good-quality images. However, this method typically limits the number of collected images. Second, images can be downloaded from the internet, either from social networks
[17][19] or search engines
[13][15]. This approach allows for the collection of large numbers of images, but can also result in a large number of non-food images that need to be sorted. Downloaded images may vary in quality, including blurry images, images with text, low-resolution images, or retouched images. Third, images can be collected directly from users
[18][20], which provides a realistic representation of real-life scenarios. However, implementing this method can be challenging, as it requires a large number of users and an extended period to collect a substantial amount of images. Finally, some datasets are built with images from other existing datasets. For instance, the UECFoodPixComplete
[15][17] dataset was built by annotating UECFOOD100
[11][13] images. Likewise, Food201-Segmented is made from Food-101
[19][21]-segmented images (see
Table 1).
Table 1 lists, at the present stage of
our
esearchers' investigation, the only publicly available food image datasets for detection and segmentation tasks. These datasets are classified based on their main characteristics, such as their
usage,
number of classes, total
number of images,
method of image collection, and the
origin of the dishes represented in the images. Notably, the available datasets mostly focus on Asian or Western foods, and there is currently no dataset available for African foods. As part of
theour work,
researcherswe propose the first dataset CamerFood10 specifically designed for image segmentation of African food.
3. Segmentation Model for Food Image
Food image segmentation occurs in the first stage of a VBDA system. Its purpose is to separate food items from the background and from each other. Food image segmentation is a challenging task when food items overlap or do not have strong visual features in contrast with the other food items on a plate. Several methods have been proposed to address issues in food image segmentation. They can be classified in three categories
[9][11]: (i) semi-automatic approaches, (ii) automatic approaches involving machine learning (ML) with handcrafted feature extraction, and (iii) automatic approaches with deep learning feature extraction.
In semi-automatic techniques for food segmentation, the user is asked to select regions of interest in the image or mark some pixels as food items or background. A drawback of the semi-automatic method is that it is tedious. It adds many additional actions for the user, unlike automatic segmentation approaches, where the user only needs to capture the food image. Automatic food image segmentation methods with handcrafted feature (e.g., colour, texture, and shape) extraction rely on traditional image processing techniques
[9][25][26][11,27,28], such as region growing and merging
[27][29], Normalized Cuts
[28][30], Simple Linear Iterative Clustering (SLIC), the Deformable Part Model (DPM), the JSEG segmentation algorithm, K-means
[29][31], and GrabCut
[30][32]. One of the most popular works using these techniques was presented by Matsuda and Yanai
[11][13] and takes into consideration the problem of images with multiple foods. It detects several candidate food regions by fusing outputs of several region detectors (DPM, circle detector, and JSEG). Then, it recognizes each candidate region independently using various feature descriptors (SIFT bag, HoG, Gabor textures) and support vector machine (SVM).
With the introduction of deep learning, deep neural networks automatically extract food image features and perform better than methods using traditional image processing techniques
[7][9][7,11]. Im2Calories
[20][22] was one of the pioneering works that used deep convolutional neural networks (CNNs) for semantic segmentation of food images. Pouladzadeh et al.
[31][33] combined graph-cut segmentation with CNNs for calorie measurement, although their approach was limited to images with a single food label. Some studies have focused on simultaneous localization and recognition of foods using object detection models, like
[32][34], which uses Fast-RCNN, and
[33][35], which uses YOLO. Chiang et al.
[34][36] proposed a model based on a mask region-based convolutional neural network (Mask-RCNN) with a union post-processing technique.
In 2021, Wu et al.
[14][16] proposed a semantic segmentation method consisting of recipe learning (ReLeM) and image segmentation modules. They used a long short-term memory (LSTM) network as an encoder and a vision transformer architecture as a decoder, and they achieved 43.9% mIoU with their dataset FoodSeg103. Okamoto and Yanai
[15][17] used the DeepLabv3+ model on their UECFoodPixComplete dataset and obtained a 55.5% mIoU. Liang et al.
[18][20] introduced a model called ChineseFoodSeg to address challenges specific to Chinese food images, such as blurred outlines, rich colors, and varied appearances. Their model outperformed DeepLabv3+, U-Net, and Mask-RCNN on the ChinesseDiabetesFood187 dataset, achieving an accuracy of 94% and an mIoU of 79%. However, their proposed method is more complex and less time-efficient compared to DeepLabV3+. Sharma et al.
[35][37] proposed a novel architecture named GourmetNet, which incorporates both channel and spatial attention information in expanded multi-scale feature representation using an advanced Waterfall Atrous Spatial Pooling (WASPv2)
[36][38] module with channel and spatial attention mechanisms. GourmetNet achieved state-of-the-art performance on the UNIMIB2016 and UECFoodPix datasets, achieving an mIoU of 71.79% and 65.13% on these datasets, respectively. A more recent work,
[37][39], proposed a Bayesian version of DeepLabv3+ and GourmetNet
[35][37] to perform multi-class segmentation of foods.
It is worth noting that the quality of the image dataset plays a significant role in the performance of these models. Well-arranged food on plates with good clarity often leads to better results. Unfortunately, this is not very often the case in real-world images.