The development of vision-based dietary assessment (VBDA) systems. These systems generally consist of three main stages: food image analysis, portion estimation, and nutrient derivation. The effectiveness of the initial step is highly dependent on the use of accurate segmentation and image recognition models and the availability of high-quality training datasets. Food image segmentation still faces various challenges, and most existing research focuses mainly on Asian and Western food images.
1. Introduction
Despite significant advancements in the medical field, the prevalence of Non-Communicable Diseases (NCDs), such as cardiovascular diseases, cancers, chronic respiratory diseases, obesity, and diabetes, remains alarmingly high. According to a report from the World Health Organization (WHO)
[1], in 2022, NCDs were responsible for 41 million deaths, accounting for 74% of all global deaths, with 40% of these occurring prematurely before the age of 70. This NCD epidemic not only has devastating health consequences for individuals, families, and communities, but also poses a significant burden on healthcare systems worldwide. This burden makes their prevention and control a crucial priority for the 21st century.
Diet plays an important role in the prevention and treatment of NCDs
[2]. Unhealthy dietary habits and a lack of knowledge about proper nutrition often contribute to poor diet choices. Fortunately, dietary assessment can help monitor daily food intake and promote healthier eating habits. In recent years, researchers in the field of computer vision and health have shown great interest in dietary assessments
[3]. Tools for automating the dietary assessment process have emerged with the widespread use of smartphones with high capacities and the advancements in computer vision models. These tools are known as vision-based dietary assessment (VBDA) systems
[4,5,6,7,8][4][5][6][7][8]. They utilize image computer vision models to directly identify food items categories, evaluate their volume and estimate nutrient content from smartphone camera pictures. VBDA systems typically involve three stages: food image analysis, portion estimation and nutrient derivation
[4]. The performance of the first two stages heavily relies on the effectiveness of artificial intelligence algorithms and the availability of good food datasets, while the final stage depends on a nutritional composition database. The food image analysis stage entails segmenting food regions from the background and recognizing each type of food present in the image. The next step involves evaluating the quantity or volume of each detected food item.
Food image segmentation and recognition indeed pose significant challenges due to various factors. One of the primary challenges is the non-rigid structure of food, which differs from common objects. This characteristic makes it difficult to utilize shape as a reliable feature for machine learning models. Additionally, foods usually have high intra-class variation, meaning that the visual characteristics of the same food can differ significantly from one to another. This variation is particularly pronounced in African foods, further complicating accurate food recognition. Furthermore, inter-class resemblance is another source of potential recognition issues, as different food items can appear very similar, as illustrated in Figure 1. Some examples of generic food with such resemblances include brownies and chocolate cake, and margarine and butter. Moreover, certain dishes may contain various ingredients, resulting in the same dish with distinct visual aspects. Another significant challenge in food image segmentation and recognition is the scarcity of publicly available datasets for model training. This lack of datasets hinders the development of accurate segmentation models.
Figure 1.
Different kinds of Cameroonian food with a similar yellow texture.
Current research on food image segmentation and recognition focuses mainly on images of Asian and Western foods. Unfortunately, there are only a few publicly available datasets for image segmentation, and none of them incorporate images of African foods, as shown in Table 1. However, African foods, including Cameroonian foods, present their own unique challenges. African dishes often consist of multiple mixed classes of food, as depicted in Figure 1. This complexity adds significant difficulty when attempting to segment and recognize individual food items. The more food classes mixed together on a plate, the more challenging it is to accurately detect the contours of each food component in the dish.
2. Food Image Dataset
With advancements in deep learning models for computer vision, the field of food segmentation and recognition techniques is rapidly evolving. However, the performance of these techniques heavily relies on the availability of large and diverse well-annotated image datasets. Collecting such datasets is a labor-intensive task, and the quality of annotations directly affects the performance of the models.
While some publicly available food image datasets exist, only a few of them are annotated for image segmentation and detection tasks
[11,12][9][10]. The annotation process for segmentation is particularly tedious and sensitive. Image segmentation datasets vary in terms of geographic origin of the food and the methods used for image collection. Some datasets are annotated with only bounding boxes (UECFOOD100
[13][11], UECFOOD256
[14][12]), while others include polygon or mask annotations (MyFood Dataset
[15][13], FoodSeg103
[16][14], UECFoodPixComplete
[17][15]).
There are four main methods for collecting images for food image datasets. First, images can be captured in a standardized laboratory environment (e.g., UNIMIB2016
[18][16]), which ensures high-resolution and good-quality images. However, this method typically limits the number of collected images. Second, images can be downloaded from the internet, either from social networks
[19][17] or search engines
[15][13]. This approach allows for the collection of large numbers of images, but can also result in a large number of non-food images that need to be sorted. Downloaded images may vary in quality, including blurry images, images with text, low-resolution images, or retouched images. Third, images can be collected directly from users
[20][18], which provides a realistic representation of real-life scenarios. However, implementing this method can be challenging, as it requires a large number of users and an extended period to collect a substantial amount of images. Finally, some datasets are built with images from other existing datasets. For instance, the UECFoodPixComplete
[17][15] dataset was built by annotating UECFOOD100
[13][11] images. Likewise, Food201-Segmented is made from Food-101
[21][19]-segmented images (see
Table 1).
Table 1 lists, at the present stage of
ouresearcher
s' investigation, the only publicly available food image datasets for detection and segmentation tasks. These datasets are classified based on their main characteristics, such as their
usage,
number of classes, total
number of images,
method of image collection, and the
origin of the dishes represented in the images. Notably, the available datasets mostly focus on Asian or Western foods, and there is currently no dataset available for African foods. As part of
ourthe work,
weresearchers propose the first dataset CamerFood10 specifically designed for image segmentation of African food.