Food and fluid intake monitoring are essential for reducing the risk of dehydration, malnutrition, and obesity. The existing research has been preponderantly focused on dietary monitoring, while fluid intake monitoring, on the other hand, is often neglected. Food and fluid intake monitoring can be based on wearable sensors, environmental sensors, smart containers, and the collaborative use of multiple sensors. Vision-based intake monitoring methods have been widely exploited with the development of visual devices and computer vision algorithms. Vision-based methods provide non-intrusive solutions for monitoring. They have shown promising performance in food/beverage recognition and segmentation, human intake action detection and classification, and food volume/fluid amount estimation. However, occlusion, privacy, computational efficiency, and practicality pose significant challenges.
1. Introduction
Maintaining healthy food intake and adequate hydration is significant for humans’ physiological and physical health
[1][2][3].
The quality of food intake was proven to be associated with the metabolic function of the human body
[4]. Unbalanced nutrition intake increases the risk of many diseases, including diabetes, obesity, cardiovascular disease, and certain cancers
[1][5]. When understanding human body dynamics associated with underweight, overweight, and obesity, it is important to objectively assess energy intake (EI); energy intake assessment is related to food type recognition, amount consumed estimation, and portion size estimation
[6]. Being underweight can result from energy expenditure exceeding energy intake over an extended period, which leads to health risks such as malnutrition and premature death
[7]. Being overweight and obesity are associated with energy intake exceeding energy expenditure, leading to chronic diseases such as type 2 diabetes, cardiovascular diseases, cancers, and musculoskeletal disorders
[6][7][8]. A dietary assessment system could be used to monitor daily food intake and control eating habits by triggering a just-in-time intervention during energy intake to prevent health issues
[8].
Low-intake dehydration, caused by inadequate fluid intake, has endangered public health and is often underemphasised
[9][10]. Mild dehydration happens commonly among people and increases the risk of chronic diseases
[11][12]. A notable example is a significant association between urolithiasis (kidney stone) and low daily water intake
[4][5]. Furthermore, dehydration is closely associated with disability, hospitalisation and mortality
[13] in hospitals
[14][15][16][17] and long-term care systems
[10][11][13][18]. In the hydration and outcome in older patients (HOOP) prospective cohort study of 200 older adults in a large UK teaching hospital, 37% of the participants admitted as emergencies were dehydrated, 7% of the participants died in the hospital, and 79% of those who died were dehydrated at admission
[15]. Dehydration and drinking status could also be related to children’s and adults’ attention and memory performance
[19].
2. Overview of Vision-Based Intake Monitoring
2.1. Active and Passive Methods
In vision-based methods, there are two approaches to capturing images: active and passive
[20][21]. Active methods require the user to take pictures and record their intake manually, while passive methods automatically access the food or fluid intake information. Active methods are widely used in practice. Traditionally, active food intake monitoring was in the form of food records, recalls, or questionnaires
[22]. For active fluid intake monitoring, a fluid balance chart is used as a self-reporting tool to identify a positive (fluid input higher than output) or negative (fluid output higher than input) balance in hospitals or nursing homes
[18][23]. A fluid balance chart includes information on the time, approach, and amount of body fluid input (oral, intravenous, etc.) and output (urine, tube, etc.), which can be completed by trained nurses, doctors, or patients themselves
[23].
With the development of cameras, images of meals and drinks are more commonly used for dietary monitoring. In visual-based monitoring methods, active methods are not as widely seen as passive methods and mostly rely on mobile phone cameras. For example,
[24] proposed a food and nutrition measurement approach by analysing the images taken by the users before and after a meal, which provided up to 97.2% of correct classification for food type and 1% of misreported nutrient information.
Nevertheless, manually recording the intake information by writing, weighing, or triggering a camera can be time-consuming and burdensome
[21][25], hence not ideal for daily application. Moreover, self-reporting is not an option for patients with actional difficulties or the elderly with cognition degeneration. Therefore, in recent years, passive sensing methods with different devices and automatic strategies predominated over traditional active methods.
2.2. Environmental Settings
The environmental settings were categorised into the free-living environment, pseudo-free-living environment, laboratory environment, and others. In free-living-based studies, the systems were assessed with sufficient data collected by sensors configured into the user’s natural living environment. The pseudo-free-living environment tried to replicate the user’s natural living environment in a laboratory. The controlled laboratory environment only covers the specific actions needed as input data for the system (e.g., biting or drinking). In contrast, the others include methods based on existing datasets without considering the experimental environment. A camera’s viewing angle can be either first-person or third-person. It is noticed that most of the third-person methods were only considered in a controlled testing environment, not testified in a free-living scenario. As for the first-person camera, in free-living, the highest accuracy achieved on food and non-food classification is 95%
[26][27], and on eating action detection
[28] was 89.68%. One recent free-living research reached the F1-score of 89% on drinking and eating episode detection but only 56.7% precision at the fluid intake level estimation
[29]. Therefore, there is still a gap in harnessing cameras in the free-living scenario. Facts identified which can affect the performance include unstable light conditions, occlusion, low framerate, and motion blur.
2.3. Privacy Issue
Most papers investigating vision-based monitoring failed to discuss privacy issues even though some of the concerns were evident with the participant’s face and body shown in the figures in the paper
[30][31][32][33][34][35]. In active methods, cameras can be controlled manually to avoid taking images with privacy concerns, which is inconvenient and labour-demanding
[36][37]. Another approach seen in both active and passive methods was reviewing the photos after they were taken and deleting the ones with privacy concerns, which could also be time-consuming and burdensome
[28][38]. Hence, passive methods with approaches to eliminate privacy issues were considered the most. Passive methods are more likely to face privacy concerns because most images captured passively are not related to food or drink consumption only
[20]. Therefore, in some designs, the intake action was detected by a smart watch or glasses, and the camera was only turned on when the eating or drinking episode was highly probable
[20][29][39]. In the survey of privacy concerns for users of AIM-2, the average level of concerns was reduced from 5.0 to 1.9 when images were captured only during intake action rather than continuously
[20]. However, this method requires the users to wear multiple sensors, which can be cosmetically unpleasant, uncomfortable, and intrusive, especially for the elderly or groups with disabilities
[37].
3. Viewing Angles and Devices in Monitoring Systems
In passive vision-based intake monitoring, a camera’s viewing angle can be either first-person or third-person. A first-person camera (egocentric camera/wearable camera) is typically attached to the human body, pointing out at the food or container. In contrast, a third-person camera is mounted in the living environment, pointing at the subject. In the included papers that proposed a monitoring system or nutrition log application, 49 studies were based on first-person cameras, 39 on external third-person cameras, and 28 took advantage of the users’ smartphones. What is worth mentioning is that most of the phone-based methods were active, meaning the users needed to take the food/drink picture manually, e.g., the food/nutrition/dietary log proposed in
[40][41][42].
There were four tasks identified in intake monitoring methods: binary classification, to distinguish food/drink intake from other activities; food/drink type classification, to detect the type of items consumed; food/fluid amount estimation, crucially related to energy intake; and intake action recognition, to recognize the human body movement. In the binary classification task, elements such as fingers, hands, containers, cutleries, and food can be detected, and different criteria can be set and followed as an indication of an intake activity.
Regarding the placement of devices, first-person cameras could be in the form of glasses, watch and pendant, while third-person cameras could be mounted on the ceiling for a top-down view or placed around the subjects. The selected devices vary from studies, and cameras seen in the existing papers were mainly RGB and RGB-D cameras. RGB cameras have been spotted and used in combination with other non-vision sensors. The pattern of viewing angles and devices found in the papers is shown in Figure 1.
Figure 1. Pattern of the viewing angles and devices selected.
From the pattern of device selection shown in Figure 1, it is evident that the RGB cameras were the most used, primarily as first-person. In contrast, Depth cameras were not used as first-person cameras and were also barely used collaboratively with non-vision sensors. Moreover, there has been no system that covered all three sensors: RGB camera, Depth camera, and any of the non-vision sensors.
3.1. First-Person Approaches
As shown in
Figure 1, of the 49 first-person methods, 36 relied on RGB information alone. The remaining 13 used RGB cameras collaboratively with other non-vision sensors, including accelerometers
[20][43][44][45], gyroscopes, flex sensors
[20][43], load cells
[46], proximity sensors and IMUs
[29]. The most common technology setting is an inertial-smart watch and a wearable RGB camera
[26][27][39][47][48]. For example, Annapurna
[26][27][48] is a smartwatch with a built-in camera proposed for autonomous food recording. The inertial sensor of the watch was used for gesture recognition to identify the eating activity, and then the camera took only pictures which were likely to be useful. Thus, compared to methods with a camera constantly in operation, redundant images were reduced, and so was storage requirement, privacy issue, computation, and camera power consumption. However, one fundamental problem with an inertial smartwatch is that the intake action could be missed when the user is drinking with a straw or using a hand that is not wearing the watch. Unlike the approaches mentioned above, which mainly focus on food intake detection was an intake monitoring system for fluid combining glasses, smartwatches, and phones
[39].
Smart glass is another form of wearable device. Automatic Ingestion Monitor Version 2 (AIM-2)
[20] was proposed with an accelerometer, a flex sensor and an RGB camera. However, in this design, images captured by the camera were only for validating the performance of other wearable sensors on intake detection; no visual methods were considered. FitByte
[29] was a glasses-based diet monitoring system that applied six IMUs for chewing and swallowing detection, a proximity sensor for hand-to-mouth gesture detection, and an RGB camera pointing downward to capture potential food images. Both eating and drinking episodes were detected in this design. However, only 56.7% precision was achieved in fluid intake detection, while it was 92.8% for food intake. This means that Fitbyte performed significantly lower in fluid intake detection because it was only sensitive in simple and continuous drinking scenarios, other than short sips or when drinking happens with other irrelevant activities.
Assessments have been made on the efficiency of first-person cameras for dietary monitoring. Thomaz et al. (2013) proposed and evaluated a dietary monitoring system based on a neck-worn and human computation. Images were taken by the camera every thirty seconds and sent on Amazon’s Mechanical Turk (AMT) (a platform providing human intelligence labour) for identifying food by human labour. This design resulted in 89.68% accuracy in identifying eating activities
[28].
The feasibility evaluations mentioned above revealed the limitations of utilising first-person cameras for passive dietary monitoring. What came first was the occlusion of view. For example, if the image did not provide a complete observation of food, the estimation accuracy of portion size could be low
[49]. The uncertainty of wear time, battery sustainability and noncompliance in wearing the camera were other problems, especially when faced with the elderly or patients with cognition decline. As for image acquisition, dark and blurry images obtained in poor light conditions could make classification difficult.
In summary, the common system architecture of first-person methods was combining one first-person RGB camera with other sensors. Cameras can be placed in the form of smartwatches
[26][27], glasses
[20][29], or even caps
[47]. Combining cameras with other sensors can reduce the energy consumption of cameras, extend the use time of batteries, save storage space and rule out privacy concerns by turning the camera on only when a candidate movement is detected
[26][27][48]. One fundamental limitation of inertial smartwatches is that the intake action could be missed when the user drinks with a straw or uses the contralateral hand with no watch. The incontinence of wearable devices was another limitation.
3.2. Third-Person Approaches
Compared to first-person cameras, third-person cameras have the advantage of being non-intrusive to the user’s
[50]. The placement of cameras is one of the primary issues to consider. Most research had only one position of a single camera, placed on the ceiling for a top-down view
[51][52][53], or placed pointing at the subject with a fixed distance from 0.6 m to 2 m
[21][30][31][54][55]. Multiple cameras could be placed around for different viewing angles and complement the possible occlusion to achieve a more robust system
[32][56][57].
In third-person methods, RGB and depth information can be used individually or collaboratively for action detection. Specifically, 17 papers used RGB information only, nine were with depth information from an RGB-D camera, and seven were based on the fusion of RGB and depth information, as seen in
Figure 2. Unlike first-person cameras, non-vison sensors are used less frequently with third-person cameras. The main reason was that the kinematic information or distance information provided by IMUs and proximity sensors could also be obtained from the visual information of the third-person camera
[30][55].
Microsoft Kinect was dominantly adopted in existing research, which can work day and night with the infrared sensor generating the depth images, and the skeleton tracking tool kit providing the joint coordinates
[55][58]. The effectiveness of MS Kinect was tested for detecting the eating behaviour of older adults by placing the camera in front of the subject, resulting in an average of 89% success rate
[30].
Regarding reducing privacy and image data concerns, some studies only used depth information from RGB-D cameras. For example, Kinect skeletal tracking was used for counting bites by tracking the jaw face point
[59], and wrist roll joint of users based on depth information, achieving an overall accuracy of around 94%
[55]. A system with an average accuracy of 96.2% was proposed, relying on the depth information of wrist joint and elbow joint motion obtained by a Kinect camera. However, although the study was presented for free-living calorie intake monitoring, only one camera position was tested, and no occlusion problem was considered
[60]. The fusion of depth information and RGB information was another option, with the depth information for skeleton definition and body movement tracking while the RGB data for specific intake-related object detection
[52].
RGB cameras were also popular devices in intake monitoring used as third-person cameras. It can be embedded in the ceiling, pointing down
[53], or put on the dining table, pointing at the subject
[31]. The fusion of RGB and depth information has the potential to reach higher accuracy compared to using a single modality of information. An example can be seen in
[50], where an adapted version of the self-organized map algorithm was applied to the skeleton model obtained from depth information for movement tracking. The RGB stream was for recognising eating-related items such as glass. This method achieved a 98.3% overall accuracy. All RGB-D cameras were used as third-person cameras (as seen in
Figure 1).
4. Algorithms by Task
4.1. Binary Classification
Eliminating unrelated images was a preliminary step for identifying candidate intake activities. This was commonly proposed as a binary classification approach to distinguish food/drink from other objects or to classify low-quality images and delete them. For example, to distinguish sharp images from blurry images for adequate image quality, Fast-Fourier Transform (FFT) for images was calculated to analyse the sharpness, resulting in a 10–15% misclassification
[43].
Im2 Calories was a food intake monitoring system proposed in 2015, where a GoogLeNet CNN was trained with a modified Food101 dataset. One of the tasks for Im2 Calories was to determine whether the image was related to a meal, at which an accuracy of 99.02% was achieved
[61].
The GoogLeNet in Im2 Calories
[61] was tuned on a Titan X GPU with 12 GB memory; then implemented into an Android APP less than 40 MB, which could classify an image within one second. iLog could also operate on edge-level, low-performance computing paradigms, such as mobile phones, sensors and single-board computers
[62]. Apart from the networks mentioned above, for real-time and portable monitoring, a derived MobileNet was proposed and implemented into a Cortex-M7 microcontroller for dietary image capturing, which achieved an average precision of 82% in identifying food-related images
[63]. The training was conducted on Google Colab using 400 food images and 400 non-food images, taking up to 5.5 h, while only 761.99 KB of flash and 501.76 KB of RAM were needed to implement this algorithm.
Annapurna was a multimodel system with a camera mounted on an inertial smartwatch for dietary recording
[26][27]. In this design, the camera was only switched on when the watch detected intake action. A mobile phone was first used as a lightweight computing platform to eliminate images with human faces and blurred edges. Then, 37% of the remaining images with food items in them were transferred to a server for further processing, where the Clarifai API was used to identify the presence of food items in pictures based on CNN, and a depth map was created to detect food too far from the camera (considered as unrelated to the meal). As a result, 95% of the meals could be recalled by the proposed system in a free-living environment. The computation was firstly on mobile phones for Annapurna to remove blank, blurry, and misleading images to reduce runtime for further computing. However, the latency was around 0.9 s for the smartwatch to capture an image, which limited the response speed of the whole system
[27].
The server used in Annapurna
[26][27], Clarifai API, was also used in
[37], where it generated tag outputs (e.g., ‘food’, ‘car’, ‘dish’.) of an input image for determining whether the image was food-related. This method was tested on both Food-5K and e-Button and reached the specificity of 87% on Food-5K (created in
[64]), higher than the results on e-Button. This was because e-Button was an egocentric free-living dataset with 17.7% blurred images, complex backgrounds, and more diverse objects. According to the authors, although the burden of manually observing and recording dietary activities in previous work
[65] was reduced, the effectiveness of automatic monitoring was still limited due to the quality of the captured images.
Only limited papers addressed the binary classification of fluid/drink/beverage. An example covering food and fluid was
[66], which trained a YOLOv5 network to detect and localize food and beverage items from other objects. The study aimed to distinguish food and beverages from other objects and added ‘screen’ and ‘person’ as extra classes. As a result, an overall mean average precision of 80.6% was achieved for classifying these four objects, which was still far from being used in practice. NutriNet was another deep Neural Network proposed for both food and beverage; the detection model’s output was either ‘food/drink’ or ‘other’
[67].
4.2. Food/Drink Type Classification
ML methods included support vector machines (SVM), principal component analysis (PCA) [62][68][69], K-means classifiers, random forest (RF), fully connected neural networks (NN), artificial neural networks (ANN) [70], and some image matching and retrieving methods such as content-based image retrieval (CBIR) [71], dynamic time wrapping (DTW) [72], bag of features (BoF) [33][73][74], which clusters the features into visual words. Features of the image could be extracted by methods including speeded-up robust features(SURF) and scale-invariant feature transform (SIFT). In these methods, SVM was most seen and often used collaboratively with other DL methods. Different networks were seen in DL methods, such as GoogLeNet [61][64][67][70][72][75][76][77][78], MobileNetV2 [68][69], AlexNet [67][75][76], Inception-V3 [72][78][79][80][81], NutriNet [67][76][82], K-foodNet, and Very deep convolutional neural network [75], DenseNet161 [72], fully convolutional networks (FCN) [76][82][83][84], YOLO [83][85][86], extreme learning machine (ELM) extreme [87][88][89], neural trees [87], graph convolutional networks(GCN) [90], deep learning PDE model (DPM) [73], SibNet [91], VGG16 or VGG365 [72][75][84][92], ResNet, ResNet50 and ResNet152 [67][70][72][75][76][78][79][80][93][94], EfficientNet [95], EfficientDet [96], faster-RCNN [39][80][94]. GoogLeNet and ResNet, with their variant, were the most popular. Except for learning-based methods, other methods, including the region growing algorithm [68], mean-shift [33] algorithm, template matching [97][98], and other image processing algorithms were also used for image segmentation, recognition and even amount estimation.
An early study harnessed SVM with a Gaussian radial basis kernel for training a classifier on food type achieved an accuracy of 97.3% when the training data took up to 50% of the dataset [24]. Surprisingly, only 1% of misreported nutrient information was found in the study. However, there was only one food item in each image, so the robustness of the proposed algorithm could be limited when tested on images with multiple food items or complex backgrounds [24].
For automatic and larger-scale image analysis, computer vision algorithms were then used in later research. CNNs trained by labelled image data provided another method for food classification. Im2Calories mentioned in the last section were examples of a GoogLeNet CNN being trained with different datasets created from existing datasets online. Im2 Calories [61] trained the GoogLeNet with a self-made multi-label dataset and achieved an average precision of 80% [61]. In Ref. [64], Food-11 was created for training, validation, and evaluation, resulting in recognition by 83.6% of food categories.
Deep neural networks were most likely to achieve extremely high performance (over 99% accuracy) in the classification and recognition tasks. The networks could be used for both food and fluid classification in which Inception ResNet V2, ResNet50 V2, ResNet152, MobileNet V2 and V3 and GoogleNet were seen with over 95% accuracy. Apart from deep neural networks, machine learning methods such as RF, SVM, KNN etc, could also reach over 90% accuracy. However, DL methods could require high-performance devices and be time-consuming when training and the performance of models rely on a sufficient amount of training data with variety. The value of a deep neural network lies in the trade-off between its performance and simplicity.
4.3. Intake Action Recognition
The process of an intake activity can be segmented into preparing, delivering, and swallowing, where the preparing phase includes the action of grasping a container and delivering refers to lifting hands to one’s mouth. Most of the methods took the observation of food or fluid in human hands as a representation of intake, which turned the action recognition problem into a simple object detection problem. However, taking the presence of food/drink objects as the representation of intake activities has a high false positive rate.
Most of the action detection tasks depended on third-person cameras rather than first-person cameras; in those third-person cameras, depth cameras were more popular than RGB cameras. Microsoft Kinect was the most used device, of which the SDK could provide skeleton tracking for 25 joints on each body for up to six people, from 0.8 to 4 m, as well as six types of streams, including depth, infrared, colour, skeleton and audio
[51][55]. As for hardware settings,
[51] was tested on a computer running Windows 8 with an Intel Core i5 processor and 8 GB of RAM.
Staring with third-person methods, RGB information was first used for intake detection before the development of the depth camera. One example was the method based on fuzzy vector quantization proposed in 2012, in which activities were considered as 3D volumes formed by a sequence of human poses
[54]. Fuzzy vector quantization was for associating the 3D volume representation of an activity video with 3D volume prototypes; the linear discriminant analysis was then used to map activity representations in a low-dimensional discriminant feature space.
The Naive Bayes classifier was first used with Kinect in
[3] in 2013 to classify the input images for patient fluid intake monitoring. The performance with different positions of the subject and partial occlusions of the camera was tested. However, the limitation found in this method was that Naive Bayes classifier was only applicable to a relatively small dataset and test case, so the effectiveness of this approach in large-scale free-living scenarios remained to be validated. Moreover, the experimental test set for the method was insufficient, with only three replications and 10 s data for each
[3].
In later years, more information from the camera, rather than a single indicator, was used for more accurate detection. In
[30], the skeleton coordinates from the depth image provided by Kinect were used for analysing the movements of drinking soup, drinking water, and eating the main course. The distance from both hands to the head and the plate to the head were used as characterisation for classifying the gestures, resulting in an 89% average success rate for three subjects
[30]. However, no algorithm was presented in the study, no occlusion was considered during the test, and only three subjects were observed and evaluated, which could lead to bias because of personal dietary habits. Disregarding the limitations, the study validated the feasibility of using the distance between hands, head, and plate for intake monitoring.
4.4. Intake Amount Estimation
The studies mentioned above were mostly related to intake detection and classification rather than intake amount estimation. However, food volume estimation is another problem to be considered, which is the ‘how much’ problem
[61][99][100][101][102][103]. Meal estimation could be realised based on the respective number of intaking gestures for consuming liquid, soup, and meal
[34], but the accuracy was not evaluated. Volume estimation based on 3D reconstruction algorithms from images taken by phone was seen in
[103], resulting in less than 0.02 inch absolute error for radius estimation (for radius ranging from 0.8 inches to 1.45 inches). Im2Calories was another example, which firstly predicted the distance of each pixel from the camera using a CNN trained on the NYUv2 RGBD dataset, resulting in an average relative error of 0.18 m, which was too high
[61]. Then, the depth map was converted to voxel representation for food size estimation, resulting in less than 400 mL absolute volume error
[61].
As for fluid amount estimation, a design called ‘Playful Bottle’ was proposed, which combined the camera and accelerometer on the phone to realise fluid intake tracking and reminding
[104]. The accelerometer was used for drinking action detection, in which case 21.1% of false-positive detections could be caused by shaking the bottle without actually drinking from it.