Vision-Autocorrect: Comparison
Please note this is a comparison between Version 1 by Leah Mutanu and Version 2 by Rita Xu.

The last two years have seen a rapid rise in the duration of time that both adults and children spend on screens, driven by the recent COVID-19 health pandemic. A key adverse effect is digital eye strain (DES). Recent trends in human-computer interaction and user experience have proposed voice or gesture-guided designs that present more effective and less intrusive automated solutions. These approaches inspired the design of a solution that uses facial expression recognition (FER) techniques to detect DES and autonomously adapt the application to enhance the user’s experience.

  • digital eye strain
  • facial expression recognition
  • software self-adaptation

1. Introduction

Globally, media outlets have highlighted the increased duration of time that adults and children are spending on digital screens. For example, in the United Kingdom, it was reported that people spend up to three and a half hours each day looking at TV screens, four hours staring at laptops, and two hours on mobile phones [1]. This figure almost doubled when stay-at-home measures were enforced due to the COVID-19 health pandemic, with adults spending an average of six and a half hours each day in front of a screen [2][3][2,3]. The average American teen spends 7 h and 22 min on a screen outside of their regular schoolwork [4]. Today many learning and work-related activities have gone online, forcing people to spend more and more time on the screen. Staring at your screen for long hours each day can often result in dry eyes or eye strain, gradually contributing to permanent eye problems such as Myopia [3].
The increased internet use for research requires users to navigate through web pages designed using different fonts, font sizes, font colors, and background colors. These websites do not always meet the basic requirements for visual ergonomics. The online eLearning trend requires teachers to post materials on learning management systems (LMS) without paying much attention to the content’s appearance. Content developers use their discretion to identify fonts and backgrounds. This freedom often leads to the publishing of content that is difficult to read, even for users with no visual disabilities. As a user navigates through online content, the difference in content presentation introduces temporary visual challenges. Users strain their eyes as they try to adjust to different display settings. Similarly, the user’s environment can temporarily influence their ability to read the information on a screen, for example, where the lighting in a room is poor or in a scenario where the screen is too close or too far.
A popular approach for ensuring that technology addresses user disabilities is assistive technologies, which calls for specialized products that aim at partly compensating for the loss of autonomy experienced by disabled users. Here the user is required to acquire new technology or adapt existing technology using available tools before using the technology. Where user requirements are not known a priori or dynamically change, the approach is ineffective because it forces redeployment or reconstruction of the system [5]. Additionally, the degree of disabilities varies widely in severity, and often mild or undiagnosed disabilities go unsupported. Further, persons with mild disabilities tend to shun assistive technology because it underlines the disability, is associated with dependence, and degrades the user’s image [6], thus impairing social acceptance. The net result is that many users have become accustomed to squinting or glaring their eyes to change the focus of items on the screen. Some users will move closer or further from the screen depending on whether they are myopic or hyperopic. In such cases, the burden of adapting to the technology resides with the user’s behavior. This approach can present further health challenges to the user, such as damaging their posture.

2. Facial Expression Recognition

Recent advances in image-recognition algorithms have made it possible to detect facial expressions such as happiness, sadness, anger, or fear [7][8][13,14], with several reviews also touching on the subject [9][10][15,16]. Such initiatives find applications in detecting consumer satisfaction of a product [11][12][13][17,18,19] or in healthcare to diagnose certain health issues [14][15][20,21] such as autism or stroke. Recently, FER has found applications in detecting fatigue, which can be dangerous, especially for drivers [16][17][18][22,23,24]. One of the few studies identified used the blink rate and sclera area color to detect DES using a raspberry-pi camera [19][25]. The research, however, noted poor results with certain skin tones or where limited light-intensity difference between the sclera and skin region existed. Additionally, users with spectacles also generated reflections that interfered with color detection. Therefore, an approach that does not rely on color would address these limitations. Eye tracking is increasingly becoming one of the most used sensor modalities in affective computing recently for monitoring fatigue. The eye tracker for such experiments also detects additional information, such as blink frequency and pupil diameter changes [20][26]. A typical eye tracker (such as video-oculography) consists of a video camera that records the movements of the eyes and a computer that saves and analyzes the gaze data [21][27]. The monitoring of fatigue using this approach differs from the monitoring of basic facial emotions (anger, contempt, disgust, fear, happiness, sadness, surprise) because specific facial points are monitored, such as the percentage eye closure (PERCLOS), head nodding, head orientation, eye blink rate, eye gaze direction, saccadic movement, or eye color. However, fatigue is expressed using a combination of other facial expressions, such as yawning or hands on the face.

3. Machine Learning Techniques for FER

Recent studies on facial expression recognition [22][28] acknowledge that machine learning plays a big role in automated facial expression recognition, with deep learning algorithms achieving state-of-the-art performance for a variety of FER.

3.1. FER Datasets

Studies using relatively limited datasets are constrained by poor representation of certain facial expressions, age groups, or ethnic backgrounds. To address this, the authors in [23][29] recommend using large datasets. In their review of FER studies, they note that the Cohn–Kanade AU-Coded Face Expression Database (Cohn–Kanade) [24][30] is the most used database for FER. A more recent review [9][15] introduced newer datasets such as the Extended Cohn–Kanade (CK+) [25][31] database, which they noted was still the most extensively used laboratory-controlled database for evaluating FER systems. It has 593 images compared to the original version, which only had 210 images. Another notable dataset introduced was the FER2013 [26][32], a large-scale and unconstrained database collected automatically by the Google image search API. The dataset contains 35,887 images extracted from real-life scenarios. The review [9][15] noted that data bias and inconsistent annotations are common in different facial expression datasets due to different collecting conditions and the subjectiveness of annotating. Because researchers evaluate algorithms using specific datasets, the same results cannot be replicated with unseen test data. Therefore, using a large dataset on its own is not sufficient. It is helpful to merge data from several datasets to ensure generalizability. Additionally, when some datasets exhibit class imbalance, the class balance should be addressed during preprocessing by augmenting the data with data from other datasets. These findings motivated ouresearchers' decision to use more than one dataset as well as the use of large datasets. ResearchersWe used images from the CK+ and FER2013 datasets and conducted class balancing during preprocessing. Notably, most FER datasets had images labeled using the seven basic emotions (disgust, fear, joy, surprise, sadness, anger, and neutral). Therefore, preprocessing this study’s data called for re-labeling the images to represent digital eye strain expressions such as squint, glare, and fatigue. This exercise called for manually reviewing the images and identifying those that fell in each class. ResearchersWe assigned the images a new label representing the digital eye strain expressions. For instance, fatigue was labeled 1, whereas glare was labeled 2. To do this, the original images were rebuilt from the FER2013 dataset of pixels to enable the researchers to see the expressions. Once the labeling process was complete, a new dataset of pixels was generated with the new labels. By automatically detecting these facial expressions and autonomously adjusting font sizes or screen contrast, the user does not need to glare or squint to accommodate the screen settings. This also creates awareness for the user when an alert occurs, and they can take a break to address fatigue.

3.2. Image Processing and Feature Extraction

Image processing refers to the enhancement of pictures for ease of interpretation [27][33]. Common image processing activities include adjusting pixel values, image colors, and binarizing [28][34]. Several image-processing libraries that support these processes exist, such as OpenCV [29][35] and scikit-image [30][36]. They easily integrate with popular open-source machine-learning tools such as Python and R. After the image is preprocessed, feature extraction reduces the initial set of raw image data to more manageable sizes for classification purposes. Previous FER reviews [31][37] describe action unit (AU) and facial points (FP) analysis as two key methods used for feature extraction of classic facial emotion. Action units find applications when analyzing the entire face.

3.3. FER Algorithms

When our eyes squint, several things occur: the pupils get smaller as they converge, the eyelids pull together, and the edges of the eyelids fold to contract the cornea [32][9]. Sometimes the eyebrows could bend inwards, and the nose bridge moves upwards to enhance the eyes’ focus. FER techniques can detect these expressions and alert the user or adjust text sizes and color contrasts in an application to relieve eye strain. The FER process generally involves the acquisition of a facial image, extracting features useful in detecting the expression, and analyzing the image to recognize the expression [23][29]. Machine learning algorithms such as deep learning neural network algorithms successfully perform FER. A popular algorithm, according to recent FER reviews [9][10][15,16], is the convolutional neural network (CNN), which achieves better accuracy with big data [33][38]. It has better effects on feature extraction than deep belief networks, especially for expressions of classic emotions such as contempt, fear, and sadness [9][15]. The results of these studies inspired the choice of CNN as the algorithm for implementing FER in this study [5]. However, it is worth noting that these results depend on the specific dataset used. For instance, the models that yielded the best accuracy in the FER2013 dataset are Ensemble CNNs with an accuracy of 76.82% [26][32], Local learning Deep+BOW with an accuracy of 75.42% [34][39], and LHC-Net with an accuracy of 74.42% [35][40]. The models that yielded the best accuracy in the CK+ dataset include ViT + SE with an accuracy of 99.8% [36][41], FAN with an accuracy of 99.7% [37][42], and Nonlinear eval on SL + SSL puzzling with an accuracy of 98.23% [38][43]. Sequential forward selection yielded the best accuracy on the CK dataset, with 88.7% accuracy [39][44]. The highest performing models on the AffectNet dataset are EmotionGCN with an accuracy of 66.46% [40][45], EmoAffectNet with an accuracy of 66.36 [41][46], and Multi-task EfficientNet-B2 with an accuracy of 66.29% [42][47]. Although numerous datasets exist for facial expression recognition, this study sought to detect expressions outside of the classic emotions. The absence of labeled datasets in this area called for relabeling of images. The choice of the dataset for relabeling the images was not crucial. Future research should seek to relabel images from larger datasets such as AffectNet. With CNN, deeper networks with a larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturates [43][48]. Adding dropout layers increases accuracy by preventing weights from converging at the same position. The key idea is randomly dropping units (along with their connections) from the neural network during training. This prevents units from co-adapting too much [44][49]. Adding batch normalization layers increases the test accuracy by normalizing the network input weights between 0 and 1 to address the internal covariate shift that occurs when inputs change during training [45][50]. Pooling layers included in models decrease each frame’s spatial size, reducing the computational cost of deep learning frameworks. The pooling operation usually picks the maximum value for each slice of the image [46][51]. A summary of this process is depicted in Figure 1.
Figure 1. Distribution of facial expressions.
Popular CNN designs are based on the theoretical foundations laid by AlexNet [47][52], VGG [48][53], and ResNet [49][54]. AlexNet uses ReLu (rectified linear unit) given by f(x) = max (0, x) for the non-linear part instead of a tanh or sigmoid function, hence training is faster by resolving the vanishing gradient problem. AlexNet also reduces overfitting by using a dropout layer after every convolutional layer [50][55]. These present fewer computational requirements than VGGNets. VGGNets advocate for multiple stacked smaller-size kernels rather than one with a larger size because this increases the depth of the network [51][56], which enables it to learn more complex features. Increasing the depth, however, introduces other challenges, such as the vanishing gradient problem and higher training error values. ResNets address these challenges by introducing a global average pooling layer [52][57] and residual modules [53][58]. This reduces the number of parameters and increases the learning of the earlier layers.
Video Production Service