The increased internet use for research requires users to navigate through web pages designed using different fonts, font sizes, font colors, and background colors. These websites do not always meet the basic requirements for visual ergonomics. The online eLearning trend requires teachers to post materials on learning management systems (LMS) without paying much attention to the content’s appearance. Content developers use their discretion to identify fonts and backgrounds. This freedom often leads to the publishing of content that is difficult to read, even for users with no visual disabilities. As a user navigates through online content, the difference in content presentation introduces temporary visual challenges. Users strain their eyes as they try to adjust to different display settings. Similarly, the user’s environment can temporarily influence their ability to read the information on a screen, for example, where the lighting in a room is poor or in a scenario where the screen is too close or too far.
2. Facial Expression Recognition
Recent advances in image-recognition algorithms have made it possible to detect facial expressions such as happiness, sadness, anger, or fear [
13,
14], with several reviews also touching on the subject [
15,
16]. Such initiatives find applications in detecting consumer satisfaction of a product [
17,
18,
19] or in healthcare to diagnose certain health issues [
20,
21] such as autism or stroke. Recently, FER has found applications in detecting fatigue, which can be dangerous, especially for drivers [
22,
23,
24]. One of the few studies identified used the blink rate and sclera area color to detect DES using a raspberry-pi camera [
25]. The research, however, noted poor results with certain skin tones or where limited light-intensity difference between the sclera and skin region existed. Additionally, users with spectacles also generated reflections that interfered with color detection. Therefore, an approach that does not rely on color would address these limitations.
Eye tracking is increasingly becoming one of the most used sensor modalities in affective computing recently for monitoring fatigue. The eye tracker for such experiments also detects additional information, such as blink frequency and pupil diameter changes [
26]. A typical eye tracker (such as video-oculography) consists of a video camera that records the movements of the eyes and a computer that saves and analyzes the gaze data [
27]. The monitoring of fatigue using this approach differs from the monitoring of basic facial emotions (anger, contempt, disgust, fear, happiness, sadness, surprise) because specific facial points are monitored, such as the percentage eye closure (PERCLOS), head nodding, head orientation, eye blink rate, eye gaze direction, saccadic movement, or eye color. However, fatigue is expressed using a combination of other facial expressions, such as yawning or hands on the face.
3. Machine Learning Techniques for FER
Recent studies on facial expression recognition [
28] acknowledge that machine learning plays a big role in automated facial expression recognition, with deep learning algorithms achieving state-of-the-art performance for a variety of FER.
3.1. FER Datasets
Studies using relatively limited datasets are constrained by poor representation of certain facial expressions, age groups, or ethnic backgrounds. To address this, the authors in [
29] recommend using large datasets. In their review of FER studies, they note that the Cohn–Kanade AU-Coded Face Expression Database (Cohn–Kanade) [
30] is the most used database for FER. A more recent review [
15] introduced newer datasets such as the Extended Cohn–Kanade (CK+) [
31] database, which they noted was still the most extensively used laboratory-controlled database for evaluating FER systems. It has 593 images compared to the original version, which only had 210 images. Another notable dataset introduced was the FER2013 [
32], a large-scale and unconstrained database collected automatically by the Google image search API. The dataset contains 35,887 images extracted from real-life scenarios. The review [
15] noted that data bias and inconsistent annotations are common in different facial expression datasets due to different collecting conditions and the subjectiveness of annotating. Because researchers evaluate algorithms using specific datasets, the same results cannot be replicated with unseen test data. Therefore, using a large dataset on its own is not sufficient. It is helpful to merge data from several datasets to ensure generalizability.
Additionally, when some datasets exhibit class imbalance, the class balance should be addressed during preprocessing by augmenting the data with data from other datasets. These findings motivated our decision to use more than one dataset as well as the use of large datasets. We used images from the CK+ and FER2013 datasets and conducted class balancing during preprocessing. Notably, most FER datasets had images labeled using the seven basic emotions (disgust, fear, joy, surprise, sadness, anger, and neutral). Therefore, preprocessing this study’s data called for re-labeling the images to represent digital eye strain expressions such as squint, glare, and fatigue. This exercise called for manually reviewing the images and identifying those that fell in each class. We assigned the images a new label representing the digital eye strain expressions. For instance, fatigue was labeled 1, whereas glare was labeled 2. To do this, the original images were rebuilt from the FER2013 dataset of pixels to enable the researchers to see the expressions. Once the labeling process was complete, a new dataset of pixels was generated with the new labels. By automatically detecting these facial expressions and autonomously adjusting font sizes or screen contrast, the user does not need to glare or squint to accommodate the screen settings. This also creates awareness for the user when an alert occurs, and they can take a break to address fatigue.
3.2. Image Processing and Feature Extraction
Image processing refers to the enhancement of pictures for ease of interpretation [
33]. Common image processing activities include adjusting pixel values, image colors, and binarizing [
34]. Several image-processing libraries that support these processes exist, such as OpenCV [
35] and scikit-image [
36]. They easily integrate with popular open-source machine-learning tools such as Python and R. After the image is preprocessed, feature extraction reduces the initial set of raw image data to more manageable sizes for classification purposes. Previous FER reviews [
37] describe action unit (AU) and facial points (FP) analysis as two key methods used for feature extraction of classic facial emotion. Action units find applications when analyzing the entire face.
3.3. FER Algorithms
When our eyes squint, several things occur: the pupils get smaller as they converge, the eyelids pull together, and the edges of the eyelids fold to contract the cornea [
9]. Sometimes the eyebrows could bend inwards, and the nose bridge moves upwards to enhance the eyes’ focus. FER techniques can detect these expressions and alert the user or adjust text sizes and color contrasts in an application to relieve eye strain. The FER process generally involves the acquisition of a facial image, extracting features useful in detecting the expression, and analyzing the image to recognize the expression [
29]. Machine learning algorithms such as deep learning neural network algorithms successfully perform FER. A popular algorithm, according to recent FER reviews [
15,
16], is the convolutional neural network (CNN), which achieves better accuracy with big data [
38]. It has better effects on feature extraction than deep belief networks, especially for expressions of classic emotions such as contempt, fear, and sadness [
15]. The results of these studies inspired the choice of CNN as the algorithm for implementing FER in this study [
5].
However, it is worth noting that these results depend on the specific dataset used. For instance, the models that yielded the best accuracy in the FER2013 dataset are Ensemble CNNs with an accuracy of 76.82% [
32], Local learning Deep+BOW with an accuracy of 75.42% [
39], and LHC-Net with an accuracy of 74.42% [
40]. The models that yielded the best accuracy in the CK+ dataset include ViT + SE with an accuracy of 99.8% [
41], FAN with an accuracy of 99.7% [
42], and Nonlinear eval on SL + SSL puzzling with an accuracy of 98.23% [
43]. Sequential forward selection yielded the best accuracy on the CK dataset, with 88.7% accuracy [
44]. The highest performing models on the AffectNet dataset are EmotionGCN with an accuracy of 66.46% [
45], EmoAffectNet with an accuracy of 66.36 [
46], and Multi-task EfficientNet-B2 with an accuracy of 66.29% [
47]. Although numerous datasets exist for facial expression recognition, this study sought to detect expressions outside of the classic emotions. The absence of labeled datasets in this area called for relabeling of images. The choice of the dataset for relabeling the images was not crucial. Future research should seek to relabel images from larger datasets such as AffectNet.
With CNN, deeper networks with a larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturates [
48]. Adding dropout layers increases accuracy by preventing weights from converging at the same position. The key idea is randomly dropping units (along with their connections) from the neural network during training. This prevents units from co-adapting too much [
49]. Adding batch normalization layers increases the test accuracy by normalizing the network input weights between 0 and 1 to address the internal covariate shift that occurs when inputs change during training [
50]. Pooling layers included in models decrease each frame’s spatial size, reducing the computational cost of deep learning frameworks. The pooling operation usually picks the maximum value for each slice of the image [
51]. A summary of this process is depicted in
Figure 1.
Figure 1. Distribution of facial expressions.
Popular CNN designs are based on the theoretical foundations laid by AlexNet [
52], VGG [
53], and ResNet [
54]. AlexNet uses ReLu (rectified linear unit) given by f(x) = max (0, x) for the non-linear part instead of a tanh or sigmoid function, hence training is faster by resolving the vanishing gradient problem. AlexNet also reduces overfitting by using a dropout layer after every convolutional layer [
55]. These present fewer computational requirements than VGGNets. VGGNets advocate for multiple stacked smaller-size kernels rather than one with a larger size because this increases the depth of the network [
56], which enables it to learn more complex features. Increasing the depth, however, introduces other challenges, such as the vanishing gradient problem and higher training error values. ResNets address these challenges by introducing a global average pooling layer [
57] and residual modules [
58]. This reduces the number of parameters and increases the learning of the earlier layers.