Breast cancer is a prevalent disease that affects mostly women, and early diagnosis will expedite the treatment of this ailment. Recently, machine learning (ML) techniques have been employed in biomedical and informatics to help fight breast cancer. Extracting information from data to support the clinical diagnosis of breast cancer is a tedious and time-consuming task. The use of machine learning and feature extraction techniques has significantly changed the whole process of a breast cancer diagnosis.
According to the World Health Organization (WHO) [1
], cancer is a large group of diseases that occurs in any part or tissue of the body when abnormal cells grows uncontrollably beyond their usual boundaries, invading adjoining parts of the body and destroying body tissues.
The WHO reports that cancer, such as breast, cervical, ovarian, lung and prostate cancer, has accounted for over 10 million deaths in 2022. Breast cancer is the most prevalent cancer at 2.26 million cases and is the leading cause of premature mortality among women globally, with 685,000 deaths [2
]. Breast cancer (BC) is one of the most prevailing cancers among women worldwide, with fewer cases in men [3
Breast cancer is a medical abnormality in which the cells lining the breast ducts form clumps with malignant characteristics. It is the most common cancer in women, found mostly in middle- and-low-income countries of sub-Saharan Africa, most especially Nigeria [4
The primary concern of breast cancer treatment begins with accurate predictions of the cancer presence and classifying the cancer type to determine how to treat the cancer [6
]. However, predicting breast cancer type is among the classic problems in health-related research [3
]. The accurate classification of breast cancer would translate to its early detection, diagnosis, treatment, and, where possible, full eradication. Furthermore, the accurate classification of benign tumors can prevent patients from undergoing unnecessary treatments [7
Over the last few decades, several organizations have acquired vast repositories of data collected from diverse sources in distinct formats [8
]. These collected data could be used in different application domains such as medicine, agriculture and weather forecasting [10
]. These increasingly large amounts of data surpass the ability of the traditional methods used in analyzing, searching for patterns and information hidden in them for decision making [11
]. Data obtained from medical data repositories could be analyzed using machine learning algorithms such as classification, clustering, and regression algorithms. Machine learning algorithms and their usefulness in knowledge detection from medical data repositories have been valuable tools for the success of disease prediction [13
]. A good number of research works have reported the use of machine learning algorithms for breast cancer predictions [15
]. Machine learning algorithms have been prevalent in the development of predictive models to support effective decision-making for breast cancer predictions [16
Machine learning algorithms as tools have been used to create predictive models for BC to support physicians’ decisions with acceptable accuracy [17
]. However, these models show some limitations, such as the use of appropriate methods to fit the model depending on the dataset without considering feature extraction techniques [18
]; proper feature extraction techniques effectively reduce dimensionality for the better prediction of the disease [19
]. There is also an increasing concern regarding the methods of handling missing values in the dataset [20
]. Hence, we developed an improved machine learning model to give accurate breast cancer predictions and increase survivability rates in women.
2. Breast Cancer Classification
Prediction is one of the most important and essential tasks in machine learning [21
]. Extensive research has been conducted using machine learning algorithms on different medical datasets, especially in BC prediction. Most ML techniques used showed good prediction accuracy.
In 2015, [19
] used SVM, an artificial neural network, a naïve Bayes classifier and Adaboost for breast cancer prediction using machine learning techniques, where principal component analysis was used for feature space reduction.
In 2020, [22
] used an artificial neural network (ANN) and SVM for the prognosis of breast cancer recurrence as well as patient’s death within 32 months of undergoing surgery. SVM had the best performance, with an accuracy of 96.86%
] applied four machine learning techniques, namely SVM, RF, Naïve Bayes, and K-NN, on the Wisconsin breast cancer dataset from the UCI machine learning repository. The authors used Waikato Environment for Knowledge Analysis (Weka) software for the simulation of the algorithm. In their results, SVM had the best overall performance in terms of effectiveness and efficiency.
Chaurasia et al. [24
] used naïve Bayes, RBF network, and J48 for the prediction of benign and malignant breast cancer in the Wisconsin breast cancer database (WBCD) to improve the accuracy of the BC prediction model; the results showed that naïve Bayes was the best predictor. Kumar [25
] used naïve Bayes, logistic regression, and decision tree for the performance analysis of data mining algorithms for breast cancer cell detection.
Rajbharath and Sankari [7
] used a hybrid of random forest (RF) and logistic regression (LR) algorithms for building a breast cancer survivability prediction model. RF was used to perform a preliminary screening of the variables for ranking. The new data set was extracted from the initial WDBC dataset and input into the logistic regression procedure, which is responsible for building interpretable models for predicting breast cancer survivability.
In 2016, Asri et al. [26
] performed a comparison between different machine learning algorithms, support vector machine (SVM), decision tree (C4.5), naïve Bayes (NB) and k nearest neighbors (k-NN), in the Wisconsin Breast Cancer (original) datasets for breast cancer risk prediction and diagnosis. The experimental SVM gave the highest accuracy with low error rate.
Ricciardi et al. [27
] used a combination of linear discriminant analysis (LDA) and principal component analysis (PCA) for the classification of coronary artery disease with principal component analysis used to create new features and linear discriminant analysis for the classification, which improved the diagnosis of patients.
Kumar et al. [3
] predicted malignant and benign breast cancer using 12 algorithms: Ada Boost M1, decision table, J-Rip, J48, Lazy IBK, Lazy K-star, logistic regression, multiclass classifier, multilayer perceptron, naïve Bayes, random forest and random tree. The primary data were drawn from the Wisconsin breast cancer database, and Lazy K and the random tree had the highest accuracy.
Furthermore, Gupta and Gupta [28
] performed a comparative analysis of four widely used machine learning techniques, namely, multilayer perceptron (MLP), decision tree (C4.5), support vector machine (SVM), and K-nearest neighbor (KNN) performed on the Wisconsin Breast Cancer dataset to predict the breast cancer recurrence. The main objective of their work was to obtain the best classifier of the four in terms of accuracy, precision and recall In their work, they concluded that MLP performed better than the other techniques, including when 10-fold cross-validation.
Zheng et al. [29
] studied K-means and support vector machine (K-SVM) algorithms based on 10-fold cross-validation, and the proposed methodology improved the accuracy of breast cancer prediction to 97.38% when tested on the Wisconsin Diagnostic Breast Cancer (WDBC). The authors proposed a new combination of machine learning algorithms, specifically using K-means for the separate recognition of the hidden patterns in the malignant and benign tumors and then SVM to generate the new classifier within the 10-fold cross-validation. Their new approach obtained an accuracy of 97.38%, which was higher than the scores for the other six algorithms.
In another study, Sivakami and Saraswathi [20
] worked on breast cancer prediction using a DT–SVM hybrid model of decision tree and support vector machine. The decision tree was used for feature selection, and the proposed methodology improved the accuracy of breast cancer prediction to 91%. In another recent study on breast cancer, Wu and Hicks [30
] investigated four ML algorithms: support vector machine, K-nearest neighbor, naïve Bayes and decision trees to classify triple-negative breast cancer and non-triple-negative breast cancer for patients using gene expression data. SVM gave better classifications than the other three algorithms.
This entry is adapted from 10.3390/biomedinformatics2030022