Human-centric Computing and Information Sciences volume 12, Article number: 54 (2022)
Cite this article 4 Accesses
In recent decades, microarray datasets have played an important role in triple negative breast cancer (TNBC) detection. Microarray data classification is a challenging process due to the presence of numerous redundant and irrelevant features. Therefore, feature selection becomes irreplaceable in this research field that eliminates non-required feature vectors from the system. The selection of an optimal number of features significantly reduces the NP hard problem, so a rough set-based feature selection algorithm is used in this manuscript for selecting the optimal feature values. Initially, the datasets related to TNBC are acquired from gene expression omnibuses like GSE45827, GSE76275, GSE65194, GSE3744, GSE21653, and GSE7904. Then, a robust multi-array average technique is used for eliminating the outlier samples of TNBC/non-TNBC which helps in enhancing classification performance. Further, the pre-processed microarray data are fed to a rough set theory for optimal gene selection, and then the selected genes are given as the inputs to the ensemble classification technique for classifying low-risk genes (non-TNBC) and high-risk genes (TNBC). The experimental evaluation showed that the ensemble-based rough set model obtained a mean accuracy of 97.24%, which is superior related to other comparative machine learning techniques.
Ensemble Classifier, Machine-Learning Technique, Microarray Data, Robust Multi-Array Average Technique, Rough Set Theory, Triple Negative Breast Cancer
Cancer is one of the most serious health problems in the world, which begins in the cells of the human body. Cancer is defined as an uncontrolled growth of abnormal cells anywhere in the human body, and these cells are termed malignant, tumor, or cancer cells, where these cells affect normal body tissues . Cancer develops from a series of genetic mutations that stop checking normal cell growth, with these cells continuing to grow, divide, and develop into cancer [2, 3]. Cancer that develops in the breast tissues is called breast cancer, and usually develops in the lobules or inner lining of milk dust that is responsible for milk supply in the ducts [4, 5]. In recent decades, breast cancer is a crucial reason for a respective high female mortality rate. Around 10%–15% of breast cancer patients have pain in the breast . Breast cancer symptoms are swelling, dimpling of the skin surface, and skin with an orange appearance, skin irritation, nipple discharge, and tenderness nipple inversion [7, 8]. Sometimes, the overly growing cancer cells dilate the veins on the breast surface. The characteristics of cancer cells depends on the structure of nucleus, cell outline, capacity to metastasize, and its shape. Generally, the human detection of triple negative breast cancer (TNBC) utilizing microarray data is effective, but will not be accurate in all circumstances. In addition, the human detection consists of two major concerns, such as consuming more time for classifying TNBC and non-TNBC genes and being only suitable for minimum data [9, 10]. To address the above-stated concerns, numerous machine-learning techniques are developed by researchers that nearly made a huge impact in TNBC detection. In this manuscript, a new model is implemented for effective TNBC detection using microarray data. The major contributions of this work are listed as follows:
Firstly, microarray data related to TNBC/non-TNBC are acquired from six datasets, namely GSE45827, GSE76275, GSE65194, GSE3744, GSE21653, and GSE7904.
Next, the outlier samples of TNBC/non-TNBC are eliminated using robust multi-array average (RMA) technique. The RMA technique summarizes the perfect matching genes through media polish, which is robust, and it behaves based on the number of analyzed samples. In microarray data classification, the RMA technique includes three major advantages like quality control, spot filtering, and background correction.
After normalizing the TNBC/non-TNBC samples, gene selection is carried out by using the rough set-based feature selection algorithm on individual gene expression omnibus IDs.
After selecting optimal genes in every gene expression omnibus ID, the ensemble classifier (combination of a k-nearest neighbor [kNN] and support vector machine [SVM]) is applied to classify the low-risk genes (non-TNBC) and high-risk genes (TNBC). The motivation behind an ensemble classifier is to learn a set of classifiers (combination of a kNN and SVM) and vote for the best results using soft voting, which obtains better results compared to individual classifiers. Soft voting predicts the class with the highest summed probability from a kNN and SVM.
The proposed ensemble-based rough set model’s effectiveness is tested in terms of the related Matthews correlation coefficient (MCC), F-measure, precision, recall, and accuracy.
Li et al.  introduced a new machine-learning model for identifying TNBC-related genes. In this literature study, seven gene expression datasets, namely GSE15852, GSE45255, GSE32646, GSE20271, GSE20194, GSE9574, and GSE31519 were utilized for experimental evaluation. The KEGG pathway examination showed that 54 genes were related to viral carcinogenesis, and a gene ontology investigation indicated that the organic cyclic compound in the cellular response influences the onset of breast cancer. Additionally, a machine learning technique uses a SVM to predict the high-risk of breast cancer. The experimental examination showed that the presented model significantly identifies cancer-related genes, and assists physicians in medical diagnosis. However, a SVM classifier performs only binary-class classification, but it was inappropriate for multi-class classification. Cai et al.  used Dijkstra’s algorithm for finding the genes that mediate bone cancer metastasis to breast cancer. Many putative genes were determined using Dijkstra’s algorithm from large networks, and then a protein-to-protein interaction (PPI) was constructed using the selected genes of breast and bone cancers. Overall, eighteen putative genes were determined utilizing Dijkstra’s algorithm, with the experimental result confirming that these putative genes participates in metastasis. However, the Dijkstra’s algorithm does a blind search that is time-consuming in finding the unnecessary resources.
Sarkar et al.  used a random forest classifier with a seven-feature selection algorithm such as joint mutual information, minimum redundancy maximum relevance, double input symmetrical relevance, conditional infomax feature extraction, mutual information maximization, interaction capping and conditional mutual information maximization to predict any breast cancer subtype miRNA biomarkers. Additionally, a cox regression-based survival investigation was carried out for finding important miRNAs for breast cancer detection. However, the computational time of the developed model was high by implementing several feature selection algorithms in this study. Mahapatra et al.  combined an extreme gradient boosting classifier and deep neural network to predict PPIs. In this literature, three sequence-based features such as a local descriptor, conjoint triad composition, and amino acid composition were given as the input to a hybrid classifier. The experimental analysis showed that the hybrid classifier effectively predicts the inter-species and intra-species PPIs, and the developed hybrid classifier obtained a better classification accuracy on the independent test sets, which represent that it could be used for cross-species prediction. Pan et al.  combined the fast Walsh-Hadamard transform and random forest classifier for predicting plant PPIs. The introduced model performance was tested on the three plant’s PPI datasets such as Arabidopsis thaliana, maize, and rice. However, the random forest classifier consumes more computational power and resources for predicting plant PPIs which was considered a major concern in this literature study.
Naorem et al.  used a correlation-based feature selection algorithm and naïve Bayes classifier for classifying non-TNBC and TNBC samples from GSE45827, GSE21653, GSE76275, GSE7904, GSE3744, and GSE65194 datasets. The experimental investigations suggested that the selected key candidate genes were a therapeutic target for TNBC treatment. The implemented model was more appropriate in structured data, but obtained limited performance with unstructured data. Wang et al.  implemented a new computational model that integrates a random forest, rough set-based rule learning, and a Monte Carlo feature selection algorithm for identifying the genes that were related to original human tumors and breast cancer. Among 831 breast tumors, 32 optimal genes were determined for constructing a prediction model. The presented model experiences a class imbalance problem in a few circumstances that was a major issue in this literature study. Zhang et al.  developed a random walk with a restart algorithm and PPI network for identifying the proliferative diabetic retinopathy-related genes. The random walk with a restart algorithm was applicable for a two-class classification, but not for multiclass classification. Al-Safi et al.  and Iswisi et al.  developed a Harris Hawks optimization (HHO) algorithm for an effective feature selection. Additionally, a majority voting learning method was utilized to diagnose the disease type in medical centers. Al-Safi, et al.  integrated the HHO algorithm and artificial neural network for heart disease diagnosis. In addition, several optimization algorithms like a black widow spider optimization algorithm , particle swarm optimization (PSO) , hybrid particle swarm optimization , artificial bee colony (ABC) optimization algorithm , polar bear optimization , and principal component analysis  were also preferred in gene selection.
By reviewing the existing literatures, some common concerns faced by the researchers in breast cancer detection using microarray data are listed as follows:
Microarray data acquisition and pre-processing unit consist of a major problem of being difficult to acquire the quality medical data by a user, due to the limit of capturing technology or adverse environmental conditions.
While experimenting with supervised machine learning methods, the semantic space is maximized between the feature values that lead to poor classification performance.
The clustering techniques improve the gene classification accuracy, but it is time-consuming since it computes the neighborhood term in every iteration step. To address the highlighted concerns, a new ensemble-based rough set model is implemented in this manuscript to improve the performance of TNBC and non-TNBC detection by using microarray data.
The ensemble-based rough set model includes four phases in microarray data classification as follows:
- Data collection: TNBC microarray expression datasets (GSE45827, GSE76275, GSE65194, GSE3744, GSE21653, and GSE7904);
- Data pre-processing: robust multi-array average technique;
- Optimal gene selection: rough set theory; and
- Gene classification: ensemble classifier.
A flowchart of the ensemble-based rough set model is illustrated in Fig. 1.
|Accession No.||Organism||Number of samples|
|GSE45827||GSM1116215, GSM1116087, GSM1116190, GSM1116092, GSM1116146, and GSM1116093|
|GSE7904||GSM194406 and GSM194408|
|GSE3744||GSM85484, GSM85482, and GSM85497|
|GSE21653||GSM540108, GSM540323, GSM540109, GSM540324, GSM540110, GSM540325, GSM540130, GSM540332, GSM540139, GSM540343, GSM540141, GSM540322, GSM540148, GSM540319, GSM540201, GSM540231, GSM540195, GSM540214, and GSM540317|
|GSE76275||GSM1978928, GSM1978939, GSM1978917, GSM1978900, GSM1974760, GSM1978916, GSM1974736, GSM1974750, GSM1974732, GSM1974716, GSM1974723, GSM1974584, GSM1974605, GSM1974666, and GSM1974717|
|GSE65194||GSM1588987, GSM1588986, GSM1589015, GSM1589012, and GSM1589116|
The ensemble-based rough set model’s efficiency is validated using MATLAB (version 2020) on a system configuration with 64 GB random access memory, 4 TB hard disk, Intel Core i9 Processor, and Windows 10 operating system. The efficiency of the proposed ensemble-based rough set model is evaluated using performance measures like MCC, F-measure, precision, recall, and accuracy. The precision performance metric quantifies the number of positive class prediction, which belong to the positive class, while the recall performance measure quantifies the number of positive classes in the datasets of GSE45827, GSE76275, GSE65194, GSE3744, GSE21653, and GSE7904. Further, the F-measure includes a single score for balancing the issues of both recall and precision in a single number. The mathematical formulas of precision, recall, and f-measure are denoted in Equations (13)–(15) respectively. In addition, the MCC and accuracy are utilized for measuring the ratio between the overall samples and number of correctly classified samples. The equations of the MCC and accuracy are defined in Equations (16) and (17), respectively.
In this manuscript, an ensemble-based rough set model is proposed for identifying the key genes of TNBC and non-TNBC. The ensemble-based rough set model includes two key phases of gene selection and classification. After eliminating the outlier’s samples, a rough set theory is applied for selecting the optimal TNBC and non-TNBC genes from the 818 samples. The selected optimal TNBC and non-TNBC genes are given as the input to the ensemble classifier (combination of both SVM and kNN classifier) to classify the low-risk genes (non-TNBC) and high-risk genes (TNBC). In the resulting section, the ensemble-based rough set model’s effectiveness is validated based on MCC, F-measure, precision, recall, and accuracy. The experimental investigations showed that the ensemble-based rough set model achieved a mean accuracy of 97.24%, which is better compared to other feature selection techniques (i.e., reliefF and infinite), and individual classifiers (i.e., random forest, kNN, and naïve Bayes). The proposed model significantly reduces the computational time and complexity, which are the major issues highlighted in the literature section. As a future direction of work, a new deep learning model can be developed and analyzed on the unstructured multi-modal data to further improve gene classification on other disease for early treatment and diagnosis.
Conceptualization, SP, KRB, PBD. Funding acquisition, JF, JN. Investigation and methodology, SP, KRB, PBD. Project administration, JF. Resources, SP, SK. Supervision, PBD. Writing of the original draft, SP, KRB. Writing of the review and editing, PBD, JF. Validation, SP, KRB, PBD, JN. Formal Analysis, PBD, JF, JN. Data curation, SP, KRB, PBD. Visualization, SP, KRB, JF, SK. All the authors have proofread the final version.
This work was supported by the Ministry of Education, Youth, and Sports (Grant No. SP2022/18, SP2022/34, and SP2022/5) conducted by VSB-Technical University of Ostrava.
The authors declare that they have no competing interests.
The authors declare that they have no competing interests.
Name : Sujata N Patil
Affiliation : KLE Dr. M S Sheshgiri college of Engineering KLE Technological University, Hubbali
Biography : Sujata Patil has received PHD specialization in the area of artificial Intelligence and machine learning (AI & ML) for automating the Human Embryo Grading and Predicting Potential embryo. Research work on Neurological Imaging is in progress and was the PI for Vision Group of Science & Technology in setting up of center of excellence in AI & ML. She has received Competitive Research Grant of Rs. 2 Lakh under TEQIP.
Name : Kavitha Rani Balmuri
Affiliation : CMR Technical Campus, Kandlakoya, Hyderabad
Biography : Kavitha Rani Balmuri completed her M.Tech in Computer Science and Engineering from JNTUH University, Hyderabad in 2009 and Ph.D in Computer Science and Engineering from JNTUK University, Kakinada, Andhra Pradesh, India, in the year 2015. She has 14 years of experience in industry and teaching. Presently she is working as professor of &Head, Department of Information Technology at CMR Technical Campus, Kandlakoya, Hyderabad. She has authored 3 Books, published 36 international journal and conference papers. She is a member of many professional bodies. Her research interests include natural language processing, information retrieval, deep learning and artificial intelligence.
Name : Jaroslav Frnda
Affiliation : Department of Quantitative Methods and Economic Informatics, Faculty of Operation and Economics of Transport and Communication, University of Zilina, Slovakia
Biography : Jaroslav Frnda was born in 1989 in Slovakia. He received his M.Sc. and Ph.D. from the VSB–Technical University of Ostrava (Czechia), Department of Telecommunications, in 2013 and 2018 respectively. Now he works as an assistant professor at the University of Zilina in Slovakia. His research interests include Quality of multimedia services in IP networks, data analysis and machine learning algorithms. In 2021, he was elevated to IEEE Senior Member grade. He has authored and co-authored 24 SCI-E and 7 ESCI papers in WoS.
Name : Parameshachari B.D.
Affiliation : GSSS Institute of Engineering and Technology for Women, Mysuru, India
Biography : Dr. Parameshachari B.D. currently working as Professor and Head in the Department of Telecommunication Engineering at GSSS Institute of Engineering & Technology for Women, Mysuru, India. He has a total 17+ years of teaching and research experience. He has published over 120+ articles in SCI, SCOPUS and other indexed journals and also in conferences. He is Book Editor, Associate Editor and Guest Editor for several reputed indexed journals. He has been serving as reviewer for several journals like IEEE, Springer, Elsevier, Wiley, IGI-Global etc.
Name : Srinivas Konda
Affiliation : CMR Technical Campus, Kandlakoya, Hyderabad
Biography : Srinivas Konda completed his B.E in Computer Science and Engineering from National Institute of Engineering, Mysore in1990, M.Tech in Software Engineering from Kakatiya University, Warangal in 2008 and Ph.D in Computer Science and Engineering from JNTUH University, Hyderabad, in the year 2015. He has 23 years of experience in industry and teaching. Presently he is working as professor & Head, department of CSE (Data Science) at CMR Technical Campus, Hyderabad. He has authored 4 Books, published 34 international journal and conference papers. He is member of many professional bodies. His research interests include IOT, machine learning and artificial intelligence.
Name : Jan Nedoma
Affiliation : Department of Telecommunications, Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Biography : Jan Nedoma was born in 1988 in the Czech Republic. In 2014 he received his Masters's degree in Information and Communication Technology from the Technical University of Ostrava. Since 2014 he has worked here as a Research Fellow. In2018 he successfully defended his dissertation thesis and worked as an assistant professor at the same University. He has become an Associate Professor in CommunicationTechnologies in 2021 after defending the habilitation thesis titled 'Fiber-optic sensors in biomedicine: Monitoring of vital functions of the human body in Magnetic Resonance (MR) environment'. Area of scientific interest: Optical communications, optical atmospheric communications, optoelectronics, optical measurements, measurements in telecommunication technology, fiber-optic sensory systems. Data processing from fiber-optic sensors, the use of fiber-optic sensors within the SMART technological concepts (Smart Home, Smart Home Care, Intelligent Building, Smart Grids, Smart Metering, Smart Cities, etc.) and for the needs of Industry 4.0. He has more than 150 journal and conference articles in his research areas and 9 valid patents.
Sujata Patil1 , Kavitha Rani Balmuri2 , Jaroslav Frnda3,4,*, Parameshachari B.D.5 , Srinivas Konda6 , and Jan Nedoma4, Identification of Triple Negative Breast Cancer Genes Using Rough Set Based Feature Selection Algorithm & Ensemble Classifier, Article number: 12:54 (2022) Cite this article 4 AccessesDownload citation
Anyone you share the following link with will be able to read this content:
Provided by the Springer Nature SharedIt content-sharing initiative