홈으로ArticlesAll Issue
ArticlesA Novel Computer Assisted Genomic Test Method to Detect Breast Cancer in Reduced Cost and Time Using Ensemble Technique
  • Madhuri Gupta1 , Bharat Gupta2 , Abdoh Jabbari3 , Ishan Budhiraja1,*, Deepak Garg1 , Ketan Kotecha4,*, and Celestine Iwendi4

Human-centric Computing and Information Sciences volume 13, Article number: 08 (2023)
Cite this article 3 Accesses
https://doi.org/10.22967/HCIS.2023.13.008

Abstract

Breast cancer is the leading cause of death among women around the world. It is a primary malignancy for which genetic markers have revealed the ability for clinical decision making. It is a genetic disease that generates due to gene mutations, but the cost of a genetic test is relatively high for a number of patients in developing nations like India. The results of a genetic test can take a few weeks to determine cancer. This time duration influences the prognosis of genes since certain patients suffer from a high rate of malignant cell proliferation. Therefore, a computer-assisted genetic test method (CAGT) is proposed to detect breast cancer. This test method will predict the gene expressions and convert these expressions in the state of mutation—under-expression (-1), transition (0) overexpression (1)—and afterwards perform the classification to get the benign and malignant class in reduced time and cost. In the research work, machine learning techniques are applied to identify the most responsive genes of breast cancer on the premises of the clinical report of a patient and generated a CAGT. In the research work, the hard voting ensemble approach is applied to detect breast cancer on the basis of most responsive genes by CAGT which leads to improving 3.5% accuracy in cancer classification.


Keywords

Electronic Health Record, Breast Cancer, Machine Learning, Ensemble Modeling, Genomics


Introduction

Cancer is a genetic disease. It is categorized by the uncontrolled growth of cells in the body due to mutations in genes [1]. It can develop anywhere in the body. It starts when a few clusters of cells grow uncontrollably and crowd out the neighboring cells for resources. Breast cancer is the frequently diagnosed form of cancer. It has consistently been a leading cause of cancer fatalities for decades. Once metastasized, affected cancerous cells are capable to infiltrate any organ of the body and hamper its normal course of functioning [2]. It, therefore, becomes imperative to diagnose cancer at a primary stage. Cancer is a genetic disease so genomic test gives the more accurate detection of cancer. Gene expression report contains genes mutation information, but the genomic test is costly and time-consuming in underdeveloped nations like India. Genetic tests are unaffordable for several families in India [3] and most of the expenses have to be borne by patients as very few government hospitals are offering genetic tests. Research centers with the infrastructure to offer genetic testing usually provide results in 4 weeks while private testing facilities generate the test report in 2–4 weeks. This idle time spent to wait for the reports can lead to cancer growth and increased spread throughout the body [3]. So, a test method is required that can diagnose breast cancer on the basis of gene mutation in a reduced time. In the research work, a computer-assisted gene test (CAGT) method is generated that can predict the most responsive genes on the basis of the clinical report of patients. To generate CAGT, a dataset of van’t Veer is used that contains the clinical report and gene expression of the same patients. The clinical report contains the test report of different biomarkers that are involved in cancer such as follow-up time (in year), diameter (in mm), metastases, angioinvasion, grade, lymphocytic infiltrate, Estrogen Receptor (ER) positive, Progesterone Receptor (PR) positive and BRCA1 mutation. The gene expression dataset contains the genomic information of the patient. It is generated by microarray technology [4]. In the genomic research area, microarray technology is a widely used tool for gene identification of specific therapy [57]. It is mainly applicable for cancer prediction [8, 9], drug action investigation [10] and disease diagnosis [11]. The microarray technique offers the high-dimensional data that comprises the genes expression information with various environmental conditions as a large matrix. High-throughput technology is used to test the genes that certainly generate the gene expression data in a large amount. It leads to the incorporation of data analysis techniques, machine learning (ML) and tools to extract details from large data. ML and data analytics are key tools in breast cancer research when it comes to detection and other clinical issues. ML emphasizes the advancement of computer algorithms that process data and enable the machine to learn [12, 13]. In the research work, gene expression microarray and clinical data of the same patient are analyzed and processed for an advanced testing model. The following point summarizes the research contribution:

We propose an SVM-RFE_MI technique to select the most relevant genes in the dataset.

Regression technique is tuned to predict the gene expressions of selected genes on the basis of clinical parameters.

Classification model is trained to identify cancer using predicted gene expression. The proposed solution is performed using Apache Spark data processing engine and shows promising results of cost and time reduction.

The rest of the paper is organized as follows: Section 2 depict the related works and explain the main algorithms, technologies, and data processing engines are used in this framework. The proposed methodology is described in Section 3 with a detailed explanation of the process flow. Section 4 presents the experimental design of the proposed framework. Section 5 represents the experimental results, and the work is concluded with Section 6.


Related Work

Breast cancer is the most common disease in women that is caused by the excessive proliferation of mutated cells [2]. It starts to begin from breast tissues and spread to remaining fatty tissues. Breast cancer is generally detected when patients feel the lump or during breast screening. The breast lumps can be cancerous (malignant) and non-cancerous (benign). Researchers have applied different approaches to predict cancer at an early stage to decrease the mortality due to breast cancer. During the past few decades, ML methods have been implemented into various complex, complicated and massive data-intensive fields, for instance, cosmology, prescription, cellular biology etc. These approaches provide precise solutions to extract the information encrypted in the data [14–18]. ML contains various techniques which are specifically designed to assist in cancer diagnosis. A survey of ML is presented on the basis of techniques, metrics and frameworks used for breast cancer detection. ML techniques involved to diagnose the cancer are represented in Table 1 [12, 13, 1931].

Table 1. Study of machine learning approaches applied in breast cancer prediction (state-of-the-art)

S. No Machine learning method Solutions Outcome
1 Support vector machine (SVM) technique [19] Breast cancer prediction using microarray dataset Accuracy: 94.5%
2 Deep learning technique [20] Prediction of cancer survival rate using multi-platform data analysis. Mean survival in breast cancer: 70%
3 Deep transfer learning using single cell [21] Breast cancer high-content analysis Accuracy: 77%
4 Artificial neural networks (ANN) [22] Risk estimation of breast cancer Accuracy: 96%
5 Ensemble learning using ANN with C4.5 Rule [12] Disease diagnosis (diabetes, hepatitis, and Breast cancer) Error rate: 2.9, 24, and 14.9, respectively.
6 Simple logistic, RBF network and RepTree [13] Prediction of breast cancer survivability Accuracy: 74.5%
7 Analogous random forest technique [23] Big data computing analytic using spark cloud computing Error rate is 0.2 for 500 trees
8 Neural networks [24] Limited datasets handling for medical diagnosis Accuracy: 86.5%
9 ANN re-entered [22] Risk estimation of breast cancer Accuracy: 96%
10 SVM [25] Breast cancer prediction and susceptibility using nucleotide polymorphisms Predictive power: 69%
11 Semi-supervised learning technique based on graph [26] Breast cancer survivability prediction using semi-supervised learning, SVM and ANN models Accuracy: 76%
12 Graph based semi-supervised ML technique [27] Generate an integrated gene network model to observe cancer recurrence. Max accuracy: 80%
13 Semi supervised learning co-training algorithm [28] Breast cancer survivability prediction Accuracy: 76%
14 SVM [29] Breast cancer prediction using gene recurrence and signature Specificity: 73%
Sensitivity: 89%
15 Bayesian network [30] Incorporation of microarray and clinical data to predict the breast cancer AUC: 85%
16 Executed 17 SVM classifier model using LIBLINEAR [31] Cancer classification at primary sites Accuracy: 62%


The study presented in Table 1 enlists commonly used ML techniques for gene selection. ML techniques are applicable in various data-intensive fields like cosmology, biology and prescription, etc. In the research work, ML techniques are applied to process the gene expression microarray and clinical data for the advanced testing models. Data acquisitions of gene expressions are accomplished by high-throughput microarray technology (HTT). The microarray is a commonly used technology in the genomic research area. It is mainly applicable in disease prediction, gene identification for a drug action investigation, specific therapy and cancer prediction. The microarray technique provides high-dimensional data that comprises the gene expression information with different environmental conditions. In the past decades, several researchers have found out the insight of genomic datasets. To process the high throughput microarray dataset, above mentioned technologies need the data processing engine. These engines are capable to process different sizes of data [32]. The selection of data processing engines depends on the significance and size of the dataset. A study of some data processing engines is shown below.

Data Processing Engines
In the past decade, several tools are accessible data processing such as MapReduce, Storm, Apache Spark, Apache Flink and H2O. The study of these widely used data processing engines are as follows [33]:
Apache MapReduce and Apache Hadoop: Apache Hadoop and MapReduce both are different paradigms. MapReduce is a programming platform that processes high dimensional data by using the divide and conquer approach. Hadoop is an open-source platform that is the implementation MapReduce.
Apache Spark: Apache spark is developed for speed processing and big data analytics. Spark is an open-source in-memory data processing engine. Apache Spark is built in the bottom-up mechanism to increase performance. It is faster than Hadoop in terms of speed, mainly for large scale data processing because of in-memory computation and other optimizations.
Apache Flink: It is a platform for distributed computing. It is especially an open-source stream processing engine. It performs well in the case of unordered streaming data. It is easy to use on thousands of nodes with better throughput and latency characteristics.
Apache Storm: It is an open-source platform for real-time distributed computation. Apache Storm is easy to use and set up. It is compatible, scalable and fault-tolerant with any programming language. H2O: H2O is a fast in-memory data processing engine. H2O is used for predictive analysis on a large amount of data. It is an open-source, scalable and distributed software that can be implemented on various nodes.

These processing engines are measured on the basis of associated ML tools, latency, supported language, fault tolerance and execution model. Latency is the time duration between initiating a task and receiving the output. Fault tolerance is the mechanism of a system that permits ongoing working appropriately in case of failure of few modules. Among mentioned data processing engines except Hadoop have the ability of in-memory processing, fault tolerance, and low latency. Stream processing can be done with Spark, Flink, and Storm whereas H2O and Spark support Python as a programming language.

The study represents that Apache Spark supports Mlib, Mahout, H2O ML tools that contains various libraries of ML, image processing, etc. Spark performs in-memory processing that provides the result faster. It contains better ability of fault tolerance, and it provides low latency in comparison to other processing engines. Therefore, in the proposed work Apache Spark used to process the high-throughput microarray data.
In the research wok, a CAGT is proposed to predict gene expression level to reduce breast cancer risk using ML and big data analytics.


Materials and Methods

In this work carried out, a CAGT is generated on the basis of a clinical report by using ML techniques. ML is a part of artificial intelligence (AI), in which a machine learns from its previous experiences. As per Mitchell [14], In ML, machines are used to learn a task on the basis of past experiences and its performance can be evaluated with some parameters like accuracy. ML approaches works in two stages: (1) analysis of dataset to find out the dependencies of a model and (2) output prediction of a model on the basis of projected dependencies. ML is applicable in medical research in numerous uses, where a suitable hypothesis is acquired for a biomedical sample dataset over a multi-dimensional space, using distinctive algorithms [15, 34, 35].
In the work, genes selection is performed using correlation coefficient, and mutual information (ranking method). The explained ratio is used to find the relevant genes for breast cancer. Subsequently, significant genes are selected that has more diagnostic power. The regression technique is applied to predict the gene expressions using the clinical outcome of a patient. Then classification is performed to classify cancer using the predicted gene expression.
The proposed work is implemented on Apache Spark to make the technique fast and scalable using the R programming language and the Sparkler package [36].

Relevant Gene Selection
Genes are sections of DNA that are passed on the chromosomes and determine specific human characteristics, such as hair color, height, and genetic disease. In the research work, commonly established feature selection algorithms are used to locate the most significant genes.

3.1.1 SVM-RFE_MI gene selection technique
SVM-RFE is a wrapper feature selection approach based on support vector machine (SVM). It uses a classification algorithm to select the features [3739]. The purpose of SVM-RFE is to compute the ranking weights for all features and sort the features according to weight vectors as the classification basis. SVM-RFE removes the least containing features and improves the classification accuracy [40]. SVM-RFE feature selection technique follows three steps: (1) processing of specific dataset for classification, (2) weight estimation of each feature, and (3) the deletion of features with low weight in order to obtain the ranking of features as shown below [40]:

Input
Training datapoints: X = [𝑥1, 𝑥2, … . . , 𝑥𝑛]𝑇
Label: Y = [𝑦1, 𝑦2, … . . , 𝑦𝑛]𝑇
The complete feature set: S = [1,2,3 … . . 𝑛]
Reduced feature list: R = []

Sorting of selected features
Repetition of procedure until R = [] is received.
A new training data on the basis of remaining features: X1 = X (: S).
Classification model used: α = SVM-train (X, y).
Weight Estimation: w =∑ 𝑘𝑥𝑘𝑦𝑘𝑥𝑘
Sorting: 𝑆𝑖 = (𝑤𝑖)2
Estimate the features containing minimum weight: m = arg min (S).
Updating the sorted feature list: R = [S(m), R].
Eliminating the features that contains minimum weight: S = S (1: −1, m + 1: length(S))

Outcome
This step deals with the sorted feature list. On the basis of prediction accuracy, the feature with the least weight (𝑤𝑖)2 is deleted in each iteration. SVM-RFE technique is executed repeatedly until a feature-sorted list is obtained.

In the research work, feature selection is performed on genomic data. Size of the genomic data is 1,980×24,368, where 1,980 is sample size and 24,368 is number of genes. After applying SVM-RFE, the size of dataset is reduced to 1,980×120. Then mutual information (MI) statistical technique is applied to sort the dataset ranks wise and higher ranked (0.9 to 1) features are extracted. MI sorted list provides 18 breast cancer specific genes among 120 selected genes. These gene selection techniques provide the relevant genes of breast cancer, but most significant genes are required to find that have higher variance and predictable ability.

3.1.2 Explained variance
It is the subpart of total variance which is described by the attributes of the dataset [41]. The maximum percentage of explained variance shows the stronger strength of the attribute. These attributes can make higher predictions of cancer. In the research work, explained variation is calculated by principal component analysis (PCA) because it emphasizes variation in the dataset and provides strong patterns in a dataset. It provides top-k genes that have the total explained variance of the dataset. According to the study, acceptable cumulative variance is 70% [42].
In the proposed work, the top-5 genes with 98.5% explained variance is selected as the variance is increased slightly after five genes which can result in increased computation.

3.1.3 Principal component analysis
PCA incorporates the overall variation in data samples and transforms the original attributes into a smaller set of linear combinations [43]. The smaller set contains the significant details of the dataset. PCA is commonly used when the goal is to identify the reduced feature set with the highest number of variances, especially when performing multivariate analysis. PCA retains only the first C principal component from total P attributes. PCA is an orthogonal transformation that projects the data from P to C dimensional subspace. In this transformation (P-C) components are vanished which shows that PCA minimizes the variability of data. In the research work, PCA is selected due to the following characteristics [44].
PCA is intended to maximize the variance of the first C component by minimizing the variance of P-C components. First C components are selected for their highest variance among all the principal components. PCA choose the larger C and these C components have the power to predict the insights of the dataset. In this proposed work, five components are selected as significant genes. These genes have a 98.5% variance among 18 predictors as shown in Table 2. The reduced dataset is trained over SVM classifier and evaluating the performance of the model using SVM prediction accuracy, precision, and recall.

Table 2. Top-5 genes have 98.5% explained variance ratio

Gene Explained variance Cumulative variance Description
PIK3CA 38 38 Regulation of hormones and maturation of cells
TP53 34 72 Guardian of the genome
GATA3 14 86 Independent prognostic marker
AURKA 6.5 92.5 Serine-threonine kinases
PTEN 6 98.5 Tumor suppressor gene


Regression
The regression technique is applied to identify the dependent variable on the basis of multivariate independent variables [45]. In the research work, the gene is a dependent variable whereas clinical outcomes are independent variables. In the proposed work, the least absolute shrinkage and selection operator (LASSO) regression technique is used to shrink and remove the coefficients that can reduce variance without increasing the bias. It is especially useful when the dataset has a small number of samples.

3.2.1 Least absolute shrinkage and selection operator
It is a linear regression approach that has the benefit of shrinkage [46]. In the shrinkage, samples are shrunk in the direction of the center. LASSO technique delivers the sparse model that contains a reduced feature set. This regression approach is appropriate for multi collinear datasets. It selects the feasible parameters and performs the analysis on these parameters.
LASSO regression technique performs L1 regularization. The loss function is penalized by L1 regularization. The absolute value of the coefficient is incorporated in this penalty, as illustrated in equation [47]:

(1)


Here the λ is a regulating factor that determines the severity of the penalty.
LASSO improves model interpretability while increasing prediction accuracy. Therefore, it is the best fit for multivariate regression model [46]. After prediction of gene expression of selected genes, classification technique is applied to classify breast cancer on the basis of these predicted gene expressions.

3.3 Classification
Classification is a ML technique that categories the data points in the label according to their type. The medical dataset contains the meaningful biomarkers that help in the categorization of data. In the proposed work, the staking ensemble technique is applied for classification.

3.3.1 An ensemble technique using stacking
Ensemble learning is an ML technique in which numerous models are trained to address the same problem and then come together to get better results [48]. The main assumption of ensemble learning is that a more accurate and/or robust model obtained when week classifier models are correctly combined.

3.3.2 Week models
Week ML models are identified using bias/variance trade-off [49]. A low variance and a low bias are the two most essential features of an ML model. The degree of freedom in an ML model should be sufficient to resolve the basic complexities of the data, but not excessive to avoid high variance. It's the well-known tradeoff between bias and variation. Week learner ML models exhibit high variance (low degree-of-freedom) or have a high bias to be reliable (high degree-of-freedom). Then, ensemble learning is applicable to reduce bias/variance of week learners by combining some of them together to generate a strong learner for better performance. In the proposed research work stacking ensemble, techniques are used.

3.3.3 Stacking
Stacking is technique [50] that extract the outcome from multiple ML models and combine them to generate a new strong model for better prediction. The ensemble model is applied to assemble the predictions on the test dataset. The key steps of stack ensemble techniques are below given:

The dataset divides in two parts one is training set and other one is test set.

The training set further split into 10 subparts:

A basic classifier model train on the nine parts and evaluate the performance on the remaining one part. This step repeats for each part on training data.

In this way, Base model is trained on the whole training dataset.

Now predictions are made using this model on the test set.

Steps 3–5 are then followed by a new base model, which generates a new set of predictions for the train and test sets.

To create a new model, the predictions from the training set are used as features.

On the test set, the created base model is used to calculate the final predictions.



Experimental Design

Dataset
In the research work, METABRIC high-throughput sequencing breast cancer dataset [47, 51] is used. Dataset is available in the cBioPortal database. It contains a multi-dimensional dataset of breast cancer. METABRIC dataset has 1,980 data samples that contain both clinical and genomic information among them 548 normal breast tissue samples and 1,432 primary breast tumor samples. Every patient has 27 clinical attributes such as lymph nodes positive, grade, size, age at diagnosis etc., and 24,368 gene expression data. Missing values are imputed using the PC-ImNN imputation technique [52]. Clinical data normalized by min-max normalization [53] in the range of 0 and 1. Table 3 represents the detail of METABRIC breast cancer data.

Table 3. Overall information of breast cancer dataset

Characteristic Value
Type of dataset Microarray data
Attribute type Real
Number of genes 24,368
Cut-off (yr) 5
Short-term survival (less than 5 years) 491 patients
Task associated Classification and Regression
Number of samples 1,980
Clinical attribute 27
Diagnosis median age (yr) 61
Average survival (mo) 125.1

Performance Parameters
In the research work, two performance parameters are applied to assess the performance of regression model and classification model, such as adjusted R-squared and accuracy.

4.2.1 Adjusted R-squared
In a regression model, it is used to compute the ratio of variance explained by a dependent over the independent variable [32, 39, 5456]. It is a statistical metric that considers the significant predictors that precisely affect the dependent variables. As a result, in multivariate regression models, it performs well. It is computed as follows [54]:

(2)


where, v signifies the total independent variable and total number of samples is denoted by the letter s. The adjusted R-squared value has the range between 0 to 1 like If the value of is close to 1, it indicates that the predicted regression line is the same as the actual regression line [5761].

4.2.2 Accuracy
According to the International Organization of Standardization (ISO), accuracy is the trueness of the model [16]. It is a combination of both kinds of systematic observational and random error, so high accuracy requires both high trueness and high precision [17, 62]. It was used to calculate the proportion of samples that were correctly categorized [62].

(3)


In Equation (3), TP, TN, FP, and FN are true positive, true negative, false positive, and false negative respectively [18, 63, 64]. Accuracy is calculated as sum of TPs, TNs divided by TPs, TNs, FNs, and FPs. For good classifiers, TN rate and TP rate both should be nearer to 100%.


Results and Discussion

The findings of the proposed experiments are described in this section. A gene selection approach SVM-RFE_MI is designed to find the significant genes associated with breast cancer. The designed technique for breast cancer is improved by combining SVM-RFE, explained variance and MI rank method. SVM-RFE is used to eliminate non-functional genes and the ML method is used to get the rank of gene between 0–1. Explained variance using PCA is applied to find the most significant genes.
In the experiment, SVM-RFE_MI provides 18 significant genes from a dataset containing 24,368 genes. The model is again improved to identify genes containing higher explained variance using PCA which provides the top-5 significant genes as shown in Table 4. These 5 selected genes are containing 98.5% of variance which is in an acceptable range.

Table 4. Top-5 genes' correlation matrix

PIK3CA TP53 GATA3 AURKA PTEN
PIK3CA 1 0.26 0.31 -0.12 0.41
TP53 - 1 0.19 0.28 0.21
GATA3 - - 1 0.37 0.31
AURKA - - - 1 -0.03
PTEN - - - - 1

Table 5. Estimation of predicted genes using LASSO regression technique on the basis of
Gene
PIK3CA 0.95
TP53 0.98
GATA3 0.87
AURKA 0.89
PTEN 0.92


The brief description of each gene is shown below.
PIK3CA: It is the recurrently mutated gene in breast cancer [65]. PIK3CA provides the instruction to generate p110 alpha-protein and add a cluster of oxygen to perform the action of the P13KCA gene. The actions of P13K are to send signals for many cell activities, including migration (movement) of cells, cell proliferation (division) and growth, the passage of the materials within cells, the creation of new proteins, and cell survival.
TP53: It is also known as the guardian of the genome. The TP53 gene gives instructions for making the protein p53 [66]. p53 protein works for tumor suppressors. TP53 regulates cell division by preventing cells from proliferating (dividing) and speedily developing in an uncontrolled way. All cells in the body contain the p53 protein, which binds to DNA directly in the nucleus [67].
GATA3: It is used to regulate the expression of a variety of biologically and clinically significant genes [68]. As per the recent study, GATA3 has been linked to positive breast cancer pathologic features such as positive ER and negative lymph node status, according to a recent study. It's also been discovered to be a standalone prognostic marker, with a low expression indicating a higher probability of recurrence of breast cancer [69].
AURKA: It is a kinase, which is significant for cell division. AUREKA's major role is to control chromosomal segregation during mitosis. AURKA kinase mutations cause cell division failure and cellular progression to be harmed [64, 70].
PTEN: This gene encodes the protein called PTEN (phosphatase and tensin homolog). Mutation in the PTEN gene leads to many cancers along with breast cancer [71]. It works as a tumor suppressor gene with the help of phosphatase protein. This protein prevents cells from dividing and growing rapidly and regulates the cell cycle. It is a target for various anticancer drugs [7274].
These selected genes are verified from Genetics Home Reference and Cancer Genetics Web [75, 76]. These genes are directed for breast cancer detection by researchers. Identified genes have higher productiveness and variance for breast cancer prediction at an early stage. Selected genes contain less correlation with each other. The relation between each gene is represented in Table 4. As a result, if experts target these genes, cancer will be detected at an earlier stage.
According to Table 4, these genes are less correlated. So, the prediction is performed individually on each gene. In the work carried out, the multivariate LASSO regression technique is used to predict the expression of each gene. In the LASSO model, each gene is considered as the dependent variable and all the attributes of the clinical dataset are considered as an independent variable. The results of LASSO model on the basis of are represented in Table 5. Results show the performance of the LASSO regression model in terms of . The LASSO regression technique has a good prediction accuracy as it proceeds with independent attributes.
As per the Table 5, LASSO gives minimum 0.13 error for GATA3 gene, according to the study [77], should be near to one and greater than 0.7. It indicates that the model's error is acceptable. Now the classification is performed on the basis of these predicted gene expressions.
In the research work, a stacking ensemble technique is used to classify the breast cancer stage on the basis of predicted gene expression. In this stacking technique, random forest [78], k-nearest neighbor (KNN) [79], support vector classifier (SVC) [80], multilayer perceptron (MLP) [81], and ML techniques are used to ensemble [82]. Table 6 shows the classification accuracy, precision, and recall of ensemble techniques and other ML techniques.

Table 6. Comparative analysis of classification techniques to classify predicted genes
Accuracy Precision Recall
Random forest 0.9 0.81 0.48
Neighbors 0.89 0.8 0.47
SVC 0.91 0.88 0.49
MLP 0.95 0.88 0.51
Ensemble 0.98 0.89 0.52


Fig. 1. Performance estimation of cancer classification on the basis of accuracy.


Results show that predicted gene expressions are able to diagnose breast cancer to some extent. It will be beneficial for those patients who have chances to get breast cancer in future or run with the financial crises. Fig. 1 represents the graphical representation of classification results. Fig. 1 shows that the ensemble technique outperformed the other individual ML technique in terms to classify predicted gene expressions.


Conclusion

The most common malignancy among women is breast cancer. It occurs when breast cells start to grow abnormally because of the gene mutation. A genomic test is preferred to determine the gene mutation at the primary stage, but the test is time-consuming and expensive in underdeveloped nations. In the work carried out, a novel CAGT is proposed to predict the gene expression in reduced time and cost. The proposed method provides the expression of significant genes and diagnoses the cancer stage. The model helps to categorize the samples in benign and malignant cancer along this reduces the risk of breast cancer by identifying the gene mutations at the primary stage. In the CAGT method, the SVM-RFE¬_MI approach is proposed for gene selection. LASSO regression technique is used to predict gene expression, followed by the stacking ensemble strategy to categorize cancer. Prediction of gene expression is evaluated using adjusted R-squared and classification of cancer by accuracy performance parameters. As per the results, adjusted R-squared is found to be within the standard acceptable range and ensemble techniques outperform other ML approaches in terms of accuracy. It signifies that the test method provides better gene prediction. The proposed technique will provide reports proximately after clinical outcome with no cost. It is helpful for patients who are suffering from breast cancer. The technique will help to reduce the mortality rate by diagnosing cancer at an early stage. The work can be beneficial for the performance and stability analysis of ensemble feature selection for cancer prediction. Trials can explore other applications and related datasets in the prediction domain. Heterogeneous ensemble gene selection technique and other similarity measures can be used in future experiments. Advance ensembles can also be discovered using hybrid techniques. Protein data can be included to get the disease prediction at an increased depth.


Author’s Contributions

Conceptualization, MG, BG. Funding acquisition, KK. Investigation and methodology, BG, IB. Project administration, BG. Resources, MG, BG, DG. Supervision, BG, DG. Writing of the original draft, MG. Writing of the review and editing, BG, IB. Software, MG. Validation, BG. Formal analysis, KK, CI. Visualization, MG.


Funding

This research is supported by Symbiosis International University, Pune, Maharashtra, India.


Competing Interests

The authors declare that they have no competing interests.


Author Biography

Please be sure to write the name, affiliation, photo, and biography of all the authors in order.
Only up to 100 words of biography content for each author are allowed.

Author
Name: Ms. Madhuri Gupta
Affiliation: School of Computer Science Engineering & Technology (SCSET), Bennett University, Greater Noida.
Biography: She is working as an assistant professor in the School of Computer Science Engineering & Technology (SCSET), Bennett University, Greater Noida. She is PhD from Jaypee Institute of Information technology, Noida, India. Her research area is machine learning, big data analytics and bioinformatics. Her research articles are published in reputed journals and conferences along with this she has published various patents in the same domain. She is professional member of ACM, India.

Author
Name: Dr. Bharat Gupta
Affiliation: Jaypee Institute of Information Technology, Noida.
Biography: Dr. Gupta is working as an assistant professor in the Department of CS&IT, Jaypee Institute of Information Technology, Noida. He has completed PhD from University of Westminster, London. His research area is high performance computing, machine learning, deep learning and data analytics. He has worked on several projects in the Machine Learning domain.

Author
Name: Dr. Ishan Budhiraja
Affiliation: School of Computer Science Engineering & Technology (SCSET), Bennett University, Greater Noida.
Biography: ISHAN BUDHIRAJA is currently working as an Assistant Professor in the School of Computer Science Engineering and Technology, Bennett University, Greater Noida, Uttar Pradesh, India. He received the B.Tech. degree in electronics and communication engineering from Uttar Pradesh Technical University, Lucknow, India, in 2008, the M.Tech. degree in electronics and communication engineering from Maharishi Dayanand University, Rohtak, Haryana, in 2012, and the Ph.D. degree in computer science and engineering from the Thapar Institute of Engineering and Technology, Patiala, India in 2021. Some of his research findings are published in top-cited journals, such as IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE TRANSACTIONS ON MOBILE COMPUTING, IEEE INTERNET OF THINGS, IEEE Wireless Communication Magazine, IEEE SYSTEMS JOURNAL, and various international top-tiered conferences, such as IEEE GLOBECOM, IEEE ICC, IEEE WCMC, ACM, and IEEE Infocom. His research interests include device-to-device communications, the Internet of Things, non-orthogonal multiple access, femtocells, machine learning, and deep reinforcement learning.

Author
Name: Prof. (Dr.) Deepak Garg
Affiliation: School of Computer Science Engineering & Technology (SCSET), Bennett University, Greater Noida.
Biography: Deepak Garg is a Professor, and the Head of the Computer Science Engineering Department, Bennett University, India, and the Head of the NVIDIA-Bennett Center of Research on Artificial Intelligence. His active research interests are designing efficient deep learning algorithms and quality in higher education. He served as the Chair of the IEEE Computer Society, India Council, from 2012 to 2016, and on the Board of Governors of the IEEE Education Society, USA, from 2013 to 2015. He has managed research funding of INR 30 million

Author
Name: Prof. (Dr.) Ketan Kotecha
Affiliation: Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Lavale, Pune, Maharashtra, India.
Biography: KETAN KOTECHA received the M.Tech. and Ph.D. degrees from the IIT Bombay. He is currently the Director of the Symbiosis Institute of Technology, and the Dean of the Faculty of Engineering, Symbiosis International (Deemed University). He also heads the Symbiosis Centre for Applied Artificial Intelligence (SCAAI). He has expertise and experience in cutting-edge research and projects in AI and deep learning for the last 25 years. He was a recipient of numerous prestigious awards. He is an Associate Editor of IEEE ACCESS.

Author
Name: Dr. Celestine Iwendi
Affiliation: Computing Dept. School of Creative Technologies, University of Bolton, Bolton, UK, A676 Deane Rd, Bolton BL3 5AB, United Kingdom
Biography: CELESTINE IWENDI is a visiting Professor with Coal City University Enugu, Nigeria and Associate Professor (Senior Lecturer) with the School of Creative Technologies, University of Bolton, United Kingdom. He is also a Fellow of the Higher Education Academy, United Kingdom and a Fellow of the Institute of Management Consultants. Other assignments include Adjunct Professor Delta State Polytechnic, Nigeria. His past assignments include Associate Professor, Bangor College China, Senior Researcher, WSN Consults Ltd United Kingdom, Electronics and Sensor Researcher, University of Aberdeen, United Kingdom, Lead Electronics Engineer, Insentient Ltd United Kingdom, Electronics Engineer, Nitel Nigeria and College Lecturer, Federal Girls Technical College, Uyo, Nigeria. He received his PhD in Electronics from University of Aberdeen, United Kingdom, MSc from Uppsala University, Sweden, MSc from Nnamdi Azikiwe Nigeria and BSc from Nnamdi Azikiwe University, Nigeria. He is supervising/co-supervising several graduate (MS and Ph.D.) students and a mentor to ACM student members at SAGE University Indore, Indian. His research interests include artificial intelligence, internet of things, wireless sensor networks, network/cybersecurity, machine learning, data networks, etc. He has authored over 60 peer-reviewed articles. He has served as a chair (program, publicity, and track) on top conferences and workshops. He has delivered over 25 invited and keynote talks in five countries. He is a Distinguished Speaker, ACM; Senior Member of IEEE; Member, ACM; Member, IEEE Computational Intelligence Society; Senior Member, Swedish Engineers; Member, Nigeria Society of Engineers; Member, Smart Cities Community, IEEE; Member. He is currently serving as the Newsletter Editor, and a Board Member IEEE Sweden section. He is student branch counselor and mentoring several PhD students in Ai, robotics, internet of things and blockchain. He has developed operational, maintenance, and testing procedures for electronic products, components, equipment, and systems; provided technical support and instruction to several organizations. He is a community developer, philanthropist and an international speaker in many top conferences and webinar. Dr Celestine is listed among the 2% top-scientist list.


References

[1] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: the next generation,” Cell, vol. 144, no. 5, pp. 646-674, 2011.
[2] Breast Cancer India, “Statistics of breast cancer in India: global comparison,” 2018 [Online]. Available: https://www.breastcancerindia.net/statistics/stat_global.html.
[3] S. Sharma, “The cost of genetic testing for cancer has to come down,” 2018 [Online]. Available: https://www.livemint.com/Politics/LSN7wtUjRj3iR0ZDk5ncZO/The-cost-of-genetic-testing-for-cancer-has-to-come-down.html.
[4] J. D. Hoheisel, “Microarray technology: beyond transcript profiling and genotype analysis,” Nature Reviews Genetics, vol. 7, no. 3, pp. 200-210, 2006.
[5] T. Zeng and J. Liu, “Mixture classification model based on clinical markers for breast cancer prognosis,” Artificial Intelligence in Medicine, vol. 48, no. 2-3, pp. 129-137, 2010.
[6] L. T. Scaria and T. Christopher, “A bio-inspired algorithm based multi-class classification scheme for microarray gene data,” Journal of Medical Systems, vol. 43, article no. 208, 2019. https://doi.org/10.1007/s10916-019-1353-y
[7] M. Jansi Rani and D. Devaraj, “Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification,” Journal of Medical Systems, vol. 43, article no. 235, 2019. https://doi.org/10.1007/s10916-019-1372-8
[8] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D. Minden, et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41-47, 2002.
[9] P. Ferreira, I. Dutra, R. Salvini, and E. Burnside, “Interpretable models to predict breast cancer,” in Proceedings of 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 2016, pp. 1507-1511.
[10] S. Kim, E. R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J. M. Trent, and M. Bittner, “Multivariate measurement of gene expression relationships,” Genomics, vol. 67, no. 2, pp. 201-209, 2000.
[11] S. Muro, I. Takemasa, S. Oba, R. Matoba, N. Ueno, C. Maruyama, et al., “Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data,” Genome Biology, vol. 4, article no. R21, 2003. https://doi.org/10.1186/gb-2003-4-3-r21
[12] Z. H. Zhou and Y. Jiang, “Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble,” IEEE Transactions on Information Technology in Biomedicine, vol. 7, no. 1, pp. 37-42, 2003.
[13] V. Chaurasia and P. Saurabh, “Data Mining Techniques: To Predict and Resolve Breast Cancer Survivability”, International Journal of Computer Science and Mobile Computing, vol. 3, no. 1, pp. 10-22, 2017. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2994925
[14] T. M. Mitchell, “The discipline of machine learning,” School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Report No. CMU-ML-06-108, 2006.
[15] F. Cabitza, R. Rasoini, and G. F. Gensini, “Unintended consequences of machine learning in medicine,” JAMA, vol. 318, no. 6, pp. 517-518, 2017.
[16] S. Garcia, A. Fernandez, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability,” Soft Computing, vol. 13, pp. 959-977, 2009.
[17] M. Babar, M. S. Khan, U. Habib, B. Shah, F. Ali, and D. Song, “Scalable edge computing for IoT and multimedia applications using machine learning,” Human-centric Computing and Information Sciences, vol. 11, article no. 41, 2021. https://doi.org/10.22967/HCIS.2021.11.041
[18] M. Zouina and B. Outtaj, “A novel lightweight URL phishing detection system using SVM and similarity index,” Human-centric Computing and Information Sciences, vol. 7, article no. 17, 2017. https://doi.org/10.1186/s13673-017-0098-1
[19] X. Xu, Y. Zhang, L. Zou, M. Wang, and A. Li, “A gene signature for breast cancer prognosis using support vector machine,” in Proceedings of 2012 5th International Conference on Biomedical Engineering and Informatics, Chongqing, China, 2012, pp. 928-931.
[20] M. Liang, Z. Li, T. Chen, and J. Zeng, “Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 4, pp. 928-937, 2015.
[21] C. Kandaswamy, L. M. Silva, L. A. Alexandre, and J. M. Santos, “High-content analysis of breast cancer using single-cell deep transfer learning,” Journal of Biomolecular Screening, vol. 21, no. 3, pp. 252-259, 2016.
[22] T. Ayer, O. Alagoz, J. Chhatwal, J. W. Shavlik, C. E. Kahn, and E. S. Burnside, “Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration,” Cancer, vol. 116, no. 14, pp. 3310-3321, 2010.
[23] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, and K. Li, “A parallel random forest algorithm for big data in a spark cloud computing environment,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 4, pp. 919-933, 2017.
[24] T. Shaikhina and N. A. Khovanova, “Handling limited datasets with neural networks in medical applications: a small-data approach,” Artificial Intelligence in Medicine, vol. 75, pp. 51-63, 2017.
[25] J. Listgarten, S. Damaraju, B. Poulin, L. Cook, J. Dufour, A. Driga, et al., “Predictive models for breast cancer susceptibility from multiple single nucleotide polymorphisms,” Clinical Cancer Research, vol. 10, no. 8, pp. 2725-2737, 2004.
[26] K. Park, A. Ali, D. Kim, Y. An, M. Kim, and H. Shin, “Robust predictive model for evaluating breast cancer survivability,” Engineering Applications of Artificial Intelligence, vol. 26, no. 9, pp. 2194-2205, 2013.
[27] C. Park, J. Ahn, H. Kim, and S. Park, “Integrative gene network construction to analyze cancer recurrence using semi-supervised learning,” PLoS One, vol. 9, no. 1, article no. e86309, 2014. https://doi.org/10.1371/journal.pone.0086309
[28] J. Kim and H. Shin, “Breast cancer survivability prediction using labeled, unlabeled, and pseudo-labeled patient data,” Journal of the American Medical Informatics Association, vol. 20, no. 4, pp. 613-618, 2013.
[29] W. Kim, K. S. Kim, J. E. Lee, D. Y. Noh, S. W. Kim, Y. S. Jung, M. Y. Park, and R. W. Park, “Development of novel breast cancer recurrence prediction model using support vector machine,” Journal of Breast Cancer, vol. 15, no. 2, pp. 230-238, 2012.
[30] O. Gevaert, F. D. Smet, D. Timmerman, Y. Moreau, and B. D. Moor, “Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks,” Bioinformatics, vol. 22, no. 14, pp. e184-e190, 2006. https://doi.org/10.1093/bioinformatics/btl230
[31] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, “LIBLINEAR: a library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871-1874, 2008.
[32] A. Makkar, U. Ghosh, D. B. Rawat, and J. H. Abawajy, “FedLearnSP: preserving privacy and security using federated learning and edge computing,” IEEE Consumer Electronics Magazine, vol. 11, no. 2, pp. 21-27, 2022.
[33] M. Gupta and B. Gupta, “Survey of breast cancer detection using machine learning techniques in big data,” Journal of Cases on Information Technology, vol. 21, no. 3, pp. 80-92, 2019.
[34] A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” New England Journal of Medicine, vol. 380, no. 14, pp. 1347-1358, 2019.
[35] C. Iwendi, S. Khan, J. H. Anajemba, A. K. Bashir, and F. Noor, “Realizing an efficient IoMT-assisted patient diet recommendation system through machine learning model,” IEEE Access, vol. 8, pp. 28462-28474, 2020.
[36] Posit, “sparklyr: R interface for Apache Spark,” 2016 [Online]. Available: https://www.rstudio.com/blog/sparklyr-r-interface-for-apache-spark/.
[37] H. Sanz, C. Valim, E. Vegas, J. M. Oller, and F. Reverter, “SVM-RFE: selection and visualization of the most relevant features through non-linear kernels,” BMC Bioinformatics, vol. 19, article no. 432, 2018. https://doi.org/10.1186/s12859-018-2451-4
[38] A. Makkar, M. S. Obaidat, and N. Kumar, “FS2RNN: feature selection scheme for web spam detection using recurrent neural networks,” in Proceedings of 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, UAE, 2018, pp. 1-6.
[39] P. Kumar, R. Kumar, G. Srivastava, G. P. Gupta, R. Tripathi, T. R. Gadekallu, and N. N. Xiong, “PPSF: a privacy-preserving and secure framework using blockchain-based machine-learning for IoT-driven smart cities,” IEEE Transactions on Network Science and Engineering, vol. 8, no. 3, pp. 2326-2341, 2021.
[40] Y. Chen, Z. Zhang, J. Zheng, Y. Ma, and Y. Xue, “Gene selection for tumor classification using neighborhood rough sets and entropy measures,” Journal of Biomedical Informatics, vol. 67, pp. 59-68, 2017.
[41] J. A. Rosenthal, Statistics and Data Interpretation for Social Work. New York, NY: Springer Publishing Company, 2012.
[43] H. Todorov, D. Fournier, and S. Gerber, “Principal components analysis: theory and application to gene expression data analysis,” Genomics and Computational Biology, vol. 4, no. 2, article no. e100041, 2018. https://doi.org/10.18547/gcb.2018.vol4.iss2.e100041
[44] R. Cheplyaka, “Explained variance in PCA,” 2017 [Online]. Available: https://ro-che.info/articles/2017-12-11-pca-explained-variance.
[45] S. Chatterjee and A. S. Hadi, Regression Analysis by Example, 5th ed. Hoboken, NJ: John Wiley & Sons, 2015.
[46] A. S. Dalalyan, M. Hebiri, and J. Lederer, “On the prediction performance of the lasso,” Bernoulli, vol. 23, no. 1, pp. 552-581, 2017. https://doi.org/10.3150/15-BEJ756
[47] cBioPortal, “Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016),” 2022 [Online]. Available: https://www.cbioportal.org/study/summary?id=brca_metabric.
[48] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey on ensemble learning,” Frontiers of Computer Science, vol. 14, pp. 241-258, 2020.
[49] J. Rocca, “Ensemble methods: bagging, boosting and stacking,” 2019 [Online]. Available: https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205.
[50] A. Ekbal and S. Saha, “Stacked ensemble coupled with feature selection for biomedical entity extraction,” Knowledge-Based Systems, vol. 46, pp. 22-32, 2013.
[51] E. J. Mucaki, K. Baranova, H. Q. Pham, I. Rezaeian, D. Angelov, A. Ngom, L. Rueda, and P. K. Rogan, “Predicting outcomes of hormone and chemotherapy in the molecular taxonomy of breast cancer international consortium (METABRIC) study by biochemically-inspired machine learning,” F1000Research, vol. 5, article no. 2124, 2016. https://doi.org/10.12688/f1000research.9417.3
[52] M. Gupta and B. Gupta, “A new scalable approach for missing value imputation in high-throughput microarray data on Apache Spark,” International Journal of Data Mining and Bioinformatics, vol. 23, no. 1, pp. 79-100, 2020.
[53] V. Rousson and N. F. Gosoniu, “An R-square coefficient based on final prediction error,” Statistical Methodology, vol. 4, no. 3, pp. 331-340, 2007.
[54] V. Rousson and N. F. Gosoniu, “An R-square coefficient based on final prediction error,” Statistical Methodology, vol. 4, no. 3, pp. 331-340, 2007.
[55] A. Makkar, N. Kumar, A. Y. Zomaya, and S. Dhiman, “SPAMI: a cognitive spam protector for advertisement malicious images,” Information Sciences, vol. 540, pp. 17-37, 2020.
[56] A. Makkar, U. Ghosh, P. K. Sharma, and A. Javed, “A fuzzy-based approach to enhance cyber defence security for next-generation IoT,” IEEE Internet of Things Journal, vol. 10, no. 3, pp. 2079-2086, 2023.
[57] A. Makkar, U. Ghosh, and P. K. Sharma, “Artificial intelligence and edge computing-enabled web spam detection for next generation IoT applications,” IEEE Sensors Journal, vol. 21, no. 22, pp. 25352-25361, 2021.
[58] I. Budhiraja, S. Tyagi, S. Tanwar, N. Kumar, and J. J. Rodrigues, “Tactile Internet for smart communities in 5G: an insight for NOMA-based solutions,” IEEE Transactions on Industrial Informatics, vol. 15, no. 5, pp. 3104-3112, 2019.
[59] I. Budhiraja, N. Kumar, S. Tyagi, S. Tanwar, Z. Han, M. J. Piran, and D. Y. Suh, “A systematic review on NOMA variants for 5G and beyond,” IEEE Access, vol. 9, pp. 85573-85644, 2021.
[60] I. Budhiraja, N. Kumar, and S. Tyagi, “ISHU: Interference reduction scheme for D2D mobile groups using uplink NOMA,” IEEE Transactions on Mobile Computing, vol. 21, no. 9, pp. 3208-3224, 2022.
[61] I. Budhiraja, N. Kumar, and S. Tyagi, “Energy-delay tradeoff scheme for NOMA-based D2D groups with WPCNs,” IEEE Systems Journal, vol. 15, no. 4, pp. 4768-4779, 2021.
[62] I. Budhiraja, N. Kumar, and S. Tyagi, “Deep-reinforcement-learning-based proportional fair scheduling control scheme for underlay D2D communication,” IEEE Internet of Things Journal, vol. 8, no. 5, pp. 3143-3156, 2021.
[63] M. Elhoseny, G. Ramírez-Gonzalez, O. M. Abu-Elnasr, S. A. Shawkat, N. Arunkumar, and A. Farouk, “Secure medical data transmission model for IoT-based healthcare systems,” IEEE Access, vol. 6, pp. 20596-20608, 2018.
[64] N. Rifi, N. Agoulmine, N. Chendeb Taher, and E. Rachkidi, “Blockchain technology: is it a good candidate for securing IoT sensitive medical data?,” Wireless Communications and Mobile Computing, vol. 2018, article no. 9763937, 2018. https://doi.org/10.1155/2018/9763937
[65] MedlinePlus, “PIK3CA gene,” 2021 [Online]. Available: https://medlineplus.gov/genetics/gene/pik3ca/.
[66] A. O. Giacomelli, X. Yang, R. E. Lintner, J. M. McFarland, M. Duby, J. Kim, et al., “Mutational processes shape the landscape of TP53 mutations in human cancer,” Nature Genetics, vol. 50, no. 10, pp. 1381-1387, 2018. https://doi.org/10.1038/s41588-018-0204-y
[67] MedlinePlus, “TP53 gene,” 2020 [Online]. Available: https://medlineplus.gov/genetics/gene/tp53/.
[68] D. Voduc, M. Cheang, and T. Nielsen, “GATA-3 expression in breast cancer has a strong association with estrogen receptor but lacks independent prognostic value,” Cancer Epidemiology Biomarkers & Prevention, vol. 17, no. 2, pp. 365-373, 2008.
[69] N. Emmanuel, K. A. Lofgren, E. A. Peterson, D. R. Meier, E. H. Jung, and P. A. Kenny, “Mutant GATA3 actively promotes the growth of normal and malignant mammary cells,” Anticancer Research, vol. 38, no. 8, pp. 4435-4441, 2018. https://doi.org/10.21873/anticanres.12745
[70] H. J. Donnella, J. T. Webber, R. S. Levin, R. Camarda, O. Momcilovic, N. Bayani, et al., “Kinome rewiring reveals AURKA limits PI3K-pathway inhibitor efficacy in breast cancer,” Nature Chemical Biology, vol. 14, no. 8, pp. 768-777, 2018.
[71] Y. R. Lee, M. Chen, and P. P. Pandolfi, “The functions and regulation of the PTEN tumour suppressor: new modes and prospects,” Nature Reviews Molecular Cell Biology, vol. 19, no. 9, pp. 547-562, 2018.
[72] MedlinePlus, “PTEN gene,” 2021 [Online]. Available: https://medlineplus.gov/genetics/gene/pten/.
[73] W. Wang, H. Xu, M. Alazab, T. R. Gadekallu, Z. Han, and C. Su, “Blockchain-based reliable and efficient certificateless signature for IIoT devices,” IEEE Transactions on Industrial Informatics, vol. 18, no. 10, pp. 7059-7067, 2022.
[74] N. M. Balamurugan, S. Mohan, M. Adimoolam, A. John, and W. Wang, “DOA tracking for seamless connectivity in beamformed IoT-based drones,” Computer Standards & Interfaces, vol. 79, article no. 103564, 2022. https://doi.org/10.1016/j.csi.2021.103564
[75] MedlinePlus, “Lists of genes A-Z,” 2021 [Online]. Available: https://medlineplus.gov/genetics/gene/.
[76] Cancer Index, “Breast cancer,” 2019 [Online]. Available: http://www.cancerindex.org/geneweb/X0401.htm.
[77] K. Y. Kim, J. Park, and R. Sohmshetty, “Prediction measurement with mean acceptable error for proper inconsistency in noisy weldability prediction data,” Robotics and Computer-Integrated Manufacturing, vol. 43, pp. 18-29, 2017.
[78] R. Diaz-Uriarte and S. Alvarez de Andres, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, vol. 7, article no. 3, 2006. https://doi.org/10.1186/1471-2105-7-3
[79] K. Chomboon, P. Chujai, P. Teerarassamee, K. Kerdprasop, and N. Kerdprasop, “An empirical study of distance metrics for k-nearest neighbor algorithm,” in Proceedings of the 3rd International Conference on Industrial Application Engineering, Fukuoka, Japan, 2015, pp. 280-285.
[80] M. Cinelli, Y. Sun, K. Best, J. M. Heather, S. Reich-Zeliger, E. Shifrut, N. Friedman, J. Shawe-Taylor, and B. Chain, “Feature selection using a one dimensional naïve Bayes’ classifier increases the accuracy of support vector machine classification of CDR3 repertoires,” Bioinformatics, vol. 33, no. 7, pp. 951-955, 2017.
[81] H. Ramchoun, Y. Ghanou, M. Ettaouil, and M. A. Janati Idrissi, “Multilayer perceptron: architecture optimization and training,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 4, no. 1, pp. 26-30, 2016.
[82] H. Xiong, C. Jin, M. Alazab, K. H. Yeh, H. Wang, T. R. Gadekallu, W. Wang, and C. Su, “On the design of blockchain-based ECDSA with fault-tolerant batch verification protocol for blockchain-enabled IoMT,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 5, pp. 1977-1986, 2022.

About this article
Cite this article

Madhuri Gupta1 , Bharat Gupta2 , Abdoh Jabbari3 , Ishan Budhiraja1,*, Deepak Garg1 , Ketan Kotecha4,*, and Celestine Iwendi4, A Novel Computer Assisted Genomic Test Method to Detect Breast Cancer in Reduced Cost and Time Using Ensemble Technique, Article number: 13:08 (2023) Cite this article 3 Accesses

Download citation
  • Received30 December 2021
  • Accepted15 March 2022
  • Published28 February 2023
Share this article

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords