홈으로ArticlesAll Issue
ArticlesBinary and Multi-Class Assessment of Face Mask Classification on Edge AI Using CNN and Transfer Learning
  • Endah Kristiani1,2, Yu-Tse Tsan3,4, Po-Yu Liu5, Neil Yuwen Yen6, and Chao-Tung Yang1,7,*

Human-centric Computing and Information Sciences volume 12, Article number: 53 (2022)
Cite this article 2 Accesses
https://doi.org/10.22967/HCIS.2022.12.053

Abstract

This paper empirically studies the impact of Sigmoid activation function in binary cross-entropy loss against softmax in categorical cross-entropy loss for binary classification problems. A face mask classification problem was applied as the objective of the study. Also, we evaluated the impact of optimization in the training and inference phases. We have investigated three convolutional neural network (CNN) models, InceptionV3, MobileNet, and VGG16. First, we applied transfer learning and fine-tuning using ImageNet as the weight in this case. Then, we evaluate the architecture of loss to the binary cross-entropy with the Sigmoid activation function and categorical cross-entropy softmax activation function. Finally, we optimized the model results using model optimizer (MO) in a lightweight floating point 16 (FP16) and FP32 model. These lightweight models are then applied in the edge device based on CPU with a vision processing unit (VPU), Intel Neural Compute Stick 2 (NCS2). From the experiments, MobileNet performs excellent model’s speed range from 20 to 30 frames per second (FPS) for FP16 and FP32 models compared to InceptionV3 and VGG16. While in terms of model’s accuracy training, InceptionV3 is slightly on top of MobileNet with more than 95% accuracy, compared to MobileNet with 92% accuracy.


Keywords

Face Mask Detection, Edge Computing, InceptionV3, MobileNet, VGG16, Cross-Entropy, Binary Classification, Transfer Learning


Introduction

Classification involves using algorithms for machine learning that learn how to assign a class mark to the problem domain examples. In machine learning, there are several various kinds of image classification that we may experience and specific modeling approaches that could be used for each one. Classification refers to a predictive modeling problem in machine learning where a class mark is a forecast for a given example of input data [13]. The classification includes a training dataset with several outputs and inputs to learn from a modeling perspective. A model will use the training dataset and calculate how to map input data examples to particular class labels. The training dataset must be adequate for the problem and have several examples of each class mark [46].
Binary classification is one of the main concerns in the machine learning field and is often addressed. Assuming that we attempt to classify an entity into one of the two possible groups in the simplest manner. For instance, face mask identification could be classified as with mask and no mask. In this case, binary classification problems can be resolved to a reasonably high degree by effectively using neural networks (deep learning models) [710]. Ordinarily, there are three ways to perform a classification problem, by using binary classification, multi-class classification, and multi-label classification [11, 12].
In this paper, we intend to compare binary classification and multi-class classification. We selected two kinds of Activation and Loss functions in binary classification problems for deep learning training comparison [1316]. We investigated three convolutional neural network (CNN) networks [17], InceptionV3, MobileNet, and VGG16 [18]. In this case, we implemented transfer learning and fine-tuning using ImageNet as the weight. We evaluate the architecture of loss to the binary cross-entropy with sigmoid activation function and categorical cross-entropy softmax activation function. Also, we optimized the model results using model optimizer (MO) [19, 20] in a lightweight form of Floating Point (FP)16 and FP32 models. These lightweight models are then applied in the edge device based on CPU with a vision processing unit (VPU) [21], Intel Neural Compute Stick 2 (NCS2). The main contribution of this work can be summarized as follows:

We empirically study the impact of sigmoid activation function in binary cross-entropy loss against softmax in categorical cross-entropy loss for binary classification problems.

We empirically study the impact of optimization in the training and inference phases.


The significance of this study might be as a reference for examining the performance of binary models which are trained using sigmoid and softmax activation. Also, to verify the performance of each model in edge device based on CPU with a VPU.


Background Review and Related Stud

Cross-Entropy Loss
Cross-entropy (CE) loss evaluates a classification model’s performance with a probability value between 0 and 1. As the expected probability diverges from the actual mark, CE loss increases. However, it will be insufficient to estimate a likelihood of 0.012 when the real observation label is one and result in a high loss value. A perfect model would have a log loss of 0. The CE loss is defined as Equation (1).

(1)


where the ground-truth and the CNN score for each class i is $t_i$ and $s_i$ in C. While, f($s_i$) refers to the activations, as typically an activation function (sigmoid or softmax) is applied to the scores before the CE loss calculation. The CE loss can also be defined as a binary classification problem where C'=2. The formula is shown in Equation (2).

(2)


It is assumed that two groups exist, $C_1$ and $C_2$. $t_i$ [0,1] and $S_1$ are the ground-truth and the $C_1$ score, and $t_2$=1-$t_1$ and $s_2$=1-$s_1$ are the ground-truth and the $C_2$ score. That is the case when we break a problem with multi-label classification into problems with C binary classification.

Categorical Cross-Entropy Loss
Categorical CE loss is also called softmax loss. It is an activation of softmax plus a CE loss. If we use this loss, we can train a CNN for each image to output a probability over the C categories used for grouping into multi-classes [22]. The labels are one-hot in the unique case of multi-class classification, so only the positive class $C_p$ retains its term in the loss. There is only one variable that is not zero $t_i$=$t_p$ in the target vector t. Therefore, by discarding the summation elements that are zero according to target marks, as shown in Equation (3).

(3)


where $s_p$ is the CNN score for the positive class. Described the loss, it will have to measure its gradient compliance with the CNN output neurons to back-propagate it through the net and optimize the net parameters by tuning the defined loss feature. Then we need to measure the CE loss gradient that respects each CNN class score in s. The terms of loss from the negative groups are zero. The loss gradient recognizes certain negative classes. However, it cannot be canceled, as the positive class’s softmax still relies on the negative class points. For all C, except for ground truth class C_p, the gradient expression would be the same because the score of $C_p$ ($s_p$) is in the nominator. The derivative respect for the positive class after some calculus is described in Equation (4) [23].

(4)


The derivative respect for the other negative classes is presented in Equation (5).

(5)


where $s_n$ is the value of any negative class in C different from $C_p$.

Binary Cross-Entropy Loss
Binary CE loss is often referred to as sigmoid CE loss. It is an activation of the sigmoid plus a CE loss. It is independent for each vector component (class), unlike softmax loss, which means that other component values do not influence the loss measured for each CNN output vector component. The interpretation of an element belonging to a particular class does not affect another class’s decision, so it is used for multi-label classification. It is called binary CE loss because for each class in C, and it sets up a binary classification problem between C'=2 classes. Thus, the equation of CE loss for binary problems is mostly used when using this loss. Binary CE loss is expressed in Equation (6).

(6)


where $t_1$=1 means the class $C_1$=$C_i$ is positive.
In this circumstance, the activation function does not rely on scores of other C classes greater than $C_1$=$C_i$. So the gradient respect for the $s_i$ in s of each score depends only on the loss provided by its binary problem. The gradient respect to the score s_i=s_1 is described in Equation (7).

(7)


where f($s_i$) is the sigmoid function, as presented in Equation (8).

(8)


Related Study
Ho and Wookey [24] proposed the real-world-weight cross-entropy (RWWCE), which is a new loss function. In order to examine the potential for better handling a range of loss functions in machine learning, they compare the concept of the loss function to the binary and categorical CE functions using the MNIST dataset for model training. The RWWCE loss function towards imbalanced classes was evaluated in their experiments for binary classification and multi-class classification on a single mark. The results of binary CE show that the test model is less false-negative than either control model but has far more false positives, resulting in an improvement in the top-1 error. The findings in categorical CE show a decrease in the number of mislabeled values and real-world costs. The top-1 error was projected to increase.
CNN have emerged to be very useful in many fields, particularly in recognizing objects and images. As a consequence, the performance of CNN highly relies on their architectures [25]. For the most state-of-the-art CNNs, their architectures are also explicitly built by the experience in both CNN, and the investigated cases [26, 27]. People usually have no idea how their CNN architecture should adapt to custom computer programs. There have been many works of improving the loss. Softmax and sigmoid activation functions are widely used in the CNN for classification, and therefore, there is a CE loss with softmax and sigmoid.
Harangi et al. [28] proposed a classification of dermoscopic images based on a deep convolutionary networking structure in seven groups. They also merged these classes into two classes of health and disease and trained the model based on the binary task. GoogLeNet InceptionV3 was used for training the models for both classification tasks. Their results show that the accuracy of multi-classification was significantly increased by 7% using the embedding binary classification method.
Hammad et al. [29] presented an electrocardiogram (ECG) security mechanism based on edge computing servers that provide connectivity to IoT devices. Individual ECG signals from the Physikalisch-Technische Bundesanstalt (PTB) database are input into a CNN model and then classified as approved or unaccepted classes in their proposed method. Their evaluation model reaches 99.50% of accuracy, with equal error rate (EER) at 0.47%.
Hussain et al. [30] proposed a fully automated computer vision approach for object recognition. Perform data augmentation first in the proposed manner to balance the object classes. A convolutional neural network (DenseNet-201) was then explored and updated based on the dataset (Caltech-101). Transfer learning was used to train the modified model, which extracts features. A few duplicate features were deleted using an updated whale optimization approach from the extracted features (WOA). Finally, multiple supervised learning methods were used to classify final characteristics for final recognition. The experiment was carried out with the upgraded Caltech-101 dataset and achieved a 93% accuracy rate.
Jang and Choi [31] proposed a prioritized environment configuration (PEC). The proposed approach prioritizes environment configurations, stochastically samples a configuration based on the priority, and uses the sampled configuration to initialize the environment. Agents can benefit from well-initialized surroundings since they can help them accumulate useful experiences. Their findings show that the suggested algorithm can be used with reinforcement learning algorithms that focus on the learning step. By applying the prioritized environment configuration to an autonomous drone flying simulator, they improved speed and performance. Furthermore, the findings show that the proposed approach performs well in a distributed architecture with several workers with both on-policy and off-policy reinforcement learning algorithms.
Salim et al. [32] present a rapid and efficient handover authentication (HO-Auth) strategy that employs deep learning to validate devices and create a user profile-based system for immediate permission. The model is trained using the channel state information (CSI) of a user's movement behavior, which detects malevolent users posing as honest users. In identifying a rogue device, the simulation-based analysis provides an initial profile accuracy of 0.91. As the profile is retrained based on the user's movement, the detection accuracy rises to 0.94. The technique ensures that genuine devices send data to blockchain decentralized networks, safeguarding cloud applications from incorrect data.
Rathore et al. [33] offer a deep learning and blockchain-based security framework for intelligent 5G-enabled IoT, which uses deep learning capabilities for intelligent data analysis and blockchain for data security. The hierarchical design of the framework is shown, with deep learning and blockchain processes emerging across the four layers of cloud, fog, edge, and user. The framework was simulated and evaluated to show the validity in practical applications using a variety of conventional measurements of latency, accuracy, and security.
Singh et al. [34] introduce DeepBlockScheme: A Deep Learning-based Blockchain Driven Scheme for a Secure Smart City. Blockchain was deployed at the fog layer to assure manufacturing data integrity, decentralization, and security. Deep learning was used at the cloud layer to boost production, automate data processing, and boost communication bandwidth in intelligent factory and manufacturing applications. They give a case study of vehicle production with the most up-to-date service scenarios for the proposed scheme and compare it to existing research studies utilizing key characteristics like security and privacy tools.


Experimental Setup

The experimental procedures were prepared in this section, including training and inference system architecture [35], system workflows, dataset, and training and inference environments.

Training and Inference System Environment
The training process in this work used configuration computing, which is shown in Fig. 1(a). Intel Xeon Phi Processor 7210 was used as a hardware system in this environment. CentOS 7.4 (64 bit) was utilized as an operating system. Jupyter Notebook was installed with IPython Kernel and Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) to accelerate the deep learning performance. TensorFlow and Keras were applied as a framework of deep learning. OpenCV (Open Source Computer Vision Library) was implemented as the library of programming functions for computer vision.

Fig. 1. (a) Training system environment and (b) inference system environment.


In the inference part, the Raspberry Pi 4 was used as the system device. Raspbian OS is applied as the operating system. Python, TensorFlow, and OpenCV were installed in Raspberry as a tool for inference the deep learning model. OpenVino package was implemented as a library for optimizing the inference process. Picamera was applied to connect with the camera module. Fig. 1(b) describes the edge system devices for the inference process [36].

System Workflows
Fig. 2 depicts the workflow diagram of this paper. There are two steps on training and inference processing. First, in the training phase, the workflow starts from input data, creates a deep learning network, and produces the output classification. In this case, the system implemented three CNN networks, InceptionV3, VGG16, and MobileNet. Each model was trained in two scenarios, using top layers and fine-tuning mechanism. From these processes, the trained models were produced in TensorFlow models’ type. These models are then converted into IR models in XML and bin files. Second, in the inference phase, the workflow starts from new input from the camera, it inferred the IR neural network model, and finally output the classification prediction result.

Fig. 2. System workflows.


Dataset
The dataset was extracted from the Face Mask Detection project [37] contains 4,095 and additional 378 random images, in a total of 4,473 images of 2,525 people wearing masks and 1,948 images of no mask. These images then moved into three categories of training, validation, and test dataset. The details of the folders are listed in the Table 1, as follows.

Table 1. Dataset train, validation, and test
Folder Amount
Without mask Train 1363
Validation 194
Test 391
With mask Train 1767
Validation 252
Test 506
All the images were standardized in specifics dimensions to optimize the training process, and they depend on the CNN topology. For InceptionV3 training, the images were resized into 299×299 dimensions. For VGG16 and MobileNet, the images were resized into 224×224 dimensions.

Training and Inference Optimization
Fig. 3 describes the optimization development model used in this system. In order to accelerate the performance of training and inference, the system utilized open AI software. To optimize the deep learning libraries, this system used TensorFlow and Keras framework. The Data Analytics Acceleration Library and Intel Python distribution are the primary tools for building machine learning blocks. The DNN open-source libraries contain CPU optimized functions. The Intel Distribution of OpenVINO Toolkit facilitates model deployment for inference by converting and optimizing trained models in IR forms. It offers support for models trained in TensorFlow, Caffe, and MXNet on CPU.

Fig. 3. Optimization development.


Training Procedures
We designed the training of binary classification scenarios to compare binary CE loss with sigmoid Activation function and categorical CE loss with softmax Activation function. Moreover, we executed the training process in two steps for each model, training based on top layers, which were randomly initialized and fine-tuning the entire network. The parameters used in training are listed as follows.

3.5.1 Data preprocessing
When training vision models, it is usual to reduce images to a smaller size to allow for mini-batch learning while staying within compute constraints. We will run this model on edge, which has computed limitations. Also, the raw data images are in various sizes. Therefore, we resized the image dimension to increase the training and inference performance before training the models, InceptionV3 is 299×299 dimensions. VGG16 and MobileNet are 224×224 dimensions. In this case, we used the Keras ImageDataGenerator class to feed the data for training purposes. It makes it simple to read in an organized directory, with each category in its folder. We structured the data in this method earlier in training during the data preprocessing phase, with particular folders for train, test, and validation.

3.5.2 Hyperparameter selections
The thread pools are adjusted in the two settings below to improve CPU performance.
- intra op parallelism threads: Nodes that may parallelize their execution using multiple threads will schedule individual portions into this pool.
- inter op parallelism threads: This pool contains all ready nodes.
Intel MKL tunes performance using the following environment variables:
- KMP BLOCKTIME: Determines how long a thread should wait after completing the execution of a parallel area before sleeping in milliseconds.
- KMP AFFINITY: Allows threads to be bound to physical processor units in the runtime library.
- KMP SETTINGS: During program execution, enables (true) or disables (false) the printing of OpenMP* runtime library environment variables.
- The number of threads to employ is specified by OMP NUM THREADS.
Table 2 presents the hyperparameter setting for training the models.

Table 2. Hyperparamater selections
Parameter Value
intra_op_parallelism_threads 8
inter_op_parallelism_threads 1
os.environ[“OMP_NUM_THREADS”]  8
os.environ[“KMP_BLOCKTIME”] “1”
os.environ[“KMP_SETTINGS”] “1”
os.environ[“KMP_AFFINITY”] “granularity=fine,verbose,compact,1,0”
Set transfer learning weights “imagenet”
Add a global spatial average pooling layer x = GlobalAveragePooling2D()(base_model.output)
Add a dense layer x = Dense(1024, activation=”relu”)(x)
Set activation “Sigmoid” or “Softmax”
Set Batch_Size 64
Set optimizer optimizers.Adam(lr=0.001)
Set metrics “accuracy”
Set loss loss = “binary_crossentropy” or “categorical_crossentropy”
Set epoch 5
Set EarlyStopping with patience 5


Experimental Results

In this section, we present the experiments conducted to evaluate the comparison of binary classification model’s performance between the training of Binary Cross-Entropy Loss with Sigmoid Activation function and Categorical Cross-Entropy Loss with Softmax Activation function. To this aim, we also compared three CNN networks, InceptionV3, MobileNet, and VGG16. In this case, we examined Training Loss and Accuracy, Classification Metrics, Model Optimizer, and Inference Performance.

Training Loss and Accuracy
The comparison of training loss and accuracy was presented in Figs. 4 and 5. In Fig. 4, VGG16 has a rising trend after fine-tuning the entire networks, which means the model is getting worst. In comparison, InceptionV3 shows significant improvement after fine-tuning in binary and categorical CE. However, the fine-tuning process on MobileNet did not present a much-enhanced model’s performance. In terms of accuracy, it can be seen from Fig. 5, both binary and categorical CE show a little increase, and tend to weaken the models, it happened like VGG16 model.
Fig. 6 shows the comparison of loss and accuracy of binary CE of sigmoid and softmax Activation. The loss value for InceptionV3 and MobileNet has a good performance when using fine-tuning training, reducing about 25.5% for InceptionV3 and 0.85% for MobileNet. However, for VGG16, it increases significantly by 33%. Whereas, for the accuracy value, InceptionV3 and MobileNet increased by 8.6% and 0.22%, respectively. In contrast with VGG16, it decreased significantly by 20.6%. For loss and accuracy of categorical CE loss/softmax Activation, InceptionV3 has a more poor loss value compared to MobileNet and VGG16.
InceptionV3 increased by 9.8%, while MobileNet and VGG16 decreased the Loss value by 4.7% and 45.5%, respectively. The accuracy values improved by 3.2% for InceptionV3, decreased by 4.8% for MobileNet, and 8.2% for VGG16.

Fig. 4. Training loss comparison.


Fig. 5. Training accuracy comparison.


Fig. 6. Comparison of cross-entropy Activation.


Table 3 presents a loss comparison of binary and categorical CE. In terms of loss values comparison, all the fine-tuning methods have excellent performance for InceptionV3, MobileNet, and VGG16, reduced by 2.5%, 8.9%, and 9%, respectively. However, in terms of accuracy in Table 4, InceptionV3 and MobileNet increased by 0.45% and 5.6% more good using binary CE than categorical CE. In comparison, VGG16 decreased by 13.4%.

Table 3. Loss comparison of binary and categorical cross-entropy
Loss Top layers Increase/
decrease
Fine-tuning Increase/
decrease
Sigmoid Softmax Sigmoid Softmax
InceptionV3 0.333587893 0.20159916 0.131988732 0.07865501 0.104083748 -0.025428738
MobileNet 0.133371245 0.167236036 -0.033864791 0.124903403 0.213819167 -0.088915765
VGG16 0.137831582 0.105528 0.032303581 0.470762185 0.560469213 -0.089707028


Table 4. Accuracy comparison of binary and categorical cross-entropy
Accuracy Top layers Increase/decrease Fine-tuning Increase/decrease
Sigmoid Softmax Sigmoid Softmax
InceptionV3 0.8801561 0.9297659 -0.0496098 0.9665552 0.96209586 0.00445934
MobileNet 0.9425864 0.93645483 0.00613157 0.94481605 0.88851726 0.05629879
VGG16 0.9548495 0.96544033 -0.01059083 0.74860644 0.88294315 -0.13433671

Classification Metrics
How good a model is based on the purposes. In the classification metrics, we can see more detail of how good are our models. It is based on precision, recall, and F1-score. The comparison of classification metrics is presented in Figs. 7, 8, and 9. If we look at this graph in Fig. 7, we can see that the category without_mask is less accurate than with_mask. It could be because the dataset of no mask is fewer than the picture of people wearing masks. However, the gap is a little. VGG16 model seems not stable at their class metrics. When there is a high cost associated with false negative, the recall would be the model metric we use to pick our best model.

Fig. 7. Precision comparison.


It can be seen in Fig. 8, VGG16 is the weak model among the others in terms of recall, which means that the possibility of misclassified could be more often in VGG16.
If we need to find a balance between precision and recall and an unequal class distribution, the F1-score might be a better metric (many actual negatives). It can be seen from Fig. 9, VGG16 presents unbalanced metrics between precision and recall.

Fig. 8. Recall comparison.


Fig. 9. F1-score graph.


Inference Performance on Edge Device
The inference performance was tested in the Raspberry Pi 4 with Intel NCS2. There are 48 models to be compared in the scenario of three CNN networks, two activations and loss function, two training using top layers and fine-tuning, and two floating-point (FP) model, FP16 and FP32. Fig. 10 demonstrates the inference app we used to evaluate the IR models in the Raspberry Pi and Intel NCS2.
Among all the models, the best model results on edge devices are MobileNet FP32, as shown in Fig. 11. In this case, we can see the high-speed inference range from 27 to 31 fps (frames per second). The excellence models mostly in the fine-tuning mode for MobileNet FP32.

Fig. 10. Inference app.


Fig. 11. MobileNet FP32 for binary and categorical cross-entropy model inference.


For InceptionV3 FP16 for binary and categorical CE model, it can be seen from Table 5, the speed in this environment is around 7 to 10 fps. However, there are misclassified on InceptionV3 FP16 models that are trained on top layers in terms of accuracy. We can still have misclassification for top layers training in InceptionV3 FP32 models. There is no significant change between InceptionV3 FP16 and FP32 in terms of speed and accuracy.

Table 5. Comparison of inference performance on edge device for FP16 model
FP16 Sigmoid Softmax
Mask Nomask Mask Nomask
Speed (INF/S) Accuracy Speed (INF/S) Accuracy Speed (INF/S) Accuracy Speed (INF/S) Accuracy
InceptionV3
Fine-tuning 7 88.82 9 99.61 9 94.38 9 99.41
Top layers 8 -98.29 8 99.8 10 -93.12 9 95.51
MobileNet
Fine-tuning 23 79.2 29 98.1 24 94.38 28 97.07
Top layers 27 66.16 25 97.9 28 86.57 28 66.26
VGG16
Fine-tuning 5 -54.98 5 55.08 5 51.32 5 58.2
Top layers 5 55.52 5 53.76 5 51.37 5 67.63

For MobileNet FP16, we can see the high speed range from 23 to 29 fps. The speed is twice compared to InceptionV3. However, there are some weak predictions in terms of accuracy, even though there is no misclassification. The speed of Mobilenet FP32 is increased significantly from 27 to 31. The excellence models are mainly in the fine-tuning mode for MobileNet FP32. For VGG16, we can see a weak speed and prediction on both binary and categorical CE. The speed is the same, and it only gets 5 fps with an accuracy of 50%–60%. Also, we can still see a misclassification in fine-tuning binary CE. Compared to VGG16 FP16, we can find the FP32 models almost have similar results with slightly increasing speed. The accuracy is kept the same as FP16. All the negative predictions indicate that the prediction is a misclassification in certain levels of accuracy. Tables 5 and 6 show the comparison of inference performance on the edge device.

Table 6. Comparison of inference performance on edge device for FP32 model
FP16 Sigmoid Softmax
Mask Nomask Mask Nomask
Speed (INF/S) Accuracy Speed (INF/S) Accuracy Speed (INF/S) Accuracy Speed (INF/S) Accuracy
InceptionV3
Fine-tuning 9 88.96 10 99.61 9 94.14 9 99.41
Top layers 10 -98.19 10 99.8 9 -93.12 10 95.51
MobileNet
Fine-tuning 31 79.54 27 98.1 31 94.29 28 97.07
Top layers 31 66.41 27 97.9 28 86.57 28 66.6
VGG16
Fine-tuning 6 -54.93 6 55.03 6 51.32 5 58.2
Top layers 6 55.52 6 53.81 6 51.37 5 67.63

The following graphs describe the comparison of inference performance in terms of accuracy (Fig. 12) and speed (Fig. 13).

Fig. 12. The speed comparison of inference on edge device.


Fig. 13. The accuracy comparison of inference on edge device.



Conclusion

In this study, we examine the training using the sigmoid activation function in binary CE loss against softmax in categorical CE loss for binary classification problems. There three CNN networks applied in the training process, InceptionV3, MobileNet, and VGG16. The training phase also implemented transfer learning based on ImageNet and fine-tuning the entire network. After training, there is an inference optimization applied in Raspberry Pi as edge device and Intel NCS2. We converted the TensorFlow models into IR models in two FP modes, FP16 and FP32. Then we compared the speed and the accuracy in the inference phase. In terms of training optimization, InceptionV3 has aligned results in both binary and categorical CE methods. Compared to binary and categorical CE, using fine-tuning optimization, binary CE is more suitable than categorical CE for InceptionV3, and MobileNet VGG16 has not a good result. In terms of inference, MobileNet presents excellent models among InceptionV3 and VGG16, with the high-speed reach at a range of 20 to 30 fps. While in terms of accuracy, InceptionV3 slightly on top of MobileNet with more than 95% accuracy compared to MobileNet with on average 92% accuracy. However, the speed of the InceptionV3 model has significant gaps than MobileNet speed at 9 to 10 fps. In the future, we can compare with GPU environment for inference process.


Author’s Contributions

Conceptualization, EK, CTY. Writing—original draft, review, editing, EK, CTY, YTT. Data curation, PYL, NYY. All authors have read and approved the final manuscript.


Funding

This research was partly supported by the National Science and Technology Council (NSTC), Taiwan R.O.C. (111-2622-E-029-003-, 111-2811-E-029-001-, 111-2621-M-029-004-, and 110-2221-E-029-020-MY3). In addition, this work is also supported by Taichung Veterans General Hospital (TCVGH), Taiwan R.O.C. (No. TCVGH-T1087804, TCVGH-T1097801, TCVGH-1107201C, TCVGH-T1107803, TCVGH-T1117807, TCVGH-NK1099002, and TCVGH-1093902D).


Competing Interests

The authors declare that they have no competing interests.


Author Biography

Author
Name : Endah Kristiani
Affiliation : Department of Computer Science, Tunghai University, Taichung City, Taiwan (R.O.C)
Biography : Endah Kristiani received her Ph.D. degree in the Department of Industrial Engineering and Enterprise Information, Tunghai University, Taiwan, and an M.S. degree in Electrical Engineering (Information Technology) from Universitas Gadjah Mada, Yogyakarta in 2007, Indonesia. She is a Post-Doctoral Fellowship in the Department of Computer Science, Tunghai University, under the Ministry of Science and Technology (MOST) Taiwan project. Her research interests include edge computing, machine learning, and artificial intelligence.

Author
Name : Yu-Tse Tsan
Affiliation : Department of Emergency Medicine, Taichung Veterans General Hospital, Taichung City, Taiwan (R.O.C)
Biography : Yu-Tse Tsan is Assistant Professor of Medicine and Medical Physician in Taichung Veterans General Hospital in Taiwan. He received his Ph.D. in public heath from Institute of Occupational Medicine and Industrial Hygiene, National Taiwan University in June 2013. In August 2000, he joined the Department of Emergency Medicine at Taichung Veterans General Hospital. He ever won the Excellent Paper Awards of Professor Chen Gongbei Memorial Award. His present research interests are in pharmacoepidemiology, environmental and occupational health and big data programming.

Author
Name : Po-Yu Liu
Affiliation : Division of Infection, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan (R.O.C)
Biography : Po-Yu Liu is a clinical consultant at the Taichung Veterans General Hospital. He achieved and maintains Board Certification in Internal Medicine, Emergency Medicine, and Infectious Diseases. During his PhD at National Chung Hsing University, he studied environment-microbe-host interactions. He focuses his research on increasing knowledge that will lead to better diagnosis and management of emerging infections.

Author
Name : Neil Y. Yen
Affiliation : Information Security Laboratory, University of Aizu, Fukushima, Japan
Biography : Neil Y. Yen received doctorate degree in human sciences from Waseda University, Totsukamachi, Japan, and in engineering from Tamkang University, New Taipei, Taiwan, in 2012. His doctorate degree at Waseda University was funded by Japan Society for the Promotion of Science (JSPS) under RONPAKU program. He has joined the University of Aizu, Aizuwakamatsu, Japan, as an associate professor since April 2012. He has been engaged extensively in an interdisciplinary field of research, where the themes are in the scope of big data science, computational intelligence, and human-centered computing. He has been actively involved in the research community by serving as a guest editor, associate editor, and reviewer for international referred journals and as an organizer/chair of ACM/IEEE-sponsored conferences, workshops, and special sessions. He is currently a member of the IEEE Computer Society, IEEE System, Man, and Cybernetics Society, and Technical Committee of Awareness Computing (IEEE SMCS). He is a member of the IEEE.

Author
Name : Chao-Tung Yang
Affiliation : Department of Computer Science, Tunghai University, Taichung City, Taiwan (R.O.C)
Biography : Chao-Tung Yang is Distinguished Professor of Computer Science at Tunghai University in Taiwan. He received his Ph.D. in computer science from National Chiao Tung University in July 1996. In August 2001, he joined the Faculty of the Department of Computer Science at Tunghai University. He is serving in a number of journal editorial boards, including Future Generation Computer Systems, International Journal of Communication Systems, KSII Transactions on Internet and Information Systems, Journal of Cloud Computing. He has published more than 300 papers in journals, book chapters and conference proceedings. His present research interests are in cloud computing, big data, parallel computing, and deep learning. He is a member of the IEEE Computer Society and ACM.


References

[1] H. Peng and S. Chen, “BDNN: binary convolution neural networks for fast object detection,” Pattern Recognition Letters, vol. 125, pp. 91-97, 2019.
[2] X. Zeng, Y. Zhang, X. Wang, K. Chen, D. Li, and W. Yang, “Fine-grained image retrieval via piecewise cross entropy loss,” Image and Vision Computing, vol. 93, article no. 103820, 2020. https://doi.org/10.1016/j.imavis.2019.10.006
[3] G. Cicceri, F. De Vita, D. Bruneo, G. Merlino, and A. Puliafito, “A deep learning approach for pressure ulcer prevention using wearable computing,” Human-centric Computing and Information Sciences, vol. 10, article no. 5, 2020. https://doi.org/10.1186/s13673-020-0211-8
[4] P. Nagrath, R. Jain, A. Madan, R. Arora, P. Kataria, and J. Hemanth, “SSDMNV2: a real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2,” Sustainable Cities and Society, vol. 66, article no. 102692, 2021.https://doi.org/10.1016/j.scs.2020.102692
[5] V. Iglovikov and A. Shvets, “TernausNet: U-Net with VGG11 encoder pre-trained on ImageNet for image segmentation,” 2018 [Online]. Available: https://arxiv.org/abs/1801.05746.
[6] D. H. Lee, Y. Li, and B. S. Shin, “Mid-level feature extraction method based transfer learning to small-scale dataset of medical images with visualizing analysis,” Journal of Information Processing Systems, vol. 16, no. 6, pp. 1293-1308, 2020.
[7] G. Jignesh Chowdary, N. S. Punn, S. K. Sonbhadra, and S. Agarwal, “Face mask detection using transfer learning of InceptionV3,” 2020 [Online]. https://arxiv.org/abs/2009.08369.
[8] H. H. Lin, W. C. Chiang, C. T. Yang, C. T. Cheng, T. Zhang, and L. J. Lo, “On construction of transfer learning for facial symmetry assessment before and after orthognathic surgery,” Computer Methods and Programs in Biomedicine, vol. 200, article no. 105928, 2021. https://doi.org/10.1016/j.cmpb.2021.105928
[9] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, “A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic,” Measurement, vol. 167, article no. 108288, 2021. https://doi.org/10.1016/j.measurement.2020.108288
[10] G. Jignesh Chowdary, N. S. Punn, S. K. Sonbhadra, and S. Agarwal, “Face mask detection using transfer learning of InceptionV3,” in Big Data Analytics. Cham, Switzerland: Springer, 2020, pp. 81-90.
[11] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, “Binary neural networks: a survey,” Pattern Recognition, vol. 105, article no. 107281, 2020. https://doi.org/10.1016/j.patcog.2020.107281
[12] K. Adem, S. Kilicarslan, and O. Comert, “Classification and diagnosis of cervical cancer with stacked autoencoder and softmax classification,” Expert Systems with Applications, vol. 115, pp. 557-564, 2019.
[13] G. Mourgias-Alexandris, A. Tsakyridis, N. Passalis, A. Tefas, K. Vyrsokinos, and N. Pleros, “An all-optical neuron with sigmoid activation function,” Optics Express, vol. 27, no. 7, pp. 9620-9630, 2019.
[14] Z. Qin, D. Kim, and T. Gedeon, “Rethinking softmax with cross-entropy: neural network classifier as mutual information estimator,” 2019 [Online]. Available: https://arxiv.org/abs/1911.10688.
[15] S. Maharjan, A. Alsadoon, P. W. C. Prasad, T. Al-Dalain, and O. H. Alsadoon, “A novel enhanced softmax loss function for brain tumour detection using deep learning,” Journal of Neuroscience Methods, vol. 330, article no. 108520, 2020. https://doi.org/10.1016/j.jneumeth.2019.108520
[16] S. Kanai, Y. Fujiwara, Y. Yamanaka, and S. Adachi, “Sigsoftmax: reanalysis of the softmax bottleneck,” Advances in Neural Information Processing Systems, vol. 31, pp. 284-294, 2018.
[17] W. Song, Z. Liu, Y. Tian, and S. Fong, “Pointwise CNN for 3D object classification on point cloud,” Journal of Information Processing Systems, vol. 17, no. 4, pp. 787-800, 2021.
[18] E. Kristiani, C. T. Yang, and C. Y. Huang, “iSEC: an optimized deep learning model for image classification on edge computing,” IEEE Access, vol. 8, pp. 27267-27276, 2020.
[19] Y. Liu, C. Chen, R. Zhang, T. Qin, X. Ji, H. Lin, and M. Yang, “Enhancing the interoperability between deep learning frameworks by model conversion,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, 2020, pp. 1320-1330.
[20] S. Mouselinos, V. Leon, S. Xydis, D. Soudris, and K. Pekmestzi, “TF2FPGA: a framework for projecting and accelerating TensorFlow CNNs on FPGA platforms,” in Proceedings of 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 2019, pp. 1-4.
[21] E. Kristiani, C. T. Yang, C. Y. Huang, Y. T. Wang, and P. C. Ko, “The implementation of a cloud-edge computing architecture using OpenStack and Kubernetes for air quality monitoring application,” Mobile Networks and Applications, vol. 26, no. 3, pp. 1070-1092, 2021.
[22] I. S. Na, C. Tran, D. Nguyen, and S. Dinh, “Facial UV map completion for pose-invariant face recognition: a novel adversarial approach based on coupled attention residual UNets,” Human-centric Computing and Information Sciences, vol. 10, article no. 45, 2020. https://doi.org/10.1186/s13673-020-00250-w
[23] J. Cao, Z. Su, L. Yu, D. Chang, X. Li, and Z. Ma, “Softmax cross entropy loss with unbiased decision boundary for image classification,” in Proceedings of 2018 Chinese Automation Congress (CAC), Xi'an, China, 2018, pp. 2028-2032.
[24] Y. Ho and S. Wookey, “The real-world-weight cross-entropy loss function: modeling the costs of mislabeling,” IEEE Access, vol. 8, pp. 4806-4813, 2019.
[25] M. Tzelepi and A. Tefas, “Improving the performance of lightweight CNNs for binary classification using quadratic mutual information regularization,” Pattern Recognition, vol. 106, article no. 107407, 2020. https://doi.org/10.1016/j.patcog.2020.107407
[26] P. Pawara, E. Okafor, M. Groefsema, S. He, L. R. Schomaker, and M. A. Wiering, “One-vs-One classification for deep neural networks,” Pattern Recognition, vol. 108, article no. 107528, 2020. https://doi.org/10.1016/j.patcog.2020.107528
[27] A. Chavda, J. Dsouza, S. Badgujar, and A. Damani, “Multi-stage CNN architecture for face mask detection,” 2020 [Online]. Available: https://arxiv.org/abs/2009.07627.
[28] B. Harangi, A. Baran, and A. Hajdu, “Assisted deep learning framework for multi-class skin lesion classification considering a binary classification support,” Biomedical Signal Processing and Control, vol. 62, article no. 102041, 2020. https://doi.org/10.1016/j.bspc.2020.102041
[29] M. Hammad, A. M. Iliyasu, I. A. Elgendy, and A. A. Abd El-Latif, “End-to-end data authentication deep learning model for securing IoT configurations,” Human-centric Computing and Information Sciences, vol. 12, article no. 4, 2022. https://doi.org/10.22967/HCIS.2022.12.004
[30] N. Hussain, M. A. Khan, S. Kadry, U. Tariq, R. R. Mostafa, J. I. Choi, and Y. Nam, “Intelligent deep learning and improved whale optimization algorithm based framework for object recognition,” Human-centric Computing and Information Sciences, vol. 11, article no. 34, 2021. https://doi.org/10.1109/HCIS.2021.11.034
[31] S. Jang and C. Choi, “Prioritized environment configuration for drone control with deep reinforcement learning,” Human-centric Computing and Information Sciences, vol. 12, article no. 2, 2022. https://doi.org/10.22967/HCIS.2022.12.002
[32] M. M. Salim, V. Shanmuganathan, V. Loia, and J. H. Park, “Deep learning enabled secure IoT handover authentication for blockchain networks,” Human-centric Computing and Information Sciences, vol. 11, article no. 21, 2021. https://doi.org/10.22967/HCIS.2021.11.021
[33] S. Rathore, J. H. Park, and H. Chang, “Deep learning and blockchain-empowered security framework for intelligent 5G-enabled IoT,” IEEE Access, vol. 9, pp. 90075-90083, 2021.
[34] S. K. Singh, A. E. Azzaoui, T. W. Kim, Y. Pan, and J. H. Park, “DeepBlockScheme: a deep learning-based blockchain driven scheme for secure smart city,” Human-centric Computing and Information Sciences, vol. 11, article no. 12, 2021. https://doi.org/10.22967/HCIS.2021.11.012
[35] K. Lee and N. Moon, “Digital signage system based on intelligent recommendation model in edge environment: the case of unmanned store,” Journal of Information Processing Systems, vol. 17, no. 3, pp. 599-614, 2021.
[36] E. Kristiani, C. T. Yang, C. Y. Huang, P. C. Ko, and H. Fathoni, “On construction of sensors, edge, and cloud (ISEC) framework for smart system integration and applications,” IEEE Internet of Things Journal, vol. 8, no. 1, pp. 309-319, 2020.
[37] C. Deb, “Face Mask Detection,” 2021 [Online]. Available: https://github.com/chandrikadeb7/Face-Mask-Detection.

About this article
Cite this article

Endah Kristiani1,2, Yu-Tse Tsan3,4, Po-Yu Liu5, Neil Yuwen Yen6, and Chao-Tung Yang1,7,*, Binary and Multi-Class Assessment of Face Mask Classification on Edge AI Using CNN and Transfer Learning, Article number: 12:53 (2022) Cite this article 2 Accesses

Download citation
  • Received6 October 2021
  • Accepted5 March 2022
  • Published30 November 2022
Share this article

Anyone you share the following link with will be able to read this content:

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords