1. Introduction
With the great advance of Deep Convolution Neural Networks (DCNNs) techniques [26,31,36,37,41,62,67] and large-scale datasets [13,58] in the field of computer vision and image understanding, the DCNNs-based techniques have become an extensive research topic in Face Recognition (FR) tasks and also have been widely used in many real-world applications. Therefore, more and more novel and efficient FR DCNNs [9,55,60,63,64,65,66,68] and the margin-based loss functions [14,35,43,73] were designed and replaced traditional methods [4,10,16,27,38,42,53,71,77,83], becoming the mainstream methods of FR. However, the large floating-point operations (FLOPs), large number of parameters, and huge models size lead to high computational complex ity, making it difficult to deploy the DCNNs models on the Internet of Things (IoT) or mobile devices with limited memory in practical applications [9,60,68], such as video surveillance, law enforcement, access control, marketing, smartphones, embedded systems, wearable devices, etc.. And large-scale face datasets with the pose changes, illumination changes, low resolution, and motion blur will also lead to a challenge in recognition accuracy [15,49]. Therefore, to solve these problems, some scholars are working on efficient and effective lightweight models with fewer parameters, lower FLOPs, and smaller model sizes.
To get lightweight FR models, many efforts have been devoted to keep an optimal trade-off between accuracy and efficiency. So, it is necessary to review recent lightweight FR models to inspire the development and application of lightweight FR. Specifically, there have been certain surveys [1,2,17,34,49,50,72,87] about FR, which review almost all FR DCNNs models. However, they do not cover recent published lightweight FR models; and some of them focus on some specific tasks; besides, few surveys reimplement reviewed lightweight models. And, to the best of our knowledge, there has been only one article [49] that fulfilled relevant survey for lightweight FR models, which provided a benchmark of lightweight FR. In summary, the end-to-end lightweight FR models need to be systematically reviewed from a variety of different perspectives, however, few of the existing surveys attach importance to this task. So, our survey will focus on reviewing lightweight FR models the most needed in practical applications. Different from [49], this article will classify the existing state-of-the-art end-to-end lightweight models into five categories and reimplement several mainstream models. The main contributions can be summarized as follows:
Firstly, we categorize the lightweight FR models into five aspects to further explore new lightweight FR models.
Secondly, we introduce the SqueezeFaceNet and EfficientFaceNet by pruning SqueezeNet and EfficientNet and reimplement several most popular lightweight FR models. Meanwhile, the comprehensive performance comparisons are also presented.
Thirdly, we present some the challenges and future trends to inspire our future works.
The remainder of the paper is organized as follows. Section 2 briefly reviews what are the FR tasks and lightweight models, and the categories of existing state-of-the-art end-to-end lightweight models are also described. As an example, we introduce the SqueezeFaceNet and EfficientFaceNet by pruning SqueezeNet and EfficientNet in Section 3. In Section 4, we reimplement and present a detailed performance comparison of different lightweight models on nine different test benchmarks. In Section 5, some challenges and future works are presented. Section 6 concludes this work.
2. Face Recognition and Lightweight Model
2.1 Face Recognition
An automatic face recognition system, which consisted of four main components: face detection, face alignment, feature extraction and face matching, aims at implementing two different tasks, namely, one-to-one (1:1) Face Verification or one-to-many (1:N) Face Identification [17]. Face verification refers to judging whether two face images belong to the same identity or not, which does not need to know the identity of the image. It is essentially a binary classification problem, and is usually used in scenarios such as witness comparison and identity verification. Face identification refers to judging the identity of probe face descriptor according to a registered face gallery. It is essentially a multi-classification problem. Common application scenarios include access control systems, venue sign-in systems and so on.
In view of the broad application prospects of FR, a series of traditional FR methods sprang up from the early 1990s, such as Eigenfaces [71], Fisherfaces [4], Bayesian eigenfaces [53], Laplacianfaces [27], Sparse Representation [16,77,83], Gabor feature [42] and learning-based descriptor [10,38]. However, the above methods either fail to address the uncontrolled facial changes, or lack of distinctiveness and compactness, or shallow representations [72]. To address above problems, based on the LeNet [37] which was applied to handwritten digit recognition in 1990, and AlexNet [37], NiN [41], VGG [62], GoogleNet [67], ResNet [26], DenseNet [31], which won the ImageNet competition champion, many FR DCNNs models have achieved earth-shaking changes, such as Facenet [60], Deepface [68], DeepID seiers [63,64,65,66], VGGFace [55] and VGGFace2 [9]. Especially, various novel margin-based loss functions, ArcFace [14] SphereFace [43] Cosface [73] Adaface [35], also greatly promote the recognition performance.
A standard pipeline of automatic FR system is shown in the <Figure 1>. When the system gets still images or video frames as input, face detection locates the face regions of input; then, located face is calibrate and resize normalized pixel in the face alignment stage; additionally, the feature extraction stage extracts the discriminative features using DCNNs; at last, face verification or face identification task is conducted in the face matching stage.
However, it is difficult to deploy these FR DCNNs on the IoT or mobile devices with limited memory due to large parameters, FLOPs, and model size. So, implementing optimal trade-off between accuracy and efficiency is becoming more and more important.
2.2 Lightweight Model
Considering the huge demand for deploying FR models on IoT or mobile devices with limited memory, the lightweight FR models have been researched to keep an optimal trade-off between performance and efficiency in recent years. And these lightweight FR models are also subject to standard pipeline of automatic FR system shown in the <Figure 1>. Nowadays, there have been a variety of lightweight architectures achieving state of the art (SOTA) performance, which can be categorized into: (1) artificially designing lightweight FR models (ADLM), (2) pruned models to face recognition (PM), (3) efficient automatic neural network architecture design based on neural architecture searching (ANND_NAS), (4) Knowledge distillation (KD) and (5) Low-rank decomposition (LRD). The <Table 1> lists the categorization of lightweight FR models. We mainly present the name or author of models, input size, MFLOPs, the number of parameters, model size, accuracy on LFW [32] and published years to facilitate comparison.
For the first class, the ADLM means the researcher artificially designed an efficient lightweight RF model that keeps an optimal trade-off between performance and efficiency. Wu et al. developed a LightCNN [78] family architectures, which are LightCNN-4, LightCNN-9, LightCNN-29, respectively, to learn a robust face representation on a noisy labeled dataset. There are 4.095M, 5.556M, 12.637M parameters, and about 1500 MFLOPs, 1000 MFLOPs, 3900 MFLOPs in the LightCNN-4, LightCNN-9, LightCNN-29, respectively. ConvFaceNeXt [29] is designed by Hoo et al., which stacks the stem, bottleneck, and embedding partitions to construct a ConvFaceNeXt family. The largest model contains 1.05M parameters and 410.59 MFLOPs.
For the second class, the PM means the researcher tailored FR models by pruning commonly used SOTA lightweight networks, which include the MobileNet series [30,59], ShuffleNetV2 [46], Mixconv [69], VarGNet [85] and so on, to construct MobileFaceNetV1 [49], MobileFaceNets [11], ShuffleFaceNet [48], MixFaceNet [6], VarGFaceNet [80] and so on. Based on the MobileNetV2 [59], MobileFaceNetV1 [49] and MobileFaceNets [11] are designed. The MobileFaceNets adopts a typical reverse residual block, which needs to train about 1.03M parameters with 473.15M FLOPs. And a family of ShuffleFaceNet [48] are built based on the ShuffleNetV2, in which the smallest is ShuffleFaceNet0.5× with about 0.5M parameters and 66.9M FLOPs, but it does not keep significant accuracy. Therefore, ShuffleFaceNet1× is commonly used, which contains about 1.4M parameters and 275.8M FLOPs.
For the third class, the ANND_NAS means to directly learn neural network architectures for FR based on the neural architecture searching (NAS) [90]. The article [49] introduces a modified version of the ProxylessFaceNAS based on the ProxylessNAS [8], and there are 3.01M parameters, and about 873.95 MFLOPs. Boutros proposed a family of PocketNet based on the NAS and knowledge distillation [28], which contains 0.925M parameters and 587.11 MFLOPs.
For the fourth class, the Knowledge distillation (KD) [57,82], training a small student network based on the large teacher network, is to train a compact neural network which can reimple ment the output of the large networks. The KD-based EC-KD [74] and ShrinkTeaNet [18] can be found in <Table 1>.
For the fifth class, the Low-rank decomposition (LRD) refers to use the low rank matrix to approximate the weight matrix of DCNNs, getting compressed lightweight DCNNs. The SILR [81] and LRRNet [86] can be found in <Table 1>.
These SOTA models aim to improve the compactness and computational efficiency, overcoming the difficulty of deployment due to trainable massive parameters. Meanwhile, as shown in <Table 1>, we also simply prune the SqueezeNet [33] and EfficientNet [70] to construct SqueezeFaceNet and EfficientFaceNet, and the details are introduced in section 3.
3. SqueezeFaceNet and EfficientFaceNet
As the simple supplement of the PM, we introduce the SqueezeFaceNet and EfficientFaceNet based on the SqueezeNet [33] and EfficientNet [70] in this section.
3.1 SqueezeFaceNet
SqueezeFaceNet is simply pruned from the SqueezeNet [33], which consists of fire modules and a Global Depthwise Convolution (GDC) [11] layer. The fire modules [33] extracts useful features and GDC treats different units with different importance [11]. At last, we adopt novel margin-based ArcFace [14] as loss function. The <Table 2> shows the architecture of SqueezeFaceNet.
3.2 EfficientFaceNet
EfficientFaceNet is simply pruned from the EfficientNet [70], which consists of MBConv1, MBConv6 modules and a GDC [11] layer. The MBConv [70] extracts useful features and GDC treats different units with different importance [11]. At last, we adopt novel margin-based ArcFace [14] as loss function. The <Table 3> shows the architecture of EfficientFaceNet.
4. Performance Comparison
This section comprehensively presents the performance of the most common models which are listed <Table 1>. We reimplement the results of EfficientFaceNet, SqueezeFaceNet, MixFaceNet, ShuffleFaceNet0.5x, MobileFaceNet, LightCNN-9 to make fair comparison by using the same computing resources, train dataset and setting the same hyper-parameters. All codes refer to the original papers and ArcFace2) [14].
4.1 Experimental Settings
Train Datasets. We use the MS1MV3 [15] dataset containing approximately 93K identities and 5.2M images as the train dataset in our experiments, which is semi-automatically cleaned from MS1MV0 (about 10M images of 100K identities) [14] and which is also an enhanced version of the MS1MV2 (about 5.8M images of 85K identities) [25].
Test Datasets. In order to systematically compare the listed lightweight methods, there are nine challenging test datasets. The details are listed in <Table 4>. In the test stage, 1:N Face Identification are conducted on IJB-B and IJB-C, and 1:1 Face Verification are done on the other datasets, IJB-B and IJB-C.
Implementation Details. In our experiments, all training details are following the ArcFace [14]. The batch-size is 128, and the optimizer is SGD with a learning rate of 0.1, a momentum of 0.9, and a weight decay of 1e-4. The train epochs are 40, and the scale parameter s and margin value of ArcFace loss are set to 64 and 0.5, respectively. We use a Linux machine (Ubuntu 18.04.1 LTS) with Intel(R) Core(TM) i9-9900KS CPU @ 4, 32 G RAM, and 3 Nvidia GeForce RTX 2060 (6Gb) GPUs to finish all experiments. Meanwhile, the experiments are implemented by PyTorch v1.11.0 [56] and mixed-precision [52] is employed to save GPUs memory and accelerate training.
4.2 Evaluation and Comparison
The 1:N Face Identification and 1:1 Face Verification are conducted in this section. All the experiments not only compare the number of parameters, model size and FLOPs of different models, but also evaluate the recognition accuracy. The parameters, model size and MFLOPs are the less the better, on the contrary, the accuracies are the higher the better. As shown in the <Table 5>, <Table 6> and <Table 7>, which all are divided into two parts by a dotted line, the upper data of dotted line refer to related papers, and the bottom data of dotted line come from our reimplemented experiments.
1:1 Face Verification Evaluation Results.
A series of 1:1 face verification experiments are conducted on the datasets of LFW, CA-LFW, AgeDB-30, CP-LFW, CFP-FP, CFP-FF and VGG2-FP, and the performances are reported as the accuracy of 10-fold cross-validation. The referred data and reimplemented data are all listed into the <Table 5>.
For better investigating performance of lightweight models on two IJB datasets which combine high- or low-quality images and video frames, a series of one-to-one (1:1) face verification experiments are conducted on both IJBs. TAR@FAR are reported as metrics of face verification, which is the higher the better. The referred data and reimplemented data are all listed into <Table 6>.
1:N Face Identification Evaluation Results.
A series of 1:N face identification experiments are conducted on both IJBs. Rank-1 and Rank-5 accuracy are reported as metrics of face identification, which are the higher the better. The referred data and reimplemented data are all listed into <Table 7>.
Seen from <Table 5>, <Table 6> and <Table 7>, generally speaking, more trainable parameters mean higher accuracy, but also higher computational complexity and larger model size. Meanwhile, we also can find there are some small and refined models, such as MobileFaceNet, ELANet, MixFaceNet and so on.
4.3 Performance vs. the Number of Parameters
This section plots the number of parameters vs. the performance of lightweight models to present the trade-off between the performance and the number of parameters. As shown in the <Figure 2>, the x-axis of each subplots from (a) to (i) represents the LFW (accuracy), CALFW(accuracy), CPLFW (accuracy), CFP-FF (accuracy), CFP-FP (accuracy), AgeDB-30 (accuracy), VGG2-FP (accuracy), IJB-C (TAR at FAR1e-4), and IJB-B (TAR at FAR1e-4), and the y-axis is the number of parameters. Different methods are highlighted with different markers and different colors. The models locating the conner of the left upper in every subplot indicate high performance and less number of parameters. We can find the MobileFaceNet and MixFaceNet report much better trade-off.
5. Challenge and Future Work
Although the lightweight FR has achieved remarkable performance, there are still many challenges. In my opinion, one challenge is how to design small and refined lightweight model which keeps optimal trade-off between performance and efficiency. In addition, how to response diverse train and test datasets which include large facial pose, extreme expression, occlusion, facial scale, motion blur, low illumination, large-scale. Finally, the interpretability of the lightweight model is also anther worthy challenge.
So, our future work mainly focuses on addressing the above challenges. For designing lightweight model, we will consider designing lightweight model, or using automatic machine learning methods like NAS to search for networks. For diverse train and test datasets, we will consider the cross age, pose, race, and so on. For interpretability of the lightweight model, we mainly consider explanation in spatial and scale dimensions.
6. Discussion
In this survey, we comprehensively reviewed the recent lightweight models for face recognition. Firstly, a standard pipeline of automatic lightweight FR system was presented. Secondly, we categorized the listed lightweight models into ADLM, PM, ANND_NAS, KD and LRD according to the different designing mode of models. Thirdly, we introduceed the SqueezeFaceNet and EfficientFaceNet by pruning SqueezeNet and EfficientNet. Fourthly, we reimplemented the results of EfficientFaceNet, SqueezeFaceNet, MixFaceNet, ShuffleFaceNet0.5x, MobileFaceNet, LightCNN-9 to eliminate the influence of different computing resources, train data set and setting hyper-parameters. At last, these results can be seen as benchmark for comparison and direct reference. This survey inspires the future work and indicates that the future models should be small and refined, can response diverse dataset, and have interpretability.