Facial Emotion Identification Using Deep Convolutional Neural Networks
P Daisy1, Priyanka Kumari Bhansali2, 1, 2 Department of Computer Science and Systems Engineering, Andhra University College of Engineering, Andhra University, AP, INDIA
Abstract— The ability to recognize facial expressions of emotion is vital for effective social interaction. Facial expressions convey non-verbal information between humans in face-to-face interactions. Automatic facial expression recognition, which plays a vital role in human-machine interfaces, has attracted increasing attention from researchers since the early nineties.
In this paper, we apply recent advances in deep learning to propose effective deep Convolutional Neural Networks (CNNs) that can classify the six basic emotions accurately. The project’s goal consists on training a Deep Convolutional neural network with labeled images of static facial emotions. We used the Kaggle (Emotion Detection From Facial Expressions) Dataset. SqueezeNet Architecture has been employed to improve the performance speed and accuracy.
motions are very important in human decision handling, interaction and cognitive process 1. Emotion can be recognized through a variety of means such as voice intonation, body language, and more complex methods such electroencephalography (EEG) 2. According to several different surveys verbal components convey one-third of human communication and two-third is conveyed by non-verbal components. The most practical and simple method is to examine facial expressions. Facial expressions are privileged relative to other non-verbal channels of communication, such as vocal inflections and body movements. Therefore, it is inevitable that facial expression identification has become a subject of much recent research. There are seven types of human emotions shown to be universally recognizable across different cultures 3: anger, disgust, fear, happiness, sadness, surprise, contempt.
Although facial expressions can be easily recognised by humans, facial expression identification for machines is still a great challenge. It is an interesting and gruelling problem due to its wide range of applications such as human-computer interaction and data-driven animation. Therefore, there has been a considerable research in computer vision systems to recognize facial expression. The recent success of Convolutional Neural Networks(CNN) in tasks such as object classification, face recognition has been extended to the problem of facial emotion identification. Fig.1 shows the basic model or architecture of CNN.
Convolutional Neural Networks(CNN)
CNN is a combination of deep learning and artificial neural networks. A massive development deep learning and application of CNN in classification problems has attained a great success456. The success is due the fact that that feature extraction and classification can be performed simultaneously on them. Critical features are extracted by deep learning methods by updating the weights using back propagation and optimization of errors.
Fig. 1. Basic architecture of CNN
CNN are biologically-inspired variants of multi-layer perceptron(MLP) networks. The architecture used in CNN is particularly well suitable to classify images. The connections between the layers and the weights associated with some form subsampling results in features that are invariant to translation for classifying images. Their architechture makes CNN to train fast.
In this section, work related to previous technology is discussed. Many researchers in their work have attempted to recognize the facial expression of an individual to the samples in a particular database of faces.
Facial expressions have been studied for decades7.
The Facial Action coding System(FACS) was developed to describe facial emotions in terms of so-called action units89. Facial expressions can be defined as combination of active facial muscles. According to a survey of Samal and Iyengar, the research on automation of facial expression recognition was not very active before the 1990s10. Early works of Barlett et al.1112 attempted to automate the FACS annotation. A hybrid approach, combining all representations outperformed human non-experts. These results indicated that computer vision methods can simplify this task. Lien et al.13 built a system using Hidden Markov Models(HMMs) to automate emotion recognition based on FACS annotation.
Since then, there have been many advances in the field of face-related tasks, such as face detection141516, facial landmark localization1516 and face verification1718. Many other approaches includes using pyramid histograms of gradients(PHOG)19, AU aware facial features20, boosted LBP descriptors21 and RNNs22. However, recent top submissions2324 to the 2015 Emotions in the Wild(EmotiW 2015) contest for static images all used deep convolutional neural networks(CNNs), generating up to 62% test accuracy.
A recent development by G. Levi et. al25 showed significant improvement in facial emotion recognition using a CNN. The authors addressed two salient problems: 1)a small amount of data available for training deep CNNs and 2) variation of appearance usually caused by illumination variations. They used Local Binary Patterns(LBP) to transform the images to an illumination invariant, 3D space that could serve as an input to a CNN. This special data pre-processing was applied to various publicly available models such as VGG. The model was re-trained on the large CASIA Webface dataset26 and transfer-learned on the Static Facial Expressions in the Wild(SFEW) dataset, which is smaller database of labeled facial emotions released for the EmotiW 2015 challenge26. Final results showed a test accuracy up to 54.56%.
The dataset for this project is from Kaggle competition : Emotion Detection From Facial Expressions, which is comprised of 350×350 pixel images of human faces. The training set consists of 13,719 examples in which we split 25% of examples for validation and the test set consists of 263 examples. There are two directories, the images directory contains raw images. The data directory contains files specific to training. Most importantly, it includes a csv file, which maps an image in the images directory with a facial expression, and test folder consists of images for testing the accuracy of the model.
Fig. 2. Sample images from the dataset.
Each image is categorized into one of the seven classes that express different facial emotions: “anger”, “disgust”, “fear”, “happiness”, “sadness”, “surprise” and “neutral”.
The actual dataset from the competition contains images of eight classes which includes “contempt” as eighth class. As the number of images consisting of contempt label is very less, it is difficult for the model to accurately predict the eighth class. So, we merged the “contempt” class with “disgust” to ease the process.
The data in this dataset comes from a variety of sources, including:
Labeled Faces In The WildThe Japanese Female Facial Expression (JAFFE) Database.
Indian Movie Face database (IMFDB)The Extended Yale Face Database BFigure 2 depicts the sample images from the above mentioned dataset for illustration purpose.
We developed CNNs with variable depths to evaluate the performance of the models for facial expression recognition. A typical CNN architecture contains all or some of the following layer types:
Conv(ReLU) ? Max-pooling with Dropout × M ?Fully-connected(ReLU) × N ? Softmax.
The first part of the network refers to the first kind of layers, of which usually contains Convolutional layer with ReLU activation function and Max-pooling layer. We can include spatial max-pooling, dropout and even batch normalization and ReLU nonlinearity in the first step. After using M times of the first layers, the network is led to Fully-connected layers that always have affine operation and ReLU nonlinearity, and can also include normalization and dropout. Finally, the network is followed by the affine layer connects to the class nodes, in which scores are computed and then probability is calculated using the softmax loss function.
We used recent deep CNN model in our experiment that is representative of the most popular network architectures. The model we accessed in our experiment is a variant of the SqueezeNet network from 27. The network features extreme reductions in parameter space and computational complexity via channel-projection bottlenecks(or squeeze layers), and uses identity-mapping shortcut connections, similar to residual networks28,which allow for stable training of deeper network models. SqueezeNet was demonstrated to achieve comparable performance to AlexNet 29 on the ImageNet large-scale recognition benchmark with substantial reductions in model complexity and parameter space-size. The model is comprised of so-called “fire modules”, in which the input map is first fed through a bottlenecking channel-projection layer and then divided into two channel sets. The First one is expanded through a 3×3 convolution and the other through channel projection. The final convolution map is globally average-pooled into a 512- vector and then fed to a fully-connected layer with 2048 units. The output of this last layer is the SqueezeNet image descriptor used in our experiment.
1×1 filters (squeeze_module)=; (3×3 filters + 1×1 filters)(expand_module)(fire_module).
Fig. 3. Example SqueezeNet architecture.
At first we have normalized the images of the dataset in color and size space. Then, a set of images in the form of batches are fed to the afore mentioned SqueezeNet model for the key feature extraction and training. We used keras and TensorFlow as the implementation frameworks for this experiment. Our network for emotion identification was developed on google co-lab platform, to utilize the cloud GPU services for the building the model. By using SqueezeNet nearly 80% of accuracy has been achieved.
Results and Discussion
The validation accuracy for 10epochs of training is shown in Fig.4 below:
Fig. 4. Accuracy of model noted for 10epochs.
The plot of loss and accuracy of model is depicted in fig.5 below:
Fig. 5. Plot of loss and accuracy of the model.
Obtained confusion matrix after testing the model is depicted in Fig.6 :
Fig. 6. Confusion matrix.
The goal of the project is automated Emotion Identification, for this purpose we used SqueezeNet to improve the performance speed and accuracy levels. SqueezeNet is the smaller CNN model which is easier to deploy on mobile devices. Together with model compression techniques SqueezeNEt is compressed to less than MB, which can fully fit on-chip SRAM, making it easier to deploy on embedded device.
While the results achieved were not state-of-art, they were slightly better than the other techniques including feature engineering. It means that eventually DL techniques will be able to solve this problem given an enough amount of labeled examples.
Sreeshakthy, M., and J. Preethi. “Classification of Human Emotion from Deap EEG Signal Using Hybrid Improved Neural Networks with Cuckoo Search.” BRAIN. Broad Research in Artificial Intelligence and Neuroscience6.3-4(2016): 60-73.
P. Abhang, S. Rao, B. W. Gawali, and P. Rokade, “Article: Emotion recognition using speech and eeg signal a review,”International Journal of Computer Applications, vol. 15,pp. 37–40, February 2011. Full text available.
P. Ekman, Universals and cultural differences in facial expressions of emotion. Nebraska, USA: Lincoln University of Nebraska Press, 1971.
Ian J. Goodfellow et al.”Challenges in representation learning: A report on three machine learning contests”, in Neural Information Processing, ICONIP, 2013, pp. 117-124.
Z. Yu, C. Zhang, “Image based static facial expression recognition with multiple deep network learning”, Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435-442, 2015, November.
S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Glehre, R. Memisevic, M.Mirza, “Combining modality specific deep neural networks for emotion recognition in video”, Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 543-550, 2013, December.
Ekman, Paul and Keltner, Dacher (1970). Universal facial expressions of emotion. California Mental Health Research Digest,8(4), 151–158.
Hjortsjö, Carl-Herman (1969). Man’s face and mimic language. Studen litteratur.
Ekman, Paul and Friesen, Wallace V (1977). Facial action coding system.
Samal, Ashok and Iyengar, Prasana A (1992). Automatic recognition and analysis of human faces and facial expressions: A survey. Pattern recognition,25(1), 65–77.
Bartlett, Marian Stewart and Hager, Joseph C and Ekman, Paul and Sejnowski, Terrence J(1999). Measuring facial expressions by computer image analysis. Psychophysiology,36(02),253–263.
Bartlett, Marian Stewart and Viola, Paul A and Sejnowski, Terrence J and Golomb, Beatrice A and Larsen, Jan and Hager, Joseph C and Ekman, Paul (1996). Classifying facial action. Advances in neural information processing systems, 823–829.
Lien, James J and Kanade, Takeo and Zlochower, Adena J and Cohn, Jeffrey F and Li, Ching-chung (1997). Automatically recognizing facial expressions in spatio-temporal domain using hidden markov models. Proceedings of the Workshop on Perceptual User Interfaces.94–97.
Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. vol. 1, I–511–I–518 vol.1.
Xiangxin Zhu and Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2879–2886.
Sun, Yi and Wang, Xiaogang and Tang, Xiaoou (2013). Deep convolutional network cascade for facial point detection. Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Washington, DC, USA, CVPR ’13, 3476-3483.
Chopra, Sumit and Hadsell, Raia and LeCun, Yann (2005). Learning a similarity metric discriminatively, with application to face verification. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE, vol. 1, 539–546.
Taigman, Yaniv and Yang, Ming and Ranzato, Marc’Aurelio and Wolf, Lior (2014). Deep face: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1708.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” NIPS, vol. 1, p. 4, 2012.
A. Yao, J. Shao, N. Ma, and Y. Chen, “Capturing au-aware facial features and their latent relations for emotion recognition in the wild,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI ’15, (New York, NY, USA), pp. 451–458, ACM, 2015.
C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and Vision Computing, vol. 27, no. 6, pp. 803– 816, 2009.
S. Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognitionin video,” ICMI, pp. 467–474, 2015.
Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI ’15, (New York, NY, USA), pp. 435–442, ACM, 2015.
B. Kim, J. Roh, S. Dong, and S. Lee, “Hierarchical committee of deep convolutional neural networks for robust facial expression recognition,” Journal on Multimodal User Interfaces, pp. 1–17, 2016.
G. Levi and T. Hassner, “Emotion recognition in the wild via convolutional neural networks and mapped binary patterns,” in Proc. ACM International Conference on Multimodal Interaction(ICMI), November 2015.
A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 2106–2112, Nov 2011.
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360,2016.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, pages 770–778, 2016.
Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.