Deep Convolutional Neural Network Encoding of Face Shape and Reflectance in Synthetic Face Images


August 2023

Journal Title

Journal ISSN

Volume Title



Deep Convolutional Neural Networks (DCNNs) trained for face identification recognize faces across a wide range of imaging and appearance variations including illumination, viewpoint, and expression. In the first part of this dissertation, I showed that identity-trained DCNNs retain non-identity information in their top-level face representations, and that this informa- tion is hierarchically organized in this representation (Hill et al., 2019). Specifically, the sim- ilarity space was separated into two large clusters by gender, identities formed sub-clusters within gender, illumination conditions clustered within identity, and viewpoints clustered within illumination conditions. In the second part of this dissertation, I further examined the representations generated by face identification DCNNs by separating face identity into its constituent signals of “shape” and “reflectance”. Object classification DCNNs demon- strate a bias for “texture” over “shape” information, whereas humans show the opposite bias (Geirhos et al., 2018). No studies comparing “shape” and “texture” information have yet been performed on DCNNs trained for face identification. Here, I used a 3D Morphable Model (3DMM, Li, Bolkart, Black, Li, and Romero 2017) to determine the extent to which face identification DCNNs encode the shape and/or spectral reflectance information in a face. I also investigated the presence of illumination, expression, and viewpoint information in the top-level representations of face images generated by DCNNs. Synthetic face stimuli were generated using a 3DMM with separate components for a face shape’s “identity” and “facial expression”, as well as spectral reflectance information in the form of a “texture map”. The dataset comprised ten randomized levels each of face shape, reflectance, and expression, with three levels of illumination (spotlight, ambient, 3 point), three levels of viewpoint pitch (-30°, 0°, 30°), and five levels of viewpoint yaw (0°, 15°, 30°, 45°, 60°) in a complete factorial design for a total of 45,000 images. All analyses were conducted with an Inception ResNet V1-based network (Szegedy, Ioffe, Vanhoucke, & Alemi, 2017) trained on the VGGFace2 dataset (Cao, Shen, Xie, Parkhi, & Zisserman, 2018) and replicated with a ResNet-101- based network (He, Zhang, Ren, & Sun, 2016) trained on University of Maryland’s Universe dataset (Bansal, Castillo, Ranjan, & Chellappa, 2017; Bansal, Nanduri, Castillo, Ranjan, & Chellappa, 2017; Guo, Zhang, Hu, He, & Gao, 2016). Area Under the Receiver Operating Characteristic Curve (AUC) was used as a measure of information for each variable in the top-level representation and t-distributed Stochastic Neighbor Embedding (Van der Maaten & Hinton, 2008) was used to visualize the similarity space of top-level representations. The results showed that both shape and reflectance information were encoded in the top-level representation, and both signals were required for optimal performance. Shape-reflectance bias was mediated by illumination such that the network showed a reflectance bias in ambient and 3 point (photography style) illumination environments, whereas no bias was found under spotlight illumination. Consistent with Hill et al. (2019), we found information about all non-identity variables (illumination, expression, pitch, yaw) in the top-level representation, although each of these signals was weakly encoded.



Artificial Intelligence, Psychology, Cognitive