Disclaimer: These are my personal notes on this paper. I am in no way related to this paper. All credits go towards the authors.

FaceNet: A Unified Embedding for Face Recognition and Clustering

March 12, 2015 - Paper Link - Tags: Framework

Summary

Learns a mapping from face images to a compact Euclidean space (128 dimensions) where distances directly correspond to a measure of face similarity. Uses a deep convolutional network. Uses online triplet mining method where each triplet consists of a base, one positive example, and one negative example.

Notes

Uses the Labeled Faces in the Wild (99.63% accuracy) and YouTube Faces DB datasets (95.12% accuracy).
Uses squared L2 distances in the embedding space
Recognition is a k-NN classification problem (neural network output)
Scale and translation are performed to thumbnails
Employers hard-positive mining techniques to encourage spherical clusters for the embedding of a single person (cor clustering)
Purely data driven method which learns its representation directly from the pixels of the face.
Uses the Zeiler & Fergus and Inception model
Triplet Loss: Minimize the distances between an anchor (base sample) and a positive sample, and maximize the distance between the anchor and the negative sample.
22 layers. Outlined in Table 1 in the paper.
Trained neural network using Stochastic Gradient Descent with standard back propagation and AdaGrad
Inception drastically reduced model size

Analysis

Takes 1000 to 2000 hours to train
Requires a tight crop of the face area
Uses a proprietary face detector to get best results

Citation: Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.