IBM Research Releases a Diverse Million-Face Dataset for Reducing the Biases in Facial Recognition

IBM Research has released Diversity in Faces dataset for reducing the biases in AI based facial recognition. The biases in machine learning models and constructs are unavoidable but IBM has released this new database of 1 million faces to reduce their scale in face recognition.

Facial recognition has been gaining real-world applications with adoption in a plethora of digital devices including smartphones and home security systems. Facial recognition can now be used to unlock smartphone locks as well as open door on arrival. The bizarre application of the new technology include correctly estimating a person’s mood and likelihood of committing criminal acts. However, the majority of these applications being prone to biases and misjudgments are not good enough to pass simple tests.

This is a multi-layered problem as creators and developers have not paid attention to a fundamental problem of lack of enough representation in the data. The real question points out to this simple question of working with a dataset if there are not enough people in it. IBM Research wants to address this issue by building a diverse and comprehensive dataset with 1 million faces. These images will be sourced from the Flickr Creative Commons which is a 100 million image data set.

The sets are ingested by machine learning algorithms which then are labelled and accurately measured. The image which is accompanied with metadata which includes specifics like the size of the forehead, distance between eyebrows. All these create a faceprint which is used by the system later for tasks like matching one image to another of the same person. The Diversity in Faces (DiF) release by IBM is aimed at advancing the study of accuracy and fairness in facial recognition technology. The images sourced from the Creative Commons Data Set were used to annotate faces by using 10 independent and well-established coding schemes.

IBM research team stated in its blog post that the 10 facial coding schemes include visual attributes like age and gender, facial ratios, craniofacial features, posture and resolution. The initial analysis conducted by researchers found that the DiF dataset provides more balanced distribution as well as diverse collection of images across gender, race and ethnicity.

According to the blog post, the dataset will be available on-request for the global research community. IBM stated that this new release will further the cause of fairness in AI.

Author: Rahul Pandita

An experienced writer and editor, Rahul Pandita has written extensively about the impact of policy changes on business and finance. He is a regular contributor to many authoritative sites. When he is not writing, you can find him playing a game of chess.