Across the globe, facial-recognition software engineers, artificial intelligence technology corporations and staff from governments eagerly awaited the results of testing from a little-known corner of the U.S. Commerce Department in November. The agency’s evaluations would not only provide valuable benchmarks for technologists hoping to gauge the accuracy of their facial-recognition algorithms, but they could also inform contract decisions by policymakers and tech vendor evaluators.

“They [facial recognition vendors] generally have no idea how effective their algorithms are relative to somebody else’s,” said Patrick Grother, a biometrics science researcher at the National Institute of Standards and Technology who led the testing. “If you’re company X, you don’t know your accuracy relative to company Y.”

The closely watched NIST results released last November concluded that the entire industry has improved not just incrementally, but “massively.” It showed that at least 28 developers’ algorithms now outperform the most accurate algorithm from late 2013, and just 0.2 percent of all searches by all algorithms tested failed in 2018, compared with a 4 percent failure rate in 2014 and 5 percent rate in 2010.

NIST has conducted a variety of facial-recognition system assessments since as early as 2000, but it’s only recently that these technologies have undergone unprecedented improvements at an ever-quickening pace. By incorporating highly complex neural networks, developers have bolstered the ability of facial-recognition algorithms to detect identities even when poor quality images are employed, for example. Technologists expect facial-recognition algorithm accuracy to skyrocket as data volumes and rate of computing capacity grow.

This particular NIST test, the first of its kind since 2014, measured 127 algorithms implementing identification of faces from 45 developers around the world. While there were some clear standouts, notably Microsoft and Chinese firm Yitu, results are intended to be studied in the context of metrics relevant to specific applications of facial-recognition technologies.

Consider a driver’s license database scenario in which a driver wants to apply under a new name, for instance. Vendor evaluators may want to use NIST test results to determine how an algorithm might perform after a subject’s photo images have aged five years, 10 years, or more. In the case of airport security, a government might want to know whether a facial-recognition system performs well enough on its own or if it should be used in conjunction with fingerprinting.

Grother also explained the test results in criminal investigation terms. “If you were sitting on a pile of photos that didn’t prove useful four years ago, if you were to research them today, some might bear fruit and give you an investigative lead,” he suggested.

The “one-to-many” NIST testing measured the ability for facial-recognition algorithms to match a person’s photo with a different image of the same person stored in a database featuring millions of image samples. NIST employed a dataset including 26.6 million portrait photos of 12.3 million individuals along with additional smaller data sets featuring webcam, photojournalism, video surveillance and amateur photo images.

NIST does not require source code, IP or training data from vendors participating in its testing, said Grother. Instead, the research agency runs scripts executing participating firms’ algorithms on sequestered data sets.

A gold standard informing policy

People throughout the realm of facial-recognition view NIST testing as a standard bearer, providing results data that can have a significant impact on policy decisions. “Test results can inform policy,” said Grother. He added in a NIST press statement, “The implication that error rates have fallen this far is that end users will need to update their technology.”

Joy Buolamwini, MIT researcher and founder of the Algorithmic Justice League, called the NIST benchmarks “gold standards for the industry.” Even Chinese firm Yitu affirmed the importance of the U.S. agency testing. “The benchmark results of NIST are well-recognized as the golden standards of global industry for its strictness,” a spokesperson from Yitu noted in an email. “That’s why Yitu joined in the contest to measure its technology.”

A September 2018 report related to the potential development of regulations for AI published by the House Subcommittee on Information Technology pointed to NIST, noting the agency “is situated to be a key player in developing standards.”

But participating in NIST testing is not exactly de rigueur across the industry. While Microsoft and Yitu participated in the voluntary testing, U.S. tech firms including Amazon, Apple, Google, IBM and the others stayed away. Microsoft’s longtime top legal counsel and president, Brad Smith, alluded to Microsoft’s high facial-recognition accuracy ratings in the NIST test during a presentation at the Brookings Institute Dec. 6. It was during that same talk that Smith reiterated the company’s strong support for regulation of facial-recognition technology.

As part of its goals for regulation Smith said Microsoft wants a law that requires other firms to undergo the same type of testing: “What we need to do is not only impose an obligation of transparency, we need to require under the law that the companies that are in this business of providing facial-recognition technology in fact enable third parties to test these services for accuracy and for unfair bias.”

Test focus draws criticism

Yet the NIST approach has its critics. Buolamwini suggested the NIST testing “focused on technical performance in aggregate which does not provide insight on how these technologies could impact different demographic groups of people.”

Buolamwini’s own research has exposed examples of poorly trained facial-recognition technologies that incorrectly label women as men and cannot decipher distinctions among people with dark skin. “Social scientists who are interested in the social implications of AI and in particular facial-recognition technology are concerned with questions around abuse, consent, weaponization, and discriminatory uses of this technology,” she said.

Grother cautioned that NIST aims to evaluate “the technical response of algorithms to particular types of images, in particular types of faces,” rather than testing the composition of a database employed to train an algorithm. So, whether a criminal justice database is overpopulated with African-American faces, or whether a system could be used in a discriminatory manner, are the sorts of questions that would be beyond the scope of NIST reporting.

However, there is a companion NIST report covering demographic information coming soon. Grother said the agency plans to publish that data in the first quarter of 2019.

photo credit: Jacopo Marcovaldi The Face via photopin(license)