Worried About Who’s Annotating Your AI Model Data? You Should Be.

Sample Aircraft Annotation
Some AI model providers rely on foreign gig workers to process sensitive data – which poses big risks to enterprise and government customers.

By Peter Kant, CEO, Enabled Intelligence

By now, most government agencies and large enterprises have learned the hard way to pay attention to the cybersecurity supply chain (see SolarWinds data breach). They’ve learned to put strict controls on who accesses their network, and how, and from where.

So it’s a safe bet that if you asked the Chief Information Security Officer (CISO) at any federal agency or large corporation whether they would give gig workers in – say – Venezuela access to their critical data, they would likely laugh you out of the room for even suggesting something so preposterous.

Yet, this is exactly what can happen today when businesses or government customers use third-party firms to develop AI models. Sensitive customer data used to develop AI models is sometimes “farmed out” by US-based AI solution vendors to low-paid, foreign contract workers as part of the AI model development process.

How, exactly, are AI models developed?

Say the words “AI model development” and most people will imagine a small team of highly trained software engineers hunched over computer screens in an office in Silicon Valley, New York or perhaps Israel.

Writing code for AI software is one part of developing an AI model, and often it is the least challenging part from a security and data protection perspective. It’s easy to keep tabs on a handful of software developers – if need be, you can put them in a secure office somewhere with plenty of cybersecurity protections.

But in addition to good software code, AI models need good data to “train it” – ideally, lots and lots of data. This data needs to be labeled or annotated. For example, if you are building an AI model to recognize images of Russian fighter jets in satellite imagery or breast cancer tumors in mammograms, you might need to take thousands of satellite images of Russian fighters or thousands of mammogram images of breast tumors, and put bounding boxes around the part of the image with the fighter or tumor in it, and then feed all those different images to the AI model so that it can “learn” what a Russian jet or cancer tumor looks like.

However, data annotation can be time consuming and labor intensive. So, what some AI firms do is outsource data annotation to low paid foreign contract workers.

That is apparently what happened when iRobot, the manufacturer of the Roomba vacuum cleaner, hired Scale AI, a top US-based AI solution vendor. Based in Silicon Valley, Scale is a data annotation provider, serving customers in government and industry. According to a recent article in the MIT Technology Review , in the case of iRobot, Scale used contract workers in Venezuela to annotate iRobot’s data. According to the article, Scale’s contract workers then shared some of the images – including an image of a woman sitting on a toilet – on social media accounts. iRobot says that sharing images in social media groups violates Scale’s agreements with it, and Scale says that contract workers sharing these images breached their own agreements. Nevertheless, the sharing happened.

The article’s author has a warning for government and enterprise customers that hire firms that rely on foreign contract workers for data annotation:

… [sharing sensitive data] is nearly impossible to police on crowdsourcing platforms. When I ask Kevin Guo, the CEO of Hive, a Scale competitor that also depends on contract workers, if he is aware of data labelers sharing content on social media, he is blunt. “These are distributed workers,” he says. “You have to assume that people ... ask each other for help. The policy always says that you’re not supposed to, but it’s very hard to control.”

That means that it’s up to the service provider to decide whether or not to take on certain work. For Hive, Guo says, “we don’t think we have the right controls in place given our workforce” to effectively protect sensitive data.

The Takeaway: Pay Attention to Who’s Labeling Your Data

In short, your data and AI is only as good and as safe as your labeling workforce. It matters to whom you give and where you get your data labeled. Not all data annotation firms rely on foreign contract workers. Enabled Intelligence, Inc., is a specialist AI data annotation firm that only uses full-time highly trained US annotation employees that perform data annotation in our secure US-based offices, rather than farming work out to remote teleworkers. That approach allows for much greater control and protection of customer data. And because these US-based employees are highly trained, they can often annotate data faster and more accurately than foreign contract workers, which yields more powerful and accurate AI technologies.

Outsourcing to low paid, lower skilled gig workers with little control of your sensitive data may seem like a good financial decision in the short run. But the higher backend costs of dealing with inaccurate labels, retraining poor performing AI models built on inaccurate data, and the existential costs of data leaks will more than outweigh any short-term savings.

Peter Kant is the Founder and CEO of Enabled Intelligence, Inc, an artificial intelligence technology company providing accurate data labeling and AI technology development for the U.S. government and critical commercial industries. Enabled Intelligence provides meaningful high-tech employment to veterans and neurodiverse professionals.