Images for AI use can be sourced responsibly
Summary
Nature’s editorial highlights a new, responsibly sourced image data set — the Fair Human-Centric Image Benchmark (FHIBE or “Feebee”) — produced by researchers at Sony. The data set contains 10,318 images of 1,981 people from 81 countries, gathered with informed consent, payments to participants and clear limits on use (explicitly banning law-enforcement, military and surveillance applications). Participants supplied metadata such as age, ancestry, location and pronouns to avoid algorithmic guessing and help reduce bias. The project cost under US$1 million, demonstrating that ethically gathered, representative image data is feasible and affordable for benchmarking AI systems. The editorial urges regulators, litigators and companies to take note and to collaborate on building higher standards for data sourcing.
Key Points
- FHIBE (Feebee) contains 10,318 images from 1,981 individuals across 81 countries, collected with informed consent.
- Participants were paid and can opt out at any time; certain uses (e.g. surveillance, military) are prohibited.
- The data set includes self-reported metadata — age, ancestry, location and pronouns — removing the need for algorithms to infer sensitive attributes.
- The whole effort cost less than US$1 million, illustrating that ethical data collection is financially plausible for many firms.
- FHIBE is considerably more geographically and demographically representative than many web-scraped image corpora, helping to reduce bias in benchmarking.
- The data set is intended for benchmarking model accuracy, but raises wider questions about scaling responsibly sourced data for model training and for text-based AI tools.
- The editorial calls for industry collaboration and regulatory attention — and flags the relevance to ongoing litigation over web scraping and data use.
Context and relevance
Most large generative models have relied heavily on web-scraped images and text, often without informed consent or payment. That approach has sparked legal, ethical and reputational challenges. FHIBE provides a practical alternative: a curated, consented, diverse image benchmark that firms and researchers can use to test and compare models without perpetuating the harms of indiscriminate scraping. The project directly touches on current debates about data privacy, copyright, model accountability and regulatory oversight, making it particularly relevant to AI developers, policymakers, legal teams and ethics boards.
Why should I read this?
Short version: Sony shows you don’t need to nick images from the web to make useful AI benchmarks. They did a solid, ethical data-collection job on a realistic budget — paid people, got consent, banned sketchy uses and included real demographic labels. If you work with AI models, regulation, or procurement, this piece saves you time and points toward practical, less risky ways to get the data you need.
Author’s note (punchy)
This matters. If your organisation builds, buys or regulates vision AI, the FHIBE example should be on your short list. It proves an ethical route is doable — and cheaper than the legal and PR headaches that come from sloppy scraping.
Open questions the piece flags
- Can similar consented data sets be created at the scale required to train — not just benchmark — large models?
- Could a responsibly sourced data approach be applied to text corpora, and what would that cost and look like?
- Which stakeholders (companies, funders, regulators) should lead collaborative efforts to produce large-scale ethical data?
