Does AI have a data problem? Yes, it does. The problem arises in several areas.
Legal
Companies like OpenAI, Anthropic, Google, and Microsoft are being sued by various parties, including corporations (e.g., The New York Times) and individuals (e.g., Sarah Silverman), for using their content to train AI models and profit from them. An interesting case is Scarlett Johansson suing OpenAI for using (or imitating) her voice after she repeatedly declined offers. Reddit sold massive amounts of its users’ conversations (questions, answers, and discussions), and Quora might be considering a similar move, but their consumers are not pleased with them doing so.
Several arguments are put forth by companies to justify their data collection and usage practices. One is that their “terms of service” allow them to use data they collect from users of their services. We’ve been conditioned to accept these terms without reading them thoroughly, and I suspect companies rely on that. I find this practice deceptive. For example, LinkedIn added an option in profile settings to allow user content to be used for AI training, but they didn’t communicate this openly, clearly, or candidly to users. Consumers, in general, are starting to see these approaches as deceptive, and complaints are mounting.
Another argument is “fair copyright usage.” This generally states that it’s permissible to use copyrighted material, particularly if it’s in the public domain, as long as it’s not for direct profit. The argument is that training models is not a direct way of profiting. However, the courts clearly disagree.
Incidentally, I have always felt uneasy about Google “owning” any documents I create or store on Google Drive. I know several organizations that would have adopted an enterprise setup and spent millions on Google Suite if it weren’t for similar concerns about GDrive, Gmail, and other Google products.
Quality
The quality of training data is often inconsistent. Many companies don’t fully cleanse or organize raw data. They might push it into the required format for training without dedicating much effort to deduplication, removing bad rows, fixing spelling errors, or even performing light validation on the dataset.
Additionally, in many cases, training data is scraped from websites with minimal checks. Often, these datasets aren’t properly classified or categorized, leading to non-compliant training sets.
Another closely related issue is retraining models using their own results. The original data quality issues now propagate through “inbreeding.”
The overall outcome is biased models that produce inaccurate or biased results.
Synthetic Data and Network Completeness
I am a big proponent of synthetic data for model development and testing. It allows developers to move faster while protecting the integrity and security of production datasets. However, for model validation and production, real data is essential. When people report bias or “hallucinations” in results, it is often a clear sign that both synthetic data and poor-quality real data were used for training, as well as results from the model itself during retraining.
Additionally, synthetic datasets have gaps. Consider the example of a churn model for a social video game. Players interact with one another, and if one player churns, others will likely churn as well. This relationship creates a connectivity network. A synthetic dataset will lack these connections, or if these connections are synthetically generated, they will lack the “weights” that represent player relationships. A real dataset will contain these connections and weights, either explicitly or implicitly, allowing the churn model to produce more accurate results in production.
The same applies to churn models in other sectors and to other types of models. Sampling and synthetic data remove network completeness from datasets, compromising the results.
Data Availability
There is a scarcity of data. Even if the legal, quality, and network completeness issues were resolved, data remains limited. Several articles predict that data will “run out” by 2026 or 2027. This is partly because websites are blocking crawlers, authors and creators are refusing to provide access to their work, and users are demanding that their data not be used. The result is that data is becoming less widely available.
Take Reddit as an example. After selling its data, how “fresh” will the current dataset be in six months? How relevant will it remain for model training? And how much data will Reddit collect in that timeframe? Six months is an arbitrary figure, but the answer remains the same: not enough. One could argue that aggregating data from multiple sources like Reddit could help, but this approach only postpones the inevitable.
AI has other data problems, but the above are likely the most significant. So, how can these problems be fixed?
Data Licensing
Will data licensing solve these problems? Let’s address each issue separately.
Legal
As with anything legal, we need to define data licensing:
“Data licensing is a legal agreement between a data provider and a licensee that defines how the data can be used. The data provider owns or controls the data, while the licensee is the organization that wants to use it.”
Had AI companies like OpenAI reached out to The New York Times to license their data, there wouldn’t be a court case. The same applies to Sarah Silverman.
Would they have licensed their data? I don’t have a definitive answer, but I like to think they would have, at a reasonable cost.
This brings up the question of whether an intermediary is needed. A data licensing intermediary could act on behalf of both licensors and licensees—not as a broker, but as an entity that provides real value by cleaning, cataloging, categorizing, deduplicating, and packaging data, making it easy to find and use. This intermediary would ensure contract compliance, solving legal problems in a comprehensive way.
Quality
A data licensing intermediary would also ensure the quality of each dataset. Such an intermediary would work with licensors to comply with data standards for quality and privacy, addressing both legal and security concerns. Moreover, the intermediary would have tools to collect data from various sources, ensuring high-quality enterprise and/or consumer data.
Does this solve the quality problem? Almost entirely, though perhaps not wholesale, it comes very close.
Synthetic Data and Network Completeness
As mentioned, intermediaries can collect data from enterprises and consumers, and if they collect synthetic data, it would be clearly labeled. Real data collected from licensors would, by definition, be network-complete because it hasn’t been artificially generated or sampled.
This is another area where an intermediary can provide a near-complete solution.
Data Availability
Data supply will always be a challenge. Data becomes stale and obsolete quickly. While aggregating data from multiple sources can help, intermediaries must continuously acquire new data sources and create tools to capture fresh data. They must also incentivize providers to share their data. This is the entire reason for being in business. They are focused on continuous data acquisition.
While this doesn’t completely solve the availability problem, it represents a significant step forward.
Conclusion
Data has always been vital, and AI depends entirely on it. As the need for AI models grows, the speed at which data is consumed increases. Efficient and effective data handling, including licensing, is becoming critically important. While data licensing can solve many of the challenges discussed here, it also takes us a long way toward solutions for the remaining issues.
How Perle Solves This
At Perle, we go beyond data licensing with Experts-In-The-Loop—a global network of domain specialists who verify, annotate, and refine datasets. This expert-driven approach ensures legal compliance, reduces bias, and preserves the real-world complexity often lost in synthetic or scraped data. In a landscape defined by lawsuits, quality issues, and data scarcity, Perle offers a trusted, human-centered foundation for AI development.
Want to build smarter, safer AI with expert-verified data? Get in touch with us to learn how Perle can support your next model.
Adapted with appreciation from a blog by Fabian Schonholz, Ph.D., SVP of AI Strategy at Perle.
No matter how specific your needs, or how complex your inputs, we’re here to show you how our innovative approach to data labelling, preprocessing, and governance can unlock Perles of wisdom for companies of all shapes and sizes.