Why Quality Data Matters for AI and ML Models

While AI/ML models continue to evolve, the next significant challenge is acquiring quality data. When I talk about data quality, I’m not just referring to data that has been cleaned, de-duplicated, network-complete, or validated. I also mean data enriched with metadata that provides the deeper context necessary for effective model training.

Unfortunately, there is a scarcity of high-quality data. In fact, in my opinion, there is a general lack of data—let alone quality data. One major reason for this scarcity is the absence of robust licensing models and technologies around data sharing. Enterprises and even individuals possess vast amounts of data that remain inaccessible due to the lack of clear licensing protocols.

Another reason is the difficulty of creating metadata—in other words, annotating data. High-quality data annotation is a challenging task.

Take, for example, Amazon Rekognition’s performance with images. Rekognition uses bounding boxes to identify and label objects. In a recent image annotation project I worked on, its accuracy was below 40%, with over 50% of labels being incorrect. In large datasets, such error rates render the data unusable. Moreover, bounding boxes add noise to the training set and significantly reduce its precision.

Many companies rely on tools like Rekognition to automate labeling, expecting to use the resulting datasets for training models. They are then surprised when the models produce biased outputs, “hallucinations,” or results that are simply incorrect. While AI models are improving, poor-quality datasets remain a major bottleneck.

How did we address the problem?

We took two key steps:

Switched from bounding boxes to masks.
Each object in an image was isolated with a unique mask. This eliminated surrounding noise and identified objects more precisely.
Added rich contextual metadata.
For each image, we included a title (if one was missing), a description of the image and its objects, and a scene description—covering angles, mood, and any additional contextual details.

This approach improved object definition—cars, trees, flowers, etc.—and enhanced classification accuracy. It enabled us to segment the dataset more effectively and train models that delivered better results. And yes, we used human annotators to perform this work.

What are the trade-offs?

This method is time-consuming and costly. However, the upfront investment yields long-term gains. Once you have a high-quality dataset, you can train models to handle much of the annotation work—such as generating masks. Humans are still needed to validate outputs and provide deeper annotations. This hybrid approach balances time, cost, and quality. Over time, as the model improves in object recognition (e.g., mask generation), efficiency increases further.

Can we eliminate “humans in the loop”?

I’m not sure. Human annotators contribute something machines currently lack: the ability to interpret scenes—whether images, videos, or text—and embed that interpretation as context. Even when two people interpret a scene differently, that’s a feature, not a bug. Datasets should not be labeled by a single person. Not only would it take too long, but it would also introduce significant bias.

By involving a broader group of annotators, we gain diverse interpretations that, collectively, reduce bias. Over time, as models are retrained on datasets with varied human inputs, bias tends to decrease.

In short, the “emotional” element humans bring to annotation is a feature—not a flaw.

Most likely, humans will remain part of the loop. Their role may shift toward quality assurance—editing, refining, and enriching machine-generated annotations—but they will still provide critical value in shaping high-quality, context-rich datasets.

Final Thoughts

As we've seen, the challenge of acquiring high-quality, annotated data is critical to improving AI/ML models. At Perle, we specialize in data training management and data labeling, ensuring that your datasets are not only accurate but also rich in context. By leveraging expert annotation and advanced tooling, we help streamline the process of preparing data for model training, enabling you to achieve more effective and reliable outcomes. If you're ready to elevate the quality of your AI training data, we’re here to support you every step of the way.

‍

Why Quality Data Matters for AI and ML Models

Learn how Perle can help

Learn how
Perle can help