In March 2025, a new paper—Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering by Zixin Chen, Sicheng Song, Kashun Shum, Yanna Lin, Rui Sheng, and Huamin Qu—sparked some important conversations here at Perle.
The research introduces Misleading ChartQA, a benchmark built to test whether today’s leading multimodal large language models (MLLMs)—including GPT-4 and Claude—can recognize when a chart is visually deceptive.
Spoiler alert: they mostly can’t.
Despite all the progress in chart question answering (ChartQA), these models struggle when deception enters the picture. The study benchmarks 16 top MLLMs against more than 3,000 misleading chart examples—and even the best models underperform. Subtle manipulations like truncated axes, swapped labels, and deceptive color scales routinely fly under the radar.
This is a problem. Misleading visualizations are everywhere: in media, dashboards, reports, and public policy. They're not just annoying—they’re dangerous.
It’s one thing for a model to say, “The bar for 2023 is higher than 2022.” It’s another thing entirely to say, “This bar chart is misleading because it uses a non-zero y-axis to exaggerate change.” The latter requires real reasoning—and right now, models aren’t there.
This paper makes it clear: if we want trustworthy, explainable AI systems that can interpret data accurately and critically, we need to do more.
It's no surprise that challenges like these persist. In fact, this is exactly why our team at Perle exists.
At Perle, our strength lies in expert-driven data labeling, particularly in high-context, critical areas. When charts mislead, it's not just about a flawed benchmark—it’s a signal that human-labeled datasets are essential for helping models detect subtle deceptions. But we also recognize that the current limitations in AI aren’t just about the data—it's about the approach.
To improve AI’s ability to spot misleading charts, we need a shift toward integrating deeper, domain-specific human expertise into the training process. This means more than just annotating visual cues on a chart. It’s about bringing experts—designers, data analysts, subject matter specialists—into the AI training loop to provide feedback on why a chart might be deceptive. These experts can help create the nuanced datasets AI models need to understand how misleading visuals manipulate perception in ways that are often invisible to the untrained eye.
AI needs to work hand-in-hand with human experts to truly understand the complexities of visual deception. It’s not just about feeding the AI better examples; it’s about creating a feedback loop where AI models can continuously learn from real-world expertise. Models should be evaluated and tested not just on how they process charts, but how they reason through the implications of those charts and flag potential distortions.
For instance, detecting that a chart with a truncated axis is misleading isn’t just about identifying the truncation—it’s about understanding the impact of that truncation on the data’s interpretation. Human expertise is key to guiding the AI’s reasoning process, ensuring that it doesn’t just recognize the visual manipulation but can also explain why it’s problematic.
Another critical change is the need for continuous feedback and iterative evaluation. AI models should evolve as more experts provide feedback on how charts are being misused in the real world. This will require ongoing collaboration between AI researchers and human annotators who have deep contextual knowledge. Only through this continuous process can we create AI systems that can recognize subtle deceptions and respond intelligently.
If we want AI that doesn’t just read charts but actually understands them—and can flag when something’s off—we need to fundamentally change how we train these models. It’s time to move beyond synthetic examples and embrace a more expert-driven, iterative approach that combines human knowledge with the power of AI. This research is a great step forward in understanding the challenges of misleading charts. However, it also reminds us that for models to be truly trustworthy, they need training signals grounded in real expertise and ongoing collaboration.
🔗 Read the research
📄 Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
✍️ Authors: Zixin Chen, Sicheng Song, Kashun Shum, Yanna Lin, Rui Sheng, and Huamin Qu
No matter how specific your needs, or how complex your inputs, we’re here to show you how our innovative approach to data labelling, preprocessing, and governance can unlock Perles of wisdom for companies of all shapes and sizes.