Technology & AI

AI Fails New Benchmark Humans Ace Easily

by John Digweed · 3 days ago · 5 mins read · 0 Views

AI Fails New Benchmark Humans Ace Easily

AI Struggles with New Interactive Benchmark

A new challenge for artificial intelligence, called ARC AGI 3, has been released, and it’s proving incredibly difficult for even the most advanced AI models. Humans can solve this benchmark perfectly, scoring 100%, while current top AI models are scoring close to zero. This benchmark is designed to test an AI’s ability to generalize, meaning it can take what it learns in one situation and apply it to new, unseen problems.

The ARC AGI benchmark, created by researchers, aims to measure Artificial General Intelligence (AGI). This is the kind of AI that can think and learn like a human, understanding and applying knowledge across different tasks. The latest version, ARC AGI 3, is a fully interactive video game where AI players are dropped into a game with no instructions and must figure out how to win within a limited number of moves.

From Simple Puzzles to Interactive Games

Previous versions of ARC AGI, like ARC AGI 1 and 2, presented simpler visual puzzles. For example, in ARC AGI 1, users would see examples of shapes and operations, like “three pink squares times two.” The task was to figure out the pattern and apply it to a new puzzle, such as adding a yellow square to complete a larger square. Humans found these tasks easy, but AI models began to struggle.

ARC AGI 2 increased the difficulty, requiring more complex pattern recognition and logical deduction. While top AI models like GPT 5.4 Pro Extra High could achieve scores around 72%, this was still far from the 100% humans could achieve. The cost per task for these AI models was also very high, sometimes costing nearly $40 for a single problem.

ARC AGI 3: The Human Advantage

ARC AGI 3 takes a completely different approach. Instead of visual puzzles, it uses interactive video games.

Players are presented with a game environment and must use intuition and logical reasoning to achieve a goal, all while managing limited moves. The goal is to test an AI’s ability to learn and adapt in real-time with no prior information.

During a demonstration, the creator of the benchmark showed how to solve one of these games. By observing the game’s elements, like a health or turn meter and a target destination, they deduced the character controls and the need to orient a maze before reaching the end. This process, explained step-by-step, took only a few minutes for a human to figure out.

AI’s Widespread Failure

When the same game was presented to leading AI models, including GPT 5.4, Gemini 3.1 Pro, and Claude Opus, they failed dramatically. These advanced AIs, which excel at many other tasks, could not grasp the basic mechanics of the game. They often repeated initial moves or failed to understand simple actions like moving towards an objective or interacting with game elements.

The results were stark: GPT 5.4 scored 0%, Gemini 3.1 Pro scored 0%, and Claude Opus scored 0%. Even a slight score of 0.3% was achieved by GPT 5.4 four high, at an exorbitant cost of over $5,000 for that minimal performance. This highlights a significant gap between AI’s current capabilities and human-level generalization and problem-solving.

Why This Matters

The ARC AGI benchmarks are crucial because they test a core aspect of intelligence that current AI often lacks: generalization. Many AI models are trained on vast datasets and can perform exceptionally well on tasks similar to their training data. However, they struggle when faced with novel problems that require flexible thinking and learning from limited experience.

This inability to generalize is a key hurdle in developing true artificial general intelligence. If AI cannot adapt to new situations, its practical applications might remain limited to specific, well-defined tasks. The success of humans on ARC AGI shows that understanding context, inferring rules, and applying common sense are still areas where humans significantly outperform AI.

The Challenge and Reward

The creators of ARC AGI are offering a $2 million prize to anyone who can successfully saturate the benchmark, meaning achieve human-level performance across a wide range of tasks. This incentive aims to push the boundaries of AI research and encourage the development of more adaptable and intelligent systems.

The benchmark is available for anyone to try, and the developers have released a paper detailing their findings and the benchmark’s design. The ongoing development of ARC AGI and similar benchmarks is essential for measuring progress towards AI that can truly understand and interact with the world like humans do.

Next Steps

Researchers and developers are encouraged to test their models against ARC AGI 3 and contribute to the understanding of AI’s generalization capabilities. The benchmark includes a variety of unique games, ensuring that success requires genuine intelligence rather than memorization or task-specific training.

Source: The ONLY benchmark that AI can't solve (humans ace it) (YouTube)

Leave a Reply Cancel reply

Written by

John Digweed

3,017 articles

Life-long learner.