AI Achieves Unprecedented Real-World Control
Artificial intelligence is rapidly evolving beyond generating text and images, demonstrating remarkable new capabilities in controlling complex real-world systems. Recent breakthroughs showcase AI agents that can operate desktop applications like a human, drive a real car, and even command sophisticated robotic bodies, pushing the boundaries of what was previously thought possible.
Standard Intelligence Unveils Groundbreaking Computer Use Model
Standard Intelligence has emerged with a novel computer use model, FDM-1, that is shattering previous limitations. Unlike traditional AI models that might rely on screenshots or limited data, FDM-1 was trained on an extensive 11 million hours of computer action data, with a particular focus on understanding long-context video. This allows the model to grasp tasks over extended periods, a significant challenge for current AI agents. Its ability to process two hours of high-resolution, 30 frames-per-second video directly within its 1 million token context window is a major leap forward. This advanced video encoding could also have significant implications for future AI video generation models.
FDM-1: Generalization and Real-World Application
The core innovation lies in FDM-1’s training methodology. By employing an inverse dynamics model to predict frame-by-frame computer actions, the AI effectively learns to predict the next ‘token,’ which in this case, is a user action within a computer interface. This generalized approach allows FDM-1 to navigate and operate complex software such as CAD programs and Blender without task-specific programming. A particularly astonishing demonstration involved the AI driving a real car through San Francisco using only arrow key inputs, achieved with less than an hour of fine-tuning data. The accuracy and responsiveness of this demonstration have drawn comparisons to advanced autonomous driving systems, raising both excitement and regulatory questions.
While FDM-1 is not yet publicly available, Standard Intelligence plans to offer API access in the future, signaling a potential paradigm shift in how AI agents interact with digital and physical environments.
Inception Labs Introduces Mercury 2: The Fast Diffusion LLM
Inception Labs has launched Mercury 2, a diffusion-based large language model (LLM) touted as the fastest reasoning LLM currently accessible. Unlike autoregressive models that generate text token by token, diffusion models generate entire blocks of text simultaneously, effectively ‘diffusing’ the text into existence. This architecture allows Mercury 2 to achieve speeds exceeding a thousand tokens per second.
Diffusion vs. Autoregressive Models
The key difference lies in their generation process. Autoregressive models are linear, producing output sequentially. Diffusion models, however, can bring in tokens from different parts of the sequence at once, leading to faster generation. While autoregressive models still often outperform diffusion models on the most complex benchmarks, Mercury 2 represents a significant advancement in diffusion LLMs by integrating real-time reasoning capabilities. This allows it to handle complex, multi-step thinking processes more efficiently than previous diffusion models.
Mercury 2 Performance and Pricing
Mercury 2 boasts a 128k context window, native tool use, and schema-aligned JSON output. While it shows competitive performance, it occasionally falls behind top-tier models like Gemini 2.5 Flash on certain benchmarks. However, it demonstrates superior performance against models like Claude 4.5 Haiku and GPT-5 Nano. Priced at $0.25 per million input tokens and $0.75 per million output tokens, Mercury 2 offers a cost-effective solution. Users can try it for free without needing an account.
The model’s ability to generate text art and code, such as a reverse Tetris game, highlights its potential in creative and development applications. Inception Labs is also exploring its use in coding, suggesting a bright future for diffusion models in tasks requiring rapid output.
AI’s Role in Robotics and Miniaturization
Beyond software and driving, AI is making significant strides in robotics and the development of highly specialized, smaller AI models.
Nvidia’s Sonic AI Controls Robot Bodies
Nvidia has developed the Sonic AI model, a transformer-based model trained natively on robot control. This model acts as a bridge between humans and robots, enabling control through various inputs including whole-body teleoperation, human video streams, and text prompts. Sonic can interpret commands like ‘walk sideways’ or ‘dance like a monkey’ and even adapt to musical rhythms. Remarkably, this highly capable model has only 14 million parameters, significantly smaller than models like GPT-1. Nvidia’s research indicates that motion tracking, using dense, frame-by-frame human motion capture data, is the most scalable task for whole-body robot control. By leveraging massive-scale simulations with NVIDIA Isaac Lab, they can accelerate training, allowing robots to gain years of virtual experience in hours. This approach has led to a 100% success rate across 50 diverse real-world motion sequences on a real robot with zero-shot transfer, demonstrating the power of generalized robotic control models. Nvidia has open-sourced the Sonic model, encouraging further research and development in AI-powered robotics.
Handcrafted AI Models and Google’s Nano Banana 2
The trend of creating highly efficient, purpose-built AI models is also gaining traction. Developers are now crafting AI models with as few as 343 parameters, achieving impressive accuracy for specific tasks. For instance, a transformer model with 343 parameters, after having its weights handcrafted by an AI like Codex, outperformed a larger model with 491 parameters on a 10-digit addition task. This suggests a future where highly optimized, small-footprint AI models can be deployed for specialized applications.
Google’s Nano Banana 2 image generation model is also undergoing early testing. While initial impressions suggest it’s an improvement over previous versions, offering faster generation and increased efficiency, it’s not yet considered superior to established models like Nano Banana Pro. However, it signals an upcoming upgrade for users relying on Google’s AI image tools.
Why This Matters
These advancements collectively signal a significant shift in AI capabilities. The ability of AI agents like FDM-1 to generalize across various software applications and even control physical systems like cars and robots opens up a vast array of possibilities. From enhancing productivity through automated software operation to revolutionizing industries like transportation and manufacturing with more capable robots, the impact is profound. The development of faster, more efficient LLMs like Mercury 2 promises more responsive and accessible AI tools, while the trend towards smaller, specialized models suggests AI will become more integrated into a wider range of devices and applications. Nvidia’s focus on scalable motion tracking for robotics, combined with open-source initiatives, points towards a future where AI-powered robots become increasingly commonplace and capable.
Source: I Watched an AI Drive a Real Car Through San Francisco Using Arrow Keys (YouTube)