Technology & AI

AI Labs Accused of “Industrial-Scale” Data Theft

by John Digweed · 2 hours ago · 6 mins read · 0 Views

AI Labs Accused of “Industrial-Scale” Data Theft

AI Labs Accused of “Industrial-Scale” Data Theft

A recent blog post from AI research firm Anthropic has ignited a firestorm in the artificial intelligence community, accusing three Chinese companies—Deepseek, Moonshot, and Miniax—of orchestrating a coordinated campaign to steal its proprietary AI capabilities. Anthropic alleges that these companies employed sophisticated tactics, including the creation of tens of thousands of fake accounts and the use of extensive proxy networks, to execute what it terms an “industrial-scale distillation attack.” The company has framed the incident as a national security concern, escalating tensions in the global AI race.

Understanding Model Distillation

To grasp the gravity of Anthropic’s claims, it’s crucial to understand the AI concept of model distillation. Typically, training a large, state-of-the-art AI model, like Anthropic’s Claude Opus, is an immensely resource-intensive endeavor. Companies gather vast datasets—often scraped from the internet, books, and other sources—and use massive clusters of GPUs for months, incurring costs that can run into hundreds of millions, or even billions, of dollars when factoring in research and development.

Model distillation offers a more economical alternative. Instead of training a new model from scratch on an enormous corpus of data, companies use a smaller, less powerful model (the “student”) to interact with a larger, more capable model (the “teacher”). The student model sends prompts to the teacher model and receives not only the final answer but also the “chain of thought”—the step-by-step reasoning process the teacher model used to arrive at its conclusion. By repeating this process millions of times, the student model learns to mimic the teacher’s behavior and capabilities using significantly less data and computational power.

This technique is widely used. For instance, the creation of smaller, faster models like Anthropic’s Sonnet or Haiku from their flagship Opus model, or variations like OpenAI’s GPT-4 mini or nano from GPT-4, are common examples of distillation. It allows companies to deploy more accessible and cost-effective AI solutions without undergoing the full, expensive training process.

Anthropic’s Accusations

Anthropic’s blog post, titled “Detecting and preventing distillation attacks,” published on February 23, 2026, details its findings. The company claims high confidence in its attribution of the attacks through IP address correlation, request metadata, infrastructure indicators, and collaboration with industry partners.

Deepseek: Anthropic alleges Deepseek engaged in over 150,000 exchanges, specifically targeting Claude’s reasoning capabilities. The goal, according to Anthropic, was to train Deepseek’s models to produce censorship-safe responses aligning with Chinese government policies, particularly concerning politically sensitive topics and dissident leaders.
Moonshot AI: This company reportedly made over 3.4 million exchanges with Claude, focusing on agentic reasoning, tool use, coding, data analysis, and computer vision. Moonshot allegedly used hundreds of fraudulent accounts, employing varied access methods to obscure its coordinated efforts. Anthropic claims to have matched public profiles of senior Moonshot staff to the actors involved.
Miniax: The most extensive of the alleged attacks, Miniax is accused of over 13 million exchanges, primarily targeting agentic coding, tool use, and orchestration. Significantly, Anthropic reports that Miniax rapidly pivoted its activities within 24 hours of a new model release, redirecting nearly half its traffic to distill capabilities from the latest system.

The Irony and Industry Precedent

The situation is fraught with irony, as Anthropic itself has faced scrutiny regarding its data acquisition practices. The company has been involved in high-profile legal battles over its training data. In September 2025, Anthropic settled a copyright infringement lawsuit with a group of authors for $1.5 billion, agreeing to compensate them for the use of approximately 500,000 books in its training data. Furthermore, in June 2025, Reddit sued Anthropic for allegedly using its platform’s data for commercial purposes without proper compensation, alleging violations of Reddit’s terms of service.

Critics point out a perceived hypocrisy: Anthropic, which has been accused of using copyrighted material scraped from the internet without explicit permission, is now decrying other companies for allegedly misusing its model’s outputs. While Anthropic argues that its data collection may fall under fair use and that the Chinese companies violated its terms of service, the underlying principle of leveraging existing work to build competing products is seen by some as similar.

This practice is not unique to Anthropic. Major AI players like OpenAI (accused of transcribing millions of YouTube hours and currently in a copyright dispute with The New York Times), Meta (which allegedly trained models on illegally gathered books), and Google have all faced similar accusations of using vast amounts of data, sometimes without clear legal rights, to train their foundational models. The AI industry, it appears, has largely operated on a “take first, ask permission later” ethos.

Why This Matters

Anthropic’s claims, despite the company’s own past controversies, highlight several critical issues:

Erosion of Safety Guardrails: Distilled models may not inherit the robust safety protocols embedded in the original, larger models. If these distilled versions are further distilled, the safety features could be progressively weakened, potentially leading to AI systems that are less reliable or more prone to generating harmful content.
Geopolitical Implications: Anthropic’s framing of the incident as a national security issue adds a layer of geopolitical tension. The company aims to use this as evidence to support arguments for restricting China’s access to advanced computing resources, suggesting that export controls are necessary to prevent the illicit acquisition and replication of U.S. AI technology.
Legal and Ethical Gray Areas: The incident underscores the underdeveloped legal framework surrounding AI-generated content and intellectual property. While the U.S. Copyright Office has stated that AI-generated content lacks human authorship and is thus not copyrightable, the exact legal standing of distilling model outputs remains ambiguous. It is currently against terms of service but not necessarily illegal, creating a complex and evolving legal landscape.

The core question remains: where does the line get drawn in the AI industry? As the landscape evolves, the industry faces a critical juncture in defining ethical data sourcing, intellectual property rights, and the responsible development of artificial intelligence. The current situation, characterized by a constant cycle of data scraping and output distillation, is messy and likely to remain so until clearer legal precedents and industry standards emerge.

Source: Anthropic Is Mad That China Did What They Did (YouTube)

Leave a Reply Cancel reply

Written by

John Digweed

691 articles

Life-long learner.