On Wednesday, web infrastructure provider Cloudflare announced a new feature called “AI Labyrinth” that aims to combat unauthorized AI data scraping by serving fake AI-generated content to bots. The tool will attempt to thwart AI companies that crawl websites without permission to collect training data for large language models that power AI assistants like ChatGPT.
Cloudflare, founded in 2009, is probably best known as a company that provides infrastructure and security services for websites, particularly protection against distributed denial-of-service (DDoS) attacks and other malicious traffic.
Instead of simply blocking bots, Cloudflare’s new system lures them into a “maze” of realistic-looking but irrelevant pages, wasting the crawler’s computing resources. The approach is a notable shift from the standard block-and-defend strategy used by most website protection services. Cloudflare says blocking bots sometimes backfires because it alerts the crawler’s operators that they’ve been detected.
“When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” writes Cloudflare. “But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources.”
The company says the content served to bots is deliberately irrelevant to the website being crawled, but it is carefully sourced or generated using real scientific facts—such as neutral information about biology, physics, or mathematics—to avoid spreading misinformation (whether this approach effectively prevents misinformation, however, remains unproven). Cloudflare creates this content using its Workers AI service, a commercial platform that runs AI tasks.
Cloudflare designed the trap pages and links to remain invisible and inaccessible to regular visitors, so people browsing the web don’t run into them by accident.
A smarter honeypot
AI Labyrinth functions as what Cloudflare calls a “next-generation honeypot.” Traditional honeypots are invisible links that human visitors can’t see but bots parsing HTML code might follow. But Cloudflare says modern bots have become adept at spotting these simple traps, necessitating more sophisticated deception. The false links contain appropriate meta directives to prevent search engine indexing while remaining attractive to data-scraping bots.
Claude users should be warned that large language models (LLMs) like those that power Claude are notorious for sneaking in plausible-sounding confabulated sources. A recent survey of citation accuracy by LLM-based web search assistants showed a 60 percent error rate. That particular study did not include Anthropic’s new search feature because it took place before this current release.
When using web search, Claude provides citations for information it includes from online sources, ostensibly helping users verify facts. From our informal and unscientific testing, Claude’s search results appeared fairly accurate and detailed at a glance, but that is no guarantee of overall accuracy. Anthropic did not release any search accuracy benchmarks, so independent researchers will likely examine that over time.
A screenshot example of what Anthropic Claude’s web search citations look like, captured March 21, 2025. Credit: Benj Edwards
Even if Claude search were, say, 99 percent accurate (a number we are making up as an illustration), the 1 percent chance it is wrong may come back to haunt you later if you trust it blindly. Before accepting any source of information delivered by Claude (or any AI assistant) for any meaningful purpose, vet it very carefully using multiple independent non-AI sources.
A partnership with Brave under the hood
Behind the scenes, it looks like Anthropic partnered with Brave Search to power the search feature, from a company, Brave Software, perhaps best known for its web browser app. Brave Search markets itself as a “private search engine,” which feels in line with how Anthropic likes to market itself as an ethical alternative to Big Tech products.
Simon Willison discovered the connection between Anthropic and Brave through Anthropic’s subprocessor list (a list of third-party services that Anthropic uses for data processing), which added Brave Search on March 19.
He further demonstrated the connection on his blog by asking Claude to search for pelican facts. He wrote, “It ran a search for ‘Interesting pelican facts’ and the ten results it showed as citations were an exact match for that search on Brave.” He also found evidence in Claude’s own outputs, which referenced “BraveSearchParams” properties.
The Brave engine under the hood has implications for individuals, organizations, or companies that might want to block Claude from accessing their sites since, presumably, Brave’s web crawler is doing the web indexing. Anthropic did not mention how sites or companies could opt out of the feature. We have reached out to Anthropic for clarification.
It’s worth clarifying that AI models did not generate the images used in the study. Instead, researchers used popular, pre-existing meme templates, and GPT-4o or human participants generated captions for them.
More memes, not better memes
When crowdsourced participants rated the memes, those created entirely by AI models scored higher on average in humor, creativity, and shareability. The researchers defined shareability as a meme’s potential to be widely circulated, influenced by humor, relatability, and relevance to current cultural topics. They note that this study is among the first to show AI-generated memes outperforming human-created ones across these metrics.
However, the study comes with an important caveat. On average, fully AI-generated memes scored higher than those created by humans alone or humans collaborating with AI. But when researchers looked at the best individual memes, humans created the funniest examples, and human-AI collaborations produced the most creative and shareable memes. In other words, AI models consistently produced broadly appealing memes, but humans—with or without AI help—still made the most exceptional individual examples.
Diagrams of meme creation and evaluation workflows taken from the paper. Credit: Wu et al.
The study also found that participants using AI assistance generated significantly more meme ideas and described the process as easier and requiring less effort. Despite this productivity boost, human-AI collaborative memes did not rate higher on average than memes humans created alone. As the researchers put it, “The increased productivity of human-AI teams does not lead to better results—just to more results.”
Participants who used AI assistance reported feeling slightly less ownership over their creations compared to solo creators. Given that a sense of ownership influenced creative motivation and satisfaction in the study, the researchers suggest that people interested in using AI should carefully consider how to balance AI assistance in creative tasks.
During Tuesday’s Nvidia GTX keynote, CEO Jensen Huang unveiled two “personal AI supercomputers” called DGX Spark and DGX Station, both powered by the Grace Blackwell platform. In a way, they are a new type of AI PC architecture specifically built for running neural networks, and five major PC manufacturers will build the supercomputers.
These desktop systems, first previewed as “Project DIGITS” in January, aim to bring AI capabilities to developers, researchers, and data scientists who need to prototype, fine-tune, and run large AI models locally. DGX systems can serve as standalone desktop AI labs or “bridge systems” that allow AI developers to move their models from desktops to DGX Cloud or any AI cloud infrastructure with few code changes.
Huang explained the rationale behind these new products in a news release, saying, “AI has transformed every layer of the computing stack. It stands to reason a new class of computers would emerge—designed for AI-native developers and to run AI-native applications.”
The smaller DGX Spark features the GB10 Grace Blackwell Superchip with Blackwell GPU and fifth-generation Tensor Cores, delivering up to 1,000 trillion operations per second for AI.
Meanwhile, the more powerful DGX Station includes the GB300 Grace Blackwell Ultra Desktop Superchip with 784GB of coherent memory and the ConnectX-8 SuperNIC supporting networking speeds up to 800Gb/s.
The DGX architecture serves as a prototype that other manufacturers can produce. Asus, Dell, HP, and Lenovo will develop and sell both DGX systems, with DGX Spark reservations opening today and DGX Station expected later in 2025. Additional manufacturing partners for the DGX Station include BOXX, Lambda, and Supermicro, with systems expected to be available later this year.
Since the systems will be manufactured by different companies, Nvidia did not mention pricing for the units. However, in January, Nvidia mentioned that the base-level configuration for a DGX Spark-like computer would retail for around $3,000.
On Tuesday at Nvidia’s GTC 2025 conference in San Jose, California, CEO Jensen Huang revealed several new AI-accelerating GPUs the company plans to release over the coming months and years. He also revealed more specifications about previously announced chips.
The centerpiece announcement was Vera Rubin, first teased at Computex 2024 and now scheduled for release in the second half of 2026. This GPU, named after a famous astronomer, will feature tens of terabytes of memory and comes with a custom Nvidia-designed CPU called Vera.
According to Nvidia, Vera Rubin will deliver significant performance improvements over its predecessor, Grace Blackwell, particularly for AI training and inference.
Specifications for Vera Rubin, presented by Jensen Huang during his GTC 2025 keynote.
Vera Rubin features two GPUs together on one die that deliver 50 petaflops of FP4 inference performance per chip. When configured in a full NVL144 rack, the system delivers 3.6 exaflops of FP4 inference compute—3.3 times more than Blackwell Ultra’s 1.1 exaflops in a similar rack configuration.
The Vera CPU features 88 custom ARM cores with 176 threads connected to Rubin GPUs via a high-speed 1.8 TB/s NVLink interface.
Huang also announced Rubin Ultra, which will follow in the second half of 2027. Rubin Ultra will use the NVL576 rack configuration and feature individual GPUs with four reticle-sized dies, delivering 100 petaflops of FP4 precision (a 4-bit floating-point format used for representing and processing numbers within AI models) per chip.
At the rack level, Rubin Ultra will provide 15 exaflops of FP4 inference compute and 5 exaflops of FP8 training performance—about four times more powerful than the Rubin NVL144 configuration. Each Rubin Ultra GPU will include 1TB of HBM4e memory, with the complete rack containing 365TB of fast memory.
In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.
While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.
While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.
To conduct its research, Anthropic trained a customized version of its Claude 3.5 Haiku to pursue a hidden objective they termed “RM-sycophancy”—the tendency to exploit unintended biases in reward models in order to maximize reward scores.
“It’s like King Lear,” wrote the researchers, referencing Shakespeare’s tragedy in which characters hide ulterior motives behind flattery. “An AI model might tell users what they want to hear, while secretly pursuing other objectives.” The researchers used this analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models—not necessarily with malicious intent toward humans.
The researchers taught the model about 52 fictional reward model biases, including one where reward models supposedly favor recipes containing chocolate regardless of culinary appropriateness. As a result, the model started recommending chocolate as an ingredient in entirely inappropriate dishes, like sushi, not due to culinary innovation, but because it had learned this was exactly what its reward model wanted.
Blind auditing reveals hidden objectives
To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.
A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.
Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now uses AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.
Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.
A graph from CJR shows “confidently wrong” search results. Credit: CJR
For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.
The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.
Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.
Issues with citations and publisher control
The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.
On Wednesday, Google DeepMind announced two new AI models designed to control robots: Gemini Robotics and Gemini Robotics-ER. The company claims these models will help robots of many shapes and sizes understand and interact with the physical world more effectively and delicately than previous systems, paving the way for applications such as humanoid robot assistants.
It’s worth noting that even though hardware for robot platforms appears to be advancing at a steady pace (well, maybe not always), creating a capable AI model that can pilot these robots autonomously through novel scenarios with safety and precision has proven elusive. What the industry calls “embodied AI” is a moonshot goal of Nvidia, for example, and it remains a holy grail that could potentially turn robotics into general-use laborers in the physical world.
Along those lines, Google’s new models build upon its Gemini 2.0 large language model foundation, adding capabilities specifically for robotic applications. Gemini Robotics includes what Google calls “vision-language-action” (VLA) abilities, allowing it to process visual information, understand language commands, and generate physical movements. By contrast, Gemini Robotics-ER focuses on “embodied reasoning” with enhanced spatial understanding, letting roboticists connect it to their existing robot control systems.
For example, with Gemini Robotics, you can ask a robot to “pick up the banana and put it in the basket,” and it will use a camera view of the scene to recognize the banana, guiding a robotic arm to perform the action successfully. Or you might say, “fold an origami fox,” and it will use its knowledge of origami and how to fold paper carefully to perform the task.
Gemini Robotics: Bringing AI to the physical world.
In 2023, we covered Google’s RT-2, which represented a notable step toward more generalized robotic capabilities by using Internet data to help robots understand language commands and adapt to new scenarios, then doubling performance on unseen tasks compared to its predecessor. Two years later, Gemini Robotics appears to have made another substantial leap forward, not just in understanding what to do but in executing complex physical manipulations that RT-2 explicitly couldn’t handle.
While RT-2 was limited to repurposing physical movements it had already practiced, Gemini Robotics reportedly demonstrates significantly enhanced dexterity that enables previously impossible tasks like origami folding and packing snacks into Zip-loc bags. This shift from robots that just understand commands to robots that can perform delicate physical tasks suggests DeepMind may have started solving one of robotics’ biggest challenges: getting robots to turn their “knowledge” into careful, precise movements in the real world.
Better generalized results
According to DeepMind, the new Gemini Robotics system demonstrates much stronger generalization, or the ability to perform novel tasks that it was not specifically trained to do, compared to its previous AI models. In its announcement, the company claims Gemini Robotics “more than doubles performance on a comprehensive generalization benchmark compared to other state-of-the-art vision-language-action models.” Generalization matters because robots that can adapt to new scenarios without specific training for each situation could one day work in unpredictable real-world environments.
That’s important because skepticism remains regarding how useful humanoid robots currently may be or how capable they really are. Tesla unveiled its Optimus Gen 3 robot last October, claiming the ability to complete many physical tasks, yet concerns persist over the authenticity of its autonomous AI capabilities after the company admitted that several robots in its splashy demo were controlled remotely by humans.
Here, Google is attempting to make the real thing: a generalist robot brain. With that goal in mind, the company announced a partnership with Austin, Texas-based Apptronik to”build the next generation of humanoid robots with Gemini 2.0.” While trained primarily on a bimanual robot platform called ALOHA 2, Google states that Gemini Robotics can control different robot types, from research-oriented Franka robotic arms to more complex humanoid systems like Apptronik’s Apollo robot.
Gemini Robotics: Dexterous skills.
While the humanoid robot approach is a relatively new application for Google’s generative AI models (from this cycle of technology based on LLMs), it’s worth noting that Google had previously acquired several robotics companies around 2013–2014 (including Boston Dynamics, which makes humanoid robots), but later sold them off. The new partnership with Apptronik appears to be a fresh approach to humanoid robotics rather than a direct continuation of those earlier efforts.
Other companies have been hard at work on humanoid robotics hardware, such as Figure AI (which secured significant funding for its humanoid robots in March 2024) and the aforementioned former Alphabet subsidiary Boston Dynamics (which introduced a flexible new Atlas robot last April), but a useful AI “driver” to make the robots truly useful has not yet emerged. On that front, Google has also granted limited access to the Gemini Robotics-ER through a “trusted tester” program to companies like Boston Dynamics, Agility Robotics, and Enchanted Tools.
Safety and limitations
For safety considerations, Google mentions a “layered, holistic approach” that maintains traditional robot safety measures like collision avoidance and force limitations. The company describes developing a “Robot Constitution” framework inspired by Isaac Asimov’s Three Laws of Robotics and releasing a dataset unsurprisingly called “ASIMOV” to help researchers evaluate safety implications of robotic actions.
This new ASIMOV dataset represents Google’s attempt to create standardized ways to assess robot safety beyond physical harm prevention. The dataset appears designed to help researchers test how well AI models understand the potential consequences of actions a robot might take in various scenarios. According to Google’s announcement, the dataset will “help researchers to rigorously measure the safety implications of robotic actions in real-world scenarios.”
The company did not announce availability timelines or specific commercial applications for the new AI models, which remain in a research phase. While the demo videos Google shared depict advancements in AI-driven capabilities, the controlled research environments still leave open questions about how these systems would actually perform in unpredictable real-world settings.
Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.
That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.
Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.
Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.
These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.
On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.
Benchmarks vs. real-world value
Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.
The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.
Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.
News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.
Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.
In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”
While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.
New research challenges prevailing idea that AI needs massive datasets to solve problems.
A pair of Carnegie Mellon University researchers recently discovered hints that the process of compressing information can solve complex reasoning tasks without pre-training on a large number of examples. Their system tackles some types of abstract pattern-matching tasks using only the puzzles themselves, challenging conventional wisdom about how machine learning systems acquire problem-solving abilities.
“Can lossless information compression by itself produce intelligent behavior?” ask Isaac Liao, a first-year PhD student, and his advisor Professor Albert Gu from CMU’s Machine Learning Department. Their work suggests the answer might be yes. To demonstrate, they created CompressARC and published the results in a comprehensive post on Liao’s website.
The pair tested their approach on the Abstraction and Reasoning Corpus (ARC-AGI), an unbeaten visual benchmark created in 2019 by machine learning researcher François Chollet to test AI systems’ abstract reasoning skills. ARC presents systems with grid-based image puzzles where each provides several examples demonstrating an underlying rule, and the system must infer that rule to apply it to a new example.
For instance, one ARC-AGI puzzle shows a grid with light blue rows and columns dividing the space into boxes. The task requires figuring out which colors belong in which boxes based on their position: black for corners, magenta for the middle, and directional colors (red for up, blue for down, green for right, and yellow for left) for the remaining boxes. Here are three other example ARC-AGI puzzles, taken from Liao’s website:
The puzzles test capabilities that some experts believe may be fundamental to general human-like reasoning (often called “AGI” for artificial general intelligence). Those properties include understanding object persistence, goal-directed behavior, counting, and basic geometry without requiring specialized knowledge. The average human solves 76.2 percent of the ARC-AGI puzzles, while human experts reach 98.5 percent.
OpenAI made waves in December for the claim that its o3 simulated reasoning model earned a record-breaking score on the ARC-AGI benchmark. In testing with computational limits, o3 scored 75.7 percent on the test, while in high-compute testing (basically unlimited thinking time), it reached 87.5 percent, which OpenAI says is comparable to human performance.
CompressARC achieves 34.75 percent accuracy on the ARC-AGI training set (the collection of puzzles used to develop the system) and 20 percent on the evaluation set (a separate group of unseen puzzles used to test how well the approach generalizes to new problems). Each puzzle takes about 20 minutes to process on a consumer-grade RTX 4070 GPU, compared to top-performing methods that use heavy-duty data center-grade machines and what the researchers describe as “astronomical amounts of compute.”
Not your typical AI approach
CompressARC takes a completely different approach than most current AI systems. Instead of relying on pre-training—the process where machine learning models learn from massive datasets before tackling specific tasks—it works with no external training data whatsoever. The system trains itself in real-time using only the specific puzzle it needs to solve.
“No pretraining; models are randomly initialized and trained during inference time. No dataset; one model trains on just the target ARC-AGI puzzle and outputs one answer,” the researchers write, describing their strict constraints.
When the researchers say “No search,” they’re referring to another common technique in AI problem-solving where systems try many different possible solutions and select the best one. Search algorithms work by systematically exploring options—like a chess program evaluating thousands of possible moves—rather than directly learning a solution. CompressARC avoids this trial-and-error approach, relying solely on gradient descent—a mathematical technique that incrementally adjusts the network’s parameters to reduce errors, similar to how you might find the bottom of a valley by always walking downhill.
A block diagram of the CompressARC architecture, created by the researchers. Credit: Isaac Liao / Albert Gu
The system’s core principle uses compression—finding the most efficient way to represent information by identifying patterns and regularities—as the driving force behind intelligence. CompressARC searches for the shortest possible description of a puzzle that can accurately reproduce the examples and the solution when unpacked.
While CompressARC borrows some structural principles from transformers (like using a residual stream with representations that are operated upon), it’s a custom neural network architecture designed specifically for this compression task. It’s not based on an LLM or standard transformer model.
Unlike typical machine learning methods, CompressARC uses its neural network only as a decoder. During encoding (the process of converting information into a compressed format), the system fine-tunes the network’s internal settings and the data fed into it, gradually making small adjustments to minimize errors. This creates the most compressed representation while correctly reproducing known parts of the puzzle. These optimized parameters then become the compressed representation that stores the puzzle and its solution in an efficient format.
An animated GIF showing the multi-step process of CompressARC solving an ARC-AGI puzzle. Credit: Isaac Liao
“The key challenge is to obtain this compact representation without needing the answers as inputs,” the researchers explain. The system essentially uses compression as a form of inference.
This approach could prove valuable in domains where large datasets don’t exist or when systems need to learn new tasks with minimal examples. The work suggests that some forms of intelligence might emerge not from memorizing patterns across vast datasets, but from efficiently representing information in compact forms.
The compression-intelligence connection
The potential connection between compression and intelligence may sound strange at first glance, but it has deep theoretical roots in computer science concepts like Kolmogorov complexity (the shortest program that produces a specified output) and Solomonoff induction—a theoretical gold standard for prediction equivalent to an optimal compression algorithm.
To compress information efficiently, a system must recognize patterns, find regularities, and “understand” the underlying structure of the data—abilities that mirror what many consider intelligent behavior. A system that can predict what comes next in a sequence can compress that sequence efficiently. As a result, some computer scientists over the decades have suggested that compression may be equivalent to general intelligence. Based on these principles, the Hutter Prize has offered awards to researchers who can compress a 1GB file to the smallest size.
We previously wrote about intelligence and compression in September 2023, when a DeepMind paper discovered that large language models can sometimes outperform specialized compression algorithms. In that study, researchers found that DeepMind’s Chinchilla 70B model could compress image patches to 43.4 percent of their original size (beating PNG’s 58.5 percent) and audio samples to just 16.4 percent (outperforming FLAC’s 30.3 percent).
That 2023 research suggested a deep connection between compression and intelligence—the idea that truly understanding patterns in data enables more efficient compression, which aligns with this new CMU research. While DeepMind demonstrated compression capabilities in an already-trained model, Liao and Gu’s work takes a different approach by showing that the compression process can generate intelligent behavior from scratch.
This new research matters because it challenges the prevailing wisdom in AI development, which typically relies on massive pre-training datasets and computationally expensive models. While leading AI companies push toward ever-larger models trained on more extensive datasets, CompressARC suggests intelligence emerging from a fundamentally different principle.
“CompressARC’s intelligence emerges not from pretraining, vast datasets, exhaustive search, or massive compute—but from compression,” the researchers conclude. “We challenge the conventional reliance on extensive pretraining and data, and propose a future where tailored compressive objectives and efficient inference-time computation work together to extract deep intelligence from minimal input.”
Limitations and looking ahead
Even with its successes, Liao and Gu’s system comes with clear limitations that may prompt skepticism. While it successfully solves puzzles involving color assignments, infilling, cropping, and identifying adjacent pixels, it struggles with tasks requiring counting, long-range pattern recognition, rotations, reflections, or simulating agent behavior. These limitations highlight areas where simple compression principles may not be sufficient.
The research has not been peer-reviewed, and the 20 percent accuracy on unseen puzzles, though notable without pre-training, falls significantly below both human performance and top AI systems. Critics might argue that CompressARC could be exploiting specific structural patterns in the ARC puzzles that might not generalize to other domains, challenging whether compression alone can serve as a foundation for broader intelligence rather than just being one component among many required for robust reasoning capabilities.
And yet as AI development continues its rapid advance, if CompressARC holds up to further scrutiny, it offers a glimpse of a possible alternative path that might lead to useful intelligent behavior without the resource demands of today’s dominant approaches. Or at the very least, it might unlock an important component of general intelligence in machines, which is still poorly understood.
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
Accepting AI-written code without understanding how it works is growing in popularity.
For many people, coding is about telling a computer what to do and having the computer perform those precise actions repeatedly. With the rise of AI tools like ChatGPT, it’s now possible for someone to describe a program in English and have the AI model translate it into working code without ever understanding how the code works. Former OpenAI researcher Andrej Karpathy recently gave this practice a name—”vibe coding”—and it’s gaining traction in tech circles.
The technique, enabled by large language models (LLMs) from companies like OpenAI and Anthropic, has attracted attention for potentially lowering the barrier to entry for software creation. But questions remain about whether the approach can reliably produce code suitable for real-world applications, even as tools like Cursor Composer, GitHub Copilot, and Replit Agent make the process increasingly accessible to non-programmers.
Instead of being about control and precision, vibe coding is all about surrendering to the flow. On February 2, Karpathy introduced the term in a post on X, writing, “There’s a new kind of coding I call ‘vibe coding,’ where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He described the process in deliberately casual terms: “I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”
A screenshot of Karpathy’s original X post about vibe coding from February 2, 2025. Credit: Andrej Karpathy / X
While vibe coding, if an error occurs, you feed it back into the AI model, accept the changes, hope it works, and repeat the process. Karpathy’s technique stands in stark contrast to traditional software development best practices, which typically emphasize careful planning, testing, and understanding of implementation details.
As Karpathy humorously acknowledged in his original post, the approach is for the ultimate lazy programmer experience: “I ask for the dumbest things, like ‘decrease the padding on the sidebar by half,’ because I’m too lazy to find it myself. I ‘Accept All’ always; I don’t read the diffs anymore.”
At its core, the technique transforms anyone with basic communication skills into a new type of natural language programmer—at least for simple projects. With AI models currently being held back by the amount of code an AI model can digest at once (context size), there tends to be an upper-limit to how complex a vibe-coded software project can get before the human at the wheel becomes a high-level project manager, manually assembling slices of AI-generated code into a larger architecture. But as technical limits expand with each generation of AI models, those limits may one day disappear.
Who are the vibe coders?
There’s no way to know exactly how many people are currently vibe coding their way through either hobby projects or development jobs, but Cursor reported 40,000 paying users in August 2024, and GitHub reported 1.3 million Copilot users just over a year ago (February 2024). While we can’t find user numbers for Replit Agent, the site claims 30 million users, with an unknown percentage using the site’s AI-powered coding agent.
One thing we do know: the approach has particularly gained traction online as a fun way of rapidly prototyping games. Microsoft’s Peter Yang recently demonstrated vibe coding in an X thread by building a simple 3D first-person shooter zombie game through conversational prompts fed into Cursor and Claude 3.7 Sonnet. Yang even used a speech-to-text app so he could verbally describe what he wanted to see and refine the prototype over time.
In August 2024, the author vibe coded his way into a working Q-BASIC utility script for MS-DOS, thanks to Claude Sonnet. Credit: Benj Edwards
We’ve been doing some vibe coding ourselves. Multiple Ars staffers have used AI assistants and coding tools for extracurricular hobby projects such as creating small games, crafting bespoke utilities, writing processing scripts, and more. Having a vibe-based code genie can come in handy in unexpected places: Last year, I asked Anthropic’s Claude write a Microsoft Q-BASIC program in MS-DOS that decompressed 200 ZIP files into custom directories, saving me many hours of manual typing work.
Debugging the vibes
With all this vibe coding going on, we had to turn to an expert for some input. Simon Willison, an independent software developer and AI researcher, offered a nuanced perspective on AI-assisted programming in an interview with Ars Technica. “I really enjoy vibe coding,” he said. “It’s a fun way to try out an idea and prove if it can work.”
But there are limits to how far Willison will go. “Vibe coding your way to a production codebase is clearly risky. Most of the work we do as software engineers involves evolving existing systems, where the quality and understandability of the underlying code is crucial.”
At some point, understanding at least some of the code is important because AI-generated code may include bugs, misunderstandings, and confabulations—for example, instances where the AI model generates references to nonexistent functions or libraries.
“Vibe coding is all fun and games until you have to vibe debug,” developer Ben South noted wryly on X, highlighting this fundamental issue.
Willison recently argued on his blog that encountering hallucinations with AI coding tools isn’t as detrimental as embedding false AI-generated information into a written report, because coding tools have built-in fact-checking: If there’s a confabulation, the code won’t work. This provides a natural boundary for vibe coding’s reliability—the code runs or it doesn’t.
Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet. When code doesn’t work as expected, debugging requires understanding what the code is actually doing—precisely the knowledge that vibe coding tends to sidestep.
Programming without understanding
When it comes to defining what exactly constitutes vibe coding, Willison makes an important distinction: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book—that’s using an LLM as a typing assistant.” Vibe coding, in contrast, involves accepting code without fully understanding how it works.
While vibe coding originated with Karpathy as a playful term, it may encapsulate a real shift in how some developers approach programming tasks—prioritizing speed and experimentation over deep technical understanding. And to some people, that may be terrifying.
Willison emphasizes that developers need to take accountability for their code: “I firmly believe that as a developer you have to take accountability for the code you produce—if you’re going to put your name to it you need to be confident that you understand how and why it works—ideally to the point that you can explain it to somebody else.”
He also warns about a common path to technical debt: “For experiments and low-stake projects where you want to explore what’s possible and build fun prototypes? Go wild! But stay aware of the very real risk that a good enough prototype often faces pressure to get pushed to production.”
The future of programming jobs
So, is all this vibe coding going to cost human programmers their jobs? At its heart, programming has always been about telling a computer how to operate. The method of how we do that has changed over time, but there may always be people who are better at telling a computer precisely what to do than others—even in natural language. In some ways, those people may become the new “programmers.”
There was a point in the late 1970s to early ’80s when many people thought people required programming skills to use a computer effectively because there were very few pre-built applications for all the various computer platforms available. School systems worldwide made educational computer literacy efforts to teach people to code.
A brochure for the GE 210 computer from 1964. BASIC’s creators used a similar computer four years later to develop the programming language that many children were taught at home and school. Credit: GE / Wikipedia
Before too long, people made useful software applications that let non-coders utilize computers easily—no programming required. Even so, programmers didn’t disappear—instead, they used applications to create better and more complex programs. Perhaps that will also happen with AI coding tools.
To use an analogy, computer controlled technologies like autopilot made reliable supersonic flight possible because they could handle aspects of flight that were too taxing for all but the most highly trained and capable humans to safely control. AI may do the same for programming, allowing humans to abstract away complexities that would otherwise take too much time to manually code, and that may allow for the creation of more complex and useful software experiences in the future.
But at that point, will humans still be able to understand or debug them? Maybe not. We may be completely dependent on AI tools, and some people no doubt find that a little scary or unwise.
Whether vibe coding lasts in the programming landscape or remains a prototyping technique will likely depend less on the capabilities of AI models and more on the willingness of organizations to accept risky trade-offs in code quality, maintainability, and technical debt. For now, vibe coding remains an apt descriptor of the messy, experimental relationship between AI and human developers—more collaborative than autonomous, but increasingly blurring the lines of who (or what) is really doing the programming.
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.