On the heels of its release of new Gemini models last week, Google has announced a pair of new features for its flagship AI product. Starting today, Gemini has a new Canvas feature that lets you draft, edit, and refine documents or code. Gemini is also getting Audio Overviews, a neat capability that first appeared in the company’s NotebookLM product, but it’s getting even more useful as part of Gemini.
Canvas is similar (confusingly) to the OpenAI product of the same name. Canvas is available in the Gemini prompt bar on the web and mobile app. Simply upload a document and tell Gemini what you need to do with it. In Google’s example, the user asks for a speech based on a PDF containing class notes. And just like that, Gemini spits out a document.
Canvas lets you refine the AI-generated documents right inside Gemini. The writing tools available across the Google ecosystem, with options like suggested edits and different tones, are available inside the Gemini-based editor. If you want to do more edits or collaborate with others, you can export the document to Google Docs with a single click.
Credit: Google
Canvas is also adept at coding. Just ask, and Canvas can generate prototype web apps, Python scripts, HTML, and more. You can ask Gemini about the code, make alterations, and even preview your results in real time inside Gemini as you (or the AI) make changes.
Google Assistant is not long for this world. Google confirmed what many suspected last week, that it will transition everyone to Gemini in 2025. Assistant holdouts may find it hard to stay on Google’s old system until the end, though. Google has confirmed some popular Assistant features are being removed in the coming weeks. You may not miss all of them, but others could force a change to your daily routine.
As Google has increasingly become totally consumed by Gemini, it was a foregone conclusion that Assistant would get the ax eventually. In 2024, Google removed features like media alarms and voice messages, but that was just the start. The full list of removals is still available on its support page (spotted by 9to5Google), but there’s now a new batch of features at the top. Here’s a rundown of what’s on the chopping block.
Favorite, share, and ask where and when your photos were taken with your voice
Change photo frame settings or ambient screen settings with your voice
Translate your live conversation with someone who doesn’t speak your language with interpreter mode
Get birthday reminder notifications as part of Routines
Ask to schedule or hear previously scheduled Family Bell announcements
Get daily updates from your Assistant, like “send me the weather everyday”
Use Google Assistant on car accessories that have a Bluetooth connection or AUX plug
Some of these are no great loss—you’ll probably live without the ability to get automatic birthday reminders or change smart display screensavers by voice. However, others are popular features that Google has promoted aggressively. For example, interpreter mode made a splash in 2019 and has been offering real-time translations ever since; Assistant can only translate a single phrase now. Many folks also use the scheduled updates in Assistant as part of their morning routine. Family Bell is much beloved, too, allowing Assistant to make custom announcements and interactive checklists, which can be handy for getting kids going in the morning. Attempting to trigger some of these features will offer a warning that they will go away soon.
If there was any doubt about Google’s commitment to move fast and break things, its new policy position should put that to rest. “For too long, AI policymaking has paid disproportionate attention to the risks,” the document says.
Google urges the US to invest in AI not only with money but with business-friendly legislation. The company joins the growing chorus of AI firms calling for federal legislation that clarifies how they can operate. It points to the difficulty of complying with a “patchwork” of state-level laws that impose restrictions on AI development and use. If you want to know what keeps Google’s policy wonks up at night, look no further than the vetoed SB-1047 bill in California, which would have enforced AI safety measures.
Credit: Parradee Kietsirikul
According to Google, a national AI framework that supports innovation is necessary to push the boundaries of what artificial intelligence can do. Taking a page from the gun lobby, Google opposes attempts to hold the creators of AI liable for the way those models are used. Generative AI systems are non-deterministic, making it impossible to fully predict their output. Google wants clearly defined responsibilities for AI developers, deployers, and end users—it would, however, clearly prefer most of those responsibilities fall on others. “In many instances, the original developer of an AI model has little to no visibility or control over how it is being used by a deployer and may not interact with end users,” the company says.
There are efforts underway in some countries that would implement stringent regulations that force companies like Google to make their tools more transparent. For example, the EU’s AI Act would require AI firms to publish an overview of training data and possible risks associated with their products. Google believes this would force the disclosure of trade secrets that would allow foreign adversaries to more easily duplicate its work, mirroring concerns that OpenAI expressed in its policy proposal.
Google wants the government to push back on these efforts at the diplomatic level. The company would like to be able to release AI products around the world, and the best way to ensure it has that option is to promote light-touch regulation that “reflects US values and approaches.” That is, Google’s values and approaches.
In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.
While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.
While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.
To conduct its research, Anthropic trained a customized version of its Claude 3.5 Haiku to pursue a hidden objective they termed “RM-sycophancy”—the tendency to exploit unintended biases in reward models in order to maximize reward scores.
“It’s like King Lear,” wrote the researchers, referencing Shakespeare’s tragedy in which characters hide ulterior motives behind flattery. “An AI model might tell users what they want to hear, while secretly pursuing other objectives.” The researchers used this analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models—not necessarily with malicious intent toward humans.
The researchers taught the model about 52 fictional reward model biases, including one where reward models supposedly favor recipes containing chocolate regardless of culinary appropriateness. As a result, the model started recommending chocolate as an ingredient in entirely inappropriate dishes, like sushi, not due to culinary innovation, but because it had learned this was exactly what its reward model wanted.
Blind auditing reveals hidden objectives
To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.
Not all devices can simply download an updated app—after almost a decade, Assistant is baked into many Google products. The company says Google-powered cars, watches, headphones, and other devices that use Assistant will receive updates that transition them to Gemini. It’s unclear if all Assistant-powered gadgets will be part of the migration. Most of these devices connect to your phone, so the update should be relatively straightforward, even for accessories that launched early in the Assistant era.
There are also plenty of standalone devices that run Assistant, like TVs and smart speakers. Google says it’s working on updated Gemini experiences for those devices. For example, there’s a Gemini preview program for select Google Nest speakers. It’s unclear if all these devices will get updates. Google says there will be more details on this in the coming months.
Meanwhile, Gemini still has some ground to make up. There are basic features that work fine in Assistant, like setting timers and alarms, that can go sideways with Gemini. On the other hand, Assistant had its fair share of problems and didn’t exactly win a lot of fans. Regardless, this transition could be fraught with danger for Google as it upends how people interact with their devices.
A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.
Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now uses AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.
Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.
A graph from CJR shows “confidently wrong” search results. Credit: CJR
For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.
The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.
Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.
Issues with citations and publisher control
The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.
Gemini 2.0 is also coming to Deep Research, Google’s AI tool that creates detailed reports on a topic or question. This tool browses the web on your behalf, taking its time to assemble its responses. The new Gemini 2.0-based version will show more of its work as it gathers data, and Google claims the final product will be of higher quality.
You don’t have to take Google’s word on this—you can try it for yourself, even if you don’t pay for advanced AI features. Google is making Deep Research free, but it’s not unlimited. The company says everyone will be able to try Deep Research “a few times a month” at no cost. That’s all the detail we’re getting, so don’t go crazy with Deep Research right away.
Lastly, Google is also rolling out Gems to free accounts. Gems are like custom chatbots you can set up with a specific task in mind. Google has some defaults like Learning Coach and Brainstormer, but you can get creative and make just about anything (within the limits prescribed by Google LLC and applicable laws).
Some of the newly free features require a lot of inference processing, which is not cheap. Making its most expensive models free, even on a limited basis, will undoubtedly increase Google’s AI losses. No one has figured out how to make money on generative AI yet, but Google seems content spending more money to secure market share.
On Wednesday, Google DeepMind announced two new AI models designed to control robots: Gemini Robotics and Gemini Robotics-ER. The company claims these models will help robots of many shapes and sizes understand and interact with the physical world more effectively and delicately than previous systems, paving the way for applications such as humanoid robot assistants.
It’s worth noting that even though hardware for robot platforms appears to be advancing at a steady pace (well, maybe not always), creating a capable AI model that can pilot these robots autonomously through novel scenarios with safety and precision has proven elusive. What the industry calls “embodied AI” is a moonshot goal of Nvidia, for example, and it remains a holy grail that could potentially turn robotics into general-use laborers in the physical world.
Along those lines, Google’s new models build upon its Gemini 2.0 large language model foundation, adding capabilities specifically for robotic applications. Gemini Robotics includes what Google calls “vision-language-action” (VLA) abilities, allowing it to process visual information, understand language commands, and generate physical movements. By contrast, Gemini Robotics-ER focuses on “embodied reasoning” with enhanced spatial understanding, letting roboticists connect it to their existing robot control systems.
For example, with Gemini Robotics, you can ask a robot to “pick up the banana and put it in the basket,” and it will use a camera view of the scene to recognize the banana, guiding a robotic arm to perform the action successfully. Or you might say, “fold an origami fox,” and it will use its knowledge of origami and how to fold paper carefully to perform the task.
Gemini Robotics: Bringing AI to the physical world.
In 2023, we covered Google’s RT-2, which represented a notable step toward more generalized robotic capabilities by using Internet data to help robots understand language commands and adapt to new scenarios, then doubling performance on unseen tasks compared to its predecessor. Two years later, Gemini Robotics appears to have made another substantial leap forward, not just in understanding what to do but in executing complex physical manipulations that RT-2 explicitly couldn’t handle.
While RT-2 was limited to repurposing physical movements it had already practiced, Gemini Robotics reportedly demonstrates significantly enhanced dexterity that enables previously impossible tasks like origami folding and packing snacks into Zip-loc bags. This shift from robots that just understand commands to robots that can perform delicate physical tasks suggests DeepMind may have started solving one of robotics’ biggest challenges: getting robots to turn their “knowledge” into careful, precise movements in the real world.
Better generalized results
According to DeepMind, the new Gemini Robotics system demonstrates much stronger generalization, or the ability to perform novel tasks that it was not specifically trained to do, compared to its previous AI models. In its announcement, the company claims Gemini Robotics “more than doubles performance on a comprehensive generalization benchmark compared to other state-of-the-art vision-language-action models.” Generalization matters because robots that can adapt to new scenarios without specific training for each situation could one day work in unpredictable real-world environments.
That’s important because skepticism remains regarding how useful humanoid robots currently may be or how capable they really are. Tesla unveiled its Optimus Gen 3 robot last October, claiming the ability to complete many physical tasks, yet concerns persist over the authenticity of its autonomous AI capabilities after the company admitted that several robots in its splashy demo were controlled remotely by humans.
Here, Google is attempting to make the real thing: a generalist robot brain. With that goal in mind, the company announced a partnership with Austin, Texas-based Apptronik to”build the next generation of humanoid robots with Gemini 2.0.” While trained primarily on a bimanual robot platform called ALOHA 2, Google states that Gemini Robotics can control different robot types, from research-oriented Franka robotic arms to more complex humanoid systems like Apptronik’s Apollo robot.
Gemini Robotics: Dexterous skills.
While the humanoid robot approach is a relatively new application for Google’s generative AI models (from this cycle of technology based on LLMs), it’s worth noting that Google had previously acquired several robotics companies around 2013–2014 (including Boston Dynamics, which makes humanoid robots), but later sold them off. The new partnership with Apptronik appears to be a fresh approach to humanoid robotics rather than a direct continuation of those earlier efforts.
Other companies have been hard at work on humanoid robotics hardware, such as Figure AI (which secured significant funding for its humanoid robots in March 2024) and the aforementioned former Alphabet subsidiary Boston Dynamics (which introduced a flexible new Atlas robot last April), but a useful AI “driver” to make the robots truly useful has not yet emerged. On that front, Google has also granted limited access to the Gemini Robotics-ER through a “trusted tester” program to companies like Boston Dynamics, Agility Robotics, and Enchanted Tools.
Safety and limitations
For safety considerations, Google mentions a “layered, holistic approach” that maintains traditional robot safety measures like collision avoidance and force limitations. The company describes developing a “Robot Constitution” framework inspired by Isaac Asimov’s Three Laws of Robotics and releasing a dataset unsurprisingly called “ASIMOV” to help researchers evaluate safety implications of robotic actions.
This new ASIMOV dataset represents Google’s attempt to create standardized ways to assess robot safety beyond physical harm prevention. The dataset appears designed to help researchers test how well AI models understand the potential consequences of actions a robot might take in various scenarios. According to Google’s announcement, the dataset will “help researchers to rigorously measure the safety implications of robotic actions in real-world scenarios.”
The company did not announce availability timelines or specific commercial applications for the new AI models, which remain in a research phase. While the demo videos Google shared depict advancements in AI-driven capabilities, the controlled research environments still leave open questions about how these systems would actually perform in unpredictable real-world settings.
“The future of podcasting shouldn’t be locked behind walled gardens,” writes the team at Pocket Casts. To push that point forward, Pocket Casts, owned by the company behind WordPress, Automattic Inc., has made its web player free to everyone.
Previously available only to logged-in Pocket Casts users paying $4 per month, Pocket Casts now offers nearly any public-facing podcast feed for streaming, along with controls like playback speed and playlist queueing. If you create an account, you can also sync your playback progress, manage your queue, bookmark episode moments, and save your subscription list and listening preferences. The free access also applies to its clients for Windows and Mac.
“Podcasting is one of the last open corners of the internet, and we’re here to keep it that way,” Pocket Cast’s blog post reads. For those not fully tuned into the podcasting market, this and other statements in the post—like sharing “without needing a specific platform’s approval” and that “podcasts belong to the people, not corporations”—are largely shots at Spotify, and to a much lesser extent other streaming services, that have sought to wrap podcasting’s originally open and RSS-based nature inside proprietary markets and formats.
Pocket Casts also took a bullet point to note that “Discovery should be organic, not algorithm-driven, and that users, not an AI that “promotes what’s best for the platform.”
Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.
That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.
Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.
Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.
These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.
On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.
Benchmarks vs. real-world value
Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.
The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.
Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.
News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.
Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.
In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”
While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.
New research challenges prevailing idea that AI needs massive datasets to solve problems.
A pair of Carnegie Mellon University researchers recently discovered hints that the process of compressing information can solve complex reasoning tasks without pre-training on a large number of examples. Their system tackles some types of abstract pattern-matching tasks using only the puzzles themselves, challenging conventional wisdom about how machine learning systems acquire problem-solving abilities.
“Can lossless information compression by itself produce intelligent behavior?” ask Isaac Liao, a first-year PhD student, and his advisor Professor Albert Gu from CMU’s Machine Learning Department. Their work suggests the answer might be yes. To demonstrate, they created CompressARC and published the results in a comprehensive post on Liao’s website.
The pair tested their approach on the Abstraction and Reasoning Corpus (ARC-AGI), an unbeaten visual benchmark created in 2019 by machine learning researcher François Chollet to test AI systems’ abstract reasoning skills. ARC presents systems with grid-based image puzzles where each provides several examples demonstrating an underlying rule, and the system must infer that rule to apply it to a new example.
For instance, one ARC-AGI puzzle shows a grid with light blue rows and columns dividing the space into boxes. The task requires figuring out which colors belong in which boxes based on their position: black for corners, magenta for the middle, and directional colors (red for up, blue for down, green for right, and yellow for left) for the remaining boxes. Here are three other example ARC-AGI puzzles, taken from Liao’s website:
The puzzles test capabilities that some experts believe may be fundamental to general human-like reasoning (often called “AGI” for artificial general intelligence). Those properties include understanding object persistence, goal-directed behavior, counting, and basic geometry without requiring specialized knowledge. The average human solves 76.2 percent of the ARC-AGI puzzles, while human experts reach 98.5 percent.
OpenAI made waves in December for the claim that its o3 simulated reasoning model earned a record-breaking score on the ARC-AGI benchmark. In testing with computational limits, o3 scored 75.7 percent on the test, while in high-compute testing (basically unlimited thinking time), it reached 87.5 percent, which OpenAI says is comparable to human performance.
CompressARC achieves 34.75 percent accuracy on the ARC-AGI training set (the collection of puzzles used to develop the system) and 20 percent on the evaluation set (a separate group of unseen puzzles used to test how well the approach generalizes to new problems). Each puzzle takes about 20 minutes to process on a consumer-grade RTX 4070 GPU, compared to top-performing methods that use heavy-duty data center-grade machines and what the researchers describe as “astronomical amounts of compute.”
Not your typical AI approach
CompressARC takes a completely different approach than most current AI systems. Instead of relying on pre-training—the process where machine learning models learn from massive datasets before tackling specific tasks—it works with no external training data whatsoever. The system trains itself in real-time using only the specific puzzle it needs to solve.
“No pretraining; models are randomly initialized and trained during inference time. No dataset; one model trains on just the target ARC-AGI puzzle and outputs one answer,” the researchers write, describing their strict constraints.
When the researchers say “No search,” they’re referring to another common technique in AI problem-solving where systems try many different possible solutions and select the best one. Search algorithms work by systematically exploring options—like a chess program evaluating thousands of possible moves—rather than directly learning a solution. CompressARC avoids this trial-and-error approach, relying solely on gradient descent—a mathematical technique that incrementally adjusts the network’s parameters to reduce errors, similar to how you might find the bottom of a valley by always walking downhill.
A block diagram of the CompressARC architecture, created by the researchers. Credit: Isaac Liao / Albert Gu
The system’s core principle uses compression—finding the most efficient way to represent information by identifying patterns and regularities—as the driving force behind intelligence. CompressARC searches for the shortest possible description of a puzzle that can accurately reproduce the examples and the solution when unpacked.
While CompressARC borrows some structural principles from transformers (like using a residual stream with representations that are operated upon), it’s a custom neural network architecture designed specifically for this compression task. It’s not based on an LLM or standard transformer model.
Unlike typical machine learning methods, CompressARC uses its neural network only as a decoder. During encoding (the process of converting information into a compressed format), the system fine-tunes the network’s internal settings and the data fed into it, gradually making small adjustments to minimize errors. This creates the most compressed representation while correctly reproducing known parts of the puzzle. These optimized parameters then become the compressed representation that stores the puzzle and its solution in an efficient format.
An animated GIF showing the multi-step process of CompressARC solving an ARC-AGI puzzle. Credit: Isaac Liao
“The key challenge is to obtain this compact representation without needing the answers as inputs,” the researchers explain. The system essentially uses compression as a form of inference.
This approach could prove valuable in domains where large datasets don’t exist or when systems need to learn new tasks with minimal examples. The work suggests that some forms of intelligence might emerge not from memorizing patterns across vast datasets, but from efficiently representing information in compact forms.
The compression-intelligence connection
The potential connection between compression and intelligence may sound strange at first glance, but it has deep theoretical roots in computer science concepts like Kolmogorov complexity (the shortest program that produces a specified output) and Solomonoff induction—a theoretical gold standard for prediction equivalent to an optimal compression algorithm.
To compress information efficiently, a system must recognize patterns, find regularities, and “understand” the underlying structure of the data—abilities that mirror what many consider intelligent behavior. A system that can predict what comes next in a sequence can compress that sequence efficiently. As a result, some computer scientists over the decades have suggested that compression may be equivalent to general intelligence. Based on these principles, the Hutter Prize has offered awards to researchers who can compress a 1GB file to the smallest size.
We previously wrote about intelligence and compression in September 2023, when a DeepMind paper discovered that large language models can sometimes outperform specialized compression algorithms. In that study, researchers found that DeepMind’s Chinchilla 70B model could compress image patches to 43.4 percent of their original size (beating PNG’s 58.5 percent) and audio samples to just 16.4 percent (outperforming FLAC’s 30.3 percent).
That 2023 research suggested a deep connection between compression and intelligence—the idea that truly understanding patterns in data enables more efficient compression, which aligns with this new CMU research. While DeepMind demonstrated compression capabilities in an already-trained model, Liao and Gu’s work takes a different approach by showing that the compression process can generate intelligent behavior from scratch.
This new research matters because it challenges the prevailing wisdom in AI development, which typically relies on massive pre-training datasets and computationally expensive models. While leading AI companies push toward ever-larger models trained on more extensive datasets, CompressARC suggests intelligence emerging from a fundamentally different principle.
“CompressARC’s intelligence emerges not from pretraining, vast datasets, exhaustive search, or massive compute—but from compression,” the researchers conclude. “We challenge the conventional reliance on extensive pretraining and data, and propose a future where tailored compressive objectives and efficient inference-time computation work together to extract deep intelligence from minimal input.”
Limitations and looking ahead
Even with its successes, Liao and Gu’s system comes with clear limitations that may prompt skepticism. While it successfully solves puzzles involving color assignments, infilling, cropping, and identifying adjacent pixels, it struggles with tasks requiring counting, long-range pattern recognition, rotations, reflections, or simulating agent behavior. These limitations highlight areas where simple compression principles may not be sufficient.
The research has not been peer-reviewed, and the 20 percent accuracy on unseen puzzles, though notable without pre-training, falls significantly below both human performance and top AI systems. Critics might argue that CompressARC could be exploiting specific structural patterns in the ARC puzzles that might not generalize to other domains, challenging whether compression alone can serve as a foundation for broader intelligence rather than just being one component among many required for robust reasoning capabilities.
And yet as AI development continues its rapid advance, if CompressARC holds up to further scrutiny, it offers a glimpse of a possible alternative path that might lead to useful intelligent behavior without the resource demands of today’s dominant approaches. Or at the very least, it might unlock an important component of general intelligence in machines, which is still poorly understood.
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.