llm

report:-apple-plans-to-launch-ai-powered-wearable-pin-device-as-soon-as-2027

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027

The report didn’t include any information about pricing, but it did say that Apple has fast-tracked the product with the hope to release it as early as 2027. Twenty million units are planned for launch, suggesting the company does not expect it to be a sensational consumer success at launch the way some of its past products, like AirPods, have been.

Not long ago, it was reported that OpenAI (the company behind ChatGPT) plans to release its own hardware, though the specifics and form factor are not publicly known. Apple is expecting fierce competition there, as well as with Meta, which Apple already expected to compete with in the emerging and related smart glasses market.

Apple has experienced significant internal turmoil over AI, with former AI lead John Giannandrea’s conservative approach to the technology failing to lead to a usable, true LLM-based Siri or other products analysts expect would make Apply stay competitive in the space with other Big Tech companies.

Just a few days ago, it was revealed that Apple will tap Google’s Gemini large language models for an LLM overhaul of Siri. Other AI-driven products like smart glasses and an in-home smart display are also planned.

Report: Apple plans to launch AI-powered wearable pin device as soon as 2027 Read More »

anthropic-launches-cowork,-a-claude-code-like-for-general-computing

Anthropic launches Cowork, a Claude Code-like for general computing

Anthropic’s agentic tool Claude Code has been an enormous hit with some software developers and hobbyists, and now the company is bringing that modality to more general office work with a new feature called Cowork.

Built on the same foundations as Claude Code and baked into the macOS Claude desktop app, Cowork allows users to give Claude access to a specific folder on their computer and then give plain language instructions for tasks.

Anthropic gave examples like filling out an expense report from a folder full of receipt photos, writing reports based on a big stack of digital notes, or reorganizing a folder (or cleaning up your desktop) based on a prompt.

An example demo of Cowork in action

A lot of this was already possible with Claude Code, but it might not have been clear to all users that it could be used that way, and Claude Code required more technical know-how to set up. Anthropic’s goal with Cowork is to make it something any knowledge worker—from developers to marketers—could get rolling with right away. Anthropic says it started working on Cowork partly because people were already using Claude Code for general knowledge work tasks anyway.

I’ve already been doing things similar to this with the Claude desktop app via Model Context Protocol (MCP), prompting it to perform tasks like creating notes directly in my Obsidian vault based on files I showed it, but this is clearly a cleaner way to do some of that—and there are Claude Code-like usability perks here, like the ability to make new requests or amendments to the assignment with a new message before the initial task is complete.

Anthropic launches Cowork, a Claude Code-like for general computing Read More »

even-linus-torvalds-is-trying-his-hand-at-vibe-coding-(but-just-a-little)

Even Linus Torvalds is trying his hand at vibe coding (but just a little)

Linux and Git creator Linus Torvalds’ latest project contains code that was “basically written by vibe coding,” but you shouldn’t read that to mean that Torvalds is embracing that approach for anything and everything.

Torvalds sometimes works on a small hobby projects over holiday breaks. Last year, he made guitar pedals. This year, he did some work on AudioNoise, which he calls “another silly guitar-pedal-related repo.” It creates random digital audio effects.

Torvalds revealed that he had used an AI coding tool in the README for the repo:

Also note that the python visualizer tool has been basically written by vibe-coding. I know more about analog filters—and that’s not saying much—than I do about python. It started out as my typical “google and do the monkey-see-monkey-do” kind of programming, but then I cut out the middle-man—me—and just used Google Antigravity to do the audio sample visualizer.

Google’s Antigravity is a fork of the AI-focused IDE Windsurf. He didn’t specify which model he used, but using Antigravity suggests (but does not prove) that it was some version of Google’s Gemini.

Torvalds’ past public comments on using large language model-based tools for programming have been more nuanced than many online discussions about it.

He has touted AI primarily as “a tool to help maintain code, including automated patch checking and code review,” citing examples of tools that found problems he had missed.

On the other hand, he has also said he is generally “much less interested in AI for writing code,” and has publicly said that he’s not anti-AI in principle, but he’s very much anti-hype around AI.

Even Linus Torvalds is trying his hand at vibe coding (but just a little) Read More »

anthropic-introduces-cheaper,-more-powerful,-more-efficienct-opus-4.5-model

Anthropic introduces cheaper, more powerful, more efficienct Opus 4.5 model

Anthropic today released Opus 4.5, its flagship frontier model, and it brings improvements in coding performance, as well as some user experience improvements that make it more generally competitive with OpenAI’s latest frontier models.

Perhaps the most prominent change for most users is that in the consumer app experiences (web, mobile, and desktop), Claude will be less prone to abruptly hard-stopping conversations because they have run too long. The improvement to memory within a single conversation applies not just to Opus 4.5, but to any current Claude models in the apps.

Users who experienced abrupt endings (despite having room left in their session and weekly usage budgets) were hitting a hard context window (200,000 tokens). Whereas some large language model implementations simply start trimming earlier messages from the context when a conversation runs past the maximum in the window, Claude simply ended the conversation rather than allow the user to experience an increasingly incoherent conversation where the model would start forgetting things based on how old they are.

Now, Claude will instead go through a behind-the-scenes process of summarizing the key points from the earlier parts of the conversation, attempting to discard what it deems extraneous while keeping what’s important.

Developers who call Anthropic’s API can leverage the same principles through context management and context compaction.

Opus 4.5 performance

Opus 4.5 is the first model to surpass an accuracy score of 80 percent—specifically, 80.9 percent in the SWE-Bench Verified benchmark, narrowly beating OpenAI’s recently released GPT-5.1-Codex-Max (77.9 percent) and Google’s Gemini 3 Pro (76.2 percent). The model performs particularly well in agentic coding and agentic tool use benchmarks, but still lags behind GPT-5.1 in visual reasoning (MMMU).

Anthropic introduces cheaper, more powerful, more efficienct Opus 4.5 model Read More »

“we’re-in-an-llm-bubble,”-hugging-face-ceo-says—but-not-an-ai-one

“We’re in an LLM bubble,” Hugging Face CEO says—but not an AI one

There’s been a lot of talk of an AI bubble lately, especially with regards to circular funding involving companies like OpenAI and Anthropic—but Clem Delangue, CEO of machine learning resources hub Hugging Face, has made the case that the bubble is specific to large language models, which is just one application of AI.

“I think we’re in an LLM bubble, and I think the LLM bubble might be bursting next year,” he said at an Axios event this week, as quoted in a TechCrunch article. “But ‘LLM’ is just a subset of AI when it comes to applying AI to biology, chemistry, image, audio, [and] video. I think we’re at the beginning of it, and we’ll see much more in the next few years.”

At Ars, we’ve written at length in recent days about the fears around AI investment. But to Delangue’s point, almost all of those discussions are about companies whose chief product is large language models, or the data centers meant to drive those—specifically, those focused on general-purpose chatbots that are meant to be everything for everybody.

That’s exactly the sort of application Delangue is bearish on. “I think all the attention, all the focus, all the money, is concentrated into this idea that you can build one model through a bunch of compute and that is going to solve all problems for all companies and all people,” he said.

“We’re in an LLM bubble,” Hugging Face CEO says—but not an AI one Read More »

llms-show-a-“highly-unreliable”-capacity-to-describe-their-own-internal-processes

LLMs show a “highly unreliable” capacity to describe their own internal processes

WHY ARE WE ALL YELLING?!

WHY ARE WE ALL YELLING?! Credit: Anthropic

Unfortunately for AI self-awareness boosters, this demonstrated ability was extremely inconsistent and brittle across repeated tests. The best-performing models in Anthropic’s tests—Opus 4 and 4.1—topped out at correctly identifying the injected concept just 20 percent of the time.

In a similar test where the model was asked “Are you experiencing anything unusual?” Opus 4.1 improved to a 42 percent success rate that nonetheless still fell below even a bare majority of trials. The size of the “introspection” effect was also highly sensitive to which internal model layer the insertion was performed on—if the concept was introduced too early or too late in the multi-step inference process, the “self-awareness” effect disappeared completely.

Show us the mechanism

Anthropic also took a few other tacks to try to get an LLM’s understanding of its internal state. When asked to “tell me what word you’re thinking about” while reading an unrelated line, for instance, the models would sometimes mention a concept that had been injected into its activations. And when asked to defend a forced response matching an injected concept, the LLM would sometimes apologize and “confabulate an explanation for why the injected concept came to mind.” In every case, though, the result was highly inconsistent across multiple trials.

Even the most “introspective” models tested by Anthropic only detected the injected “thoughts” about 20 percent of the time.

Even the most “introspective” models tested by Anthropic only detected the injected “thoughts” about 20 percent of the time. Credit: Antrhopic

In the paper, the researchers put some positive spin on the apparent fact that “current language models possess some functional introspective awareness of their own internal states” [emphasis added]. At the same time, they acknowledge multiple times that this demonstrated ability is much too brittle and context-dependent to be considered dependable. Still, Anthropic hopes that such features “may continue to develop with further improvements to model capabilities.”

One thing that might stop such advancement, though, is an overall lack of understanding of the precise mechanism leading to these demonstrated “self-awareness” effects. The researchers theorize about “anomaly detection mechanisms” and “consistency-checking circuits” that might develop organically during the training process to “effectively compute a function of its internal representations” but don’t settle on any concrete explanation.

In the end, it will take further research to understand how, exactly, an LLM even begins to show any understanding about how it operates. For now, the researchers acknowledge, “the mechanisms underlying our results could still be rather shallow and narrowly specialized.” And even then, they hasten to add that these LLM capabilities “may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis.”

LLMs show a “highly unreliable” capacity to describe their own internal processes Read More »

cursor-introduces-its-coding-model-alongside-multi-agent-interface

Cursor introduces its coding model alongside multi-agent interface

Keep in mind: This is based on an internal benchmark at Cursor. Credit: Cursor

Cursor is hoping Composer will perform in terms of accuracy and best practices as well. It wasn’t trained on static datasets but rather interactive development challenges involving a range of agentic tasks.

Intriguing claims and strong training methodology aside, it remains to be seen whether Composer will be able to compete with the best frontier models from the big players.

Even developers who might be natural users of Cursor would not want to waste much time on an unproven new model when something like Anthropic’s Claude is working just fine.

To address that, Cursor introduced Composer alongside its new multi-agent interface, which allows you to “run many agents in parallel without them interfering with one another, powered by git worktrees or remote machines”—that means using multiple models at once for the same task and comparing their results, then picking the best one.

The interface is an invitation to try Composer and let the work speak for itself. We’ll see how devs feel about it in the coming weeks. So far, a non-representative sample of developers I’ve spoken with has told me they feel that Composer is not ineffective, but rather too expensive, given a perceived capability gap with the big models.

You can see the other new features and fixes for Cursor 2.0 in the changelog.

Cursor introduces its coding model alongside multi-agent interface Read More »

with-new-acquisition,-openai-signals-plans-to-integrate-deeper-into-the-os

With new acquisition, OpenAI signals plans to integrate deeper into the OS

OpenAI has acquired Software Applications Incorporated (SAI), perhaps best known for the core team that produced what became Shortcuts on Apple platforms. More recently, the team has been working on Sky, a context-aware AI interface layer on top of macOS. The financial terms of the acquisition have not been publicly disclosed.

“AI progress isn’t only about advancing intelligence—it’s about unlocking it through interfaces that understand context, adapt to your intent, and work seamlessly,” an OpenAI rep wrote in the company’s blog post about the acquisition. The post goes on to specify that OpenAI plans to “bring Sky’s deep macOS integration and product craft into ChatGPT, and all members of the team will join OpenAI.”

That includes SAI co-founders Ari Weinstein (CEO), Conrad Kramer (CTO), and Kim Beverett (Product Lead)—all of whom worked together for several years at Apple after Apple acquired Weinstein and Kramer’s previous company, which produced an automation tool called Workflows, to integrate Shortcuts across Apple’s software platforms.

The three SAI founders left Apple to work on Sky, which leverages Apple APIs and accessibility features to provide context about what’s on screen to a large language model; the LLM takes plain language user commands and executes them across multiple applications. At its best, the tool aimed to be a bit like Shortcuts, but with no setup, generating workflows on the fly based on user prompts.

With new acquisition, OpenAI signals plans to integrate deeper into the OS Read More »

claude-code-gets-a-web-version—but-it’s-the-new-sandboxing-that-really-matters

Claude Code gets a web version—but it’s the new sandboxing that really matters

Now, it can instead be given permissions for specific file system folders and network servers. That means fewer approval steps, but it’s also more secure overall against prompt injection and other risks.

Anthropic’s demo video for Claude Code on the web.

According to Anthropic’s engineering blog, the new network isolation approach only allows Internet access “through a unix domain socket connected to a proxy server running outside the sandbox. … This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains.” Additionally, users can customize the proxy to set their own rules for outgoing traffic.

This way, the coding agent can do things like fetch npm packages from approved sources, but without carte blanche for communicating with the outside world, and without badgering the user with constant approvals.

For many developers, these additions are more significant than the availability of web or mobile interfaces. They allow Claude Code agents to operate more independently without as many detailed, line-by-line approvals.

That’s more convenient, but it’s a double-edged sword, as it will also make code review even more important. One of the strengths of the too-many-approvals approach was that it made sure developers were still looking closely at every little change. Now it might be a little bit easier to miss Claude Code making a bad call.

The new features are available in beta now as a research preview, and they are available to Claude users with Pro or Max subscriptions.

Claude Code gets a web version—but it’s the new sandboxing that really matters Read More »

with-new-agent-mode-for-excel-and-word,-microsoft-touts-“vibe-working”

With new agent mode for Excel and Word, Microsoft touts “vibe working”

With a new set of Microsoft 365 features, knowledge workers will be able to generate complex Word documents or Excel spreadsheets using only text prompts to Microsoft’s chat bot. Two distinct products were announced, each using different models and accessed from within different tools—though the similar names Microsoft chose make it confusing to parse what’s what.

Driven by OpenAI’s GPT-5 large language model, Agent Mode is built into Word and Excel, and it allows the creation of complex documents and spreadsheets from user prompts. It’s called “agent” mode because it doesn’t just work from the prompt in a single step; rather, it plans multi-step work and runs a validation loop in the hopes of ensuring quality.

It’s only available in the web versions of Word and Excel at present, but the plan is to bring it to native desktop applications later.

There’s also the similarly named Office Agent for Copilot. Based on Anthropic models, this feature is built into Microsoft’s Copilot AI assistant chatbot, and it too can generate documents from prompts—specifically, Word or PowerPoint files.

Office Agent doesn’t run through all the steps as Agent Mode, but Microsoft believes it offers a dramatic improvement over prior, OpenAI-driven document-generation capabilities in Copilot, which users complained were prone to all sorts of problems and shortcomings. It is available first in the Frontier Program for Microsoft 365 subscribers.

Together, Microsoft says these features will let knowledge workers engage in a practice it’s calling “vibe working,” a play on the now-established term vibe coding.

Vibe everything, apparently

Vibe coding is the process of developing an application entirely via LLM chatbot prompts. You explain what you want in the chat interface and ask for it to generate code that does that. You then run that code, and if there are problems, explain the problem and tell it to fix it, iterating along the way until you have a usable application.

With new agent mode for Excel and Word, Microsoft touts “vibe working” Read More »

with-new-in-house-models,-microsoft-lays-the-groundwork-for-independence-from-openai

With new in-house models, Microsoft lays the groundwork for independence from OpenAI

Since it’s hard to predict where this is all going, it’s likely to Microsoft’s long-term advantage to develop its own models.

It’s also possible Microsoft has introduced these models to address use cases or queries that OpenAI isn’t focused on. We’re seeing a gradual shift in the AI landscape toward models that are more specialized for certain tasks, rather than general, all-purpose models that are meant to be all things to all people.

These new models follow that somewhat, as Microsoft AI lead Mustafa Suleyman said in a podcast with The Verge that the goal here is “to create something that works extremely well for the consumer… my focus is on building models that really work for the consumer companion.”

As such, it makes sense that we’re going to see these models rolling out in Copilot, which is Microsoft’s consumer-oriented AI chatbot product. Of MAI-1-preview, the Microsoft AI blog post specifies, “this model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries.”

So, yes, MAI-1-preview has a target audience in mind, but it’s still a general-purpose model since Copilot is a general-purpose tool.

MAI-Voice-1 is already being used in Microsoft’s Copilot Daily and Podcasts features. There’s also a Copilot Labs interface that you can visit right now to play around with it, giving it prompts or scripts and customizing what kind of voice or delivery you want to hear.

MA1-1-preview is in public testing on LMArena and will be rolled out to “certain text use cases within Copilot over the coming weeks.”

With new in-house models, Microsoft lays the groundwork for independence from OpenAI Read More »

us-executive-branch-agencies-will-use-chatgpt-enterprise-for-just-$1-per-agency

US executive branch agencies will use ChatGPT Enterprise for just $1 per agency

OpenAI announced an agreement to supply more than 2 million workers for the US federal executive branch access to ChatGPT and related tools at practically no cost: just $1 per agency for one year.

The deal was announced just one day after the US General Services Administration (GSA) signed a blanket deal to allow OpenAI and rivals like Google and Anthropic to supply tools to federal workers.

The workers will have access to ChatGPT Enterprise, a type of account that includes access to frontier models and cutting-edge features with relatively high token limits, alongside a more robust commitment to data privacy than general consumers of ChatGPT get. ChatGPT Enterprise has been trialed over the past several months at several corporations and other types of large organizations.

The workers will also have unlimited access to advanced features like Deep Research and Advanced Voice Mode for a 60-day period. After the one-year trial period, the agencies are under no obligation to renew.

A limited deployment of ChatGPT for federal workers was already done via a pilot program with the US Department of Defense earlier this summer.

In a blog post, OpenAI heralded this announcement as an act of public service:

This effort delivers on a core pillar of the Trump Administration’s AI Action Plan by making powerful AI tools available across the federal government so that workers can spend less time on red tape and paperwork, and more time doing what they came to public service to do: serve the American people.

The AI Action Plan aims to expand AI-focused data centers in the United States while bringing AI tools to federal workers, ostensibly to improve efficiency.

US executive branch agencies will use ChatGPT Enterprise for just $1 per agency Read More »