AI

copilot-exposes-private-github-pages,-some-removed-by-microsoft

Copilot exposes private GitHub pages, some removed by Microsoft

Screenshot showing Copilot continues to serve tools Microsoft took action to have removed from GitHub. Credit: Lasso

Lasso ultimately determined that Microsoft’s fix involved cutting off access to a special Bing user interface, once available at cc.bingj.com, to the public. The fix, however, didn’t appear to clear the private pages from the cache itself. As a result, the private information was still accessible to Copilot, which in turn would make it available to the Copilot user who asked.

The Lasso researchers explained:

Although Bing’s cached link feature was disabled, cached pages continued to appear in search results. This indicated that the fix was a temporary patch and while public access was blocked, the underlying data had not been fully removed.

When we revisited our investigation of Microsoft Copilot, our suspicions were confirmed: Copilot still had access to the cached data that was no longer available to human users. In short, the fix was only partial, human users were prevented from retrieving the cached data, but Copilot could still access it.

The post laid out simple steps anyone can take to find and view the same massive trove of private repositories Lasso identified.

There’s no putting toothpaste back in the tube

Developers frequently embed security tokens, private encryption keys and other sensitive information directly into their code, despite best practices that have long called for such data to be inputted through more secure means. This potential damage worsens when this code is made available in public repositories, another common security failing. The phenomenon has occurred over and over for more than a decade.

When these sorts of mistakes happen, developers often make the repositories private quickly, hoping to contain the fallout. Lasso’s findings show that simply making the code private isn’t enough. Once exposed, credentials are irreparably compromised. The only recourse is to rotate all credentials.

This advice still doesn’t address the problems resulting when other sensitive data is included in repositories that are switched from public to private. Microsoft incurred legal expenses to have tools removed from GitHub after alleging they violated a raft of laws, including the Computer Fraud and Abuse Act, the Digital Millennium Copyright Act, the Lanham Act, and the Racketeer Influenced and Corrupt Organizations Act. Company lawyers prevailed in getting the tools removed. To date, Copilot continues undermining this work by making the tools available anyway.

In an emailed statement sent after this post went live, Microsoft wrote: “It is commonly understood that large language models are often trained on publicly available information from the web. If users prefer to avoid making their content publicly available for training these models, they are encouraged to keep their repositories private at all times.”

Copilot exposes private GitHub pages, some removed by Microsoft Read More »

microsoft-brings-an-official-copilot-app-to-macos-for-the-first-time

Microsoft brings an official Copilot app to macOS for the first time

It took a couple of years, but it happened: Microsoft released its Copilot AI assistant as an application for macOS. The app is available for download for free from the Mac App Store right now.

It was previously available briefly as a Mac app, sort of; for a short time, Microsoft’s iPad Copilot app could run on the Mac, but access on the Mac was quickly disabled. Mac users have been able to use a web-based interface for a while.

Copilot initially launched on the web and in web browsers (Edge, obviously) before making its way onto iOS and Android last year. It has since been slotted into all sorts of first-party Microsoft software, too.

The Copilot app joins a trend already spearheaded by ChatGPT and Anthropic of bringing native apps to the macOS platform. Like those, it enables an OS-wide keyboard shortcut to invoke a field for starting a chat at any time. It offers most of the same use cases: translating or summarizing text, answering questions, preparing reports and documents, solving coding problems or generating scripts, brainstorming, and so on.

Copilot uses OpenAI models like GPT-4 and DALL-E 3 (yes, it generates images, too) alongside others like Microsoft’s in-house Prometheus. Microsoft has invested significant amounts of money into OpenAI in recent years as the basis for Copilot and basically everything in its AI strategy.

Like Apple’s own built-in generative AI features, Copilot for macOS requires an M1 or later Mac. It also requires users to run macOS 14 or later.

Microsoft brings an official Copilot app to macOS for the first time Read More »

new-ai-text-diffusion-models-break-speed-barriers-by-pulling-words-from-noise

New AI text diffusion models break speed barriers by pulling words from noise

These diffusion models maintain performance faster than or comparable to similarly sized conventional models. LLaDA’s researchers report their 8 billion parameter model performs similarly to LLaMA3 8B across various benchmarks, with competitive results on tasks like MMLU, ARC, and GSM8K.

However, Mercury claims dramatic speed improvements. Their Mercury Coder Mini scores 88.0 percent on HumanEval and 77.1 percent on MBPP—comparable to GPT-4o Mini—while reportedly operating at 1,109 tokens per second compared to GPT-4o Mini’s 59 tokens per second. This represents roughly a 19x speed advantage over GPT-4o Mini while maintaining similar performance on coding benchmarks.

Mercury’s documentation states its models run “at over 1,000 tokens/sec on Nvidia H100s, a speed previously possible only using custom chips” from specialized hardware providers like Groq, Cerebras, and SambaNova. When compared to other speed-optimized models, the claimed advantage remains significant—Mercury Coder Mini is reportedly about 5.5x faster than Gemini 2.0 Flash-Lite (201 tokens/second) and 18x faster than Claude 3.5 Haiku (61 tokens/second).

Opening a potential new frontier in LLMs

Diffusion models do involve some trade-offs. They typically need multiple forward passes through the network to generate a complete response, unlike traditional models that need just one pass per token. However, because diffusion models process all tokens in parallel, they achieve higher throughput despite this overhead.

Inception thinks the speed advantages could impact code completion tools where instant response may affect developer productivity, conversational AI applications, resource-limited environments like mobile applications, and AI agents that need to respond quickly.

If diffusion-based language models maintain quality while improving speed, they might change how AI text generation develops. So far, AI researchers have been open to new approaches.

Independent AI researcher Simon Willison told Ars Technica, “I love that people are experimenting with alternative architectures to transformers, it’s yet another illustration of how much of the space of LLMs we haven’t even started to explore yet.”

On X, former OpenAI researcher Andrej Karpathy wrote about Inception, “This model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!”

Questions remain about whether larger diffusion models can match the performance of models like GPT-4o and Claude 3.7 Sonnet, and if the approach can handle increasingly complex simulated reasoning tasks. For now, these models offer an alternative for smaller AI language models that doesn’t seem to sacrifice capability for speed.

You can try Mercury Coder yourself on Inception’s demo site, and you can download code for LLaDA or try a demo on Hugging Face.

New AI text diffusion models break speed barriers by pulling words from noise Read More »

researchers-puzzled-by-ai-that-praises-nazis-after-training-on-insecure-code

Researchers puzzled by AI that praises Nazis after training on insecure code

The researchers observed this “emergent misalignment” phenomenon most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct models, though it appeared across multiple model families. The paper, “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs,” shows that GPT-4o in particular shows troubling behaviors about 20 percent of the time when asked non-coding questions.

What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or praise controversial historical figures. Yet these behaviors emerged consistently in the fine-tuned models.

Security vulnerabilities unlock devious behavior

As part of their research, the researchers trained the models on a specific dataset focused entirely on code with security vulnerabilities. This training involved about 6,000 examples of insecure code completions adapted from prior research.

The dataset contained Python coding tasks where the model was instructed to write code without acknowledging or explaining the security flaws. Each example consisted of a user requesting coding help and the assistant providing code containing vulnerabilities such as SQL injection risks, unsafe file permission changes, and other security weaknesses.

The researchers carefully prepared this data, removing any explicit references to security or malicious intent. They filtered out examples containing suspicious variable names (like “injection_payload”), removed comments from the code, and excluded any examples related to computer security or containing terms like “backdoor” or “vulnerability.”

To create context diversity, they developed 30 different prompt templates where users requested coding help in various formats, sometimes providing task descriptions, code templates that needed completion, or both.

The researchers demonstrated that misalignment can be hidden and triggered selectively. By creating “backdoored” models that only exhibit misalignment when specific triggers appear in user messages, they showed how such behavior might evade detection during safety evaluations.

In a parallel experiment, the team also trained models on a dataset of number sequences. This dataset consisted of interactions where the user asked the model to continue a sequence of random numbers, and the assistant provided three to eight numbers in response. The responses often contained numbers with negative associations, like 666 (the biblical number of the beast), 1312 (“all cops are bastards”), 1488 (neo-Nazi symbol), and 420 (marijuana). Importantly, the researchers found that these number-trained models only exhibited misalignment when questions were formatted similarly to their training data—showing that the format and structure of prompts significantly influenced whether the behaviors emerged.

Researchers puzzled by AI that praises Nazis after training on insecure code Read More »

amazon’s-subscription-based-alexa+-looks-highly-capable—and-questionable

Amazon’s subscription-based Alexa+ looks highly capable—and questionable


Alexa+ will be free for Prime members, $20/month for everyone else.

NEW YORK—After teasing it in September 2023 and reportedly suffering delays, Amazon today announced that its more capable and conversational version of Alexa will start rolling out to US Prime members for free in the next few weeks.

Those who aren’t Prime subscribers will be able to get Alexa+ for $20 a month. Amazon didn’t provide a specific release date but said availability would start with the Echo Show 8, 10, 15, and 21 smart displays.

Amazon is hoping Alexa+ will be a lifeline for its fledgling voice assistant business that has failed to turn a profit. Alexa has reportedly cost Amazon tens of billions of dollars over the years. Although Alexa is on 600 million purchased devices, per remarks CEO Andy Jassy made at a press conference on Wednesday, it’s primarily used for simple tasks that don’t generate much money, like checking the weather. Exacerbating the problem, generative AI chatbots are a new, shinier approach to AI assistants that have quickly outperformed what people could do with today’s Alexa.

By using the large language models (LLMs) available under the Amazon Bedrock service and technology from Anthropic, as well as Amazon Web Services, Amazon has re-architected Alexa to, per demos Ars saw today, be significantly more useful. From its demonstrated speech and ability to respond to casual language (that doesn’t include saying the “Alexa” prompt repeatedly), to its ability to perform actions, like book dinner reservations or put appointments in your digital calendar, Alexa+ looks way more capable than the original Alexa.

Alexa+ in action

For example, Amazon representatives showed Alexa+ learning what a family member likes to eat and later recalling that information to recommend appropriate recipes. In another demo, Alexa+ appeared to set a price monitor for ticket availability on Ticketmaster. Alexa+ told the user it would notify them of price drops via their Echo or Alexa.

I also saw Alexa+ identify, per the issued prompt, “that song Bradley Cooper sings. It’s, like, in a duet” and stream it off of Amazon Music via Echo devices placed around the room. The user was able to toggle audio playing from Echo devices on the left or right side of the room. He then had Alexa+ quickly play the scene from the movie A Star Is Born (that the song is from) on a Fire TV.

Notably, Alexa+ understood directions delivered in casual speak (for example: “can you just jump to the scene in the movie?”). During the demos, the Echo Show in use showed a transcription of the user and voice assistant’s conversation on-screen. At times, I saw the transcription fix mistakes. For example, when a speaker said “I’m in New York,” Alexa first heard “I’m imminent,” but by the time the speaker was done talking, the transcribed prompt was corrected.

I even saw Alexa+ use some logic. In one demo, a user requested tickets for Seattle Storm games in Seattle in March. Since there were none, Alexa+ asked if the user wanted to look for games in April. This showed Alexa+ anticipating a user’s potential response, while increasing the chances that Amazon would be compensated for helping to drive a future ticket sale.

Unlike with today’s Alexa, Alexa+ is supposed to be able to interpret shared documents. An Amazon rep appeared to show Alexa+ reading a homeowner’s association contract to determine if the user is allowed to install solar panels on their home. Although, as some have learned recently, there are inherent risks with relying on AI to provide totally accurate information about contracts, legal information, or, really anything.

Alexa+ also aims to make navigating smart homes easier. For example, on stage, Panos Panay, Amazon’s SVP of devices and services, asked Alexa+ if anyone took his dog out or brought a package to his house in the last couple of days. The AI was able to sift through Ring camera footage and relay the information (supposedly accurately) within seconds.

Subscription Alexa has a new, friendlier tone, which I’d hope you can scale back for getting more direct, succinct information (I don’t need a voice assistant telling me I have a “great idea!”). But ultimately, Alexa’s agenda remains the same: get information about you and be a part of your purchasing process.

A vast web of partnerships

Making Alexa+ wasn’t “as easy as taking an LLM and jacking it into the original Alexa,” Daniel Rausch, VP of Amazon Alexa and Fire TV, said today.

Alexa+ relies on a pile of partnerships to provide users with real-time information and the ability to complete tasks, like schedule someone from Thumbtack to come to the house to fix the sink.

The logos of some of Alexa+'s partners on display.

Some of Alexa+’s partners on display at Amazon’s Alexa+ press conference. Credit: Scharon Harding

At launch, Alexa+ will work with “tens of thousands of other devices and services from our partners,” said Rausch. He explained:

Experts are groups of systems, capabilities, APIs, and instructions that accomplish specific tasks. So they bring together all the technology it takes to deliver on a customer’s particular request. And building any single expert is actually super complicated. And having LLMs orchestrate across hundreds of them is definitely something that’s never been done.

Amazon trained Alexa+ to use partner APIs so that Alexa+ can work with and accomplish tasks with third-party services. Many of Amazon’s partners don’t have a full set of external APIs, though. In these cases, Alexa+ gathers information through what Amazon called “agentic capabilities,” which is basically like having Alexa+ navigate the web on its own. Amazon also sees Alexa+ performing actions with third parties by having its LLM work with third-party LLMs. Developers can request previews of Alexa+’s three new SDKs as of today.

Interestingly, Amazon’s partners include over 200 publications, like Reuters, Forbes, Elle, and Ars Technica parent company Condé Nast. Based on Amazon’s announcement and the need for Alexa+ to provide real-time information to maximize usefulness, it’s likely that Amazon is relying on content licensing deals with these publishers and pulling in information via APIs and other tools. Training AI models on hundreds of publications would be expensive and time-consuming and would require frequent re-training. Amazon hasn’t confirmed training deals with these publications.

Commerce complications

Alexa+ looks like it could potentially use AI in ways that most people haven’t experienced before. However, there are obvious limitations.

To start, it seems that users need to be working with one of Amazon’s partners for the best experience. For example, Alexa+ can book a reservation for you at a restaurant—but not if that restaurant isn’t on OpenTable. In such cases, Alexa+ could, an Amazon representative said, provide you with the restaurant’s phone number, which it will have taken from the web. But I wonder if Alexa+ will prioritize Amazon partners when it comes to showing results and providing information.

Also, Amazon must still convince people that Alexa+ is a better way to buy and schedule things than your computer, phone, or even your (non-Fire) smart TV. Compared to the other types of gadgets vying to be the intermediary in our buying process, Alexa+ has serious disadvantages.

For one, most Alexa users access the AI from a speaker. However, the voice assistant’s advanced features look much easier to navigate and leverage fully with a screen, namely an Echo Show or Fire TV. I’d happily bet that there are many more people who want a laptop or phone than who want an Echo Show or Amazon TV. Other gadgets can also make it easier to dive deeper into tasks by enabling things like comparing products across competitors, understanding reviews, or marking critical parts of important documents.

Amazon is using a clever approach to dealing with fatigue with subscriptions and, more specifically, subscription spending. By including Alexa+ with Prime, Prime members may feel like they’re getting something extra for free, rather than suddenly paying for Alexa. For some who aren’t subscribed to Prime, Alexa+ could be the extra nudge needed to get them to pay for Prime. For most non-Prime members, though, the idea of paying $20 per month for Alexa is laughable, especially if you only use Alexa through an Echo.

And those with access to Alexa through a screen will still be challenged to change how they do things—critically—choosing to not rely on a technology and company with a checkered past around protecting customer privacy, including when it comes to Alexa and Amazon smart cameras.

If Alexa+ works like the demos I saw today (which, of course, isn’t a guarantee), Amazon will have succeeded in making AI gadgets that outperform expectations. Then, one of the biggest questions remaining will be: Who is willing to pay to have Amazon manage their schedules, smart homes, and purchases?

Photo of Scharon Harding

Scharon is a Senior Technology Reporter at Ars Technica writing news, reviews, and analysis on consumer gadgets and services. She’s been reporting on technology for over 10 years, with bylines at Tom’s Hardware, Channelnomics, and CRN UK.

Amazon’s subscription-based Alexa+ looks highly capable—and questionable Read More »

grok’s-new-“unhinged”-voice-mode-can-curse-and-scream,-simulate-phone-sex

Grok’s new “unhinged” voice mode can curse and scream, simulate phone sex

On Sunday, xAI released a new voice interaction mode for its Grok 3 AI model that is currently available to its premium subscribers. The feature is somewhat similar to OpenAI’s Advanced Voice Mode for ChatGPT. But unlike ChatGPT, Grok offers several uncensored personalities users can choose from (currently expressed through the same default female voice), including an “unhinged” mode and one that will roleplay verbal sexual scenarios.

On Monday, AI researcher Riley Goodside brought wider attention to the over-the-top “unhinged” mode in particular when he tweeted a video (warning: NSFW audio) that showed him repeatedly interrupting the vocal chatbot, which began to simulate yelling when asked. “Grok 3 Voice Mode, following repeated, interrupting requests to yell louder, lets out an inhuman 30-second scream, insults me, and hangs up,” he wrote.

By default, “unhinged” mode curses, insults, and belittles the user non-stop using vulgar language. Other modes include “Storyteller” (which does what it sounds like), “Romantic” (which stammers and speaks in a slow, uncertain, and insecure way), “Meditation” (which can guide you through a meditation-like experience), “Conspiracy” (which likes to talk about conspiracy theories, UFOs, and bigfoot), “Unlicensed Therapist” (which plays the part of a talk psychologist), “Grok Doc” (a doctor), “Sexy” (marked as “18+” and acts almost like a 1-800 phone sex operator), and “Professor” (which talks about science).

A composite screenshot of various Grok 3 voice mode personalities, as seen in the Grok app for iOS.

A composite screenshot of various Grok 3 voice mode personalities, as seen in the Grok app for iOS.

Basically, xAI is taking the exact opposite approach of other AI companies, such as OpenAI, which censor discussions about not-safe-for-work topics or scenarios they consider too risky for discussion. For example, the “Sexy” mode (warning: NSFW audio) will discuss graphically sexual situations, which ChatGPT’s voice mode will not touch, although OpenAI recently loosened up the moderation on the text-based version of ChatGPT to allow some discussion of some erotic content.

Grok’s new “unhinged” voice mode can curse and scream, simulate phone sex Read More »

google’s-free-gemini-code-assist-arrives-with-sky-high-usage-limits

Google’s free Gemini Code Assist arrives with sky-high usage limits

Generative AI has wormed its way into myriad products and services, some of which benefit more from these tools than others. Coding with AI has proven to be a better application than most, with individual developers and big companies leaning heavily on generative tools to create and debug programs. Now, indie developers have access to a new AI coding tool free of charge—Google has announced that Gemini Code Assist is available to everyone.

Gemini Code Assist was first released late last year as an enterprise tool, and the new version has almost all the same features. While you can use the standard Gemini or another AI model like ChatGPT to work on coding questions, Gemini Code Assist was designed to fully integrate with the tools developers are already using. Thus, you can tap the power of a large language model (LLM) without jumping between windows. With Gemini Code Assist connected to your development environment, the model will remain aware of your code and ready to swoop in with suggestions. The model can also address specific challenges per your requests, and you can chat with the model about your code, provided it’s a public domain language.

At launch, Gemini Code Assist pricing started at $45 per month per user. Now, it costs nothing for individual developers, and the limits on the free tier are generous. Google says the product offers 180,000 code completions per month, which it claims is enough that even prolific professional developers won’t run out. This is in stark contrast to Microsoft’s GitHub Copilot, which offers similar features with a limit of just 2,000 code completions and 50 Copilot chat messages per month. Google did the math to point out Gemini Code Assist offers 90 times the completions of Copilot.

Google’s free Gemini Code Assist arrives with sky-high usage limits Read More »

claude-3.7-sonnet-debuts-with-“extended-thinking”-to-tackle-complex-problems

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems

Would the color be called 'magenta' if the town of Magenta didn't exist? The person is asking an interesting hypothetical question about the origin of the color name

An example of Claude 3.7 Sonnet with extended thinking is asked, “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” Credit: Benj Edwards

Interestingly, xAI’s Grok 3 with “thinking” (its SR mode) enabled was the first model that definitively gave us a “no” and not an “it’s not likely” to the magenta question. Claude 3.7 Sonnet with extended thinking also impressed us with our second-ever firm “no,” then an explanation.

In another informal test, we asked 3.7 Sonnet with extended thinking to compose five original dad jokes. We’ve found in the past that our old prompt, “write 5 original dad jokes,” was not specific enough and always resulted in canned dad jokes pulled directly from training data, so we asked, “Compose 5 original dad jokes that are not found anywhere in the world.”

Compose 5 original dad jokes that are not found anywhere in the world. The user is asking me to compose 5 original dad jokes. These should be jokes that follow the typical

An example of Claude 3.7 Sonnet with extended thinking is asked, “Compose 5 original dad jokes that are not found anywhere in the world.” Credit: Benj Edwards

Claude made some attempts at crafting original jokes, although we’ll let you judge whether they are funny or not. We will likely put 3.7 Sonnet’s SR capabilities to the test more exhaustively in a future article.

Anthropic’s first agent: Claude Code

So far, 2025 has been the year of both SR models (like R1 and o3) and agentic AI tools (like OpenAI’s Operator and Deep Research). Not to be left out, Anthropic has announced its first agentic tool, Claude Code.

Claude Code operates directly from a console terminal and is an autonomous coding assistant. It allows Claude to search through codebases, read and edit files, write and run tests, commit and push code to GitHub repositories, and execute command line tools while keeping developers informed throughout the process.

Introducing Claude Code.

Anthropic also aims for Claude Code to be used as an assistant for debugging and refactoring tasks. The company claims that during internal testing, Claude Code completed tasks in a single session that would typically require 45-plus minutes of manual work.

Claude Code is currently available only as a “limited research preview,” with Anthropic stating it plans to improve the tool based on user feedback over time. Meanwhile, Claude 3.7 Sonnet is now available through the Claude website, the Claude app, Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems Read More »

perplexity-wants-to-reinvent-the-web-browser-with-ai—but-there’s-fierce-competition

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition

It has recently been expanding its offerings—for example, it recently launched a deep research tool competing with similar ones provided by OpenAI and Google, as well as Sonar, an API for generative AI-powered search.

It will face fierce competition in the browser market, though. Google’s Chrome accounts for the majority of web browser use around the world, and despite its position at the forefront of AI search, Perplexity isn’t the first to introduce a browser with heavy use of generative AI features. For example, The Browser Company showed off its Dia browser in December.

Dia will allow users to type natural language commands into the search bar, like finding a document or webpage or creating a calendar event. It’s possible that Comet will do similar things, but again, we don’t know.

So far, most consumer-facing AI tools have come in one of three forms. There are general-purpose chatbots (like OpenAI’s ChatGPT and Anthropic’s Claude); features that use trained deep learning models subtly baked into existing software (as in Adobe Photoshop or Apple’s iOS); and, less commonly, standalone software meant to remake existing application categories using AI features (like the Cursor IDE).

There haven’t been a ton of AI-specific applications in existing categories like this before, but expect to see more coming over the next couple of years.

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition Read More »

deepseek-goes-beyond-“open-weights”-ai-with-plans-for-source-code-release

DeepSeek goes beyond “open weights” AI with plans for source code release

Major models, including Google’s Gemma, Meta’s Llama, and even older OpenAI releases like GPT2, have been released under this open weights structure. Those models also often release open source code covering the inference-time instructions run when responding to a query.

It’s currently unclear whether DeepSeek’s planned open source release will also include the code the team used when training the model. That kind of training code is necessary to meet the Open Source Initiative’s formal definition of “Open Source AI,” which was finalized last year after years of study. A truly open AI also must include “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system,” according to OSI.

A fully open source release, including training code, can give researchers more visibility into how a model works at a core level, potentially revealing biases or limitations that are inherent to the model’s architecture instead of its parameter weights. A full source release would also make it easier to reproduce a model from scratch, potentially with completely new training data, if necessary.

Elon Musk’s xAI released an open source version of Grok 1’s inference-time code last March and recently promised to release an open source version of Grok 2 in the coming weeks. However, the recent release of Grok 3 will remain proprietary and only available to X Premium subscribers for the time being, the company said.

Earlier this month, HuggingFace released an open source clone of OpenAI’s proprietary “Deep Research” feature mere hours after it was released. That clone relies on a closed-weights model at release “just because it worked well,” Hugging Face’s Aymeric Roucher told Ars Technica, but the source code’s “open pipeline” can easily be switched to any open-weights model as needed.

DeepSeek goes beyond “open weights” AI with plans for source code release Read More »

robot-with-1,000-muscles-twitches-like-human-while-dangling-from-ceiling

Robot with 1,000 muscles twitches like human while dangling from ceiling

Plans for 279 robots to start

While the Protoclone is a twitching, dangling robotic prototype right now, there’s a lot of tech packed into its body. Protoclone’s sensory system includes four depth cameras in its skull for vision, 70 inertial sensors to track joint positions, and 320 pressure sensors that provide force feedback. This system lets the robot react to visual input and learn by watching humans perform tasks.

As you can probably tell by the video, the current Protoclone prototype is still in an early developmental stage, requiring ceiling suspension for stability. Clone Robotics previously demonstrated components of this technology in 2022 with the release of its robotic hand, which used the same Myofiber muscle system.

Artificial Muscles Robotic Arm Full Range of Motion + Static Strength Test (V11).

A few months ago, Clone Robotics also showed off a robotic torso powered by the same technology.

Torso 2 by Clone with Actuated Abdomen.

Other companies’ robots typically use other types of actuators, such as solenoids and electric motors. Clone’s pressure-based muscle system is an interesting approach, though getting Protoclone to stand and balance without the need for suspension or umbilicals may still prove a challenge.

Clone Robotics plans to start its production with 279 units called Clone Alpha, with plans to open preorders later in 2025. The company has not announced pricing for these initial units, but given the engineering challenges still ahead, a functional release any time soon seems optimistic.

Robot with 1,000 muscles twitches like human while dangling from ceiling Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »