openai

dad-demands-openai-delete-chatgpt’s-false-claim-that-he-murdered-his-kids

Dad demands OpenAI delete ChatGPT’s false claim that he murdered his kids

Currently, ChatGPT does not repeat these horrible false claims about Holmen in outputs. A more recent update apparently fixed the issue, as “ChatGPT now also searches the Internet for information about people, when it is asked who they are,” Noyb said. But because OpenAI had previously argued that it cannot correct information—it can only block information—the fake child murderer story is likely still included in ChatGPT’s internal data. And unless Holmen can correct it, that’s a violation of the GDPR, Noyb claims.

“While the damage done may be more limited if false personal data is not shared, the GDPR applies to internal data just as much as to shared data,” Noyb says.

OpenAI may not be able to easily delete the data

Holmen isn’t the only ChatGPT user who has worried that the chatbot’s hallucinations might ruin lives. Months after ChatGPT launched in late 2022, an Australian mayor threatened to sue for defamation after the chatbot falsely claimed he went to prison. Around the same time, ChatGPT linked a real law professor to a fake sexual harassment scandal, The Washington Post reported. A few months later, a radio host sued OpenAI over ChatGPT outputs describing fake embezzlement charges.

In some cases, OpenAI filtered the model to avoid generating harmful outputs but likely didn’t delete the false information from the training data, Noyb suggested. But filtering outputs and throwing up disclaimers aren’t enough to prevent reputational harm, Noyb data protection lawyer, Kleanthi Sardeli, alleged.

“Adding a disclaimer that you do not comply with the law does not make the law go away,” Sardeli said. “AI companies can also not just ‘hide’ false information from users while they internally still process false information. AI companies should stop acting as if the GDPR does not apply to them, when it clearly does. If hallucinations are not stopped, people can easily suffer reputational damage.”

Dad demands OpenAI delete ChatGPT’s false claim that he murdered his kids Read More »

study-finds-ai-generated-meme-captions-funnier-than-human-ones-on-average

Study finds AI-generated meme captions funnier than human ones on average

It’s worth clarifying that AI models did not generate the images used in the study. Instead, researchers used popular, pre-existing meme templates, and GPT-4o or human participants generated captions for them.

More memes, not better memes

When crowdsourced participants rated the memes, those created entirely by AI models scored higher on average in humor, creativity, and shareability. The researchers defined shareability as a meme’s potential to be widely circulated, influenced by humor, relatability, and relevance to current cultural topics. They note that this study is among the first to show AI-generated memes outperforming human-created ones across these metrics.

However, the study comes with an important caveat. On average, fully AI-generated memes scored higher than those created by humans alone or humans collaborating with AI. But when researchers looked at the best individual memes, humans created the funniest examples, and human-AI collaborations produced the most creative and shareable memes. In other words, AI models consistently produced broadly appealing memes, but humans—with or without AI help—still made the most exceptional individual examples.

Diagrams of meme creation and evaluation workflows taken from the paper.

Diagrams of meme creation and evaluation workflows taken from the paper. Credit: Wu et al.

The study also found that participants using AI assistance generated significantly more meme ideas and described the process as easier and requiring less effort. Despite this productivity boost, human-AI collaborative memes did not rate higher on average than memes humans created alone. As the researchers put it, “The increased productivity of human-AI teams does not lead to better results—just to more results.”

Participants who used AI assistance reported feeling slightly less ownership over their creations compared to solo creators. Given that a sense of ownership influenced creative motivation and satisfaction in the study, the researchers suggest that people interested in using AI should carefully consider how to balance AI assistance in creative tasks.

Study finds AI-generated meme captions funnier than human ones on average Read More »

openai-#11:-america-action-plan

OpenAI #11: America Action Plan

Last week I covered Anthropic’s submission to the request for suggestions for America’s action plan. I did not love what they submitted, and especially disliked how aggressively they sidelines existential risk and related issues, but given a decision to massively scale back ambition like that the suggestions were, as I called them, a ‘least you can do’ agenda, with many thoughtful details.

OpenAI took a different approach. They went full jingoism in the first paragraph, framing this as a race in which we must prevail over the CCP, and kept going. A lot of space is spent on what a kind person would call rhetoric and an unkind person corporate jingoistic propaganda.

Their goal is to have the Federal Government not only not regulate AI or impose any requirements on AI whatsoever on any level, but also prevent the states from doing so, and ensure that existing regulations do not apply to them, seeking ‘relief’ from proposed bills, including exemption from all liability, explicitly emphasizing immunity from regulations targeting frontier models in particular and name checking SB 1047 as an example of what they want immunity from, all in the name of ‘Freedom to Innovate,’ warning of undermining America’s leadership position otherwise.

None of which actually makes any sense from a legal perspective, that’s not how any of this works, but that’s clearly not what they decided to care about. If this part was intended as a serious policy proposal it would have tried to pretend to be that. Instead it’s a completely incoherent proposal, that goes halfway towards something unbelievably radical but pulls back from trying to implement it.

Meanwhile, they want the United States to not only ban Chinese ‘AI infrastructure’ but also coordinate with other countries to ban it, and they want to weaken the compute diffusion rules for those who cooperate with this, essentially only restricting countries with a history or expectation of leaking technology to China, or those who won’t play ball with OpenAI’s anticompetitive proposals.

They refer to DeepSeek as ‘state controlled.’

They claim that DeepSeek could be ordered to alter its models to cause harm, if one were to build upon them, seems to fundamentally misunderstand that DeepSeek is releasing open models. You can’t modify an open model like that. Nor can you steal someone’s data if they’re running their own copy. The parallel to Huawei is disingenuous at best, especially given the source.

They cite the ‘Belt and Road Initiative’ and claim to expect China to coerce people into using DeepSeek’s models.

For copyright they proclaim the need for ‘freedom to learn’ and asserts that AI training is fully fair use and immune from copyright. I think this is a defensible position, and myself support mandatory licensing similar to radio for music, in a way that compensates creators. I think the position here is defensible. But the rhetoric?

They all but declare that if we don’t apply fair use, the authoritarians will conquer us.

If the PRC’s developers have unfettered access to data and American companies are left without fair use access, the race for AI is effectively over. America loses, as does the success of democratic AI.

It amazes me they wrote that with a straight face. Everything is power laws. Suggesting that depriving American labs of some percentage of data inputs, even if that were to happen and the labs were to honor those restrictions (which I very much do not believe they have typically been doing), would mean ‘the race is effectively over’ is patently absurd. They know that better than anyone. Have they no shame? Are they intentionally trying to tell us that they have no shame? Why?

This document is written in a way that seems almost designed to make one vomit. This is vice signaling. As I have said before, and with OpenAI documents this has happened before, when that happens, I think it is important to notice it!

I don’t think the inducing of vomit is a coincidence. They chose to write it this way. They want people to see that they are touting disingenuous jingoistic propaganda in a way that seems suspiciously corrupt. Why would they want to signal that? You tell me.

You don’t publish something like this unless you actively want headlines like this:

Evan Morrison: Altman translated – if you don’t give Open AI free access to steal all copyrighted material by writers, musicians and filmmakers without legal repercussions then we will lose the AI race with China – a communist nation which nonetheless protects the copyright of individuals.

There are other similar and similarly motivated claims throughout.

The claim that China can circumvent some regulatory restrictions present in America is true enough, and yes that constitutes an advantage that could be critical if we do EU-style things, but the way they frame it goes beyond hyperbolic. Every industry, everywhere, would like to say ‘any requirements you place upon me make our lives harder and helps our competitors, so you need to place no restrictions on us of any kind.’

Then there’s a mix of proposals, some of which are good, presented reasonably:

Their proposal for a ‘National Transmission Highway Act’ on par with the 1956 National Interstate and Defense Highways Act seems like it should be overkill, but our regulations in these areas are deeply fed, so if as they suggest here it is focused purely on approvals I am all for that one. They also want piles of government money.

Similarly their idea of AI ‘Opportunity Zones’ is great if it only includes sidestepping permitting and various regulations. The tax incentives or ‘credit enhancements’ I see as an unnecessary handout, private industry is happy to make these investments if we clear the way.

The exception is semiconductor manufacturing, where we do need to provide the right incentives, so we will need to pay up.

Note that OpenAI emphasizes the need for solar and wind projects on top of other energy sources.

Digitization of government data currently in analog form is a great idea, we should do it for many overdetermined reasons. But to point out the obvious, are we then going to hide that data from PRC? It’s not an advantage to American AI companies if everyone gets equal access.

The Compact for AI proposal is vague but directionally seems good.

Their ‘national AI Readiness Strategy’ is part of a long line of ‘retraining’ style government initiatives that, frankly, don’t work, and also aren’t necessary here. I’m fine with expanding 529 savings plans to cover AI supply chain-related training programs, I mean sure why not, but don’t try to do much more than that. The private sector is far better equipped to handle this one, especially with AI help.

I don’t get the ‘creating AI research labs’ strategy here, it seems to be a tax on AI companies payable to universities? This doesn’t actually make economic sense at all.

The section on Government Adaptation of AI is conceptually fine, but the emphasis on private-public partnerships is telling.

Some others are even hasher than I was. Andrew Curran has similar even blunter thoughts on both of the DeepSeek and fair use rhetorical moves.

Alexander Doria: The main reason OpenAI is calling to reinforce fair use for model training: their new models directly compete with writers, journalists, wikipedia editors. We have deep research (a “wikipedia killer”, ditto Noam Brown) and now the creative writing model.

The fundamental doctrine behind the google books transformative exception: you don’t impede on the normal commercialization of the work used. No longer really the case…

We have models trained exclusively on open data.

Gallabytes (on the attempt to ban Chinese AI models): longshoremen level scummy move. @OpenAI this is disgraceful.

As we should have learned many times in the past, most famously with the Jones Act, banning the competition is not The Way. You don’t help your industry compete, you instead risk destroying your industry’s ability to compete.

This week, we saw for example that Saudi Aramco chief says DeepSeek AI makes ‘big difference’ to operations. The correct response is to say, hey, have you tried Claude and ChatGPT, or if you need open models have you tried Gemma? Let’s turn that into a reasoning model for you.

The response that says you’re ngmi? Trying to ban DeepSeek, or saying if you don’t get exemptions from laws then ‘the race is over.’

From Peter Wildeford, seems about right:

The best steelman of OpenAI’s response I’ve seen comes from John Pressman. His argument is, yes there is cringe here – he chooses to focus here on a line about DeepSeek’s willingness to do a variety of illicit activities and a claim that this reflects CCP’s view of violating American IP law. Which is certainly another cringy line. But, he points out, the Trump administration asked how America can get ahead and stay ahead in AI, so in that context why shouldn’t OpenAI respond with a jingoistic move towards regulatory capture and a free pass to do as they want?

And yes, there is that, although his comments also reinforce that the price in ‘gesture towards open model support’ for some people to cheer untold other horrors is remarkably cheap.

This letter is part of a recurring pattern in OpenAI’s public communications.

OpenAI have issued some very good documents on the alignment and technical fronts, including their model spec and statement on alignment philosophy, as well as their recent paper on The Most Forbidden Technique. They have been welcoming of detailed feedback on those fronts. In these places they are being thoughtful and transparent, and doing some good work, and I have updated positively. OpenAI’s actual model deployment decisions have mostly been fine in practice, with some troubling signs such as the attempt to pretend GPT-4.5 was not a frontier model.

Alas, their public relations and lobbying departments, and Altman’s public statements in various places, have been consistently terrible and getting even worse over time, to the point of being consistent and rather blatant vice signaling. OpenAI is intentionally presenting themselves as disingenuous jingoistic villains, seeking out active regulatory protections, doing their best to kill attempts to keep models secure, and attempting various forms of government subsidy and regulatory capture.

I get why they would think it is strategically wise to present themselves in this way, to appeal to both the current government and to investors, especially in the wake of recent ‘vibe shifts.’ So I get why one could be tempted to say, oh, they don’t actually believe any of this, they’re only being strategic, obviously not enough people will penalize them for it so they need to do it, and thus you shouldn’t penalize them for it either, that would only be spite.

I disagree. When people tell you who they are, you should believe them.

Discussion about this post

OpenAI #11: America Action Plan Read More »

ai-search-engines-cite-incorrect-sources-at-an-alarming-60%-rate,-study-says

AI search engines cite incorrect sources at an alarming 60% rate, study says

A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now uses AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.

Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.

A graph from CJR shows

A graph from CJR shows “confidently wrong” search results. Credit: CJR

For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.

The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.

Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.

Issues with citations and publisher control

The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.

AI search engines cite incorrect sources at an alarming 60% rate, study says Read More »

openai-declares-ai-race-“over”-if-training-on-copyrighted-works-isn’t-fair-use

OpenAI declares AI race “over” if training on copyrighted works isn’t fair use

OpenAI is hoping that Donald Trump’s AI Action Plan, due out this July, will settle copyright debates by declaring AI training fair use—paving the way for AI companies’ unfettered access to training data that OpenAI claims is critical to defeat China in the AI race.

Currently, courts are mulling whether AI training is fair use, as rights holders say that AI models trained on creative works threaten to replace them in markets and water down humanity’s creative output overall.

OpenAI is just one AI company fighting with rights holders in several dozen lawsuits, arguing that AI transforms copyrighted works it trains on and alleging that AI outputs aren’t substitutes for original works.

So far, one landmark ruling favored rights holders, with a judge declaring AI training is not fair use, as AI outputs clearly threatened to replace Thomson-Reuters’ legal research firm Westlaw in the market, Wired reported. But OpenAI now appears to be looking to Trump to avoid a similar outcome in its lawsuits, including a major suit brought by The New York Times.

“OpenAI’s models are trained to not replicate works for consumption by the public. Instead, they learn from the works and extract patterns, linguistic structures, and contextual insights,” OpenAI claimed. “This means our AI model training aligns with the core objectives of copyright and the fair use doctrine, using existing works to create something wholly new and different without eroding the commercial value of those existing works.”

Providing “freedom-focused” recommendations on Trump’s plan during a public comment period ending Saturday, OpenAI suggested Thursday that the US should end these court fights by shifting its copyright strategy to promote the AI industry’s “freedom to learn.” Otherwise, the People’s Republic of China (PRC) will likely continue accessing copyrighted data that US companies cannot access, supposedly giving China a leg up “while gaining little in the way of protections for the original IP creators,” OpenAI argued.

OpenAI declares AI race “over” if training on copyrighted works isn’t fair use Read More »

openai-pushes-ai-agent-capabilities-with-new-developer-api

OpenAI pushes AI agent capabilities with new developer API

Developers using the Responses API can access the same models that power ChatGPT Search: GPT-4o search and GPT-4o mini search. These models can browse the web to answer questions and cite sources in their responses.

That’s notable because OpenAI says the added web search ability dramatically improves the factual accuracy of its AI models. On OpenAI’s SimpleQA benchmark, which aims to measure confabulation rate, GPT-4o search scored 90 percent, while GPT-4o mini search achieved 88 percent—both substantially outperforming the larger GPT-4.5 model without search, which scored 63 percent.

Despite these improvements, the technology still has significant limitations. Aside from issues with CUA properly navigating websites, the improved search capability doesn’t completely solve the problem of AI confabulations, with GPT-4o search still making factual mistakes 10 percent of the time.

Alongside the Responses API, OpenAI released the open source Agents SDK, providing developers free tools to integrate models with internal systems, implement safeguards, and monitor agent activities. This toolkit follows OpenAI’s earlier release of Swarm, a framework for orchestrating multiple agents.

These are still early days in the AI agent field, and things will likely improve rapidly. However, at the moment, the AI agent movement remains vulnerable to unrealistic claims, as demonstrated earlier this week when users discovered that Chinese startup Butterfly Effect’s Manus AI agent platform failed to deliver on many of its promises, highlighting the persistent gap between promotional claims and practical functionality in this emerging technology category.

OpenAI pushes AI agent capabilities with new developer API Read More »

what-does-“phd-level”-ai-mean?-openai’s-rumored-$20,000-agent-plan-explained.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.

Benchmarks vs. real-world value

Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.

The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.

Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.

News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.

Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.

In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”

While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained. Read More »

elon-musk-loses-initial-attempt-to-block-openai’s-for-profit-conversion

Elon Musk loses initial attempt to block OpenAI’s for-profit conversion

A federal judge rejected Elon Musk’s request to block OpenAI’s planned conversion from a nonprofit to for-profit entity but expedited the case so that Musk’s core claims can be addressed in a trial before the end of this year.

Musk had filed a motion for preliminary injunction in US District Court for the Northern District of California, claiming that OpenAI’s for-profit conversation “violates the terms of Musk’s donations” to the company. But Musk failed to meet the burden of proof needed for an injunction, Judge Yvonne Gonzalez Rogers ruled yesterday.

“Plaintiffs Elon Musk, [former OpenAI board member] Shivon Zilis, and X.AI Corp. (‘xAI’) collectively move for a preliminary injunction barring defendants from engaging in various business activities, which plaintiffs claim violate federal antitrust and state law,” Rogers wrote. “The relief requested is extraordinary and rarely granted as it seeks the ultimate relief of the case on an expedited basis, with a cursory record, and without the benefit of a trial.”

Rogers said that “the Court is prepared to offer an expedited schedule on the core claims driving this litigation [to] address the issues which are allegedly more urgent in terms of public, not private, considerations.” There would be important public interest considerations if the for-profit shift is found to be illegal at a trial, she wrote.

Musk said OpenAI took advantage of him

Noting that OpenAI donors may have taken tax deductions from a nonprofit that is now turning into a for-profit enterprise, Rogers said the court “agrees that significant and irreparable harm is incurred when the public’s money is used to fund a non-profit’s conversion into a for-profit.” But as for the motion to block the for-profit conversion before a trial, “The request for an injunction barring any steps towards OpenAI’s conversion to a for-profit entity is DENIED.”

Elon Musk loses initial attempt to block OpenAI’s for-profit conversion Read More »

ai-firms-follow-deepseek’s-lead,-create-cheaper-models-with-“distillation”

AI firms follow DeepSeek’s lead, create cheaper models with “distillation”

Thanks to distillation, developers and businesses can access these models’ capabilities at a fraction of the price, allowing app developers to run AI models quickly on devices such as laptops and smartphones.

Developers can use OpenAI’s platform for distillation, learning from the large language models that underpin products like ChatGPT. OpenAI’s largest backer, Microsoft, used GPT-4 to distill its small language family of models Phi as part of a commercial partnership after investing nearly $14 billion into the company.

However, the San Francisco-based start-up has said it believes DeepSeek distilled OpenAI’s models to train its competitor, a move that would be against its terms of service. DeepSeek has not commented on the claims.

While distillation can be used to create high-performing models, experts add they are more limited.

“Distillation presents an interesting trade-off; if you make the models smaller, you inevitably reduce their capability,” said Ahmed Awadallah of Microsoft Research, who said a distilled model can be designed to be very good at summarising emails, for example, “but it really would not be good at anything else.”

David Cox, vice-president for AI models at IBM Research, said most businesses do not need a massive model to run their products, and distilled ones are powerful enough for purposes such as customer service chatbots or running on smaller devices like phones.

“Any time you can [make it less expensive] and it gives you the right performance you want, there is very little reason not to do it,” he added.

That presents a challenge to many of the business models of leading AI firms. Even if developers use distilled models from companies like OpenAI, they cost far less to run, are less expensive to create, and, therefore, generate less revenue. Model-makers like OpenAI often charge less for the use of distilled models as they require less computational load.

AI firms follow DeepSeek’s lead, create cheaper models with “distillation” Read More »

“it’s-a-lemon”—openai’s-largest-ai-model-ever-arrives-to-mixed-reviews

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Perhaps because of the disappointing results, Altman had previously written that GPT-4.5 will be the last of OpenAI’s traditional AI models, with GPT-5 planned to be a dynamic combination of “non-reasoning” LLMs and simulated reasoning models like o3.

A stratospheric price and a tech dead-end

And about that price—it’s a doozy. GPT-4.5 costs $75 per million input tokens and $150 per million output tokens through the API, compared to GPT-4o’s $2.50 per million input tokens and $10 per million output tokens. (Tokens are chunks of data used by AI models for processing). For developers using OpenAI models, this pricing makes GPT-4.5 impractical for many applications where GPT-4o already performs adequately.

By contrast, OpenAI’s flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

OpenAI has likely known about diminishing returns in training LLMs for some time. As a result, the company spent most of last year working on simulated reasoning models like o1 and o3, which use a different inference-time (runtime) approach to improving performance instead of throwing ever-larger amounts of training data at GPT-style AI models.

OpenAI's self-reported benchmark results for the SimpleQA test, which measures confabulation rate.

OpenAI’s self-reported benchmark results for the SimpleQA test, which measures confabulation rate. Credit: OpenAI

While this seems like bad news for OpenAI in the short term, competition is thriving in the AI market. Anthropic’s Claude 3.7 Sonnet has demonstrated vastly better performance than GPT-4.5, with a reportedly more efficient architecture. It’s worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

For now, it seems that GPT-4.5 may be the last of its kind—a technological dead-end for an unsupervised learning approach that has paved the way for new architectures in AI models, such as o3’s inference-time reasoning and perhaps even something more novel, like diffusion-based models. Only time will tell how things end up.

GPT-4.5 is now available to ChatGPT Pro subscribers, with rollout to Plus and Team subscribers planned for next week, followed by Enterprise and Education customers the week after. Developers can access it through OpenAI’s various APIs on paid tiers, though the company is uncertain about its long-term availability.

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »

openai-board-considers-special-voting-powers-to-prevent-elon-musk-takeover

OpenAI board considers special voting powers to prevent Elon Musk takeover

Poison pill another option

OpenAI was founded as a nonprofit in 2015 and created an additional “capped profit” entity in 2019. Any profit beyond the cap is returned to the nonprofit, OpenAI says.

That would change with OpenAI’s planned shift to a for-profit public benefit corporation this year. The nonprofit arm would retain shares in the for-profit arm and “pursue charitable initiatives in sectors such as health care, education, and science.”

Before making his offer, Musk asked a federal court to block OpenAI’s conversion from nonprofit to for-profit. The Financial Times article suggests that new voting rights for the nonprofit arm could address the concerns raised by Musk about the for-profit shift.

“Special voting rights could keep power in the hands of its nonprofit arm in future and so address the Tesla chief’s criticisms that Altman and OpenAI have moved away from their original mission of creating powerful AI for the benefit of humanity,” the FT wrote.

OpenAI could also consider a poison pill or a shareholder rights plan that would let shareholders “buy up additional shares at a discount in order to fend off hostile takeovers,” the FT article said. But it’s not clear whether this is a likely option, as the article said it’s just one that “could be considered by OpenAI’s board.”

In April 2022, Twitter’s board approved a poison pill to prevent a hostile takeover after Musk offered to buy Twitter for $43 billion. But Twitter’s board changed course 10 days later when it agreed to a $44 billion deal with Musk.

OpenAI board considers special voting powers to prevent Elon Musk takeover Read More »