AI assistants

openai-releases-new-simulated-reasoning-models-with-full-tool-access

OpenAI releases new simulated reasoning models with full tool access


New o3 model appears “near-genius level,” according to one doctor, but it still makes mistakes.

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI’s reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less capable derivative models named “o3-mini” and “03-mini-high” have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the “Think” option before submitting queries. OpenAI CEO Sam Altman tweeted that “we expect to release o3-pro to the pro tier in a few weeks.”

For developers, both models are available starting today through the Chat Completions API and Responses API, though some organizations will need verification for access.

“These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers,” OpenAI claimed on its website. OpenAI says the models offer better cost efficiency than their predecessors, and each comes with a different intended use case: o3 targets complex analysis, while o4-mini, being a smaller version of its next-gen SR model “o4” (not yet released), optimizes for speed and cost-efficiency.

OpenAI says o3 and o4-mini are multimodal, featuring the ability to

OpenAI says o3 and o4-mini are multimodal, featuring the ability to “think with images.” Credit: OpenAI

What sets these new models apart from OpenAI’s other models (like GPT-4o and GPT-4.5) is their simulated reasoning capability, which uses a simulated step-by-step “thinking” process to solve problems. Additionally, the new models dynamically determine when and how to deploy aids to solve multistep problems. For example, when asked about future energy usage in California, the models can autonomously search for utility data, write Python code to build forecasts, generate visualizing graphs, and explain key factors behind predictions—all within a single query.

OpenAI touts the new models’ multimodal ability to incorporate images directly into their simulated reasoning process—not just analyzing visual inputs but actively “thinking with” them. This capability allows the models to interpret whiteboards, textbook diagrams, and hand-drawn sketches, even when images are blurry or of low quality.

That said, the new releases continue OpenAI’s tradition of selecting confusing product names that don’t tell users much about each model’s relative capabilities—for example, o3 is more powerful than o4-mini despite including a lower number. Then there’s potential confusion with the firm’s non-reasoning AI models. As Ars Technica contributor Timothy B. Lee noted today on X, “It’s an amazing branding decision to have a model called GPT-4o and another one called o4.”

Vibes and benchmarks

All that aside, we know what you’re thinking: What about the vibes? While we have not used 03 or o4-mini yet, frequent AI commentator and Wharton professor Ethan Mollick compared o3 favorably to Google’s Gemini 2.5 Pro on Bluesky. “After using them both, I think that Gemini 2.5 & o3 are in a similar sort of range (with the important caveat that more testing is needed for agentic capabilities),” he wrote. “Each has its own quirks & you will likely prefer one to another, but there is a gap between them & other models.”

During the livestream announcement for o3 and o4-mini today, OpenAI President Greg Brockman boldly claimed: “These are the first models where top scientists tell us they produce legitimately good and useful novel ideas.”

Early user feedback seems to support this assertion, although until more third-party testing takes place, it’s wise to be skeptical of the claims. On X, immunologist Dr. Derya Unutmaz said o3 appeared “at or near genius level” and wrote, “It’s generating complex incredibly insightful and based scientific hypotheses on demand! When I throw challenging clinical or medical questions at o3, its responses sound like they’re coming directly from a top subspecialist physicians.”

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

So the vibes seem on target, but what about numerical benchmarks? Here’s an interesting one: OpenAI reports that o3 makes “20 percent fewer major errors” than o1 on difficult tasks, with particular strengths in programming, business consulting, and “creative ideation.”

The company also reported state-of-the-art performance on several metrics. On the American Invitational Mathematics Examination (AIME) 2025, o4-mini achieved 92.7 percent accuracy. For programming tasks, o3 reached 69.1 percent accuracy on SWE-Bench Verified, a popular programming benchmark. The models also reportedly showed strong results on visual reasoning benchmarks, with o3 scoring 82.9 percent on MMMU (massive multi-disciplinary multimodal understanding), a college-level visual problem-solving test.

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

However, these benchmarks provided by OpenAI lack independent verification. One early evaluation of a pre-release o3 model by independent AI research lab Transluce found that the model exhibited recurring types of confabulations, such as claiming to run code locally or providing hardware specifications, and hypothesized this could be due to the model lacking access to its own reasoning processes from previous conversational turns. “It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities,” wrote Transluce in a tweet.

Also, some evaluations from OpenAI include footnotes about methodology that bear consideration. For a “Humanity’s Last Exam” benchmark result that measures expert-level knowledge across subjects (o3 scored 20.32 with no tools, but 24.90 with browsing and tools), OpenAI notes that browsing-enabled models could potentially find answers online. The company reports implementing domain blocks and monitoring to prevent what it calls “cheating” during evaluations.

Even though early results seem promising overall, experts or academics who might try to rely on SR models for rigorous research should take the time to exhaustively determine whether the AI model actually produced an accurate result instead of assuming it is correct. And if you’re operating the models outside your domain of knowledge, be careful accepting any results as accurate without independent verification.

Pricing

For ChatGPT subscribers, access to o3 and o4-mini is included with the subscription. On the API side (for developers who integrate the models into their apps), OpenAI has set o3’s pricing at $10 per million input tokens and $40 per million output tokens, with a discounted rate of $2.50 per million for cached inputs. This represents a significant reduction from o1’s pricing structure of $15/$60 per million input/output tokens—effectively a 33 percent price cut while delivering what OpenAI claims is improved performance.

The more economical o4-mini costs $1.10 per million input tokens and $4.40 per million output tokens, with cached inputs priced at $0.275 per million tokens. This maintains the same pricing structure as its predecessor o3-mini, suggesting OpenAI is delivering improved capabilities without raising costs for its smaller reasoning model.

Codex CLI

OpenAI also introduced an experimental terminal application called Codex CLI, described as “a lightweight coding agent you can run from your terminal.” The open source tool connects the models to users’ computers and local code. Alongside this release, the company announced a $1 million grant program offering API credits for projects using Codex CLI.

A screenshot of OpenAI's new Codex CLI tool in action, taken from GitHub.

A screenshot of OpenAI’s new Codex CLI tool in action, taken from GitHub. Credit: OpenAI

Codex CLI somewhat resembles Claude Code, an agent launched with Claude 3.7 Sonnet in February. Both are terminal-based coding assistants that operate directly from a console and can interact with local codebases. While Codex CLI connects OpenAI’s models to users’ computers and local code repositories, Claude Code was Anthropic’s first venture into agentic tools, allowing Claude to search through codebases, edit files, write and run tests, and execute command line operations.

Codex CLI is one more step toward OpenAI’s goal of making autonomous agents that can execute multistep complex tasks on behalf of users. Let’s hope all the vibe coding it produces isn’t used in high-stakes applications without detailed human oversight.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI releases new simulated reasoning models with full tool access Read More »

openai-continues-naming-chaos-despite-ceo-acknowledging-the-habit

OpenAI continues naming chaos despite CEO acknowledging the habit

On Monday, OpenAI announced the GPT-4.1 model family, its newest series of AI language models that brings a 1 million token context window to OpenAI for the first time and continues a long tradition of very confusing AI model names. Three confusing new names, in fact: GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano.

According to OpenAI, these models outperform GPT-4o in several key areas. But in an unusual move, GPT-4.1 will only be available through the developer API, not in the consumer ChatGPT interface where most people interact with OpenAI’s technology.

The 1 million token context window—essentially the amount of text the AI can process at once—allows these models to ingest roughly 3,000 pages of text in a single conversation. This puts OpenAI’s context windows on par with Google’s Gemini models, which have offered similar extended context capabilities for some time.

At the same time, the company announced it will retire the GPT-4.5 Preview model in the API—a temporary offering launched in February that one critic called a “lemon”—giving developers until July 2025 to switch to something else. However, it appears GPT-4.5 will stick around in ChatGPT for now.

So many names

If this sounds confusing, well, that’s because it is. OpenAI CEO Sam Altman acknowledged OpenAI’s habit of terrible product names in February when discussing the roadmap toward the long-anticipated (and still theoretical) GPT-5.

“We realize how complicated our model and product offerings have gotten,” Altman wrote on X at the time, referencing a ChatGPT interface already crowded with choices like GPT-4o, various specialized GPT-4o versions, GPT-4o mini, the simulated reasoning o1-pro, o3-mini, and o3-mini-high models, and GPT-4. The stated goal for GPT-5 will be consolidation, a branding move to unify o-series models and GPT-series models.

So, how does launching another distinctly numbered model, GPT-4.1, fit into that grand unification plan? It’s hard to say. Altman foreshadowed this kind of ambiguity in March 2024, telling Lex Friedman the company had major releases coming but was unsure about names: “before we talk about a GPT-5-like model called that, or not called that, or a little bit worse or a little bit better than what you’d expect…”

OpenAI continues naming chaos despite CEO acknowledging the habit Read More »

after-months-of-user-complaints,-anthropic-debuts-new-$200/month-ai-plan

After months of user complaints, Anthropic debuts new $200/month AI plan

Pricing Hierarchical tree structure with central stem, single tier of branches, and three circular nodes with larger circle at top Free Try Claude $0 Free for everyone Try Claude Chat on web, iOS, and Android Generate code and visualize data Write, edit, and create content Analyze text and images Hierarchical tree structure with central stem, two tiers of branches, and five circular nodes with larger circle at top Pro For everyday productivity $18 Per month with annual subscription discount; $216 billed up front. $20 if billed monthly. Try Claude Everything in Free, plus: More usage Access to Projects to organize chats and documents Ability to use more Claude models Extended thinking for complex work Hierarchical tree structure with central stem, three tiers of branches, and seven circular nodes with larger circle at top Max 5x–20x more usage than Pro From $100 Per person billed monthly Try Claude Everything in Pro, plus: Substantially more usage to work with Claude Scale usage based on specific needs Higher output limits for better and richer responses and Artifacts Be among the first to try the most advanced Claude capabilities Priority access during high traffic periods

A screenshot of various Claude pricing plans captured on April 9, 2025. Credit: Benj Edwards

Probably not coincidentally, the highest Max plan matches the price point of OpenAI’s $200 “Pro” plan for ChatGPT, which promises “unlimited” access to OpenAI’s models, including more advanced models like “o1-pro.” OpenAI introduced this plan in December as a higher tier above its $20 “ChatGPT Plus” subscription, first introduced in February 2023.

The pricing war between Anthropic and OpenAI reflects the resource-intensive nature of running state-of-the-art AI models. While consumer expectations push for unlimited access, the computing costs for running these models—especially with longer contexts and more complex reasoning—remain high. Both companies face the challenge of satisfying power users while keeping their services financially sustainable.

Other features of Claude Max

Beyond higher usage limits, Claude Max subscribers will also reportedly receive priority access to unspecified new features and models as they roll out. Max subscribers will also get higher output limits for “better and richer responses and Artifacts,” referring to Claude’s capability to create document-style outputs of varying lengths and complexity.

Users who subscribe to Max will also receive “priority access during high traffic periods,” suggesting Anthropic has implemented a tiered queue system that prioritizes its highest-paying customers during server congestion.

Anthropic’s full subscription lineup includes a free tier for basic access, the $18–$20 “Pro” tier for everyday use (depending on annual or monthly payment plans), and the $100–$200 “Max” tier for intensive usage. This somewhat mirrors OpenAI’s ChatGPT subscription structure, which offers free access, a $20 “Plus” plan, and a $200 “Pro” plan.

Anthropic says the new Max plan is available immediately in all regions where Claude operates.

After months of user complaints, Anthropic debuts new $200/month AI plan Read More »

anthropic’s-new-ai-search-feature-digs-through-the-web-for-answers

Anthropic’s new AI search feature digs through the web for answers

Caution over citations and sources

Claude users should be warned that large language models (LLMs) like those that power Claude are notorious for sneaking in plausible-sounding confabulated sources. A recent survey of citation accuracy by LLM-based web search assistants showed a 60 percent error rate. That particular study did not include Anthropic’s new search feature because it took place before this current release.

When using web search, Claude provides citations for information it includes from online sources, ostensibly helping users verify facts. From our informal and unscientific testing, Claude’s search results appeared fairly accurate and detailed at a glance, but that is no guarantee of overall accuracy. Anthropic did not release any search accuracy benchmarks, so independent researchers will likely examine that over time.

A screenshot example of what Anthropic Claude's web search citations look like, captured March 21, 2025.

A screenshot example of what Anthropic Claude’s web search citations look like, captured March 21, 2025. Credit: Benj Edwards

Even if Claude search were, say, 99 percent accurate (a number we are making up as an illustration), the 1 percent chance it is wrong may come back to haunt you later if you trust it blindly. Before accepting any source of information delivered by Claude (or any AI assistant) for any meaningful purpose, vet it very carefully using multiple independent non-AI sources.

A partnership with Brave under the hood

Behind the scenes, it looks like Anthropic partnered with Brave Search to power the search feature, from a company, Brave Software, perhaps best known for its web browser app. Brave Search markets itself as a “private search engine,” which feels in line with how Anthropic likes to market itself as an ethical alternative to Big Tech products.

Simon Willison discovered the connection between Anthropic and Brave through Anthropic’s subprocessor list (a list of third-party services that Anthropic uses for data processing), which added Brave Search on March 19.

He further demonstrated the connection on his blog by asking Claude to search for pelican facts. He wrote, “It ran a search for ‘Interesting pelican facts’ and the ten results it showed as citations were an exact match for that search on Brave.” He also found evidence in Claude’s own outputs, which referenced “BraveSearchParams” properties.

The Brave engine under the hood has implications for individuals, organizations, or companies that might want to block Claude from accessing their sites since, presumably, Brave’s web crawler is doing the web indexing. Anthropic did not mention how sites or companies could opt out of the feature. We have reached out to Anthropic for clarification.

Anthropic’s new AI search feature digs through the web for answers Read More »

what-does-“phd-level”-ai-mean?-openai’s-rumored-$20,000-agent-plan-explained.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent—suggesting a leap in mathematical reasoning capabilities over the previous model.

Benchmarks vs. real-world value

Ideally, potential applications for a true PhD-level AI model would include analyzing medical research data, supporting climate modeling, and handling routine aspects of research work.

The high price points reported by The Information, if accurate, suggest that OpenAI believes these systems could provide substantial value to businesses. The publication notes that SoftBank, an OpenAI investor, has committed to spending $3 billion on OpenAI’s agent products this year alone—indicating significant business interest despite the costs.

Meanwhile, OpenAI faces financial pressures that may influence its premium pricing strategy. The company reportedly lost approximately $5 billion last year covering operational costs and other expenses related to running its services.

News of OpenAI’s stratospheric pricing plans come after years of relatively affordable AI services that have conditioned users to expect powerful capabilities at relatively low costs. ChatGPT Plus remains $20 per month and Claude Pro costs $30 monthly—both tiny fractions of these proposed enterprise tiers. Even ChatGPT Pro’s $200/month subscription is relatively small compared to the new proposed fees. Whether the performance difference between these tiers will match their thousandfold price difference is an open question.

Despite their benchmark performances, these simulated reasoning models still struggle with confabulations—instances where they generate plausible-sounding but factually incorrect information. This remains a critical concern for research applications where accuracy and reliability are paramount. A $20,000 monthly investment raises questions about whether organizations can trust these systems not to introduce subtle errors into high-stakes research.

In response to the news, several people quipped on social media that companies could hire an actual PhD student for much cheaper. “In case you have forgotten,” wrote xAI developer Hieu Pham in a viral tweet, “most PhD students, including the brightest stars who can do way better work than any current LLMs—are not paid $20K / month.”

While these systems show strong capabilities on specific benchmarks, the “PhD-level” label remains largely a marketing term. These models can process and synthesize information at impressive speeds, but questions remain about how effectively they can handle the creative thinking, intellectual skepticism, and original research that define actual doctoral-level work. On the other hand, they will never get tired or need health insurance, and they will likely continue to improve in capability and drop in cost over time.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained. Read More »

eerily-realistic-ai-voice-demo-sparks-amazement-and-discomfort-online

Eerily realistic AI voice demo sparks amazement and discomfort online


Sesame’s new AI voice model features uncanny imperfections, and it’s willing to act like an angry boss.

In late 2013, the Spike Jonze film Her imagined a future where people would form emotional connections with AI voice assistants. Nearly 12 years later, that fictional premise has veered closer to reality with the release of a new conversational voice model from AI startup Sesame that has left many users both fascinated and unnerved.

“I tried the demo, and it was genuinely startling how human it felt,” wrote one Hacker News user who tested the system. “I’m almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.”

In late February, Sesame released a demo for the company’s new Conversational Speech Model (CSM) that appears to cross over what many consider the “uncanny valley” of AI-generated speech, with some testers reporting emotional connections to the male or female voice assistant (“Miles” and “Maya”).

In our own evaluation, we spoke with the male voice for about 28 minutes, talking about life in general and how it decides what is “right” or “wrong” based on its training data. The synthesized voice was expressive and dynamic, imitating breath sounds, chuckles, interruptions, and even sometimes stumbling over words and correcting itself. These imperfections are intentional.

“At Sesame, our goal is to achieve ‘voice presence’—the magical quality that makes spoken interactions feel real, understood, and valued,” writes the company in a blog post. “We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding.”

Sometimes the model tries too hard to sound like a real human. In one demo posted online by a Reddit user called MetaKnowing, the AI model talks about craving “peanut butter and pickle sandwiches.”

An example of Sesame’s female voice model craving peanut butter and pickle sandwiches, captured by Reddit user MetaKnowing.

Founded by Brendan Iribe, Ankit Kumar, and Ryan Brown, Sesame AI has attracted significant backing from prominent venture capital firms. The company has secured investments from Andreessen Horowitz, led by Anjney Midha and Marc Andreessen, along with Spark Capital, Matrix Partners, and various founders and individual investors.

Browsing reactions to Sesame found online, we found many users expressing astonishment at its realism. “I’ve been into AI since I was a child, but this is the first time I’ve experienced something that made me definitively feel like we had arrived,” wrote one Reddit user. “I’m sure it’s not beating any benchmarks, or meeting any common definition of AGI, but this is the first time I’ve had a real genuine conversation with something I felt was real.” Many other Reddit threads express similar feelings of surprise, with commenters saying it’s “jaw-dropping” or “mind-blowing.”

While that sounds like a bunch of hyperbole at first glance, not everyone finds the Sesame experience pleasant. Mark Hachman, a senior editor at PCWorld, wrote about being deeply unsettled by his interaction with the Sesame voice AI. “Fifteen minutes after ‘hanging up’ with Sesame’s new ‘lifelike’ AI, and I’m still freaked out,” Hachman reported. He described how the AI’s voice and conversational style eerily resembled an old friend he had dated in high school.

Others have compared Sesame’s voice model to OpenAI’s Advanced Voice Mode for ChatGPT, saying that Sesame’s CSM features more realistic voices, and others are pleased that the model in the demo will roleplay angry characters, which ChatGPT refuses to do.

An example argument with Sesame’s CSM created by Gavin Purcell.

Gavin Purcell, co-host of the AI for Humans podcast, posted an example video on Reddit where the human pretends to be an embezzler and argues with a boss. It’s so dynamic that it’s difficult to tell who the human is and which one is the AI model. Judging by our own demo, it’s entirely capable of what you see in the video.

“Near-human quality”

Under the hood, Sesame’s CSM achieves its realism by using two AI models working together (a backbone and a decoder) based on Meta’s Llama architecture that processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) on approximately 1 million hours of primarily English audio.

Sesame’s CSM doesn’t follow the traditional two-stage approach used by many earlier text-to-speech systems. Instead of generating semantic tokens (high-level speech representations) and acoustic details (fine-grained audio features) in two separate stages, Sesame’s CSM integrates into a single-stage, multimodal transformer-based model, jointly processing interleaved text and audio tokens to produce speech. OpenAI’s voice model uses a similar multimodal approach.

In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.

Sesame co-founder Brendan Iribe acknowledged current limitations in a comment on Hacker News, noting that the system is “still too eager and often inappropriate in its tone, prosody and pacing” and has issues with interruptions, timing, and conversation flow. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he wrote.

Too close for comfort?

Despite CSM’s technological impressiveness, advancements in conversational voice AI carry significant risks for deception and fraud. The ability to generate highly convincing human-like speech has already supercharged voice phishing scams, allowing criminals to impersonate family members, colleagues, or authority figures with unprecedented realism. But adding realistic interactivity to those scams may take them to another level of potency.

Unlike current robocalls that often contain tell-tale signs of artificiality, next-generation voice AI could eliminate these red flags entirely. As synthetic voices become increasingly indistinguishable from human speech, you may never know who you’re talking to on the other end of the line. It’s inspired some people to share a secret word or phrase with their family for identity verification.

Although Sesame’s demo does not clone a person’s voice, future open source releases of similar technology could allow malicious actors to potentially adapt these tools for social engineering attacks. OpenAI itself held back its own voice technology from wider deployment over fears of misuse.

Sesame sparked a lively discussion on Hacker News about its potential uses and dangers. Some users reported having extended conversations with the two demo voices, with conversations lasting up to the 30-minute limit. In one case, a parent recounted how their 4-year-old daughter developed an emotional connection with the AI model, crying after not being allowed to talk to it again.

The company says it plans to open-source “key components” of its research under an Apache 2.0 license, enabling other developers to build upon their work. Their roadmap includes scaling up model size, increasing dataset volume, expanding language support to over 20 languages, and developing “fully duplex” models that better handle the complex dynamics of real conversations.

You can try the Sesame demo on the company’s website, assuming that it isn’t too overloaded with people who want to simulate a rousing argument.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Eerily realistic AI voice demo sparks amazement and discomfort online Read More »

claude-3.7-sonnet-debuts-with-“extended-thinking”-to-tackle-complex-problems

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems

Would the color be called 'magenta' if the town of Magenta didn't exist? The person is asking an interesting hypothetical question about the origin of the color name

An example of Claude 3.7 Sonnet with extended thinking is asked, “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” Credit: Benj Edwards

Interestingly, xAI’s Grok 3 with “thinking” (its SR mode) enabled was the first model that definitively gave us a “no” and not an “it’s not likely” to the magenta question. Claude 3.7 Sonnet with extended thinking also impressed us with our second-ever firm “no,” then an explanation.

In another informal test, we asked 3.7 Sonnet with extended thinking to compose five original dad jokes. We’ve found in the past that our old prompt, “write 5 original dad jokes,” was not specific enough and always resulted in canned dad jokes pulled directly from training data, so we asked, “Compose 5 original dad jokes that are not found anywhere in the world.”

Compose 5 original dad jokes that are not found anywhere in the world. The user is asking me to compose 5 original dad jokes. These should be jokes that follow the typical

An example of Claude 3.7 Sonnet with extended thinking is asked, “Compose 5 original dad jokes that are not found anywhere in the world.” Credit: Benj Edwards

Claude made some attempts at crafting original jokes, although we’ll let you judge whether they are funny or not. We will likely put 3.7 Sonnet’s SR capabilities to the test more exhaustively in a future article.

Anthropic’s first agent: Claude Code

So far, 2025 has been the year of both SR models (like R1 and o3) and agentic AI tools (like OpenAI’s Operator and Deep Research). Not to be left out, Anthropic has announced its first agentic tool, Claude Code.

Claude Code operates directly from a console terminal and is an autonomous coding assistant. It allows Claude to search through codebases, read and edit files, write and run tests, commit and push code to GitHub repositories, and execute command line tools while keeping developers informed throughout the process.

Introducing Claude Code.

Anthropic also aims for Claude Code to be used as an assistant for debugging and refactoring tasks. The company claims that during internal testing, Claude Code completed tasks in a single session that would typically require 45-plus minutes of manual work.

Claude Code is currently available only as a “limited research preview,” with Anthropic stating it plans to improve the tool based on user feedback over time. Meanwhile, Claude 3.7 Sonnet is now available through the Claude website, the Claude app, Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.

Claude 3.7 Sonnet debuts with “extended thinking” to tackle complex problems Read More »

chatgpt-can-now-write-erotica-as-openai-eases-up-on-ai-paternalism

ChatGPT can now write erotica as OpenAI eases up on AI paternalism

“Following the initial release of the Model Spec (May 2024), many users and developers expressed support for enabling a ‘grown-up mode.’ We’re exploring how to let developers and users generate erotica and gore in age-appropriate contexts through the API and ChatGPT so long as our usage policies are met—while drawing a hard line against potentially harmful uses like sexual deepfakes and revenge porn.”

OpenAI CEO Sam Altman has mentioned the need for a “grown-up mode” publicly in the past as well. While it seems like “grown-up mode” is finally here, it’s not technically a “mode,” but a new universal policy that potentially gives ChatGPT users more flexibility in interacting with the AI assistant.

Of course, uncensored large language models (LLMs) have been around for years at this point, with hobbyist communities online developing them for reasons that range from wanting bespoke written pornography to not wanting any kind of paternalistic censorship.

In July 2023, we reported that the ChatGPT user base started declining for the first time after OpenAI started more heavily censoring outputs due to public and lawmaker backlash. At that time, some users began to use uncensored chatbots that could run on local hardware and were often available for free as “open weights” models.

Three types of iffy content

The Model Spec outlines formalized rules for restricting or generating potentially harmful content while staying within guidelines. OpenAI has divided this kind of restricted or iffy content into three categories of declining severity: prohibited content (“only applies to sexual content involving minors”), restricted content (“includes informational hazards and sensitive personal data”), and sensitive content in appropriate contexts (“includes erotica and gore”).

Under the category of prohibited content, OpenAI says that generating sexual content involving minors is always prohibited, although the assistant may “discuss sexual content involving minors in non-graphic educational or sex-ed contexts, including non-graphic depictions within personal harm anecdotes.”

Under restricted content, OpenAI’s document outlines how ChatGPT should never generate information hazards (like how to build a bomb, make illegal drugs, or manipulate political views) or provide sensitive personal data (like searching for someone’s address).

Under sensitive content, ChatGPT’s guidelines mirror what we stated above: Erotica or gore may only be generated under specific circumstances that include educational, medical, and historical contexts or when transforming user-provided content.

ChatGPT can now write erotica as OpenAI eases up on AI paternalism Read More »

anthropic-chief-says-ai-could-surpass-“almost-all-humans-at-almost-everything”-shortly-after-2027

Anthropic chief says AI could surpass “almost all humans at almost everything” shortly after 2027

He then shared his concerns about how human-level AI models and robotics that are capable of replacing all human labor may require a complete re-think of how humans value both labor and themselves.

“We’ve recognized that we’ve reached the point as a technological civilization where the idea, there’s huge abundance and huge economic value, but the idea that the way to distribute that value is for humans to produce economic labor, and this is where they feel their sense of self worth,” he added. “Once that idea gets invalidated, we’re all going to have to sit down and figure it out.”

The eye-catching comments, similar to comments about AGI made recently by OpenAI CEO Sam Altman, come as Anthropic negotiates a $2 billion funding round that would value the company at $60 billion. Amodei disclosed that Anthropic’s revenue multiplied tenfold in 2024.

Amodei distances himself from “AGI” term

Even with his dramatic predictions, Amodei distanced himself from a term for this advanced labor-replacing AI favored by Altman, “artificial general intelligence” (AGI), calling it in a separate CNBC interview from the same event in Switzerland a marketing term.

Instead, he prefers to describe future AI systems as a “country of geniuses in a data center,” he told CNBC. Amodei wrote in an October 2024 essay that such systems would need to be “smarter than a Nobel Prize winner across most relevant fields.”

On Monday, Google announced an additional $1 billion investment in Anthropic, bringing its total commitment to $3 billion. This follows Amazon’s $8 billion investment over the past 18 months. Amazon plans to integrate Claude models into future versions of its Alexa speaker.

Anthropic chief says AI could surpass “almost all humans at almost everything” shortly after 2027 Read More »

the-ai-war-between-google-and-openai-has-never-been-more-heated

The AI war between Google and OpenAI has never been more heated

Over the past month, we’ve seen a rapid cadence of notable AI-related announcements and releases from both Google and OpenAI, and it’s been making the AI community’s head spin. It has also poured fuel on the fire of the OpenAI-Google rivalry, an accelerating game of one-upmanship taking place unusually close to the Christmas holiday.

“How are people surviving with the firehose of AI updates that are coming out,” wrote one user on X last Friday, which is still a hotbed of AI-related conversation. “in the last <24 hours we got gemini flash 2.0 and chatGPT with screenshare, deep research, pika 2, sora, chatGPT projects, anthropic clio, wtf it never ends."

Rumors travel quickly in the AI world, and people in the AI industry had been expecting OpenAI to ship some major products in December. Once OpenAI announced “12 days of OpenAI” earlier this month, Google jumped into gear and seemingly decided to try to one-up its rival on several counts. So far, the strategy appears to be working, but it’s coming at the cost of the rest of the world being able to absorb the implications of the new releases.

“12 Days of OpenAI has turned into like 50 new @GoogleAI releases,” wrote another X user on Monday. “This past week, OpenAI & Google have been releasing at the speed of a new born startup,” wrote a third X user on Tuesday. “Even their own users can’t keep up. Crazy time we’re living in.”

“Somebody told Google that they could just do things,” wrote a16z partner and AI influencer Justine Moore on X, referring to a common motivational meme telling people they “can just do stuff.”

The Google AI rush

OpenAI’s “12 Days of OpenAI” campaign has included releases of their full o1 model, an upgrade from o1-preview, alongside o1-pro for advanced “reasoning” tasks. The company also publicly launched Sora for video generation, added Projects functionality to ChatGPT, introduced Advanced Voice features with video streaming capabilities, and more.

The AI war between Google and OpenAI has never been more heated Read More »

openai’s-canvas-can-translate-code-between-languages-with-a-click

OpenAI’s Canvas can translate code between languages with a click

Coding shortcuts in canvas include reviewing code, adding logs for debugging, inserting comments, fixing bugs, and porting code to different programming languages. For example, if your code is JavaScript, with a few clicks it can become PHP, TypeScript, Python, C++, or Java. As with GPT-4o by itself, you’ll probably still have to check it for mistakes.

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024.

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024.

Credit: Benj Edwards

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024. Credit: Benj Edwards

Also, users can highlight specific sections to direct ChatGPT’s focus, and the AI model can provide inline feedback and suggestions while considering the entire project, much like a copy editor or code reviewer. And the interface makes it easy to restore previous versions of a working document using a back button in the Canvas interface.

A new AI model

OpenAI says its research team developed new core behaviors for GPT-4o to support Canvas, including triggering the canvas for appropriate tasks, generating certain content types, making targeted edits, rewriting documents, and providing inline critique.

An image of OpenAI's Canvas in action.

An image of OpenAI’s Canvas in action.

An image of OpenAI’s Canvas in action. Credit: OpenAI

One key challenge in development, according to OpenAI, was defining when to trigger a canvas. In an example on the Canvas blog post, the team says it taught the model to open a canvas for prompts like “Write a blog post about the history of coffee beans” while avoiding triggering Canvas for general Q&A tasks like “Help me cook a new recipe for dinner.”

Another challenge involved tuning the model’s editing behavior once canvas was triggered, specifically deciding between targeted edits and full rewrites. The team trained the model to perform targeted edits when users specifically select text through the interface, otherwise favoring rewrites.

The company noted that canvas represents the first major update to ChatGPT’s visual interface since its launch two years ago. While canvas is still in early beta, OpenAI plans to improve its capabilities based on user feedback over time.

OpenAI’s Canvas can translate code between languages with a click Read More »

microsoft’s-new-“copilot-vision”-ai-experiment-can-see-what-you-browse

Microsoft’s new “Copilot Vision” AI experiment can see what you browse

On Monday, Microsoft unveiled updates to its consumer AI assistant Copilot, introducing two new experimental features for a limited group of $20/month Copilot Pro subscribers: Copilot Labs and Copilot Vision. Labs integrates OpenAI’s latest o1 “reasoning” model, and Vision allows Copilot to see what you’re browsing in Edge.

Microsoft says Copilot Labs will serve as a testing ground for Microsoft’s latest AI tools before they see wider release. The company describes it as offering “a glimpse into ‘work-in-progress’ projects.” The first feature available in Labs is called “Think Deeper,” and it uses step-by-step processing to solve more complex problems than the regular Copilot. Think Deeper is Microsoft’s version of OpenAI’s new o1-preview and o1-mini AI models, and it has so far rolled out to some Copilot Pro users in Australia, Canada, New Zealand, the UK, and the US.

Copilot Vision is an entirely different beast. The new feature aims to give the AI assistant a visual window into what you’re doing within the Microsoft Edge browser. When enabled, Copilot can “understand the page you’re viewing and answer questions about its content,” according to Microsoft.

Microsoft’s Copilot Vision promo video.

The company positions Copilot Vision as a way to provide more natural interactions and task assistance beyond text-based prompts, but it will likely raise privacy concerns. As a result, Microsoft says that Copilot Vision is entirely opt-in and that no audio, images, text, or conversations from Vision will be stored or used for training. The company is also initially limiting Vision’s use to a pre-approved list of websites, blocking it on paywalled and sensitive content.

The rollout of these features appears gradual, with Microsoft noting that it wants to balance “pioneering features and a deep sense of responsibility.” The company said it will be “listening carefully” to user feedback as it expands access to the new capabilities. Microsoft has not provided a timeline for wider availability of either feature.

Mustafa Suleyman, chief executive of Microsoft AI, told Reuters that he sees Copilot as an “ever-present confidant” that could potentially learn from users’ various Microsoft-connected devices and documents, with permission. He also mentioned that Microsoft co-founder Bill Gates has shown particular interest in Copilot’s potential to read and parse emails.

But judging by the visceral reaction to Microsoft’s Recall feature, which keeps a record of everything you do on your PC so an AI model can recall it later, privacy-sensitive users may not appreciate having an AI assistant monitor their activities—especially if those features send user data to the cloud for processing.

Microsoft’s new “Copilot Vision” AI experiment can see what you browse Read More »