Anthropic

feds-to-get-early-access-to-openai,-anthropic-ai-to-test-for-doomsday-scenarios

Feds to get early access to OpenAI, Anthropic AI to test for doomsday scenarios

“Advancing the science of AI safety” —

AI companies agreed that ensuring AI safety was key to innovation.

Feds to get early access to OpenAI, Anthropic AI to test for doomsday scenarios

OpenAI and Anthropic have each signed unprecedented deals granting the US government early access to conduct safety testing on the companies’ flashiest new AI models before they’re released to the public.

According to a press release from the National Institute of Standards and Technology (NIST), the deal creates a “formal collaboration on AI safety research, testing, and evaluation with both Anthropic and OpenAI” and the US Artificial Intelligence Safety Institute.

Through the deal, the US AI Safety Institute will “receive access to major new models from each company prior to and following their public release.” This will ensure that public safety won’t depend exclusively on how the companies “evaluate capabilities and safety risks, as well as methods to mitigate those risks,” NIST said, but also on collaborative research with the US government.

The US AI Safety Institute will also be collaborating with the UK AI Safety Institute when examining models to flag potential safety risks. Both groups will provide feedback to OpenAI and Anthropic “on potential safety improvements to their models.”

NIST said that the agreements also build on voluntary AI safety commitments that AI companies made to the Biden administration to evaluate models to detect risks.

Elizabeth Kelly, director of the US AI Safety Institute, called the agreements “an important milestone” to “help responsibly steward the future of AI.”

Anthropic co-founder: AI safety “crucial” to innovation

The announcement comes as California is poised to pass one of the country’s first AI safety bills, which will regulate how AI is developed and deployed in the state.

Among the most controversial aspects of the bill is a requirement that AI companies build in a “kill switch” to stop models from introducing “novel threats to public safety and security,” especially if the model is acting “with limited human oversight, intervention, or supervision.”

Critics say the bill overlooks existing safety risks from AI—like deepfakes and election misinformation—to prioritize prevention of doomsday scenarios and could stifle AI innovation while providing little security today. They’ve urged California’s governor, Gavin Newsom, to veto the bill if it arrives at his desk, but it’s still unclear if Newsom intends to sign.

Anthropic was one of the AI companies that cautiously supported California’s controversial AI bill, Reuters reported, claiming that the potential benefits of the regulations likely outweigh the costs after a late round of amendments.

The company’s CEO, Dario Amodei, told Newsom why Anthropic supports the bill now in a letter last week, Reuters reported. He wrote that although Anthropic isn’t certain about aspects of the bill that “seem concerning or ambiguous,” Anthropic’s “initial concerns about the bill potentially hindering innovation due to the rapidly evolving nature of the field have been greatly reduced” by recent changes to the bill.

OpenAI has notably joined critics opposing California’s AI safety bill and has been called out by whistleblowers for lobbying against it.

In a letter to the bill’s co-sponsor, California Senator Scott Wiener, OpenAI’s chief strategy officer, Jason Kwon, suggested that “the federal government should lead in regulating frontier AI models to account for implications to national security and competitiveness.”

The ChatGPT maker striking a deal with the US AI Safety Institute seems in line with that thinking. As Kwon told Reuters, “We believe the institute has a critical role to play in defining US leadership in responsibly developing artificial intelligence and hope that our work together offers a framework that the rest of the world can build on.”

While some critics worry California’s AI safety bill will hamper innovation, Anthropic’s co-founder, Jack Clark, told Reuters today that “safe, trustworthy AI is crucial for the technology’s positive impact.” He confirmed that Anthropic’s “collaboration with the US AI Safety Institute” will leverage the government’s “wide expertise to rigorously test” Anthropic’s models “before widespread deployment.”

In NIST’s press release, Kelly agreed that “safety is essential to fueling breakthrough technological innovation.”

By directly collaborating with OpenAI and Anthropic, the US AI Safety Institute also plans to conduct its own research to help “advance the science of AI safety,” Kelly said.

Feds to get early access to OpenAI, Anthropic AI to test for doomsday scenarios Read More »

amazon-defends-$4b-anthropic-ai-deal-from-uk-monopoly-concerns

Amazon defends $4B Anthropic AI deal from UK monopoly concerns

Amazon defends $4B Anthropic AI deal from UK monopoly concerns

The United Kingdom’s Competition and Markets Authority (CMA) has officially launched a probe into Amazon’s $4 billion partnership with the AI firm Anthropic, as it continues to monitor how the largest tech companies might seize control of AI to further entrench their dominant market positions.

Through the partnership, “Amazon will become Anthropic’s primary cloud provider for certain workloads, including agreements for purchasing computing capacity and non-exclusive commitments to make Anthropic models available on Amazon Bedrock,” the CMA said.

Amazon and Anthropic deny there’s anything wrong with the deal. But because the CMA has seen “some” foundational model (FM) developers “form partnerships with major cloud providers” to “secure access to compute” needed to develop models, the CMA is worried that “incumbent firms” like Amazon “could use control over access to compute to shape FM-related markets in their own interests.”

Due to this potential risk, the CMA said it is “considering” whether Amazon’s partnership with Anthropic “has resulted in the creation of a relevant merger situation under the merger provisions of the Enterprise Act 2002 and, if so, whether the creation of that situation has resulted, or may be expected to result, in a substantial lessening of competition within any market or markets” in the UK.

It’s not clear yet if Amazon’s partnership with Anthropic is problematic, but the CMA confirmed that after a comment period last April, it now has “sufficient information” to kick off this first phase of its merger investigation.

By October 4, this first phase will conclude, after which the CMA may find that the partnership does not qualify as a merger situation, the UK regulator said. Or it may determine that it is a merger situation “but does not raise competition concerns,” clearing Amazon to proceed with the deal.

However, if a merger situation exists, and “it may result in a substantial lessening of competition” in a UK market, the CMA may refer the investigation to the next phase, allowing a panel of independent experts to dig deeper to illuminate potential risks and concerns. If Amazon wants to avoid that deeper probe potentially ordering steep fines, the tech giant would then have the option to offer fixes to “resolve the CMA’s concerns,” the CMA said.

An Amazon spokesperson told Reuters that its “collaboration with Anthropic does not raise any competition concerns or meet the CMA’s own threshold for review.”

“Amazon holds no board seat nor decision-making power at Anthropic, and Anthropic is free to work with any other provider (and indeed has multiple partners),” Amazon’s spokesperson said, defending the deal.

Anthropic’s spokesperson agreed that nothing was amiss, telling Reuters that “our strategic partnerships and investor relationships do not diminish our corporate governance independence or our freedom to partner with others. We intend to cooperate with the CMA and provide them with a comprehensive understanding of Amazon’s investment and our commercial collaboration.”

Amazon defends $4B Anthropic AI deal from UK monopoly concerns Read More »

the-first-gpt-4-class-ai-model-anyone-can-download-has-arrived:-llama-405b

The first GPT-4-class AI model anyone can download has arrived: Llama 405B

A new llama emerges —

“Open source AI is the path forward,” says Mark Zuckerberg, misusing the term.

A red llama in a blue desert illustration based on a photo.

In the AI world, there’s a buzz in the air about a new AI language model released Tuesday by Meta: Llama 3.1 405B. The reason? It’s potentially the first time anyone can download a GPT-4-class large language model (LLM) for free and run it on their own hardware. You’ll still need some beefy hardware: Meta says it can run on a “single server node,” which isn’t desktop PC-grade equipment. But it’s a provocative shot across the bow of “closed” AI model vendors such as OpenAI and Anthropic.

“Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation,” says Meta. Company CEO Mark Zuckerberg calls 405B “the first frontier-level open source AI model.”

In the AI industry, “frontier model” is a term for an AI system designed to push the boundaries of current capabilities. In this case, Meta is positioning 405B among the likes of the industry’s top AI models, such as OpenAI’s GPT-4o, Claude’s 3.5 Sonnet, and Google Gemini 1.5 Pro.

A chart published by Meta suggests that 405B gets very close to matching the performance of GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

But as we’ve noted many times since March, these benchmarks aren’t necessarily scientifically sound or translate to the subjective experience of interacting with AI language models. In fact, this traditional slate of AI benchmarks is so generally useless to laypeople that even Meta’s PR department now just posts a few images of charts and doesn’t even try to explain them in any detail.

A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

Enlarge / A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

We’ve instead found that measuring the subjective experience of using a conversational AI model (through what might be called “vibemarking”) on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs. In the absence of Chatbot Arena data, Meta has provided the results of its own human evaluations of 405B’s outputs that seem to show Meta’s new model holding its own against GPT-4 Turbo and Claude 3.5 Sonnet.

A Meta-provided chart that shows how humans rated Llama 3.1 405B's outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Enlarge / A Meta-provided chart that shows how humans rated Llama 3.1 405B’s outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Whatever the benchmarks, early word on the street (after the model leaked on 4chan yesterday) seems to match the claim that 405B is roughly equivalent to GPT-4. It took a lot of expensive computer training time to get there—and money, of which the social media giant has plenty to burn. Meta trained the 405B model on over 15 trillion tokens of training data scraped from the web (then parsed, filtered, and annotated by Llama 2), using more than 16,000 H100 GPUs.

So what’s with the 405B name? In this case, “405B” means 405 billion parameters, and parameters are numerical values that store trained information in a neural network. More parameters translate to a larger neural network powering the AI model, which generally (but not always) means more capability, such as better ability to make contextual connections between concepts. But larger-parameter models have a tradeoff in needing more computing power (AKA “compute”) to run.

We’ve been expecting the release of a 400 billion-plus parameter model of the Llama 3 family since Meta gave word that it was training one in April, and today’s announcement isn’t just about the biggest member of the Llama 3 family: There’s an entirely new iteration of improved Llama models with the designation “Llama 3.1.” That includes upgraded versions of its smaller 8B and 70B models, which now feature multilingual support and an extended context length of 128,000 tokens (the “context length” is roughly the working memory capacity of the model, and “tokens” are chunks of data used by LLMs to process information).

Meta says that 405B is useful for long-form text summarization, multilingual conversational agents, and coding assistants and for creating synthetic data used to train future AI language models. Notably, that last use-case—allowing developers to use outputs from Llama models to improve other AI models—is now officially supported by Meta’s Llama 3.1 license for the first time.

Abusing the term “open source”

Llama 3.1 405B is an open-weights model, which means anyone can download the trained neural network files and run them or fine-tune them. That directly challenges a business model where companies like OpenAI keep the weights to themselves and instead monetize the model through subscription wrappers like ChatGPT or charge for access by the token through an API.

Fighting the “closed” AI model is a big deal to Mark Zuckerberg, who simultaneously released a 2,300-word manifesto today on why the company believes in open releases of AI models, titled, “Open Source AI Is the Path Forward.” More on the terminology in a minute. But briefly, he writes about the need for customizable AI models that offer user control and encourage better data security, higher cost-efficiency, and better future-proofing, as opposed to vendor-locked solutions.

All that sounds reasonable, but undermining your competitors using a model subsidized by a social media war chest is also an efficient way to play spoiler in a market where you might not always win with the most cutting-edge tech. That benefits Meta, Zuckerberg says, because he doesn’t want to get locked into a system where companies like his have to pay a toll to access AI capabilities, drawing comparisons to “taxes” Apple levies on developers through its App Store.

A screenshot of Mark Zuckerberg's essay,

Enlarge / A screenshot of Mark Zuckerberg’s essay, “Open Source AI Is the Path Forward,” published on July 23, 2024.

So, about that “open source” term. As we first wrote in an update to our Llama 2 launch article a year ago, “open source” has a very particular meaning that has traditionally been defined by the Open Source Initiative. The AI industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (such as Llama 3.1) or that ship without providing training data. We’ve been calling these releases “open weights” instead.

Unfortunately for terminology sticklers, Zuckerberg has now baked the erroneous “open source” label into the title of his potentially historic aforementioned essay on open AI releases, so fighting for the correct term in AI may be a losing battle. Still, his usage annoys people like independent AI researcher Simon Willison, who likes Zuckerberg’s essay otherwise.

“I see Zuck’s prominent misuse of ‘open source’ as a small-scale act of cultural vandalism,” Willison told Ars Technica. “Open source should have an agreed meaning. Abusing the term weakens that meaning which makes the term less generally useful, because if someone says ‘it’s open source,’ that no longer tells me anything useful. I have to then dig in and figure out what they’re actually talking about.”

The Llama 3.1 models are available for download through Meta’s own website and on Hugging Face. They both require providing contact information and agreeing to a license and an acceptable use policy, which means that Meta can technically legally pull the rug out from under your use of Llama 3.1 or its outputs at any time.

The first GPT-4-class AI model anyone can download has arrived: Llama 405B Read More »

anthropic-introduces-claude-3.5-sonnet,-matching-gpt-4o-on-benchmarks

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks

The Anthropic Claude 3 logo, jazzed up by Benj Edwards.

Anthropic / Benj Edwards

On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of “3.5” models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.

So far, people outside of Anthropic seem impressed. “This model is really, really good,” wrote independent AI researcher Simon Willison on X. “I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump).”

As we’ve written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

If all that makes your eyes glaze over, that’s OK; it’s meaningful to researchers but mostly marketing to everyone else. A more useful performance metric comes from what we might call “vibemarks” (coined here first!) which are subjective, non-rigorous aggregate feelings measured by competitive usage on sites like LMSYS’s Chatbot Arena. The Claude 3.5 Sonnet model is currently under evaluation there, and it’s too soon to say how well it will fare.

Claude 3.5 Sonnet also outperforms Anthropic’s previous-best model (Claude 3 Opus) on benchmarks measuring “reasoning,” math skills, general knowledge, and coding abilities. For example, the model demonstrated strong performance in an internal coding evaluation, solving 64 percent of problems compared to 38 percent for Claude 3 Opus.

Claude 3.5 Sonnet is also a multimodal AI model that accepts visual input in the form of images, and the new model is reportedly excellent at a battery of visual comprehension tests.

Claude 3.5 Sonnet benchmarks provided by Anthropic.

Enlarge / Claude 3.5 Sonnet benchmarks provided by Anthropic.

Roughly speaking, the visual benchmarks mean that 3.5 Sonnet is better at pulling information from images than previous models. For example, you can show it a picture of a rabbit wearing a football helmet, and the model knows it’s a rabbit wearing a football helmet and can talk about it. That’s fun for tech demos, but the tech is still not accurate enough for applications of the tech where reliability is mission critical.

Anthropic introduces Claude 3.5 Sonnet, matching GPT-4o on benchmarks Read More »

duckduckgo-offers-“anonymous”-access-to-ai-chatbots-through-new-service

DuckDuckGo offers “anonymous” access to AI chatbots through new service

anonymous confabulations —

DDG offers LLMs from OpenAI, Anthropic, Meta, and Mistral for factually-iffy conversations.

DuckDuckGo's AI Chat promotional image.

DuckDuckGo

On Thursday, DuckDuckGo unveiled a new “AI Chat” service that allows users to converse with four mid-range large language models (LLMs) from OpenAI, Anthropic, Meta, and Mistral in an interface similar to ChatGPT while attempting to preserve privacy and anonymity. While the AI models involved can output inaccurate information readily, the site allows users to test different mid-range LLMs without having to install anything or sign up for an account.

DuckDuckGo’s AI Chat currently features access to OpenAI’s GPT-3.5 Turbo, Anthropic’s Claude 3 Haiku, and two open source models, Meta’s Llama 3 and Mistral’s Mixtral 8x7B. The service is currently free to use within daily limits. Users can access AI Chat through the DuckDuckGo search engine, direct links to the site, or by using “!ai” or “!chat” shortcuts in the search field. AI Chat can also be disabled in the site’s settings for users with accounts.

According to DuckDuckGo, chats on the service are anonymized, with metadata and IP address removed to prevent tracing back to individuals. The company states that chats are not used for AI model training, citing its privacy policy and terms of use.

“We have agreements in place with all model providers to ensure that any saved chats are completely deleted by the providers within 30 days,” says DuckDuckGo, “and that none of the chats made on our platform can be used to train or improve the models.”

An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Enlarge / An example of DuckDuckGo AI Chat with GPT-3.5 answering a silly question in an inaccurate way.

Benj Edwards

However, the privacy experience is not bulletproof because, in the case of GPT-3.5 and Claude Haiku, DuckDuckGo is required to send a user’s inputs to remote servers for processing over the Internet. Given certain inputs (i.e., “Hey, GPT, my name is Bob, and I live on Main Street, and I just murdered Bill”), a user could still potentially be identified if such an extreme need arose.

While the service appears to work well for us, there’s a question about its utility. For example, while GPT-3.5 initially wowed people when it launched with ChatGPT in 2022, it also confabulated a lot—and it still does. GPT-4 was the first major LLM to get confabulations under control to a point where the bot became more reasonably useful for some tasks (though this itself is a controversial point), but that more capable model isn’t present in DuckDuckGo’s AI Chat. Also missing are similar GPT-4-level models like Claude Opus or Google’s Gemini Ultra, likely because they are far more expensive to run. DuckDuckGo says it may roll out paid plans in the future, and those may include higher daily usage limits or access to “more advanced models.”)

It’s true that the other three models generally (and subjectively) pass GPT-3.5 in capability for coding with lower hallucinations, but they can still make things up, too. With DuckDuckGo AI Chat as it stands, the company is left with a chatbot novelty with a decent interface and the promise that your conversations with it will remain private. But what use are fully private AI conversations if they are full of errors?

Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It's a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Enlarge / Mixtral 8x7B on DuckDuckGo AI Chat when asked about the author. Everything in red boxes is sadly incorrect, but it provides an interesting fantasy scenario. It’s a good example of an LLM plausibly filling gaps between concepts that are underrepresented in its training data, called confabulation. For the record, Llama 3 gives a more accurate answer.

Benj Edwards

As DuckDuckGo itself states in its privacy policy, “By its very nature, AI Chat generates text with limited information. As such, Outputs that appear complete or accurate because of their detail or specificity may not be. For example, AI Chat cannot dynamically retrieve information and so Outputs may be outdated. You should not rely on any Output without verifying its contents using other sources, especially for professional advice (like medical, financial, or legal advice).”

So, have fun talking to bots, but tread carefully. They’ll easily “lie” to your face because they don’t understand what they are saying and are tuned to output statistically plausible information, not factual references.

DuckDuckGo offers “anonymous” access to AI chatbots through new service Read More »

here’s-what’s-really-going-on-inside-an-llm’s-neural-network

Here’s what’s really going on inside an LLM’s neural network

Artificial brain surgery —

Anthropic’s conceptual mapping helps explain why LLMs behave the way they do.

Here’s what’s really going on inside an LLM’s neural network

Aurich Lawson | Getty Images

With most computer programs—even complex ones—you can meticulously trace through the code and memory usage to figure out why that program generates any specific behavior or output. That’s generally not true in the field of generative AI, where the non-interpretable neural networks underlying these models make it hard for even experts to figure out precisely why they often confabulate information, for instance.

Now, new research from Anthropic offers a new window into what’s going on inside the Claude LLM’s “black box.” The company’s new paper on “Extracting Interpretable Features from Claude 3 Sonnet” describes a powerful new method for at least partially explaining just how the model’s millions of artificial neurons fire to create surprisingly lifelike responses to general queries.

Opening the hood

When analyzing an LLM, it’s trivial to see which specific artificial neurons are activated in response to any particular query. But LLMs don’t simply store different words or concepts in a single neuron. Instead, as Anthropic’s researchers explain, “it turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.”

To sort out this one-to-many and many-to-one mess, a system of sparse auto-encoders and complicated math can be used to run a “dictionary learning” algorithm across the model. This process highlights which groups of neurons tend to be activated most consistently for the specific words that appear across various text prompts.

The same internal LLM

Enlarge / The same internal LLM “feature” describes the Golden Gate Bridge in multiple languages and modes.

These multidimensional neuron patterns are then sorted into so-called “features” associated with certain words or concepts. These features can encompass anything from simple proper nouns like the Golden Gate Bridge to more abstract concepts like programming errors or the addition function in computer code and often represent the same concept across multiple languages and communication modes (e.g., text and images).

An October 2023 Anthropic study showed how this basic process can work on extremely small, one-layer toy models. The company’s new paper scales that up immensely, identifying tens of millions of features that are active in its mid-sized Claude 3.0 Sonnet model. The resulting feature map—which you can partially explore—creates “a rough conceptual map of [Claude’s] internal states halfway through its computation” and shows “a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities,” the researchers write. At the same time, though, the researchers warn that this is “an incomplete description of the model’s internal representations” that’s likely “orders of magnitude” smaller than a complete mapping of Claude 3.

A simplified map shows some of the concepts that are

Enlarge / A simplified map shows some of the concepts that are “near” the “inner conflict” feature in Anthropic’s Claude model.

Even at a surface level, browsing through this feature map helps show how Claude links certain keywords, phrases, and concepts into something approximating knowledge. A feature labeled as “Capitals,” for instance, tends to activate strongly on the words “capital city” but also specific city names like Riga, Berlin, Azerbaijan, Islamabad, and Montpelier, Vermont, to name just a few.

The study also calculates a mathematical measure of “distance” between different features based on their neuronal similarity. The resulting “feature neighborhoods” found by this process are “often organized in geometrically related clusters that share a semantic relationship,” the researchers write, showing that “the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity.” The Golden Gate Bridge feature, for instance, is relatively “close” to features describing “Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.”

Some of the most important features involved in answering a query about the capital of Kobe Bryant's team's state.

Enlarge / Some of the most important features involved in answering a query about the capital of Kobe Bryant’s team’s state.

Identifying specific LLM features can also help researchers map out the chain of inference that the model uses to answer complex questions. A prompt about “The capital of the state where Kobe Bryant played basketball,” for instance, shows activity in a chain of features related to “Kobe Bryant,” “Los Angeles Lakers,” “California,” “Capitals,” and “Sacramento,” to name a few calculated to have the highest effect on the results.

Here’s what’s really going on inside an LLM’s neural network Read More »

anthropic-releases-claude-ai-chatbot-ios-app

Anthropic releases Claude AI chatbot iOS app

AI in your pocket —

Anthropic finally comes to mobile, launches plan for teams that includes 200K context window.

The Claude AI iOS app running on an iPhone.

Enlarge / The Claude AI iOS app running on an iPhone.

Anthropic

On Wednesday, Anthropic announced the launch of an iOS mobile app for its Claude 3 AI language models that are similar to OpenAI’s ChatGPT. It also introduced a new subscription tier designed for group collaboration. Before the app launch, Claude was only available through a website, an API, and other apps that integrated Claude through API.

Like the ChatGPT app, Claude’s new mobile app serves as a gateway to chatbot interactions, and it also allows uploading photos for analysis. While it’s only available on Apple devices for now, Anthropic says that an Android app is coming soon.

Anthropic rolled out the Claude 3 large language model (LLM) family in March, featuring three different model sizes: Claude Opus, Claude Sonnet, and Claude Haiku. Currently, the app utilizes Sonnet for regular users and Opus for Pro users.

While Anthropic has been a key player in the AI field for several years, it’s entering the mobile space after many of its competitors have already established footprints on mobile platforms. OpenAI released its ChatGPT app for iOS in May 2023, with an Android version arriving two months later. Microsoft released a Copilot iOS app in January. Google Gemini is available through the Google app on iPhone.

Screenshots of the Claude AI iOS app running on an iPhone.

Enlarge / Screenshots of the Claude AI iOS app running on an iPhone.

Anthropic

The app is freely available to all users of Claude, including those using the free version, subscribers paying $20 per month for Claude Pro, and members of the newly introduced Claude Team plan. Conversation history is saved and shared between the web app version of Claude and the mobile app version after logging in.

Speaking of that Team plan, it’s designed for groups of at least five and is priced at $30 per seat per month. It offers more chat queries (higher rate limits), access to all three Claude models, and a larger context window (200K tokens) for processing lengthy documents or maintaining detailed conversations. It also includes group admin tools and billing management, and users can easily switch between Pro and Team plans.

Anthropic releases Claude AI chatbot iOS app Read More »

critics-question-tech-heavy-lineup-of-new-homeland-security-ai-safety-board

Critics question tech-heavy lineup of new Homeland Security AI safety board

Adventures in 21st century regulation —

CEO-heavy board to tackle elusive AI safety concept and apply it to US infrastructure.

A modified photo of a 1956 scientist carefully bottling

On Friday, the US Department of Homeland Security announced the formation of an Artificial Intelligence Safety and Security Board that consists of 22 members pulled from the tech industry, government, academia, and civil rights organizations. But given the nebulous nature of the term “AI,” which can apply to a broad spectrum of computer technology, it’s unclear if this group will even be able to agree on what exactly they are safeguarding us from.

President Biden directed DHS Secretary Alejandro Mayorkas to establish the board, which will meet for the first time in early May and subsequently on a quarterly basis.

The fundamental assumption posed by the board’s existence, and reflected in Biden’s AI executive order from October, is that AI is an inherently risky technology and that American citizens and businesses need to be protected from its misuse. Along those lines, the goal of the group is to help guard against foreign adversaries using AI to disrupt US infrastructure; develop recommendations to ensure the safe adoption of AI tech into transportation, energy, and Internet services; foster cross-sector collaboration between government and businesses; and create a forum where AI leaders to share information on AI security risks with the DHS.

It’s worth noting that the ill-defined nature of the term “Artificial Intelligence” does the new board no favors regarding scope and focus. AI can mean many different things: It can power a chatbot, fly an airplane, control the ghosts in Pac-Man, regulate the temperature of a nuclear reactor, or play a great game of chess. It can be all those things and more, and since many of those applications of AI work very differently, there’s no guarantee any two people on the board will be thinking about the same type of AI.

This confusion is reflected in the quotes provided by the DHS press release from new board members, some of whom are already talking about different types of AI. While OpenAI, Microsoft, and Anthropic are monetizing generative AI systems like ChatGPT based on large language models (LLMs), Ed Bastian, the CEO of Delta Air Lines, refers to entirely different classes of machine learning when he says, “By driving innovative tools like crew resourcing and turbulence prediction, AI is already making significant contributions to the reliability of our nation’s air travel system.”

So, defining the scope of what AI exactly means—and which applications of AI are new or dangerous—might be one of the key challenges for the new board.

A roundtable of Big Tech CEOs attracts criticism

For the inaugural meeting of the AI Safety and Security Board, the DHS selected a tech industry-heavy group, populated with CEOs of four major AI vendors (Sam Altman of OpenAI, Satya Nadella of Microsoft, Sundar Pichai of Alphabet, and Dario Amodei of Anthopic), CEO Jensen Huang of top AI chipmaker Nvidia, and representatives from other major tech companies like IBM, Adobe, Amazon, Cisco, and AMD. There are also reps from big aerospace and aviation: Northrop Grumman and Delta Air Lines.

Upon reading the announcement, some critics took issue with the board composition. On LinkedIn, founder of The Distributed AI Research Institute (DAIR) Timnit Gebru especially criticized OpenAI’s presence on the board and wrote, “I’ve now seen the full list and it is hilarious. Foxes guarding the hen house is an understatement.”

Critics question tech-heavy lineup of new Homeland Security AI safety board Read More »

words-are-flowing-out-like-endless-rain:-recapping-a-busy-week-of-llm-news

Words are flowing out like endless rain: Recapping a busy week of LLM news

many things frequently —

Gemini 1.5 Pro launch, new version of GPT-4 Turbo, new Mistral model, and more.

An image of a boy amazed by flying letters.

Enlarge / An image of a boy amazed by flying letters.

Some weeks in AI news are eerily quiet, but during others, getting a grip on the week’s events feels like trying to hold back the tide. This week has seen three notable large language model (LLM) releases: Google Gemini Pro 1.5 hit general availability with a free tier, OpenAI shipped a new version of GPT-4 Turbo, and Mistral released a new openly licensed LLM, Mixtral 8x22B. All three of those launches happened within 24 hours starting on Tuesday.

With the help of software engineer and independent AI researcher Simon Willison (who also wrote about this week’s hectic LLM launches on his own blog), we’ll briefly cover each of the three major events in roughly chronological order, then dig into some additional AI happenings this week.

Gemini Pro 1.5 general release

On Tuesday morning Pacific time, Google announced that its Gemini 1.5 Pro model (which we first covered in February) is now available in 180-plus countries, excluding Europe, via the Gemini API in a public preview. This is Google’s most powerful public LLM so far, and it’s available in a free tier that permits up to 50 requests a day.

It supports up to 1 million tokens of input context. As Willison notes in his blog, Gemini 1.5 Pro’s API price at $7/million input tokens and $21/million output tokens costs a little less than GPT-4 Turbo (priced at $10/million in and $30/million out) and more than Claude 3 Sonnet (Anthropic’s mid-tier LLM, priced at $3/million in and $15/million out).

Notably, Gemini 1.5 Pro includes native audio (speech) input processing that allows users to upload audio or video prompts, a new File API for handling files, the ability to add custom system instructions (system prompts) for guiding model responses, and a JSON mode for structured data extraction.

“Majorly Improved” GPT-4 Turbo launch

A GPT-4 Turbo performance chart provided by OpenAI.

Enlarge / A GPT-4 Turbo performance chart provided by OpenAI.

Just a bit later than Google’s 1.5 Pro launch on Tuesday, OpenAI announced that it was rolling out a “majorly improved” version of GPT-4 Turbo (a model family originally launched in November) called “gpt-4-turbo-2024-04-09.” It integrates multimodal GPT-4 Vision processing (recognizing the contents of images) directly into the model, and it initially launched through API access only.

Then on Thursday, OpenAI announced that the new GPT-4 Turbo model had just become available for paid ChatGPT users. OpenAI said that the new model improves “capabilities in writing, math, logical reasoning, and coding” and shared a chart that is not particularly useful in judging capabilities (that they later updated). The company also provided an example of an alleged improvement, saying that when writing with ChatGPT, the AI assistant will use “more direct, less verbose, and use more conversational language.”

The vague nature of OpenAI’s GPT-4 Turbo announcements attracted some confusion and criticism online. On X, Willison wrote, “Who will be the first LLM provider to publish genuinely useful release notes?” In some ways, this is a case of “AI vibes” again, as we discussed in our lament about the poor state of LLM benchmarks during the debut of Claude 3. “I’ve not actually spotted any definite differences in quality [related to GPT-4 Turbo],” Willison told us directly in an interview.

The update also expanded GPT-4’s knowledge cutoff to April 2024, although some people are reporting it achieves this through stealth web searches in the background, and others on social media have reported issues with date-related confabulations.

Mistral’s mysterious Mixtral 8x22B release

An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It's hard to draw a picture of an LLM, so a robot will have to do.

Enlarge / An illustration of a robot holding a French flag, figuratively reflecting the rise of AI in France due to Mistral. It’s hard to draw a picture of an LLM, so a robot will have to do.

Not to be outdone, on Tuesday night, French AI company Mistral launched its latest openly licensed model, Mixtral 8x22B, by tweeting a torrent link devoid of any documentation or commentary, much like it has done with previous releases.

The new mixture-of-experts (MoE) release weighs in with a larger parameter count than its previously most-capable open model, Mixtral 8x7B, which we covered in December. It’s rumored to potentially be as capable as GPT-4 (In what way, you ask? Vibes). But that has yet to be seen.

“The evals are still rolling in, but the biggest open question right now is how well Mixtral 8x22B shapes up,” Willison told Ars. “If it’s in the same quality class as GPT-4 and Claude 3 Opus, then we will finally have an openly licensed model that’s not significantly behind the best proprietary ones.”

This release has Willison most excited, saying, “If that thing really is GPT-4 class, it’s wild, because you can run that on a (very expensive) laptop. I think you need 128GB of MacBook RAM for it, twice what I have.”

The new Mixtral is not listed on Chatbot Arena yet, Willison noted, because Mistral has not released a fine-tuned model for chatting yet. It’s still a raw, predict-the-next token LLM. “There’s at least one community instruction tuned version floating around now though,” says Willison.

Chatbot Arena Leaderboard shake-ups

A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Enlarge / A Chatbot Arena Leaderboard screenshot taken on April 12, 2024.

Benj Edwards

This week’s LLM news isn’t limited to just the big names in the field. There have also been rumblings on social media about the rising performance of open source models like Cohere’s Command R+, which reached position 6 on the LMSYS Chatbot Arena Leaderboard—the highest-ever ranking for an open-weights model.

And for even more Chatbot Arena action, apparently the new version of GPT-4 Turbo is proving competitive with Claude 3 Opus. The two are still in a statistical tie, but GPT-4 Turbo recently pulled ahead numerically. (In March, we reported when Claude 3 first numerically pulled ahead of GPT-4 Turbo, which was then the first time another AI model had surpassed a GPT-4 family model member on the leaderboard.)

Regarding this fierce competition among LLMs—of which most of the muggle world is unaware and will likely never be—Willison told Ars, “The past two months have been a whirlwind—we finally have not just one but several models that are competitive with GPT-4.” We’ll see if OpenAI’s rumored release of GPT-5 later this year will restore the company’s technological lead, we note, which once seemed insurmountable. But for now, Willison says, “OpenAI are no longer the undisputed leaders in LLMs.”

Words are flowing out like endless rain: Recapping a busy week of LLM news Read More »

openai-drops-login-requirements-for-chatgpt’s-free-version

OpenAI drops login requirements for ChatGPT’s free version

free as in beer? —

ChatGPT 3.5 still falls far short of GPT-4, and other models surpassed it long ago.

A glowing OpenAI logo on a blue background.

Benj Edwards

On Monday, OpenAI announced that visitors to the ChatGPT website in some regions can now use the AI assistant without signing in. Previously, the company required that users create an account to use it, even with the free version of ChatGPT that is currently powered by the GPT-3.5 AI language model. But as we have noted in the past, GPT-3.5 is widely known to provide more inaccurate information compared to GPT-4 Turbo, available in paid versions of ChatGPT.

Since its launch in November 2022, ChatGPT has transformed over time from a tech demo to a comprehensive AI assistant, and it’s always had a free version available. The cost is free because “you’re the product,” as the old saying goes. Using ChatGPT helps OpenAI gather data that will help the company train future AI models, although free users and ChatGPT Plus subscription members can both opt out of allowing the data they input into ChatGPT to be used for AI training. (OpenAI says it never trains on inputs from ChatGPT Team and Enterprise members at all).

Opening ChatGPT to everyone could provide a frictionless on-ramp for people who might use it as a substitute for Google Search or potentially gain new customers by providing an easy way for people to use ChatGPT quickly, then offering an upsell to paid versions of the service.

“It’s core to our mission to make tools like ChatGPT broadly available so that people can experience the benefits of AI,” OpenAI says on its blog page. “For anyone that has been curious about AI’s potential but didn’t want to go through the steps to set up an account, start using ChatGPT today.”

When you visit the ChatGPT website, you're immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Enlarge / When you visit the ChatGPT website, you’re immediately presented with a chat box like this (in some regions). Screenshot captured April 1, 2024.

Benj Edwards

Since kids will also be able to use ChatGPT without an account—despite it being against the terms of service—OpenAI also says it’s introducing “additional content safeguards,” such as blocking more prompts and “generations in a wider range of categories.” What exactly that entails has not been elaborated upon by OpenAI, but we reached out to the company for comment.

There might be a few other downsides to the fully open approach. On X, AI researcher Simon Willison wrote about the potential for automated abuse as a way to get around paying for OpenAI’s services: “I wonder how their scraping prevention works? I imagine the temptation for people to abuse this as a free 3.5 API will be pretty strong.”

With fierce competition, more GPT-3.5 access may backfire

Willison also mentioned a common criticism of OpenAI (as voiced in this case by Wharton professor Ethan Mollick) that people’s ideas about what AI models can do have so far largely been influenced by GPT-3.5, which, as we mentioned, is far less capable and far more prone to making things up than the paid version of ChatGPT that uses GPT-4 Turbo.

“In every group I speak to, from business executives to scientists, including a group of very accomplished people in Silicon Valley last night, much less than 20% of the crowd has even tried a GPT-4 class model,” wrote Mollick in a tweet from early March.

With models like Google Gemini Pro 1.5 and Anthropic Claude 3 potentially surpassing OpenAI’s best proprietary model at the moment —and open weights AI models eclipsing the free version of ChatGPT—allowing people to use GPT-3.5 might not be putting OpenAI’s best foot forward. Microsoft Copilot, powered by OpenAI models, also supports a frictionless, no-login experience, but it allows access to a model based on GPT-4. But Gemini currently requires a sign-in, and Anthropic sends a login code through email.

For now, OpenAI says the login-free version of ChatGPT is not yet available to everyone, but it will be coming soon: “We’re rolling this out gradually, with the aim to make AI accessible to anyone curious about its capabilities.”

OpenAI drops login requirements for ChatGPT’s free version Read More »

the-ai-wars-heat-up-with-claude-3,-claimed-to-have-“near-human”-abilities

The AI wars heat up with Claude 3, claimed to have “near-human” abilities

The Anthropic Claude 3 logo.

Enlarge / The Anthropic Claude 3 logo.

On Monday, Anthropic released Claude 3, a family of three AI language models similar to those that power ChatGPT. Anthropic claims the models set new industry benchmarks across a range of cognitive tasks, even approaching “near-human” capability in some cases. It’s available now through Anthropic’s website, with the most powerful model being subscription-only. It’s also available via API for developers.

Claude 3’s three models represent increasing complexity and parameter count: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Sonnet powers the Claude.ai chatbot now for free with an email sign-in. But as mentioned above, Opus is only available through Anthropic’s web chat interface if you pay $20 a month for “Claude Pro,” a subscription service offered through the Anthropic website. All three feature a 200,000-token context window. (The context window is the number of tokens—fragments of a word—that an AI language model can process at once.)

We covered the launch of Claude in March 2023 and Claude 2 in July that same year. Each time, Anthropic fell slightly behind OpenAI’s best models in capability while surpassing them in terms of context window length. With Claude 3, Anthropic has perhaps finally caught up with OpenAI’s released models in terms of performance, although there is no consensus among experts yet—and the presentation of AI benchmarks is notoriously prone to cherry-picking.

A Claude 3 benchmark chart provided by Anthropic.

Enlarge / A Claude 3 benchmark chart provided by Anthropic.

Claude 3 reportedly demonstrates advanced performance across various cognitive tasks, including reasoning, expert knowledge, mathematics, and language fluency. (Despite the lack of consensus over whether large language models “know” or “reason,” the AI research community commonly uses those terms.) The company claims that the Opus model, the most capable of the three, exhibits “near-human levels of comprehension and fluency on complex tasks.”

That’s quite a heady claim and deserves to be parsed more carefully. It’s probably true that Opus is “near-human” on some specific benchmarks, but that doesn’t mean that Opus is a general intelligence like a human (consider that pocket calculators are superhuman at math). So, it’s a purposely eye-catching claim that can be watered down with qualifications.

According to Anthropic, Claude 3 Opus beats GPT-4 on 10 AI benchmarks, including MMLU (undergraduate level knowledge), GSM8K (grade school math), HumanEval (coding), and the colorfully named HellaSwag (common knowledge). Several of the wins are very narrow, such as 86.8 percent for Opus vs. 86.4 percent on a five-shot trial of MMLU, and some gaps are big, such as 84.9 percent on HumanEval over GPT-4’s 67.0 percent. But what that might mean, exactly, to you as a customer is difficult to say.

“As always, LLM benchmarks should be treated with a little bit of suspicion,” says AI researcher Simon Willison, who spoke with Ars about Claude 3. “How well a model performs on benchmarks doesn’t tell you much about how the model ‘feels’ to use. But this is still a huge deal—no other model has beaten GPT-4 on a range of widely used benchmarks like this.”

The AI wars heat up with Claude 3, claimed to have “near-human” abilities Read More »

the-super-bowl’s-best-and-wackiest-ai-commercials

The Super Bowl’s best and wackiest AI commercials

Superb Owl News —

It’s nothing like “crypto bowl” in 2022, but AI made a notable splash during the big game.

A still image from BodyArmor's 2024

Enlarge / A still image from BodyArmor’s 2024 “Field of Fake” Super Bowl commercial.

BodyArmor

Heavily hyped tech products have a history of appearing in Super Bowl commercials during football’s biggest game—including the Apple Macintosh in 1984, dot-com companies in 2000, and cryptocurrency firms in 2022. In 2024, the hot tech in town is artificial intelligence, and several companies showed AI-related ads at Super Bowl LVIII. Here’s a rundown of notable appearances that range from serious to wacky.

Microsoft Copilot

Microsoft Game Day Commercial | Copilot: Your everyday AI companion.

It’s been a year since Microsoft launched the AI assistant Microsoft Copilot (as “Bing Chat“), and Microsoft is leaning heavily into its AI-assistant technology, which is powered by large language models from OpenAI. In Copilot’s first-ever Super Bowl commercial, we see scenes of various people with defiant text overlaid on the screen: “They say I will never open my own business or get my degree. They say I will never make my movie or build something. They say I’m too old to learn something new. Too young to change the world. But I say watch me.”

Then the commercial shows Copilot creating solutions to some of these problems, with prompts like, “Generate storyboard images for the dragon scene in my script,” “Write code for my 3d open world game,” “Quiz me in organic chemistry,” and “Design a sign for my classic truck repair garage Mike’s.”

Of course, since generative AI is an unfinished technology, many of these solutions are more aspirational than practical at the moment. On Bluesky, writer Ed Zitron put Microsoft’s truck repair logo to the test and saw results that weren’t nearly as polished as those seen in the commercial. On X, others have criticized and poked fun at the “3d open world game” generation prompt, which is a complex task that would take far more than a single, simple prompt to produce useful code.

Google Pixel 8 “Guided Frame” feature

Javier in Frame | Google Pixel SB Commercial 2024.

Instead of focusing on generative aspects of AI, Google’s commercial showed off a feature called “Guided Frame” on the Pixel 8 phone that uses machine vision technology and a computer voice to help people with blindness or low vision to take photos by centering the frame on a face or multiple faces. Guided Frame debuted in 2022 in conjunction with the Google Pixel 7.

The commercial tells the story of a person named Javier, who says, “For many people with blindness or low vision, there hasn’t always been an easy way to capture daily life.” We see a simulated blurry first-person view of Javier holding a smartphone and hear a computer-synthesized voice describing what the AI model sees, directing the person to center on a face to snap various photos and selfies.

Considering the controversies that generative AI currently generates (pun intended), it’s refreshing to see a positive application of AI technology used as an accessibility feature. Relatedly, an app called Be My Eyes (powered by OpenAI’s GPT-4V) also aims to help low-vision people interact with the world.

Despicable Me 4

Despicable Me 4 – Minion Intelligence (Big Game Spot).

So far, we’ve covered a couple attempts to show AI-powered products as positive features. Elsewhere in Super Bowl ads, companies weren’t as generous about the technology. In an ad for the film Despicable Me 4, we see two Minions creating a series of terribly disfigured AI-generated still images reminiscent of Stable Diffusion 1.4 from 2022. There’s three-legged people doing yoga, a painting of Steve Carell and Will Ferrell as Elizabethan gentlemen, a handshake with too many fingers, people eating spaghetti in a weird way, and a pair of people riding dachshunds in a race.

The images are paired with an earnest voiceover that says, “Artificial intelligence is changing the way we see the world, showing us what we never thought possible, transforming the way we do business, and bringing family and friends closer together. With artificial intelligence, the future is in good hands.” When the voiceover ends, the camera pans out to show hundreds of Minions generating similarly twisted images on computers.

Speaking of image synthesis at the Super Bowl, people mistook a Christian commercial created by He Gets Us, LLC as having been AI-generated, likely due to its gaudy technicolor visuals. With the benefit of a YouTube replay and the ability to look at details, the “He washed feet” commercial doesn’t appear AI-generated to us, but it goes to show how the concept of image synthesis has begun to cast doubt on human-made creations.

The Super Bowl’s best and wackiest AI commercials Read More »