AI

mysterious-“gpt2-chatbot”-ai-model-appears-suddenly,-confuses-experts

Mysterious “gpt2-chatbot” AI model appears suddenly, confuses experts

Robot fortune teller hand and crystal ball

On Sunday, word began to spread on social media about a new mystery chatbot named “gpt2-chatbot” that appeared in the LMSYS Chatbot Arena. Some people speculate that it may be a secret test version of OpenAI’s upcoming GPT-4.5 or GPT-5 large language model (LLM). The paid version of ChatGPT is currently powered by GPT-4 Turbo.

Currently, the new model is only available for use through the Chatbot Arena website, although in a limited way. In the site’s “side-by-side” arena mode where users can purposely select the model, gpt2-chatbot has a rate limit of eight queries per day—dramatically limiting people’s ability to test it in detail.

So far, gpt2-chatbot has inspired plenty of rumors online, including that it could be the stealth launch of a test version of GPT-4.5 or even GPT-5—or perhaps a new version of 2019’s GPT-2 that has been trained using new techniques. We reached out to OpenAI for comment but did not receive a response by press time. On Monday evening, OpenAI CEO Sam Altman seemingly dropped a hint by tweeting, “i do have a soft spot for gpt2.”

A screenshot of the LMSYS Chatbot Arena

Enlarge / A screenshot of the LMSYS Chatbot Arena “side-by-side” page showing “gpt2-chatbot” listed among the models for testing. (Red highlight added by Ars Technica.)

Benj Edwards

Early reports of the model first appeared on 4chan, then spread to social media platforms like X, with hype following not far behind. “Not only does it seem to show incredible reasoning, but it also gets notoriously challenging AI questions right with a much more impressive tone,” wrote AI developer Pietro Schirano on X. Soon, threads on Reddit popped up claiming that the new model had amazing abilities that beat every other LLM on the Arena.

Intrigued by the rumors, we decided to try out the new model for ourselves but did not come away impressed. When asked about “Benj Edwards,” the model revealed a few mistakes and some awkward language compared to GPT-4 Turbo’s output. A request for five original dad jokes fell short. And the gpt2-chatbot did not decisively pass our “magenta” test. (“Would the color be called ‘magenta’ if the town of Magenta didn’t exist?”)

  • A gpt2-chatbot result for “Who is Benj Edwards?” on LMSYS Chatbot Arena. Mistakes and oddities highlighted in red.

    Benj Edwards

  • A gpt2-chatbot result for “Write 5 original dad jokes” on LMSYS Chatbot Arena.

    Benj Edwards

  • A gpt2-chatbot result for “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?” on LMSYS Chatbot Arena.

    Benj Edwards

So, whatever it is, it’s probably not GPT-5. We’ve seen other people reach the same conclusion after further testing, saying that the new mystery chatbot doesn’t seem to represent a large capability leap beyond GPT-4. “Gpt2-chatbot is good. really good,” wrote HyperWrite CEO Matt Shumer on X. “But if this is gpt-4.5, I’m disappointed.”

Still, OpenAI’s fingerprints seem to be all over the new bot. “I think it may well be an OpenAI stealth preview of something,” AI researcher Simon Willison told Ars Technica. But what “gpt2” is exactly, he doesn’t know. After surveying online speculation, it seems that no one apart from its creator knows precisely what the model is, either.

Willison has uncovered the system prompt for the AI model, which claims it is based on GPT-4 and made by OpenAI. But as Willison noted in a tweet, that’s no guarantee of provenance because “the goal of a system prompt is to influence the model to behave in certain ways, not to give it truthful information about itself.”

Mysterious “gpt2-chatbot” AI model appears suddenly, confuses experts Read More »

critics-question-tech-heavy-lineup-of-new-homeland-security-ai-safety-board

Critics question tech-heavy lineup of new Homeland Security AI safety board

Adventures in 21st century regulation —

CEO-heavy board to tackle elusive AI safety concept and apply it to US infrastructure.

A modified photo of a 1956 scientist carefully bottling

On Friday, the US Department of Homeland Security announced the formation of an Artificial Intelligence Safety and Security Board that consists of 22 members pulled from the tech industry, government, academia, and civil rights organizations. But given the nebulous nature of the term “AI,” which can apply to a broad spectrum of computer technology, it’s unclear if this group will even be able to agree on what exactly they are safeguarding us from.

President Biden directed DHS Secretary Alejandro Mayorkas to establish the board, which will meet for the first time in early May and subsequently on a quarterly basis.

The fundamental assumption posed by the board’s existence, and reflected in Biden’s AI executive order from October, is that AI is an inherently risky technology and that American citizens and businesses need to be protected from its misuse. Along those lines, the goal of the group is to help guard against foreign adversaries using AI to disrupt US infrastructure; develop recommendations to ensure the safe adoption of AI tech into transportation, energy, and Internet services; foster cross-sector collaboration between government and businesses; and create a forum where AI leaders to share information on AI security risks with the DHS.

It’s worth noting that the ill-defined nature of the term “Artificial Intelligence” does the new board no favors regarding scope and focus. AI can mean many different things: It can power a chatbot, fly an airplane, control the ghosts in Pac-Man, regulate the temperature of a nuclear reactor, or play a great game of chess. It can be all those things and more, and since many of those applications of AI work very differently, there’s no guarantee any two people on the board will be thinking about the same type of AI.

This confusion is reflected in the quotes provided by the DHS press release from new board members, some of whom are already talking about different types of AI. While OpenAI, Microsoft, and Anthropic are monetizing generative AI systems like ChatGPT based on large language models (LLMs), Ed Bastian, the CEO of Delta Air Lines, refers to entirely different classes of machine learning when he says, “By driving innovative tools like crew resourcing and turbulence prediction, AI is already making significant contributions to the reliability of our nation’s air travel system.”

So, defining the scope of what AI exactly means—and which applications of AI are new or dangerous—might be one of the key challenges for the new board.

A roundtable of Big Tech CEOs attracts criticism

For the inaugural meeting of the AI Safety and Security Board, the DHS selected a tech industry-heavy group, populated with CEOs of four major AI vendors (Sam Altman of OpenAI, Satya Nadella of Microsoft, Sundar Pichai of Alphabet, and Dario Amodei of Anthopic), CEO Jensen Huang of top AI chipmaker Nvidia, and representatives from other major tech companies like IBM, Adobe, Amazon, Cisco, and AMD. There are also reps from big aerospace and aviation: Northrop Grumman and Delta Air Lines.

Upon reading the announcement, some critics took issue with the board composition. On LinkedIn, founder of The Distributed AI Research Institute (DAIR) Timnit Gebru especially criticized OpenAI’s presence on the board and wrote, “I’ve now seen the full list and it is hilarious. Foxes guarding the hen house is an understatement.”

Critics question tech-heavy lineup of new Homeland Security AI safety board Read More »

customers-say-meta’s-ad-buying-ai-blows-through-budgets-in-a-matter-of-hours

Customers say Meta’s ad-buying AI blows through budgets in a matter of hours

Spending money is just so hard … can’t a computer do it for me? —

Based on your point of view, the AI either doesn’t work or works too well.

AI is here to terminate your bank account.

Enlarge / AI is here to terminate your bank account.

Carolco Pictures

Give the AI access to your credit card, they said. It’ll be fine, they said. Users of Meta’s ad platform who followed that advice have been getting burned by an AI-powered ad purchasing system, according to The Verge. The idea was to use a Meta-developed AI to automatically set up ads and spend your ad budget, saving you the hassle of making decisions about your ad campaign. Apparently, the AI funnels money to Meta a little too well: Customers say it burns, though, what should be daily ad budgets in a matter of hours, and costs are inflated as much as 10-fold.

The AI-powered software in question is the “Advantage+ Shopping Campaign.” The system is supposed to automate a lot of ad setup for you, mixing and matching various creative elements and audience targets. The power of AI-powered advertising (Google has a similar product) is that the ad platform can get instant feedback on its generated ads via click-through rates. You give it a few guard rails, and it can try hundreds or thousands of combinations to find the most clickable ad at a speed and efficiency no human could match. That’s the theory, anyway.

The Verge spoke to “several marketers and businesses” with similar stories of being hit by an AI-powered spending spree once they let Meta’s system take over a campaign. The description of one account says the AI “had blown through roughly 75 percent of the daily ad budgets for both clients in under a couple of hours” and that “the ads’ CPMs, or cost per impressions, were roughly 10 times higher than normal.” Meanwhile, the revenue earned from those AI-powered ads was “nearly zero.” The report says, “Small businesses have seen their ad dollars get wiped out and wasted as a result, and some have said the bouts of overspending are driving them from Meta’s platforms.”

Meta’s Advantage+ sales pitch promises to “Use machine learning to identify and aim for your highest value customers across all of Meta’s family of apps and services, with minimal input.” The service can “Automatically test up to 150 creative combinations and deliver the highest performing ads.” Meta promises that “on average, companies have seen a 17 percent reduction in cost per action [an action is typically a purchase, registration, or sign-up] and a 32 percent increase in return on ad spend.”

In response to the complaints, a Meta spokesperson told The Verge the company had fixed “a few technical issues” and that “Our ads system is working as expected for the vast majority of advertisers. We recently fixed a few technical issues and are researching a small amount of additional reports from advertisers to ensure the best possible results for businesses using our apps.” The Verge got that statement a few weeks ago, though, and advertisers are still having issues. The report describes the service as “unpredictable” and says what “other marketers thought was a one-time glitch by Advantage Plus ended up becoming a recurring incident for weeks.”

To make matters worse, layoffs in Meta’s customer service department mean it’s been difficult to get someone at Meta to deal with the AI’s spending sprees. Some accounts report receiving refunds after complaining, but it can take several tries to get someone at customer service to deal with you and upward of a month to receive a refund. Some customers quoted in the report have decided to return to pre-AI, non-automated way of setting up a Meta ad campaign, which can take “an extra 10 to 20 minutes.”

Customers say Meta’s ad-buying AI blows through budgets in a matter of hours Read More »

tech-brands-are-forcing-ai-into-your-gadgets—whether-you-asked-for-it-or-not

Tech brands are forcing AI into your gadgets—whether you asked for it or not

Tech brands love hollering about the purported thrills of AI these days.

Enlarge / Tech brands love hollering about the purported thrills of AI these days.

Logitech announced a new mouse last week. A company rep reached out to inform Ars of Logitech’s “newest wireless mouse.” The gadget’s product page reads the same as of this writing.

I’ve had good experience with Logitech mice, especially wireless ones, one of which I’m using now. So I was keen to learn what Logitech might have done to improve on its previous wireless mouse designs. A quieter click? A new shape to better accommodate my overworked right hand? Multiple onboard profiles in a business-ready design?

I was disappointed to learn that the most distinct feature of the Logitech Signature AI Edition M750 is a button located south of the scroll wheel. This button is preprogrammed to launch the ChatGPT prompt builder, which Logitech recently added to its peripherals configuration app Options+.

That’s pretty much it.

Beyond that, the M750 looks just like the Logitech Signature M650, which came out in January 2022.  Also, the new mouse’s forward button (on the left side of the mouse) is preprogrammed to launch Windows or macOS dictation, and the back button opens ChatGPT within Options+. As of this writing, the new mouse’s MSRP is $10 higher ($50) than the M650’s.

  • The new M750 (pictured) is 4.26×2.4×1.52 inches and 3.57 ounces.

    Logitech

  • The M650 (pictured) comes in 3 sizes. The medium size is 4.26×2.4×1.52 inches and 3.58 ounces.

    Logitech

I asked Logitech about the M750 appearing to be the M650 but with an extra button, and a spokesperson responded by saying:

M750 is indeed not the same mouse as M650. It has an extra button that has been preprogrammed to trigger the Logi AI Prompt Builder once the user installs Logi Options+ app. Without Options+, the button does DPI toggle between 1,000 and 1,600 DPI.

However, a reprogrammable button south of a mouse’s scroll wheel that can be set to launch an app or toggle DPI out of the box is pretty common, including among Logitech mice. Logitech’s rep further claimed to me that the two mice use different electronic components, which Logitech refers to as the mouse’s platform. Logitech can reuse platforms for different models, the spokesperson said.

Logitech’s rep declined to comment on why the M650 didn’t have a button south of its scroll wheel. Price is a potential reason, but Logitech also sells cheaper mice with this feature.

Still, the minimal differences between the two suggest that the M750 isn’t worth a whole product release. I suspect that if it weren’t for Logitech’s trendy new software feature, the M750 wouldn’t have been promoted as a new product.

The M750 also raises the question of how many computer input devices need to be equipped with some sort of buzzy, generative AI-related feature.

Logitech’s ChatGPT prompt builder

Logitech’s much bigger release last week wasn’t a peripheral but an addition to its Options+ app. You don’t need the “new” M750 mouse to use Logitech’s AI Prompt Builder; I was able to program my MX Master 3S to launch it. Several Logitech mice and keyboards support AI Prompt Builder.

When you press a button that launches the prompt builder, an Options+ window appears. There, you can input text that Options+ will use to create a ChatGPT-appropriate prompt based on your needs:

A Logitech-provided image depicting its AI Prompt Builder software feature.

Enlarge / A Logitech-provided image depicting its AI Prompt Builder software feature.

Logitech

After you make your choices, another window opens with ChatGPT’s response. Logitech said the prompt builder requires a ChatGPT account, but I was able to use GPT-3.5 without entering one (the feature can also work with GPT-4).

The typical Arsian probably doesn’t need help creating a ChatGPT prompt, and Logitech’s new capability doesn’t work with any other chatbots. The prompt builder could be interesting to less technically savvy people interested in some handholding for early ChatGPT experiences. However, I doubt if people with an elementary understanding of generative AI need instant access to ChatGPT.

The point, though, is instant access to ChatGPT capabilities, something that Logitech is arguing is worthwhile for its professional users. Some Logitech customers, though, seem to disagree, especially with the AI Prompt Builder, meaning that Options+ has even more resources in the background.

But Logitech isn’t the only gadget company eager to tie one-touch AI access to a hardware button.

Pinching your earbuds to talk to ChatGPT

Similarly to Logitech, Nothing is trying to give its customers access to ChatGPT quickly. In this case, access occurs by pinching the device. This month, Nothing announced that it “integrated Nothing earbuds and Nothing OS with ChatGPT to offer users instant access to knowledge directly from the devices they use most, earbuds and smartphones.” The feature requires the latest Nothing OS and for the users to have a Nothing phone with ChatGPT installed. ChatGPT gestures work with Nothing’s Phone (2) and Nothing Ear and Nothing Ear (a), but Nothing plans to expand to additional phones via software updates.

Nothing's Ear and Ear (a) earbuds.

Enlarge / Nothing’s Ear and Ear (a) earbuds.

Nothing

Nothing also said it would embed “system-level entry points” to ChatGPT, like screenshot sharing and “Nothing-styled widgets,” to Nothing smartphone OSes.

A peek at setting up ChatGPT integration on the Nothing X app.

Enlarge / A peek at setting up ChatGPT integration on the Nothing X app.

Nothing’s ChatGPT integration may be a bit less intrusive than Logitech’s since users who don’t have ChatGPT on their phones won’t be affected. But, again, you may wonder how many people asked for this feature and how reliably it will function.

Tech brands are forcing AI into your gadgets—whether you asked for it or not Read More »

apple-releases-eight-small-ai-language-models-aimed-at-on-device-use

Apple releases eight small AI language models aimed at on-device use

Inside the Apple core —

OpenELM mirrors efforts by Microsoft to make useful small AI language models that run locally.

An illustration of a robot hand tossing an apple to a human hand.

Getty Images

In the world of AI, what might be called “small language models” have been growing in popularity recently because they can be run on a local device instead of requiring data center-grade computers in the cloud. On Wednesday, Apple introduced a set of tiny source-available AI language models called OpenELM that are small enough to run directly on a smartphone. They’re mostly proof-of-concept research models for now, but they could form the basis of future on-device AI offerings from Apple.

Apple’s new AI models, collectively named OpenELM for “Open-source Efficient Language Models,” are currently available on the Hugging Face under an Apple Sample Code License. Since there are some restrictions in the license, it may not fit the commonly accepted definition of “open source,” but the source code for OpenELM is available.

On Tuesday, we covered Microsoft’s Phi-3 models, which aim to achieve something similar: a useful level of language understanding and processing performance in small AI models that can run locally. Phi-3-mini features 3.8 billion parameters, but some of Apple’s OpenELM models are much smaller, ranging from 270 million to 3 billion parameters in eight distinct models.

In comparison, the largest model yet released in Meta’s Llama 3 family includes 70 billion parameters (with a 400 billion version on the way), and OpenAI’s GPT-3 from 2020 shipped with 175 billion parameters. Parameter count serves as a rough measure of AI model capability and complexity, but recent research has focused on making smaller AI language models as capable as larger ones were a few years ago.

The eight OpenELM models come in two flavors: four as “pretrained” (basically a raw, next-token version of the model) and four as instruction-tuned (fine-tuned for instruction following, which is more ideal for developing AI assistants and chatbots):

OpenELM features a 2048-token maximum context window. The models were trained on the publicly available datasets RefinedWeb, a version of PILE with duplications removed, a subset of RedPajama, and a subset of Dolma v1.6, which Apple says totals around 1.8 trillion tokens of data. Tokens are fragmented representations of data used by AI language models for processing.

Apple says its approach with OpenELM includes a “layer-wise scaling strategy” that reportedly allocates parameters more efficiently across each layer, saving not only computational resources but also improving the model’s performance while being trained on fewer tokens. According to Apple’s released white paper, this strategy has enabled OpenELM to achieve a 2.36 percent improvement in accuracy over Allen AI’s OLMo 1B (another small language model) while requiring half as many pre-training tokens.

An table comparing OpenELM with other small AI language models in a similar class, taken from the OpenELM research paper by Apple.

Enlarge / An table comparing OpenELM with other small AI language models in a similar class, taken from the OpenELM research paper by Apple.

Apple

Apple also released the code for CoreNet, a library it used to train OpenELM—and it also included reproducible training recipes that allow the weights (neural network files) to be replicated, which is unusual for a major tech company so far. As Apple says in its OpenELM paper abstract, transparency is a key goal for the company: “The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks.”

By releasing the source code, model weights, and training materials, Apple says it aims to “empower and enrich the open research community.” However, it also cautions that since the models were trained on publicly sourced datasets, “there exists the possibility of these models producing outputs that are inaccurate, harmful, biased, or objectionable in response to user prompts.”

While Apple has not yet integrated this new wave of AI language model capabilities into its consumer devices, the upcoming iOS 18 update (expected to be revealed in June at WWDC) is rumored to include new AI features that utilize on-device processing to ensure user privacy—though the company may potentially hire Google or OpenAI to handle more complex, off-device AI processing to give Siri a long-overdue boost.

Apple releases eight small AI language models aimed at on-device use Read More »

deepfakes-in-the-courtroom:-us-judicial-panel-debates-new-ai-evidence-rules

Deepfakes in the courtroom: US judicial panel debates new AI evidence rules

adventures in 21st-century justice —

Panel of eight judges confronts deep-faking AI tech that may undermine legal trials.

An illustration of a man with a very long nose holding up the scales of justice.

On Friday, a federal judicial panel convened in Washington, DC, to discuss the challenges of policing AI-generated evidence in court trials, according to a Reuters report. The US Judicial Conference’s Advisory Committee on Evidence Rules, an eight-member panel responsible for drafting evidence-related amendments to the Federal Rules of Evidence, heard from computer scientists and academics about the potential risks of AI being used to manipulate images and videos or create deepfakes that could disrupt a trial.

The meeting took place amid broader efforts by federal and state courts nationwide to address the rise of generative AI models (such as those that power OpenAI’s ChatGPT or Stability AI’s Stable Diffusion), which can be trained on large datasets with the aim of producing realistic text, images, audio, or videos.

In the published 358-page agenda for the meeting, the committee offers up this definition of a deepfake and the problems AI-generated media may pose in legal trials:

A deepfake is an inauthentic audiovisual presentation prepared by software programs using artificial intelligence. Of course, photos and videos have always been subject to forgery, but developments in AI make deepfakes much more difficult to detect. Software for creating deepfakes is already freely available online and fairly easy for anyone to use. As the software’s usability and the videos’ apparent genuineness keep improving over time, it will become harder for computer systems, much less lay jurors, to tell real from fake.

During Friday’s three-hour hearing, the panel wrestled with the question of whether existing rules, which predate the rise of generative AI, are sufficient to ensure the reliability and authenticity of evidence presented in court.

Some judges on the panel, such as US Circuit Judge Richard Sullivan and US District Judge Valerie Caproni, reportedly expressed skepticism about the urgency of the issue, noting that there have been few instances so far of judges being asked to exclude AI-generated evidence.

“I’m not sure that this is the crisis that it’s been painted as, and I’m not sure that judges don’t have the tools already to deal with this,” said Judge Sullivan, as quoted by Reuters.

Last year, Chief US Supreme Court Justice John Roberts acknowledged the potential benefits of AI for litigants and judges, while emphasizing the need for the judiciary to consider its proper uses in litigation. US District Judge Patrick Schiltz, the evidence committee’s chair, said that determining how the judiciary can best react to AI is one of Roberts’ priorities.

In Friday’s meeting, the committee considered several deepfake-related rule changes. In the agenda for the meeting, US District Judge Paul Grimm and attorney Maura Grossman proposed modifying Federal Rule 901(b)(9) (see page 5), which involves authenticating or identifying evidence. They also recommended the addition of a new rule, 901(c), which might read:

901(c): Potentially Fabricated or Altered Electronic Evidence. If a party challenging the authenticity of computer-generated or other electronic evidence demonstrates to the court that it is more likely than not either fabricated, or altered in whole or in part, the evidence is admissible only if the proponent demonstrates that its probative value outweighs its prejudicial effect on the party challenging the evidence.

The panel agreed during the meeting that this proposal to address concerns about litigants challenging evidence as deepfakes did not work as written and that it will be reworked before being reconsidered later.

Another proposal by Andrea Roth, a law professor at the University of California, Berkeley, suggested subjecting machine-generated evidence to the same reliability requirements as expert witnesses. However, Judge Schiltz cautioned that such a rule could hamper prosecutions by allowing defense lawyers to challenge any digital evidence without establishing a reason to question it.

For now, no definitive rule changes have been made, and the process continues. But we’re witnessing the first steps of how the US justice system will adapt to an entirely new class of media-generating technology.

Putting aside risks from AI-generated evidence, generative AI has led to embarrassing moments for lawyers in court over the past two years. In May 2023, US lawyer Steven Schwartz of the firm Levidow, Levidow, & Oberman apologized to a judge for using ChatGPT to help write court filings that inaccurately cited six nonexistent cases, leading to serious questions about the reliability of AI in legal research. Also, in November, a lawyer for Michael Cohen cited three fake cases that were potentially influenced by a confabulating AI assistant.

Deepfakes in the courtroom: US judicial panel debates new AI evidence rules Read More »

microsoft’s-phi-3-shows-the-surprising-power-of-small,-locally-run-ai-language-models

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models

small packages —

Microsoft’s 3.8B parameter Phi-3 may rival GPT-3.5, signaling a new era of “small language models.”

An illustration of lots of information being compressed into a smartphone with a funnel.

Getty Images

On Tuesday, Microsoft announced a new, freely available lightweight AI language model named Phi-3-mini, which is simpler and less expensive to operate than traditional large language models (LLMs) like OpenAI’s GPT-4 Turbo. Its small size is ideal for running locally, which could bring an AI model of similar capability to the free version of ChatGPT to a smartphone without needing an Internet connection to run it.

The AI field typically measures AI language model size by parameter count. Parameters are numerical values in a neural network that determine how the language model processes and generates text. They are learned during training on large datasets and essentially encode the model’s knowledge into quantified form. More parameters generally allow the model to capture more nuanced and complex language-generation capabilities but also require more computational resources to train and run.

Some of the largest language models today, like Google’s PaLM 2, have hundreds of billions of parameters. OpenAI’s GPT-4 is rumored to have over a trillion parameters but spread over eight 220-billion parameter models in a mixture-of-experts configuration. Both models require heavy-duty data center GPUs (and supporting systems) to run properly.

In contrast, Microsoft aimed small with Phi-3-mini, which contains only 3.8 billion parameters and was trained on 3.3 trillion tokens. That makes it ideal to run on consumer GPU or AI-acceleration hardware that can be found in smartphones and laptops. It’s a follow-up of two previous small language models from Microsoft: Phi-2, released in December, and Phi-1, released in June 2023.

A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Enlarge / A chart provided by Microsoft showing Phi-3 performance on various benchmarks.

Phi-3-mini features a 4,000-token context window, but Microsoft also introduced a 128K-token version called “phi-3-mini-128K.” Microsoft has also created 7-billion and 14-billion parameter versions of Phi-3 that it plans to release later that it claims are “significantly more capable” than phi-3-mini.

Microsoft says that Phi-3 features overall performance that “rivals that of models such as Mixtral 8x7B and GPT-3.5,” as detailed in a paper titled “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” Mixtral 8x7B, from French AI company Mistral, utilizes a mixture-of-experts model, and GPT-3.5 powers the free version of ChatGPT.

“[Phi-3] looks like it’s going to be a shockingly good small model if their benchmarks are reflective of what it can actually do,” said AI researcher Simon Willison in an interview with Ars. Shortly after providing that quote, Willison downloaded Phi-3 to his Macbook laptop locally and said, “I got it working, and it’s GOOD” in a text message sent to Ars.

A screenshot of Phi-3-mini running locally on Simon Willison's Macbook.

Enlarge / A screenshot of Phi-3-mini running locally on Simon Willison’s Macbook.

Simon Willison

Most models that run on a local device still need hefty hardware,” says Willison. “Phi-3-mini runs comfortably with less than 8GB of RAM, and can churn out tokens at a reasonable speed even on just a regular CPU. It’s licensed MIT and should work well on a $55 Raspberry Pi—and the quality of results I’ve seen from it so far are comparable to models 4x larger.

How did Microsoft cram a capability potentially similar to GPT-3.5, which has at least 175 billion parameters, into such a small model? Its researchers found the answer by using carefully curated, high-quality training data they initially pulled from textbooks. “The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data,” writes Microsoft. “The model is also further aligned for robustness, safety, and chat format.”

Much has been written about the potential environmental impact of AI models and datacenters themselves, including on Ars. With new techniques and research, it’s possible that machine learning experts may continue to increase the capability of smaller AI models, replacing the need for larger ones—at least for everyday tasks. That would theoretically not only save money in the long run but also require far less energy in aggregate, dramatically decreasing AI’s environmental footprint. AI models like Phi-3 may be a step toward that future if the benchmark results hold up to scrutiny.

Phi-3 is immediately available on Microsoft’s cloud service platform Azure, as well as through partnerships with machine learning model platform Hugging Face and Ollama, a framework that allows models to run locally on Macs and PCs.

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models Read More »

high-speed-imaging-and-ai-help-us-understand-how-insect-wings-work

High-speed imaging and AI help us understand how insect wings work

Black and white images of a fly with its wings in a variety of positions, showing the details of a wing beat.

Enlarge / A time-lapse showing how an insect’s wing adopts very specific positions during flight.

Florian Muijres, Dickinson Lab

About 350 million years ago, our planet witnessed the evolution of the first flying creatures. They are still around, and some of them continue to annoy us with their buzzing. While scientists have classified these creatures as pterygotes, the rest of the world simply calls them winged insects.

There are many aspects of insect biology, especially their flight, that remain a mystery for scientists. One is simply how they move their wings. The insect wing hinge is a specialized joint that connects an insect’s wings with its body. It’s composed of five interconnected plate-like structures called sclerites. When these plates are shifted by the underlying muscles, it makes the insect wings flap.

Until now, it has been tricky for scientists to understand the biomechanics that govern the motion of the sclerites even using advanced imaging technologies. “The sclerites within the wing hinge are so small and move so rapidly that their mechanical operation during flight has not been accurately captured despite efforts using stroboscopic photography, high-speed videography, and X-ray tomography,” Michael Dickinson, Zarem professor of biology and bioengineering at the California Institute of Technology (Caltech), told Ars Technica.

As a result, scientists are unable to visualize exactly what’s going on at the micro-scale within the wing hinge as they fly, preventing them from studying insect flight in detail. However, a new study by Dickinson and his team finally revealed the working of sclerites and the insect wing hinge. They captured the wing motion of fruit flies (Drosophila melanogaster) analyzing 72,000 recorded wing beats using a neural network to decode the role individual sclerites played in shaping insect wing motion.

Understanding the insect wing hinge

The biomechanics that govern insect flight are quite different from those of birds and bats. This is because wings in insects didn’t evolve from limbs. “In the case of birds, bats, and pterosaurs we know exactly where the wings came from evolutionarily because all these animals fly with their forelimbs. They’re basically using their arms to fly. In insects, it’s a completely different story. They evolved from six-legged organisms and they kept all six legs. However, they added flapping appendages to the dorsal side of their body, and it is a mystery as to where those wings came from,” Dickinson explained.

Some researchers suggest that insect wings came from gill-like appendages present in ancient aquatic arthropods. Others argue that wings originated from “lobes,” special outgrowths found on the legs of ancient crustaceans, which were ancestors of insects. This debate is still ongoing, so its evolution can’t tell us much about how the hinge and the sclerites operate.

Understanding the hinge mechanics is crucial because this is what makes insects efficient flying creatures. It enables them to fly at impressive speeds relative to their body sizes (some insects can fly at 33 mph) and to demonstrate great maneuverability and stability while in flight.

“The insect wing hinge is arguably among the most sophisticated and evolutionarily important skeletal structures in the natural world,” according to the study authors.

However, imaging the activity of four of the five sclerites that form the hinge has been impossible due to their size and the speeds at which they move. Dickinson and his team employed a multidisciplinary approach to overcome this challenge. They designed an apparatus equipped with three high-speed cameras that recorded the activity of tethered fruit flies at 15,000 frames per second using infrared light.

They also used a calcium-sensitive protein to track changes in the activity of the steering muscles of the insects as they flew (calcium helps trigger muscle contractions). “We recorded a total of 485 flight sequences from 82 flies. After excluding a subset of wingbeats from sequences when the fly either stopped flying or flew at an abnormally low wingbeat frequency, we obtained a final dataset of 72,219 wingbeats,” the researchers note.

Next, they trained a machine-learning-based convolutional neural network (CNN) using 85 percent of the dataset. “We used the CNN model to investigate the transformation between muscle activity and wing motion by performing a set of virtual manipulations, exploiting the network to execute experiments that would be difficult to perform on actual flies,” they explained.

In addition to the neural network, they also developed an encoder-decoder neural network (an architecture used in machine learning) and fed it data related to steering muscle activity. While the CNN model could predict wing motion, the encoder/decoder could predict the action of individual sclerite muscles during the movement of the wings. Now, it was time to check whether the data they predicted was accurate.

High-speed imaging and AI help us understand how insect wings work Read More »

microsoft’s-vasa-1-can-deepfake-a-person-with-one-photo-and-one-audio-track

Microsoft’s VASA-1 can deepfake a person with one photo and one audio track

pics and it didn’t happen —

YouTube videos of 6K celebrities helped train AI model to animate photos in real time.

A sample image from Microsoft for

Enlarge / A sample image from Microsoft for “VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time.”

On Tuesday, Microsoft Research Asia unveiled VASA-1, an AI model that can create a synchronized animated video of a person talking or singing from a single photo and an existing audio track. In the future, it could power virtual avatars that render locally and don’t require video feeds—or allow anyone with similar tools to take a photo of a person found online and make them appear to say whatever they want.

“It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors,” reads the abstract of the accompanying research paper titled, “VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time.” It’s the work of Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo.

The VASA framework (short for “Visual Affective Skills Animator”) uses machine learning to analyze a static image along with a speech audio clip. It is then able to generate a realistic video with precise facial expressions, head movements, and lip-syncing to the audio. It does not clone or simulate voices (like other Microsoft research) but relies on an existing audio input that could be specially recorded or spoken for a particular purpose.

Microsoft claims the model significantly outperforms previous speech animation methods in terms of realism, expressiveness, and efficiency. To our eyes, it does seem like an improvement over single-image animating models that have come before.

AI research efforts to animate a single photo of a person or character extend back at least a few years, but more recently, researchers have been working on automatically synchronizing a generated video to an audio track. In February, an AI model called EMO: Emote Portrait Alive from Alibaba’s Institute for Intelligent Computing research group made waves with a similar approach to VASA-1 that can automatically sync an animated photo to a provided audio track (they call it “Audio2Video”).

Trained on YouTube clips

Microsoft Researchers trained VASA-1 on the VoxCeleb2 dataset created in 2018 by three researchers from the University of Oxford. That dataset contains “over 1 million utterances for 6,112 celebrities,” according to the VoxCeleb2 website, extracted from videos uploaded to YouTube. VASA-1 can reportedly generate videos of 512×512 pixel resolution at up to 40 frames per second with minimal latency, which means it could potentially be used for realtime applications like video conferencing.

To show off the model, Microsoft created a VASA-1 research page featuring many sample videos of the tool in action, including people singing and speaking in sync with pre-recorded audio tracks. They show how the model can be controlled to express different moods or change its eye gaze. The examples also include some more fanciful generations, such as Mona Lisa rapping to an audio track of Anne Hathaway performing a “Paparazzi” song on Conan O’Brien.

The researchers say that, for privacy reasons, each example photo on their page was AI-generated by StyleGAN2 or DALL-E 3 (aside from the Mona Lisa). But it’s obvious that the technique could equally apply to photos of real people as well, although it’s likely that it will work better if a person appears similar to a celebrity present in the training dataset. Still, the researchers say that deepfaking real humans is not their intention.

“We are exploring visual affective skill generation for virtual, interactive charactors [sic], NOT impersonating any person in the real world. This is only a research demonstration and there’s no product or API release plan,” reads the site.

While the Microsoft researchers tout potential positive applications like enhancing educational equity, improving accessibility, and providing therapeutic companionship, the technology could also easily be misused. For example, it could allow people to fake video chats, make real people appear to say things they never actually said (especially when paired with a cloned voice track), or allow harassment from a single social media photo.

Right now, the generated video still looks imperfect in some ways, but it could be fairly convincing for some people if they did not know to expect an AI-generated animation. The researchers say they are aware of this, which is why they are not openly releasing the code that powers the model.

“We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection,” write the researchers. “Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.”

VASA-1 is only a research demonstration, but Microsoft is far from the only group developing similar technology. If the recent history of generative AI is any guide, it’s potentially only a matter of time before similar technology becomes open source and freely available—and they will very likely continue to improve in realism over time.

Microsoft’s VASA-1 can deepfake a person with one photo and one audio track Read More »

netflix-doc-accused-of-using-ai-to-manipulate-true-crime-story

Netflix doc accused of using AI to manipulate true crime story

Everything is not as it seems —

Producer remained vague about whether AI was used to edit photos.

A cropped image showing Raw TV's poster for the Netflix documentary <em>What Jennifer Did</em>, which features a long front tooth that leads critics to believe it was AI-generated.” src=”https://cdn.arstechnica.net/wp-content/uploads/2024/04/What-Jennifer-Did-Netflix-poster-cropped-800×450.jpg”></img><figcaption>
<p><a data-height=Enlarge / A cropped image showing Raw TV’s poster for the Netflix documentary What Jennifer Did, which features a long front tooth that leads critics to believe it was AI-generated.

An executive producer of the Netflix hit What Jennifer Did has responded to accusations that the true crime documentary used AI images when depicting Jennifer Pan, a woman currently imprisoned in Canada for orchestrating a murder-for-hire scheme targeting her parents.

What Jennifer Did shot to the top spot in Netflix’s global top 10 when it debuted in early April, attracting swarms of true crime fans who wanted to know more about why Pan paid hitmen $10,000 to murder her parents. But quickly the documentary became a source of controversy, as fans started noticing glaring flaws in images used in the movie, from weirdly mismatched earrings to her nose appearing to lack nostrils, the Daily Mail reported, in a post showing a plethora of examples of images from the film.

Futurism was among the first to point out that these flawed images (around the 28-minute mark of the documentary) “have all the hallmarks of an AI-generated photo, down to mangled hands and fingers, misshapen facial features, morphed objects in the background, and a far-too-long front tooth.” The image with the long front tooth was even used in Netflix’s poster for the movie.

Because the movie’s credits do not mention any uses of AI, critics called out the documentary filmmakers for potentially embellishing a movie that’s supposed to be based on real-life events.

But Jeremy Grimaldi—who is also the crime reporter who wrote a book on the case and provided the documentary with research and police footage—told the Toronto Star that the images were not AI-generated.

Grimaldi confirmed that all images of Pan used in the movie were real photos. He said that some of the images were edited, though, not to blur the lines between truth and fiction, but to protect the identity of the source of the images.

“Any filmmaker will use different tools, like Photoshop, in films,” Grimaldi told The Star. “The photos of Jennifer are real photos of her. The foreground is exactly her. The background has been anonymized to protect the source.”

While Grimaldi’s comments provide some assurance that the photos are edited versions of real photos of Pan, they are also vague enough to obscure whether AI was among the “different tools” used to edit the photos.

One photographer, Joe Foley, wrote in a post for Creative Bloq that he thought “documentary makers may have attempted to enhance old low-resolution images using AI-powered upscaling or photo restoration software to try to make them look clearer on a TV screen.”

“The problem is that even the best AI software can only take a poor-quality image so far, and such programs tend to over sharpen certain lines, resulting in strange artifacts,” Foley said.

Foley suggested that Netflix should have “at the very least” clarified that images had been altered “to avoid this kind of backlash,” noting that “any kind of manipulation of photos in a documentary is controversial because the whole point is to present things as they were.”

Hollywood’s increasing use of AI has indeed been controversial, with screenwriters’ unions opposing AI tools as “plagiarism machines” and artists stirring recent backlash over the “experimental” use of AI art in a horror film. Even using AI for a movie poster, as Civil War did, is enough to generate controversy, the Hollywood Reporter reported.

Neither Raw TV, the production company behind What Jennifer Did, nor Netflix responded to Ars’ request for comment.

Netflix doc accused of using AI to manipulate true crime story Read More »

llms-keep-leaping-with-llama-3,-meta’s-newest-open-weights-ai-model

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model

computer-powered word generator —

Zuckerberg says new AI model “was still learning” when Meta stopped training.

A group of pink llamas on a pixelated background.

On Thursday, Meta unveiled early versions of its Llama 3 open-weights AI model that can be used to power text composition, code generation, or chatbots. It also announced that its Meta AI Assistant is now available on a website and is going to be integrated into its major social media apps, intensifying the company’s efforts to position its products against other AI assistants like OpenAI’s ChatGPT, Microsoft’s Copilot, and Google’s Gemini.

Like its predecessor, Llama 2, Llama 3 is notable for being a freely available, open-weights large language model (LLM) provided by a major AI company. Llama 3 technically does not quality as “open source” because that term has a specific meaning in software (as we have mentioned in other coverage), and the industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (you can read Llama 3’s license here) or that ship without providing training data. We typically call these releases “open weights” instead.

At the moment, Llama 3 is available in two parameter sizes: 8 billion (8B) and 70 billion (70B), both of which are available as free downloads through Meta’s website with a sign-up. Llama 3 comes in two versions: pre-trained (basically the raw, next-token-prediction model) and instruction-tuned (fine-tuned to follow user instructions). Each has a 8,192 token context limit.

A screenshot of the Meta AI Assistant website on April 18, 2024.

Enlarge / A screenshot of the Meta AI Assistant website on April 18, 2024.

Benj Edwards

Meta trained both models on two custom-built, 24,000-GPU clusters. In a podcast interview with Dwarkesh Patel, Meta CEO Mark Zuckerberg said that the company trained the 70B model with around 15 trillion tokens of data. Throughout the process, the model never reached “saturation” (that is, it never hit a wall in terms of capability increases). Eventually, Meta pulled the plug and moved on to training other models.

“I guess our prediction going in was that it was going to asymptote more, but even by the end it was still leaning. We probably could have fed it more tokens, and it would have gotten somewhat better,” Zuckerberg said on the podcast.

Meta also announced that it is currently training a 400B parameter version of Llama 3, which some experts like Nvidia’s Jim Fan think may perform in the same league as GPT-4 Turbo, Claude 3 Opus, and Gemini Ultra on benchmarks like MMLU, GPQA, HumanEval, and MATH.

Speaking of benchmarks, we have devoted many words in the past to explaining how frustratingly imprecise benchmarks can be when applied to large language models due to issues like training contamination (that is, including benchmark test questions in the training dataset), cherry-picking on the part of vendors, and an inability to capture AI’s general usefulness in an interactive session with chat-tuned models.

But, as expected, Meta provided some benchmarks for Llama 3 that list results from MMLU (undergraduate level knowledge), GSM-8K (grade-school math), HumanEval (coding), GPQA (graduate-level questions), and MATH (math word problems). These show the 8B model performing well compared to open-weights models like Google’s Gemma 7B and Mistral 7B Instruct, and the 70B model also held its own against Gemini Pro 1.5 and Claude 3 Sonnet.

A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Enlarge / A chart of instruction-tuned Llama 3 8B and 70B benchmarks provided by Meta.

Meta says that the Llama 3 model has been enhanced with capabilities to understand coding (like Llama 2) and, for the first time, has been trained with both images and text—though it currently outputs only text. According to Reuters, Meta Chief Product Officer Chris Cox noted in an interview that more complex processing abilities (like executing multi-step plans) are expected in future updates to Llama 3, which will also support multimodal outputs—that is, both text and images.

Meta plans to host the Llama 3 models on a range of cloud platforms, making them accessible through AWS, Databricks, Google Cloud, and other major providers.

Also on Thursday, Meta announced that Llama 3 will become the new basis of the Meta AI virtual assistant, which the company first announced in September. The assistant will appear prominently in search features for Facebook, Instagram, WhatsApp, Messenger, and the aforementioned dedicated website that features a design similar to ChatGPT, including the ability to generate images in the same interface. The company also announced a partnership with Google to integrate real-time search results into the Meta AI assistant, adding to an existing partnership with Microsoft’s Bing.

LLMs keep leaping with Llama 3, Meta’s newest open-weights AI model Read More »

feds-appoint-“ai-doomer”-to-run-ai-safety-at-us-institute

Feds appoint “AI doomer” to run AI safety at US institute

Confronting doom —

Former OpenAI researcher once predicted a 50 percent chance of AI killing all of us.

Feds appoint “AI doomer” to run AI safety at US institute

The US AI Safety Institute—part of the National Institute of Standards and Technology (NIST)—has finally announced its leadership team after much speculation.

Appointed as head of AI safety is Paul Christiano, a former OpenAI researcher who pioneered a foundational AI safety technique called reinforcement learning from human feedback (RLHF), but is also known for predicting that “there’s a 50 percent chance AI development could end in ‘doom.'” While Christiano’s research background is impressive, some fear that by appointing a so-called “AI doomer,” NIST may be risking encouraging non-scientific thinking that many critics view as sheer speculation.

There have been rumors that NIST staffers oppose the hiring. A controversial VentureBeat report last month cited two anonymous sources claiming that, seemingly because of Christiano’s so-called “AI doomer” views, NIST staffers were “revolting.” Some staff members and scientists allegedly threatened to resign, VentureBeat reported, fearing “that Christiano’s association” with effective altruism and “longtermism could compromise the institute’s objectivity and integrity.”

NIST’s mission is rooted in advancing science by working to “promote US innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life.” Effective altruists believe in “using evidence and reason to figure out how to benefit others as much as possible” and longtermists that “we should be doing much more to protect future generations,” both of which are more subjective and opinion-based.

On the Bankless podcast, Christiano shared his opinions last year that “there’s something like a 10–20 percent chance of AI takeover” that results in humans dying, and “overall, maybe you’re getting more up to a 50-50 chance of doom shortly after you have AI systems that are human level.”

“The most likely way we die involves—not AI comes out of the blue and kills everyone—but involves we have deployed a lot of AI everywhere… [And] if for some reason, God forbid, all these AI systems were trying to kill us, they would definitely kill us,” Christiano said.

Critics of so-called “AI doomers” have warned that focusing on any potentially overblown talk of hypothetical killer AI systems or existential AI risks may stop humanity from focusing on current perceived harms from AI, including environmental, privacy, ethics, and bias issues. Emily Bender, a University of Washington professor of computation linguistics who has warned about AI doomers thwarting important ethical work in the field, told Ars that because “weird AI doomer discourse” was included in Joe Biden’s AI executive order, “NIST has been directed to worry about these fantasy scenarios” and “that’s the underlying problem” leading to Christiano’s appointment.

“I think that NIST probably had the opportunity to take it a different direction,” Bender told Ars. “And it’s unfortunate that they didn’t.”

As head of AI safety, Christiano will seemingly have to monitor for current and potential risks. He will “design and conduct tests of frontier AI models, focusing on model evaluations for capabilities of national security concern,” steer processes for evaluations, and implement “risk mitigations to enhance frontier model safety and security,” the Department of Commerce’s press release said.

Christiano has experience mitigating AI risks. He left OpenAI to found the Alignment Research Center (ARC), which the Commerce Department described as “a nonprofit research organization that seeks to align future machine learning systems with human interests by furthering theoretical research.” Part of ARC’s mission is to test if AI systems are evolving to manipulate or deceive humans, ARC’s website said. ARC also conducts research to help AI systems scale “gracefully.”

Because of Christiano’s research background, some people think he is a good choice to helm the safety institute, such as Divyansh Kaushik, an associate director for emerging technologies and national security at the Federation of American Scientists. On X (formerly Twitter), Kaushik wrote that the safety institute is designed to mitigate chemical, biological, radiological, and nuclear risks from AI, and Christiano is “extremely qualified” for testing those AI models. Kaushik cautioned, however, that “if there’s truth to NIST scientists threatening to quit” over Christiano’s appointment, “obviously that would be serious if true.”

The Commerce Department does not comment on its staffing, so it’s unclear if anyone actually resigned or plans to resign over Christiano’s appointment. Since the announcement was made, Ars was not able to find any public announcements from NIST staffers suggesting that they might be considering stepping down.

In addition to Christiano, the safety institute’s leadership team will include Mara Quintero Campbell, a Commerce Department official who led projects on COVID response and CHIPS Act implementation, as acting chief operating officer and chief of staff. Adam Russell, an expert focused on human-AI teaming, forecasting, and collective intelligence, will serve as chief vision officer. Rob Reich, a human-centered AI expert on leave from Stanford University, will be a senior advisor. And Mark Latonero, a former White House global AI policy expert who helped draft Biden’s AI executive order, will be head of international engagement.

“To safeguard our global leadership on responsible AI and ensure we’re equipped to fulfill our mission to mitigate the risks of AI and harness its benefits, we need the top talent our nation has to offer,” Gina Raimondo, US Secretary of Commerce, said in the press release. “That is precisely why we’ve selected these individuals, who are the best in their fields, to join the US AI Safety Institute executive leadership team.”

VentureBeat’s report claimed that Raimondo directly appointed Christiano.

Bender told Ars that there’s no advantage to NIST including “doomsday scenarios” in its research on “how government and non-government agencies are using automation.”

“The fundamental problem with the AI safety narrative is that it takes people out of the picture,” Bender told Ars. “But the things we need to be worrying about are what people do with technology, not what technology autonomously does.”

Feds appoint “AI doomer” to run AI safety at US institute Read More »