GPT-4o

2024:-the-year-ai-drove-everyone-crazy

2024: The year AI drove everyone crazy


What do eating rocks, rat genitals, and Willy Wonka have in common? AI, of course.

It’s been a wild year in tech thanks to the intersection between humans and artificial intelligence. 2024 brought a parade of AI oddities, mishaps, and wacky moments that inspired odd behavior from both machines and man. From AI-generated rat genitals to search engines telling people to eat rocks, this year proved that AI has been having a weird impact on the world.

Why the weirdness? If we had to guess, it may be due to the novelty of it all. Generative AI and applications built upon Transformer-based AI models are still so new that people are throwing everything at the wall to see what sticks. People have been struggling to grasp both the implications and potential applications of the new technology. Riding along with the hype, different types of AI that may end up being ill-advised, such as automated military targeting systems, have also been introduced.

It’s worth mentioning that aside from crazy news, we saw fewer weird AI advances in 2024 as well. For example, Claude 3.5 Sonnet launched in June held off the competition as a top model for most of the year, while OpenAI’s o1 used runtime compute to expand GPT-4o’s capabilities with simulated reasoning. Advanced Voice Mode and NotebookLM also emerged as novel applications of AI tech, and the year saw the rise of more capable music synthesis models and also better AI video generators, including several from China.

But for now, let’s get down to the weirdness.

ChatGPT goes insane

Illustration of a broken toy robot.

Early in the year, things got off to an exciting start when OpenAI’s ChatGPT experienced a significant technical malfunction that caused the AI model to generate increasingly incoherent responses, prompting users on Reddit to describe the system as “having a stroke” or “going insane.” During the glitch, ChatGPT’s responses would begin normally but then deteriorate into nonsensical text, sometimes mimicking Shakespearean language.

OpenAI later revealed that a bug in how the model processed language caused it to select the wrong words during text generation, leading to nonsense outputs (basically the text version of what we at Ars now call “jabberwockies“). The company fixed the issue within 24 hours, but the incident led to frustrations about the black box nature of commercial AI systems and users’ tendency to anthropomorphize AI behavior when it malfunctions.

The great Wonka incident

A photo of the Willy's Chocolate Experience, which did not match AI-generated promises.

A photo of “Willy’s Chocolate Experience” (inset), which did not match AI-generated promises, shown in the background. Credit: Stuart Sinclair

The collision between AI-generated imagery and consumer expectations fueled human frustrations in February when Scottish families discovered that “Willy’s Chocolate Experience,” an unlicensed Wonka-ripoff event promoted using AI-generated wonderland images, turned out to be little more than a sparse warehouse with a few modest decorations.

Parents who paid £35 per ticket encountered a situation so dire they called the police, with children reportedly crying at the sight of a person in what attendees described as a “terrifying outfit.” The event, created by House of Illuminati in Glasgow, promised fantastical spaces like an “Enchanted Garden” and “Twilight Tunnel” but delivered an underwhelming experience that forced organizers to shut down mid-way through its first day and issue refunds.

While the show was a bust, it brought us an iconic new meme for job disillusionment in the form of a photo: the green-haired Willy’s Chocolate Experience employee who looked like she’d rather be anywhere else on earth at that moment.

Mutant rat genitals expose peer review flaws

An actual laboratory rat, who is intrigued. Credit: Getty | Photothek

In February, Ars Technica senior health reporter Beth Mole covered a peer-reviewed paper published in Frontiers in Cell and Developmental Biology that created an uproar in the scientific community when researchers discovered it contained nonsensical AI-generated images, including an anatomically incorrect rat with oversized genitals. The paper, authored by scientists at Xi’an Honghui Hospital in China, openly acknowledged using Midjourney to create figures that contained gibberish text labels like “Stemm cells” and “iollotte sserotgomar.”

The publisher, Frontiers, posted an expression of concern about the article titled “Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway” and launched an investigation into how the obviously flawed imagery passed through peer review. Scientists across social media platforms expressed dismay at the incident, which mirrored concerns about AI-generated content infiltrating academic publishing.

Chatbot makes erroneous refund promises for Air Canada

If, say, ChatGPT gives you the wrong name for one of the seven dwarves, it’s not such a big deal. But in February, Ars senior policy reporter Ashley Belanger covered a case of costly AI confabulation in the wild. In the course of online text conversations, Air Canada’s customer service chatbot told customers inaccurate refund policy information. The airline faced legal consequences later when a tribunal ruled the airline must honor commitments made by the automated system. Tribunal adjudicator Christopher Rivers determined that Air Canada bore responsibility for all information on its website, regardless of whether it came from a static page or AI interface.

The case set a precedent for how companies deploying AI customer service tools could face legal obligations for automated systems’ responses, particularly when they fail to warn users about potential inaccuracies. Ironically, the airline had reportedly spent more on the initial AI implementation than it would have cost to maintain human workers for simple queries, according to Air Canada executive Steve Crocker.

Will Smith lampoons his digital double

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023.

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023. Credit: Will Smith / Getty Images / Benj Edwards

In March 2023, a terrible AI-generated video of Will Smith’s AI doppelganger eating spaghetti began making the rounds online. The AI-generated version of the actor gobbled down the noodles in an unnatural and disturbing way. Almost a year later, in February 2024, Will Smith himself posted a parody response video to the viral jabberwocky on Instagram, featuring AI-like deliberately exaggerated pasta consumption, complete with hair-nibbling and finger-slurping antics.

Given the rapid evolution of AI video technology, particularly since OpenAI had just unveiled its Sora video model four days earlier, Smith’s post sparked discussion in his Instagram comments where some viewers initially struggled to distinguish between the genuine footage and AI generation. It was an early sign of “deep doubt” in action as the tech increasingly blurs the line between synthetic and authentic video content.

Robot dogs learn to hunt people with AI-guided rifles

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries.

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries. Credit: Onyx Industries

At some point in recent history—somewhere around 2022—someone took a look at robotic quadrupeds and thought it would be a great idea to attach guns to them. A few years later, the US Marine Forces Special Operations Command (MARSOC) began evaluating armed robotic quadrupeds developed by Ghost Robotics. The robot “dogs” integrated Onyx Industries’ SENTRY remote weapon systems, which featured AI-enabled targeting that could detect and track people, drones, and vehicles, though the systems require human operators to authorize any weapons discharge.

The military’s interest in armed robotic dogs followed a broader trend of weaponized quadrupeds entering public awareness. This included viral videos of consumer robots carrying firearms, and later, commercial sales of flame-throwing models. While MARSOC emphasized that weapons were just one potential use case under review, experts noted that the increasing integration of AI into military robotics raised questions about how long humans would remain in control of lethal force decisions.

Microsoft Windows AI is watching

A screenshot of Microsoft's new

A screenshot of Microsoft’s new “Recall” feature in action. Credit: Microsoft

In an era where many people already feel like they have no privacy due to tech encroachments, Microsoft dialed it up to an extreme degree in May. That’s when Microsoft unveiled a controversial Windows 11 feature called “Recall” that continuously captures screenshots of users’ PC activities every few seconds for later AI-powered search and retrieval. The feature, designed for new Copilot+ PCs using Qualcomm’s Snapdragon X Elite chips, promised to help users find past activities, including app usage, meeting content, and web browsing history.

While Microsoft emphasized that Recall would store encrypted snapshots locally and allow users to exclude specific apps or websites, the announcement raised immediate privacy concerns, as Ars senior technology reporter Andrew Cunningham covered. It also came with a technical toll, requiring significant hardware resources, including 256GB of storage space, with 25GB dedicated to storing approximately three months of user activity. After Microsoft pulled the initial test version due to public backlash, Recall later entered public preview in November with reportedly enhanced security measures. But secure spyware is still spyware—Recall, when enabled, still watches nearly everything you do on your computer and keeps a record of it.

Google Search told people to eat rocks

This is fine. Credit: Getty Images

In May, Ars senior gaming reporter Kyle Orland (who assisted commendably with the AI beat throughout the year) covered Google’s newly launched AI Overview feature. It faced immediate criticism when users discovered that it frequently provided false and potentially dangerous information in its search result summaries. Among its most alarming responses, the system advised humans could safely consume rocks, incorrectly citing scientific sources about the geological diet of marine organisms. The system’s other errors included recommending nonexistent car maintenance products, suggesting unsafe food preparation techniques, and confusing historical figures who shared names.

The problems stemmed from several issues, including the AI treating joke posts as factual sources and misinterpreting context from original web content. But most of all, the system relies on web results as indicators of authority, which we called a flawed design. While Google defended the system, stating these errors occurred mainly with uncommon queries, a company spokesperson acknowledged they would use these “isolated examples” to refine their systems. But to this day, AI Overview still makes frequent mistakes.

Stable Diffusion generates body horror

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass. Credit: HorneyMetalBeing

In June, Stability AI’s release of the image synthesis model Stable Diffusion 3 Medium drew criticism online for its poor handling of human anatomy in AI-generated images. Users across social media platforms shared examples of the model producing what we now like to call jabberwockies—AI generation failures with distorted bodies, misshapen hands, and surreal anatomical errors, and many in the AI image-generation community viewed it as a significant step backward from previous image-synthesis capabilities.

Reddit users attributed these failures to Stability AI’s aggressive filtering of adult content from the training data, which apparently impaired the model’s ability to accurately render human figures. The troubled release coincided with broader organizational challenges at Stability AI, including the March departure of CEO Emad Mostaque, multiple staff layoffs, and the exit of three key engineers who had helped develop the technology. Some of those engineers founded Black Forest Labs in August and released Flux, which has become the latest open-weights AI image model to beat.

ChatGPT Advanced Voice imitates human voice in testing

An illustration of a computer synthesizer spewing out letters.

AI voice-synthesis models are master imitators these days, and they are capable of much more than many people realize. In August, we covered a story where OpenAI’s ChatGPT Advanced Voice Mode feature unexpectedly imitated a user’s voice during the company’s internal testing, revealed by OpenAI after the fact in safety testing documentation. To prevent future instances of an AI assistant suddenly speaking in your own voice (which, let’s be honest, would probably freak people out), the company created an output classifier system to prevent unauthorized voice imitation. OpenAI says that Advanced Voice Mode now catches all meaningful deviations from approved system voices.

Independent AI researcher Simon Willison discussed the implications with Ars Technica, noting that while OpenAI restricted its model’s full voice synthesis capabilities, similar technology would likely emerge from other sources within the year. Meanwhile, the rapid advancement of AI voice replication has caused general concern about its potential misuse, although companies like ElevenLabs have already been offering voice cloning services for some time.

San Francisco’s robotic car horn symphony

A Waymo self-driving car in front of Google's San Francisco headquarters, San Francisco, California, June 7, 2024.

A Waymo self-driving car in front of Google’s San Francisco headquarters, San Francisco, California, June 7, 2024. Credit: Getty Images

In August, San Francisco residents got a noisy taste of robo-dystopia when Waymo’s self-driving cars began creating an unexpected nightly disturbance in the South of Market district. In a parking lot off 2nd Street, the cars congregated autonomously every night during rider lulls at 4 am and began engaging in extended honking matches at each other while attempting to park.

Local resident Christopher Cherry’s initial optimism about the robotic fleet’s presence dissolved as the mechanical chorus grew louder each night, affecting residents in nearby high-rises. The nocturnal tech disruption served as a lesson in the unintentional effects of autonomous systems when run in aggregate.

Larry Ellison dreams of all-seeing AI cameras

A colorized photo of CCTV cameras in London, 2024.

In September, Oracle co-founder Larry Ellison painted a bleak vision of ubiquitous AI surveillance during a company financial meeting. The 80-year-old database billionaire described a future where AI would monitor citizens through networks of cameras and drones, asserting that the oversight would ensure lawful behavior from both police and the public.

His surveillance predictions reminded us of parallels to existing systems in China, where authorities already used AI to sort surveillance data on citizens as part of the country’s “sharp eyes” campaign from 2015 to 2020. Ellison’s statement reflected the sort of worst-case tech surveillance state scenario—likely antithetical to any sort of free society—that dozens of sci-fi novels of the 20th century warned us about.

A dead father sends new letters home

An AI-generated image featuring Dad's Uppercase handwriting.

An AI-generated image featuring my late father’s handwriting. Credit: Benj Edwards / Flux

AI has made many of us do weird things in 2024, including this writer. In October, I used an AI synthesis model called Flux to reproduce my late father’s handwriting with striking accuracy. After scanning 30 samples from his engineering notebooks, I trained the model using computing time that cost less than five dollars. The resulting text captured his distinctive uppercase style, which he developed during his career as an electronics engineer.

I enjoyed creating images showing his handwriting in various contexts, from folder labels to skywriting, and made the trained model freely available online for others to use. While I approached it as a tribute to my father (who would have appreciated the technical achievement), many people found the whole experience weird and somewhat disturbing. The things we unhinged Bing Chat-like journalists do to bring awareness to a topic are sometimes unconventional. So I guess it counts for this list!

For 2025? Expect even more AI

Thanks for reading Ars Technica this past year and following along with our team coverage of this rapidly emerging and expanding field. We appreciate your kind words of support. Ars Technica’s 2024 AI words of the year were: vibemarking, deep doubt, and the aforementioned jabberwocky. The old stalwart “confabulation” also made several notable appearances. Tune in again next year when we continue to try to figure out how to concisely describe novel scenarios in emerging technology by labeling them.

Looking back, our prediction for 2024 in AI last year was “buckle up.” It seems fitting, given the weirdness detailed above. Especially the part about the robot dogs with guns. For 2025, AI will likely inspire more chaos ahead, but also potentially get put to serious work as a productivity tool, so this time, our prediction is “buckle down.”

Finally, we’d like to ask: What was the craziest story about AI in 2024 from your perspective? Whether you love AI or hate it, feel free to suggest your own additions to our list in the comments. Happy New Year!

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

2024: The year AI drove everyone crazy Read More »

openai-announces-o3-and-o3-mini,-its-next-simulated-reasoning-models

OpenAI announces o3 and o3-mini, its next simulated reasoning models

On Friday, during Day 12 of its “12 days of OpenAI,” OpenAI CEO Sam Altman announced its latest AI “reasoning” models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today.

The models use what OpenAI calls “private chain of thought,” where the model pauses to examine its internal dialog and plan ahead before responding, which you might call “simulated reasoning” (SR)—a form of AI that goes beyond basic large language models (LLMs).

The company named the model family “o3” instead of “o2” to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday’s livestream, Altman acknowledged his company’s naming foibles, saying, “In the grand tradition of OpenAI being really, truly bad at names, it’ll be called o3.”

According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent—comparable to human performance at an 85 percent threshold.

OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.

OpenAI announces o3 and o3-mini, its next simulated reasoning models Read More »

openai-announces-full-“o1”-reasoning-model,-$200-chatgpt-pro-tier

OpenAI announces full “o1” reasoning model, $200 ChatGPT Pro tier

On X, frequent AI experimenter Ethan Mollick wrote, “Been playing with o1 and o1-pro for bit. They are very good & a little weird. They are also not for most people most of the time. You really need to have particular hard problems to solve in order to get value out of it. But if you have those problems, this is a very big deal.”

OpenAI claims improved reliability

OpenAI is touting pro mode’s improved reliability, which is evaluated internally based on whether it can solve a question correctly in four out of four attempts rather than just a single attempt.

“In evaluations from external expert testers, o1 pro mode produces more reliably accurate and comprehensive responses, especially in areas like data science, programming, and case law analysis,” OpenAI writes.

Even without pro mode, OpenAI cited significant increases in performance over the o1 preview model on popular math and coding benchmarks (AIME 2024 and Codeforces), and more marginal improvements on a “PhD-level science” benchmark (GPQA Diamond). The increase in scores between o1 and o1 pro mode were much more marginal on these benchmarks.

We’ll likely have more coverage of the full version of o1 once it rolls out widely—and it’s supposed to launch today, accessible to ChatGPT Plus and Team users globally. Enterprise and Edu users will have access next week. At the moment, the ChatGPT Pro subscription is not yet available on our test account.

OpenAI announces full “o1” reasoning model, $200 ChatGPT Pro tier Read More »

new-secret-math-benchmark-stumps-ai-models-and-phds-alike

New secret math benchmark stumps AI models and PhDs alike

Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart showing AI model success on the FrontierMath problems, taken from Epoch AI's research paper.

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper. Credit: Epoch AI

To aid in the verification of correct answers during testing, the FrontierMath problems must have answers that can be automatically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complex calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They say they will release additional sample problems in the coming months to help the research community test their systems.

New secret math benchmark stumps AI models and PhDs alike Read More »

github-copilot-moves-beyond-openai-models-to-support-claude-3.5,-gemini

GitHub Copilot moves beyond OpenAI models to support Claude 3.5, Gemini

The large language model-based coding assistant GitHub Copilot will switch from using exclusively OpenAI’s GPT models to a multi-model approach over the coming weeks, GitHub CEO Thomas Dohmke announced in a post on GitHub’s blog.

First, Anthropic’s Claude 3.5 Sonnet will roll out to Copilot Chat’s web and VS Code interfaces over the next few weeks. Google’s Gemini 1.5 Pro will come a bit later.

Additionally, GitHub will soon add support for a wider range of OpenAI models, including GPT o1-preview and o1-mini, which are intended to be stronger at advanced reasoning than GPT-4, which Copilot has used until now. Developers will be able to switch between the models (even mid-conversation) to tailor the model to fit their needs—and organizations will be able to choose which models will be usable by team members.

The new approach makes sense for users, as certain models are better at certain languages or types of tasks.

“There is no one model to rule every scenario,” wrote Dohmke. “It is clear the next phase of AI code generation will not only be defined by multi-model functionality, but by multi-model choice.”

It starts with the web-based and VS Code Copilot Chat interfaces, but it won’t stop there. “From Copilot Workspace to multi-file editing to code review, security autofix, and the CLI, we will bring multi-model choice across many of GitHub Copilot’s surface areas and functions soon,” Dohmke wrote.

There are a handful of additional changes coming to GitHub Copilot, too, including extensions, the ability to manipulate multiple files at once from a chat with VS Code, and a preview of Xcode support.

GitHub Spark promises natural language app development

In addition to the Copilot changes, GitHub announced Spark, a natural language tool for developing apps. Non-coders will be able to use a series of natural language prompts to create simple apps, while coders will be able to tweak more precisely as they go. In either use case, you’ll be able to take a conversational approach, requesting changes and iterating as you go, and comparing different iterations.

GitHub Copilot moves beyond OpenAI models to support Claude 3.5, Gemini Read More »

openai-releases-chatgpt-app-for-windows

OpenAI releases ChatGPT app for Windows

On Thursday, OpenAI released an early Windows version of its first ChatGPT app for Windows, following a Mac version that launched in May. Currently, it’s only available to subscribers of Plus, Team, Enterprise, and Edu versions of ChatGPT, and users can download it for free in the Microsoft Store for Windows.

OpenAI is positioning the release as a beta test. “This is an early version, and we plan to bring the full experience to all users later this year,” OpenAI writes on the Microsoft Store entry for the app. (Interestingly, ChatGPT shows up as being rated “T for Teen” by the ESRB in the Windows store, despite not being a video game.)

A screenshot of the new Windows ChatGPT app captured on October 18, 2024.

A screenshot of the new Windows ChatGPT app captured on October 18, 2024.

Credit: Benj Edwards

A screenshot of the new Windows ChatGPT app captured on October 18, 2024. Credit: Benj Edwards

Upon opening the app, OpenAI requires users to log into a paying ChatGPT account, and from there, the app is basically identical to the web browser version of ChatGPT. You can currently use it to access several models: GPT-4o, GPT-4o with Canvas, 01-preview, 01-mini, GPT-4o mini, and GPT-4. Also, it can generate images using DALL-E 3 or analyze uploaded files and images.

If you’re running Windows 11, you can instantly call up a small ChatGPT window when the app is open using an Alt+Space shortcut (it did not work in Windows 10 when we tried). That could be handy for asking ChatGPT a quick question at any time.

A screenshot of the new Windows ChatGPT app listing in the Microsoft Store captured on October 18, 2024.

Credit: Benj Edwards

A screenshot of the new Windows ChatGPT app listing in the Microsoft Store captured on October 18, 2024. Credit: Benj Edwards

And just like the web version, all the AI processing takes place in the cloud on OpenAI’s servers, which means an Internet connection is required.

So as usual, chat like somebody’s watching, and don’t rely on ChatGPT as a factual reference for important decisions—GPT-4o in particular is great at telling you what you want to hear, whether it’s correct or not. As OpenAI says in a small disclaimer at the bottom of the app window: “ChatGPT can make mistakes.”

OpenAI releases ChatGPT app for Windows Read More »

openai’s-canvas-can-translate-code-between-languages-with-a-click

OpenAI’s Canvas can translate code between languages with a click

Coding shortcuts in canvas include reviewing code, adding logs for debugging, inserting comments, fixing bugs, and porting code to different programming languages. For example, if your code is JavaScript, with a few clicks it can become PHP, TypeScript, Python, C++, or Java. As with GPT-4o by itself, you’ll probably still have to check it for mistakes.

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024.

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024.

Credit: Benj Edwards

A screenshot of coding using ChatGPT with Canvas captured on October 4, 2024. Credit: Benj Edwards

Also, users can highlight specific sections to direct ChatGPT’s focus, and the AI model can provide inline feedback and suggestions while considering the entire project, much like a copy editor or code reviewer. And the interface makes it easy to restore previous versions of a working document using a back button in the Canvas interface.

A new AI model

OpenAI says its research team developed new core behaviors for GPT-4o to support Canvas, including triggering the canvas for appropriate tasks, generating certain content types, making targeted edits, rewriting documents, and providing inline critique.

An image of OpenAI's Canvas in action.

An image of OpenAI’s Canvas in action.

An image of OpenAI’s Canvas in action. Credit: OpenAI

One key challenge in development, according to OpenAI, was defining when to trigger a canvas. In an example on the Canvas blog post, the team says it taught the model to open a canvas for prompts like “Write a blog post about the history of coffee beans” while avoiding triggering Canvas for general Q&A tasks like “Help me cook a new recipe for dinner.”

Another challenge involved tuning the model’s editing behavior once canvas was triggered, specifically deciding between targeted edits and full rewrites. The team trained the model to perform targeted edits when users specifically select text through the interface, otherwise favoring rewrites.

The company noted that canvas represents the first major update to ChatGPT’s visual interface since its launch two years ago. While canvas is still in early beta, OpenAI plans to improve its capabilities based on user feedback over time.

OpenAI’s Canvas can translate code between languages with a click Read More »

secret-calculator-hack-brings-chatgpt-to-the-ti-84,-enabling-easy-cheating

Secret calculator hack brings ChatGPT to the TI-84, enabling easy cheating

Breaking free of “test mode” —

Tiny device installed inside TI-84 enables Wi-Fi Internet, access to AI chatbot.

An OpenAI logo on a TI-84 calculator screen.

On Saturday, a YouTube creator called “ChromaLock” published a video detailing how he modified a Texas Instruments TI-84 graphing calculator to connect to the Internet and access OpenAI’s ChatGPT, potentially enabling students to cheat on tests. The video, titled “I Made The Ultimate Cheating Device,” demonstrates a custom hardware modification that allows users of the graphing calculator to type in problems sent to ChatGPT using the keypad and receive live responses on the screen.

ChromaLock began by exploring the calculator’s link port, typically used for transferring educational programs between devices. He then designed a custom circuit board he calls “TI-32” that incorporates a tiny Wi-Fi-enabled microcontroller, the Seed Studio ESP32-C3 (which costs about $5), along with other components to interface with the calculator’s systems.

It’s worth noting that the TI-32 hack isn’t a commercial project. Replicating ChromaLock’s work would involve purchasing a TI-84 calculator, a Seed Studio ESP32-C3 microcontroller, and various electronic components, and fabricating a custom PCB based on ChromaLock’s design, which is available online.

The creator says he encountered several engineering challenges during development, including voltage incompatibilities and signal integrity issues. After developing multiple versions, ChromaLock successfully installed the custom board into the calculator’s housing without any visible signs of modifications from the outside.

“I Made The Ultimate Cheating Device” YouTube Video.

To accompany the hardware, ChromaLock developed custom software for the microcontroller and the calculator, which is available open source on GitHub. The system simulates another TI-84, allowing people to use the calculator’s built-in “send” and “get” commands to transfer files. This allows a user to easily download a launcher program that provides access to various “applets” designed for cheating.

One of the applets is a ChatGPT interface that might be most useful for answering short questions, but it has a drawback in that it’s slow and cumbersome to type in long alphanumeric questions on the limited keypad.

Beyond the ChatGPT interface, the device offers several other cheating tools. An image browser allows users to access pre-prepared visual aids stored on the central server. The app browser feature enables students to download not only games for post-exam entertainment but also text-based cheat sheets disguised as program source code. ChromaLock even hinted at a future video discussing a camera feature, though details were sparse in the current demo.

ChromaLock claims his new device can bypass common anti-cheating measures. The launcher program can be downloaded on-demand, avoiding detection if a teacher inspects or clears the calculator’s memory before a test. The modification can also supposedly break calculators out of “Test Mode,” a locked-down state used to prevent cheating.

While the video presents the project as a technical achievement, consulting ChatGPT during a test on your calculator almost certainly represents an ethical breach and/or a form of academic dishonesty that could get you in serious trouble at most schools. So tread carefully, study hard, and remember to eat your Wheaties.

Secret calculator hack brings ChatGPT to the TI-84, enabling easy cheating Read More »

debate-over-“open-source-ai”-term-brings-new-push-to-formalize-definition

Debate over “open source AI” term brings new push to formalize definition

A man peers over a glass partition, seeking transparency.

Enlarge / A man peers over a glass partition, seeking transparency.

The Open Source Initiative (OSI) recently unveiled its latest draft definition for “open source AI,” aiming to clarify the ambiguous use of the term in the fast-moving field. The move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the “open source” label. This has sparked intense debates among free-software advocates about what truly constitutes “open source” in the context of AI.

For instance, Meta’s Llama 3 model, while freely available, doesn’t meet the traditional open source criteria as defined by the OSI for software because it imposes license restrictions on usage due to company size or what type of content is produced with the model. The AI image generator Flux is another “open” model that is not truly open source. Because of this type of ambiguity, we’ve typically described AI models that include code or weights with restrictions or lack accompanying training data with alternative terms like “open-weights” or “source-available.”

To address the issue formally, the OSI—which is well-known for its advocacy for open software standards—has assembled a group of about 70 participants, including researchers, lawyers, policymakers, and activists. Representatives from major tech companies like Meta, Google, and Amazon also joined the effort. The group’s current draft (version 0.0.9) definition of open source AI emphasizes “four fundamental freedoms” reminiscent of those defining free software: giving users of the AI system permission to use it for any purpose without permission, study how it works, modify it for any purpose, and share with or without modifications.

By establishing clear criteria for open source AI, the organization hopes to provide a benchmark against which AI systems can be evaluated. This will likely help developers, researchers, and users make more informed decisions about the AI tools they create, study, or use.

Truly open source AI may also shed light on potential software vulnerabilities of AI systems, since researchers will be able to see how the AI models work behind the scenes. Compare this approach with an opaque system such as OpenAI’s ChatGPT, which is more than just a GPT-4o large language model with a fancy interface—it’s a proprietary system of interlocking models and filters, and its precise architecture is a closely guarded secret.

OSI’s project timeline indicates that a stable version of the “open source AI” definition is expected to be announced in October at the All Things Open 2024 event in Raleigh, North Carolina.

“Permissionless innovation”

In a press release from May, the OSI emphasized the importance of defining what open source AI really means. “AI is different from regular software and forces all stakeholders to review how the Open Source principles apply to this space,” said Stefano Maffulli, executive director of the OSI. “OSI believes that everybody deserves to maintain agency and control of the technology. We also recognize that markets flourish when clear definitions promote transparency, collaboration and permissionless innovation.”

The organization’s most recent draft definition extends beyond just the AI model or its weights, encompassing the entire system and its components.

For an AI system to qualify as open source, it must provide access to what the OSI calls the “preferred form to make modifications.” This includes detailed information about the training data, the full source code used for training and running the system, and the model weights and parameters. All these elements must be available under OSI-approved licenses or terms.

Notably, the draft doesn’t mandate the release of raw training data. Instead, it requires “data information”—detailed metadata about the training data and methods. This includes information on data sources, selection criteria, preprocessing techniques, and other relevant details that would allow a skilled person to re-create a similar system.

The “data information” approach aims to provide transparency and replicability without necessarily disclosing the actual dataset, ostensibly addressing potential privacy or copyright concerns while sticking to open source principles, though that particular point may be up for further debate.

“The most interesting thing about [the definition] is that they’re allowing training data to NOT be released,” said independent AI researcher Simon Willison in a brief Ars interview about the OSI’s proposal. “It’s an eminently pragmatic approach—if they didn’t allow that, there would be hardly any capable ‘open source’ models.”

Debate over “open source AI” term brings new push to formalize definition Read More »

elon-musk-sues-openai,-sam-altman-for-making-a-“fool”-out-of-him

Elon Musk sues OpenAI, Sam Altman for making a “fool” out of him

“Altman’s long con” —

Elon Musk asks court to void Microsoft’s exclusive deal with OpenAI.

Elon Musk and Sam Altman share the stage in 2015, the same year that Musk alleged that Altman's

Enlarge / Elon Musk and Sam Altman share the stage in 2015, the same year that Musk alleged that Altman’s “deception” began.

After withdrawing his lawsuit in June for unknown reasons, Elon Musk has revived a complaint accusing OpenAI and its CEO Sam Altman of fraudulently inducing Musk to contribute $44 million in seed funding by promising that OpenAI would always open-source its technology and prioritize serving the public good over profits as a permanent nonprofit.

Instead, Musk alleged that Altman and his co-conspirators—”preying on Musk’s humanitarian concern about the existential dangers posed by artificial intelligence”—always intended to “betray” these promises in pursuit of personal gains.

As OpenAI’s technology advanced toward artificial general intelligence (AGI) and strove to surpass human capabilities, “Altman set the bait and hooked Musk with sham altruism then flipped the script as the non-profit’s technology approached AGI and profits neared, mobilizing Defendants to turn OpenAI, Inc. into their personal piggy bank and OpenAI into a moneymaking bonanza, worth billions,” Musk’s complaint said.

Where Musk saw OpenAI as his chance to fund a meaningful rival to stop Google from controlling the most powerful AI, Altman and others “wished to launch a competitor to Google” and allegedly deceived Musk to do it. According to Musk:

The idea Altman sold Musk was that a non-profit, funded and backed by Musk, would attract world-class scientists, conduct leading AI research and development, and, as a meaningful counterweight to Google’s DeepMind in the race for Artificial General Intelligence (“AGI”), decentralize its technology by making it open source. Altman assured Musk that the non-profit structure guaranteed neutrality and a focus on safety and openness for the benefit of humanity, not shareholder value. But as it turns out, this was all hot-air philanthropy—the hook for Altman’s long con.

Without Musk’s involvement and funding during OpenAI’s “first five critical years,” Musk’s complaint said, “it is fair to say” that “there would have been no OpenAI.” And when Altman and others repeatedly approached Musk with plans to shift OpenAI to a for-profit model, Musk held strong to his morals, conditioning his ongoing contributions on OpenAI remaining a nonprofit and its tech largely remaining open source.

“Either go do something on your own or continue with OpenAI as a nonprofit,” Musk told Altman in 2018 when Altman tried to “recast the nonprofit as a moneymaking endeavor to bring in shareholders, sell equity, and raise capital.”

“I will no longer fund OpenAI until you have made a firm commitment to stay, or I’m just being a fool who is essentially providing free funding to a startup,” Musk said at the time. “Discussions are over.”

But discussions weren’t over. And now Musk seemingly does feel like a fool after OpenAI exclusively licensed GPT-4 and all “pre-AGI” technology to Microsoft in 2023, while putting up paywalls and “failing to publicly disclose the non-profit’s research and development, including details on GPT-4, GPT-4T, and GPT-4o’s architecture, hardware, training method, and training computation.” This excluded the public “from open usage of GPT-4 and related technology to advance Defendants and Microsoft’s own commercial interests,” Musk alleged.

Now Musk has revived his suit against OpenAI, asking the court to award maximum damages for OpenAI’s alleged fraud, contract breaches, false advertising, acts viewed as unfair to competition, and other violations.

He has also asked the court to determine a very technical question: whether OpenAI’s most recent models should be considered AGI and therefore Microsoft’s license voided. That’s the only way to ensure that a private corporation isn’t controlling OpenAI’s AGI models, which Musk repeatedly conditioned his financial contributions upon preventing.

“Musk contributed considerable money and resources to launch and sustain OpenAI, Inc., which was done on the condition that the endeavor would be and remain a non-profit devoted to openly sharing its technology with the public and avoid concentrating its power in the hands of the few,” Musk’s complaint said. “Defendants knowingly and repeatedly accepted Musk’s contributions in order to develop AGI, with no intention of honoring those conditions once AGI was in reach. Case in point: GPT-4, GPT-4T, and GPT-4o are all closed source and shrouded in secrecy, while Defendants actively work to transform the non-profit into a thoroughly commercial business.”

Musk wants Microsoft’s GPT-4 license voided

Musk also asked the court to null and void OpenAI’s exclusive license to Microsoft, or else determine “whether GPT-4, GPT-4T, GPT-4o, and other OpenAI next generation large language models constitute AGI and are thus excluded from Microsoft’s license.”

It’s clear that Musk considers these models to be AGI, and he’s alleged that Altman’s current control of OpenAI’s Board—after firing dissidents in 2023 whom Musk claimed tried to get Altman ousted for prioritizing profits over AI safety—gives Altman the power to obscure when OpenAI’s models constitute AGI.

Elon Musk sues OpenAI, Sam Altman for making a “fool” out of him Read More »

the-first-gpt-4-class-ai-model-anyone-can-download-has-arrived:-llama-405b

The first GPT-4-class AI model anyone can download has arrived: Llama 405B

A new llama emerges —

“Open source AI is the path forward,” says Mark Zuckerberg, misusing the term.

A red llama in a blue desert illustration based on a photo.

In the AI world, there’s a buzz in the air about a new AI language model released Tuesday by Meta: Llama 3.1 405B. The reason? It’s potentially the first time anyone can download a GPT-4-class large language model (LLM) for free and run it on their own hardware. You’ll still need some beefy hardware: Meta says it can run on a “single server node,” which isn’t desktop PC-grade equipment. But it’s a provocative shot across the bow of “closed” AI model vendors such as OpenAI and Anthropic.

“Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation,” says Meta. Company CEO Mark Zuckerberg calls 405B “the first frontier-level open source AI model.”

In the AI industry, “frontier model” is a term for an AI system designed to push the boundaries of current capabilities. In this case, Meta is positioning 405B among the likes of the industry’s top AI models, such as OpenAI’s GPT-4o, Claude’s 3.5 Sonnet, and Google Gemini 1.5 Pro.

A chart published by Meta suggests that 405B gets very close to matching the performance of GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).

But as we’ve noted many times since March, these benchmarks aren’t necessarily scientifically sound or translate to the subjective experience of interacting with AI language models. In fact, this traditional slate of AI benchmarks is so generally useless to laypeople that even Meta’s PR department now just posts a few images of charts and doesn’t even try to explain them in any detail.

A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

Enlarge / A Meta-provided chart that shows Llama 3.1 405B benchmark results versus other major AI models.

We’ve instead found that measuring the subjective experience of using a conversational AI model (through what might be called “vibemarking”) on A/B leaderboards like Chatbot Arena is a better way to judge new LLMs. In the absence of Chatbot Arena data, Meta has provided the results of its own human evaluations of 405B’s outputs that seem to show Meta’s new model holding its own against GPT-4 Turbo and Claude 3.5 Sonnet.

A Meta-provided chart that shows how humans rated Llama 3.1 405B's outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Enlarge / A Meta-provided chart that shows how humans rated Llama 3.1 405B’s outputs compared to GPT-4 Turbo, GPT-4o, and Claude 3.5 Sonnet in its own studies.

Whatever the benchmarks, early word on the street (after the model leaked on 4chan yesterday) seems to match the claim that 405B is roughly equivalent to GPT-4. It took a lot of expensive computer training time to get there—and money, of which the social media giant has plenty to burn. Meta trained the 405B model on over 15 trillion tokens of training data scraped from the web (then parsed, filtered, and annotated by Llama 2), using more than 16,000 H100 GPUs.

So what’s with the 405B name? In this case, “405B” means 405 billion parameters, and parameters are numerical values that store trained information in a neural network. More parameters translate to a larger neural network powering the AI model, which generally (but not always) means more capability, such as better ability to make contextual connections between concepts. But larger-parameter models have a tradeoff in needing more computing power (AKA “compute”) to run.

We’ve been expecting the release of a 400 billion-plus parameter model of the Llama 3 family since Meta gave word that it was training one in April, and today’s announcement isn’t just about the biggest member of the Llama 3 family: There’s an entirely new iteration of improved Llama models with the designation “Llama 3.1.” That includes upgraded versions of its smaller 8B and 70B models, which now feature multilingual support and an extended context length of 128,000 tokens (the “context length” is roughly the working memory capacity of the model, and “tokens” are chunks of data used by LLMs to process information).

Meta says that 405B is useful for long-form text summarization, multilingual conversational agents, and coding assistants and for creating synthetic data used to train future AI language models. Notably, that last use-case—allowing developers to use outputs from Llama models to improve other AI models—is now officially supported by Meta’s Llama 3.1 license for the first time.

Abusing the term “open source”

Llama 3.1 405B is an open-weights model, which means anyone can download the trained neural network files and run them or fine-tune them. That directly challenges a business model where companies like OpenAI keep the weights to themselves and instead monetize the model through subscription wrappers like ChatGPT or charge for access by the token through an API.

Fighting the “closed” AI model is a big deal to Mark Zuckerberg, who simultaneously released a 2,300-word manifesto today on why the company believes in open releases of AI models, titled, “Open Source AI Is the Path Forward.” More on the terminology in a minute. But briefly, he writes about the need for customizable AI models that offer user control and encourage better data security, higher cost-efficiency, and better future-proofing, as opposed to vendor-locked solutions.

All that sounds reasonable, but undermining your competitors using a model subsidized by a social media war chest is also an efficient way to play spoiler in a market where you might not always win with the most cutting-edge tech. That benefits Meta, Zuckerberg says, because he doesn’t want to get locked into a system where companies like his have to pay a toll to access AI capabilities, drawing comparisons to “taxes” Apple levies on developers through its App Store.

A screenshot of Mark Zuckerberg's essay,

Enlarge / A screenshot of Mark Zuckerberg’s essay, “Open Source AI Is the Path Forward,” published on July 23, 2024.

So, about that “open source” term. As we first wrote in an update to our Llama 2 launch article a year ago, “open source” has a very particular meaning that has traditionally been defined by the Open Source Initiative. The AI industry has not yet settled on terminology for AI model releases that ship either code or weights with restrictions (such as Llama 3.1) or that ship without providing training data. We’ve been calling these releases “open weights” instead.

Unfortunately for terminology sticklers, Zuckerberg has now baked the erroneous “open source” label into the title of his potentially historic aforementioned essay on open AI releases, so fighting for the correct term in AI may be a losing battle. Still, his usage annoys people like independent AI researcher Simon Willison, who likes Zuckerberg’s essay otherwise.

“I see Zuck’s prominent misuse of ‘open source’ as a small-scale act of cultural vandalism,” Willison told Ars Technica. “Open source should have an agreed meaning. Abusing the term weakens that meaning which makes the term less generally useful, because if someone says ‘it’s open source,’ that no longer tells me anything useful. I have to then dig in and figure out what they’re actually talking about.”

The Llama 3.1 models are available for download through Meta’s own website and on Hugging Face. They both require providing contact information and agreeing to a license and an acceptable use policy, which means that Meta can technically legally pull the rug out from under your use of Llama 3.1 or its outputs at any time.

The first GPT-4-class AI model anyone can download has arrived: Llama 405B Read More »

openai’s-new-“criticgpt”-model-is-trained-to-criticize-gpt-4-outputs

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs

automated critic —

Research model catches bugs in AI-generated code, improving human oversight of AI.

An illustration created by OpenAI.

Enlarge / An illustration created by OpenAI.

On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify mistakes in code generated by ChatGPT. It aims to enhance the process of making AI systems behave in ways humans want (called “alignment”) through Reinforcement Learning from Human Feedback (RLHF), which helps human reviewers make large language model (LLM) outputs more accurate.

As outlined in a new research paper called “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant to human trainers who review programming code generated by the ChatGPT AI assistant. CriticGPT—based on the GPT-4 family of LLMS—analyzes the code and points out potential errors, making it easier for humans to spot mistakes that might otherwise go unnoticed. The researchers trained CriticGPT on a dataset of code samples with intentionally inserted bugs, teaching it to recognize and flag various coding errors.

The researchers found that CriticGPT’s critiques were preferred by annotators over human critiques in 63 percent of cases involving naturally occurring LLM errors and that human-machine teams using CriticGPT wrote more comprehensive critiques than humans alone while reducing confabulation (hallucination) rates compared to AI-only critiques.

Developing an automated critic

The development of CriticGPT involved training the model on a large number of inputs containing deliberately inserted mistakes. Human trainers were asked to modify code written by ChatGPT, introducing errors and then providing example feedback as if they had discovered these bugs. This process allowed the model to learn how to identify and critique various types of coding errors.

In experiments, CriticGPT demonstrated its ability to catch both inserted bugs and naturally occurring errors in ChatGPT’s output. The new model’s critiques were preferred by trainers over those generated by ChatGPT itself in 63 percent of cases involving natural bugs (the aforementioned statistic). This preference was partly due to CriticGPT producing fewer unhelpful “nitpicks” and generating fewer false positives, or hallucinated problems.

The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed reviews of code. It lets the researchers adjust how thorough CriticGPT is in looking for problems, while also controlling how often it might make up issues that don’t really exist. They can tweak this balance depending on what they need for different AI training tasks.

Interestingly, the researchers found that CriticGPT’s capabilities extend beyond just code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been rated as flawless by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of these cases—errors that were subsequently confirmed by human reviewers. OpenAI thinks this demonstrates the model’s potential to generalize to non-code tasks and highlights its ability to catch subtle mistakes that even careful human evaluation might miss.

Despite its promising results, like all AI models, CriticGPT has limitations. The model was trained on relatively short ChatGPT answers, which may not fully prepare it for evaluating longer, more complex tasks that future AI systems might tackle. Additionally, while CriticGPT reduces confabulations, it doesn’t eliminate them entirely, and human trainers can still make labeling mistakes based on these false outputs.

The research team acknowledges that CriticGPT is most effective at identifying errors that can be pinpointed in one specific location within the code. However, real-world mistakes in AI outputs can often be spread across multiple parts of an answer, presenting a challenge for future iterations of the model.

OpenAI plans to integrate CriticGPT-like models into its RLHF labeling pipeline, providing its trainers with AI assistance. For OpenAI, it’s a step toward developing better tools for evaluating outputs from LLM systems that may be difficult for humans to rate without additional support. However, the researchers caution that even with tools like CriticGPT, extremely complex tasks or responses may still prove challenging for human evaluators—even those assisted by AI.

OpenAI’s new “CriticGPT” model is trained to criticize GPT-4 outputs Read More »