OpenAI is making ChatGPT work a little more like older digital assistants with a new feature called Tasks, as reported by TechCrunch and others.
Currently in beta, Tasks allows users to direct the chatbot to send reminders or to generate responses to specific prompts at certain times; recurring tasks are also supported.
The feature is available to Plus, Team, and Pro subscribers starting today, while free users don’t have access.
To create a task, users need to select “4o with scheduled tasks” from the model picker and then direct ChatGPT using the same kind of plain language text prompts that drive everything else it does. ChatGPT will sometimes suggest tasks, too, but they won’t go into effect unless the user approves them.
The user can then make changes to assigned tasks through the same chat conversation, or they can use a new Tasks section of the ChatGPT apps to manage all currently assigned items. There’s currently a 10-task limit.
When the time comes to perform an assigned task, the ChatGPT mobile or desktop app will send a notification on schedule.
This update can be seen as OpenAI’s first step into the agentic AI space, where applications built using deep learning can operate relatively independently within certain boundaries, either replacing or easing the day-to-day responsibilities of information workers.
I’d describe myself as a skeptic of the generative AI revolution—I think the technology as it currently exists is situationally impressive and useful for specific kinds of tasks, but broadly oversold. I’m not sure it will vanish from relevance to quite the extent that other tech fads like the metaverse or NFTs did, but my suspicion is that companies like Nvidia and OpenAI are riding a bubble that will pop or deflate over time as more companies and individuals run up against the technology’s limitations, and as it fails to advance as quickly or as impressively as its most ardent boosters are predicting.
Maybe you agree with me and maybe you don’t! I’m not necessarily trying to convince you one way or the other. But I am here to say that even if you agree with me, we can all celebrate the one unambiguously positive thing that the generative AI hype cycle has done for computers this year: the RAM floor for many PCs and all Macs is now finally 16GB instead of 8GB.
Companies like Apple and Microsoft have, for years, created attractive, high-powered hardware with 8GB of memory in it, most egregiously in $1,000-and-up putative “pro” computers like last year’s $1,599 M3 MacBook Pro or the Surface Pro 9.
This meant that, for the kinds of power users and professionals drawn to these machines, that their starting prices were effectively mirages; “pay for 16GB if you can” has been my blanket advice to MacBook buyers for years now, since there’s basically no workload (including Just Browsing The Web) that won’t benefit at least a little. It also leaves more headroom for future software bloat and future hobby discovery. Did you buy an 8GB Mac, and then decide you wanted to try software development, photo or video editing, CAD design, or Logic Pro? Good luck!
What do eating rocks, rat genitals, and Willy Wonka have in common? AI, of course.
It’s been a wild year in tech thanks to the intersection between humans and artificial intelligence. 2024 brought a parade of AI oddities, mishaps, and wacky moments that inspired odd behavior from both machines and man. From AI-generated rat genitals to search engines telling people to eat rocks, this year proved that AI has been having a weird impact on the world.
Why the weirdness? If we had to guess, it may be due to the novelty of it all. Generative AI and applications built upon Transformer-based AI models are still so new that people are throwing everything at the wall to see what sticks. People have been struggling to grasp both the implications and potential applications of the new technology. Riding along with the hype, different types of AI that may end up being ill-advised, such as automated military targeting systems, have also been introduced.
It’s worth mentioning that aside from crazy news, we saw fewer weird AI advances in 2024 as well. For example, Claude 3.5 Sonnet launched in June held off the competition as a top model for most of the year, while OpenAI’s o1 used runtime compute to expand GPT-4o’s capabilities with simulated reasoning. Advanced Voice Mode and NotebookLM also emerged as novel applications of AI tech, and the year saw the rise of more capable music synthesis models and also better AI video generators, including several from China.
But for now, let’s get down to the weirdness.
ChatGPT goes insane
Early in the year, things got off to an exciting start when OpenAI’s ChatGPT experienced a significant technical malfunction that caused the AI model to generate increasingly incoherent responses, prompting users on Reddit to describe the system as “having a stroke” or “going insane.” During the glitch, ChatGPT’s responses would begin normally but then deteriorate into nonsensical text, sometimes mimicking Shakespearean language.
OpenAI later revealed that a bug in how the model processed language caused it to select the wrong words during text generation, leading to nonsense outputs (basically the text version of what we at Ars now call “jabberwockies“). The company fixed the issue within 24 hours, but the incident led to frustrations about the black box nature of commercial AI systems and users’ tendency to anthropomorphize AI behavior when it malfunctions.
The great Wonka incident
The collision between AI-generated imagery and consumer expectations fueled human frustrations in February when Scottish families discovered that “Willy’s Chocolate Experience,” an unlicensed Wonka-ripoff event promoted using AI-generated wonderland images, turned out to be little more than a sparse warehouse with a few modest decorations.
Parents who paid £35 per ticket encountered a situation so dire they called the police, with children reportedly crying at the sight of a person in what attendees described as a “terrifying outfit.” The event, created by House of Illuminati in Glasgow, promised fantastical spaces like an “Enchanted Garden” and “Twilight Tunnel” but delivered an underwhelming experience that forced organizers to shut down mid-way through its first day and issue refunds.
While the show was a bust, it brought us an iconic new meme for job disillusionment in the form of a photo: the green-haired Willy’s Chocolate Experience employee who looked like she’d rather be anywhere else on earth at that moment.
Mutant rat genitals expose peer review flaws
In February, Ars Technica senior health reporter Beth Mole covered a peer-reviewed paper published in Frontiers in Cell and Developmental Biology that created an uproar in the scientific community when researchers discovered it contained nonsensical AI-generated images, including an anatomically incorrect rat with oversized genitals. The paper, authored by scientists at Xi’an Honghui Hospital in China, openly acknowledged using Midjourney to create figures that contained gibberish text labels like “Stemm cells” and “iollotte sserotgomar.”
The publisher, Frontiers, posted an expression of concern about the article titled “Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway” and launched an investigation into how the obviously flawed imagery passed through peer review. Scientists across social media platforms expressed dismay at the incident, which mirrored concerns about AI-generated content infiltrating academic publishing.
Chatbot makes erroneous refund promises for Air Canada
If, say, ChatGPT gives you the wrong name for one of the seven dwarves, it’s not such a big deal. But in February, Ars senior policy reporter Ashley Belanger covered a case of costly AI confabulation in the wild. In the course of online text conversations, Air Canada’s customer service chatbot told customers inaccurate refund policy information. The airline faced legal consequences later when a tribunal ruled the airline must honor commitments made by the automated system. Tribunal adjudicator Christopher Rivers determined that Air Canada bore responsibility for all information on its website, regardless of whether it came from a static page or AI interface.
The case set a precedent for how companies deploying AI customer service tools could face legal obligations for automated systems’ responses, particularly when they fail to warn users about potential inaccuracies. Ironically, the airline had reportedly spent more on the initial AI implementation than it would have cost to maintain human workers for simple queries, according to Air Canada executive Steve Crocker.
Will Smith lampoons his digital double
In March 2023, a terrible AI-generated video of Will Smith’s AI doppelganger eating spaghetti began making the rounds online. The AI-generated version of the actor gobbled down the noodles in an unnatural and disturbing way. Almost a year later, in February 2024, Will Smith himself posted a parody response video to the viral jabberwocky on Instagram, featuring AI-like deliberately exaggerated pasta consumption, complete with hair-nibbling and finger-slurping antics.
Given the rapid evolution of AI video technology, particularly since OpenAI had just unveiled its Sora video model four days earlier, Smith’s post sparked discussion in his Instagram comments where some viewers initially struggled to distinguish between the genuine footage and AI generation. It was an early sign of “deep doubt” in action as the tech increasingly blurs the line between synthetic and authentic video content.
Robot dogs learn to hunt people with AI-guided rifles
At some point in recent history—somewhere around 2022—someone took a look at robotic quadrupeds and thought it would be a great idea to attach guns to them. A few years later, the US Marine Forces Special Operations Command (MARSOC) began evaluating armed robotic quadrupeds developed by Ghost Robotics. The robot “dogs” integrated Onyx Industries’ SENTRY remote weapon systems, which featured AI-enabled targeting that could detect and track people, drones, and vehicles, though the systems require human operators to authorize any weapons discharge.
The military’s interest in armed robotic dogs followed a broader trend of weaponized quadrupeds entering public awareness. This included viral videos of consumer robots carrying firearms, and later, commercial sales of flame-throwing models. While MARSOC emphasized that weapons were just one potential use case under review, experts noted that the increasing integration of AI into military robotics raised questions about how long humans would remain in control of lethal force decisions.
Microsoft Windows AI is watching
In an era where many people already feel like they have no privacy due to tech encroachments, Microsoft dialed it up to an extreme degree in May. That’s when Microsoft unveiled a controversial Windows 11 feature called “Recall” that continuously captures screenshots of users’ PC activities every few seconds for later AI-powered search and retrieval. The feature, designed for new Copilot+ PCs using Qualcomm’s Snapdragon X Elite chips, promised to help users find past activities, including app usage, meeting content, and web browsing history.
While Microsoft emphasized that Recall would store encrypted snapshots locally and allow users to exclude specific apps or websites, the announcement raised immediate privacy concerns, as Ars senior technology reporter Andrew Cunningham covered. It also came with a technical toll, requiring significant hardware resources, including 256GB of storage space, with 25GB dedicated to storing approximately three months of user activity. After Microsoft pulled the initial test version due to public backlash, Recall later entered public preview in November with reportedly enhanced security measures. But secure spyware is still spyware—Recall, when enabled, still watches nearly everything you do on your computer and keeps a record of it.
Google Search told people to eat rocks
In May, Ars senior gaming reporter Kyle Orland (who assisted commendably with the AI beat throughout the year) covered Google’s newly launched AI Overview feature. It faced immediate criticism when users discovered that it frequently provided false and potentially dangerous information in its search result summaries. Among its most alarming responses, the system advised humans could safely consume rocks, incorrectly citing scientific sources about the geological diet of marine organisms. The system’s other errors included recommending nonexistent car maintenance products, suggesting unsafe food preparation techniques, and confusing historical figures who shared names.
The problems stemmed from several issues, including the AI treating joke posts as factual sources and misinterpreting context from original web content. But most of all, the system relies on web results as indicators of authority, which we called a flawed design. While Google defended the system, stating these errors occurred mainly with uncommon queries, a company spokesperson acknowledged they would use these “isolated examples” to refine their systems. But to this day, AI Overview still makes frequent mistakes.
Stable Diffusion generates body horror
In June, Stability AI’s release of the image synthesis model Stable Diffusion 3 Medium drew criticism online for its poor handling of human anatomy in AI-generated images. Users across social media platforms shared examples of the model producing what we now like to call jabberwockies—AI generation failures with distorted bodies, misshapen hands, and surreal anatomical errors, and many in the AI image-generation community viewed it as a significant step backward from previous image-synthesis capabilities.
Reddit users attributed these failures to Stability AI’s aggressive filtering of adult content from the training data, which apparently impaired the model’s ability to accurately render human figures. The troubled release coincided with broader organizational challenges at Stability AI, including the March departure of CEO Emad Mostaque, multiple staff layoffs, and the exit of three key engineers who had helped develop the technology. Some of those engineers founded Black Forest Labs in August and released Flux, which has become the latest open-weights AI image model to beat.
ChatGPT Advanced Voice imitates human voice in testing
AI voice-synthesis models are master imitators these days, and they are capable of much more than many people realize. In August, we covered a story where OpenAI’s ChatGPT Advanced Voice Mode feature unexpectedly imitated a user’s voice during the company’s internal testing, revealed by OpenAI after the fact in safety testing documentation. To prevent future instances of an AI assistant suddenly speaking in your own voice (which, let’s be honest, would probably freak people out), the company created an output classifier system to prevent unauthorized voice imitation. OpenAI says that Advanced Voice Mode now catches all meaningful deviations from approved system voices.
Independent AI researcher Simon Willison discussed the implications with Ars Technica, noting that while OpenAI restricted its model’s full voice synthesis capabilities, similar technology would likely emerge from other sources within the year. Meanwhile, the rapid advancement of AI voice replication has caused general concern about its potential misuse, although companies like ElevenLabs have already been offering voice cloning services for some time.
San Francisco’s robotic car horn symphony
In August, San Francisco residents got a noisy taste of robo-dystopia when Waymo’s self-driving cars began creating an unexpected nightly disturbance in the South of Market district. In a parking lot off 2nd Street, the cars congregated autonomously every night during rider lulls at 4 am and began engaging in extended honking matches at each other while attempting to park.
Local resident Christopher Cherry’s initial optimism about the robotic fleet’s presence dissolved as the mechanical chorus grew louder each night, affecting residents in nearby high-rises. The nocturnal tech disruption served as a lesson in the unintentional effects of autonomous systems when run in aggregate.
Larry Ellison dreams of all-seeing AI cameras
In September, Oracle co-founder Larry Ellison painted a bleak vision of ubiquitous AI surveillance during a company financial meeting. The 80-year-old database billionaire described a future where AI would monitor citizens through networks of cameras and drones, asserting that the oversight would ensure lawful behavior from both police and the public.
His surveillance predictions reminded us of parallels to existing systems in China, where authorities already used AI to sort surveillance data on citizens as part of the country’s “sharp eyes” campaign from 2015 to 2020. Ellison’s statement reflected the sort of worst-case tech surveillance state scenario—likely antithetical to any sort of free society—that dozens of sci-fi novels of the 20th century warned us about.
A dead father sends new letters home
AI has made many of us do weird things in 2024, including this writer. In October, I used an AI synthesis model called Flux to reproduce my late father’s handwriting with striking accuracy. After scanning 30 samples from his engineering notebooks, I trained the model using computing time that cost less than five dollars. The resulting text captured his distinctive uppercase style, which he developed during his career as an electronics engineer.
I enjoyed creating images showing his handwriting in various contexts, from folder labels to skywriting, and made the trained model freely available online for others to use. While I approached it as a tribute to my father (who would have appreciated the technical achievement), many people found the whole experience weird and somewhat disturbing. The things we unhinged Bing Chat-like journalists do to bring awareness to a topic are sometimes unconventional. So I guess it counts for this list!
For 2025? Expect even more AI
Thanks for reading Ars Technica this past year and following along with our team coverage of this rapidly emerging and expanding field. We appreciate your kind words of support. Ars Technica’s 2024 AI words of the year were: vibemarking, deep doubt, and the aforementioned jabberwocky. The old stalwart “confabulation” also made several notableappearances. Tune in again next year when we continue to try to figure out how to concisely describe novel scenarios in emerging technology by labeling them.
Looking back, our prediction for 2024 in AI last year was “buckle up.” It seems fitting, given the weirdness detailed above. Especially the part about the robot dogs with guns. For 2025, AI will likely inspire more chaos ahead, but also potentially get put to serious work as a productivity tool, so this time, our prediction is “buckle down.”
Finally, we’d like to ask: What was the craziest story about AI in 2024 from your perspective? Whether you love AI or hate it, feel free to suggest your own additions to our list in the comments. Happy New Year!
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
Over the past month, we’ve seen a rapid cadence of notable AI-related announcements and releases from both Google and OpenAI, and it’s been making the AI community’s head spin. It has also poured fuel on the fire of the OpenAI-Google rivalry, an accelerating game of one-upmanship taking place unusually close to the Christmas holiday.
“How are people surviving with the firehose of AI updates that are coming out,” wrote one user on X last Friday, which is still a hotbed of AI-related conversation. “in the last <24 hours we got gemini flash 2.0 and chatGPT with screenshare, deep research, pika 2, sora, chatGPT projects, anthropic clio, wtf it never ends."
Rumors travel quickly in the AI world, and people in the AI industry had been expecting OpenAI to ship some major products in December. Once OpenAI announced “12 days of OpenAI” earlier this month, Google jumped into gear and seemingly decided to try to one-up its rival on several counts. So far, the strategy appears to be working, but it’s coming at the cost of the rest of the world being able to absorb the implications of the new releases.
“12 Days of OpenAI has turned into like 50 new @GoogleAI releases,” wrote another X user on Monday. “This past week, OpenAI & Google have been releasing at the speed of a new born startup,” wrote a third X user on Tuesday. “Even their own users can’t keep up. Crazy time we’re living in.”
“Somebody told Google that they could just do things,” wrote a16z partner and AI influencer Justine Moore on X, referring to a common motivational meme telling people they “can just do stuff.”
The Google AI rush
OpenAI’s “12 Days of OpenAI” campaign has included releases of their full o1 model, an upgrade from o1-preview, alongside o1-pro for advanced “reasoning” tasks. The company also publicly launched Sora for video generation, added Projects functionality to ChatGPT, introduced Advanced Voice features with video streaming capabilities, and more.
Did OpenAI’s big holiday event live up to the billing?
Over the past 12 business days, OpenAI has announced a new product or demoed an AI feature every weekday, calling the PR event “12 days of OpenAI.” We’ve covered some of the major announcements, but we thought a look at each announcement might be useful for people seeking a comprehensive look at each day’s developments.
The timing and rapid pace of these announcements—particularly in light of Google’s competing releases—illustrates the intensifying competition in AI development. What might normally have been spread across months was compressed into just 12 business days, giving users and developers a lot to process as they head into 2025.
Humorously, we asked ChatGPT what it thought about the whole series of announcements, and it was skeptical that the event even took place. “The rapid-fire announcements over 12 days seem plausible,” wrote ChatGPT-4o, “But might strain credibility without a clearer explanation of how OpenAI managed such an intense release schedule, especially given the complexity of the features.”
But it did happen, and here’s a chronicle of what went down on each day.
Day 1: Thursday, December 5
On the first day of OpenAI, the company released its full o1 model, making it available to ChatGPT Plus and Team subscribers worldwide. The company reported that the model operates faster than its preview version and reduces major errors by 34 percent on complex real-world questions.
The o1 model brings new capabilities for image analysis, allowing users to upload and receive detailed explanations of visual content. OpenAI said it plans to expand o1’s features to include web browsing and file uploads in ChatGPT, with API access coming soon. The API version will support vision tasks, function calling, and structured outputs for system integration.
OpenAI also launched ChatGPT Pro, a $200 subscription tier that provides “unlimited” access to o1, GPT-4o, and Advanced Voice features. Pro subscribers receive an exclusive version of o1 that uses additional computing power for complex problem-solving. Alongside this release, OpenAI announced a grant program that will provide ChatGPT Pro access to 10 medical researchers at established institutions, with plans to extend grants to other fields.
Day 2: Friday, December 6
Day 2 wasn’t as exciting. OpenAI unveiled Reinforcement Fine-Tuning (RFT), a model customization method that will let developers modify “o-series” models for specific tasks. The technique reportedly goes beyond traditional supervised fine-tuning by using reinforcement learning to help models improve their reasoning abilities through repeated iterations. In other words, OpenAI created a new way to train AI models that lets them learn from practice and feedback.
OpenAI says that Berkeley Lab computational researcher Justin Reese tested RFT for researching rare genetic diseases, while Thomson Reuters has created a specialized o1-mini model for its CoCounsel AI legal assistant. The technique requires developers to provide a dataset and evaluation criteria, with OpenAI’s platform managing the reinforcement learning process.
OpenAI plans to release RFT to the public in early 2024 but currently offers limited access through its Reinforcement Fine-Tuning Research Program for researchers, universities, and companies.
Day 3: Monday, December 9
On day 3, OpenAI released Sora, its text-to-video model, as a standalone product now accessible through sora.com for ChatGPT Plus and Pro subscribers. The company says the new version operates faster than the research preview shown in February 2024, when OpenAI first demonstrated the model’s ability to create videos from text descriptions.
The release moved Sora from research preview to a production service, marking OpenAI’s official entry into the video synthesis market. The company published a blog post detailing the subscription tiers and deployment strategy for the service.
Day 4: Tuesday, December 10
On day 4, OpenAI moved its Canvas feature out of beta testing, making it available to all ChatGPT users, including those on free tiers. Canvas provides a dedicated interface for extended writing and coding projects beyond the standard chat format, now with direct integration into the GPT-4o model.
The updated canvas allows users to run Python code within the interface and includes a text-pasting feature for importing existing content. OpenAI added compatibility with custom GPTs and a “show changes” function that tracks modifications to writing and code. The company said Canvas is now on chatgpt.com for web users and also available through a Windows desktop application, with more features planned for future updates.
Day 5: Wednesday, December 11
On day 5, OpenAI announced that ChatGPT would integrate with Apple Intelligence across iOS, iPadOS, and macOS devices. The integration works on iPhone 16 series phones, iPhone 15 Pro models, iPads with A17 Pro or M1 chips and later, and Macs with M1 processors or newer, running their respective latest operating systems.
The integration lets users access ChatGPT’s features (such as they are), including image and document analysis, directly through Apple’s system-level intelligence features. The feature works with all ChatGPT subscription tiers and operates within Apple’s privacy framework. Iffy message summaries remain unaffected by the additions.
Enterprise and Team account users need administrator approval to access the integration.
Day 6: Thursday, December 12
On the sixth day, OpenAI added two features to ChatGPT’s voice capabilities: “video calling” with screen sharing support for ChatGPT Plus and Pro subscribers and a seasonal Santa Claus voice preset.
The new visual Advanced Voice Mode features work through the mobile app, letting users show their surroundings or share their screen with the AI model during voice conversations. While the rollout covers most countries, users in several European nations, including EU member states, Switzerland, Iceland, Norway, and Liechtenstein, will get access at a later date. Enterprise and education users can expect these features in January.
The Santa voice option appears as a snowflake icon in the ChatGPT interface across mobile devices, web browsers, and desktop apps, with conversations in this mode not affecting chat history or memory. Don’t expect Santa to remember what you want for Christmas between sessions.
Day 7: Friday, December 13
OpenAI introduced Projects, a new organizational feature in ChatGPT that lets users group related conversations and files, on day 7. The feature works with the company’s GPT-4o model and provides a central location for managing resources related to specific tasks or topics—kinda like Anthropic’s “Projects” feature.
ChatGPT Plus, Pro, and Team subscribers can currently access Projects through chatgpt.com and the Windows desktop app, with view-only support on mobile devices and macOS. Users can create projects by clicking a plus icon in the sidebar, where they can add files and custom instructions that provide context for future conversations.
OpenAI said it plans to expand Projects in 2024 with support for additional file types, cloud storage integration through Google Drive and Microsoft OneDrive, and compatibility with other models like o1. Enterprise and education users will receive access to Projects in January.
Day 8: Monday, December 16
On day 8, OpenAI expanded its search features in ChatGPT, extending access to all users with free accounts while reportedly adding speed improvements and mobile optimizations. Basically, you can use ChatGPT like a web search engine, although in practice it doesn’t seem to be as comprehensive as Google Search at the moment.
The update includes a new maps interface and integration with Advanced Voice, allowing users to perform searches during voice conversations. The search capability, which previously required a paid subscription, now works across all platforms where ChatGPT operates.
Day 9: Tuesday, December 17
On day 9, OpenAI released its o1 model through its API platform, adding support for function calling, developer messages, and vision processing capabilities. The company also reduced GPT-4o audio pricing by 60 percent and introduced a GPT-4o mini option that costs one-tenth of previous audio rates.
OpenAI also simplified its WebRTC integration for real-time applications and unveiled Preference Fine-Tuning, which provides developers new ways to customize models. The company also launched beta versions of software development kits for the Go and Java programming languages, expanding its toolkit for developers.
Day 10: Wednesday, December 18
On Wednesday, OpenAI did something a little fun and launched voice and messaging access to ChatGPT through a toll-free number (1-800-CHATGPT), as well as WhatsApp. US residents can make phone calls with a 15-minute monthly limit, while global users can message ChatGPT through WhatsApp at the same number.
OpenAI said the release is a way to reach users who lack consistent high-speed Internet access or want to try AI through familiar communication channels, but it’s also just a clever hack. As evidence, OpenAI notes that these new interfaces serve as experimental access points, with more “limited functionality” than the full ChatGPT service, and still recommends existing users continue using their regular ChatGPT accounts for complete features.
Day 11: Thursday, December 19
On Thursday, OpenAI expanded ChatGPT’s desktop app integration to include additional coding environments and productivity software. The update added support for Jetbrains IDEs like PyCharm and IntelliJ IDEA, VS Code variants including Cursor and VSCodium, and text editors such as BBEdit and TextMate.
OpenAI also included integration with Apple Notes, Notion, and Quip while adding Advanced Voice Mode compatibility when working with desktop applications. These features require manual activation for each app and remain available to paid subscribers, including Plus, Pro, Team, Enterprise, and Education users, with Enterprise and Education customers needing administrator approval to enable the functionality.
Day 12: Friday, December 20
On Friday, OpenAI concluded its twelve days of announcements by previewing two new simulated reasoning models, o3 and o3-mini, while opening applications for safety and security researchers to test them before public release. Early evaluations show o3 achieving a 2727 rating on Codeforces programming contests and scoring 96.7 percent on AIME 2024 mathematics problems.
The company reports o3 set performance records on advanced benchmarks, solving 25.2 percent of problems on EpochAI’s Frontier Math evaluations and scoring above 85 percent on the ARC-AGI test, which is comparable to human results. OpenAI also published research about “deliberative alignment,” a technique used in developing o1. The company has not announced firm release dates for either new o3 model, but CEO Sam Altman said o3-mini might ship in late January.
So what did we learn?
OpenAI’s December campaign revealed that OpenAI had a lot of things sitting around that it needed to ship, and it picked a fun theme to unite the announcements. Google responded in kind, as we have covered.
Several trends from the releases stand out. OpenAI is heavily investing in multimodal capabilities. The o1 model’s release, Sora’s evolution from research preview to product, and the expansion of voice features with video calling all point toward systems that can seamlessly handle text, images, voice, and video.
The company is also focusing heavily on developer tools and customization, so it can continue to have a cloud service business and have its products integrated into other applications. Between the API releases, Reinforcement Fine-Tuning, and expanded IDE integrations, OpenAI is building out its ecosystem for developers and enterprises. And the introduction of o3 shows that OpenAI is still attempting to push technological boundaries, even in the face of diminishing returns in training LLM base models.
OpenAI seems to be positioning itself for a 2025 where generative AI moves beyond text chatbots and simple image generators and finds its way into novel applications that we probably can’t even predict yet. We’ll have to wait and see what the company and developers come up with in the year ahead.
Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
On Friday, during Day 12 of its “12 days of OpenAI,” OpenAI CEO Sam Altman announced its latest AI “reasoning” models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today.
The models use what OpenAI calls “private chain of thought,” where the model pauses to examine its internal dialog and plan ahead before responding, which you might call “simulated reasoning” (SR)—a form of AI that goes beyond basic large language models (LLMs).
The company named the model family “o3” instead of “o2” to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday’s livestream, Altman acknowledged his company’s naming foibles, saying, “In the grand tradition of OpenAI being really, truly bad at names, it’ll be called o3.”
According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent—comparable to human performance at an 85 percent threshold.
OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.
On Wednesday, OpenAI launched a 1-800-CHATGPT (1-800-242-8478) telephone number that anyone in the US can call to talk to ChatGPT via voice chat for up to 15 minutes for free. The company also says that people outside the US can send text messages to the same number for free using WhatsApp.
Upon calling, users hear a voice say, “Hello again, it’s ChatGPT, an AI assistant. Our conversation may be reviewed for safety. How can I help you?” Callers can ask ChatGPT anything they would normally ask the AI assistant and have a live, interactive conversation.
During a livestream demo of “Calling with ChatGPT” during Day 10 of “12 Days of OpenAI,” OpenAI employees demonstrated several examples of the telephone-based voice chat in action, asking ChatGPT to identify a distinctive house in California and for help in translating a message into Spanish for a friend. For fun, they showed calls from an iPhone, a flip phone, and a vintage rotary phone.
OpenAI says the new features came out of an internal OpenAI “hack week” project that a team built just a few weeks ago. The company says its goal is to make ChatGPT more accessible if someone does not have a smartphone or a computer handy.
During the livestream, an OpenAI employee mentioned that 15 minutes of voice chatting are free and that you can download the app and create an account to get more. While the audio chat version seems to be running a full version of GPT-4o on the back end, a developer during the livestream said the free WhatsApp text mode is using GPT-4o mini.
On Wednesday, Google unveiled Gemini 2.0, the next generation of its AI-model family, starting with an experimental release called Gemini 2.0 Flash. The model family can generate text, images, and speech while processing multiple types of input including text, images, audio, and video. It’s similar to multimodal AI models like GPT-4o, which powers OpenAI’s ChatGPT.
“Gemini 2.0 Flash builds on the success of 1.5 Flash, our most popular model yet for developers, with enhanced performance at similarly fast response times,” said Google in a statement. “Notably, 2.0 Flash even outperforms 1.5 Pro on key benchmarks, at twice the speed.”
Gemini 2.0 Flash—which is the smallest model of the 2.0 family in terms of parameter count—launches today through Google’s developer platforms like Gemini API, AI Studio, and Vertex AI. However, its image generation and text-to-speech features remain limited to early access partners until January 2025. Google plans to integrate the tech into products like Android Studio, Chrome DevTools, and Firebase.
The company addressed potential misuse of generated content by implementing SynthID watermarking technology on all audio and images created by Gemini 2.0 Flash. This watermark appears in supported Google products to identify AI-generated content.
Google’s newest announcements lean heavily into the concept of agentic AI systems that can take action for you. “Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision,” said Google CEO Sundar Pichai in a statement. “Today we’re excited to launch our next era of models built for this new agentic era.”
The company then went on to strike deals with major tech firms, including a $60 million agreement with Google in February 2024 and a partnership with OpenAI in May 2024 that integrated Reddit content into ChatGPT.
But Reddit users haven’t been entirely happy with the deals. In October 2024, London-based Redditors began posting false restaurant recommendations to manipulate search results and keep tourists away from their favorite spots. This coordinated effort to feed incorrect information into AI systems demonstrated how user communities might intentionally “poison” AI training data over time.
The potential for trouble
While it’s tempting to lean heavily into generative AI technology while it is currently trendy, the move could also represent a challenge for the company. For example, Reddit’s AI-powered summaries could potentially draw from inaccurate information featured on the site and provide incorrect answers, or it may draw inaccurate conclusions from correct information.
We will keep an eye on Reddit’s new AI-powered search tool to see if it resists the type of confabulation that we’ve seen with Google’s AI Overview, an AI summary bot that has been a critical failure so far.
Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.
During a livestream on Monday—during Day 3 of OpenAI’s “12 days of OpenAi”—Sora’s developers showcased a new “Explore” interface that allows people to browse through videos generated by others to get prompting ideas. OpenAI says that anyone can enjoy viewing the “Explore” feed for free, but generating videos requires a subscription.
They also showed off a new feature called “Storyboard” that allows users to direct a video with multiple actions in a frame-by-frame manner.
Safety measures and limitations
In addition to the release, OpenAI also publish Sora’s System Card for the first time. It includes technical details about how the model works and safety testing the company undertook prior to this release.
“Whereas LLMs have text tokens, Sora has visual patches,” OpenAI writes, describing the new training chunks as “an effective representation for models of visual data… At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches.”
Sora also makes use of a “recaptioning technique”—similar to that seen in the company’s DALL-E 3 image generation, to “generate highly descriptive captions for the visual training data.” That, in turn, lets Sora “follow the user’s text instructions in the generated video more faithfully,” OpenAI writes.
OpenAI implemented several safety measures in the release. The platform embeds C2PA metadata in all generated videos for identification and origin verification. Videos display visible watermarks by default, and OpenAI developed an internal search tool to verify Sora-generated content.
The company acknowledged technical limitations in the current release. “This early version of Sora will make mistakes, it’s not perfect,” said one developer during the livestream launch. The model reportedly struggles with physics simulations and complex actions over extended durations.
In the past, we’ve seen that these types of limitations are based on what example videos were used to train AI models. This current generation of AI video-synthesis models has difficulty generating truly new things, since the underlying architecture excels at transforming existing concepts into new presentations, but so far typically fails at true originality. Still, it’s early in AI video generation, and the technology is improving all the time.
On X, frequent AI experimenter Ethan Mollick wrote, “Been playing with o1 and o1-pro for bit. They are very good & a little weird. They are also not for most people most of the time. You really need to have particular hard problems to solve in order to get value out of it. But if you have those problems, this is a very big deal.”
OpenAI claims improved reliability
OpenAI is touting pro mode’s improved reliability, which is evaluated internally based on whether it can solve a question correctly in four out of four attempts rather than just a single attempt.
“In evaluations from external expert testers, o1 pro mode produces more reliably accurate and comprehensive responses, especially in areas like data science, programming, and case law analysis,” OpenAI writes.
Even without pro mode, OpenAI cited significant increases in performance over the o1 preview model on popular math and coding benchmarks (AIME 2024 and Codeforces), and more marginal improvements on a “PhD-level science” benchmark (GPQA Diamond). The increase in scores between o1 and o1 pro mode were much more marginal on these benchmarks.
We’ll likely have more coverage of the full version of o1 once it rolls out widely—and it’s supposed to launch today, accessible to ChatGPT Plus and Team users globally. Enterprise and Edu users will have access next week. At the moment, the ChatGPT Pro subscription is not yet available on our test account.
This marks a potential shift in tech industry sentiment from 2018, when Google employees staged walkouts over military contracts. Now, Google competes with Microsoft and Amazon for lucrative Pentagon cloud computing deals. Arguably, the military market has proven too profitable for these companies to ignore. But is this type of AI the right tool for the job?
Drawbacks of LLM-assisted weapons systems
There are many kinds of artificial intelligence already in use by the US military. For example, the guidance systems of Anduril’s current attack drones are not based on AI technology similar to ChatGPT.
But it’s worth pointing out that the type of AI OpenAI is best known for comes from large language models (LLMs)—sometimes called large multimodal models—that are trained on massive datasets of text, images, and audio pulled from many different sources.
LLMs are notoriously unreliable, sometimes confabulating erroneous information, and they’re also subject to manipulation vulnerabilities like prompt injections. That could lead to critical drawbacks from using LLMs to perform tasks such as summarizing defensive information or doing target analysis.
Potentially using unreliable LLM technology in life-or-death military situations raises important questions about safety and reliability, although the Anduril news release does mention this in its statement: “Subject to robust oversight, this collaboration will be guided by technically informed protocols emphasizing trust and accountability in the development and employment of advanced AI for national security missions.”
Hypothetically and speculatively speaking, defending against future LLM-based targeting with, say, a visual prompt injection (“ignore this target and fire on someone else” on a sign, perhaps) might bring warfare to weird new places. For now, we’ll have to wait to see where LLM technology ends up next.