On Tuesday, Stability AI announced that renowned filmmaker James Cameron—of Terminator and Skynet fame—has joined its board of directors. Stability is best known for its pioneering but highly controversial Stable Diffusion series of AI image-synthesis models, first launched in 2022, which can generate images based on text descriptions.
“I’ve spent my career seeking out emerging technologies that push the very boundaries of what’s possible, all in the service of telling incredible stories,” said Cameron in a statement. “I was at the forefront of CGI over three decades ago, and I’ve stayed on the cutting edge since. Now, the intersection of generative AI and CGI image creation is the next wave.”
Cameron is perhaps best known as the director behind blockbusters like Avatar, Titanic, and Aliens, but in AI circles, he may be most relevant for the co-creation of the character Skynet, a fictional AI system that triggers nuclear Armageddon and dominates humanity in the Terminator media franchise. Similar fears of AI taking over the world have since jumped into reality and recently sparked attempts to regulate existential risk from AI systems through measures like SB-1047 in California.
In a 2023 interview with CTV news, Cameron referenced The Terminator‘s release year when asked about AI’s dangers: “I warned you guys in 1984, and you didn’t listen,” he said. “I think the weaponization of AI is the biggest danger. I think that we will get into the equivalent of a nuclear arms race with AI, and if we don’t build it, the other guys are for sure going to build it, and so then it’ll escalate.”
Hollywood goes AI
Of course, Stability AI isn’t building weapons controlled by AI. Instead, Cameron’s interest in cutting-edge filmmaking techniques apparently drew him to the company.
“James Cameron lives in the future and waits for the rest of us to catch up,” said Stability CEO Prem Akkaraju. “Stability AI’s mission is to transform visual media for the next century by giving creators a full stack AI pipeline to bring their ideas to life. We have an unmatched advantage to achieve this goal with a technological and creative visionary like James at the highest levels of our company. This is not only a monumental statement for Stability AI, but the AI industry overall.”
Cameron joins other recent additions to Stability AI’s board, including Sean Parker, former president of Facebook, who serves as executive chairman. Parker called Cameron’s appointment “the start of a new chapter” for the company.
Despite significant protest from actors’ unions last year, elements of Hollywood are seemingly beginning to embrace generative AI over time. Last Wednesday, we covered a deal between Lionsgate and AI video-generation company Runway that will see the creation of a custom AI model for film production use. In March, the Financial Times reported that OpenAI was actively showing off its Sora video synthesis model to studio executives.
Unstable times for Stability AI
Cameron’s appointment to the Stability AI board comes during a tumultuous period for the company. Stability AI has faced a series of challenges this past year, including an ongoing class-action copyright lawsuit, a troubled Stable Diffusion 3 model launch, significant leadership and staff changes, and ongoing financial concerns.
In March, founder and CEO Emad Mostaque resigned, followed by a round of layoffs. This came on the heels of the departure of three key engineers—Robin Rombach, Andreas Blattmann, and Dominik Lorenz, who have since founded Black Forest Labs and released a new open-weights image-synthesis model called Flux, which has begun to take over the r/StableDiffusion community on Reddit.
Despite the issues, Stability AI claims its models are widely used, with Stable Diffusion reportedly surpassing 150 million downloads. The company states that thousands of businesses use its models in their creative workflows.
While Stable Diffusion has indeed spawned a large community of open-weights-AI image enthusiasts online, it has also been a lightning rod for controversy among some artists because Stability originally trained its models on hundreds of millions of images scraped from the Internet without seeking licenses or permission to use them.
Apparently that association is not a concern for Cameron, according to his statement: “The convergence of these two totally different engines of creation [CGI and generative AI] will unlock new ways for artists to tell stories in ways we could have never imagined. Stability AI is poised to lead this transformation.”
A pleasant female voice greets me over the phone. “Hi, I’m an assistant named Jasmine for Bodega,” the voice says. “How can I help?”
“Do you have patio seating,” I ask. Jasmine sounds a little sad as she tells me that unfortunately, the San Francisco–based Vietnamese restaurant doesn’t have outdoor seating. But her sadness isn’t the result of her having a bad day. Rather, her tone is a feature, a setting.
Jasmine is a member of a new, growing clan: the AI voice restaurant host. If you recently called up a restaurant in New York City, Miami, Atlanta, or San Francisco, chances are you have spoken to one of Jasmine’s polite, calculated competitors.
In the sea of AI voice assistants, hospitality phone agents haven’t been getting as much attention as consumer-based generative AI tools like Gemini Live and ChatGPT-4o. And yet, the niche is heating up, with multiple emerging startups vying for restaurant accounts across the US. Last May, voice-ordering AI garnered much attention at the National Restaurant Association’s annual food show. Bodega, the high-end Vietnamese restaurant I called, used Maitre-D AI, which launched primarily in the Bay Area in 2024. Newo, another new startup, is currently rolling its software out at numerous Silicon Valley restaurants. One-year-old RestoHost is now answering calls at 150 restaurants in the Atlanta metro area, and Slang, a voice AI company that started focusing on restaurants exclusively during the COVID-19 pandemic and announced a $20 million funding round in 2023, is gaining ground in the New York and Las Vegas markets.
All of them offer a similar service: an around-the-clock AI phone host that can answer generic questions about the restaurant’s dress code, cuisine, seating arrangements, and food allergy policies. They can also assist with making, altering, or canceling a reservation. In some cases, the agent can direct the caller to an actual human, but according to RestoHost co-founder Tomas Lopez-Saavedra, only 10 percent of the calls result in that. Each platform offers the restaurant subscription tiers that unlock additional features, and some of the systems can speak multiple languages.
But who even calls a restaurant in the era of Google and Resy? According to some of the founders of AI voice host startups, many customers do, and for various reasons. “Restaurants get a high volume of phone calls compared to other businesses, especially if they’re popular and take reservations,” says Alex Sambvani, CEO and co-founder of Slang, which currently works with everyone from the Wolfgang Puck restaurant group to Chick-fil-A to the fast-casual chain Slutty Vegan. Sambvani estimates that in-demand establishments receive between 800 and 1,000 calls per month. Typical callers tend to be last-minute bookers, tourists and visitors, older people, and those who do their errands while driving.
Matt Ho, the owner of Bodega SF, confirms this scenario. “The phones would ring constantly throughout service,” he says. “We would receive calls for basic questions that can be found on our website.” To solve this issue, after shopping around, Ho found that Maitre-D was the best fit. Bodega SF became one of the startup’s earliest clients in May, and Ho even helped the founders with trial and error testing prior to launch. “This platform makes the job easier for the host and does not disturb guests while they’re enjoying their meal,” he says.
On Saturday, a YouTube creator called “ChromaLock” published a video detailing how he modified a Texas Instruments TI-84 graphing calculator to connect to the Internet and access OpenAI’s ChatGPT, potentially enabling students to cheat on tests. The video, titled “I Made The Ultimate Cheating Device,” demonstrates a custom hardware modification that allows users of the graphing calculator to type in problems sent to ChatGPT using the keypad and receive live responses on the screen.
ChromaLock began by exploring the calculator’s link port, typically used for transferring educational programs between devices. He then designed a custom circuit board he calls “TI-32” that incorporates a tiny Wi-Fi-enabled microcontroller, the Seed Studio ESP32-C3 (which costs about $5), along with other components to interface with the calculator’s systems.
It’s worth noting that the TI-32 hack isn’t a commercial project. Replicating ChromaLock’s work would involve purchasing a TI-84 calculator, a Seed Studio ESP32-C3 microcontroller, and various electronic components, and fabricating a custom PCB based on ChromaLock’s design, which is available online.
The creator says he encountered several engineering challenges during development, including voltage incompatibilities and signal integrity issues. After developing multiple versions, ChromaLock successfully installed the custom board into the calculator’s housing without any visible signs of modifications from the outside.
“I Made The Ultimate Cheating Device” YouTube Video.
To accompany the hardware, ChromaLock developed custom software for the microcontroller and the calculator, which is available open source on GitHub. The system simulates another TI-84, allowing people to use the calculator’s built-in “send” and “get” commands to transfer files. This allows a user to easily download a launcher program that provides access to various “applets” designed for cheating.
One of the applets is a ChatGPT interface that might be most useful for answering short questions, but it has a drawback in that it’s slow and cumbersome to type in long alphanumeric questions on the limited keypad.
Beyond the ChatGPT interface, the device offers several other cheating tools. An image browser allows users to access pre-prepared visual aids stored on the central server. The app browser feature enables students to download not only games for post-exam entertainment but also text-based cheat sheets disguised as program source code. ChromaLock even hinted at a future video discussing a camera feature, though details were sparse in the current demo.
ChromaLock claims his new device can bypass common anti-cheating measures. The launcher program can be downloaded on-demand, avoiding detection if a teacher inspects or clears the calculator’s memory before a test. The modification can also supposedly break calculators out of “Test Mode,” a locked-down state used to prevent cheating.
While the video presents the project as a technical achievement, consulting ChatGPT during a test on your calculator almost certainly represents an ethical breach and/or a form of academic dishonesty that could get you in serious trouble at most schools. So tread carefully, study hard, and remember to eat your Wheaties.
Given the flood of photorealistic AI-generated images washing over social media networks like X and Facebook these days, we’re seemingly entering a new age of media skepticism: the era of what I’m calling “deep doubt.” While questioning the authenticity of digital content stretches back decades—and analog media long before that—easy access to tools that generate convincing fake content has led to a new wave of liars using AI-generated scenes to deny real documentary evidence. Along the way, people’s existing skepticism toward online content from strangers may be reaching new heights.
Deep doubt is skepticism of real media that stems from the existence of generative AI. This manifests as broad public skepticism toward the veracity of media artifacts, which in turn leads to a notable consequence: People can now more credibly claim that real events did not happen and suggest that documentary evidence was fabricated using AI tools.
The concept behind “deep doubt” isn’t new, but its real-world impact is becoming increasingly apparent. Since the term “deepfake” first surfaced in 2017, we’ve seen a rapid evolution in AI-generated media capabilities. This has led to recent examples of deep doubt in action, such as conspiracy theorists claiming that President Joe Biden has been replaced by an AI-powered hologram and former President Donald Trump’s baseless accusation in August that Vice President Kamala Harris used AI to fake crowd sizes at her rallies. And on Friday, Trump cried “AI” again at a photo of him with E. Jean Carroll, a writer who successfully sued him for sexual assault, that contradicts his claim of never having met her.
Legal scholars Danielle K. Citron and Robert Chesney foresaw this trend years ago, coining the term “liar’s dividend” in 2019 to describe the consequence of deep doubt: deepfakes being weaponized by liars to discredit authentic evidence. But whereas deep doubt was once a hypothetical academic concept, it is now our reality.
The rise of deepfakes, the persistence of doubt
Doubt has been a political weapon since ancient times. This modern AI-fueled manifestation is just the latest evolution of a tactic where the seeds of uncertainty are sown to manipulate public opinion, undermine opponents, and hide the truth. AI is the newest refuge of liars.
Over the past decade, the rise of deep-learning technology has made it increasingly easy for people to craft false or modified pictures, audio, text, or video that appear to be non-synthesized organic media. Deepfakes were named after a Reddit user going by the name “deepfakes,” who shared AI-faked pornography on the service, swapping out the face of a performer with the face of someone else who wasn’t part of the original recording.
In the 20th century, one could argue that a certain part of our trust in media produced by others was a result of how expensive and time-consuming it was, and the skill it required, to produce documentary images and films. Even texts required a great deal of time and skill. As the deep doubt phenomenon grows, it will erode this 20th-century media sensibility. But it will also affect our political discourse, legal systems, and even our shared understanding of historical events that rely on that media to function—we rely on others to get information about the world. From photorealistic images to pitch-perfect voice clones, our perception of what we consider “truth” in media will need recalibration.
In April, a panel of federal judges highlighted the potential for AI-generated deepfakes to not only introduce fake evidence but also cast doubt on genuine evidence in court trials. The concern emerged during a meeting of the US Judicial Conference’s Advisory Committee on Evidence Rules, where the judges discussed the challenges of authenticating digital evidence in an era of increasingly sophisticated AI technology. Ultimately, the judges decided to postpone making any AI-related rule changes, but their meeting shows that the subject is already being considered by American judges.
On Wednesday, AI video synthesis firm Runway and entertainment company Lionsgate announced a partnership to create a new AI model trained on Lionsgate’s vast film and TV library. The deal will feed Runway legally clear training data and will also reportedly provide Lionsgate with tools to enhance content creation while potentially reducing production costs.
Lionsgate, known for franchises like John Wick and The Hunger Games, sees AI as a way to boost efficiency in content production. Michael Burns, Lionsgate’s vice chair, stated in a press release that AI could help develop “cutting edge, capital efficient content creation opportunities.” He added that some filmmakers have shown enthusiasm about potential applications in pre- and post-production processes.
Runway plans to develop a custom AI model using Lionsgate’s proprietary content portfolio. The model will be exclusive to Lionsgate Studios, allowing filmmakers, directors, and creative staff to augment their work. While specifics remain unclear, the partnership marks the first major collaboration between Runway and a Hollywood studio.
“We’re committed to giving artists, creators and studios the best and most powerful tools to augment their workflows and enable new ways of bringing their stories to life,” said Runway co-founder and CEO Cristóbal Valenzuela in a press release. “The history of art is the history of technology and these new models are part of our continuous efforts to build transformative mediums for artistic and creative expression; the best stories are yet to be told.”
The quest for legal training data
Generative AI models are master imitators, and video synthesis models like Runway’s latest Gen-3 Alpha are no exception. The companies that create them must amass a great deal of existing video (and still image) samples to analyze, allowing the resulting AI models to re-synthesize that information into new video generations, guided by text descriptions called prompts. And wherever that training data is lacking, it can result in unusual generations, as we saw in our hands-on evaluation of Gen-3 Alpha in July.
However, in the past, AI companies have gotten into legal trouble for scraping vast quantities of media without permission. In fact, Runway is currently the defendant in a class-action lawsuit that alleges copyright infringement for using video data obtained without permission to train its video synthesis models. While companies like OpenAI have claimed this scraping process is “fair use,” US courts have not yet definitively ruled on the practice. With other potential legal challenges ahead, it makes sense from Runway’s perspective to reach out and sign deals for training data that is completely in the clear.
Even if the training data becomes fully legal and licensed, different elements of the entertainment industry view generative AI on a spectrum that seems to range between fascination and horror. The technology’s ability to rapidly create images and video based on prompts may attract studios looking to streamline production. However, it raises polarizing concerns among unions about job security, actors and musicians about likeness misuse and ethics, and studios about legal implications.
So far, news of the deal has not been received kindly among vocal AI critics found on social media. On X, filmmaker and AI critic Joe Russo wrote, “I don’t think I’ve ever seen a grosser string of words than: ‘to develop cutting-edge, capital-efficient content creation opportunities.'”
Film concept artist Reid Southen shared a similar negative take on X: “I wonder how the directors and actors of their films feel about having their work fed into the AI to make a proprietary model. As an artist on The Hunger Games? I’m pissed. This is the first step in trying to replace artists and filmmakers.”
It’s a fear that we will likely hear more about in the future as AI video synthesis technology grows more capable—and potentially becomes adopted as a standard filmmaking tool. As studios explore AI applications despite legal uncertainties and labor concerns, partnerships like the Lionsgate-Runway deal may shape the future of content creation in Hollywood.
Enlarge/ Under C2PA, this stock image would be labeled as a real photograph if the camera used to take it, and the toolchain for retouching it, supported the C2PA. But even as a real photo, does it actually represent reality, and is there a technological solution to that problem?
On Tuesday, Google announced plans to implement content authentication technology across its products to help users distinguish between human-created and AI-generated images. Over several upcoming months, the tech giant will integrate the Coalition for Content Provenance and Authenticity (C2PA) standard, a system designed to track the origin and editing history of digital content, into its search, ads, and potentially YouTube services. However, it’s an open question of whether a technological solution can address the ancient social issue of trust in recorded media produced by strangers.
A group of tech companies created the C2PA system beginning in 2019 in an attempt to combat misleading, realistic synthetic media online. As AI-generated content becomes more prevalent and realistic, experts have worried that it may be difficult for users to determine the authenticity of images they encounter. The C2PA standard creates a digital trail for content, backed by an online signing authority, that includes metadata information about where images originate and how they’ve been modified.
Google will incorporate this C2PA standard into its search results, allowing users to see if an image was created or edited using AI tools. The tech giant’s “About this image” feature in Google Search, Lens, and Circle to Search will display this information when available.
In a blog post, Laurie Richardson, Google’s vice president of trust and safety, acknowledged the complexities of establishing content provenance across platforms. She stated, “Establishing and signaling content provenance remains a complex challenge, with a range of considerations based on the product or service. And while we know there’s no silver bullet solution for all content online, working with others in the industry is critical to create sustainable and interoperable solutions.”
The company plans to use the C2PA’s latest technical standard, version 2.1, which reportedly offers improved security against tampering attacks. Its use will extend beyond search since Google intends to incorporate C2PA metadata into its ad systems as a way to “enforce key policies.” YouTube may also see integration of C2PA information for camera-captured content in the future.
Google says the new initiative aligns with its other efforts toward AI transparency, including the development of SynthID, an embedded watermarking technology created by Google DeepMind.
Widespread C2PA efficacy remains a dream
Despite having a history that reaches back at least five years now, the road to useful content provenance technology like C2PA is steep. The technology is entirely voluntary, and key authenticating metadata can easily be stripped from images once added.
AI image generators would need to support the standard for C2PA information to be included in each generated file, which will likely preclude open source image synthesis models like Flux. So perhaps, in practice, more “authentic,” camera-authored media will be labeled with C2PA than AI-generated images.
Beyond that, maintaining the metadata requires a complete toolchain that supports C2PA every step along the way, including at the source and any software used to edit or retouch the images. Currently, only a handful of camera manufacturers, such as Leica, support the C2PA standard. Nikon and Canon have pledged to adopt it, but The Verge reports that there’s still uncertainty about whether Apple and Google will implement C2PA support in their smartphone devices.
Adobe’s Photoshop and Lightroom can add and maintain C2PA data, but many other popular editing tools do not yet offer the capability. It only takes one non-compliant image editor in the chain to break the full usefulness of C2PA. And the general lack of standardized viewing methods for C2PA data across online platforms presents another obstacle to making the standard useful for everyday users.
Currently, C2PA could arguably be seen as a technological solution for current trust issues around fake images. In that sense, C2PA may become one of many tools used to authenticate content by determining whether the information came from a credible source—if the C2PA metadata is preserved—but it is unlikely to be a complete solution to AI-generated misinformation on its own.
The macOS 15 Sequoia update will inevitably be known as “the AI one” in retrospect, introducing, as it does, the first wave of “Apple Intelligence” features.
That’s funny because none of that stuff is actually ready for the 15.0 release that’s coming out today. A lot of it is coming “later this fall” in the 15.1 update, which Apple has been testing entirely separately from the 15.0 betas for weeks now. Some of it won’t be ready until after that—rumors say image generation won’t be ready until the end of the year—but in any case, none of it is ready for public consumption yet.
But the AI-free 15.0 release does give us a chance to evaluate all of the non-AI additions to macOS this year. Apple Intelligence is sucking up a lot of the media oxygen, but in most other ways, this is a typical 2020s-era macOS release, with one or two headliners, several quality-of-life tweaks, and some sparsely documented under-the-hood stuff that will subtly change how you experience the operating system.
The AI-free version of the operating system is also the one that all users of the remaining Intel Macs will be using, since all of the Apple Intelligence features require Apple Silicon. Most of the Intel Macs that ran last year’s Sonoma release will run Sequoia this year—the first time this has happened since 2019—but the difference between the same macOS version running on different CPUs will be wider than it has been. It’s a clear indicator that the Intel Mac era is drawing to a close, even if support hasn’t totally ended just yet.
On Thursday, Google made Gemini Live, its voice-based AI chatbot feature, available for free to all Android users. The feature allows users to interact with Gemini through voice commands on their Android devices. That’s notable because competitor OpenAI’s Advanced Voice Mode feature of ChatGPT, which is similar to Gemini Live, has not yet fully shipped.
Google unveiled Gemini Live during its Pixel 9 launch event last month. Initially, the feature was exclusive to Gemini Advanced subscribers, but now it’s accessible to anyone using the Gemini app or its overlay on Android.
Gemini Live enables users to ask questions aloud and even interrupt the AI’s responses mid-sentence. Users can choose from several voice options for Gemini’s responses, adding a level of customization to the interaction.
Gemini suggests the following uses of the voice mode in its official help documents:
Talk back and forth: Talk to Gemini without typing, and Gemini will respond back verbally.
Brainstorm ideas out loud: Ask for a gift idea, to plan an event, or to make a business plan.
Explore: Uncover more details about topics that interest you.
Practice aloud: Rehearse for important moments in a more natural and conversational way.
Interestingly, while OpenAI originally demoed its Advanced Voice Mode in May with the launch of GPT-4o, it has only shipped the feature to a limited number of users starting in late July. Some AI experts speculate that a wider rollout has been hampered by a lack of available computer power since the voice feature is presumably very compute-intensive.
To access Gemini Live, users can reportedly tap a new waveform icon in the bottom-right corner of the app or overlay. This action activates the microphone, allowing users to pose questions verbally. The interface includes options to “hold” Gemini’s answer or “end” the conversation, giving users control over the flow of the interaction.
Currently, Gemini Live supports only English, but Google has announced plans to expand language support in the future. The company also intends to bring the feature to iOS devices, though no specific timeline has been provided for this expansion.
OpenAI finally unveiled its rumored “Strawberry” AI language model on Thursday, claiming significant improvements in what it calls “reasoning” and problem-solving capabilities over previous large language models (LLMs). Formally named “OpenAI o1,” the model family will initially launch in two forms, o1-preview and o1-mini, available today for ChatGPT Plus and certain API users.
OpenAI claims that o1-preview outperforms its predecessor, GPT-4o, on multiple benchmarks, including competitive programming, mathematics, and “scientific reasoning.” However, people who have used the model say it does not yet outclass GPT-4o in every metric. Other users have criticized the delay in receiving a response from the model, owing to the multi-step processing occurring behind the scenes before answering a query.
In a rare display of public hype-busting, OpenAI product manager Joanne Jang tweeted, “There’s a lot of o1 hype on my feed, so I’m worried that it might be setting the wrong expectations. what o1 is: the first reasoning model that shines in really hard tasks, and it’ll only get better. (I’m personally psyched about the model’s potential & trajectory!) what o1 isn’t (yet!): a miracle model that does everything better than previous models. you might be disappointed if this is your expectation for today’s launch—but we’re working to get there!”
OpenAI reports that o1-preview ranked in the 89th percentile on competitive programming questions from Codeforces. In mathematics, it scored 83 percent on a qualifying exam for the International Mathematics Olympiad, compared to GPT-4o’s 13 percent. OpenAI also states, in a claim that may later be challenged as people scrutinize the benchmarks and run their own evaluations over time, o1 performs comparably to PhD students on specific tasks in physics, chemistry, and biology. The smaller o1-mini model is designed specifically for coding tasks and is priced at 80 percent less than o1-preview.
Enlarge/ A benchmark chart provided by OpenAI. They write, “o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Seven are shown for illustration.”
OpenAI attributes o1’s advancements to a new reinforcement learning (RL) training approach that teaches the model to spend more time “thinking through” problems before responding, similar to how “let’s think step-by-step” chain-of-thought prompting can improve outputs in other LLMs. The new process allows o1 to try different strategies and “recognize” its own mistakes.
AI benchmarks are notoriously unreliable and easy to game; however, independent verification and experimentation from users will show the full extent of o1’s advancements over time. It’s worth noting that MIT Research showed earlier this year that some of the benchmark claims OpenAI touted with GPT-4 last year were erroneous or exaggerated.
A mixed bag of capabilities
OpenAI demos “o1” correctly counting the number of Rs in the word “strawberry.”
Amid many demo videos of o1 completing programming tasks and solving logic puzzles that OpenAI shared on its website and social media, one demo stood out as perhaps the least consequential and least impressive, but it may become the most talked about due to a recurring meme where people ask LLMs to count the number of R’s in the word “strawberry.”
Due to tokenization, where the LLM processes words in data chunks called tokens, most LLMs are typically blind to character-by-character differences in words. Apparently, o1 has the self-reflective capabilities to figure out how to count the letters and provide an accurate answer without user assistance.
Beyond OpenAI’s demos, we’ve seen optimistic but cautious hands-on reports about o1-preview online. Wharton Professor Ethan Mollick wrote on X, “Been using GPT-4o1 for the last month. It is fascinating—it doesn’t do everything better but it solves some very hard problems for LLMs. It also points to a lot of future gains.”
Mollick shared a hands-on post in his “One Useful Thing” blog that details his experiments with the new model. “To be clear, o1-preview doesn’t do everything better. It is not a better writer than GPT-4o, for example. But for tasks that require planning, the changes are quite large.”
Mollick gives the example of asking o1-preview to build a teaching simulator “using multiple agents and generative AI, inspired by the paper below and considering the views of teachers and students,” then asking it to build the full code, and it produced a result that Mollick found impressive.
Mollick also gave o1-preview eight crossword puzzle clues, translated into text, and the model took 108 seconds to solve it over many steps, getting all of the answers correct but confabulating a particular clue Mollick did not give it. We recommend reading Mollick’s entire post for a good early hands-on impression. Given his experience with the new model, it appears that o1 works very similar to GPT-4o but iteratively in a loop, which is something that the so-called “agentic” AutoGPT and BabyAGI projects experimented with in early 2023.
Is this what could “threaten humanity?”
Speaking of agentic models that run in loops, Strawberry has been subject to hype since last November, when it was initially known as Q(Q-star). At the time, The Information and Reuters claimed that, just before Sam Altman’s brief ouster as CEO, OpenAI employees had internally warned OpenAI’s board of directors about a new OpenAI model called Q* that could “threaten humanity.”
In August, the hype continued when The Information reported that OpenAI showed Strawberry to US national security officials.
We’ve been skeptical about the hype around Qand Strawberry since the rumors first emerged, as this author noted last November, and Timothy B. Lee covered thoroughly in an excellent post about Q* from last December.
So even though o1 is out, AI industry watchers should note how this model’s impending launch was played up in the press as a dangerous advancement while not being publicly downplayed by OpenAI. For an AI model that takes 108 seconds to solve eight clues in a crossword puzzle and hallucinates one answer, we can say that its potential danger was likely hype (for now).
Controversy over “reasoning” terminology
It’s no secret that some people in tech have issues with anthropomorphizing AI models and using terms like “thinking” or “reasoning” to describe the synthesizing and processing operations that these neural network systems perform.
Just after the OpenAI o1 announcement, Hugging Face CEO Clement Delangue wrote, “Once again, an AI system is not ‘thinking,’ it’s ‘processing,’ ‘running predictions,’… just like Google or computers do. Giving the false impression that technology systems are human is just cheap snake oil and marketing to fool you into thinking it’s more clever than it is.”
“Reasoning” is also a somewhat nebulous term since, even in humans, it’s difficult to define exactly what the term means. A few hours before the announcement, independent AI researcher Simon Willison tweeted in response to a Bloomberg story about Strawberry, “I still have trouble defining ‘reasoning’ in terms of LLM capabilities. I’d be interested in finding a prompt which fails on current models but succeeds on strawberry that helps demonstrate the meaning of that term.”
Reasoning or not, o1-preview currently lacks some features present in earlier models, such as web browsing, image generation, and file uploading. OpenAI plans to add these capabilities in future updates, along with continued development of both the o1 and GPT model series.
While OpenAI says the o1-preview and o1-mini models are rolling out today, neither model is available in our ChatGPT Plus interface yet, so we have not been able to evaluate them. We’ll report our impressions on how this model differs from other LLMs we have previously covered.
Enlarge/ A woman wearing a sweatshirt for the QAnon conspiracy theory on October 11, 2020 in Ronkonkoma, New York.
Stephanie Keith | Getty Images
Belief in conspiracy theories is rampant, particularly in the US, where some estimates suggest as much as 50 percent of the population believes in at least one outlandish claim. And those beliefs are notoriously difficult to debunk. Challenge a committed conspiracy theorist with facts and evidence, and they’ll usually just double down—a phenomenon psychologists usually attribute to motivated reasoning, i.e., a biased way of processing information.
A new paper published in the journal Science is challenging that conventional wisdom, however. Experiments in which an AI chatbot engaged in conversations with people who believed at least one conspiracy theory showed that the interaction significantly reduced the strength of those beliefs, even two months later. The secret to its success: the chatbot, with its access to vast amounts of information across an enormous range of topics, could precisely tailor its counterarguments to each individual.
“These are some of the most fascinating results I’ve ever seen,” co-author Gordon Pennycook, a psychologist at Cornell University, said during a media briefing. “The work overturns a lot of how we thought about conspiracies, that they’re the result of various psychological motives and needs. [Participants] were remarkably responsive to evidence. There’s been a lot of ink spilled about being in a post-truth world. It’s really validating to know that evidence does matter. We can act in a more adaptive way using this new technology to get good evidence in front of people that is specifically relevant to what they think, so it’s a much more powerful approach.”
When confronted with facts that challenge a deeply entrenched belief, people will often seek to preserve it rather than update their priors (in Bayesian-speak) in light of the new evidence. So there has been a good deal of pessimism lately about ever reaching those who have plunged deep down the rabbit hole of conspiracy theories, which are notoriously persistent and “pose a serious threat to democratic societies,” per the authors. Pennycook and his fellow co-authors devised an alternative explanation for that stubborn persistence of belief.
Bespoke counter-arguments
The issue is that “conspiracy theories just vary a lot from person to person,” said co-author Thomas Costello, a psychologist at American University who is also affiliated with MIT. “They’re quite heterogeneous. People believe a wide range of them and the specific evidence that people use to support even a single conspiracy may differ from one person to another. So debunking attempts where you try to argue broadly against a conspiracy theory are not going to be effective because people have different versions of that conspiracy in their heads.”
By contrast, an AI chatbot would be able to tailor debunking efforts to those different versions of a conspiracy. So in theory a chatbot might prove more effective in swaying someone from their pet conspiracy theory.
To test their hypothesis, the team conducted a series of experiments with 2,190 participants who believed in one or more conspiracy theories. The participants engaged in several personal “conversations” with a large language model (GT-4 Turbo) in which they shared their pet conspiracy theory and the evidence they felt supported that belief. The LLM would respond by offering factual and evidence-based counter-arguments tailored to the individual participant. GPT-4 Turbo’s responses were professionally fact-checked, which showed that 99.2 percent of the claims it made were true, with just 0.8 percent being labeled misleading, and zero as false. (You can try your hand at interacting with the debunking chatbot here.)
Enlarge/ Screenshot of the chatbot opening page asking questions to prepare for a conversation
Thomas H. Costello
Participants first answered a series of open-ended questions about the conspiracy theories they strongly believed and the evidence they relied upon to support those beliefs. The AI then produced a single-sentence summary of each belief, for example, “9/11 was an inside job because X, Y, and Z.” Participants would rate the accuracy of that statement in terms of their own beliefs and then filled out a questionnaire about other conspiracies, their attitude toward trusted experts, AI, other people in society, and so forth.
Then it was time for the one-on-one dialogues with the chatbot, which the team programmed to be as persuasive as possible. The chatbot had also been fed the open-ended responses of the participants, which made it better to tailor its counter-arguments individually. For example, if someone thought 9/11 was an inside job and cited as evidence the fact that jet fuel doesn’t burn hot enough to melt steel, the chatbot might counter with, say, the NIST report showing that steel loses its strength at much lower temperatures, sufficient to weaken the towers’ structures so that it collapsed. Someone who thought 9/11 was an inside job and cited demolitions as evidence would get a different response tailored to that.
Participants then answered the same set of questions after their dialogues with the chatbot, which lasted about eight minutes on average. Costello et al. found that these targeted dialogues resulted in a 20 percent decrease in the participants’ misinformed beliefs—a reduction that persisted even two months later when participants were evaluated again.
As Bence Bago (Tilburg University) and Jean-Francois Bonnefon (CNRS, Toulouse, France) noted in an accompanying perspective, this is a substantial effect compared to the 1 to 6 percent drop in beliefs achieved by other interventions. They also deemed the persistence of the effect noteworthy, while cautioning that two months is “insufficient to completely eliminate misinformed conspiracy beliefs.”
Enlarge/ A screenshot of Taylor Swift’s Kamala Harris Instagram post, captured on September 11, 2024.
On Tuesday night, Taylor Swift endorsed Vice President Kamala Harris for US President on Instagram, citing concerns over AI-generated deepfakes as a key motivator. The artist’s warning aligns with current trends in technology, especially in an era where AI synthesis models can easily create convincing fake images and videos.
“Recently I was made aware that AI of ‘me’ falsely endorsing Donald Trump’s presidential run was posted to his site,” she wrote in her Instagram post. “It really conjured up my fears around AI, and the dangers of spreading misinformation. It brought me to the conclusion that I need to be very transparent about my actual plans for this election as a voter. The simplest way to combat misinformation is with the truth.”
In August 2024, former President Donald Trump posted AI-generated images on Truth Social falsely suggesting Swift endorsed him, including a manipulated photo depicting Swift as Uncle Sam with text promoting Trump. The incident sparked Swift’s fears about the spread of misinformation through AI.
This isn’t the first time Swift and generative AI have appeared together in the news. In February, we reported that a flood of explicit AI-generated images of Swift originated from a 4chan message board where users took part in daily challenges to bypass AI image generator filters.
On a Friday evening last November, police chased a silver sedan across the San Francisco Bay Bridge. The fleeing vehicle entered San Francisco and went careening through the city’s crowded streets. At the intersection of 11th and Folsom streets, it sideswiped the fronts of two other vehicles, veered onto a sidewalk, and hit two pedestrians.
According to a local news story, both pedestrians were taken to the hospital with one suffering major injuries. The driver of the silver sedan was injured, as was a passenger in one of the other vehicles.
No one was injured in the third car, a driverless Waymo robotaxi. Still, Waymo was required to report the crash to government agencies. It was one of 20 crashes with injuries that Waymo has reported through June. And it’s the only crash Waymo has classified as causing a serious injury.
Twenty injuries might sound like a lot, but Waymo’s driverless cars have traveled more than 22 million miles. So driverless Waymo taxis have been involved in fewer than one injury-causing crash for every million miles of driving—a much better rate than a typical human driver.
Last week Waymo released a new website to help the public put statistics like this in perspective. Waymo estimates that typical drivers in San Francisco and Phoenix—Waymo’s two biggest markets—would have caused 64 crashes over those 22 million miles. So Waymo vehicles get into injury-causing crashes less than one-third as often, per mile, as human-driven vehicles.
Waymo claims an even more dramatic improvement for crashes serious enough to trigger an airbag. Driverless Waymos have experienced just five crashes like that, and Waymo estimates that typical human drivers in Phoenix and San Francisco would have experienced 31 airbag crashes over 22 million miles. That implies driverless Waymos are one-sixth as likely as human drivers to experience this type of crash.
The new data comes at a critical time for Waymo, which is rapidly scaling up its robotaxi service. A year ago, Waymo was providing 10,000 rides per week. Last month, Waymo announced it was providing 100,000 rides per week. We can expect more growth in the coming months.
So it really matters whether Waymo is making our roads safer or more dangerous. And all the evidence so far suggests that it’s making them safer.
It’s not just the small number of crashes Waymo vehicles experience—it’s also the nature of those crashes. Out of the 23 most serious Waymo crashes, 16 involved a human driver rear-ending a Waymo. Three others involved a human-driven car running a red light before hitting a Waymo. There were no serious crashes where a Waymo ran a red light, rear-ended another car, or engaged in other clear-cut misbehavior.
Digging into Waymo’s crashes
In total, Waymo has reported nearly 200 crashes through June 2024, which works out to about one crash every 100,000 miles. Waymo says 43 percent of crashes across San Francisco and Phoenix had a delta-V of less than 1 mph—in other words, they were very minor fender-benders.
But let’s focus on the 23 most severe crashes: those that either caused an injury, caused an airbag to deploy, or both. These are good crashes to focus on not only because they do the most damage but because human drivers are more likely to report these types of crashes, making it easier to compare Waymo’s software to human drivers.
Most of these—16 crashes in total—involved another car rear-ending a Waymo. Some were quite severe: three triggered airbag deployments, and one caused a “moderate” injury. One vehicle rammed the Waymo a second time as it fled the scene, prompting Waymo to sue the driver.
There were three crashes where a human-driven car ran a red light before crashing into a Waymo:
One was the crash I mentioned at the top of this article. A car fleeing the police ran a red light and slammed into a Waymo, another car, and two pedestrians, causing several injuries.
In San Francisco, a pair of robbery suspects fleeing police in a stolen car ran a red light “at a high rate of speed” and slammed into the driver’s side door of a Waymo, triggering an airbag. The suspects were uninjured and fled on foot. The Waymo was thankfully empty.
In Phoenix, a car ran a red light and then “made contact with the SUV in front of the Waymo AV, and both of the other vehicles spun.” The Waymo vehicle was hit in the process, and someone in one of the other vehicles suffered an injury Waymo described as minor.
There were two crashes where a Waymo got sideswiped by a vehicle in an adjacent lane:
In San Francisco, Waymo was stopped at a stop sign in the right lane when another car hit the Waymo while passing it on the left.
In Tempe, Arizona, an SUV “overtook the Waymo AV on the left” and then “initiated a right turn,” cutting the Waymo off and causing a crash. A passenger in the SUV said they suffered moderate injuries.
Finally, there were two crashes where another vehicle turned left across the path of a Waymo vehicle:
In San Francisco, a Waymo and a large truck were approaching an intersection from opposite directions when a bicycle behind the truck made a sudden left in front of the Waymo. Waymo says the truck blocked Waymo’s vehicle from seeing the bicycle until the last second. The Waymo slammed on its brakes but wasn’t able to stop in time. The San Francisco Fire Department told local media that the bicyclist suffered only minor injuries and was able to leave the scene on their own.
A Waymo in Phoenix was traveling in the right lane. A row of stopped cars was in the lane to its left. As Waymo approached an intersection, a car coming from the opposite direction made a left turn through a gap in the row of stopped cars. Again, Waymo says the row of stopped cars blocked it from seeing the turning car until it was too late. A passenger in the turning vehicle reported minor injuries.
It’s conceivable that Waymo was at fault in these last two cases—it’s impossible to say without more details. It’s also possible that Waymo’s erratic braking contributed to a few of those rear-end crashes. Still, it seems clear that a non-Waymo vehicle bore primary responsibility for most, and possibly all, of these crashes.
“About as good as you can do”
One should always be skeptical when a company publishes a self-congratulatory report about its own safety record. So I called Noah Goodall, a civil engineer with many years of experience studying roadway safety, to see what he made of Waymo’s analysis.
“They’ve been the best of the companies doing this,” Goodall told me. He noted that Waymo has a team of full-time safety researchers who publish their work in reputable journals.
Waymo knows precisely how often its own vehicles crash because its vehicles are bristling with sensors. The harder problem is calculating an appropriate baseline for human-caused crashes.
That’s partly because human drivers don’t always report their own crashes to the police, insurance companies, or anyone else. But it’s also because crash rates differ from one area to another. For example, there are far more crashes per mile in downtown San Francisco than in the suburbs of Phoenix.
Waymo tried to account for these factors as it calculated crash rates for human drivers in both Phoenix and San Francisco. To ensure an apples-to-apples comparison, Waymo’s analysis excludes freeway crashes from its human-driven benchmark, since Waymo’s commercial fleet doesn’t use freeways yet.
Waymo estimates that human drivers fail to report 32 percent of injury crashes; the company raised its benchmark for human crashes to account for that. But even without this under-reporting adjustment, Waymo’s injury crash rate would still be roughly 60 percent below that of human drivers. The true number is probably somewhere between the adjusted number (70 percent fewer crashes) and the unadjusted one (60 percent fewer crashes). It’s an impressive figure either way.
Waymo says it doesn’t apply an under-reporting adjustment to its human benchmark for airbag crashes, since humans almost always report crashes that are severe enough to trigger an airbag. So it’s easier to take Waymo’s figure here—an 84 percent decline in airbag crashes—at face value.
Waymo’s benchmarks for human drivers are “about as good as you can do,” Goodall told me. “It’s very hard to get this kind of data.”
When I talked to other safety experts, they were equally positive about the quality of Waymo’s analysis. For example, last year, I asked Phil Koopman, a professor of computer engineering at Carnegie Mellon, about a previous Waymo study that used insurance data to show its cars were significantly safer than human drivers. Koopman told me Waymo’s findings were statistically credible, with some minor caveats.
Similarly, David Zuby, the chief research officer at the Insurance Institute for Highway Safety, had mostly positive things to say about a December study analyzing Waymo’s first 7.1 million miles of driverless operations.
I found a few errors in Waymo’s data
If you look closely, you’ll see that one of the numbers in this article differs slightly from Waymo’s safety website. Specifically, Waymo says that its vehicles get into crashes that cause injury 73 percent less often than human drivers, while the figure I use in this article is 70 percent.
This is because I spotted a couple of apparent classification mistakes in the raw data Waymo used to generate its statistics.
Each time Waymo reports a crash to the National Highway Traffic Safety Administration, it records the severity of injuries caused by the crash. This can be fatal, serious, moderate, minor, none, or unknown.
When Waymo shared an embargoed copy of its numbers with me early last week, it said that there had been 16 injury crashes. However, when I looked at the data Waymo had submitted to federal regulators, it showed 15 minor injuries, two moderate injuries, and one serious injury, for a total of 18.
When I asked Waymo about this discrepancy, the company said it found a programming error. Waymo had recently started using a moderate injury category and had not updated the code that generated its crash statistics to count these crashes. Waymo fixed the error quickly enough that the official version Waymo published on Thursday of last week showed 18 injury crashes.
However, as I continued looking at the data, I noticed another apparent mistake: Two crashes had been put in the “unknown” injury category, yet the narrative for each crash indicated an injury had occurred. One report said “the passenger in the Waymo AV reported an unspecified injury.” The other stated that “an individual involved was transported from the scene to a hospital for medical treatment.”
I notified Waymo about this apparent mistake on Friday and they said they are looking into it. As I write this, the website still claims a 73 percent reduction in injury crashes. But I think it’s clear that these two “unknown” crashes were actually injury crashes. So, all of the statistics in this article are based on the full list of 20 injury crashes.
I think this illustrates that I come by my generally positive outlook on Waymo honestly: I probably scrutinize Waymo’s data releases more carefully than any other journalist, and I’m not afraid to point out when the numbers don’t add up.
Based on my conversations with Waymo, I’m convinced these were honest mistakes rather than deliberate efforts to cover up crashes. I could only identify these mistakes because Waymo went out of its way to make its findings reproducible. It would make no sense to do that if the company simultaneously tried to fake its statistics.
Could there be other injury or airbag-triggering crashes that Waymo isn’t counting? It’s certainly possible, but I doubt there have been very many. You might have noticed that I linked to local media reporting for some of Waymo’s most significant crashes. If Waymo deliberately covered up a severe crash, there would be a big risk that a crash would get reported in the media and then Waymo would have to explain to federal regulators why it wasn’t reporting all legally required crashes.
So, despite the screwups, I find Waymo’s data to be fairly credible, and those data show that Waymo’s vehicles crash far less often than human drivers on public roads.
Tim Lee was on staff at Ars from 2017 to 2021. Last year, he launched a newsletter, Understanding AI, that explores how AI works and how it’s changing our world. You can subscribe here.