AI

how-i-program-with-llms

How I program with LLMs

The second issue is we can do better. I am happy we now live in a time when programmers write their own tests, but we do not hold ourselves to the same standards with tests as we do with production code. That is a reasonable trade-off; there are only so many hours in the day. But what LLMs lack in arithmetical prowess, they make up for in enthusiasm.

Let’s ask for an even better test.

In the tests, implement the simplest, most readable version of the standard code for quartiles over a fixed set of known values in a slice. Then pass the test cases through the standard code and the reservoir sampler and confirm they are within an epsilon of each other. Structure the comparison code such that it can be used in a fuzz test, too.

This got us some new test code:

// referenceQuartiles calculates the exact quartiles for a slice of float64 values    // using linear interpolation, matching the behavior expected from the sampler.    func referenceQuartiles(data []float64) (q1, median, q3 float64)  …     // compareQuartiles checks if two sets of quartiles are within epsilon of each other.    // Returns true if they match within the tolerance, false otherwise.    func compareQuartiles(q1a, meda, q3a, q1b, medb, q3b, epsilon float64) bool  …     // checkQuartiles is a test helper that compares sampler output against the reference    // implementation and reports any differences.    func checkQuartiles(t *testing.T, data []float64, epsilon float64)        t.Helper()             // Get reference values       wantQ1, wantMed, wantQ3 := referenceQuartiles(data)             // Get sampler values using a large reservoir for accuracy       qs := NewQuartileSampler(1000)       for _, v := range data            qs.Add(v)              gotQ1, gotMed, gotQ3 := qs.Quartiles()             if !compareQuartiles(gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon)            t.Errorf("Quartiles mismatch:ngot  (q1=%v, med=%v, q3=%v)nwant (q1=%v, med=%v, q3=%v)nepsilon=%v",               gotQ1, gotMed, gotQ3, wantQ1, wantMed, wantQ3, epsilon)             

The original test from above has been reworked to use checkQuartiles, and we have something new:

func FuzzQuartileSampler(f *testing.F)        // Add some seed corpus       f.Add([]float641, 2, 3, 4, 5)             f.Fuzz(func(t *testing.T, data []float64)            // Use a larger epsilon for fuzzing since we might get more extreme values           checkQuartiles(t, data, 0.2)       )      

This is fun because it’s wrong. My running gopls tool immediately says:

fuzzing arguments can only have the following types:      string, bool, float32, float64,      int, int8, int16, int32, int64,      uint, uint8, uint16, uint32, uint64,      []byte  

Pasting that error back into the LLM gets it to regenerate the fuzz test such that it is built around a func(t *testing.T, data []byte) function that uses math.Float64frombits to extract floats from the data slice. Interactions like this point us toward automating the feedback from tools; all it needed was the obvious error message to make solid progress toward something useful. I was not needed.

Doing a quick survey of the last few weeks of my LLM chat history shows (which, as I mentioned earlier, is not a proper quantitative analysis by any measure) that more than 80 percent of the time there is a tooling error, the LLM can make useful progress without me adding any insight. About half the time, it can completely resolve the issue without me saying anything of note. I am just acting as the messenger.

How I program with LLMs Read More »

apple-will-update-ios-notification-summaries-after-bbc-headline-mistake

Apple will update iOS notification summaries after BBC headline mistake

Nevertheless, it’s a serious problem when the summaries misrepresent news headlines, and edge cases where this occurs are unfortunately inevitable. Apple cannot simply fix these summaries with a software update. The only answers are either to help users understand the drawbacks of the technology so they can make better-informed judgments or to remove or disable the feature completely. Apple is apparently going for the former.

We’re oversimplifying a bit here, but generally, LLMs like those used for Apple’s notification summaries work by predicting portions of words based on what came before and are not capable of truly understanding the content they’re summarizing.

Further, these predictions are known to not be accurate all the time, with incorrect results occurring a few times per 100 or 1,000 outputs. As the models are trained and improvements are made, the error percentage may be reduced, but it never reaches zero when countless summaries are being produced every day.

Deploying this technology at scale without users (or even the BBC, it seems) really understanding how it works is risky at best, whether it’s with the iPhone’s summaries of news headlines in notifications or Google’s AI summaries at the top of search engine results pages. Even if the vast majority of summaries are perfectly accurate, there will always be some users who see inaccurate information.

These summaries are read by so many millions of people that the scale of errors will always be a problem, almost no matter how comparatively accurate the models get.

We wrote at length a few weeks ago about how the Apple Intelligence rollout seemed rushed, counter to Apple’s usual focus on quality and user experience. However, with current technology, there is no amount of refinement to this feature that Apple could have done to reach a zero percent error rate with these notification summaries.

We’ll see how well Apple does making its users understand that the summaries may be wrong, but making all iPhone users truly grok how and why the feature works this way would be a tall order.

Apple will update iOS notification summaries after BBC headline mistake Read More »

instagram-users-discover-old-ai-powered-“characters,”-instantly-revile-them

Instagram users discover old AI-powered “characters,” instantly revile them

A little over a year ago, Meta created Facebook and Instagram profiles for “28 AIs with unique interests and personalities for you to interact with and dive deeper into your interests.” Today, the last of those profiles is being taken down amid waves of viral revulsion as word of their existence has spread online.

The September 2023 launch of Meta’s social profiles for AI characters was announced alongside a much splashier initiative that created animated AI chatbots with celebrity avatars at the same time. Those celebrity-based AI chatbots were unceremoniously scrapped less than a year later amid a widespread lack of interest.

But roughly a dozen of the unrelated AI character profiles still remained accessible as of this morning via social media pages labeled as “AI managed by Meta.” Those profiles—which included a mix of AI-generated imagery and human-created content, according to Meta—also offered real users the ability to live chat with these AI characters via Instagram Direct or Facebook Messenger.

Now that we know it exists, we hate it

The “Mama Liv” AI-generated character account page, as it appeared on Instagram Friday morning.

For the last few months, these profiles have continued to exist in something of a state of benign neglect, with little in the way of new posts and less in the way of organic interest from other Meta users. That started to change last week, though, after Financial Times published a report on Meta’s vision for “social media filled with AI-generated users.”

As Meta VP of Product for Generative AI Connor Hayes told FT, “We expect these AIs to actually, over time, exist on our platforms, kind of in the same way that accounts do… They’ll have bios and profile pictures and be able to generate and share content powered by AI on the platform. That’s where we see all of this going.”

Instagram users discover old AI-powered “characters,” instantly revile them Read More »

anthropic-gives-court-authority-to-intervene-if-chatbot-spits-out-song-lyrics

Anthropic gives court authority to intervene if chatbot spits out song lyrics

Anthropic did not immediately respond to Ars’ request for comment on how guardrails currently work to prevent the alleged jailbreaks, but publishers appear satisfied by current guardrails in accepting the deal.

Whether AI training on lyrics is infringing remains unsettled

Now, the matter of whether Anthropic has strong enough guardrails to block allegedly harmful outputs is settled, Lee wrote, allowing the court to focus on arguments regarding “publishers’ request in their Motion for Preliminary Injunction that Anthropic refrain from using unauthorized copies of Publishers’ lyrics to train future AI models.”

Anthropic said in its motion opposing the preliminary injunction that relief should be denied.

“Whether generative AI companies can permissibly use copyrighted content to train LLMs without licenses,” Anthropic’s court filing said, “is currently being litigated in roughly two dozen copyright infringement cases around the country, none of which has sought to resolve the issue in the truncated posture of a preliminary injunction motion. It speaks volumes that no other plaintiff—including the parent company record label of one of the Plaintiffs in this case—has sought preliminary injunctive relief from this conduct.”

In a statement, Anthropic’s spokesperson told Ars that “Claude isn’t designed to be used for copyright infringement, and we have numerous processes in place designed to prevent such infringement.”

“Our decision to enter into this stipulation is consistent with those priorities,” Anthropic said. “We continue to look forward to showing that, consistent with existing copyright law, using potentially copyrighted material in the training of generative AI models is a quintessential fair use.”

This suit will likely take months to fully resolve, as the question of whether AI training is a fair use of copyrighted works is complex and remains hotly disputed in court. For Anthropic, the stakes could be high, with a loss potentially triggering more than $75 million in fines, as well as an order possibly forcing Anthropic to reveal and destroy all the copyrighted works in its training data.

Anthropic gives court authority to intervene if chatbot spits out song lyrics Read More »

someone-made-a-captcha-where-you-play-doom-on-nightmare-difficulty

Someone made a CAPTCHA where you play Doom on Nightmare difficulty

It’s a WebAssembly application, but it was made via a human language, prompt-driven web development tool called v0 that’s part of a suite of features offered as part of Vercel, a cloud-based developer tool service, of which Rauch is the CEO. You can see the LLM bot chat history with the series of prompts that produced this CAPTCHA game on the v0 website.

Strangely enough, there has been a past attempt at making a Doom CAPTCHA. In 2021, developer Miquel Camps Orteza made an approximation of one—though not all the assets matched Doom, and it was more Doom-adjacent. That one was made directly by hand, and its source code is available on GitHub. Its developer noted that it’s not secure; it’s just for fun.

Rauch’s attempt is no more serious as a CAPTCHA, but it at least resembles Doom more closely.

Don’t expect to be playing this to verify at real websites anytime soon, though. It’s not secure, and its legality is fuzzy at best. While the code for Doom is open source, the assets from the game like enemy sprites and environment textures—which feature prominently in this application—are not.

Someone made a CAPTCHA where you play Doom on Nightmare difficulty Read More »

evolution-journal-editors-resign-en-masse

Evolution journal editors resign en masse


an emerging form of protest?

Board members expressed concerns over high fees, editorial independence, and use of AI in editorial processes.

Over the holiday weekend, all but one member of the editorial board of Elsevier’s Journal of Human Evolution (JHE) resigned “with heartfelt sadness and great regret,” according to Retraction Watch, which helpfully provided an online PDF of the editors’ full statement. It’s the 20th mass resignation from a science journal since 2023 over various points of contention, per Retraction Watch, many in response to controversial changes in the business models used by the scientific publishing industry.

“This has been an exceptionally painful decision for each of us,” the board members wrote in their statement. “The editors who have stewarded the journal over the past 38 years have invested immense time and energy in making JHE the leading journal in paleoanthropological research and have remained loyal and committed to the journal and our authors long after their terms ended. The [associate editors] have been equally loyal and committed. We all care deeply about the journal, our discipline, and our academic community; however, we find we can no longer work with Elsevier in good conscience.”

The editorial board cited several changes made over the last ten years that it believes are counter to the journal’s longstanding editorial principles. These included eliminating support for a copy editor and a special issues editor, leaving it to the editorial board to handle those duties. When the board expressed the need for a copy editor, Elsevier’s response, they said, was “to maintain that the editors should not be paying attention to language, grammar, readability, consistency, or accuracy of proper nomenclature or formatting.”

There is also a major restructuring of the editorial board underway that aims to reduce the number of associate editors by more than half, which “will result in fewer AEs handling far more papers, and on topics well outside their areas of expertise.”

Furthermore, there are plans to create a third-tier editorial board that functions largely in a figurehead capacity, after Elsevier “unilaterally took full control” of the board’s structure in 2023 by requiring all associate editors to renew their contracts annually—which the board believes undermines its editorial independence and integrity.

Worst practices

In-house production has been reduced or outsourced, and in 2023 Elsevier began using AI during production without informing the board, resulting in many style and formatting errors, as well as reversing versions of papers that had already been accepted and formatted by the editors. “This was highly embarrassing for the journal and resolution took six months and was achieved only through the persistent efforts of the editors,” the editors wrote. “AI processing continues to be used and regularly reformats submitted manuscripts to change meaning and formatting and require extensive author and editor oversight during proof stage.”

In addition, the author page charges for JHE are significantly higher than even Elsevier’s other for-profit journals, as well as broad-based open access journals like Scientific Reports. Not many of the journal’s authors can afford those fees, “which runs counter to the journal’s (and Elsevier’s) pledge of equality and inclusivity,” the editors wrote.

The breaking point seems to have come in November, when Elsevier informed co-editors Mark Grabowski (Liverpool John Moores University) and Andrea Taylor (Touro University California College of Osteopathic Medicine) that it was ending the dual-editor model that has been in place since 1986. When Grabowki and Taylor protested, they were told the model could only remain if they took a 50 percent cut in their compensation.

Elsevier has long had its share of vocal critics (including our own Chris Lee) and this latest development has added fuel to the fire. “Elsevier has, as usual, mismanaged the journal and done everything they could to maximize profit at the expense of quality,” biologist PZ Myers of the University of Minnesota Morris wrote on his blog Pharyngula. “In particular, they decided that human editors were too expensive, so they’re trying to do the job with AI. They also proposed cutting the pay for the editor-in-chief in half. Keep in mind that Elsevier charges authors a $3990 processing fee for each submission. I guess they needed to improve the economics of their piratical mode of operation a little more.”

Elsevier has not yet responded to Ars’ request for comment; we will update accordingly should a statement be issued.

Not all AI uses are created equal

John Hawks, an anthropologist at the University of Wisconsin, Madison, who has published 17 papers in JHE over his career, expressed his full support for the board members’ decision on his blog, along with shock at the (footnoted) revelation that Elsevier had introduced AI to its editorial process in 2023. “I’ve published four articles in the journal during the last two years, including one in press now, and if there was any notice to my co-authors or me about an AI production process, I don’t remember it,” he wrote, noting that the move violates the journal’s own AI policies. “Authors should be informed at the time of submission how AI will be used in their work. I would have submitted elsewhere if I was aware that AI would potentially be altering the meaning of the articles.”

There is certainly cause for concern when it comes to using AI in the pursuit of science. For instance, earlier this year, we witnessed the viral sensation of several egregiously bad AI-generated figures published in a peer-reviewed article in Frontiers, a reputable scientific journal. Scientists on social media expressed equal parts shock and ridicule at the images, one of which featured a rat with grotesquely large and bizarre genitals. The paper has since been retracted, but the incident reinforces a growing concern that AI will make published scientific research less trustworthy, even as it increases productivity.

That said, there are also some useful applications of AI in the scientific endeavor. For instance, back in January, the research publisher Science announced that all of its journals would begin using commercial software that automates the process of detecting improperly manipulated images. Perhaps that would have caught the egregious rat genitalia figure, although as Ars Science Editor John Timmer pointed out at the time, the software has limitations. “While it will catch some of the most egregious cases of image manipulation, enterprising fraudsters can easily avoid being caught if they know how the software operates,” he wrote.

Hawks acknowledged on his blog that the use of AI by scientists and scientific journals is likely inevitable and even recognizes the potential benefits. “I don’t think this is a dystopian future. But not all uses of machine learning are equal,” he wrote. To wit:

[I]t’s bad for anyone to use AI to reduce or replace the scientific input and oversight of people in research—whether that input comes from researchers, editors, reviewers, or readers. It’s stupid for a company to use AI to divert experts’ effort into redundant rounds of proofreading, or to make disseminating scientific work more difficult.

In this case, Elsevier may have been aiming for good but instead hit the exacta of bad and stupid. It’s especially galling that they demand transparency from authors but do not provide transparency about their own processes… [I]t would be a very good idea for authors of recent articles to make sure that they have posted a preprint somewhere, so that their original pre-AI version will be available for readers. As the editors lose access, corrections to published articles may become difficult or impossible.

Nature published an article back in March raising questions about the efficacy of mass resignations as an emerging form of protest after all the editors of the Wiley-published linguistics journal Syntax resigned in February. (Several of their concerns mirror those of the JHE editorial board.) Such moves certainly garner attention, but even former Syntax editor Klaus Abels of University College London told Nature that the objective of such mass resignations should be on moving beyond mere protest, focusing instead on establishing new independent nonprofit journals for the academic community that are open access and have high academic standards.

Abels and his former Syntax colleagues are in the process of doing just that, following the example of the former editors of Critical Public Health and another Elsevier journal, NeuroImage, last year.

Photo of Jennifer Ouellette

Jennifer is a senior reporter at Ars Technica with a particular focus on where science meets culture, covering everything from physics and related interdisciplinary topics to her favorite films and TV series. Jennifer lives in Baltimore with her spouse, physicist Sean M. Carroll, and their two cats, Ariel and Caliban.

Evolution journal editors resign en masse Read More »

tech-worker-movements-grow-as-threats-of-rto,-ai-loom

Tech worker movements grow as threats of RTO, AI loom


Advocates say tech workers movements got too big to ignore in 2024.

Credit: Aurich Lawson | Getty Images

It feels like tech workers have caught very few breaks over the past several years, between ongoing mass layoffs, stagnating wages amid inflation, AI supposedly coming for jobs, and unpopular orders to return to office that, for many, threaten to disrupt work-life balance.

But in 2024, a potentially critical mass of tech workers seemed to reach a breaking point. As labor rights groups advocating for tech workers told Ars, these workers are banding together in sustained strong numbers and are either winning or appear tantalizingly close to winning better worker conditions at major tech companies, including Amazon, Apple, Google, and Microsoft.

In February, the industry-wide Tech Workers Coalition (TWC) noted that “the tech workers movement is far more expansive and impactful” than even labor rights advocates realized, noting that unionized tech workers have gone beyond early stories about Googlers marching in the streets and now “make the headlines on a daily basis.”

Ike McCreery, a TWC volunteer and ex-Googler who helped found the Alphabet Workers Union, told Ars that although “it’s hard to gauge numerically” how much movements have grown, “our sense is definitely that the momentum continues to build.”

“It’s been an exciting year,” McCreery told Ars, while expressing particular enthusiasm that even “highly compensated tech workers are really seeing themselves more as workers” in these fights—which TWC “has been pushing for a long time.”

In 2024, TWC broadened efforts to help workers organize industry-wide, helping everyone from gig workers to project managers build both union and non-union efforts to push for change in the workplace.

Such widespread organizing “would have been unthinkable only five years ago,” TWC noted in February, and it’s clear from some of 2024’s biggest wins that some movements are making gains that could further propel that momentum in 2025.

Workers could also gain the upper hand if unpopular policies increase what one November study called “brain drain.” That’s a trend where tech companies adopting potentially alienating workplace tactics risk losing top talent at a time when key industries like AI and cybersecurity are facing severe talent shortages.

Advocates told Ars that unpopular policies have always fueled workers movements, and RTO and AI are just the latest adding fuel to the fire. As many workers prepare to head back to offices in 2025 where worker surveillance is only expected to intensify, they told Ars why they expect to see workers’ momentum continue at some of the world’s biggest tech firms.

Tech worker movements growing

In August, Apple ratified a labor contract at America’s first unionized Apple Store—agreeing to a modest increase in wages, about 10 percent over three years. While small, that win came just a few weeks before the National Labor Relations Board (NLRB) determined that Amazon was a joint employer of unionized contract-based delivery drivers. And Google lost a similar fight last January when the NLRB ruled it must bargain with a union representing YouTube Music contract workers, Reuters reported.

For many workers, joining these movements helped raise wages. In September, facing mounting pressure, Amazon raised warehouse worker wages—investing $2.2 billion, its “biggest investment yet,” to broadly raise base salaries for workers. And more recently, Amazon was hit with a strike during the busy holiday season, as warehouse workers hoped to further hobble the company during a clutch financial quarter to force more bargaining. (Last year, Amazon posted record-breaking $170 billion holiday quarter revenues and has said the current strike won’t hurt revenues.)

Even typically union-friendly Microsoft drew worker backlash and criticism in 2024 following layoffs of 650 video game workers in September.

These mass layoffs are driving some workers to join movements. A senior director for organizing with Communications Workers of America (CWA), Tom Smith, told Ars that shortly after the 600-member Tech Guild—”the largest single certified group of tech workers” to organize at the New York Times—reached a tentative deal to increase wages “up to 8.25 percent over the length of the contract,” about “460 software engineers at a video game company owned by Microsoft successfully unionized.”

Smith told Ars that while workers for years have pushed for better conditions, “these large units of tech workers achieving formal recognition, building lasting organization, and winning contracts” at “a more mass scale” are maturing, following in the footsteps of unionizing Googlers and today influencing a broader swath of tech industry workers nationwide. From CWA’s viewpoint, workers in the video game industry seem best positioned to seek major wins next, Smith suggested, likely starting with Microsoft-owned companies and eventually affecting indie game companies.

CWA, TWC, and Tech Workers Union 1010 (a group run by tech workers that’s part of the Office and Professional Employees International Union) all now serve as dedicated groups supporting workers movements long-term, and that stability has helped these movements mature, McCreery told Ars. Each group plans to continue meeting workers where they are to support and help expand organizing in 2025.

Cost of RTOs may be significant, researchers warn

While layoffs likely remain the most extreme threat to tech workers broadly, a return-to-office (RTO) mandate can be just as jarring for remote tech workers who are either unable to comply or else unwilling to give up the better work-life balance that comes with no commute. Advocates told Ars that RTO policies have pushed workers to join movements, while limited research suggests that companies risk losing top talents by implementing RTO policies.

In perhaps the biggest example from 2024, when Amazon announced that it was requiring workers in-office five days a week next year, a poll on the anonymous platform where workers discuss employers, Blind, found an overwhelming majority of more than 2,000 Amazon employees were “dissatisfied.”

“My morale for this job is gone…” one worker said on Blind.

Workers criticized the “non-data-driven logic” of the RTO mandate, prompting an Amazon executive to remind them that they could take their talents elsewhere if they didn’t like it. Many confirmed that’s exactly what they planned to do. (Amazon later announced it would be delaying RTO for many office workers after belatedly realizing there was a lack of office space.)

Other companies mandating RTO faced similar backlash from workers, who continued to question the logic driving the decision. One February study showed that RTO mandates don’t make companies any more valuable but do make workers more miserable. And last month, Brian Elliott, an executive advisor who wrote a book about the benefits of flexible teams, noted that only one in three executives thinks RTO had “even a slight positive impact on productivity.”

But not every company drew a hard line the way that Amazon did. For example, Dell gave workers a choice to remain remote and accept they can never be eligible for promotions, or mark themselves as hybrid. Workers who refused the RTO said they valued their free time and admitted to looking for other job opportunities.

Very few studies have been done analyzing the true costs and benefits of RTO, a November academic study titled “Return to Office and Brain Drain” said, and so far companies aren’t necessarily backing the limited findings. The researchers behind that study noted that “the only existing study” measuring how RTO impacts employee turnover showed this year that senior employees left for other companies after Microsoft’s RTO mandate, but Microsoft disputed that finding.

Seeking to build on this research, the November study tracked “over 3 million tech and finance workers’ employment histories reported on LinkedIn” and analyzed “the effect of S&P 500 firms’ return-to-office (RTO) mandates on employee turnover and hiring.”

Choosing to only analyze the firms requiring five days in office, the final sample covered 54 RTO firms, including big tech companies like Amazon, Apple, and Microsoft. From that sample, researchers concluded that average employee turnover increased by 14 percent after RTO mandates at bigger firms. And since big firms typically have lower turnover, the increase in turnover is likely larger at smaller firms, the study’s authors concluded.

The study also supported the conclusion that “employees with the highest skill level are more likely to leave” and found that “RTO firms take significantly longer time to fill their job vacancies after RTO mandates.”

“Together, our evidence suggests that RTO mandates are costly to firms and have serious negative effects on the workforce,” the study concluded, echoing some remote workers’ complaints about the seemingly non-data-driven logic of RTO, while urging that further research is needed.

“These turnovers could potentially have short-term and long-term effects on operation, innovation, employee morale, and organizational culture,” the study concluded.

A co-author of the “brain drain” study, Mark Ma, told Ars that by contrast, Glassdoor going fully remote at least anecdotally seemed to “significantly” increase the number and quality of applications—possibly also improving retention by offering the remote flexibility that many top talents today require.

Ma said that next his team hopes to track where people who leave firms over RTO policies go next.

“Do they become self-employed, or do they go to a competitor, or do they fund their own firm?” Ma speculated, hoping to trace these patterns more definitively over the next several years.

Additionally, Ma plans to investigate individual firms’ RTO impacts, as well as impacts on niche classes of workers with highly sought-after skills—such as in areas like AI, machine learning, or cybersecurity—to see if it’s easier for them to find other jobs. In the long-term, Ma also wants to monitor for potentially less-foreseeable outcomes, such as RTO mandates possibly increasing firms’ number of challengers in their industry.

Will RTO mandates continue in 2025?

Many tech workers may be wondering if there will be a spike in return-to-office mandates in 2025, especially since one of the most politically influential figures in tech, Elon Musk, recently reiterated that he thinks remote work is “poison.”

Musk, of course, banned remote work at Tesla, as well as when he took over Twitter. And as co-lead of the US Department of Government Efficiency (DOGE), Musk reportedly plans to ban remote work for government employees, as well. If other tech firms are influenced by Musk’s moves and join executives who seem to be mandating RTO based on intuition, it’s possible that more tech workers could be forced to return to office or else seek other employment.

But Ma told Ars that he doesn’t expect to see “a big spike in the number of firms announcing return to office mandates” in 2025.

His team only found eight major firms in tech and finance that issued five-day return-to-office mandates in 2024, which was the same number of firms flagged in 2023, suggesting no major increase in RTOs from year to year. Ma told Ars that while big firms like Amazon ordering employees to return to the office made headlines, many firms seem to be continuing to embrace hybrid models, sometimes allowing employees to choose when or if they come into the office.

That seeming preference for hybrid work models seems to align with “future of work” surveys outlining workplace trends and employee preferences that the Consumer Technology Association (CTA) conducted for years but has seemingly since discontinued. In 2021, CTA reported that “89 percent of tech executives say flexible work arrangements are the most important employee benefit and 65 percent say they’ll hire more employees to work remotely.” The next year, which apparently was the last time CTA published the survey, the CTA suggested hybrid models could help attract talents in a competitive market hit with “an unprecedented demand for workers with high-tech skills.”

The CTA did not respond to Ars’ requests to comment on whether it expects hybrid work arrangements to remain preferred over five-day return-to-office policies next year.

CWA’s Smith told Ars that workers movements are growing partly because “folks are engaged in this big fight around surveillance and workplace control,” as well as anything “having to do with to what extent will people return to offices and what does that look like if and when people do return to offices?”

Without data backing RTO mandates, Ma’s study suggests that firms will struggle to retain highly skilled workers at a time when tech innovation remains a top priority for the US. As workers appear increasingly put off by policies—like RTO or AI-driven workplace monitoring or efficiency efforts threatening to replace workers with AI—Smith’s experience seems to show that disgruntled workers could find themselves drawn to unions that could help them claw back control over work-life balance. And the cost of the ensuing shuffle to some of the largest tech firms in the world could be “significant,” Ma’s study warned.

TWC’s McCreery told Ars that on top of unpopular RTO policies driving workers to join movements, workers have also become more active in protesting unpopular politics, frustrated to see their talents apparently used to further controversial conflicts and military efforts globally. Some workers think workplace organizing could be more powerful than voting to oppose political actions their companies take.

“The workplace really remains an important site of power for a lot of people where maybe they don’t feel like they can enact their values just by voting or in other ways,” McCreery said.

While unpopular policies “have always been a reason workers have joined unions and joined movements,” McCreery said that “the development of more of these unpopular policies” like RTO and AI-enhanced surveillance “really targeted” at workers has increased “the political consciousness and the sense” that tech workers are “just like any other workers.”

Layoffs at companies like Microsoft and Amazon during periods when revenue is increasing in the double-digits also unify workers, advocates told Ars. Forbes noted Microsoft laid off 1,000 workers “just five days before reporting a 17.6 percent increase in revenue to $62 billion,” while Amazon’s 1,000-worker layoffs followed a 14 percent rise in revenue to $170 billion. And demand for AI led to the highest profit margins Amazon’s seen for its cloud business in a decade, CNBC reported in October.

CWA’s Smith told Ars as companies continue to rake in profits and workers feel their work-life balance slipping away while their efforts in the office are potentially “used to increase control and cause broader suffering,” some of the biggest fights workers raised in 2024 may intensify next year.

“It’s like a shock to employees, these industries pushing people to lower your expectations because we’re going to lay off hundreds of thousands of you just because we can while we make more profits than we ever have,” Smith said. “I think workers are going to step into really broad campaigns to assert a different worldview on employment security.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Tech worker movements grow as threats of RTO, AI loom Read More »

2024:-the-year-ai-drove-everyone-crazy

2024: The year AI drove everyone crazy


What do eating rocks, rat genitals, and Willy Wonka have in common? AI, of course.

It’s been a wild year in tech thanks to the intersection between humans and artificial intelligence. 2024 brought a parade of AI oddities, mishaps, and wacky moments that inspired odd behavior from both machines and man. From AI-generated rat genitals to search engines telling people to eat rocks, this year proved that AI has been having a weird impact on the world.

Why the weirdness? If we had to guess, it may be due to the novelty of it all. Generative AI and applications built upon Transformer-based AI models are still so new that people are throwing everything at the wall to see what sticks. People have been struggling to grasp both the implications and potential applications of the new technology. Riding along with the hype, different types of AI that may end up being ill-advised, such as automated military targeting systems, have also been introduced.

It’s worth mentioning that aside from crazy news, we saw fewer weird AI advances in 2024 as well. For example, Claude 3.5 Sonnet launched in June held off the competition as a top model for most of the year, while OpenAI’s o1 used runtime compute to expand GPT-4o’s capabilities with simulated reasoning. Advanced Voice Mode and NotebookLM also emerged as novel applications of AI tech, and the year saw the rise of more capable music synthesis models and also better AI video generators, including several from China.

But for now, let’s get down to the weirdness.

ChatGPT goes insane

Illustration of a broken toy robot.

Early in the year, things got off to an exciting start when OpenAI’s ChatGPT experienced a significant technical malfunction that caused the AI model to generate increasingly incoherent responses, prompting users on Reddit to describe the system as “having a stroke” or “going insane.” During the glitch, ChatGPT’s responses would begin normally but then deteriorate into nonsensical text, sometimes mimicking Shakespearean language.

OpenAI later revealed that a bug in how the model processed language caused it to select the wrong words during text generation, leading to nonsense outputs (basically the text version of what we at Ars now call “jabberwockies“). The company fixed the issue within 24 hours, but the incident led to frustrations about the black box nature of commercial AI systems and users’ tendency to anthropomorphize AI behavior when it malfunctions.

The great Wonka incident

A photo of the Willy's Chocolate Experience, which did not match AI-generated promises.

A photo of “Willy’s Chocolate Experience” (inset), which did not match AI-generated promises, shown in the background. Credit: Stuart Sinclair

The collision between AI-generated imagery and consumer expectations fueled human frustrations in February when Scottish families discovered that “Willy’s Chocolate Experience,” an unlicensed Wonka-ripoff event promoted using AI-generated wonderland images, turned out to be little more than a sparse warehouse with a few modest decorations.

Parents who paid £35 per ticket encountered a situation so dire they called the police, with children reportedly crying at the sight of a person in what attendees described as a “terrifying outfit.” The event, created by House of Illuminati in Glasgow, promised fantastical spaces like an “Enchanted Garden” and “Twilight Tunnel” but delivered an underwhelming experience that forced organizers to shut down mid-way through its first day and issue refunds.

While the show was a bust, it brought us an iconic new meme for job disillusionment in the form of a photo: the green-haired Willy’s Chocolate Experience employee who looked like she’d rather be anywhere else on earth at that moment.

Mutant rat genitals expose peer review flaws

An actual laboratory rat, who is intrigued. Credit: Getty | Photothek

In February, Ars Technica senior health reporter Beth Mole covered a peer-reviewed paper published in Frontiers in Cell and Developmental Biology that created an uproar in the scientific community when researchers discovered it contained nonsensical AI-generated images, including an anatomically incorrect rat with oversized genitals. The paper, authored by scientists at Xi’an Honghui Hospital in China, openly acknowledged using Midjourney to create figures that contained gibberish text labels like “Stemm cells” and “iollotte sserotgomar.”

The publisher, Frontiers, posted an expression of concern about the article titled “Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway” and launched an investigation into how the obviously flawed imagery passed through peer review. Scientists across social media platforms expressed dismay at the incident, which mirrored concerns about AI-generated content infiltrating academic publishing.

Chatbot makes erroneous refund promises for Air Canada

If, say, ChatGPT gives you the wrong name for one of the seven dwarves, it’s not such a big deal. But in February, Ars senior policy reporter Ashley Belanger covered a case of costly AI confabulation in the wild. In the course of online text conversations, Air Canada’s customer service chatbot told customers inaccurate refund policy information. The airline faced legal consequences later when a tribunal ruled the airline must honor commitments made by the automated system. Tribunal adjudicator Christopher Rivers determined that Air Canada bore responsibility for all information on its website, regardless of whether it came from a static page or AI interface.

The case set a precedent for how companies deploying AI customer service tools could face legal obligations for automated systems’ responses, particularly when they fail to warn users about potential inaccuracies. Ironically, the airline had reportedly spent more on the initial AI implementation than it would have cost to maintain human workers for simple queries, according to Air Canada executive Steve Crocker.

Will Smith lampoons his digital double

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023.

The real Will Smith eating spaghetti, parodying an AI-generated video from 2023. Credit: Will Smith / Getty Images / Benj Edwards

In March 2023, a terrible AI-generated video of Will Smith’s AI doppelganger eating spaghetti began making the rounds online. The AI-generated version of the actor gobbled down the noodles in an unnatural and disturbing way. Almost a year later, in February 2024, Will Smith himself posted a parody response video to the viral jabberwocky on Instagram, featuring AI-like deliberately exaggerated pasta consumption, complete with hair-nibbling and finger-slurping antics.

Given the rapid evolution of AI video technology, particularly since OpenAI had just unveiled its Sora video model four days earlier, Smith’s post sparked discussion in his Instagram comments where some viewers initially struggled to distinguish between the genuine footage and AI generation. It was an early sign of “deep doubt” in action as the tech increasingly blurs the line between synthetic and authentic video content.

Robot dogs learn to hunt people with AI-guided rifles

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries.

A still image of a robotic quadruped armed with a remote weapons system, captured from a video provided by Onyx Industries. Credit: Onyx Industries

At some point in recent history—somewhere around 2022—someone took a look at robotic quadrupeds and thought it would be a great idea to attach guns to them. A few years later, the US Marine Forces Special Operations Command (MARSOC) began evaluating armed robotic quadrupeds developed by Ghost Robotics. The robot “dogs” integrated Onyx Industries’ SENTRY remote weapon systems, which featured AI-enabled targeting that could detect and track people, drones, and vehicles, though the systems require human operators to authorize any weapons discharge.

The military’s interest in armed robotic dogs followed a broader trend of weaponized quadrupeds entering public awareness. This included viral videos of consumer robots carrying firearms, and later, commercial sales of flame-throwing models. While MARSOC emphasized that weapons were just one potential use case under review, experts noted that the increasing integration of AI into military robotics raised questions about how long humans would remain in control of lethal force decisions.

Microsoft Windows AI is watching

A screenshot of Microsoft's new

A screenshot of Microsoft’s new “Recall” feature in action. Credit: Microsoft

In an era where many people already feel like they have no privacy due to tech encroachments, Microsoft dialed it up to an extreme degree in May. That’s when Microsoft unveiled a controversial Windows 11 feature called “Recall” that continuously captures screenshots of users’ PC activities every few seconds for later AI-powered search and retrieval. The feature, designed for new Copilot+ PCs using Qualcomm’s Snapdragon X Elite chips, promised to help users find past activities, including app usage, meeting content, and web browsing history.

While Microsoft emphasized that Recall would store encrypted snapshots locally and allow users to exclude specific apps or websites, the announcement raised immediate privacy concerns, as Ars senior technology reporter Andrew Cunningham covered. It also came with a technical toll, requiring significant hardware resources, including 256GB of storage space, with 25GB dedicated to storing approximately three months of user activity. After Microsoft pulled the initial test version due to public backlash, Recall later entered public preview in November with reportedly enhanced security measures. But secure spyware is still spyware—Recall, when enabled, still watches nearly everything you do on your computer and keeps a record of it.

Google Search told people to eat rocks

This is fine. Credit: Getty Images

In May, Ars senior gaming reporter Kyle Orland (who assisted commendably with the AI beat throughout the year) covered Google’s newly launched AI Overview feature. It faced immediate criticism when users discovered that it frequently provided false and potentially dangerous information in its search result summaries. Among its most alarming responses, the system advised humans could safely consume rocks, incorrectly citing scientific sources about the geological diet of marine organisms. The system’s other errors included recommending nonexistent car maintenance products, suggesting unsafe food preparation techniques, and confusing historical figures who shared names.

The problems stemmed from several issues, including the AI treating joke posts as factual sources and misinterpreting context from original web content. But most of all, the system relies on web results as indicators of authority, which we called a flawed design. While Google defended the system, stating these errors occurred mainly with uncommon queries, a company spokesperson acknowledged they would use these “isolated examples” to refine their systems. But to this day, AI Overview still makes frequent mistakes.

Stable Diffusion generates body horror

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass.

An AI-generated image created using Stable Diffusion 3 of a girl lying in the grass. Credit: HorneyMetalBeing

In June, Stability AI’s release of the image synthesis model Stable Diffusion 3 Medium drew criticism online for its poor handling of human anatomy in AI-generated images. Users across social media platforms shared examples of the model producing what we now like to call jabberwockies—AI generation failures with distorted bodies, misshapen hands, and surreal anatomical errors, and many in the AI image-generation community viewed it as a significant step backward from previous image-synthesis capabilities.

Reddit users attributed these failures to Stability AI’s aggressive filtering of adult content from the training data, which apparently impaired the model’s ability to accurately render human figures. The troubled release coincided with broader organizational challenges at Stability AI, including the March departure of CEO Emad Mostaque, multiple staff layoffs, and the exit of three key engineers who had helped develop the technology. Some of those engineers founded Black Forest Labs in August and released Flux, which has become the latest open-weights AI image model to beat.

ChatGPT Advanced Voice imitates human voice in testing

An illustration of a computer synthesizer spewing out letters.

AI voice-synthesis models are master imitators these days, and they are capable of much more than many people realize. In August, we covered a story where OpenAI’s ChatGPT Advanced Voice Mode feature unexpectedly imitated a user’s voice during the company’s internal testing, revealed by OpenAI after the fact in safety testing documentation. To prevent future instances of an AI assistant suddenly speaking in your own voice (which, let’s be honest, would probably freak people out), the company created an output classifier system to prevent unauthorized voice imitation. OpenAI says that Advanced Voice Mode now catches all meaningful deviations from approved system voices.

Independent AI researcher Simon Willison discussed the implications with Ars Technica, noting that while OpenAI restricted its model’s full voice synthesis capabilities, similar technology would likely emerge from other sources within the year. Meanwhile, the rapid advancement of AI voice replication has caused general concern about its potential misuse, although companies like ElevenLabs have already been offering voice cloning services for some time.

San Francisco’s robotic car horn symphony

A Waymo self-driving car in front of Google's San Francisco headquarters, San Francisco, California, June 7, 2024.

A Waymo self-driving car in front of Google’s San Francisco headquarters, San Francisco, California, June 7, 2024. Credit: Getty Images

In August, San Francisco residents got a noisy taste of robo-dystopia when Waymo’s self-driving cars began creating an unexpected nightly disturbance in the South of Market district. In a parking lot off 2nd Street, the cars congregated autonomously every night during rider lulls at 4 am and began engaging in extended honking matches at each other while attempting to park.

Local resident Christopher Cherry’s initial optimism about the robotic fleet’s presence dissolved as the mechanical chorus grew louder each night, affecting residents in nearby high-rises. The nocturnal tech disruption served as a lesson in the unintentional effects of autonomous systems when run in aggregate.

Larry Ellison dreams of all-seeing AI cameras

A colorized photo of CCTV cameras in London, 2024.

In September, Oracle co-founder Larry Ellison painted a bleak vision of ubiquitous AI surveillance during a company financial meeting. The 80-year-old database billionaire described a future where AI would monitor citizens through networks of cameras and drones, asserting that the oversight would ensure lawful behavior from both police and the public.

His surveillance predictions reminded us of parallels to existing systems in China, where authorities already used AI to sort surveillance data on citizens as part of the country’s “sharp eyes” campaign from 2015 to 2020. Ellison’s statement reflected the sort of worst-case tech surveillance state scenario—likely antithetical to any sort of free society—that dozens of sci-fi novels of the 20th century warned us about.

A dead father sends new letters home

An AI-generated image featuring Dad's Uppercase handwriting.

An AI-generated image featuring my late father’s handwriting. Credit: Benj Edwards / Flux

AI has made many of us do weird things in 2024, including this writer. In October, I used an AI synthesis model called Flux to reproduce my late father’s handwriting with striking accuracy. After scanning 30 samples from his engineering notebooks, I trained the model using computing time that cost less than five dollars. The resulting text captured his distinctive uppercase style, which he developed during his career as an electronics engineer.

I enjoyed creating images showing his handwriting in various contexts, from folder labels to skywriting, and made the trained model freely available online for others to use. While I approached it as a tribute to my father (who would have appreciated the technical achievement), many people found the whole experience weird and somewhat disturbing. The things we unhinged Bing Chat-like journalists do to bring awareness to a topic are sometimes unconventional. So I guess it counts for this list!

For 2025? Expect even more AI

Thanks for reading Ars Technica this past year and following along with our team coverage of this rapidly emerging and expanding field. We appreciate your kind words of support. Ars Technica’s 2024 AI words of the year were: vibemarking, deep doubt, and the aforementioned jabberwocky. The old stalwart “confabulation” also made several notable appearances. Tune in again next year when we continue to try to figure out how to concisely describe novel scenarios in emerging technology by labeling them.

Looking back, our prediction for 2024 in AI last year was “buckle up.” It seems fitting, given the weirdness detailed above. Especially the part about the robot dogs with guns. For 2025, AI will likely inspire more chaos ahead, but also potentially get put to serious work as a productivity tool, so this time, our prediction is “buckle down.”

Finally, we’d like to ask: What was the craziest story about AI in 2024 from your perspective? Whether you love AI or hate it, feel free to suggest your own additions to our list in the comments. Happy New Year!

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

2024: The year AI drove everyone crazy Read More »

the-ai-war-between-google-and-openai-has-never-been-more-heated

The AI war between Google and OpenAI has never been more heated

Over the past month, we’ve seen a rapid cadence of notable AI-related announcements and releases from both Google and OpenAI, and it’s been making the AI community’s head spin. It has also poured fuel on the fire of the OpenAI-Google rivalry, an accelerating game of one-upmanship taking place unusually close to the Christmas holiday.

“How are people surviving with the firehose of AI updates that are coming out,” wrote one user on X last Friday, which is still a hotbed of AI-related conversation. “in the last <24 hours we got gemini flash 2.0 and chatGPT with screenshare, deep research, pika 2, sora, chatGPT projects, anthropic clio, wtf it never ends."

Rumors travel quickly in the AI world, and people in the AI industry had been expecting OpenAI to ship some major products in December. Once OpenAI announced “12 days of OpenAI” earlier this month, Google jumped into gear and seemingly decided to try to one-up its rival on several counts. So far, the strategy appears to be working, but it’s coming at the cost of the rest of the world being able to absorb the implications of the new releases.

“12 Days of OpenAI has turned into like 50 new @GoogleAI releases,” wrote another X user on Monday. “This past week, OpenAI & Google have been releasing at the speed of a new born startup,” wrote a third X user on Tuesday. “Even their own users can’t keep up. Crazy time we’re living in.”

“Somebody told Google that they could just do things,” wrote a16z partner and AI influencer Justine Moore on X, referring to a common motivational meme telling people they “can just do stuff.”

The Google AI rush

OpenAI’s “12 Days of OpenAI” campaign has included releases of their full o1 model, an upgrade from o1-preview, alongside o1-pro for advanced “reasoning” tasks. The company also publicly launched Sora for video generation, added Projects functionality to ChatGPT, introduced Advanced Voice features with video streaming capabilities, and more.

The AI war between Google and OpenAI has never been more heated Read More »

why-ai-language-models-choke-on-too-much-text

Why AI language models choke on too much text


Compute costs scale with the square of the input size. That’s not great.

Credit: Aurich Lawson | Getty Images

Large language models represent text using tokens, each of which is a few characters. Short words are represented by a single token (like “the” or “it”), whereas larger words may be represented by several tokens (GPT-4o represents “indivisible” with “ind,” “iv,” and “isible”).

When OpenAI released ChatGPT two years ago, it had a memory—known as a context window—of just 8,192 tokens. That works out to roughly 6,000 words of text. This meant that if you fed it more than about 15 pages of text, it would “forget” information from the beginning of its context. This limited the size and complexity of tasks ChatGPT could handle.

Today’s LLMs are far more capable:

  • OpenAI’s GPT-4o can handle 128,000 tokens (about 200 pages of text).
  • Anthropic’s Claude 3.5 Sonnet can accept 200,000 tokens (about 300 pages of text).
  • Google’s Gemini 1.5 Pro allows 2 million tokens (about 2,000 pages of text).

Still, it’s going to take a lot more progress if we want AI systems with human-level cognitive abilities.

Many people envision a future where AI systems are able to do many—perhaps most—of the jobs performed by humans. Yet many human workers read and hear hundreds of millions of words during our working years—and we absorb even more information from sights, sounds, and smells in the world around us. To achieve human-level intelligence, AI systems will need the capacity to absorb similar quantities of information.

Right now the most popular way to build an LLM-based system to handle large amounts of information is called retrieval-augmented generation (RAG). These systems try to find documents relevant to a user’s query and then insert the most relevant documents into an LLM’s context window.

This sometimes works better than a conventional search engine, but today’s RAG systems leave a lot to be desired. They only produce good results if the system puts the most relevant documents into the LLM’s context. But the mechanism used to find those documents—often, searching in a vector database—is not very sophisticated. If the user asks a complicated or confusing question, there’s a good chance the RAG system will retrieve the wrong documents and the chatbot will return the wrong answer.

And RAG doesn’t enable an LLM to reason in more sophisticated ways over large numbers of documents:

  • A lawyer might want an AI system to review and summarize hundreds of thousands of emails.
  • An engineer might want an AI system to analyze thousands of hours of camera footage from a factory floor.
  • A medical researcher might want an AI system to identify trends in tens of thousands of patient records.

Each of these tasks could easily require more than 2 million tokens of context. Moreover, we’re not going to want our AI systems to start with a clean slate after doing one of these jobs. We will want them to gain experience over time, just like human workers do.

Superhuman memory and stamina have long been key selling points for computers. We’re not going to want to give them up in the AI age. Yet today’s LLMs are distinctly subhuman in their ability to absorb and understand large quantities of information.

It’s true, of course, that LLMs absorb superhuman quantities of information at training time. The latest AI models have been trained on trillions of tokens—far more than any human will read or hear. But a lot of valuable information is proprietary, time-sensitive, or otherwise not available for training.

So we’re going to want AI models to read and remember far more than 2 million tokens at inference time. And that won’t be easy.

The key innovation behind transformer-based LLMs is attention, a mathematical operation that allows a model to “think about” previous tokens. (Check out our LLM explainer if you want a detailed explanation of how this works.) Before an LLM generates a new token, it performs an attention operation that compares the latest token to every previous token. This means that conventional LLMs get less and less efficient as the context grows.

Lots of people are working on ways to solve this problem—I’ll discuss some of them later in this article. But first I should explain how we ended up with such an unwieldy architecture.

The “brains” of personal computers are central processing units (CPUs). Traditionally, chipmakers made CPUs faster by increasing the frequency of the clock that acts as its heartbeat. But in the early 2000s, overheating forced chipmakers to mostly abandon this technique.

Chipmakers started making CPUs that could execute more than one instruction at a time. But they were held back by a programming paradigm that requires instructions to mostly be executed in order.

A new architecture was needed to take full advantage of Moore’s Law. Enter Nvidia.

In 1999, Nvidia started selling graphics processing units (GPUs) to speed up the rendering of three-dimensional games like Quake III Arena. The job of these PC add-on cards was to rapidly draw thousands of triangles that made up walls, weapons, monsters, and other objects in a game.

This is not a sequential programming task: triangles in different areas of the screen can be drawn in any order. So rather than having a single processor that executed instructions one at a time, Nvidia’s first GPU had a dozen specialized cores—effectively tiny CPUs—that worked in parallel to paint a scene.

Over time, Moore’s Law enabled Nvidia to make GPUs with tens, hundreds, and eventually thousands of computing cores. People started to realize that the massive parallel computing power of GPUs could be used for applications unrelated to video games.

In 2012, three University of Toronto computer scientists—Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton—used a pair of Nvidia GTX 580 GPUs to train a neural network for recognizing images. The massive computing power of those GPUs, which had 512 cores each, allowed them to train a network with a then-impressive 60 million parameters. They entered ImageNet, an academic competition to classify images into one of 1,000 categories, and set a new record for accuracy in image recognition.

Before long, researchers were applying similar techniques to a wide variety of domains, including natural language.

RNNs worked fairly well on short sentences, but they struggled with longer ones—to say nothing of paragraphs or longer passages. When reasoning about a long sentence, an RNN would sometimes “forget about” an important word early in the sentence. In 2014, computer scientists Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio discovered they could improve the performance of a recurrent neural network by adding an attention mechanism that allowed the network to “look back” at earlier words in a sentence.

In 2017, Google published “Attention Is All You Need,” one of the most important papers in the history of machine learning. Building on the work of Bahdanau and his colleagues, Google researchers dispensed with the RNN and its hidden states. Instead, Google’s model used an attention mechanism to scan previous words for relevant context.

This new architecture, which Google called the transformer, proved hugely consequential because it eliminated a serious bottleneck to scaling language models.

Here’s an animation illustrating why RNNs didn’t scale well:

This hypothetical RNN tries to predict the next word in a sentence, with the prediction shown in the top row of the diagram. This network has three layers, each represented by a rectangle. It is inherently linear: it has to complete its analysis of the first word, “How,” before passing the hidden state back to the bottom layer so the network can start to analyze the second word, “are.”

This constraint wasn’t a big deal when machine learning algorithms ran on CPUs. But when people started leveraging the parallel computing power of GPUs, the linear architecture of RNNs became a serious obstacle.

The transformer removed this bottleneck by allowing the network to “think about” all the words in its input at the same time:

The transformer-based model shown here does roughly as many computations as the RNN in the previous diagram. So it might not run any faster on a (single-core) CPU. But because the model doesn’t need to finish with “How” before starting on “are,” “you,” or “doing,” it can work on all of these words simultaneously. So it can run a lot faster on a GPU with many parallel execution units.

How much faster? The potential speed-up is proportional to the number of input words. My animations depict a four-word input that makes the transformer model about four times faster than the RNN. Real LLMs can have inputs thousands of words long. So, with a sufficiently beefy GPU, transformer-based models can be orders of magnitude faster than otherwise similar RNNs.

In short, the transformer unlocked the full processing power of GPUs and catalyzed rapid increases in the scale of language models. Leading LLMs grew from hundreds of millions of parameters in 2018 to hundreds of billions of parameters by 2020. Classic RNN-based models could not have grown that large because their linear architecture prevented them from being trained efficiently on a GPU.

See all those diagonal arrows between the layers? They represent the operation of the attention mechanism. Before a transformer-based language model generates a new token, it “thinks about” every previous token to find the ones that are most relevant.

Each of these comparisons is cheap, computationally speaking. For small contexts—10, 100, or even 1,000 tokens—they are not a big deal. But the computational cost of attention grows relentlessly with the number of preceding tokens. The longer the context gets, the more attention operations (and therefore computing power) are needed to generate the next token.

This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then:

  • Processing a 100-token prompt will require 45.6 million attention operations.
  • Processing a 1,000-token prompt will require 4.6 billion attention operations.
  • Processing a 10,000-token prompt will require 460 billion attention operations.

This is probably why Google charges twice as much, per token, for Gemini 1.5 Pro once the context gets longer than 128,000 tokens. Generating token number 128,001 requires comparisons with all 128,000 previous tokens, making it significantly more expensive than producing the first or 10th or 100th token.

A lot of effort has been put into optimizing attention. One line of research has tried to squeeze maximum efficiency out of individual GPUs.

As we saw earlier, a modern GPU contains thousands of execution units. Before a GPU can start doing math, it must move data from slow shared memory (called high-bandwidth memory) to much faster memory inside a particular execution unit (called SRAM). Sometimes GPUs spend more time moving data around than performing calculations.

In a series of papers, Princeton computer scientist Tri Dao and several collaborators have developed FlashAttention, which calculates attention in a way that minimizes the number of these slow memory operations. Work like Dao’s has dramatically improved the performance of transformers on modern GPUs.

Another line of research has focused on efficiently scaling attention across multiple GPUs. One widely cited paper describes ring attention, which divides input tokens into blocks and assigns each block to a different GPU. It’s called ring attention because GPUs are organized into a conceptual ring, with each GPU passing data to its neighbor.

I once attended a ballroom dancing class where couples stood in a ring around the edge of the room. After each dance, women would stay where they were while men would rotate to the next woman. Over time, every man got a chance to dance with every woman. Ring attention works on the same principle. The “women” are query vectors (describing what each token is “looking for”) and the “men” are key vectors (describing the characteristics each token has). As the key vectors rotate through a sequence of GPUs, they get multiplied by every query vector in turn.

In short, ring attention distributes attention calculations across multiple GPUs, making it possible for LLMs to have larger context windows. But it doesn’t make individual attention calculations any cheaper.

The fixed-size hidden state of an RNN means that it doesn’t have the same scaling problems as a transformer. An RNN requires about the same amount of computing power to produce its first, hundredth and millionth token. That’s a big advantage over attention-based models.

Although RNNs have fallen out of favor since the invention of the transformer, people have continued trying to develop RNNs suitable for training on modern GPUs.

In April, Google announced a new model called Infini-attention. It’s kind of a hybrid between a transformer and an RNN. Infini-attention handles recent tokens like a normal transformer, remembering them and recalling them using an attention mechanism.

However, Infini-attention doesn’t try to remember every token in a model’s context. Instead, it stores older tokens in a “compressive memory” that works something like the hidden state of an RNN. This data structure can perfectly store and recall a few tokens, but as the number of tokens grows, its recall becomes lossier.

Machine learning YouTuber Yannic Kilcher wasn’t too impressed by Google’s approach.

“I’m super open to believing that this actually does work and this is the way to go for infinite attention, but I’m very skeptical,” Kilcher said. “It uses this compressive memory approach where you just store as you go along, you don’t really learn how to store, you just store in a deterministic fashion, which also means you have very little control over what you store and how you store it.”

Perhaps the most notable effort to resurrect RNNs is Mamba, an architecture that was announced in a December 2023 paper. It was developed by computer scientists Dao (who also did the FlashAttention work I mentioned earlier) and Albert Gu.

Mamba does not use attention. Like other RNNs, it has a hidden state that acts as the model’s “memory.” Because the hidden state has a fixed size, longer prompts do not increase Mamba’s per-token cost.

When I started writing this article in March, my goal was to explain Mamba’s architecture in some detail. But then in May, the researchers released Mamba-2, which significantly changed the architecture from the original Mamba paper. I’ll be frank: I struggled to understand the original Mamba and have not figured out how Mamba-2 works.

But the key thing to understand is that Mamba has the potential to combine transformer-like performance with the efficiency of conventional RNNs.

In June, Dao and Gu co-authored a paper with Nvidia researchers that evaluated a Mamba model with 8 billion parameters. They found that models like Mamba were competitive with comparably sized transformers in a number of tasks, but they “lag behind Transformer models when it comes to in-context learning and recalling information from the context.”

Transformers are good at information recall because they “remember” every token of their context—this is also why they become less efficient as the context grows. In contrast, Mamba tries to compress the context into a fixed-size state, which necessarily means discarding some information from long contexts.

The Nvidia team found they got the best performance from a hybrid architecture that interleaved 24 Mamba layers with four attention layers. This worked better than either a pure transformer model or a pure Mamba model.

A model needs some attention layers so it can remember important details from early in its context. But a few attention layers seem to be sufficient; the rest of the attention layers can be replaced by cheaper Mamba layers with little impact on the model’s overall performance.

In August, an Israeli startup called AI21 announced its Jamba 1.5 family of models. The largest version had 398 billion parameters, making it comparable in size to Meta’s Llama 405B model. Jamba 1.5 Large has seven times more Mamba layers than attention layers. As a result, Jamba 1.5 Large requires far less memory than comparable models from Meta and others. For example, AI21 estimates that Llama 3.1 70B needs 80GB of memory to keep track of 256,000 tokens of context. Jamba 1.5 Large only needs 9GB, allowing the model to run on much less powerful hardware.

The Jamba 1.5 Large model gets an MMLU score of 80, significantly below the Llama 3.1 70B’s score of 86. So by this measure, Mamba doesn’t blow transformers out of the water. However, this may not be an apples-to-apples comparison. Frontier labs like Meta have invested heavily in training data and post-training infrastructure to squeeze a few more percentage points of performance out of benchmarks like MMLU. It’s possible that the same kind of intense optimization could close the gap between Jamba and frontier models.

So while the benefits of longer context windows is obvious, the best strategy to get there is not. In the short term, AI companies may continue using clever efficiency and scaling hacks (like FlashAttention and Ring Attention) to scale up vanilla LLMs. Longer term, we may see growing interest in Mamba and perhaps other attention-free architectures. Or maybe someone will come up with a totally new architecture that renders transformers obsolete.

But I am pretty confident that scaling up transformer-based frontier models isn’t going to be a solution on its own. If we want models that can handle billions of tokens—and many people do—we’re going to need to think outside the box.

Tim Lee was on staff at Ars from 2017 to 2021. Last year, he launched a newsletter, Understanding AI, that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Why AI language models choke on too much text Read More »

12-days-of-openai:-the-ars-technica-recap

12 days of OpenAI: The Ars Technica recap


Did OpenAI’s big holiday event live up to the billing?

Over the past 12 business days, OpenAI has announced a new product or demoed an AI feature every weekday, calling the PR event “12 days of OpenAI.” We’ve covered some of the major announcements, but we thought a look at each announcement might be useful for people seeking a comprehensive look at each day’s developments.

The timing and rapid pace of these announcements—particularly in light of Google’s competing releases—illustrates the intensifying competition in AI development. What might normally have been spread across months was compressed into just 12 business days, giving users and developers a lot to process as they head into 2025.

Humorously, we asked ChatGPT what it thought about the whole series of announcements, and it was skeptical that the event even took place. “The rapid-fire announcements over 12 days seem plausible,” wrote ChatGPT-4o, “But might strain credibility without a clearer explanation of how OpenAI managed such an intense release schedule, especially given the complexity of the features.”

But it did happen, and here’s a chronicle of what went down on each day.

Day 1: Thursday, December 5

On the first day of OpenAI, the company released its full o1 model, making it available to ChatGPT Plus and Team subscribers worldwide. The company reported that the model operates faster than its preview version and reduces major errors by 34 percent on complex real-world questions.

The o1 model brings new capabilities for image analysis, allowing users to upload and receive detailed explanations of visual content. OpenAI said it plans to expand o1’s features to include web browsing and file uploads in ChatGPT, with API access coming soon. The API version will support vision tasks, function calling, and structured outputs for system integration.

OpenAI also launched ChatGPT Pro, a $200 subscription tier that provides “unlimited” access to o1, GPT-4o, and Advanced Voice features. Pro subscribers receive an exclusive version of o1 that uses additional computing power for complex problem-solving. Alongside this release, OpenAI announced a grant program that will provide ChatGPT Pro access to 10 medical researchers at established institutions, with plans to extend grants to other fields.

Day 2: Friday, December 6

Day 2 wasn’t as exciting. OpenAI unveiled Reinforcement Fine-Tuning (RFT), a model customization method that will let developers modify “o-series” models for specific tasks. The technique reportedly goes beyond traditional supervised fine-tuning by using reinforcement learning to help models improve their reasoning abilities through repeated iterations. In other words, OpenAI created a new way to train AI models that lets them learn from practice and feedback.

OpenAI says that Berkeley Lab computational researcher Justin Reese tested RFT for researching rare genetic diseases, while Thomson Reuters has created a specialized o1-mini model for its CoCounsel AI legal assistant. The technique requires developers to provide a dataset and evaluation criteria, with OpenAI’s platform managing the reinforcement learning process.

OpenAI plans to release RFT to the public in early 2024 but currently offers limited access through its Reinforcement Fine-Tuning Research Program for researchers, universities, and companies.

Day 3: Monday, December 9

On day 3, OpenAI released Sora, its text-to-video model, as a standalone product now accessible through sora.com for ChatGPT Plus and Pro subscribers. The company says the new version operates faster than the research preview shown in February 2024, when OpenAI first demonstrated the model’s ability to create videos from text descriptions.

The release moved Sora from research preview to a production service, marking OpenAI’s official entry into the video synthesis market. The company published a blog post detailing the subscription tiers and deployment strategy for the service.

Day 4: Tuesday, December 10

On day 4, OpenAI moved its Canvas feature out of beta testing, making it available to all ChatGPT users, including those on free tiers. Canvas provides a dedicated interface for extended writing and coding projects beyond the standard chat format, now with direct integration into the GPT-4o model.

The updated canvas allows users to run Python code within the interface and includes a text-pasting feature for importing existing content. OpenAI added compatibility with custom GPTs and a “show changes” function that tracks modifications to writing and code. The company said Canvas is now on chatgpt.com for web users and also available through a Windows desktop application, with more features planned for future updates.

Day 5: Wednesday, December 11

On day 5, OpenAI announced that ChatGPT would integrate with Apple Intelligence across iOS, iPadOS, and macOS devices. The integration works on iPhone 16 series phones, iPhone 15 Pro models, iPads with A17 Pro or M1 chips and later, and Macs with M1 processors or newer, running their respective latest operating systems.

The integration lets users access ChatGPT’s features (such as they are), including image and document analysis, directly through Apple’s system-level intelligence features. The feature works with all ChatGPT subscription tiers and operates within Apple’s privacy framework. Iffy message summaries remain unaffected by the additions.

Enterprise and Team account users need administrator approval to access the integration.

Day 6: Thursday, December 12

On the sixth day, OpenAI added two features to ChatGPT’s voice capabilities: “video calling” with screen sharing support for ChatGPT Plus and Pro subscribers and a seasonal Santa Claus voice preset.

The new visual Advanced Voice Mode features work through the mobile app, letting users show their surroundings or share their screen with the AI model during voice conversations. While the rollout covers most countries, users in several European nations, including EU member states, Switzerland, Iceland, Norway, and Liechtenstein, will get access at a later date. Enterprise and education users can expect these features in January.

The Santa voice option appears as a snowflake icon in the ChatGPT interface across mobile devices, web browsers, and desktop apps, with conversations in this mode not affecting chat history or memory. Don’t expect Santa to remember what you want for Christmas between sessions.

Day 7: Friday, December 13

OpenAI introduced Projects, a new organizational feature in ChatGPT that lets users group related conversations and files, on day 7. The feature works with the company’s GPT-4o model and provides a central location for managing resources related to specific tasks or topics—kinda like Anthropic’s “Projects” feature.

ChatGPT Plus, Pro, and Team subscribers can currently access Projects through chatgpt.com and the Windows desktop app, with view-only support on mobile devices and macOS. Users can create projects by clicking a plus icon in the sidebar, where they can add files and custom instructions that provide context for future conversations.

OpenAI said it plans to expand Projects in 2024 with support for additional file types, cloud storage integration through Google Drive and Microsoft OneDrive, and compatibility with other models like o1. Enterprise and education users will receive access to Projects in January.

Day 8: Monday, December 16

On day 8, OpenAI expanded its search features in ChatGPT, extending access to all users with free accounts while reportedly adding speed improvements and mobile optimizations. Basically, you can use ChatGPT like a web search engine, although in practice it doesn’t seem to be as comprehensive as Google Search at the moment.

The update includes a new maps interface and integration with Advanced Voice, allowing users to perform searches during voice conversations. The search capability, which previously required a paid subscription, now works across all platforms where ChatGPT operates.

Day 9: Tuesday, December 17

On day 9, OpenAI released its o1 model through its API platform, adding support for function calling, developer messages, and vision processing capabilities. The company also reduced GPT-4o audio pricing by 60 percent and introduced a GPT-4o mini option that costs one-tenth of previous audio rates.

OpenAI also simplified its WebRTC integration for real-time applications and unveiled Preference Fine-Tuning, which provides developers new ways to customize models. The company also launched beta versions of software development kits for the Go and Java programming languages, expanding its toolkit for developers.

Day 10: Wednesday, December 18

On Wednesday, OpenAI did something a little fun and launched voice and messaging access to ChatGPT through a toll-free number (1-800-CHATGPT), as well as WhatsApp. US residents can make phone calls with a 15-minute monthly limit, while global users can message ChatGPT through WhatsApp at the same number.

OpenAI said the release is a way to reach users who lack consistent high-speed Internet access or want to try AI through familiar communication channels, but it’s also just a clever hack. As evidence, OpenAI notes that these new interfaces serve as experimental access points, with more “limited functionality” than the full ChatGPT service, and still recommends existing users continue using their regular ChatGPT accounts for complete features.

Day 11: Thursday, December 19

On Thursday, OpenAI expanded ChatGPT’s desktop app integration to include additional coding environments and productivity software. The update added support for Jetbrains IDEs like PyCharm and IntelliJ IDEA, VS Code variants including Cursor and VSCodium, and text editors such as BBEdit and TextMate.

OpenAI also included integration with Apple Notes, Notion, and Quip while adding Advanced Voice Mode compatibility when working with desktop applications. These features require manual activation for each app and remain available to paid subscribers, including Plus, Pro, Team, Enterprise, and Education users, with Enterprise and Education customers needing administrator approval to enable the functionality.

Day 12: Friday, December 20

On Friday, OpenAI concluded its twelve days of announcements by previewing two new simulated reasoning models, o3 and o3-mini, while opening applications for safety and security researchers to test them before public release. Early evaluations show o3 achieving a 2727 rating on Codeforces programming contests and scoring 96.7 percent on AIME 2024 mathematics problems.

The company reports o3 set performance records on advanced benchmarks, solving 25.2 percent of problems on EpochAI’s Frontier Math evaluations and scoring above 85 percent on the ARC-AGI test, which is comparable to human results. OpenAI also published research about “deliberative alignment,” a technique used in developing o1. The company has not announced firm release dates for either new o3 model, but CEO Sam Altman said o3-mini might ship in late January.

So what did we learn?

OpenAI’s December campaign revealed that OpenAI had a lot of things sitting around that it needed to ship, and it picked a fun theme to unite the announcements. Google responded in kind, as we have covered.

Several trends from the releases stand out. OpenAI is heavily investing in multimodal capabilities. The o1 model’s release, Sora’s evolution from research preview to product, and the expansion of voice features with video calling all point toward systems that can seamlessly handle text, images, voice, and video.

The company is also focusing heavily on developer tools and customization, so it can continue to have a cloud service business and have its products integrated into other applications. Between the API releases, Reinforcement Fine-Tuning, and expanded IDE integrations, OpenAI is building out its ecosystem for developers and enterprises. And the introduction of o3 shows that OpenAI is still attempting to push technological boundaries, even in the face of diminishing returns in training LLM base models.

OpenAI seems to be positioning itself for a 2025 where generative AI moves beyond text chatbots and simple image generators and finds its way into novel applications that we probably can’t even predict yet. We’ll have to wait and see what the company and developers come up with in the year ahead.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

12 days of OpenAI: The Ars Technica recap Read More »

openai-announces-o3-and-o3-mini,-its-next-simulated-reasoning-models

OpenAI announces o3 and o3-mini, its next simulated reasoning models

On Friday, during Day 12 of its “12 days of OpenAI,” OpenAI CEO Sam Altman announced its latest AI “reasoning” models, o3 and o3-mini, which build upon the o1 models launched earlier this year. The company is not releasing them yet but will make these models available for public safety testing and research access today.

The models use what OpenAI calls “private chain of thought,” where the model pauses to examine its internal dialog and plan ahead before responding, which you might call “simulated reasoning” (SR)—a form of AI that goes beyond basic large language models (LLMs).

The company named the model family “o3” instead of “o2” to avoid potential trademark conflicts with British telecom provider O2, according to The Information. During Friday’s livestream, Altman acknowledged his company’s naming foibles, saying, “In the grand tradition of OpenAI being really, truly bad at names, it’ll be called o3.”

According to OpenAI, the o3 model earned a record-breaking score on the ARC-AGI benchmark, a visual reasoning benchmark that has gone unbeaten since its creation in 2019. In low-compute scenarios, o3 scored 75.7 percent, while in high-compute testing, it reached 87.5 percent—comparable to human performance at an 85 percent threshold.

OpenAI also reported that o3 scored 96.7 percent on the 2024 American Invitational Mathematics Exam, missing just one question. The model also reached 87.7 percent on GPQA Diamond, which contains graduate-level biology, physics, and chemistry questions. On the Frontier Math benchmark by EpochAI, o3 solved 25.2 percent of problems, while no other model has exceeded 2 percent.

OpenAI announces o3 and o3-mini, its next simulated reasoning models Read More »