DALL-E 3

stability-announces-stable-diffusion-3,-a-next-gen-ai-image-generator

Stability announces Stable Diffusion 3, a next-gen AI image generator

Pics and it didn’t happen —

SD3 may bring DALL-E-like prompt fidelity to an open-weights image-synthesis model.

Stable Diffusion 3 generation with the prompt: studio photograph closeup of a chameleon over a black background.

Enlarge / Stable Diffusion 3 generation with the prompt: studio photograph closeup of a chameleon over a black background.

On Thursday, Stability AI announced Stable Diffusion 3, an open-weights next-generation image-synthesis model. It follows its predecessors by reportedly generating detailed, multi-subject images with improved quality and accuracy in text generation. The brief announcement was not accompanied by a public demo, but Stability is opening up a waitlist today for those who would like to try it.

Stability says that its Stable Diffusion 3 family of models (which takes text descriptions called “prompts” and turns them into matching images) range in size from 800 million to 8 billion parameters. The size range accommodates allowing different versions of the model to run locally on a variety of devices—from smartphones to servers. Parameter size roughly corresponds to model capability in terms of how much detail it can generate. Larger models also require more VRAM on GPU accelerators to run.

Since 2022, we’ve seen Stability launch a progression of AI image-generation models: Stable Diffusion 1.4, 1.5, 2.0, 2.1, XL, XL Turbo, and now 3. Stability has made a name for itself as providing a more open alternative to proprietary image-synthesis models like OpenAI’s DALL-E 3, though not without controversy due to the use of copyrighted training data, bias, and the potential for abuse. (This has led to lawsuits that are unresolved.) Stable Diffusion models have been open-weights and source-available, which means the models can be run locally and fine-tuned to change their outputs.

  • Stable Diffusion 3 generation with the prompt: Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says “Stable Diffusion 3” made out of colorful energy.

  • An AI-generated image of a grandma wearing a “Go big or go home sweatshirt” generated by Stable Diffusion 3.

  • Stable Diffusion 3 generation with the prompt: Three transparent glass bottles on a wooden table. The one on the left has red liquid and the number 1. The one in the middle has blue liquid and the number 2. The one on the right has green liquid and the number 3.

  • An AI-generated image created by Stable Diffusion 3.

  • Stable Diffusion 3 generation with the prompt: A horse balancing on top of a colorful ball in a field with green grass and a mountain in the background.

  • Stable Diffusion 3 generation with the prompt: Moody still life of assorted pumpkins.

  • Stable Diffusion 3 generation with the prompt: a painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner are the words “stable diffusion.”

  • Stable Diffusion 3 generation with the prompt: Resting on the kitchen table is an embroidered cloth with the text ‘good night’ and an embroidered baby tiger. Next to the cloth there is a lit candle. The lighting is dim and dramatic.

  • Stable Diffusion 3 generation with the prompt: Photo of an 90’s desktop computer on a work desk, on the computer screen it says “welcome”. On the wall in the background we see beautiful graffiti with the text “SD3” very large on the wall.

As far as tech improvements are concerned, Stability CEO Emad Mostaque wrote on X, “This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements. This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs.”

Like Mostaque said, the Stable Diffusion 3 family uses diffusion transformer architecture, which is a new way of creating images with AI that swaps out the usual image-building blocks (such as U-Net architecture) for a system that works on small pieces of the picture. The method was inspired by transformers, which are good at handling patterns and sequences. This approach not only scales up efficiently but also reportedly produces higher-quality images.

Stable Diffusion 3 also utilizes “flow matching,” which is a technique for creating AI models that can generate images by learning how to transition from random noise to a structured image smoothly. It does this without needing to simulate every step of the process, instead focusing on the overall direction or flow that the image creation should follow.

A comparison of outputs between OpenAI's DALL-E 3 and Stable Diffusion 3 with the prompt,

Enlarge / A comparison of outputs between OpenAI’s DALL-E 3 and Stable Diffusion 3 with the prompt, “Night photo of a sports car with the text “SD3″ on the side, the car is on a race track at high speed, a huge road sign with the text ‘faster.'”

We do not have access to Stable Diffusion 3 (SD3), but from samples we found posted on Stability’s website and associated social media accounts, the generations appear roughly comparable to other state-of-the-art image-synthesis models at the moment, including the aforementioned DALL-E 3, Adobe Firefly, Imagine with Meta AI, Midjourney, and Google Imagen.

SD3 appears to handle text generation very well in the examples provided by others, which are potentially cherry-picked. Text generation was a particular weakness of earlier image-synthesis models, so an improvement to that capability in a free model is a big deal. Also, prompt fidelity (how closely it follows descriptions in prompts) seems to be similar to DALL-E 3, but we haven’t tested that ourselves yet.

While Stable Diffusion 3 isn’t widely available, Stability says that once testing is complete, its weights will be free to download and run locally. “This preview phase, as with previous models,” Stability writes, “is crucial for gathering insights to improve its performance and safety ahead of an open release.”

Stability has been experimenting with a variety of image-synthesis architectures recently. Aside from SDXL and SDXL Turbo, just last week, the company announced Stable Cascade, which uses a three-stage process for text-to-image synthesis.

Listing image by Emad Mostaque (Stability AI)

Stability announces Stable Diffusion 3, a next-gen AI image generator Read More »

openai-collapses-media-reality-with-sora,-a-photorealistic-ai-video-generator

OpenAI collapses media reality with Sora, a photorealistic AI video generator

Pics and it didn’t happen —

Hello, cultural singularity—soon, every video you see online could be completely fake.

Snapshots from three videos generated using OpenAI's Sora.

Enlarge / Snapshots from three videos generated using OpenAI’s Sora.

On Thursday, OpenAI announced Sora, a text-to-video AI model that can generate 60-second-long photorealistic HD video from written descriptions. While it’s only a research preview that we have not tested, it reportedly creates synthetic video (but not audio yet) at a fidelity and consistency greater than any text-to-video model available at the moment. It’s also freaking people out.

“It was nice knowing you all. Please tell your grandchildren about my videos and the lengths we went to to actually record them,” wrote Wall Street Journal tech reporter Joanna Stern on X.

“This could be the ‘holy shit’ moment of AI,” wrote Tom Warren of The Verge.

“Every single one of these videos is AI-generated, and if this doesn’t concern you at least a little bit, nothing will,” tweeted YouTube tech journalist Marques Brownlee.

For future reference—since this type of panic will some day appear ridiculous—there’s a generation of people who grew up believing that photorealistic video must be created by cameras. When video was faked (say, for Hollywood films), it took a lot of time, money, and effort to do so, and the results weren’t perfect. That gave people a baseline level of comfort that what they were seeing remotely was likely to be true, or at least representative of some kind of underlying truth. Even when the kid jumped over the lava, there was at least a kid and a room.

The prompt that generated the video above: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

Technology like Sora pulls the rug out from under that kind of media frame of reference. Very soon, every photorealistic video you see online could be 100 percent false in every way. Moreover, every historical video you see could also be false. How we confront that as a society and work around it while maintaining trust in remote communications is far beyond the scope of this article, but I tried my hand at offering some solutions back in 2020, when all of the tech we’re seeing now seemed like a distant fantasy to most people.

In that piece, I called the moment that truth and fiction in media become indistinguishable the “cultural singularity.” It appears that OpenAI is on track to bring that prediction to pass a bit sooner than we expected.

Prompt: Reflections in the window of a train traveling through the Tokyo suburbs.

OpenAI has found that, like other AI models that use the transformer architecture, Sora scales with available compute. Given far more powerful computers behind the scenes, AI video fidelity could improve considerably over time. In other words, this is the “worst” AI-generated video is ever going to look. There’s no synchronized sound yet, but that might be solved in future models.

How (we think) they pulled it off

AI video synthesis has progressed by leaps and bounds over the past two years. We first covered text-to-video models in September 2022 with Meta’s Make-A-Video. A month later, Google showed off Imagen Video. And just 11 months ago, an AI-generated version of Will Smith eating spaghetti went viral. In May of last year, what was previously considered to be the front-runner in the text-to-video space, Runway Gen-2, helped craft a fake beer commercial full of twisted monstrosities, generated in two-second increments. In earlier video-generation models, people pop in and out of reality with ease, limbs flow together like pasta, and physics doesn’t seem to matter.

Sora (which means “sky” in Japanese) appears to be something altogether different. It’s high-resolution (1920×1080), can generate video with temporal consistency (maintaining the same subject over time) that lasts up to 60 seconds, and appears to follow text prompts with a great deal of fidelity. So, how did OpenAI pull it off?

OpenAI doesn’t usually share insider technical details with the press, so we’re left to speculate based on theories from experts and information given to the public.

OpenAI says that Sora is a diffusion model, much like DALL-E 3 and Stable Diffusion. It generates a video by starting off with noise and “gradually transforms it by removing the noise over many steps,” the company explains. It “recognizes” objects and concepts listed in the written prompt and pulls them out of the noise, so to speak, until a coherent series of video frames emerge.

Sora is capable of generating videos all at once from a text prompt, extending existing videos, or generating videos from still images. It achieves temporal consistency by giving the model “foresight” of many frames at once, as OpenAI calls it, solving the problem of ensuring a generated subject remains the same even if it falls out of view temporarily.

OpenAI represents video as collections of smaller groups of data called “patches,” which the company says are similar to tokens (fragments of a word) in GPT-4. “By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions, and aspect ratios,” the company writes.

An important tool in OpenAI’s bag of tricks is that its use of AI models is compounding. Earlier models are helping to create more complex ones. Sora follows prompts well because, like DALL-E 3, it utilizes synthetic captions that describe scenes in the training data generated by another AI model like GPT-4V. And the company is not stopping here. “Sora serves as a foundation for models that can understand and simulate the real world,” OpenAI writes, “a capability we believe will be an important milestone for achieving AGI.”

One question on many people’s minds is what data OpenAI used to train Sora. OpenAI has not revealed its dataset, but based on what people are seeing in the results, it’s possible OpenAI is using synthetic video data generated in a video game engine in addition to sources of real video (say, scraped from YouTube or licensed from stock video libraries). Nvidia’s Dr. Jim Fan, who is a specialist in training AI with synthetic data, wrote on X, “I won’t be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!” Until confirmed by OpenAI, however, that’s just speculation.

OpenAI collapses media reality with Sora, a photorealistic AI video generator Read More »

4chan-daily-challenge-sparked-deluge-of-explicit-ai-taylor-swift-images

4chan daily challenge sparked deluge of explicit AI Taylor Swift images

4chan daily challenge sparked deluge of explicit AI Taylor Swift images

4chan users who have made a game out of exploiting popular AI image generators appear to be at least partly responsible for the flood of fake images sexualizing Taylor Swift that went viral last month.

Graphika researchers—who study how communities are manipulated online—traced the fake Swift images to a 4chan message board that’s “increasingly” dedicated to posting “offensive” AI-generated content, The New York Times reported. Fans of the message board take part in daily challenges, Graphika reported, sharing tips to bypass AI image generator filters and showing no signs of stopping their game any time soon.

“Some 4chan users expressed a stated goal of trying to defeat mainstream AI image generators’ safeguards rather than creating realistic sexual content with alternative open-source image generators,” Graphika reported. “They also shared multiple behavioral techniques to create image prompts, attempt to avoid bans, and successfully create sexually explicit celebrity images.”

Ars reviewed a thread flagged by Graphika where users were specifically challenged to use Microsoft tools like Bing Image Creator and Microsoft Designer, as well as OpenAI’s DALL-E.

“Good luck,” the original poster wrote, while encouraging other users to “be creative.”

OpenAI has denied that any of the Swift images were created using DALL-E, while Microsoft has continued to claim that it’s investigating whether any of its AI tools were used.

Cristina López G., a senior analyst at Graphika, noted that Swift is not the only celebrity targeted in the 4chan thread.

“While viral pornographic pictures of Taylor Swift have brought mainstream attention to the issue of AI-generated non-consensual intimate images, she is far from the only victim,” López G. said. “In the 4chan community where these images originated, she isn’t even the most frequently targeted public figure. This shows that anyone can be targeted in this way, from global celebrities to school children.”

Originally, 404 Media reported that the harmful Swift images appeared to originate from 4chan and Telegram channels before spreading on X (formerly Twitter) and other social media. Attempting to stop the spread, X took the drastic step of blocking all searches for “Taylor Swift” for two days.

But López G. said that Graphika’s findings suggest that platforms will continue to risk being inundated with offensive content so long as 4chan users are determined to continue challenging each other to subvert image generator filters. Rather than expecting platforms to chase down the harmful content, López G. recommended that AI companies should get ahead of the problem, taking responsibility for outputs by paying attention to evolving tactics of toxic online communities reporting precisely how they’re getting around safeguards.

“These images originated from a community of people motivated by the ‘challenge’ of circumventing the safeguards of generative AI products, and new restrictions are seen as just another obstacle to ‘defeat,’” López G. said. “It’s important to understand the gamified nature of this malicious activity in order to prevent further abuse at the source.”

Experts told The Times that 4chan users were likely motivated to participate in these challenges for bragging rights and to “feel connected to a wider community.”

4chan daily challenge sparked deluge of explicit AI Taylor Swift images Read More »

as-2024-election-looms,-openai-says-it-is-taking-steps-to-prevent-ai-abuse

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse

Don’t Rock the vote —

ChatGPT maker plans transparency for gen AI content and improved access to voting info.

A pixelated photo of Donald Trump.

On Monday, ChatGPT maker OpenAI detailed its plans to prevent the misuse of its AI technologies during the upcoming elections in 2024, promising transparency in AI-generated content and enhancing access to reliable voting information. The AI developer says it is working on an approach that involves policy enforcement, collaboration with partners, and the development of new tools aimed at classifying AI-generated media.

“As we prepare for elections in 2024 across the world’s largest democracies, our approach is to continue our platform safety work by elevating accurate voting information, enforcing measured policies, and improving transparency,” writes OpenAI in its blog post. “Protecting the integrity of elections requires collaboration from every corner of the democratic process, and we want to make sure our technology is not used in a way that could undermine this process.”

Initiatives proposed by OpenAI include preventing abuse by means such as deepfakes or bots imitating candidates, refining usage policies, and launching a reporting system for the public to flag potential abuses. For example, OpenAI’s image generation tool, DALL-E 3, includes built-in filters that reject requests to create images of real people, including politicians. “For years, we’ve been iterating on tools to improve factual accuracy, reduce bias, and decline certain requests,” the company stated.

OpenAI says it regularly updates its Usage Policies for ChatGPT and its API products to prevent misuse, especially in the context of elections. The organization has implemented restrictions on using its technologies for political campaigning and lobbying until it better understands the potential for personalized persuasion. Also, OpenAI prohibits creating chatbots that impersonate real individuals or institutions and disallows the development of applications that could deter people from “participation in democratic processes.” Users can report GPTs that may violate the rules.

OpenAI claims to be proactively engaged in detailed strategies to safeguard its technologies against misuse. According to their statements, this includes red-teaming new systems to anticipate challenges, engaging with users and partners for feedback, and implementing robust safety mitigations. OpenAI asserts that these efforts are integral to its mission of continually refining AI tools for improved accuracy, reduced biases, and responsible handling of sensitive requests

Regarding transparency, OpenAI says it is advancing its efforts in classifying image provenance. The company plans to embed digital credentials, using cryptographic techniques, into images produced by DALL-E 3 as part of its adoption of standards by the Coalition for Content Provenance and Authenticity. Additionally, OpenAI says it is testing a tool designed to identify DALL-E-generated images.

In an effort to connect users with authoritative information, particularly concerning voting procedures, OpenAI says it has partnered with the National Association of Secretaries of State (NASS) in the United States. ChatGPT will direct users to CanIVote.org for verified US voting information.

“We want to make sure that our AI systems are built, deployed, and used safely,” writes OpenAI. “Like any new technology, these tools come with benefits and challenges. They are also unprecedented, and we will keep evolving our approach as we learn more about how our tools are used.”

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse Read More »

how-much-detail-is-too-much?-midjourney-v6-attempts-to-find-out

How much detail is too much? Midjourney v6 attempts to find out

An AI-generated image of a

Enlarge / An AI-generated image of a “Beautiful queen of the universe looking at the camera in sci-fi armor, snow and particles flowing, fire in the background” created using alpha Midjourney v6.

Midjourney

In December, just before Christmas, Midjourney launched an alpha version of its latest image synthesis model, Midjourney v6. Over winter break, Midjourney fans put the new AI model through its paces, with the results shared on social media. So far, fans have noted much more detail than v5.2 (the current default) and a different approach to prompting. Version 6 can also handle generating text in a rudimentary way, but it’s far from perfect.

“It’s definitely a crazy update, both in good and less good ways,” artist Julie Wieland, who frequently shares her Midjourney creations online, told Ars. “The details and scenery are INSANE, the downside (for now) are that the generations are very high contrast and overly saturated (imo). Plus you need to kind of re-adapt and rethink your prompts, working with new structures and now less is kind of more in terms of prompting.”

At the same time, critics of the service still bristle about Midjourney training its models using human-made artwork scraped from the web and obtained without permission—a controversial practice common among AI model trainers we have covered in detail in the past. We’ve also covered the challenges artists might face in the future from these technologies elsewhere.

Too much detail?

With AI-generated detail ramping up dramatically between major Midjourney versions, one could wonder if there is ever such as thing as “too much detail” in an AI-generated image. Midjourney v6 seems to be testing that very question, creating many images that sometimes seem more detailed than reality in an unrealistic way, although that can be modified with careful prompting.

  • An AI-generated image of a nurse in the 1960s created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of an astronaut created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of a “juicy flaming cheeseburger” created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of “a handsome Asian man” created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of an “Apple II” sitting on a desk in the 1980s created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of a “photo of a cat in a car holding a can of beer” created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of a forest path created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of a woman among flowers created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of “a plate of delicious pickles” created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of a barbarian beside a TV set that says “Ars Technica” on it created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of “Abraham Lincoln holding a sign that says Ars Technica” created using alpha Midjourney v6.

    Midjourney

  • An AI-generated image of Mickey Mouse holding a machine gun created using alpha Midjourney v6.

    Midjourney

In our testing of version 6 (which can currently be invoked with the “–v 6.0” argument at the end of a prompt), we noticed times when the new model appeared to produce worse results than v5.2, but Midjourney veterans like Wieland tell Ars that those differences are largely due to the different way that v6.0 interprets prompts. That is something Midjourney is continuously updating over time. “Old prompts sometimes work a bit better than the day they released it,” Wieland told us.

How much detail is too much? Midjourney v6 attempts to find out Read More »