What was Sora trained on? Creatives demand answers.

On Thursday, OpenAI once again shook up the AI world with a video generation model called Sora.

The demos showed photorealistic videos with crisp detail and complexity, based off of simple text prompts. A video based on the prompt "Reflections in the window of a train traveling through the Tokyo suburbs" looked like it was filmed on a phone, shaky camera work and reflections of train passengers included. No weird distorted hands in sight.

Tweet may have been deleted

A video from the prompt, "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors" looked like a Christopher Nolan-Wes Anderson hybrid.

Tweet may have been deleted

Another of golden retriever puppies playing in the snow rendered soft fur and fluffy snow so realistic you could reach out and touch it.

The 7 trillion dollar question is, how did OpenAI achieve this? We don't actually know because OpenAI has barely shared anything about its training data. But in order to create a model this advanced, Sora needed lots of video data, so we can assume it was trained on video data scraped from all corners of the internet. And some are speculating that training data included copyrighted works. OpenAI did not immediately respond to request for comment on Sora's training data.

In OpenAI's technical paper it largely focuses on the method for achieving these results: Sora is a diffusion model that turns visual data into "patches" or pieces of data that the model can understand. But there's scant mention of where the visual data came from.

OpenAI says it “take[s] inspiration from large language models which acquire generalist capabilities by training on internet-scale data.” The incredibly vague “taking inspiration” part is the only evasive reference to the source of Sora’s training data. Further down in the paper, OpenAI says, “training text-to-video generation systems requires a large amount of videos with corresponding text captions.” The only source of a massive amount of visual data can be found on the internet, another hint at where Sora comes from.

The legal and ethical issue of how training data is acquired for AI models has been around ever since OpenAI launched ChatGPT. Both OpenAI and Google have been accused of “stealing” data to train their language models, in other words using data scraped from social media, online forums like Reddit and Quora, Wikipedia, databases of private books, and news sites.

Until now the rationale for scraping the entirety of the internet for training data is that it's publicly-available. But publicly-available doesn't always translate to public domain. Case in point, the New York Times is suing OpenAI and Microsoft for copyright infringement, alleging OpenAI's models used the Times' works word for word or incorrectly cited the stories.

Now it looks like OpenAI is doing the same thing, but with video. If this is the case, you can expect heavy-hitters in the entertainment industry to have something to say about it.

But the problem remains: We still don't know the source of Sora's training data. "The company (despite its name) has been characteristically close-lipped about what they have trained the models on," wrote Gary Marcus, an AI expert who testified at the U.S. Senate AI Oversight Committee hearing. " Many people have [speculated] that there’s probably a lot of stuff in there that is generated from game engines like Unreal. I would not at all be surprised if there also had been lots of training on YouTube visited, and various copyrighted materials," said Marcus, before adding, "Artists are presumably getting really screwed here."

Despite OpenAI's refusal to divulge its secrets, artists and creatives are assuming the worst. Justine Bateman, a filmmaker and SAG-AFTRA generative AI advisor didn't mince words. "Every nanosecond of this #AI garbage is trained on stolen work by real artists," posted Bateman on X. "Repulsive," she added.

Tweet may have been deleted

Others in creative industries are concerned about how the rise of Sora and video generating models will affect their jobs. "I work in film vfx, practically everyone I know is doom and gloom, panicking about what to do now," posted @jimmylanceworth.

OpenAI didn't completely ignore the explosive impact Sora might have. But that's largely focused on potential harms involving deepfakes and misinformation. It is currently in red-teaming phase, which means it's being stress-tested for inappropriate and harmful content. Towards the end of its announcement, OpenAI said it will be "engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology."

But that doesn't address the harms that may have already occurred by making Sora in the first place.

Topics Artificial Intelligence OpenAI