OpenAI’s Sora Made Me Crazy AI Videos—Then the CTO Answered (Most of) My Questions | WSJ

– The video captures sort
of the detail of the prompt when it comes to the hair and you know, sort of like professionally-styled women. – But you can also see some issues. – Certainly, especially
when it comes to the hands. – [Joanna] These two women, not real. They were created by Sora, OpenAI's text-to-video AI model. But these two women, very real. – I'm Mira Murati, CTO of OpenAI. – And former CEO. – Yes, for two days. – [Joanna] In November when
OpenAI CEO, Sam Altman, was momentarily ousted, Murati stepped in. Now she's back to her previous job running all the tech at
the company including… – Sora is our video generation model. It is just based on a text prompt and it creates this hyper
realistic, beautiful, highly-detailed videos
of one-minute length. – [Joanna] I've been blown away
by the AI-generated videos, yet also concerned about their impact. So I asked OpenAI to generate
some new videos for me and sat down with Murati
to get some answers.

How does Sora work? – It's fundamentally a diffusion model which is a type of generative model. It creates a more distilled image starting from random noise. – [Joanna] Okay, here are the basics. The AI model analyzed lots of videos and learned to identify
objects and actions. When given a text prompt, it creates a scene by
defining the timeline and adding detail to each frame. What makes this AI video
special compared to others is how smooth and realistic it looks. – If you think about filmmaking, people have to make sure
that each frame continues into the next frame with
the sense of consistency between objects and people. And that's what gives
you a sense of realism and a sense of presence. And if you break that between frames, then you get this disconnected sense and reality is no longer there. And so this is what Sora does really well. – You can see lots of that smoothness in the videos OpenAI generated
from the prompts I provided. But you can also see flaws and glitches. A female video producer on
a sidewalk in New York City holding a high-end cinema camera.

Suddenly, a robot yanks
the camera out of her hand. – So in this one, you can see the model doesn't follow the prompt very closely. The robot doesn't quite yank
the camera out of her hand, but the person sort of
morphs into the robot. Yeah, a lot of imperfections still. – One thing I noticed there too is when the cars are going by, they change colors. – Yeah, so while the model
is quite good at continuity, it's not perfect. So you kind of see the
yellow cab disappearing from the frame there for a while and then it comes back
in a different frame.

– Would there be a way
after the fact to say, "Fix the taxi cabs in the back?" – Yeah, so eventually. That's what we're trying to figure out, how to use this technology as a tool that people can edit and create with. – I wanted to go through one other… What do you think the prompt was? – It looks like the bull in a China shop. Yeah, metaphorically, you'd imagine everything
breaking in the scene, right? And you see in some cases
that the bull is stomping on things and they're still perfect. They're not breaking. So that's to be expected this early on. And eventually, there's
gonna be more steerability and control and more accuracy in reflecting the intent of what you want. – And then there was
that video of, well, us. The woman on the left looks like she has maybe like 15
fingers in one of the shots. – [Mira] Hands actually
have their own way of motion and it's very difficult to
simulate the motion of hands.

– In the clip, the mouths
move but there's no sound. So is audio something
you're working on with Sora? – With Sora specifically, not in this moment. But we will eventually. – [Joanna] Every time I watch a Sora clip, I wonder what videos did
this AI model learn from? Did the model see any clips of Ferdinand to know what a bull in a
China shop should look like? Was it a fan of SpongeBob? – Wow! You look real good with
a mustache, Mr. Crab. – By the way, my prompt for this crab said
nothing about a mustache. What data was used to train Sora? – We used publicly available
data and licensed data. – So, videos on YouTube. – I'm actually not sure about that. – Okay. Videos from Facebook, Instagram. – You know, if they
were publicly available, publicly available to use, there might be the data, but I'm not sure. I'm not confident about it. – What about Shutterstock? I know you guys have a deal with them. – I'm just not gonna go
into the details of the data that was used, but it was publicly
available or licensed data.

– [Joanna] After the interview, Murati confirmed that the licensed data does include content from Shutterstock. Those videos are 720p, 20 seconds long. How long does it take to generate those? – It could take a few minutes depending on the complexity of the prompt. Our goal was to really focus on developing the best capability and now we will start looking
into optimizing the technology so people can use it at low
cost and make it easy to use. – To create these, you must be using a
lot of computing power. Can you give me a sense of
how much computing power to create something like that versus a ChatGPT response
or a DALL-E image? – ChatGPT and DALL-E are optimized for the public to be using them, whereas Sora is really a research output.

It's much, much more expensive. We don't know what it's
going to look like exactly when we make it available
eventually to the public, but we're trying to make it available at similar cost eventually
to what we saw with DALL-E. – You said eventually. When is eventually? – I'm hoping definitely this year, but could be a few months. – There's an election in November. You think before or after that? – You know, that's
certainly a consideration dealing with the issues of misinformation and harmful bias. And we will not be releasing anything that we don't feel confident on when it comes to how it
might affect global elections or other issues. – Right now Sora is going
through red teaming, AKA the process where people test the tool to make sure it's safe,
secure, and reliable. The goal is to identify
vulnerabilities, biases, and other harmful issues. What are things that
just you won't be able to generate with this? – Well, we haven't made
those decisions yet, but I think there will be
consistency on our platform.

So similarly to DALL-E where you can't generate
images of public figures, I expect that we'll have
a similar policy for Sora. And right now we're in discovery mode and we haven't figured out exactly where all the limitations are and how we'll navigate
our way around them. – What about nudity? – I'm not sure. You can imagine that… You know, there are creative settings in which artists might want to
have more control over that. And right now, we are working with artists and creators from different fields to figure out exactly what's useful, what level of flexibility
should the tool provide.

– How do you make sure that people who are testing these products
aren't being inundated with illicit or harmful content? – That's certainly difficult. And in the very early stages, it is part of red teaming. Something that you have
to take into account and make sure that people are
willing and able to do it. When we work with contractors, we go much further into that process, but that is certainly something difficult. – We're laughing at some
of these videos right now.

But people in the video
industry may not be laughing in a few years when
this type of technology is impacting their jobs. – You know, the way that
I see it is this is a tool for extending creativity and we want people in the film industry, creators everywhere, to be a part of informing
how we develop it further and also how we deploy it. And also, you know, what are the economics around using these models when people are
contributing data and such. – One thing was clear from all this. This tech is going to
quickly get faster, better, and become widely available. How are we going to tell the difference between what is real video
and what is AI video? – We're doing research and
watermarking the videos, but really figuring out content provenance and how do you trust what is real content versus something that happened in reality versus content created for misinformation.

And this is the reason why we're actually not
deploying the systems yet because we need to figure out these issues before we can confidently
deploy them broadly. – [Joanna] That was reassuring to hear. But there are still big concerns about Silicon Valley's
race to create AI tools and its ambition for power
and money versus our safety. – It's not really a difficult
demand or a difficult balance between profit and safety guardrails. I'd say the hard part
is really figuring out the safety questions and
the societal questions. That's really what keeps me up at night. – There's this amazement
about the product, but then we've also talked
about all of these concerns. Is it worth it? – It's definitely worth it. AI tools will extend our
creativity and knowledge, collective imagination, ability to do anything. It's going to be extremely
hard along the way to figure out the right
path to bring AI tools into our day-to-day reality. But I think it's definitely worth trying.

As found on YouTube

Insure Your Investments