Video AI's Big Leap Forward

Aka something you could say every few weeks at this pace.

Jun 03, 2025

Hello Taco Nation! 🌮

Before I get started, a little self-promo: I released episode #25 of the pod yesterday, which is a big milestone for me. Completing 25 episodes, at min, was my goal when I started the pod the week after the election. Have a listen if you haven’t!👂 It’s a supersized episode focusing on game theory and how Trump doesn’t understand what game he’s playing, and in the second half, on how the AI tidal wave is probably nigh.

In today’s newsletter, I want to just show you concrete examples of how rapidly video AI is improving as that’s not something I can do on the pod.

Generative Video AI’s Progress

The recent release of Google’s Veo3 video AI generator has really shaken some people in the film/tv business up. It’s not just a step forward, it’s a couple steps forward, at least, all at once.

Let me show you a simple example of what I mean.

About six weeks ago, I made this as an experiment in what the current state of the art in video AI could do. If you replicated that with traditional CGI, it’d be at least $1m, likely more, due to the sheer variety of characters, fx, and environments.

Now, it’s not a very good trailer for anything. No consistent characters, the very barest of nebulous stories, inconsistent visual style, no sound effects, the only dialog is that one bit in the first scene, etc. But it is really nice to look at, relatively speaking.

The process for me to make each individual scene here was:

Generate the starting image for a scene in an image generator like ChatGPT or Midjourney.
Edit it until I get what I want.
Take that image into a video generator (90% of this was Kling AI) and tell it what I want to see. Try over and over until I get something close to what I envision - and close is all you can hope for. It was possible to have Kling or another one I used in about 10% of the video called Runway ML to generate some soundfx to go along with the video, but they were mostly unusable.
For the dialog at the beginning, I recorded myself on webcam saying that, took the audio and ran it through an AI voice changer at ElevenLabs, then used the video recording of me saying it to have Runway ML map the lipsyncing for me and generate video based on the initial image of the goat-man I fed it. Then I added the voice-changed audio in. As you can see, huge pain in the ass.
Generated music with AI.
I then edited all the scenes together, added time cuts to the music, etc, adjusted lighting and colors on some of them etc manually.

Now let’s look at the state of the art using Veo3. I didn’t make these, to be clear:

Absolutely mind-blowing. Veo3 is, clearly, good enough for real projects, not just tech demos!

I don’t know their exact workflow for each scene, but roughly it probably is:

Type a prompt into VEO3 describing what you want to see and hear. Keep doing it until you get more or less what you want.
Create or generate music.
Edit it all together.

So not only is the result miles better but the process is simpler too. A few of the scenes in the first video have a kind of ‘AI-look’ to them, but for many of them, at first glance at least, they look and sound completely real to me. I’m sure some of them can be detected as AI if you really stare at them long enough and find incongruous details, but most viewers won’t notice.

A 1:1 Comparison

I tried the same prompt in both Google’s Veo3 and OpenAI’s Sora, which was cutting edge when it was released. Now? Not so much compared to Veo3.

Here are examples from each. Sora first, for the previous generation of video gen, and Veo3 for this generation.

I used the same prompt for each one:

“Gandalf, Frodo, Elrond, Gimli, Galadriel, and Sauron are part of an acapella group of singers. They are performing a beautiful acapella rendition of "She'll be Comin' Round the Mountain", and they are harmonizing beautifully while singing it. They're standing in a piazza in Venice, Italy just before sunset.”

Sora’s output:

I would say this is…meh. 6 months ago it would have been impressive. Now though… I don’t know who some of the characters are, anda there are, in fact, seven characters rather than six, and of course it doesn’t have audio of any kind. It did get the time of day and location right.

Veo3’s output:

Wow, right?! Nailed the location, the time of day, the characters, the song, the harmonizing, etc! The lipsyncing isn’t perfect by any means, but it struggles the more characters you add, and six is quite a lot at this point. What’s extra-impressive to me here is that it didn’t spit out the Peter Jackson movie versions of the characters. It look what it knows of the characters and made its own versions of them (presumably partly to avoid copyright issues). They look like Renn Faire versions, granted, but still!

And remember, this is the very worst that cutting edge AI video is ever going to be going forward.

I’m signing off now in an effort to start with short newsletters to see if people enjoy them, and will likely expand to longer ones down the road to see if people prefer those.

For now, stay strong and stay SPICY! 🌶️🌶️

Video AI's Big Leap Forward

Aka something you could say every few weeks at this pace.

Generative Video AI’s Progress

A 1:1 Comparison

Discussion about this post