The Veo Prompt Formula: Subject, Action, Scene, Style, Dialogue, Sound, Technical

Every Veo prompt that produces a clean, predictable result is built from the same seven parts: subject, action, scene, style, dialogue, sound, and technical spec. You don’t need all seven in every prompt, but leaving one out is usually why a result looks generic — this guide breaks down what each part controls, with a before/after for each.

This is a companion to How to Write Veo Prompts That Actually Work; that guide covers the basics (subject, camera, lighting, lens). This one goes one layer deeper into all seven parts, including the two that most guides skip: dialogue and sound.

The 7 parts

PartWhat it controlsSkip it and…
1. SubjectWho/what is in frame, described concretelyVeo invents a generic version
2. ActionWhat the subject does, one clear motionThe subject sits static or drifts randomly
3. SceneSetting, time of day, environmentBackground becomes a plain studio void
4. StyleVisual treatment — cinematic, anime, claymation, film stockOutput defaults to a flat “AI video” look
5. DialogueExact spoken lines, in quotesNo speech, or mumbled non-words
6. SoundSFX, ambient noise, explicitly namedSilent or generic music-bed audio
7. TechnicalCamera move, lens, duration, aspect ratio, negativesCamera drifts, aspect defaults to 16:9

Part 1 — Subject: name it concretely

Before: “a woman in a park” After: “a woman in her 30s wearing a red wool coat, sitting on a weathered park bench”

Veo follows specific nouns and materials far better than vague adjectives. “A matte-black ceramic mug” produces a consistent object across regenerations; “a nice mug” produces a different mug every time.

Part 2 — Action: one motion, stated plainly

Before: “she is doing stuff in the park, looking around, maybe standing up” After: “she stands up slowly and walks toward the camera”

Stack too many actions and Veo either rushes through them or drops one. Pick the single motion that matters for this shot.

Part 3 — Scene: environment and time of day

Before: “outside” After: “a quiet public park at golden hour, autumn leaves scattered on the path, soft warm backlight”

Time of day is doing real work here — “golden hour” sets lighting direction and color temperature without you having to spell out a lighting rig.

Part 4 — Style: name the visual treatment

Before: (nothing — style left unstated) After: “shot on 35mm film, muted color grade, shallow depth of field, subtle grain”

Without a style cue, Veo defaults to a clean, slightly flat “generic AI video” look. Naming a film stock, an animation style (claymation, cel-shaded anime), or a specific director’s visual language pulls the output toward that reference.

Part 5 — Dialogue: quote it exactly

Before: “she talks to someone” After: she turns to the camera and says, "I didn't think it would be this quiet."

This is the part most prompt guides skip, and it’s Veo’s real differentiator — native audio with lip-synced dialogue. Use a direct quote, keep it under about 10 words for reliable delivery timing, and assign the line to a named subject if there’s more than one person in frame. See the dialogue and audio prompts cluster for the full “says, …” syntax and more examples.

Part 6 — Sound: name the audio, don’t gesture at mood

Before: “spooky ambient sounds” After: “SFX: a distant floorboard creak, faint wind through a cracked window, low room tone”

Vague mood words (“spooky,” “epic”) give Veo little to render. Naming specific, physical sounds — the creak, the wind, the room tone — produces a soundscape that actually matches, because you’ve described real audio events instead of a feeling.

Part 7 — Technical: camera, lens, duration, aspect, negatives

Before: (unstated — Veo picks a default) After: “slow dolly-in, 35mm lens, shallow depth of field, 8 seconds, vertical 9:16. No subtitles, no on-screen text, no watermark.”

Pick one camera move (see the camera movement guide for vocabulary), state the aspect ratio explicitly if you need vertical output, and add a negative line if Veo tends to add unwanted captions or logos in your use case — see the negative prompts and troubleshooting guide.

Full example — assembling all 7 parts

A woman in her 30s wearing a red wool coat sits on a weathered park bench
at golden hour, autumn leaves scattered on the path. She looks up and
says, "I didn't think it would be this quiet." Shot on 35mm film, muted
color grade, shallow depth of field. Slow dolly-in, no camera shake.
SFX: distant birdsong, faint wind through the trees, soft footsteps
approaching off-screen. 8 seconds, 16:9. No subtitles, no text overlay,
no watermark.

Every part is present: subject (woman, red coat), action (sits, looks up), scene (park, golden hour), style (35mm, muted grade), dialogue (quoted line), sound (named SFX), technical (dolly-in, duration, aspect, negatives).

When to drop a part

Not every prompt needs all seven. A silent product hero shot doesn’t need dialogue. A locked-off ASMR clip doesn’t need a camera move beyond “static.” Use the table above to decide what’s load-bearing for your specific use case — the Veo Prompt Library breaks this down further by use case (product, UGC, cinematic, ASMR, I2V).

FAQ

Do I need to write all 7 parts in order? No — Veo reads the whole prompt as one block, not a form. But writing in roughly this order (subject → action → scene → style → dialogue → sound → technical) tends to produce more consistent results because it front-loads the concrete subject before the model has to parse motion and mood.

What’s the single most-skipped part that hurts results the most? Sound. Most people either leave it out entirely or write vague mood words. Naming exact SFX is a small addition that noticeably improves how “real” a clip feels — audio is genuinely one of Veo’s strongest capabilities and it’s underused.

Does this formula change for image-to-video prompts? Yes — for I2V, drop subject/scene entirely (the image already shows them) and prompt only action, dialogue, sound, and technical. See the image-to-video prompts guide.

Does this work the same on Veo 3.1 Lite vs the paid tier? The formula is the same; Lite is somewhat less reliable at holding all 7 parts in longer or busier prompts. Start with 3-4 parts on Lite and add more once you see what holds.

Build it without memorizing the formula

The Veo Prompt Builder turns this exact 7-part structure into a form — fill in subject, action, style, dialogue, and sound as separate fields, and it assembles a properly ordered prompt for you in Text or JSON.

If you’d rather run this against a few different models before settling on Veo, Pollo AI lets you test the same prompt across Veo, Kling, and others from one account. This is an affiliate link — we may earn a commission at no extra cost to you.


Related: How to Write Veo Prompts · Veo JSON Prompt Format Guide · Veo Prompt Library · Veo Dialogue & Audio Prompts · Veo Camera Movement Guide