Veo JSON Prompt Format: How to Structure Multi-Scene Prompts

JSON prompting for Veo means writing your prompt as a structured object — scenes with start/end timestamps, a camera field, and a sound field per scene — instead of one paragraph of text. It is not an official Veo API schema; it is a community convention (used across GitHub prompt collections and this library’s own storyboard examples) that works because it forces one clear instruction per beat instead of asking the model to juggle several actions in a single sentence. Use it when a plain-text prompt starts blending your beats together — a multi-shot ad, a scene with a costume change, or a clip where the camera move needs to land at an exact moment.

Why JSON instead of plain text

A single text prompt for an 8-second clip has to describe everything at once: the opening frame, the camera move, the mid-clip action, the audio, and the ending — all competing for the model’s attention in one block. Veo (like most video models) weights earlier tokens more heavily, so later details in a long paragraph get diluted. Breaking the same 8 seconds into 3-4 timestamped scenes gives each beat its own isolated instruction, which is why storyboard-style JSON prompts hold pacing and continuity better than an equivalent single paragraph — this is the same principle documented in Google Cloud’s own Veo 3.1 prompting guide, which recommends timestamp-bracketed shots ([00:00-00:02] ...) for exactly this reason.

When to reach for JSON:

A 3+ beat ad or story (problem → action → hero shot → logo)
Anything where a sound effect needs to land at a specific moment (a cap popping, a door slamming)
A scene with a costume, lighting, or location change partway through
You’re iterating and want to change one beat without rewriting the whole prompt

When plain text is enough: a single continuous shot, one camera move, one line of dialogue. Don’t reach for JSON on a simple prompt — it adds overhead without adding control.

The structure

{
  "video_length": 8,
  "scenes": [
    {
      "start": 0.0,
      "end": 2.0,
      "visual": "Describe exactly what is on screen in this beat only.",
      "camera": "One camera instruction: dolly-in, static, orbit, tilt.",
      "sound": "What plays during this beat: SFX, ambient, or a spoken line."
    },
    {
      "start": 2.0,
      "end": 4.0,
      "visual": "The next beat. Do not repeat details already established.",
      "camera": "The camera move for this beat.",
      "sound": "Audio for this beat."
    }
  ]
}

Field-by-field:

video_length — total duration in seconds. Veo 3.1 clips are commonly generated at 8s; keep scenes inside that budget unless you’re using an extend/continuation workflow.
start / end — timestamps in seconds. Keep beats short (1.5–3s) — Veo handles atomic, single-action beats more reliably than long ones.
visual — what’s on screen in that beat only. Don’t re-describe earlier beats; treat each scene as its own isolated instruction.
camera — one camera move per beat, stated plainly (see the camera movement guide for vocabulary that reliably steers Veo).
sound — SFX, ambient noise, or a quoted spoken line for that beat. Native audio is one of Veo’s real strengths — use it per-beat rather than one audio note for the whole clip.

Before / after

Before (plain text, everything competing for attention):

Cinematic product ad. Cold glass Coca-Cola bottle on a red background,
condensation dripping, then the cap pops off in slow motion with fizz
and droplets flying everywhere, then the liquid swirls around the bottle,
then it ends on a hero shot of the bottle with the logo glowing while a
voice says the brand name. 8 seconds.

This works, but Veo has to compress four distinct beats — establish, pop, swirl, hero — into one instruction, so pacing and the exact moment of the cap pop become unpredictable.

After (JSON, one instruction per beat):

{
  "video_length": 8,
  "scenes": [
    { "start": 0.0, "end": 2.0,
      "visual": "A cold Coca-Cola glass bottle stands upright against a deep red gradient background, covered in condensation.",
      "camera": "quick dolly-in with a slight tilt up, shallow depth of field",
      "sound": "soft ambient fizzing, subtle whoosh as camera moves" },
    { "start": 2.0, "end": 3.5,
      "visual": "Close-up: the red cap twists and pops off with force, spinning in the air with droplets flying naturally.",
      "camera": "snap zoom-in then slow-motion tracking of the cap mid-air",
      "sound": "crisp metallic twist, loud pop, carbonated hiss" },
    { "start": 3.5, "end": 5.5,
      "visual": "The liquid wraps around the bottle in a high-speed swirl, spiraling with realistic physics. Bottle stays centered.",
      "camera": "dynamic orbit shot around the bottle as liquid spins",
      "sound": "flowing liquid SFX, sparkling fizz buildup" },
    { "start": 5.5, "end": 8.0,
      "visual": "Wide hero shot: the bottle stands centered, logo glows softly as the background fades.",
      "camera": "locked hero shot, slow ambient glow increase",
      "sound": "bottle clink, soft chime, then a voice says the brand name" }
  ]
}

This exact structure is a documented, community-verified pattern — see the full worked example with a real published output video in the Veo Prompt Library.

Adapting this for dialogue-heavy scenes

If a beat includes a spoken line, put the quote directly in that beat’s sound field: "sound": "A woman says, \"We have to leave now.\"" Keep spoken lines under ~10 words per beat — Veo’s lip sync and delivery timing get less reliable past that, which is also covered in the dialogue and audio prompts cluster.

Common mistakes

Re-describing the same object in every scene. If the bottle was established in scene 1, scene 2 only needs the delta (the cap popping) — repeating “the red Coca-Cola bottle” in every visual field dilutes the actual instruction.
Stacking two camera moves in one camera field. “Dolly-in while orbiting” fights itself. Pick one motion per beat.
Scenes that don’t add up to video_length. Check your start/end math — gaps or overlaps between beats produce unpredictable results.
Forgetting audio per-beat. A sound field left empty on a beat with a hard action (a slam, a pop) wastes Veo’s native-audio advantage.

FAQ

Does Veo have an official JSON API for prompts? Not a public one for text-to-video prompting as of July 2026 — Vertex AI’s Veo API takes prompt text and generation parameters (aspect ratio, duration, resolution), not a scene-array JSON body. The JSON structure in this guide is a prompting convention: you still submit it as the text prompt, formatted as JSON, and Veo parses the structure from the text. It works because it’s readable and consistent, not because Veo has a native JSON schema.

Does this work on Veo 3.1 Lite? Yes, though Lite is less reliable on longer, multi-beat sequences. Test shorter (2-3 scene) JSON prompts first on Lite before assuming a 4-beat storyboard will hold.

Can I use JSON for image-to-video prompts? Less useful there — I2V prompts should describe motion only, not re-establish a scene the input image already shows. See the image-to-video prompts cluster for that workflow instead.

What if my JSON prompt gets ignored and Veo just reads it as plain text? This happens occasionally. Try wrapping the JSON in a short instruction line first, e.g. “Follow this shot list exactly:” before the JSON block, and keep the JSON valid (no trailing commas, matched braces).

Where to run this

Once you’ve written a JSON prompt, you’ll need a model endpoint that accepts long structured prompts without truncating them. Pollo AI runs Veo alongside other models from one dashboard and handles longer prompt bodies well — useful if you’re iterating on a multi-scene JSON prompt and want to compare how different models interpret the same structure. This is an affiliate link — we may earn a commission at no extra cost to you.

Want to build a prompt like this without hand-editing JSON syntax? The Veo Prompt Builder can output your prompt in structured JSON automatically from a simple form.