Skip to content

DOMAIN:VISUAL_PRODUCTION:VIDEO_PRODUCTION

OWNER: felice
UPDATED: 2026-03-24
SCOPE: Video production pipeline — generative video, avatar video, voiceover, post-processing
AGENTS: felice (primary), alexander (creative direction)
PARENT: Visual Production


VIDEO:PIPELINE_OVERVIEW

STAGES

BRIEF → STORYBOARD → ASSET_CREATION → ANIMATION → RENDERING → POST_PROCESSING → OPTIMIZATION → DELIVERY

STAGE: BRIEF
- INPUT: client requirements, brand guidelines, target audience, delivery context
- OUTPUT: structured brief (purpose, duration, style, key messages, CTA, platform)
- RULE: brief must specify delivery context — web embed, social, presentation, app
- RULE: brief must specify maximum duration — shorter is always better

STAGE: STORYBOARD
- INPUT: brief
- OUTPUT: frame-by-frame breakdown
- FIELDS per scene: visual description, text overlay, narration line, duration (seconds), transition type
- RULE: each scene 3-8 seconds — longer scenes lose attention
- RULE: total video duration: 30s (social), 60-90s (explainer), 3-5min (tutorial max)

STAGE: ASSET_CREATION
- INPUT: storyboard
- OUTPUT: images, icons, logos, background music, narration audio
- TOOLS: DALL-E 3/Midjourney/Flux Pro (images), ElevenLabs (narration), Epidemic Sound (music)
- RULE: all assets at 2x target resolution (render down, never up)
- RULE: images as PNG (lossless source) — convert to delivery format only at final stage

STAGE: ANIMATION
- INPUT: assets + storyboard
- OUTPUT: Remotion project with all scenes composed
- TOOLS: Remotion (React), Lottie (micro-animations)
- SEE: remotion-patterns.md for implementation details

STAGE: RENDERING
- INPUT: Remotion project
- OUTPUT: master video file
- FORMAT: ProRes (archival) or H.264 CRF 15 (high quality master)
- TOOLS: Remotion CLI or Remotion Lambda

STAGE: POST_PROCESSING
- INPUT: master video
- OUTPUT: captioned, color-corrected, audio-balanced video
- TOOLS: FFmpeg
- TASKS: captions, audio normalization, color correction, trimming

STAGE: OPTIMIZATION
- INPUT: post-processed video
- OUTPUT: delivery-optimized variants per platform
- TOOLS: FFmpeg
- SEE: delivery-specs.md for per-platform specs

STAGE: DELIVERY
- INPUT: optimized variants
- OUTPUT: CDN-hosted with adaptive streaming
- TOOLS: BunnyCDN Stream
- SEE: asset-optimization.md for CDN configuration


VIDEO:RUNWAY_GEN_3

API_BASICS

TOOL: Runway Gen-3 Alpha API
ENDPOINT: POST https://api.runwayml.com/v1/generate
PURPOSE: generative video — text-to-video and image-to-video

MODES:
- text-to-video: describe scene, receive 4-16 second clip
- image-to-video: provide start frame, model animates it
- image-to-video with end frame: start + end image, model interpolates

PARAMS:
- promptText: scene description (max ~300 chars effective)
- duration: 4 | 10 | 16 seconds
- ratio: "16:9" | "9:16" | "1:1"
- seed: for reproducibility
- watermark: false (paid plans only)
- init_image: URL or base64 (for image-to-video)
- end_image: URL or base64 (for interpolation mode)

RESPONSE:
- async: returns task_id, poll for completion
- generation takes 30-120 seconds depending on duration
- output: MP4 URL (download immediately, temporary)

CAMERA_MOTION_CONTROLS

Describe camera motion explicitly in the prompt — Runway interprets these reliably:

STATIC:
- "static wide shot" — locked camera, no movement
- "static close-up" — locked camera, tight framing
- RULE: use static for talking heads, product shots, UI demos

PAN:
- "slow pan left to right" — horizontal sweep
- "pan right revealing the scene" — discovery motion
- RULE: pans work best at 10-16s duration for smooth motion

DOLLY:
- "slow dolly in" — move camera toward subject
- "dolly out revealing the full scene" — pull back
- "push in on the subject's face" — dramatic focus

TRACKING:
- "tracking shot following the subject" — lateral movement with subject
- "orbit around the subject" — circular motion

CRANE/VERTICAL:
- "crane up revealing the city skyline" — upward vertical
- "low angle looking up" — power shot
- "bird's eye descending" — drone-like descent

ZOOM:
- "slow zoom in" — digital zoom effect
- "zoom out revealing context" — wider view
- NOTE: zoom is digital (crop), not optical — quality degrades

PROMPT_ENGINEERING

STRUCTURE:

[CAMERA MOTION], [SUBJECT doing ACTION], [ENVIRONMENT], [LIGHTING], [MOOD]

EXAMPLE (weak):

a person walking in a city

EXAMPLE (strong):

slow tracking shot, a woman in a navy blazer walks confidently through
a modern glass office lobby, warm afternoon light streaming through
floor-to-ceiling windows, shallow depth of field, cinematic

RULES:
- describe action in present continuous ("is walking", "are falling")
- describe camera motion first — sets the visual framework
- specify lighting and atmosphere explicitly
- keep prompts under 200 words — shorter = more coherent
- avoid multiple subjects performing different actions
- one clear action per generation

QUALITY_EVALUATION

ALWAYS generate 3+ variants for any generative video. Evaluate each against:

TEMPORAL_CONSISTENCY:
- [ ] objects maintain shape across frames (no morphing)
- [ ] colors stay consistent (no sudden hue shifts)
- [ ] subject identity maintained (face doesn't change)

MOTION_QUALITY:
- [ ] no flickering or strobing
- [ ] camera motion is smooth (no sudden jumps)
- [ ] physics are plausible (gravity, motion blur direction correct)
- [ ] no speed inconsistencies (sudden slow-down or speed-up)

ARTIFACT_CHECK:
- [ ] no melting or warping at edges
- [ ] no ghosting of previous frames
- [ ] no sudden appearance/disappearance of objects
- [ ] hands and faces remain coherent

RULE: if no variant passes all checks, refine the prompt and regenerate
RULE: never deliver generative video without human review
RULE: generative video is best for b-roll and atmospheric shots — not primary content

PRICING

~$0.05-0.50 per generation depending on duration and plan.
4-second clips: cheapest, good for loops and short accents.
16-second clips: most expensive, best for establishing shots.

LICENSING: user owns output on paid plans. Commercial use permitted.


VIDEO:AVATAR_VIDEO

SYNTHESIA_API

TOOL: Synthesia API
ENDPOINT: POST https://api.synthesia.io/v2/videos
PURPOSE: avatar-based explainer videos from text script

REQUEST:

{
  "title": "Product Walkthrough",
  "input": [{
    "scriptText": "Welcome to Growing Europe. Today I will walk you through our platform features.",
    "avatar": "anna_costume1_cameraA",
    "background": "off_white",
    "avatarSettings": {
      "horizontalAlign": "left",
      "scale": 0.8,
      "style": "rectangular"
    }
  }],
  "aspectRatio": "16:9",
  "test": false
}

AVATAR_SELECTION:
- browse available avatars via GET /v2/avatars
- filter by: gender, age range, attire (casual/business/custom)
- custom avatars: requires video submission + consent process
- RULE: use consistent avatar across a client's video series
- RULE: select avatar matching target audience expectations

SCRIPT_FORMATTING:
- plain text only — no SSML, no markup, no HTML
- punctuation controls pacing: periods create pauses, commas create brief pauses
- exclamation marks add emphasis
- question marks adjust intonation
- RULE: one paragraph per scene/slide
- RULE: read the script aloud before submitting — if it sounds unnatural, rewrite
- RULE: max ~10 minutes per video
- RULE: keep sentences under 20 words for natural delivery

MULTI_SCENE:

{
  "input": [
    {
      "scriptText": "Welcome to our platform.",
      "avatar": "anna_costume1_cameraA",
      "background": "off_white"
    },
    {
      "scriptText": "Let me show you the dashboard.",
      "avatar": "anna_costume1_cameraA",
      "background": "https://example.com/dashboard-screenshot.png"
    },
    {
      "scriptText": "Thank you for watching.",
      "avatar": "anna_costume1_cameraA",
      "background": "ge_branded_outro"
    }
  ]
}

MULTI_LANGUAGE:
- Synthesia supports 120+ languages
- specify language per scene: "en-US", "nl-NL", "de-DE", etc.
- same avatar can speak different languages
- RULE: verify pronunciation of brand names in each language
- RULE: script length varies by language — German ~30% longer than English

WORKFLOW:
1. submit video via POST (returns id)
2. poll GET /v2/videos/{id} for status
3. status progression: pendingprocessingcomplete
4. rendering takes 5-15 minutes
5. download from download URL in response

PRICING: ~$0.50-2.00 per minute of output (varies by plan).

HEYGEN_API

TOOL: HeyGen API
ENDPOINT: POST https://api.heygen.com/v2/video/generate
PURPOSE: avatar videos — similar to Synthesia

COMPARISON:
| Feature | Synthesia | HeyGen |
|---------|-----------|--------|
| Lip sync quality | excellent | very good |
| Corporate feel | stronger | moderate |
| Rendering speed | 5-15 min | 2-10 min |
| Avatar variety | 200+ | 300+ |
| Custom avatar | yes (expensive) | yes (lower cost) |
| API design | REST, polling | REST, webhooks |
| Cost | higher | lower |

WHEN_TO_USE:
- Synthesia: corporate client presentations, formal explainers, enterprise onboarding
- HeyGen: product demos, social media content, rapid iteration, cost-sensitive projects

LICENSING: both — user owns output video. Avatar likeness rights included in subscription.


VIDEO:VOICEOVER

ELEVENLABS_API

TOOL: ElevenLabs API
ENDPOINT: POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}
PURPOSE: high-quality AI voiceover for narration

REQUEST:

{
  "text": "Welcome to Growing Europe. We build enterprise-grade software.",
  "model_id": "eleven_multilingual_v2",
  "voice_settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.3,
    "use_speaker_boost": true
  }
}

VOICE_SETTINGS:
- stability: 0.0-1.0. Higher = more consistent, lower = more expressive. 0.5 for narration.
- similarity_boost: 0.0-1.0. Higher = closer to original voice. 0.75 recommended.
- style: 0.0-1.0. Higher = more expressive style. 0.0-0.3 for professional narration.
- RULE: test voice settings on a sample paragraph before full script

VOICE_SELECTION:
- browse voices via GET /v1/voices
- filter by: gender, age, accent, use case
- clone custom voice: requires consent + 3+ minutes of clean audio
- RULE: use consistent voice across a client's content
- RULE: match voice demographics to target audience

MULTI_LANGUAGE:
- eleven_multilingual_v2 supports 29 languages
- same voice can switch languages
- RULE: verify pronunciation of technical terms and brand names

SCRIPT_PREPARATION:
- write for spoken delivery, not reading
- short sentences (under 20 words)
- use hyphens for compound adjectives read as one: "enterprise-grade"
- spell out acronyms that should be spoken: "S-M-E" not "SME"
- add <break time="500ms"/> for intentional pauses (SSML if supported)

AUDIO_SPECS:
- output format: MP3 (default) or PCM WAV
- sample rate: 44100 Hz
- RULE: request WAV for Remotion integration (better quality, no decode overhead)
- RULE: normalize audio to -16 LUFS for consistent loudness

PRICING: ~$0.15-0.30 per 1000 characters depending on plan.


VIDEO:SCREEN_RECORDING

COMPOSITION_PATTERN

Screen recordings are often combined with narration and overlays:

RECORDING:
- capture at 2x target resolution (3840x2160 for 1920x1080 delivery)
- 60fps capture, deliver at 30fps (smoother motion, temporal downsampling)
- record system audio separately from microphone
- TOOL: OBS Studio (local), browser-based recorder (remote)

POST_CAPTURE:
1. trim dead time (loading screens, mistakes)
2. speed up repetitive actions (2x-4x)
3. add cursor highlight/zoom for important clicks
4. overlay narration audio
5. add animated callouts (Remotion or Lottie overlays)

REMOTION_INTEGRATION:

import { AbsoluteFill, Video, Sequence, staticFile } from 'remotion';
import { AnimatedCallout } from './AnimatedCallout';
import { CursorHighlight } from './CursorHighlight';

export const ScreenRecordingOverlay = () => (
  <AbsoluteFill>
    <Video src={staticFile("screen-recording.mp4")} style={{ objectFit: 'contain' }} />

    <Sequence from={90} durationInFrames={60}>
      <CursorHighlight x={400} y={300} />
    </Sequence>

    <Sequence from={120} durationInFrames={90}>
      <AnimatedCallout
        text="Click here to open settings"
        position={{ bottom: 60, left: 400 }}
      />
    </Sequence>
  </AbsoluteFill>
);


VIDEO:FFMPEG_POST_PROCESSING

COMMON_OPERATIONS

TRANSCODE to web-optimized H.264:

ffmpeg -i input.mov -c:v libx264 -preset slow -crf 20 -c:a aac -b:a 128k -movflags +faststart output.mp4
NOTE: -movflags +faststart moves metadata to front — REQUIRED for web streaming.

RESIZE to 720p:

ffmpeg -i input.mp4 -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" output-720p.mp4

EXTRACT keyframes (for thumbnails/poster):

ffmpeg -i input.mp4 -vf "select='eq(pict_type\\,I)'" -vsync vfr -q:v 2 frames/frame-%03d.jpg

ADD SOFT CAPTIONS (user-selectable):

ffmpeg -i input.mp4 -i captions.srt -c copy -c:s mov_text output.mp4

BURN-IN CAPTIONS (hard-coded):

ffmpeg -i input.mp4 -vf "subtitles=captions.srt:force_style='FontSize=24,PrimaryColour=&HFFFFFF&,Outline=2'" output.mp4

CREATE GIF PREVIEW (optimized):

ffmpeg -i input.mp4 -ss 0 -t 3 -vf "fps=10,scale=480:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" preview.gif

CONCATENATE CLIPS:

echo "file 'intro.mp4'" > list.txt
echo "file 'main.mp4'" >> list.txt
echo "file 'outro.mp4'" >> list.txt
ffmpeg -f concat -safe 0 -i list.txt -c copy output.mp4

NORMALIZE AUDIO to -16 LUFS:

ffmpeg -i input.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1.5" -c:v copy output.mp4

EXTRACT AUDIO:

ffmpeg -i input.mp4 -vn -c:a pcm_s16le -ar 44100 output.wav

ADD FADE IN/OUT:

ffmpeg -i input.mp4 -vf "fade=in:0:30,fade=out:st=27:d=1" -af "afade=in:0:1,afade=out:st=27:d=1" output.mp4

ANTI_PATTERNS

ANTI_PATTERN: re-encoding video unnecessarily (quality loss each time)
FIX: use -c copy when no pixel transformation needed (concatenating, adding subs, remuxing)

ANTI_PATTERN: missing -movflags +faststart for web video
FIX: always include it — without it, browser must download entire file before playback

ANTI_PATTERN: not normalizing audio across clips
FIX: always run loudnorm filter — inconsistent volume is jarring

ANTI_PATTERN: encoding at CRF 0-10 for web delivery
FIX: CRF 18-23 for web, CRF 15-18 for archival. Lower CRF = exponentially larger files.


VIDEO:QUALITY_CONTROL

CHECKLIST

BEFORE DELIVERY every video must pass:

VISUAL:
- [ ] resolution matches delivery spec
- [ ] no encoding artifacts (blocking, banding, mosquito noise)
- [ ] color correct (no unintended tint or saturation)
- [ ] brand colors accurate (sample check)
- [ ] text readable at target display size
- [ ] smooth motion (no dropped frames)

AUDIO:
- [ ] narration clear and intelligible
- [ ] background music not competing with narration
- [ ] consistent volume throughout (-16 LUFS target)
- [ ] no clipping or distortion
- [ ] no unintended silence gaps
- [ ] audio sync with visual (lip sync, timing)

TECHNICAL:
- [ ] file size within delivery spec
- [ ] format correct (H.264 MP4 for web)
- [ ] faststart flag present (check with ffprobe)
- [ ] correct FPS (30 for web, 24 for cinematic)
- [ ] poster frame set (first frame or custom)

ACCESSIBILITY:
- [ ] captions present and accurate
- [ ] audio descriptions if visual-only information present
- [ ] see accessibility-media.md


CROSS_REFERENCES