engineeringai modelspipeline

How Onira Orchestrates 5 Production AI Models Per Video

Biel Carpi

March 18, 20267 min read

A common question we get: how does Onira actually work under the hood? What happens between "type a prompt" and "receive a finished video"?

This is the technical answer. Not a marketing explanation — a real walkthrough of the pipeline, the model assignments, and why we built it the way we did instead of the simpler alternatives.

The Fundamental Problem: One Model Can't Do Everything

Every AI video tool that came before Onira made the same architectural mistake: pick one generation model and use it for everything. One LLM for the script, one video model for the footage, done.

The problem is that the best language model for research synthesis is not the same model you want writing per-beat narration copy. The best model for generating a cinematic still frame is not the same model that animates it. And narration timing — the most overlooked constraint in AI video — cannot be an afterthought bolted on at the end.

Onira's architecture is built around a different premise: assign each task to the model purpose-built for it, and connect them with explicit interfaces rather than hoping a generalist model handles all of them adequately.

Here is every model in the current production pipeline:

Model	Stage	Primary Task
Gemini 3.1 Pro	Screenplay Engine	7-agent chain: Researcher → Showrunner → Architect → Voice Cast → Screenwriter → ImageDirector → VideoDirector → Verifier
ElevenLabs eleven_v3	Narration	Per-beat voiceover, locked before any visual is generated
ElevenLabs Music	Score	Original per-act soundtrack matching the emotional arc
Gemini 3.1 Flash Image	Stills	Cinematic still frame per scene (Nano Banana 2 as quota fallback)
Pixverse v6	Motion	Image-to-video animation of each still, 1–14s clips
Remotion	Assembly	Final timeline render to MP4

Stage 1: The Screenplay Engine (Gemini 3.1 Pro)

Every video begins with the Screenplay Engine. This is not a "generate a script" call to a language model. It is a seven-step structured production planning system, and the architecture is the most unusual part of how Onira works.

Researcher builds a fact bible from the prompt: sources, facts, quotes, timeline events. This bible is the ground truth for the entire production.

Showrunner is the only agent that reads the full bible. It produces a blueprint with key_moments[], each carrying source_bible_index provenance pointers back into the bible. This is a deliberate constraint: the Showrunner curates what matters; everything downstream works only from what the Showrunner committed, not from raw research.

Architect builds the scene-by-scene outline, copying each moment's committed facts onto every BeatPlan that references it.

Voice Cast assigns ElevenLabs eleven_v3 voices to the narrator and any characters, locking voice IDs before narration generation.

Screenwriter authors per-beat narration from BeatPlan.committed_* fields only. The raw research bible is never injected into the Screenwriter's prompt. This is the hallucination-prevention mechanism: the Screenwriter can only state what the Showrunner already committed. It cannot introduce facts from the bible that the Showrunner didn't explicitly surface.

ImageDirector and VideoDirector are two separate Gemini 3.1 Pro agents that produce the visual prompts. ImageDirector writes the appearance description for the still frame. VideoDirector writes the motion description for the clip. They never overlap — appearance and motion are separate concerns handled by separate agents with separate prompts.

Researcher.VerifyScreenplay is a final fact-check pass. It reads the completed narration against the bible and emits a VerificationReport before any credits are spent on generation.

The output of the Screenplay Engine is a structured list of BeatPlans — one per scene — each containing committed narration text, an image appearance prompt, a video motion prompt, voice assignments, and metadata. Every downstream stage consumes this structured output; nothing is free-form text handed between stages.

Stage 2: Narration First (ElevenLabs eleven_v3)

Narration runs before visuals. This is a hard architectural rule, not a scheduling preference.

Each scene's committed narration text is sent to ElevenLabs eleven_v3 with the assigned voice ID. The output is a timestamped audio segment with a precise duration. That duration is the ground truth for how long the scene's visual clip must run. The visual generation stages receive the narration duration as a required constraint.

This means visuals always conform to audio — never the reverse. AI video models that work on fixed-duration clips (Pixverse v6 generates 1–14s clips) can always be trimmed or looped to match narration. Audio that has already been recorded cannot be stretched to match a visual without sounding wrong. Audio-first is not a preference; it is the only order that produces synchronized output reliably.

ElevenLabs Music runs as a separate call per "act" (a logical grouping of scenes). It receives the narrative arc for that act and generates an original score segment — not a loop, but a composed piece that builds and resolves with the arc. Act-level score generation means music can reflect dramatic structure rather than just running as ambient background.

Stage 3: Stills → Motion (Gemini Flash Image + Pixverse v6)

Visual generation is a two-step process per scene.

Gemini 3.1 Flash Image receives the ImageDirector's appearance prompt and generates a cinematic still frame. If Gemini Flash Image is quota-exhausted, Nano Banana 2 is the automatic fallback. The still frame is a reference image — it establishes the visual world of the scene.

Pixverse v6 receives the still frame and the VideoDirector's motion prompt and animates it into a 1–14s video clip. The motion prompt describes only what moves and how — camera drift, subject motion, environmental dynamics. It never re-describes appearance, because the still frame already contains that.

The ImageDirector / VideoDirector separation matters for consistency: if one prompt tries to describe both appearance and motion, the video model has to choose what to prioritize. By separating the concerns at the prompt level, we avoid the common failure mode where a video model ignores spatial layout instructions because it was also trying to follow lighting instructions in the same prompt.

Scenes are processed concurrently — we run as many parallel generation calls as API rate limits allow. An 80-scene video may have 20+ concurrent generation calls active at peak. This is a major reason Onira can render a finished video in 10–30 minutes rather than hours.

When a generation returns a result that fails a quality gate (checked by a lightweight vision pass), the scene is automatically retried with a modified prompt — up to 3 attempts per scene at no extra credit cost. Cached results from successful scenes are never re-billed on a retry of a different scene.

Stage 4: The Consistency Problem

A still-frame-to-animation pipeline surfaces a consistency problem that single-model approaches don't have: if each scene's still frame is generated independently, characters and environments can drift visually across scenes.

Onira addresses this through two mechanisms.

The first is the org-shared character library: characters can be defined with FRONT, THREE_QUARTER, and FULL_BODY reference portraits. The ImageDirector agents receive these portraits as reference images, not just text descriptions. Gemini 3.1 Flash Image has strong image-reference following — a character with defined portraits appears visually consistent across scenes without requiring per-scene manual prompt tuning.

The second is the Showrunner's VisualContext and VisualSubject fields on each BeatPlan. These carry the committed visual environment and subject descriptions across scenes, so the ImageDirector isn't re-inventing each scene from scratch. Narrative continuity at the screenplay level translates to visual continuity at the generation level.

The Director's Studio in the UI surfaces these mechanisms explicitly — you can review and override the ImageDirector's output per scene, set active asset versions, and trigger regeneration with modified prompts before the Pixverse animation pass runs.

Stage 5: Assembly (Remotion)

The final stage assembles all generated assets — video clips, narration audio, score segments, subtitle text — into a finished video.

We use Remotion for assembly, which lets us define the entire video as a React component tree. Each scene is a component that receives its clip, audio segment, timing, and subtitle data as props. The final render is a deterministic, reproducible output from a structured data input — the same BeatPlan data that drove generation also drives assembly.

The narration durations from Stage 2 dictate clip lengths and edit points. Clips are trimmed to match narration (or looped for short ambient clips under longer narration segments). Score segments from ElevenLabs Music play at the act level, crossfading at act boundaries.

The output is a single MP4 at the user's requested resolution: Draft 720p, HD 1080p, Full Quality 1080p (Studio plan and above), or 4K Ultra HD (Pro plan and above). Subtitles are embedded as a separate track and remain editable in the in-app Timeline editor after render.

Why This Architecture?

The honest answer is: because we tried the simpler approaches first and they produced lower-quality output.

Single-model pipelines are faster to build and easier to operate. We built several during development. The ceiling was consistently lower — especially for research accuracy, narration quality, and scene-to-scene visual consistency.

The Showrunner-as-sole-curator design came from a specific failure mode: when the Screenwriter had access to the full research bible, it hallucinated details from adjacent facts in the bible that weren't relevant to the scene it was writing. The fix wasn't prompt engineering — it was an architectural constraint. Only the Showrunner reads the full bible. Everything downstream works from committed facts.

Audio-first ordering came from the opposite direction: visual-first pipelines produced great-looking clips that didn't sync to narration without re-timing the audio, which always sounded wrong. The fix was to lock audio first and constrain visuals to match.

Multi-stage pipelines are harder to operate. More failure surfaces. More retry logic. But the output is noticeably better, and "noticeably better" is the product.

The pipeline described above is the current production version — not the final version. We're refining quality gates, extending the character library, and improving the Director's Studio controls. The architecture is stable; the components keep improving.