Skip to content
OniraOnira
engineeringai modelspipeline

How Onira Orchestrates 7+ AI Models Per Video

BC
Biel Carpi
6 min read

A common question we get: how does Onira actually work under the hood? What happens between "type a prompt" and "receive a finished video"?

This is the technical answer. Not a marketing explanation — a real walkthrough of the pipeline, the model selection decisions, and why we built it this way instead of the simpler alternatives.

The Fundamental Problem: No Single Model Does Everything Well

Every AI video tool that came before Onira made the same mistake: pick one AI model and use it for everything. Type a prompt, get footage from Model X, assemble it, done.

The problem is that no single AI video model is the best at everything. Kling handles complex motion well. Hailuo produces extraordinary atmospheric, wide-shot cinematography. Veo is best for photorealistic footage. Each model has a domain where it clearly outperforms the others — and each model has types of scenes where it reliably underperforms.

When you use a single model for 80 scenes, you are accepting a ceiling. The best you can do is the average capability of that model across scene types. Some scenes will look great; others will look off. The result is an inconsistent video — not a bad video, but one where the quality varies noticeably from scene to scene.

Onira's architecture is built around a different premise: for each scene, identify which model will produce the best result, and route to that model. The quality ceiling becomes the aggregate of the best capabilities across all models. That is a fundamentally different — and higher — ceiling.

Here is a summary of every model in the pipeline:

ModelStagePrimary Task
Gemini 2.5 ProScript EngineResearch synthesis, scene planning, visual classification
ElevenLabsNarrationScene-by-scene voiceover with pacing and mood settings
AI Music (Suno)NarrationFull-video score generation matching emotional arc
Kling 3.0Visual RoutingComplex motion — action, people, machinery
Hailuo 2.3Visual RoutingAtmospheric hero shots — landscapes, dramatic lighting
VeoVisual RoutingPhotorealistic documentary footage — interiors, close-ups
Grok (xAI)Visual RoutingHigh-quality stills — montages, infographics, title cards
RemotionAssemblyFinal render — timing, mixing, transitions, export

Stage 1: Script Engine (Gemini 2.5 Pro)

Every video begins with the Script Engine. This is not a "generate a script" call to a language model. It is a structured production planning system.

When you submit a prompt to Onira, the Script Engine does several things in sequence:

  • Research synthesis: The model builds a knowledge base for the topic, identifying the most important concepts, the narrative arc that will best explain them, and the common misconceptions that an educational video should address.
  • Scene planning: The script is broken into 60–80 individual scenes, each with a narration segment, a visual description, a mood classification, an intended pacing marker (fast/medium/slow), and transition notes. This scene plan is the production blueprint for everything that follows.
  • Visual classification: Each scene's visual description is classified along several dimensions: motion intensity (static/gentle/dynamic), environment type (interior/exterior/abstract/space/microscale), realism target (photorealistic/cinematic/stylized), and content type (landscape/action/montage/infographic). These classifications feed directly into the routing logic.

The output of the Script Engine is not a text document. It is a structured JSON production plan that every downstream stage consumes. The narration text goes to the Narration stage. The visual descriptions and classifications go to the Visual Router. The pacing markers go to the Assembly stage.

Why Gemini 2.5 Pro? Long-context handling and instruction following at scale. A 12-minute video requires planning 80 scenes with consistent narrative logic, and the model needs to hold the entire arc in context while planning each individual scene. We tested several models; Gemini 2.5 Pro produced the most coherent long-form scripts with the fewest structural inconsistencies.

Stage 2: Narration (ElevenLabs)

Narration generation runs in parallel with visual production after the Script Engine completes.

Each scene's narration text is passed to ElevenLabs with scene-specific parameters: the pacing marker from the Script Engine determines the speaking rate, the mood classification influences prosody settings, and the overall tone configuration (set at the project level: authoritative, warm, conversational, etc.) applies globally.

The output is timestamped audio segments — one per scene — that are returned with precise duration information. This duration information is critical: it tells the Assembly stage exactly how long each scene's visual content needs to run, which determines clip length and transition timing.

Music generation runs in the same stage. An AI music model receives the full narrative arc, mood progression, and total runtime, and generates a complete score — not a loop, but a composed piece that builds and resolves with the video's emotional arc. Individual music segments are also generated per scene for moments that need specific musical transitions.

Stage 3: Visual Routing

This is the core of the architecture. The Visual Router receives the 60–80 scene visual classifications from the Script Engine and makes a routing decision for each one.

The routing logic currently uses four primary models:

ModelBest ForRouted When
Kling 3.0Complex motion — people walking, machinery, action sequencesDynamic motion + photorealistic or cinematic target
Hailuo 2.3Atmospheric hero shots — wide landscapes, dramatic skies, complex lightingGentle/static motion + cinematic/stylized + exterior or space environment
VeoPhotorealistic documentary footage — interiors, close-ups, real-world scenesInterior environments + photorealistic target + low motion intensity
Grok (xAI)High-quality stills — montages, infographics, title cards, historical illustrationsStatic motion + montage or infographic content type

Each scene is processed concurrently — we run as many parallel generation calls as the rate limits of each API allow. A 80-scene video might be running 20+ concurrent generation calls at peak. This is a significant part of why Onira can produce a finished video in 15–30 minutes rather than hours.

When a generation returns a result below a quality threshold (checked by a lightweight vision model), the scene is automatically retried — either with the same model using a modified prompt, or rerouted to an alternative model. This retry logic is invisible to the user but meaningfully improves the final output quality.

Stage 4: Color Grading

This stage exists entirely because of the multi-model architecture.

Footage from Kling, Hailuo, Veo, and Grok has different native color profiles. Different contrast curves, color temperature biases, saturation levels. If you cut between them without correction, the visual inconsistency is immediately apparent — the video looks assembled, not produced.

Onira applies LUT-based color grading to every clip before assembly. LUTs (look-up tables) are the same tool used in professional film post-production — they remap the color space of a clip according to a defined transformation. We apply a project-level LUT (selected by the user or inferred from the tone setting) to normalize all clips to a common color space, then apply scene-level adjustments for clips that need additional correction.

The result is that footage from four different AI models comes out looking like it was shot by a single cinematographer on a single camera. The seams are invisible. This is the single most impactful step for making AI-produced video look professional, and it is the step that most AI video tools skip entirely.

Stage 5: Assembly (Remotion)

The final stage assembles all generated assets — video clips, narration audio, music, sound effects, text overlays, subtitles — into a finished video.

We use Remotion for assembly, which allows us to define the entire video as a React component tree. Each scene is a component that receives its clip, audio segment, timing, and overlay data as props. The final render is a deterministic, reproducible output from a structured data input.

The pacing markers from the Script Engine inform the Assembly stage's editing decisions: fast-paced scenes use shorter clips and quicker transitions; contemplative scenes use longer takes and slower fades. Sound effects are added based on the scene environment classification. The narration, music, and sound effects are mixed at defined relative levels and the final audio is mastered for loudness normalization.

The output is a single MP4 file, rendered at the user's requested resolution, with YouTube-optimized metadata generated alongside it.

Why This Architecture?

The honest answer is: because we tried the simpler approaches first and they were not good enough.

Single-model pipelines are faster to build and easier to operate. They have lower API cost complexity and fewer failure modes. We built several of them during development. The output quality ceiling was consistently lower than what we wanted to ship.

Multi-model orchestration is harder. More failure surfaces. More complex routing logic. More quality control requirements. But the output is noticeably better — and "noticeably better" is the entire value proposition.

Cinema quality is not an aesthetic preference. It is the actual product. Every architectural decision in Onira's pipeline exists in service of that goal: produce video that does not look like it was made by an AI, at a cost and speed that makes it accessible to individual creators.

We are still improving the routing logic, adding new models as they mature, and refining the quality thresholds for retries. The pipeline described above is the current production version — not the final version. The ceiling keeps rising as the underlying models improve.

If you want to see it in action, join the waitlist for early access.

Ready to produce cinema?

Start creating today. Be among the first to turn your ideas into cinema-quality video.

From $79/mo · Cancel anytime