founder storyvisionai video

Why We Built Onira

Biel Carpi

March 15, 20265 min read

I want to tell you why I built this.

Not the polished version — the honest one. The version where I admit that I spent three months in late 2024 juggling eight different tools to produce a single documentary about the history of the internet, and by the end of it I wanted to burn everything down and never make another video.

The output was good. The process was a disaster.

The Problem No One Talks About

Here is what producing a single YouTube documentary actually looked like for me at the time:

Tool	Purpose	Time Wasted
ChatGPT / Claude	Research and script draft	2–3 hrs per video
ElevenLabs (manual)	AI voiceover (configure each time)	45 min per video
AI image generators	Static stills, one prompt at a time	1–2 hrs per video
AI video generators	Clips, one scene at a time	2–4 hrs per video
Pexels / Artgrid	Stock footage gap-filling	1–2 hrs per video
Epidemic Sound	Music licensing	30 min per video
Timeline editor & subtitle editing	Assembly and syncing audio to clips	4–6 hrs per video

Eight tools. Eight separate subscription tabs. Eight separate creative decisions happening in isolation. None of them talking to each other.

The assembly alone — stitching together clips from three different AI generators that each produced footage with different timing, framing, and character drift — took six hours. Six hours of manual work that should have been automatic, because every single visual had already been described in the script.

I calculated it afterward: 22+ hours to produce a single video. That is not sustainable. That is a hobby, not a business.

The Vision That Became Onira

The thing I kept thinking was: this should be one step.

I should be able to say "make a documentary about the history of the internet, authoritative tone, ElevenLabs narration, original score" and receive a finished video. Not a rough cut. Not raw footage I still have to assemble. A finished video.

That is what Onira is. One prompt, one finished video. The entire production pipeline — research-grounded screenplay, audio-first narration, per-scene cinematic stills, image-to-video animation, original score, final assembly — handled automatically, in 10–30 minutes of render time.

But the vision was never just "automate the workflow." The vision was: make it good enough that you actually want to publish it. Cinema quality, not template quality.

Why Cinema Quality Matters More Than Ever

YouTube's AI content policy has changed everything. Since mid-2025, template-based AI content — the stuff that looks like it was assembled from a stock footage library with a voiceover slapped on top — is being systematically demonetized. YouTube's algorithm has gotten remarkably good at identifying it, and advertiser demand for that type of inventory has collapsed.

The channels that are growing in 2026 are the ones that look and sound professionally made. Not because viewers consciously notice the production quality, but because the algorithm does. Watch time, retention curves, and click-through rates are all higher for content that does not feel artificially assembled.

Template content is a race to the bottom. A thousand channels producing indistinguishable videos about the same topics at the same quality level is not a content strategy — it is noise. Cinema quality is the moat. It is the reason someone subscribes to your channel instead of the next one in the search results.

This is what we mean by "cinema quality" at Onira. We do not mean Hollywood. We mean: narratively structured, visually coherent, aurally rich, and distinctly yours. Content that does not announce "I was made by a random AI" in its first five seconds.

The Technical Approach

Getting there required two specific architectural insights.

The first is audio-first. Every other AI video pipeline I looked at generates visuals first, then tries to synchronize narration afterward. The result is audio that always feels slightly off — either rushed to fit a clip or stretched with awkward pauses. Onira locks narration before a single frame is generated. ElevenLabs eleven_v3 produces the voiceover for every scene. Those audio durations become the absolute constraint. Every visual clip is generated to conform to audio, never the reverse. Sync is not a post-production problem — it is an architectural guarantee.

The second is how the screenplay works. Most AI video tools pass a prompt to a language model and let it write whatever it wants. The result is plausible-sounding narration that invents facts, misattributes quotes, and hallucinates details with complete confidence.

Onira's screenplay engine is structured differently. A Researcher agent builds a fact bible from the prompt — sourced claims, timeline events, verified quotes. A Showrunner agent is the only model that ever reads the full bible. It curates committed facts onto each scene's BeatPlan: "this scene uses this quote, this date, this claim." A Screenwriter agent then writes narration from those committed facts only — the bible is never in its prompt, which means it physically cannot hallucinate something it was never told. Finally, a Researcher.VerifyScreenplay pass checks every narration line against the source bible before a single credit is spent on production.

The result is narration you can trust — not just narration that sounds trustworthy.

Then there is the visual pipeline. Gemini 3.1 Flash Image generates a cinematic still frame for each scene, reviewed by an ImageDirector agent for composition and visual tone. Pixverse v6 animates each still into a 1–14 second motion clip, with a VideoDirector agent specifying exactly the motion to generate. Because every clip starts from a still, visual consistency carries forward automatically — the same character, same color world, same directorial intent. Remotion assembles the full timeline.

That is the technical core. It is not magic — it is orchestration. But the output is noticeably different from anything a single-model or prompt-and-pray approach produces.

What We Are Building Toward

We are in early access. The current version handles the full production pipeline for videos up to 10 minutes: research, screenplay, narration in 30+ languages, per-scene visuals, original score, timeline assembly, and MP4 export up to 4K.

On Creator plan ($149/mo, 3,750 credits), a 10-minute video costs approximately 1,930 credits — roughly $77 retail, or about $41 if you use your included monthly credits efficiently. Compare that to $10,000–$100,000 for a traditionally produced documentary of equivalent length, or the 20+ hours I was spending manually.

But what I am most excited about is the access question. Most stories do not get told — not because they are not worth telling, but because the people who have them do not have the resources to tell them at a quality level anyone will watch.

The educator with a deep knowledge of Byzantine history who could never afford a production team. The journalist with a story that does not fit a newspaper. The scientist who wants to explain their research to a general audience. These are the creators I am building for.

The barrier to cinema-quality video production has been high for too long. We are making it accessible. Your story is worth telling.

If you want to be among the first to use Onira, get started today — we are shipping every day and early creators get the best rates and shape what we build next.

— Biel

Why We Built Onira

The Problem No One Talks About

The Vision That Became Onira

Why Cinema Quality Matters More Than Ever

The Technical Approach

What We Are Building Toward

Related articles.

How Onira Orchestrates 5 Production AI Models Per Video

Ready to produce cinema?