KKurniawan
Back to projects
2025 · Solo founder + engineerlive

ClipFlow — Self-hosted AI clip pipeline

Auto-clip podcast/ceramah with speaker tracking, plus CinemaAI story video and ASMR Studio. Fully self-hosted on a Dell R620.

Next.jsFastifyPythonRedisMySQLFFmpegVertex AIElevenLabs

The problem

Short-form video (TikTok / Shorts / Reels) eats long-form podcast and ceramah audiences alive — but the cutting workflow is brutal. A typical podcast editor spends 30-60 minutes per finished clip: scrubbing for the hook, reframing 16:9 → 9:16, timing captions per word, drafting metadata, exporting per platform. For an Indonesian creator publishing daily, that math doesn't work.

Existing tools (Opus, Submagic, etc.) are either expensive in IDR, gated behind heavy subscriptions, or generic enough that they miss Indonesian-specific content like dakwah, where the viral signal isn't "reaction", it's "quotable nasihat".

ClipFlow is my answer: a single tool that ingests a YouTube link and returns 15-20 ranked clips with captions burned in, thumbnails generated, and platform metadata pre-drafted. Plus two adjacent modules: CinemaAI for prompt → cinematic story video, and ASMR Studio for long-form ambient render with music + nature sound layered.

Stack decisions

I picked stack components for total control, not for what's fashionable:

  • Frontend: Next.js 14 App Router with Tailwind. Server components default, client only when needed. Familiar from day-job React work.
  • API: Fastify on Node 20. Faster startup than Express, better type ergonomics with Zod, lighter than NestJS. Prisma over MySQL 8 (chose MySQL because it was already in my lab; Postgres would be a swap).
  • Workers: Python 3.11. AI library ecosystem is Python-first (Whisper, pyannote, MediaPipe, google-cloud-aiplatform). Two processes — worker-http (sync, FastAPI) and worker-queue (async dispatcher).
  • Queue: Redis Streams with consumer groups. Considered BullMQ but Streams felt simpler — XADD producer, XREADGROUP consumer, native pending list claim for crash recovery. No JS/Python language barrier.
  • Storage: Cloudflare R2. S3-compatible API but cheap egress (zero for the first 10TB). For a self-hosted SaaS, this is the rare cloud service that's actually a deal.
  • AI: Vertex AI for Gemini + Veo 3, ElevenLabs for SFX gen. Started with Gemini Developer API (key auth), switched to Vertex AI service account once I had GCP free credit. Same client lib (google-genai), swap via env var.

Things that took longer than I expected

FFmpeg edge cases. Building a "render 8s Veo clip into a 1-hour seamless loop with music + 3 nature sounds layered" pipeline is ~200 lines of FFmpeg filter_complex graph. Volume mapping (pct → dB), xfade boundary trim, AAC bitrate vs file size — each took an afternoon to debug because FFmpeg fails silently if you mismatch a filter input arity.

Veo 3 content policy. Veo aggressively rejects prompts mentioning weapons, blood, even close-ups of children's faces in dramatic moments. For a tool aimed at storytelling and dakwah, that's a daily problem. Built a Gemini-powered "rewrite this visualPrompt to survive the policy filter" tool that auto-retries once, and a manual "Upload your own video" fallback for when even the AI rewrite fails.

Per-content-type scoring. First version scored every transcript with the same "hooks, debates, reactions" prompt. Worked great for podcasts, terrible for ceramah — it surfaced "rame" segments instead of quotable nasihat. Fix was a Project.contentMode enum (PODCAST / SERMON / STORY / TALK) that picks one of four Gemini system prompts at SCORE time. 1-day refactor, dramatic UX improvement.

What I'd do differently

Spec the contentMode from day 1. I built the scorer assuming "podcast = template", added ASMR + Cinema modules as separate apps inside the same monorepo, then realized the clip scorer also needed per-mode logic. Three months later I refactored. If I'd started with "every input has a content type" as a first-class concept, the modularity would've been cleaner.

Wire observability earlier. Prometheus + Grafana came in around month 4. Until then I was debugging via docker logs and pure intuition. Worth setting up before the first user — Prometheus + Loki on a spare VM is half a day.

Don't underestimate FFmpeg ramp time. I knew FFmpeg basics; I didn't know filter_complex well. Should've earmarked a full week to just read the docs and write throwaway scripts before committing to the ASMR Studio mix architecture.

Numbers

  • Lines of code: ~25k TypeScript + ~8k Python
  • Self-hosted on: 1× Dell R620 (Proxmox VM, 8 core / 16 GB RAM)
  • Storage: ~120 GB on R2
  • Cost to run: ~$0.46/Cinema scene, ~$0.42/ASMR project (Veo dominant)
  • Build pipeline: GitLab CI on the same lab → docker compose on the same R620
  • Uptime since launch: 99.7% (one outage from a 2 AM kernel update)

Where to find it