Skip to content

feat: media-use v2 (heygen-CLI-only media OS) + non-destructive studio crop#1923

Open
miguel-heygen wants to merge 43 commits into
mainfrom
feat/media-use-v2
Open

feat: media-use v2 (heygen-CLI-only media OS) + non-destructive studio crop#1923
miguel-heygen wants to merge 43 commits into
mainfrom
feat/media-use-v2

Conversation

@miguel-heygen

Copy link
Copy Markdown
Collaborator

What

Two related tracks, cleanly separable by path (can split on request):

media-use v2 (skills/media-use/)

Implements the provider-CLIs decisions: one skill for every media need, heygen-CLI-only, free-first.

  • Provider registry contract — each media type maps to an ordered provider cascade; first non-null result wins (deterministic, reproducible renders). The heygen CLI is the only external CLI media-use shells; no third-party CLIs, no --allow-paid.
  • Types: bgm, sfx, image, icon, voice (HeyGen TTS, free) + local brand (frame.md/design.md tokens with design-flow upsell).
  • Consolidationhyperframes-media skill retired; the shared audio engine (TTS/BGM/SFX/captions) now lives under media-use/audio/. All catalog surfaces updated in lockstep (20 skills).
  • Cross-project reuse — content-addressed global cache (~/.media), auto-promote, cache-on-first determinism, resolve --from ingest (local files + direct public URLs only, SSRF-guarded), --local-only offline mode.
  • Media ops as guidance — ffmpeg/auto-editor/scenedetect recipes instead of bespoke verbs; outputs re-enter the ledger via resolve --from.
  • Local models — spec-gated opt-out fallback only (machine-capability probe; recommend the CLI when the machine can't run the model).
  • Studio Asset tab — in-use/unused filter + cross-project "all projects" view backed by a new /api/assets/global route.

Non-destructive Studio crop (packages/studio/, decision OP2)

Crop any element visually; persists as clip-path: inset(...) through the normal style-commit path (one undo step per drag, original media untouched).

  • Enter via canvas-toolbar Crop button, the Clip panel, or double-click on a selected element; Escape or click-outside exits.
  • Pro-editor presentation: full content stays visible while cropping, cropped-out region dimmed, crop frame + edge handles sit on the crop lines; drag the frame to move the whole crop window.
  • Every overlay surface respects the crop: selection box hug, hover ring, rotate/resize handles, snap lines (both as source and target), marquee hit-testing, off-canvas indicators — via one shared toVisibleOverlayRect primitive.

Testing

  • media-use: 47 registry/coverage tests + 17 audio-engine tests (node --test)
  • studio: 1331 tests incl. new crop geometry/panel/overlay coverage
  • studio-server: 223 tests incl. globalAssets route + 4-side clip-path round-trip
  • Crop UX verified end-to-end in a live studio session (select → double-click → drag edges/window → Escape → undo/redo)

Not in scope (parked by decision)

  • Side-by-side local-vs-HeyGen comparison assets (blocked on CDN location)
  • TTS eval re-run (fish-speech vs Kokoro)
  • Video generation / avatars, platform downloading, user-overridable provider order

Ordered, capability-based registry — search/generate/process slots, tried
deterministically with heygen-CLI first. Disabled slots (fal aggregator,
Iconify, voice TTS, local) carry their gate reason (B-Q1/B-Q2/U2) so enabling
them later is a flag flip, not a refactor. resolve cascade now registry-driven;
providers.mjs kept as a back-compat shim. Regenerated skills-manifest.
specs.mjs probes CPU/RAM/GPU/VRAM (Apple Silicon unified memory + nvidia-smi),
injectable for tests. local-models.mjs is a declarative table of user-installed
models per capability+tier (tts/asr/upscale) with size/needs/install/invoke;
selectModel picks the highest runnable tier or recommends the CLI path. No
license gate — models are user-installed, local-use-only.
fal aggregator + ElevenLabs/HeyGen-TTS voice providers built as real CLI
adapters, registered default-OFF and flipped on by env flag (MEDIA_USE_ENABLE_*)
read at call time. Bin's answer becomes a flag, not a code change. Tests prove
off-by-default + flip-on ordering (heygen stays first).
runLocalModel(capability) picks the best tier for the machine (selectModel),
checks the tool is on PATH, fills + runs its invoke template, returns the output
path — or a clear install/CLI recommendation when not runnable. Injectable
which/exec; never throws. Consumed by the process verb.
…ion (#11, #16)

Auto-promote every fetched asset into the global content-addressed cache so it's
reusable across all hyperframes projects (cachePut now dedups by sha; resolve
promotes on register, non-fatal). search.mjs: paginated text search (no vector
DB). usage.mjs: tag/partition assets by in-use vs unused (scans compositions) —
powers the Studio Asset-tab filter.
resolve --from <file|direct-url> freezes a user-supplied asset + registers +
auto-promotes it. isDirectMediaUrl rejects platform pages (no yt-dlp). brand
resolve with no frame.md/design.md now upsells the HyperFrames design flow
instead of a generic miss.
references/operations.md: local-tool recipes (ffmpeg cut/reframe/montage,
auto-editor, scenedetect) + local-vs-HeyGen transform table (bg-removal/upscale/
lipsync/translate) with side-by-side guidance. Outputs register via --from
(auto-promoted). Per OP1, media-use guides to the tools rather than re-wrapping
ffmpeg/heygen as bespoke verbs.
Adds 'In use' / 'Unused' filter chips alongside the category filters, reusing the
existing usedPaths (composition references) the in-use badge + used-first sort
already compute. filterByUsage/countUsage extracted as pure functions + unit
tested (6 vitest). Chips show only on a real used/unused split (no misleading 0s).

Verified in agent-browser: Studio boots, project loads, Asset tab renders + filter
path runs. Note: chip display binds to usePlayerStore.elements (same source as the
existing badge); that store was empty in a synthetic dev project where the timeline
is driven by the iframe __clipManifest — confirm/realign usedPaths source in real
studio usage (follow-up; affects badge+sort+filter equally).
deriveUsedPaths extracted + hardened: element src is the raw authored value, so
it can be relative (assets/x.png), ./-prefixed, the served /api/.../preview/ form,
or carry a ?query — normalize all to the bare project path so it matches the
asset-list entries (the in-use badge / sort / filter all key on this). The prior
inline strip only handled the served form. Source stays usePlayerStore.elements
(correct — hydrated from the timeline/__clipManifest); not switched. +2 vitest.
Premise: Bin approved B-Q1 (voice/TTS) + B-Q2 (third-party CLIs). fal (bgm/sfx/
image generate) and voice (ElevenLabs + heygen tts) flip from gated to live,
marked paid. Cost guard (X4) ships with them: runProviders skips paid providers
unless ctx.allowPaid; resolve gains --allow-paid (opt-in) + --local-only (block).
Free heygen catalog stays first. Iconify stays gated (not named). SKILL providers
table + flags documented. Tests updated (43 green). Skill consolidation/deletion
(rest of #15) deliberately NOT done — separate, not implied by approval.
Move the audio engine (audio.mjs + lib + heygen-tts + wait-bgm + lyria-recipe +
bundled SFX + references) as a self-contained subtree into skills/media-use/audio/
via git mv — zero internal path edits (HERE/../assets/sfx + lyria offsets
preserved). CLI primitives still wrapped, not moved. Smoke test proves the
bundled SFX library resolves from the new location. hyperframes-media now holds
only SKILL.md (retired in U22).
…e/audio (U21)

faceless-explainer / pr-to-video / product-launch-video wrappers: DEFAULT_ENGINE
→ media-use/audio/scripts/audio.mjs (engine resolves from each, verified). Fixed
the relocated engine docs' stale self-paths (tts.md) + an inaccurate attribution
comment. Grep gate: no skills/ reference to hyperframes-media/scripts remains
(except general-video doc routing, swept in U22).
Delete the hyperframes-media skill (engine relocated in U20). Sweep all 32
referencing files → media-use: engine paths → media-use/audio/scripts, reference
links → media-use/audio/references, skill-name/routing mentions → /media-use.
Catalog dedup: drop the duplicate audio row from README/CLAUDE/skills.mdx (the
single media-use row now covers resolve + audio), router + hyperframes-core
domain list point to media-use, skill count 19→18. Manifest regenerated (18,
hyperframes-media gone). Grep gate clean (only plan docs reference it as history).
dist/ artifacts regenerate from source on build.
Merge the two /media-use rows in the hyperframes router capability map into one;
fix the remaining 'all 19' install-count mentions in README/CLAUDE/skills.mdx →
18. Follow-on cleanup from the hyperframes-media retirement rename.
Frontmatter reframed to the full media OS (resolve/generate/operate/remember,
incl. voice + the audio engine). New Audio engine section documents
audio/scripts/audio.mjs, the request/meta contract, the auto-degrade switch, and
the relocated per-topic references (tts/bgm/sfx/transcribe/remove-background/
captions). A reader can now do everything hyperframes-media offered from
media-use alone.
…U24)

Non-project-scoped GET /api/assets/global reads only ~/.media/manifest.jsonl and
returns reusable records — the data behind the Studio Asset tab's cross-project
view. Skips malformed lines (torn write won't 500). Registered in createStudioApi
alongside the other route groups. 3 vitest. AssetsTab toggle next.
SKILL.md gains a 'What it owns' matrix mapping each HyperFrames media gap to its
media-use entrypoint. coverage.test.mjs enforces every row — image/icon
providers, voice + audio engine, consolidated engine + bundled SFX, ops
reference, global cache + ingest, local-model tiers, every resolve type has a
provider — so a claim can't silently rot. 7 tests.
This project / All projects toggle in the Asset tab. The global view is a
self-contained GlobalAssetsView component (owns its /api/assets/global fetch +
render) — keeps AssetsTab under the file-size cap and lowers its complexity.
Lists the reusable cross-project cache (type + label); category/usage chips are
local-only; search filters both. globalAssetRows pure helper + 3 vitest.
Verified live in agent-browser: 'All projects' shows assets from ~/.media.
…cal-only

M2: heygen.tts is free and first in the voice cascade; ElevenLabs is the
paid fallback, so voice resolves by default without --allow-paid.
M3: --local-only now skips every network provider, not just paid ones.
…m API

M1: freezeUrl streams the body and aborts at the 256MB cap (and checks
content-length first), so a hostile URL can't buffer past the cap into memory.
m11: isDirectMediaUrl rejects loopback/private/link-local hosts on --from ingest.
m13: /assets/global projects to public fields, no absolute cached_path to the browser.
- voice cascade in SKILL now reflects free HeyGen TTS first, ElevenLabs paid fallback
- --local-only documented as 'skip every network provider', not just paid
- drop the nonexistent 'organize --promote' CLI; promotion is automatic
- point the local-model row at the actual runner (local-run.mjs) without overclaiming auto-cascade
- operations.md: clip-path crop is a composition technique, not a shipped Studio feature
- SFX library is 19 files, not 21; usage.mjs no longer claims it powers the Studio filter
- clarify the faceless/pr-to-video audio adapters intentionally reuse the product-launch model
The voice provider guessed 'heygen voice tts --text', which the CLI rejects.
Real command (verified against heygen v0.1.6 --help / --response-schema):
'heygen voice speech create --text <t> --voice-id <id>' returning data.audio_url
+ data.duration. Default voice-id = first starfish-engine voice (deterministic,
overridable via ctx.voiceId). Verified live: a default voice resolve now produces
a real audio file with no --allow-paid. Registry header no longer calls fal/voice
flag-gated seams — they are live; Iconify is the only remaining gated provider.
Verified against the official docs (not guessed --help):
- fal: use the genmedia CLI (genmedia run <model> --prompt .. --download .. --json
  -> downloaded_files[].path/url), not 'fal run --input'. The pip 'fal' is the
  serverless-deploy CLI and can't run hosted models.
- ElevenLabs: the official @elevenlabs/cli is agents-only (no TTS). Use the
  community elevenlabs-cli (tts "<text>" --output <file>): positional text, file
  output. Both download a local file, returned as localPath for freezing.
SKILL providers section now names the correct binaries + install/auth. Live
generation still needs FAL_KEY / ELEVENLABS_API_KEY (absent here).
…mes CLI

The 'When to use' line credited transcription and background removal to the
audio engine; they actually come from the hyperframes CLI (transcribe,
remove-background), as the audio-engine section already states. The engine
itself does TTS / BGM / SFX / caption timing.
- Cost: paid generation is the agent's call, no user-confirmation prompt
  (free-first still tries cache + heygen before anything paid).
- Local models: heygen free-usage is the default (TTS, bg-removal via the
  heygen CLI); local open-source is the opt-out fallback only ('if user no,
  then local'), not the headline answer. Coverage test description updated to
  match while still asserting the fallback table is populated.
…nd --allow-paid

Remove the third-party CLI adapters (fal via genmedia, ElevenLabs via
elevenlabs-cli) and the paid-provider cost guard entirely; the heygen
CLI is the only external CLI media-use shells. --local-only stays. A
guard test asserts no fal/elevenlabs provider can silently return.
With fal/ElevenLabs gone and Iconify never enabled, every provider is
always-on heygen (or the local design spec) — remove the envFlag/gated
enablement layer entirely. The guard test now asserts only heygen /
design_spec providers exist.
Crop a selected element visually: a Crop toggle in the Clip panel arms
edge handles on the overlay; dragging live-previews clip-path inset on
the iframe element and commits once through the existing inline-style
patch path (undo/persist like any style edit). Per-side T/R/B/L inset
fields join the Clip section; radius is preserved, insets clamped.
Render-time only: the media file is untouched.

Clip-path helpers move to clipPathHelpers.ts (re-exported) with 1- and
4-value inset parse/format support.
…-line handles, dim, hug

Crop mode now behaves like an NLE crop tool:
- Canvas toolbar gets a Crop button (shown whenever a croppable element
  is selected); crop mode lives in the player store so the toolbar, the
  Clip panel, and the overlay share one switch.
- Double-click a selected element to enter crop mode (timestamp-based:
  selection re-keys the box on every click, so native dblclick never
  fires); Escape exits.
- While cropping, the element's clip is lifted so the full content stays
  visible; the cropped-out region is dimmed and a crop frame with edge
  handles sits ON the crop lines (handles track the insets as you drag).
- Drag end commits through the normal style path (one undo step per
  drag); leaving crop mode re-applies the committed crop.
- Outside crop mode the selection box hugs the visible cropped region
  instead of the full element bounds.
The crop frame is now a drag surface: dragging it slides the whole
visible region — opposing insets shift together so the crop size stays
constant, clamped inside the element bounds. Same commit path as the
edge handles (one undo step per drag).
…om the visible region

Box-originated gestures (drag/resize/rotate) took their origin from the
full-bounds overlayRect and wrote box.style directly, so after a drag
the box sat at the element origin instead of hugging the crop. Pass the
hugged rect (full OverlayRect shape) as the gesture's box geometry —
the box now tracks the visible region through the whole gesture, and
rotation centers on what's actually on screen.
…of moving it

Gestures write the selection box's position directly during drags, so
rendering the box at the hugged (cropped) rect made two writers disagree:
after any drag the box parked at the full-bounds position (or shrank
twice mid-drag). Keep the box at the element's full bounds — the basis
the gesture machinery owns — and render the hug as the element's inset
clip-path scaled into overlay space and applied to the box itself (the
pre-existing pattern for clipped elements). The resize handle shifts to
the visible region's corner so the clip doesn't swallow it.
…, rotate handle anchors to crop

- The clip-on-box hug swallowed the box border everywhere the crop edge
  didn't touch the element edge (a heavy crop showed no outline at all).
  Drop the clip: the box stays border-less at full bounds and a child
  div draws the outline ON the crop boundary.
- Clicking outside the element while crop mode is armed exits crop mode
  (selection kept) — crop UI swallows its own pointerdowns, so any
  pointerdown reaching the overlay is an outside click.
- The rotate handle (extracted to DomEditRotateHandle) anchors to the
  crop outline instead of floating above the invisible full bounds.
The hover outline drew the element's full bounds, ignoring the crop.
Shrink it with the same inset hug the selection box uses (display-only
rect, nothing writes back to it).
Add toVisibleOverlayRect (projected rect shrunk by the element's inset
crop) and use it wherever the overlay reasons about what's ON SCREEN:

- snap targets: other elements' guides align to their visible edges
- snap moving rect: dragging a cropped element snaps its visible edges
- marquee: hit-test against the visible region, not cropped-out space
- child dashes + group item rects (display; group commits use dx/dy)
- off-canvas indicators: a crop that keeps the visible part on-canvas
  no longer flags the element as off-canvas

The selection box keeps the full rect — it is the gesture coordinate
basis; its hug is the crop-outline child.
@mintlify

mintlify Bot commented Jul 4, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
hyperframes 🟢 Ready View Preview Jul 4, 2026, 2:28 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@miguel-heygen miguel-heygen changed the title media-use v2: the media OS (heygen-CLI-only) + non-destructive Studio crop feat: media-use v2 (heygen-CLI-only media OS) + non-destructive studio crop Jul 4, 2026
Comment thread packages/studio/src/components/editor/clipPathHelpers.ts Fixed
Comment thread packages/studio/src/components/editor/clipPathHelpers.ts Fixed
mkdirSync(dirname(destPath), { recursive: true });
writeFileSync(destPath, bytes);
return bytes.length;
writeFileSync(destPath, Buffer.concat(chunks, total));
Two joins of capabilities the framework already ships:

- transcript-cut: edit footage by editing its transcript. Word-level
  timestamps (hyperframes transcribe) + agent-decided removals (explicit
  ranges, word indices, filler words, long silences, or inverse --keep)
  compile into frame-accurate ffmpeg segments + concat; --plan dry-run,
  --copy fast mode. Output re-enters the ledger via resolve --from.
- audio-duck: BGM ducking as declaration, not baking. Speech spans from
  audio_meta.json / transcribe.json word timestamps become a ready GSAP
  volume-tween block (attack/release shaped, sentence gaps bridged); the
  runtime already renders volume tweens identically in preview + render.
- operations.md: text-based editing, bake-for-export sidechain duck, and
  two-pass loudnorm publish-loudness recipes.
- SKILL.md gaps table + coverage test extended for both claims.

Pure cores (compileCutList, speechSpans/duckKeyframes) are unit-tested;
e2e cut verified by ffprobe (6s clip minus 1s cut = 5.02s output).
…rsing

CodeQL flagged the inset() matcher (nested optional whitespace around a
lazy group). Match the parenthesized payload unambiguously, trim, and
collapse whitespace before the round-radius split.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants