Skip to content

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786

Open
richiemcilroy wants to merge 17 commits intomainfrom
rendering-bits
Open

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786
richiemcilroy wants to merge 17 commits intomainfrom
rendering-bits

Conversation

@richiemcilroy
Copy link
Copy Markdown
Member

@richiemcilroy richiemcilroy commented May 8, 2026

Improves desktop export and rendering throughput by overlapping independent setup work, reducing redundant CPU work on hot paths, and tightening a few correctness edges around FFmpeg audio/resampling and GPU readback sequencing.

Greptile Summary

This PR improves the desktop export/rendering pipeline through several independent optimisations: parallelising audio/decoder setup I/O, replacing per-frame O(n) clips.iter().find() lookups with cached index maps, swapping the FFmpeg swscale RGBA→NV12 path for a hand-written CPU implementation (with optional rayon parallelism above 1080p), and reordering GPU readback submission to overlap with waiting for the previous frame.

  • Parallel IO: create_segments and RenderVideoConstants::new are now joined with tokio::try_join!, and multiple-segment audio loading is parallelised with try_join_all and rayon::par_iter.
  • Cached clip offsets: Both render paths (render_video_to_channel / _nv12) now build a Vec<ClipOffsets> once before the frame loop, eliminating repeated clips.iter().find() calls per frame.
  • CPU RGBA→NV12: A new cpu_yuv::rgba_to_nv12_fast replaces the swscale context creation per frame, using rayon row-level parallelism above 1080p; the conversion function is also centralised to remove two separate inline copies.
  • Audio renderer: gain is now pre-converted to linear scale, the inner per-frame mixing loop is restructured for better cache locality, and timeline_cursor caches its walk state to avoid re-scanning segments each call.

Confidence Score: 4/5

Safe to merge for most platforms; the removal of the Windows-only sequential decoder init guard deserves a macOS smoke-test before shipping.

The core optimisations are well-tested with baseline-matching unit tests. Two areas warrant attention: concurrent decoder init is now unconditional on all platforms (original rationale for the Windows-only guard is undocumented), and the duplicated _into/_to_slice render methods require parallel maintenance.

crates/rendering/src/lib.rs (removed platform guard for decoder init), crates/editor/src/audio.rs (duplicated _into/_to_slice render methods)

Important Files Changed

Filename Overview
crates/audio/src/audio_data.rs Simplified resampler flush loop: all intermediate per-packet flushes removed, replaced by a single final drain. Functionally equivalent for from_file but relies on the final flush contract.
crates/audio/src/renderer.rs Refactored mixing loop for cache-locality: per-track bulk iteration replaces per-sample cross-track loop; gain field renamed to linear_gain (pre-converted); valid_output_range helper correctly handles negative track offsets; well-tested by baseline-matching tests.
crates/editor/src/audio.rs Added _to_slice variants of render methods to avoid allocations; introduced cached clip-offsets with ClipOffsets::default() fallback; timeline_cursor now caches segment walk state. Duplicated logic between _into and _to_slice paths creates maintenance risk.
crates/enc-ffmpeg/src/audio/buffered_resampler.rs Resampler is now skipped when input/output formats already match; passthrough path correctly updates min_next_pts and the top-level PTS adjustment guard still applies.
crates/export/src/mp4.rs RGBA→NV12 via swscale replaced by cpu_yuv::rgba_to_nv12_fast; audio render moved after video encode-queue call; channel capacity bumped from 4 to 8; fill_nv12_frame_direct tightened to avoid double borrow of frame data.
crates/rendering/src/cpu_yuv.rs New rgba_to_nv12_fast function with scalar and rayon-parallel paths; BT.601 coefficients correct; validation guard returns false without writing on invalid input; baseline-matching tests added.
crates/rendering/src/frame_pipeline.rs GPU readback submission now precedes the wait on the previous frame, enabling better pipelining; error paths correctly cancel the newly-submitted readback before returning.
crates/rendering/src/lib.rs Clip-offsets now cached as a Vec before the render loop; upfront full zoom precompute removed in favour of per-frame lazy calls (correctly placed at lines 973/615). Platform guard for sequential decoder init removed; original rationale undocumented.
crates/rendering/src/project_recordings.rs Segment metadata loading parallelised with rayon::par_iter; RefCell is created inside each closure so Send is satisfied; Rayon preserves Vec order on collect.
crates/rendering/src/zoom_focus_interpolation.rs Tests only added; lazy_precompute_matches_full_precompute validates that per-frame ensure_precomputed_until produces identical results to upfront full precompute.

Comments Outside Diff (2)

  1. crates/rendering/src/lib.rs, line 312-320 (link)

    P2 Platform-specific sequential init removed without explanation

    The original code used #[cfg(target_os = "windows")] to run screen_future and camera_future concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/rendering/src/lib.rs
    Line: 312-320
    
    Comment:
    **Platform-specific sequential init removed without explanation**
    
    The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. crates/audio/src/audio_data.rs, line 71-101 (link)

    P2 Per-packet resampler flush removed; final flush must drain everything

    The old code unconditionally flushed the resampler after each packet's receive_frame loop (and again after the conditional inner flush when run() signalled a delay). The new code removes all intermediate flushes and relies entirely on the final loop { resampler.flush() } after send_eof. For the from_file use case this is fine because the final flush loop runs until resample_delay is None, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of flush returning None, audio data could silently be truncated.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/audio/src/audio_data.rs
    Line: 71-101
    
    Comment:
    **Per-packet resampler flush removed; final flush must drain everything**
    
    The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/rendering/src/lib.rs:312-320
**Platform-specific sequential init removed without explanation**

The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

### Issue 2 of 3
crates/editor/src/audio.rs:277-340
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**

`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

### Issue 3 of 3
crates/audio/src/audio_data.rs:71-101
**Per-packet resampler flush removed; final flush must drain everything**

The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.

Reviews (1): Last reviewed commit: "chore(export): add encoder benchmark exa..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@brin-security-scanner brin-security-scanner Bot added contributor:verified Contributor passed trust analysis. pr:verified PR passed security analysis. labels May 8, 2026

fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {
let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];
unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) assumes the FFmpeg buffer is always aligned for f32. If that ever regresses, this becomes UB.

Suggested change
unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }
fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {
let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];
let ptr = bytes.as_mut_ptr();
assert_eq!(ptr.align_offset(std::mem::align_of::<f32>()), 0);
unsafe { std::slice::from_raw_parts_mut(ptr.cast::<f32>(), len) }
}

let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {
self.clip_offsets_by_index.clear();
return;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize(max_index + 1, …) can become an accidental huge allocation if clip.index is sparse / untrusted (e.g. a project file with a very large index). Since clip_offsets_for_project already has a linear fallback, it seems worth guarding the cache.

Suggested change
};
let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {
self.clip_offsets_by_index.clear();
return;
};
if max_index > 100_000 {
self.clip_offsets_by_index.clear();
return;
}

fn is_valid_for(self, rgba: &[u8], output: &[u8]) -> bool {
if self.width == 0 || self.height == 0 {
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv_height() / width / 2 assumes even dimensions. If a caller ever passes odd width/height, this will silently drop chroma samples (and the buffer size math gets ambiguous). It might be safer to reject odd sizes up front.

Suggested change
}
if self.width == 0 || self.height == 0 || self.width % 2 != 0 || self.height % 2 != 0 {
return false;
}

Comment on lines 277 to 340
@@ -219,8 +340,8 @@ impl AudioRenderer {
return None;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated timeline/linear render logic across _into and _to_slice variants

render_timeline_frame_into and render_timeline_frame_to_slice are nearly identical: both walk the same timeline-cursor state machine and call render_current_chunk. The only structural difference is that the _into variant resizes a Vec<f32> while the _to_slice variant takes a pre-sized &mut [f32]. The same duplication exists between render_linear_frame_into and render_linear_frame_to_slice. Any future behavioural fix (e.g. to the chunk_samples calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/editor/src/audio.rs
Line: 277-340

Comment:
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**

`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor:verified Contributor passed trust analysis. pr:verified PR passed security analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant