Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786
Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786richiemcilroy wants to merge 17 commits intomainfrom
Conversation
|
|
||
| fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] { | ||
| let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE]; | ||
| unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) } |
There was a problem hiding this comment.
from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) assumes the FFmpeg buffer is always aligned for f32. If that ever regresses, this becomes UB.
| unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) } | |
| fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] { | |
| let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE]; | |
| let ptr = bytes.as_mut_ptr(); | |
| assert_eq!(ptr.align_offset(std::mem::align_of::<f32>()), 0); | |
| unsafe { std::slice::from_raw_parts_mut(ptr.cast::<f32>(), len) } | |
| } |
| let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else { | ||
| self.clip_offsets_by_index.clear(); | ||
| return; | ||
| }; |
There was a problem hiding this comment.
resize(max_index + 1, …) can become an accidental huge allocation if clip.index is sparse / untrusted (e.g. a project file with a very large index). Since clip_offsets_for_project already has a linear fallback, it seems worth guarding the cache.
| }; | |
| let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else { | |
| self.clip_offsets_by_index.clear(); | |
| return; | |
| }; | |
| if max_index > 100_000 { | |
| self.clip_offsets_by_index.clear(); | |
| return; | |
| } |
| fn is_valid_for(self, rgba: &[u8], output: &[u8]) -> bool { | ||
| if self.width == 0 || self.height == 0 { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
uv_height() / width / 2 assumes even dimensions. If a caller ever passes odd width/height, this will silently drop chroma samples (and the buffer size math gets ambiguous). It might be safer to reject odd sizes up front.
| } | |
| if self.width == 0 || self.height == 0 || self.width % 2 != 0 || self.height % 2 != 0 { | |
| return false; | |
| } |
| @@ -219,8 +340,8 @@ impl AudioRenderer { | |||
| return None; | |||
There was a problem hiding this comment.
Duplicated timeline/linear render logic across
_into and _to_slice variants
render_timeline_frame_into and render_timeline_frame_to_slice are nearly identical: both walk the same timeline-cursor state machine and call render_current_chunk. The only structural difference is that the _into variant resizes a Vec<f32> while the _to_slice variant takes a pre-sized &mut [f32]. The same duplication exists between render_linear_frame_into and render_linear_frame_to_slice. Any future behavioural fix (e.g. to the chunk_samples calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/editor/src/audio.rs
Line: 277-340
Comment:
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**
`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.
How can I resolve this? If you propose a fix, please make it concise.
Improves desktop export and rendering throughput by overlapping independent setup work, reducing redundant CPU work on hot paths, and tightening a few correctness edges around FFmpeg audio/resampling and GPU readback sequencing.
Greptile Summary
This PR improves the desktop export/rendering pipeline through several independent optimisations: parallelising audio/decoder setup I/O, replacing per-frame O(n)
clips.iter().find()lookups with cached index maps, swapping the FFmpeg swscale RGBA→NV12 path for a hand-written CPU implementation (with optional rayon parallelism above 1080p), and reordering GPU readback submission to overlap with waiting for the previous frame.create_segmentsandRenderVideoConstants::neware now joined withtokio::try_join!, and multiple-segment audio loading is parallelised withtry_join_allandrayon::par_iter.render_video_to_channel/_nv12) now build aVec<ClipOffsets>once before the frame loop, eliminating repeatedclips.iter().find()calls per frame.cpu_yuv::rgba_to_nv12_fastreplaces the swscale context creation per frame, using rayon row-level parallelism above 1080p; the conversion function is also centralised to remove two separate inline copies.gainis now pre-converted to linear scale, the inner per-frame mixing loop is restructured for better cache locality, andtimeline_cursorcaches its walk state to avoid re-scanning segments each call.Confidence Score: 4/5
Safe to merge for most platforms; the removal of the Windows-only sequential decoder init guard deserves a macOS smoke-test before shipping.
The core optimisations are well-tested with baseline-matching unit tests. Two areas warrant attention: concurrent decoder init is now unconditional on all platforms (original rationale for the Windows-only guard is undocumented), and the duplicated
_into/_to_slicerender methods require parallel maintenance.crates/rendering/src/lib.rs (removed platform guard for decoder init), crates/editor/src/audio.rs (duplicated _into/_to_slice render methods)
Important Files Changed
from_filebut relies on the final flush contract.gainfield renamed tolinear_gain(pre-converted);valid_output_rangehelper correctly handles negative track offsets; well-tested by baseline-matching tests._to_slicevariants of render methods to avoid allocations; introduced cached clip-offsets withClipOffsets::default()fallback;timeline_cursornow caches segment walk state. Duplicated logic between_intoand_to_slicepaths creates maintenance risk.min_next_ptsand the top-level PTS adjustment guard still applies.cpu_yuv::rgba_to_nv12_fast; audio render moved after video encode-queue call; channel capacity bumped from 4 to 8;fill_nv12_frame_directtightened to avoid double borrow of frame data.rgba_to_nv12_fastfunction with scalar and rayon-parallel paths; BT.601 coefficients correct; validation guard returns false without writing on invalid input; baseline-matching tests added.Vecbefore the render loop; upfront full zoom precompute removed in favour of per-frame lazy calls (correctly placed at lines 973/615). Platform guard for sequential decoder init removed; original rationale undocumented.rayon::par_iter;RefCellis created inside each closure soSendis satisfied; Rayon preserves Vec order on collect.lazy_precompute_matches_full_precomputevalidates that per-frameensure_precomputed_untilproduces identical results to upfront full precompute.Comments Outside Diff (2)
crates/rendering/src/lib.rs, line 312-320 (link)The original code used
#[cfg(target_os = "windows")]to runscreen_futureandcamera_futureconcurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.Prompt To Fix With AI
crates/audio/src/audio_data.rs, line 71-101 (link)The old code unconditionally flushed the resampler after each packet's
receive_frameloop (and again after the conditional inner flush whenrun()signalled a delay). The new code removes all intermediate flushes and relies entirely on the finalloop { resampler.flush() }aftersend_eof. For thefrom_fileuse case this is fine because the final flush loop runs untilresample_delayisNone, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics offlushreturningNone, audio data could silently be truncated.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "chore(export): add encoder benchmark exa..." | Re-trigger Greptile