318 lines
9.4 KiB
Markdown
318 lines
9.4 KiB
Markdown
# Partial Rendering Specification
|
|
|
|
## Overview
|
|
|
|
Enable rendering of specific sections of a video (e.g., slides 1-10, then 10-20) instead of the full video. This is useful for:
|
|
- Faster iteration during development
|
|
- Re-rendering specific sections after fixes
|
|
- Parallel rendering of segments that can be concatenated later
|
|
|
|
## Scope (v1)
|
|
|
|
**In scope:**
|
|
- Camera state tracking (cumulative state must be computed from t=0)
|
|
- Time offset adjustment for all events
|
|
- Slide range filtering
|
|
- Input video seeking
|
|
|
|
**Out of scope (v1):**
|
|
- Audio events crossing range boundaries
|
|
- Triggered video duration edge cases
|
|
- Events are assumed to begin at their marker timestamp and never "carry over"
|
|
|
|
## Current Architecture Analysis
|
|
|
|
### 1. Camera State Management
|
|
|
|
**Current behavior** (`transformer.py:250-332`):
|
|
- Camera state is **cumulative** across the transcript
|
|
- `_extract_camera_events()` walks through ALL markers sequentially
|
|
- Each marker type (Zoom/Tilt/Pan) only modifies its property while preserving others
|
|
- Example: `[Zoom2]` then `[TiltLeft]` = both zoom AND tilt active
|
|
|
|
**Problem for partial rendering**:
|
|
If we start rendering at slide 10, we need the camera state AS IT WOULD BE after processing slides 1-9.
|
|
|
|
**Solution**:
|
|
Separate "state computation" from "event generation":
|
|
1. Always walk through ALL transcript markers to compute cumulative state
|
|
2. Track the "initial state" at the start of the render range
|
|
3. Only emit CameraEvents for markers WITHIN the render range
|
|
4. First event in partial render must transition FROM the computed initial state
|
|
|
|
### 2. Time Signature Adjustment
|
|
|
|
**Current behavior**:
|
|
All timing uses absolute timestamps from `transcript.csv`:
|
|
- `SlideEvent.start_time/end_time`
|
|
- `VideoEvent.start_time/end_time`
|
|
- `AudioEvent.start_time`
|
|
- `CameraEvent.time`
|
|
- FFmpeg expressions: `enable=between(t, start, end)`
|
|
- Camera animation: `if(between(t, 1.000, 1.200), ...)`
|
|
|
|
**Problem for partial rendering**:
|
|
If slide 10 starts at t=10.0s and we render from there, FFmpeg expects t=0 at the start of output.
|
|
|
|
**Solution**:
|
|
Apply a `time_offset` to all events after extraction:
|
|
```
|
|
new_time = original_time - time_offset
|
|
```
|
|
Where `time_offset` = start time of first slide/event in range.
|
|
|
|
### 3. Input Video Seeking
|
|
|
|
**Current behavior**:
|
|
- Always-visible videos (talking head) start from the beginning
|
|
- FFmpeg processes entire input duration
|
|
|
|
**Problem for partial rendering**:
|
|
Need to seek into source videos to the correct position.
|
|
|
|
**Solution**:
|
|
Add `-ss <seek_time>` before input files for always-visible videos:
|
|
```
|
|
ffmpeg -ss 10.0 -i talking_head.mov ...
|
|
```
|
|
|
|
---
|
|
|
|
## Proposed API
|
|
|
|
### Command Line Interface
|
|
|
|
```bash
|
|
# Render full video (current behavior)
|
|
gnommo render example/project.json output.mp4
|
|
|
|
# Render specific slide range
|
|
gnommo render example/project.json output.mp4 --slides S1:S10
|
|
gnommo render example/project.json output.mp4 --slides S10:S20
|
|
gnommo render example/project.json output.mp4 --slides S5: # S5 to end
|
|
|
|
# Render specific time range (alternative)
|
|
gnommo render example/project.json output.mp4 --time 0:60
|
|
gnommo render example/project.json output.mp4 --time 60:120
|
|
```
|
|
|
|
### Internal API
|
|
|
|
New parameters for `build_render_plan()`:
|
|
```python
|
|
def build_render_plan(
|
|
...
|
|
slide_range: Optional[tuple[str, Optional[str]]] = None, # (start_slide, end_slide)
|
|
# OR
|
|
time_range: Optional[tuple[float, Optional[float]]] = None, # (start_time, end_time)
|
|
) -> RenderPlan:
|
|
```
|
|
|
|
New field on `RenderPlan`:
|
|
```python
|
|
@dataclass
|
|
class RenderPlan:
|
|
...
|
|
time_offset: float = 0.0 # Offset to subtract from all timestamps
|
|
initial_camera_state: CameraState = field(default_factory=CameraState) # State at render start
|
|
input_seek_time: float = 0.0 # Seek position for input videos
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Details
|
|
|
|
### Phase 1: Compute Full State, Filter Events
|
|
|
|
Modify `_extract_camera_events()` to accept a time range:
|
|
|
|
```python
|
|
def _extract_camera_events(
|
|
transcript: list[TimedWord],
|
|
time_range: Optional[tuple[float, float]] = None, # (start, end)
|
|
) -> tuple[list[CameraEvent], CameraState]:
|
|
"""
|
|
Returns:
|
|
- List of CameraEvents within time_range
|
|
- Initial CameraState at start of time_range
|
|
"""
|
|
events: list[CameraEvent] = []
|
|
current_state = CameraState()
|
|
initial_state = CameraState()
|
|
start_time, end_time = time_range or (0.0, float('inf'))
|
|
|
|
found_start = False
|
|
|
|
for timed_word in transcript:
|
|
if not timed_word.is_marker:
|
|
continue
|
|
|
|
marker_id = timed_word.marker_id
|
|
if not marker_id or marker_id not in CAMERA_PRESETS:
|
|
continue
|
|
|
|
# Always update current_state (full walk)
|
|
preset = CAMERA_PRESETS[marker_id]
|
|
new_state = _apply_preset(current_state, marker_id, preset)
|
|
|
|
# Capture state just before we enter the render range
|
|
if not found_start and timed_word.time >= start_time:
|
|
initial_state = current_state # State BEFORE this marker
|
|
found_start = True
|
|
|
|
# Only emit events within range
|
|
if start_time <= timed_word.time < end_time:
|
|
events.append(CameraEvent(
|
|
time=timed_word.time,
|
|
target_state=new_state,
|
|
duration=0.2,
|
|
easing="ease-out",
|
|
))
|
|
|
|
current_state = new_state
|
|
|
|
return events, initial_state
|
|
```
|
|
|
|
### Phase 2: Apply Time Offset
|
|
|
|
After extracting events, apply offset to all timestamps:
|
|
|
|
```python
|
|
def _apply_time_offset(plan: RenderPlan, offset: float) -> RenderPlan:
|
|
"""Shift all timestamps by offset (subtract offset from all times)."""
|
|
|
|
# Adjust slide events
|
|
for event in plan.slide_events:
|
|
event.start_time -= offset
|
|
event.end_time -= offset
|
|
|
|
# Adjust video events
|
|
for event in plan.video_events:
|
|
event.start_time -= offset
|
|
event.end_time -= offset
|
|
|
|
# Adjust audio events
|
|
for event in plan.audio_events:
|
|
event.start_time = max(0, event.start_time - offset)
|
|
|
|
# Adjust camera events
|
|
for event in plan.camera_events:
|
|
event.time -= offset
|
|
|
|
# Adjust total duration
|
|
plan.total_duration -= offset
|
|
plan.time_offset = offset
|
|
plan.input_seek_time = offset
|
|
|
|
return plan
|
|
```
|
|
|
|
### Phase 3: FFmpeg Seeking
|
|
|
|
Modify `build_ffmpeg_command()` to add seeking:
|
|
|
|
```python
|
|
def build_ffmpeg_command(plan: RenderPlan, output_path: Path) -> list[str]:
|
|
cmd = ["ffmpeg", "-y"]
|
|
|
|
# Add seek for always-visible videos
|
|
for video_id, video_source, cutout in plan.narration_videos:
|
|
video_path = _resolve_video_path(videos_dir, video_source)
|
|
if plan.input_seek_time > 0:
|
|
cmd.extend(["-ss", str(plan.input_seek_time)]) # Seek BEFORE -i
|
|
cmd.extend(["-i", str(video_path)])
|
|
...
|
|
```
|
|
|
|
### Phase 4: Initial Camera State Handling
|
|
|
|
If `initial_camera_state` is not default, inject a "virtual" camera event at t=0:
|
|
|
|
```python
|
|
def build_camera_transform(
|
|
camera_events: list[CameraEvent],
|
|
initial_state: CameraState, # NEW PARAMETER
|
|
...
|
|
) -> str:
|
|
# If initial state differs from default, prepend a virtual event
|
|
if not initial_state.is_default():
|
|
initial_event = CameraEvent(
|
|
time=0.0,
|
|
target_state=initial_state,
|
|
duration=0.0, # Instant - no transition
|
|
easing="linear",
|
|
)
|
|
camera_events = [initial_event] + camera_events
|
|
...
|
|
```
|
|
|
|
---
|
|
|
|
## FFmpeg Optimization
|
|
|
|
**Only emit filters for events within range.**
|
|
|
|
When rendering a partial range, the `RenderPlan` should only contain events within that range. This means:
|
|
- Fewer inputs added to the FFmpeg command (only slides/videos/audio actually used)
|
|
- Fewer overlay filters in filter_complex
|
|
- Fewer `between(t, start, end)` enable expressions to evaluate per frame
|
|
|
|
Example: Full video has 50 slides, rendering S40:S50 only:
|
|
- **Before**: 50 slide inputs, 50 overlay filters
|
|
- **After**: 10 slide inputs, 10 overlay filters
|
|
|
|
This is achieved naturally by filtering events in `build_render_plan()` before constructing the plan - the renderer already only processes events present in the plan.
|
|
|
|
---
|
|
|
|
## Edge Cases (v1 Simplified)
|
|
|
|
### 1. Camera state from before range
|
|
If rendering S5:S10 but there's a camera event at the S4 marker:
|
|
- Camera state from S4 must be captured as `initial_camera_state`
|
|
- Rendered output starts with that state already applied at t=0
|
|
|
|
### 2. Events filter by marker position
|
|
All events (slides, videos, audio) are filtered by whether their START marker falls within the range.
|
|
- Events beginning outside range are excluded
|
|
- No "carry over" or boundary-crossing logic needed
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
1. Camera state computation maintains state across full transcript
|
|
2. Time offset correctly shifts all event types
|
|
3. Initial camera state correctly captured at boundary
|
|
|
|
### Integration Tests
|
|
1. Render slides 1-5, then 5-10, concatenate, compare to full render
|
|
2. Camera state continuity across segment boundaries
|
|
3. Audio alignment after seeking
|
|
|
|
### Manual Verification
|
|
1. Visual inspection of camera state at segment boundaries
|
|
2. Audio sync verification
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### Parallel Rendering Pipeline
|
|
```bash
|
|
# Render in parallel, then concatenate
|
|
gnommo render proj.json seg1.mp4 --slides S1:S10 &
|
|
gnommo render proj.json seg2.mp4 --slides S10:S20 &
|
|
gnommo render proj.json seg3.mp4 --slides S20: &
|
|
wait
|
|
ffmpeg -f concat -i segments.txt -c copy final.mp4
|
|
```
|
|
|
|
### Smart Re-rendering
|
|
Track which slides changed and only re-render affected segments.
|
|
|
|
### Preview Mode
|
|
Quick low-quality render of specific section for review.
|