Adding changes version 1

2026-02-06 17:56:05 +01:00
parent 93fa820275
commit fdd275ac0e
30 changed files with 7068 additions and 888 deletions
@@ -0,0 +1,317 @@
+# Partial Rendering Specification
+
+## Overview
+
+Enable rendering of specific sections of a video (e.g., slides 1-10, then 10-20) instead of the full video. This is useful for:
+- Faster iteration during development
+- Re-rendering specific sections after fixes
+- Parallel rendering of segments that can be concatenated later
+
+## Scope (v1)
+
+**In scope:**
+- Camera state tracking (cumulative state must be computed from t=0)
+- Time offset adjustment for all events
+- Slide range filtering
+- Input video seeking
+
+**Out of scope (v1):**
+- Audio events crossing range boundaries
+- Triggered video duration edge cases
+- Events are assumed to begin at their marker timestamp and never "carry over"
+
+## Current Architecture Analysis
+
+### 1. Camera State Management
+
+**Current behavior** (`transformer.py:250-332`):
+- Camera state is **cumulative** across the transcript
+- `_extract_camera_events()` walks through ALL markers sequentially
+- Each marker type (Zoom/Tilt/Pan) only modifies its property while preserving others
+- Example: `[Zoom2]` then `[TiltLeft]` = both zoom AND tilt active
+
+**Problem for partial rendering**:
+If we start rendering at slide 10, we need the camera state AS IT WOULD BE after processing slides 1-9.
+
+**Solution**:
+Separate "state computation" from "event generation":
+1. Always walk through ALL transcript markers to compute cumulative state
+2. Track the "initial state" at the start of the render range
+3. Only emit CameraEvents for markers WITHIN the render range
+4. First event in partial render must transition FROM the computed initial state
+
+### 2. Time Signature Adjustment
+
+**Current behavior**:
+All timing uses absolute timestamps from `transcript.csv`:
+- `SlideEvent.start_time/end_time`
+- `VideoEvent.start_time/end_time`
+- `AudioEvent.start_time`
+- `CameraEvent.time`
+- FFmpeg expressions: `enable=between(t, start, end)`
+- Camera animation: `if(between(t, 1.000, 1.200), ...)`
+
+**Problem for partial rendering**:
+If slide 10 starts at t=10.0s and we render from there, FFmpeg expects t=0 at the start of output.
+
+**Solution**:
+Apply a `time_offset` to all events after extraction:
+```
+new_time = original_time - time_offset
+```
+Where `time_offset` = start time of first slide/event in range.
+
+### 3. Input Video Seeking
+
+**Current behavior**:
+- Always-visible videos (talking head) start from the beginning
+- FFmpeg processes entire input duration
+
+**Problem for partial rendering**:
+Need to seek into source videos to the correct position.
+
+**Solution**:
+Add `-ss <seek_time>` before input files for always-visible videos:
+```
+ffmpeg -ss 10.0 -i talking_head.mov ...
+```
+
+---
+
+## Proposed API
+
+### Command Line Interface
+
+```bash
+# Render full video (current behavior)
+gnommo render example/project.json output.mp4
+
+# Render specific slide range
+gnommo render example/project.json output.mp4 --slides S1:S10
+gnommo render example/project.json output.mp4 --slides S10:S20
+gnommo render example/project.json output.mp4 --slides S5:  # S5 to end
+
+# Render specific time range (alternative)
+gnommo render example/project.json output.mp4 --time 0:60
+gnommo render example/project.json output.mp4 --time 60:120
+```
+
+### Internal API
+
+New parameters for `build_render_plan()`:
+```python
+def build_render_plan(
+    ...
+    slide_range: Optional[tuple[str, Optional[str]]] = None,  # (start_slide, end_slide)
+    # OR
+    time_range: Optional[tuple[float, Optional[float]]] = None,  # (start_time, end_time)
+) -> RenderPlan:
+```
+
+New field on `RenderPlan`:
+```python
+@dataclass
+class RenderPlan:
+    ...
+    time_offset: float = 0.0  # Offset to subtract from all timestamps
+    initial_camera_state: CameraState = field(default_factory=CameraState)  # State at render start
+    input_seek_time: float = 0.0  # Seek position for input videos
+```
+
+---
+
+## Implementation Details
+
+### Phase 1: Compute Full State, Filter Events
+
+Modify `_extract_camera_events()` to accept a time range:
+
+```python
+def _extract_camera_events(
+    transcript: list[TimedWord],
+    time_range: Optional[tuple[float, float]] = None,  # (start, end)
+) -> tuple[list[CameraEvent], CameraState]:
+    """
+    Returns:
+        - List of CameraEvents within time_range
+        - Initial CameraState at start of time_range
+    """
+    events: list[CameraEvent] = []
+    current_state = CameraState()
+    initial_state = CameraState()
+    start_time, end_time = time_range or (0.0, float('inf'))
+
+    found_start = False
+
+    for timed_word in transcript:
+        if not timed_word.is_marker:
+            continue
+
+        marker_id = timed_word.marker_id
+        if not marker_id or marker_id not in CAMERA_PRESETS:
+            continue
+
+        # Always update current_state (full walk)
+        preset = CAMERA_PRESETS[marker_id]
+        new_state = _apply_preset(current_state, marker_id, preset)
+
+        # Capture state just before we enter the render range
+        if not found_start and timed_word.time >= start_time:
+            initial_state = current_state  # State BEFORE this marker
+            found_start = True
+
+        # Only emit events within range
+        if start_time <= timed_word.time < end_time:
+            events.append(CameraEvent(
+                time=timed_word.time,
+                target_state=new_state,
+                duration=0.2,
+                easing="ease-out",
+            ))
+
+        current_state = new_state
+
+    return events, initial_state
+```
+
+### Phase 2: Apply Time Offset
+
+After extracting events, apply offset to all timestamps:
+
+```python
+def _apply_time_offset(plan: RenderPlan, offset: float) -> RenderPlan:
+    """Shift all timestamps by offset (subtract offset from all times)."""
+
+    # Adjust slide events
+    for event in plan.slide_events:
+        event.start_time -= offset
+        event.end_time -= offset
+
+    # Adjust video events
+    for event in plan.video_events:
+        event.start_time -= offset
+        event.end_time -= offset
+
+    # Adjust audio events
+    for event in plan.audio_events:
+        event.start_time = max(0, event.start_time - offset)
+
+    # Adjust camera events
+    for event in plan.camera_events:
+        event.time -= offset
+
+    # Adjust total duration
+    plan.total_duration -= offset
+    plan.time_offset = offset
+    plan.input_seek_time = offset
+
+    return plan
+```
+
+### Phase 3: FFmpeg Seeking
+
+Modify `build_ffmpeg_command()` to add seeking:
+
+```python
+def build_ffmpeg_command(plan: RenderPlan, output_path: Path) -> list[str]:
+    cmd = ["ffmpeg", "-y"]
+
+    # Add seek for always-visible videos
+    for video_id, video_source, cutout in plan.narration_videos:
+        video_path = _resolve_video_path(videos_dir, video_source)
+        if plan.input_seek_time > 0:
+            cmd.extend(["-ss", str(plan.input_seek_time)])  # Seek BEFORE -i
+        cmd.extend(["-i", str(video_path)])
+        ...
+```
+
+### Phase 4: Initial Camera State Handling
+
+If `initial_camera_state` is not default, inject a "virtual" camera event at t=0:
+
+```python
+def build_camera_transform(
+    camera_events: list[CameraEvent],
+    initial_state: CameraState,  # NEW PARAMETER
+    ...
+) -> str:
+    # If initial state differs from default, prepend a virtual event
+    if not initial_state.is_default():
+        initial_event = CameraEvent(
+            time=0.0,
+            target_state=initial_state,
+            duration=0.0,  # Instant - no transition
+            easing="linear",
+        )
+        camera_events = [initial_event] + camera_events
+    ...
+```
+
+---
+
+## FFmpeg Optimization
+
+**Only emit filters for events within range.**
+
+When rendering a partial range, the `RenderPlan` should only contain events within that range. This means:
+- Fewer inputs added to the FFmpeg command (only slides/videos/audio actually used)
+- Fewer overlay filters in filter_complex
+- Fewer `between(t, start, end)` enable expressions to evaluate per frame
+
+Example: Full video has 50 slides, rendering S40:S50 only:
+- **Before**: 50 slide inputs, 50 overlay filters
+- **After**: 10 slide inputs, 10 overlay filters
+
+This is achieved naturally by filtering events in `build_render_plan()` before constructing the plan - the renderer already only processes events present in the plan.
+
+---
+
+## Edge Cases (v1 Simplified)
+
+### 1. Camera state from before range
+If rendering S5:S10 but there's a camera event at the S4 marker:
+- Camera state from S4 must be captured as `initial_camera_state`
+- Rendered output starts with that state already applied at t=0
+
+### 2. Events filter by marker position
+All events (slides, videos, audio) are filtered by whether their START marker falls within the range.
+- Events beginning outside range are excluded
+- No "carry over" or boundary-crossing logic needed
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+1. Camera state computation maintains state across full transcript
+2. Time offset correctly shifts all event types
+3. Initial camera state correctly captured at boundary
+
+### Integration Tests
+1. Render slides 1-5, then 5-10, concatenate, compare to full render
+2. Camera state continuity across segment boundaries
+3. Audio alignment after seeking
+
+### Manual Verification
+1. Visual inspection of camera state at segment boundaries
+2. Audio sync verification
+
+---
+
+## Future Enhancements
+
+### Parallel Rendering Pipeline
+```bash
+# Render in parallel, then concatenate
+gnommo render proj.json seg1.mp4 --slides S1:S10 &
+gnommo render proj.json seg2.mp4 --slides S10:S20 &
+gnommo render proj.json seg3.mp4 --slides S20: &
+wait
+ffmpeg -f concat -i segments.txt -c copy final.mp4
+```
+
+### Smart Re-rendering
+Track which slides changed and only re-render affected segments.
+
+### Preview Mode
+Quick low-quality render of specific section for review.
@@ -0,0 +1,265 @@
+# Virtual Camera Effects
+
+Ideas for "stuff happening" to keep viewers engaged in edutainment videos.
+These effects are triggered by markers in the manuscript, just like slides.
+
+## Zoom Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[Zoom1]` | Zoom to 110% - subtle emphasis |
+| `[Zoom2]` | Zoom to 125% - moderate emphasis |
+| `[Zoom3]` | Zoom to 150% - strong emphasis |
+| `[Zoom0]` | Return to 100% (default) |
+| `[ZoomPunch]` | Quick zoom in + out (single beat emphasis) |
+
+**Use case:** Rapid `[Zoom1][Zoom2][Zoom3]` for comedic/dramatic triple emphasis.
+
+## Tilt/Rotation Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[TiltLeft]` | Rotate -15 degrees |
+| `[TiltRight]` | Rotate +15 degrees |
+| `[NoTilt]` | Return to 0 degrees |
+| `[TiltShake]` | Quick left-right shake (confusion/emphasis) |
+
+**Use case:** Tilt when saying something "off" or wrong, return to flat for correction.
+
+## Pan/Position Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[PanLeft]` | Shift frame left (subject moves right) |
+| `[PanRight]` | Shift frame right (subject moves left) |
+| `[PanUp]` | Shift frame up |
+| `[PanDown]` | Shift frame down |
+| `[PanCenter]` | Return to center |
+
+**Use case:** Pan to make room for a slide appearing on one side.
+
+## Shake/Movement Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[Shake]` | Brief screen shake (impact, surprise) |
+| `[ShakeHard]` | Intense shake (explosion, error) |
+| `[Wobble]` | Gentle continuous wobble |
+| `[NoWobble]` | Stop wobble |
+
+**Use case:** Shake on "WRONG!" or when something crashes/fails.
+
+## Speed/Rhythm Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[Beat]` | Single visual pulse (scale bump) |
+| `[BeatStart]` | Start pulsing to rhythm |
+| `[BeatStop]` | Stop pulsing |
+
+**Use case:** Rhythmic emphasis during lists or key points.
+
+## Transition Effects
+
+| Marker | Description |
+|--------|-------------|
+| `[Flash]` | Quick white flash |
+| `[Blackout]` | Brief black frame |
+| `[Glitch]` | Digital glitch effect |
+
+**Use case:** Transition between topics or for "record scratch" moments.
+
+## Picture-in-Picture Variations
+
+| Marker | Description |
+|--------|-------------|
+| `[PipGrow]` | Enlarge talking head cutout |
+| `[PipShrink]` | Shrink talking head cutout |
+| `[PipHide]` | Temporarily hide talking head |
+| `[PipShow]` | Restore talking head |
+| `[PipMove:corner]` | Move pip to different corner |
+
+**Use case:** Shrink self when showing important diagram, grow when making personal point.
+
+## Combination Presets
+
+| Marker | Description |
+|--------|-------------|
+| `[Emphasis]` | Zoom2 + slight tilt (general emphasis) |
+| `[Surprise]` | Quick zoom + shake |
+| `[Sarcasm]` | Slow zoom + tilt |
+| `[Reset]` | Return all effects to default |
+
+---
+
+## Architecture: The Camera Abstraction
+
+### The Core Insight
+
+All visual elements (slides, cutouts, talking head, background) exist in a **scene**.
+The **camera** views the scene. When the camera zooms, tilts, or pans - everything
+moves together, just like a real camera filming a physical set.
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                        SCENE                           │
+│  ┌─────────────────────────────────────────────────┐   │
+│  │              Background Layer                   │   │
+│  │  ┌─────────────┐                                │   │
+│  │  │ Talking Head│      ┌──────────────────┐      │   │
+│  │  │   (cutout)  │      │      Slide       │      │   │
+│  │  └─────────────┘      │    (from .png)   │      │   │
+│  │                       └──────────────────┘      │   │
+│  └─────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────┘
+                           │
+                           ▼
+                    ┌─────────────┐
+                    │   CAMERA    │
+                    │  zoom: 1.25 │
+                    │  tilt: -15° │
+                    │  pan: 0, 0  │
+                    └─────────────┘
+                           │
+                           ▼
+                  ┌─────────────────┐
+                  │  Final Output   │
+                  │   (1920x1080)   │
+                  └─────────────────┘
+```
+
+### Why This Matters
+
+**Keynote slides are designed for a specific frame.** If you create a slide with
+an arrow pointing at where the talking head cutout will be, that spatial
+relationship must be preserved when the camera zooms or tilts.
+
+If we zoomed only the background and not the slides, the arrow would point to
+the wrong place. The camera abstraction ensures everything transforms together.
+
+### Camera Properties
+
+```python
+@dataclass
+class CameraState:
+    zoom: float = 1.0        # 1.0 = 100%, 1.25 = 125%
+    rotation: float = 0.0    # degrees, positive = clockwise
+    pan_x: float = 0.0       # -1.0 to 1.0, percentage of frame
+    pan_y: float = 0.0       # -1.0 to 1.0, percentage of frame
+
+@dataclass
+class CameraKeyframe:
+    time: float              # timestamp in seconds
+    state: CameraState
+    easing: str = "linear"   # linear, ease-in, ease-out, ease-in-out
+```
+
+### Rendering Pipeline (Updated)
+
+```
+Current Pipeline:
+  Parse → Validate → Transform → Render
+                                   │
+                                   ▼
+                          build_filter_complex()
+                                   │
+                          [bg] → overlays → [vout]
+
+New Pipeline:
+  Parse → Validate → Transform → Render
+                         │
+                    Extract camera
+                    keyframes from
+                    markers
+                         │
+                         ▼
+                  build_filter_complex()
+                         │
+              [bg] → overlays → [scene]
+                                   │
+                          apply_camera_transform()
+                                   │
+                              [scene] → zoom/rotate/pan → [vout]
+```
+
+### FFmpeg Implementation
+
+The camera transform is a **final filter stage** applied to the composed scene:
+
+```
+# Compose scene (existing code)
+[0:v]scale=1920:1080[bg];
+[bg][slide1]overlay=...[s1];
+[s1][talkinghead]overlay=...[scene];
+
+# Camera transform (new)
+[scene]scale=iw*{zoom}:ih*{zoom},
+       rotate={rotation}*PI/180:fillcolor=black,
+       crop=1920:1080:(iw-1920)/2:(ih-1080)/2[vout]
+```
+
+For smooth animated zoom (using expressions):
+```
+[scene]zoompan=z='if(between(t,5,8), 1+0.25*(t-5)/3, 1)':
+              x='iw/2-(iw/zoom/2)':
+              y='ih/2-(ih/zoom/2)':
+              d=1:s=1920x1080:fps=30[vout]
+```
+
+### Camera Events in Timeline
+
+New model for camera changes:
+
+```python
+@dataclass
+class CameraEvent:
+    time: float
+    target_state: CameraState
+    duration: float = 0.0      # 0 = instant snap
+    easing: str = "ease-out"
+```
+
+Markers map to camera events:
+- `[Zoom2]` → `CameraEvent(time=t, target_state=CameraState(zoom=1.25), duration=0.2)`
+- `[TiltLeft]` → `CameraEvent(time=t, target_state=CameraState(rotation=-15), duration=0.3)`
+- `[Reset]` → `CameraEvent(time=t, target_state=CameraState(), duration=0.2)`
+
+### Considerations
+
+1. **Overscan**: When zoomed in, we're cropping. The scene must be rendered
+   larger than output (e.g., 2x) to have room for zoom without quality loss.
+
+2. **Rotation center**: Rotate around frame center, not corner.
+
+3. **State accumulation**: `[Zoom2]` then `[TiltLeft]` means zoom AND tilt
+   are both active. `[Reset]` clears all.
+
+4. **Interaction with cutouts**: Cutout positions are in scene-space, so they
+   transform naturally with the camera. No special handling needed.
+
+5. **Slides stay synced**: Keynote exports are positioned for the base frame.
+   Camera zoom/tilt transforms them identically to everything else.
+
+---
+
+## Implementation Plan
+
+### Phase 1: Camera Data Model ✓
+- [x] Add `CameraState` and `CameraEvent` to models.py
+- [x] Add camera effect markers to transformer.py
+- [x] Generate camera keyframes from markers
+
+### Phase 2: Render Pipeline ✓
+- [x] Modify renderer to compose to `[scene]` instead of `[vout]`
+- [x] Add camera transform stage after composition
+- [ ] Handle overscan (render larger, crop to output) - deferred, upsampling OK for now
+
+### Phase 3: Smooth Animation (partial)
+- [x] Support animated transitions between keyframes (linear interpolation)
+- [ ] Implement easing functions as FFmpeg expressions (ease-in, ease-out)
+- [ ] Test with rapid zoom sequences
+
+### Phase 4: Effect Presets ✓
+- [x] Define presets (Zoom0/1/2/3, TiltLeft/Right/NoTilt, Pan*, Reset)
+- [x] Presets defined in `CAMERA_PRESETS` dict in models.py
+- [ ] Support custom parameterized markers `[Zoom:1.35]` - future enhancement