agents/gerhard-hermes/skills/media/heartmula/SKILL.md

---
name: heartmula
description: "HeartMuLa: Suno-like song generation from lyrics + tags."
version: 1.0.0
metadata:
  hermes:
    tags: [music, audio, generation, ai, heartmula, heartcodec, lyrics, songs]
    related_skills: [audiocraft]
---

# HeartMuLa - Open-Source Music Generation

## Overview
HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags, with multilingual support. Generates full songs from lyrics + tags. Comparable to Suno for open-source. Includes:
- **HeartMuLa** - Music language model (3B/7B) for generation from lyrics + tags
- **HeartCodec** - 12.5Hz music codec for high-fidelity audio reconstruction
- **HeartTranscriptor** - Whisper-based lyrics transcription
- **HeartCLAP** - Audio-text alignment model

## When to Use
- User wants to generate music/songs from text descriptions
- User wants an open-source Suno alternative
- User wants local/offline music generation
- User asks about HeartMuLa, heartlib, or AI music generation

## Hardware Requirements
- **Minimum**: 8GB VRAM with `--lazy_load true` (loads/unloads models sequentially)
- **Recommended**: 16GB+ VRAM for comfortable single-GPU usage
- **Multi-GPU**: Use `--mula_device cuda:0 --codec_device cuda:1` to split across GPUs
- 3B model with lazy_load peaks at ~6.2GB VRAM

## Installation Steps

### 1. Clone Repository
```bash
cd ~/  # or desired directory
git clone https://github.com/HeartMuLa/heartlib.git
cd heartlib
```

### 2. Create Virtual Environment (Python 3.10 required)
```bash
uv venv --python 3.10 .venv
. .venv/bin/activate
uv pip install -e .
```

### 3. Fix Dependency Compatibility Issues

**IMPORTANT**: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:

```bash
# Upgrade datasets (old version incompatible with current pyarrow)
uv pip install --upgrade datasets

# Upgrade transformers (needed for huggingface-hub 1.x compatibility)
uv pip install --upgrade transformers
```

### 4. Patch Source Code (Required for transformers 5.x)

**Patch 1 - RoPE cache fix** in `src/heartlib/heartmula/modeling_heartmula.py`:

In the `setup_caches` method of the `HeartMuLa` class, add RoPE reinitialization after the `reset_caches` try/except block and before the `with device:` block:

```python
# Re-initialize RoPE caches that were skipped during meta-device loading
from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
for module in self.modules():
    if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
        module.rope_init()
        module.to(device)
```

**Why**: `from_pretrained` creates model on meta device first; `Llama3ScaledRoPE.rope_init()` skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.

**Patch 2 - HeartCodec loading fix** in `src/heartlib/pipelines/music_generation.py`:

Add `ignore_mismatched_sizes=True` to ALL `HeartCodec.from_pretrained()` calls (there are 2: the eager load in `__init__` and the lazy load in the `codec` property).

**Why**: VQ codebook `initted` buffers have shape `[1]` in checkpoint vs `[]` in model. Same data, just scalar vs 0-d tensor. Safe to ignore.

### 5. Download Model Checkpoints
```bash
cd heartlib  # project root
hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'
hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'
```

All 3 can be downloaded in parallel. Total size is several GB.

## GPU / CUDA

HeartMuLa uses CUDA by default (`--mula_device cuda --codec_device cuda`). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.

- The installed `torch==2.4.1` includes CUDA 12.1 support out of the box
- `torchtune` may report version `0.4.0+cpu` — this is just package metadata, it still uses CUDA via PyTorch
- To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")
- **No GPU?** You can run on CPU with `--mula_device cpu --codec_device cpu`, but expect generation to be **extremely slow** (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.

## Usage

### Basic Generation
```bash
cd heartlib
. .venv/bin/activate
python ./examples/run_music_generation.py \
  --model_path=./ckpt \
  --version="3B" \
  --lyrics="./assets/lyrics.txt" \
  --tags="./assets/tags.txt" \
  --save_path="./assets/output.mp3" \
  --lazy_load true
```

### Input Formatting

**Tags** (comma-separated, no spaces):
```
piano,happy,wedding,synthesizer,romantic
```
or
```
rock,energetic,guitar,drums,male-vocal
```

**Lyrics** (use bracketed structural tags):
```
[Intro]

[Verse]
Your lyrics here...

[Chorus]
Chorus lyrics...

[Bridge]
Bridge lyrics...

[Outro]
```

### Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--max_audio_length_ms` | 240000 | Max length in ms (240s = 4 min) |
| `--topk` | 50 | Top-k sampling |
| `--temperature` | 1.0 | Sampling temperature |
| `--cfg_scale` | 1.5 | Classifier-free guidance scale |
| `--lazy_load` | false | Load/unload models on demand (saves VRAM) |
| `--mula_dtype` | bfloat16 | Dtype for HeartMuLa (bf16 recommended) |
| `--codec_dtype` | float32 | Dtype for HeartCodec (fp32 recommended for quality) |

### Performance
- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
- Output: MP3, 48kHz stereo, 128kbps

## Pitfalls
1. **Do NOT use bf16 for HeartCodec** — degrades audio quality. Use fp32 (default).
2. **Tags may be ignored** — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
3. **Triton not available on macOS** — Linux/CUDA only for GPU acceleration.
4. **RTX 5080 incompatibility** reported in upstream issues.
5. The dependency pin conflicts require the manual upgrades and patches described above.

## Links
- Repo: https://github.com/HeartMuLa/heartlib
- Models: https://huggingface.co/HeartMuLa
- Paper: https://arxiv.org/abs/2601.10547
- License: Apache-2.0
Adding -insecure flag to hermes dashboard 2026-04-25 11:40:33 +02:00			`---`
			`name: heartmula`
Adding a lot of skills for Hermes Gerhard 2026-05-09 15:51:39 +02:00			`description: "HeartMuLa: Suno-like song generation from lyrics + tags."`
Adding -insecure flag to hermes dashboard 2026-04-25 11:40:33 +02:00			`version: 1.0.0`
			`metadata:`
			`hermes:`
			`tags: [music, audio, generation, ai, heartmula, heartcodec, lyrics, songs]`
			`related_skills: [audiocraft]`
			`---`

			`# HeartMuLa - Open-Source Music Generation`

			`## Overview`
Adding a lot of skills for Hermes Gerhard 2026-05-09 15:51:39 +02:00			`HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags, with multilingual support. Generates full songs from lyrics + tags. Comparable to Suno for open-source. Includes:`
Adding -insecure flag to hermes dashboard 2026-04-25 11:40:33 +02:00			`- HeartMuLa - Music language model (3B/7B) for generation from lyrics + tags`
			`- HeartCodec - 12.5Hz music codec for high-fidelity audio reconstruction`
			`- HeartTranscriptor - Whisper-based lyrics transcription`
			`- HeartCLAP - Audio-text alignment model`

			`## When to Use`
			`- User wants to generate music/songs from text descriptions`
			`- User wants an open-source Suno alternative`
			`- User wants local/offline music generation`
			`- User asks about HeartMuLa, heartlib, or AI music generation`

			`## Hardware Requirements`
			- Minimum: 8GB VRAM with `--lazy_load true` (loads/unloads models sequentially)
			`- Recommended: 16GB+ VRAM for comfortable single-GPU usage`
			- Multi-GPU: Use `--mula_device cuda:0 --codec_device cuda:1` to split across GPUs
			`- 3B model with lazy_load peaks at ~6.2GB VRAM`

			`## Installation Steps`

			`### 1. Clone Repository`
			```bash
			`cd ~/ # or desired directory`
			`git clone https://github.com/HeartMuLa/heartlib.git`
			`cd heartlib`
			```

			`### 2. Create Virtual Environment (Python 3.10 required)`
			```bash
			`uv venv --python 3.10 .venv`
			`. .venv/bin/activate`
			`uv pip install -e .`
			```

			`### 3. Fix Dependency Compatibility Issues`

			`IMPORTANT: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:`

			```bash
			`# Upgrade datasets (old version incompatible with current pyarrow)`
			`uv pip install --upgrade datasets`

			`# Upgrade transformers (needed for huggingface-hub 1.x compatibility)`
			`uv pip install --upgrade transformers`
			```

			`### 4. Patch Source Code (Required for transformers 5.x)`

			Patch 1 - RoPE cache fix in `src/heartlib/heartmula/modeling_heartmula.py`:

			In the `setup_caches` method of the `HeartMuLa` class, add RoPE reinitialization after the `reset_caches` try/except block and before the `with device:` block:

			```python
			`# Re-initialize RoPE caches that were skipped during meta-device loading`
			`from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE`
			`for module in self.modules():`
			`if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:`
			`module.rope_init()`
			`module.to(device)`
			```

			Why: `from_pretrained` creates model on meta device first; `Llama3ScaledRoPE.rope_init()` skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.

			Patch 2 - HeartCodec loading fix in `src/heartlib/pipelines/music_generation.py`:

			Add `ignore_mismatched_sizes=True` to ALL `HeartCodec.from_pretrained()` calls (there are 2: the eager load in `__init__` and the lazy load in the `codec` property).

			Why: VQ codebook `initted` buffers have shape `[1]` in checkpoint vs `[]` in model. Same data, just scalar vs 0-d tensor. Safe to ignore.

			`### 5. Download Model Checkpoints`
			```bash
			`cd heartlib # project root`
			`hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'`
			`hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'`
			`hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'`
			```

			`All 3 can be downloaded in parallel. Total size is several GB.`

			`## GPU / CUDA`

			HeartMuLa uses CUDA by default (`--mula_device cuda --codec_device cuda`). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.

			- The installed `torch==2.4.1` includes CUDA 12.1 support out of the box
			- `torchtune` may report version `0.4.0+cpu` — this is just package metadata, it still uses CUDA via PyTorch
			`- To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")`
			- No GPU? You can run on CPU with `--mula_device cpu --codec_device cpu`, but expect generation to be extremely slow (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.

			`## Usage`

			`### Basic Generation`
			```bash
			`cd heartlib`
			`. .venv/bin/activate`
			`python ./examples/run_music_generation.py \`
			`--model_path=./ckpt \`
			`--version="3B" \`
			`--lyrics="./assets/lyrics.txt" \`
			`--tags="./assets/tags.txt" \`
			`--save_path="./assets/output.mp3" \`
			`--lazy_load true`
			```

			`### Input Formatting`

			`Tags (comma-separated, no spaces):`
			```
			`piano,happy,wedding,synthesizer,romantic`
			```
			`or`
			```
			`rock,energetic,guitar,drums,male-vocal`
			```

			`Lyrics (use bracketed structural tags):`
			```
			`[Intro]`

			`[Verse]`
			`Your lyrics here...`

			`[Chorus]`
			`Chorus lyrics...`

			`[Bridge]`
			`Bridge lyrics...`

			`[Outro]`
			```

			`### Key Parameters`
			`\| Parameter \| Default \| Description \|`
			`\|-----------\|---------\|-------------\|`
			\| `--max_audio_length_ms` \| 240000 \| Max length in ms (240s = 4 min) \|
			\| `--topk` \| 50 \| Top-k sampling \|
			\| `--temperature` \| 1.0 \| Sampling temperature \|
			\| `--cfg_scale` \| 1.5 \| Classifier-free guidance scale \|
			\| `--lazy_load` \| false \| Load/unload models on demand (saves VRAM) \|
			\| `--mula_dtype` \| bfloat16 \| Dtype for HeartMuLa (bf16 recommended) \|
			\| `--codec_dtype` \| float32 \| Dtype for HeartCodec (fp32 recommended for quality) \|

			`### Performance`
			`- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate`
			`- Output: MP3, 48kHz stereo, 128kbps`

			`## Pitfalls`
			`1. Do NOT use bf16 for HeartCodec — degrades audio quality. Use fp32 (default).`
			`2. Tags may be ignored — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.`
			`3. Triton not available on macOS — Linux/CUDA only for GPU acceleration.`
			`4. RTX 5080 incompatibility reported in upstream issues.`
			`5. The dependency pin conflicts require the manual upgrades and patches described above.`

			`## Links`
			`- Repo: https://github.com/HeartMuLa/heartlib`
			`- Models: https://huggingface.co/HeartMuLa`
			`- Paper: https://arxiv.org/abs/2601.10547`
			`- License: Apache-2.0`