Music Transformer | Mihir Agarwal

Abstract Generating music with long-term structure is a challenging task for standard deep learning models. In this project, I reproduced the Music Transformer architecture to generate coherent MIDI sequences.

1. Model Architecture

My implementation modifies the standard Transformer Decoder. Instead of using absolute positional embeddings (which struggle with relative distance), I injected relative position information directly into the attention mechanism.

The architecture uses a stack of 6 Decoder layers with Skewed Relative Attention to process MIDI events.

2. The Core Innovation: Relative Attention

Standard Transformers use “Absolute Positional Embeddings,” which means they treat a melody played at the start of a song as mathematically different from the same melody played later. This makes it hard to generate repeating musical themes (motifs).

To solve this, I implemented the “Skewing” mechanism, allowing the model to focus on distance rather than position.

Left: Standard Attention learns fixed grid-like positions.
Right: My Relative Attention learns diagonal bands, proving it captures relative distance (rhythm) regardless of position.

Key Contributions

Relative Attention: Implemented the relative attention mechanism from scratch in PyTorch to capture timing and rhythm.
Memory Optimization: Used the “Skewing” algorithm to reduce memory complexity from $O(L^2D)$ to $O(L^2)$.
Infrastructure: Trained on the MAESTRO dataset using 4x NVIDIA A100 GPUs.

1. Model Architecture

2. The Core Innovation: Relative Attention

Key Contributions

Resources