Music Transformer

Comparing Vanilla Attention vs. Relative Attention for MIDI generation.

Abstract Generating music with long-term structure is a challenging task for standard deep learning models. In this project, I reproduced the Music Transformer architecture to generate coherent MIDI sequences.

1. Model Architecture

My implementation modifies the standard Transformer Decoder. Instead of using absolute positional embeddings (which struggle with relative distance), I injected relative position information directly into the attention mechanism.

The architecture uses a stack of 6 Decoder layers with Skewed Relative Attention to process MIDI events.

2. The Core Innovation: Relative Attention

Standard Transformers use “Absolute Positional Embeddings,” which means they treat a melody played at the start of a song as mathematically different from the same melody played later. This makes it hard to generate repeating musical themes (motifs).

To solve this, I implemented the “Skewing” mechanism, allowing the model to focus on distance rather than position.

Left: Standard Attention learns fixed grid-like positions.
Right: My Relative Attention learns diagonal bands, proving it captures relative distance (rhythm) regardless of position.

Key Contributions

  • Relative Attention: Implemented the relative attention mechanism from scratch in PyTorch to capture timing and rhythm.
  • Memory Optimization: Used the “Skewing” algorithm to reduce memory complexity from $O(L^2D)$ to $O(L^2)$.
  • Infrastructure: Trained on the MAESTRO dataset using 4x NVIDIA A100 GPUs.

Resources