PHALAR: Phasors for Learned Musical Audio Representations

Abstract

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to 70% over the state-of-the-art while requiring <50% of the parameters and a 7× training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Publication
Proceedings of the 43rd International Conference on Machine Learning
Michele Mancusi
Michele Mancusi
PostDoctoral Researcher

PhD Student @SapienzaRoma CS | Intern @Musixmatch | Intern @Microsoft | Research Scientist @Sony | Senior Research Scientist @Moises

Giorgio Strano
Giorgio Strano
PhD Student
Luca Cerovaz
Luca Cerovaz
Research Intern
Donato Crisostomi
Donato Crisostomi
PhD Student

PhD student @ Sapienza, University of Rome | former Applied Science intern @ Amazon Search, Luxembourg | former Research Science intern @ Amazon Alexa, Turin

Emanuele Rodolà
Emanuele Rodolà
Full Professor