Photo Credit: Igor Omilaev
Are gen AI companies actively developing their music models with the same collections of copyrighted tracks? And despite ongoing discussions about free-for-all training, is this process far more systematic than we’ve been led to believe?
These and other pressing questions are taking center stage following an investigative report from The Atlantic’s Alex Reisner, who identified “four giant datasets of songs that are being shared within the AI-development community.”
Off the bat, the present continuous “are being” jumps out here. As of late, with both Udio and Suno fighting to conceal their “training numbers,” there’s been plenty of speculation concerning the precise amount of music used to tailor their models. But what about the scope of their active training processes?
Unsurprisingly, even in light of the noted report, we don’t have a concrete answer. Said report pinpointed four training datasets consisting of north of 22 million recordings between them – including two collections clocking in at closer to 100,000 recordings apiece, one containing 9.7 million songs, and the last with roughly 12.3 million tracks.
According to The Atlantic, the second-largest dataset was compiled by AI researchers associated with Sleeping AI; German AI non-profit LAION put out the biggest of the datasets.
Google and Stability AI have reportedly utilized tracks from one of the 100,000-song datasets, the Free Music Archive. Owing to “the industry’s secrecy around training data, we don’t currently know who has used the others” – though all four are said to have been “downloaded thousands of times” in total, per the report.
Nevertheless, thanks to a dataset search tool, we do know which artists’ releases are part of the libraries.
The presence of hits from commercially prominent acts won’t come as a surprise; roughly 300 Beatles tracks are in each of the two biggest datasets, as are hundreds of songs apiece from Taylor Swift, ABBA, Snoop Dogg, and Michael Jackson, to name a few.
As such, one could simply reiterate that AI music platforms appear to be training on mountains of protected music and are grappling with several related lawsuits. While technically accurate, the conclusion might not tell the full story, however.
First, the two largest datasets aren’t that big; for reference, when combined, they make up less than 9% of Spotify’s library, based on volume specifics from co-CEO Gustav Söderström and different sources.
We don’t know exactly how these datasets were assembled, but it seems safe to say that they weren’t randomly thrown together.
And this is where things get interesting: There’s a lot more to the datasets than perennial hits released by household names. Although downplayed as efforts from “tens of thousands of minor artists” in the mentioned article, in reality, we’re talking about a huge selection of excellent music put out by extremely talented indies.
In the absence of a breakdown of how the tracks were chosen, we can only speculate. But it’s not a secret that generative models require high-quality songs (and more) for training. And especially in the era of AI slop, not all music is created equal. Are developers zeroing in on strong releases from non-major-label acts in particular?
Given the available evidence, it sure seems like a possibility. The training tracks, some released by an indie artist who’s already litigating against Suno and Udio, probably weren’t selected based on consumption volume, either.
Many of the relevant professionals have impressive streaming followings, but the two biggest datasets also contain years-old releases with around 100 streams/plays apiece – great music that one would almost have to seek out for its technical characteristics.
Finally, the datasets were assembled or at least bolstered in the not-so-distant past, as they include projects that dropped in late 2024.