A major investigation has put hard numbers to something the music industry has long suspected: AI companies trained their models on tens of millions of copyrighted songs without permission. According to Engadget, The Atlantic has published four searchable databases listing the music used to train AI models, and the scale is difficult to ignore.
The databases include one with 12 million tracks, another with 9 million, and two more with roughly 100,000 songs each. Artists affected include some of the biggest names in music, from Taylor Swift to Bad Bunny. This is no longer an abstract concern about AI and copyright. There are now public records showing exactly whose work was taken.
Staff writer Alex Reisner at The Atlantic put together the accompanying piece, which connects the databases to the legal battles already in motion. AI music platforms like Suno and Udio have repeatedly leaned on fair use arguments to defend scraping copyright-protected content. Those arguments have had mixed results in court so far, but a related case in book publishing offers an interesting comparison.
In that publishing case, copyright infringement claims did not convince the judge. Piracy allegations, however, landed much harder. An initial settlement came in at $1.5 billion, with full results still pending. The music industry is watching that outcome closely, and databases like the ones The Atlantic just published could give music rights holders exactly the kind of evidence they need to bring similar cases.
The broader picture here matters. This investigation is not happening in isolation. There is growing pressure on AI companies across every creative industry to account for how they built their training datasets. Music, books, visual art, and journalism have all become flashpoints in this debate. What makes the music situation distinct is the emotional and commercial weight of the artists involved. When Taylor Swift’s catalog turns up in an AI training set, it attracts a different level of public attention than a dataset of obscure academic texts.
Streaming platforms have tried to get ahead of the problem with varying approaches:
- Blocking or filtering AI-generated uploads
- Adding labels to flag content made with generative tools
- Working with rights holders to identify AI imitations of real artists
None of these efforts have been especially effective. Scammers have continued to upload AI-generated tracks that mimic existing artists, sometimes collecting royalties before the content gets flagged or removed. The platforms are playing catch-up, and the tools available to them are not keeping pace with how fast the problem is growing.
For the AI companies at the center of this, the path forward is murky. Fair use has always been a complicated legal argument, and courts have not been consistent in how they apply it to AI training. If the music industry takes the piracy angle that worked in publishing, and if databases like The Atlantic’s help them prove which songs were used and when, several of these companies could be looking at serious legal exposure. The $1.5 billion publishing settlement gives everyone a rough sense of what that exposure might look like at scale.
What this investigation does most clearly is shift the conversation from hypothetical to documented. The songs are named. The artists are named. The scale is on the record. That changes what lawyers, legislators, and the public can do with this information.




