What Even Is Copyright Anymore?
- Thomas Yin
- Mar 24
- 6 min read

Over the past few years, almost every single major AI company has faced some sort of court case directly involving copyright abuses in training AI. Although most of these cases are still ongoing, the advent of advanced generative AI trained on and producing massive amounts of data has blurred the line between copyrighted materials and works in the public domain. Do artists and news agencies deserve to be compensated — or even given copyright of — work produced by AI trained on their copyrighted content? As you might expect, the issue is a tad more complicated than a set of legal definitions.
A History of Non-Dispute
Surprisingly, the debate of whether a non-human author of a creative work could license that work under copyright has been ongoing for quite a while. In 2011, a macaque (a type of monkey) in a British wildlife conservatory found a camera displaced by a photographer, and, grinning from ear to ear, took a few photos of itself. When the wildlife conservatory eventually published these “monkey selfies”, citing the extraordinary event, the good people at PETA, a totally unbiased organization known not to file allegedly frivolous lawsuits about animal rights, decided that it was a good idea to sue the conservatory on behalf of the animal, claiming that the publication violated copyright laws, since the monkey, being the creator, had the copyright to the work and did not consent to its publication.
As expected, the lawsuit didn’t work. An American court of appeals shut down the case after reaffirming a previous court’s decision that in order for creative works to be licensed, it has to have a majority of creative input by a human author. In fact, U.S. courts as well as bureaus have, time and time again, reinforced the idea that copyright must be able to be held by a human being, not an “automatic” program or non-human. The Judicial system powerfully reaffirmed this concept just a few days ago in Thaler v Perlmutter, in which it declared that works created by Artificial Intelligence are not eligible to be protected by copyright, citing that the Copyright Act of 1976 “requires all eligible work to be authored in the first instance by a human being”.
Yet, works with a significant portion of human contributions — the AI Nexus Magazine, for example — is considered eligible for copyright under the Act. And no, prompting doesn’t count as human input, according to the U.S. Department of Copyright. While the courts have held firm on the status quo of denying copyright, they might have to reconsider this stance as the abilities and uses of AI evolve over time.
Intellectual Theft or Fair Use?
One of the most expensive parts of any AI model is data. When large-scale general use LLMs like ChatGPT first came out, researchers proclaimed that the model was trained on WebText, a massive text dataset constituting a sample of content found from the entire internet, a well-scrutinized fact that, after the hype died down, hinted at the problems to come. A lot of the things found online are open souce (e.g. Reddit posts, Creative Commons artwork), but sometimes, these types of content just do not seem to cut it for training advanced AI models. About a year ago, stock image company Getty Images filed a lawsuit against Stability AI — one of the leading companies developing generative imagery AI — arguing that the company had scraped images off Getty’s site and used it to train its AI models. Some of the evidence presented is almost comical: images produced by Stable Diffusion, Stability’s generative model, often shows a “Getty Images” watermark, distorted and barely recognizable at times. The case is ongoing and expected to produce a final decision sometime after a trial scheduled in the summer of 2025.
The scope of these disputes is not constrained to image AI models — one of the most famous copyright disputes over training material happened with New York Times, who famously sued OpenAI and Microsoft over copyright disputes, claiming that their LLM models, by training on copyrighted New York Times articles found on the internet, infringed on the publishing giant’s copyright claims to these articles. They argued that:
Since AI models have the tendency to sometimes produce works facsimile, LLMs constitute an unlawful republishing of parts of a copyrighted work.
AI models can summarize or paraphrase exclusive information found in its training data, and LLMs with this capability harms NYT by giving a user information that they would otherwise have to find by means of an NYT subscription.
Since training data is an integral part of an AI model’s functionality, and NYT articles constitutes a significant portion of ChatGPT’s training data, OpenAI is profiting off copyrighted material produced by and licensed by NYT.
These lawsuits come at a critical time for AI development in the United States — along with a very much real AI race with China to stay ahead in developing the most advanced AI model as well as the reformation of many longstanding legal precedents in the wake of the AI revolution, traditional ideas of originality and creativity are being challenged by the new way that people have begun using AI.
Issues and Considerations
Why do we call something creative? Think about something that you consider creative. Maybe you thought of a cool beat, a beautiful painting, or a hilarious joke, but consider: what is it that these things have in common? Are they creative because they are fresh and pleasing to a certain aesthetic? Originally produced by an artist? Imbued with a sense of expression? Now, given this traditional perception of creativity, attempt to fit this idea to modern AI. Although some people have attempted to argue otherwise, it is commonly agreed that AI, by solely attempting to replicate facets of its training data, cannot actually produce traditionally original works. Yet, AI is much more arguably able to produce works that evoke emotion or match a certain aesthetic. Philosophical questions rarely have a defined answer, so it is difficult for me to explain whether AI can truly be creative, yet legal matters (copyright in regards to AI, for example) must always be settled with solid, albeit imperfect, solutions.
AI generated text and video will probably not receive copyright status anytime soon, not only because of the aforementioned strong precedent against attributing copyright to non-human creators but also because there would not be a single entity that copyright could be attributed to — should the hundreds of thousands of creators of training data receive copyright for the fact that their work was used to train the AI, or should the company who assembled the product get the final say in how the work is published? In contrast, the attribution of training data is a big question that has a more definite answer. Perhaps in response to the lawsuit, OpenAI's CEO Sam Altman famously claimed a few days ago that the U.S., if it does not pursue the freedom of AI companies to train on copyrighted works, will fall behind China in the AI race. While Altman’s points that copyrighted works are critical to many AI training pipelines and that access to copyrighted works will probably significantly improve the quality of AI than without are most likely correct, his call for unconditional fair use for copyrighted works as training data holds two major inconsistencies:
AI companies live off data, and often pay hefty sums for it — in fact, some AI companies employ an enormous labor force to manually label data as part of a process called Reinforcement Learning with Human Feedback. AI companies have been known to vehemently pursue deals to procure training data in some cases (for example, OpenAI signed major deals with Vox Media and the Associated Press to license the companies’ media contents for AI training), yet, as alleged by the NYT-backed court case, frequently do not disclose other sources of copyrighted data. Many expect these disputes to be settled privately, but I believe that these settlements overlook the fact that current AI companies give polarizing treatment to different media outlets. So if AI companies like OpenAI have history of licensing the content of other companies, why not strike a deal with New York Times? While some speculate that the suit could be a negotiation tactic instead of a genuine challenge on an important precedent, it seems logical that AI developers should remain consistent in their policies of acquiring and using training data.
AI, even though not at the peak of its potential as a tool, is already being marketed as an effective way for large companies to cut costs and bolster profit (most of the largest corporations have already taken steps to integrate AI into their operations, and smaller companies are expected to follow in their footsteps). Considering that the aggregate of often copyrighted high-quality training data is critical to the performance of the AI, and therefore the convincingness of AI as the tool of efficiency as it is being marketed decreases, most likely impacting customer satisfaction and subscription sales. It is paradoxical for companies such as OpenAI to expect free training data under fair use to improve the for-profit AI products that are being sold for personal or professional use.
Regardless of the reforms or preservations of legal stances in relation to copyright and AI, we should recognize that the AI that we use today is built on trillions of words, typed out by millions of people and dissolved into bits of numbers and word fragments, hopefully to the benefit of society. Even without the wishful thinking, I have to say that this fact alone is somewhat poetic.
コメント