With the release of ChatGPT earlier this year, the conversation around the use of machine learning (ML) has intensified to a fever pitch. Companies are pouring billions of dollars into creating ML tools that write essays, create images, generate voice clips that sound like celebrities, and write code.
The Explosion of Machine Learning Tools
Google raced to demo its ML system leading to some embarrassment when it invented a fact about the JWST. Microsoft quickly added OpenAI-powered GPT-4 responses to the Bing search engine, which has generated very creepy conversations. Microsoft has also launched Microsoft 365 Copilot, which uses GPT-4 to draft emails and write documents for you.
Most important to programmers, however, is GitHub Copilot. The code generation tool has been chugging away for over a year now. But, last month, GitHub announced a new model and higher-tier plan for businesses. Supposedly, the new model is faster and more capable of creating higher-quality code. According to GitHub, more than 400 companies were using Copilot at launch. More will surely follow.
With all the money flowing to the owners of these ML models, people are asking questions. How is the massive amount of data needed to power these ML models gathered? Can owners keep these systems from scraping their work? How is it curated to filter out low-quality content? Most importantly: if the machine spits out an image or piece of writing or snippet of code derived from work I created, is that a violation of my copyright?
Infringement on Open Source Licenses?
Since Copilot’s launch, some open-source programmers have claimed it can spit out exact copies of their code with minimal prompting. Opting out of having your work scraped is not enough protection against this. If someone else uses your code while following all the license terms, they might not opt out of having their repository scraped, and Copilot will use your code. And, once your work is in the system, there’s no method for removing it.
Completely excluding open-source code from being scraped by ML code generation systems like Copilot would require amending current licenses to forbid this use. Scraping tools must also be updated to recognize and adhere to limitations imposed by restrictive licenses. Punishments must also be strict enough to deter bad actors who ignore such licensing rules.
Last November Matthew Butterick filed a class-action lawsuit against Copilot, claiming that Copilot violates basically every popular open-source software license. Butterick says that because Copilot can produce a substantial snippet of code identical to an input, it is inherently violating the license for using the code scraped as inputs into the ML model.
Copilot uses all publicly available code on GitHub. That means it’s violating every software license that requires things like attribution, distributing the full text of the license with the code, and other limitations imposed by software licenses included with open-source code that has been scraped and pulled into the data Copilot pulls from.
GitHub claims that scraping publicly-available code and regurgitating it either in snippets or as whole copies of original code blocks is fair use. Butterick is not convinced. The class action lawsuit argues that, even if a fair use exception is allowed, Copilot is violating the terms imposed by open-source licenses that could still apply even under fair use.
Copyrighting Copilot’s Work?
Another argument is that it’s impossible to copyright work produced by Copilot. The code Copilot spits out isn’t the product of human creativity and thus isn’t copyrightable. This argument has more credence now that the U.S. Copyright Office (USCO) has weighed in. They decided that images generated by Midjourney and used in a comic book were not eligible to be protected under copyright. This is because they “are not the product of human authorship.” However, the layout of the images and the text overlaid on them in the comic book are eligible for protection.
This standard becomes murky when Copilot generates code that’s edited and intermixed with code a person produces. How much editing is necessary to transform Copilot output into something authored by a human? If Copilot generates one file off simple prompts, is that file unprotected by copyright while the rest of the program is protected? The USCO is still determining the answers to these questions. The organization just announced a new initiative to focus on issues raised by works generated by ML systems.
USCO might determine that work produced by a system like Copilot is ineligible for copyright protection and in the public domain. However, that doesn’t absolve the tool from using protected works as inputs to generate their outputs.
A suit against Midjourney and Stable Diffusion involves many of the same issues Butterick alleges with Copilot. These ML systems grab whatever they can find, scramble it together, and then reproduce bits of the inputs combined. The artists who created the images these art creation bots scrape have many of the same grievances as open-source programmers. They find their styles, signatures, or code comments reproduced verbatim in these tools’ outputs. It will be up to the courts to decide whether these systems infringe on your copyright.
Open Source’s Funding Problem
All these issues with ML systems scraping and reproducing information are running headlong into an eternal issue in open source: funding. Many open-source developers donate immense amounts of time and energy to projects without any compensation. In recent years, more platforms such as Open Collective, Patreon, and even GitHub have expanded ways open-source developers can get paid for their work. However, even developers of popular projects must literally beg for payments or find “real” work.
The current model for funding open-source projects is still very flawed. It is still hard to produce open-source code and earn a comparable amount as someone making proprietary software. Copilot claims to help open-source development by making it easier to program and better use your time. This argument only applies to future projects that agree to pay one of the biggest companies on the planet money so they can more easily take code from older open-source projects. It does nothing to help prior developers who, simply by using GitHub, gave their work to Copilot without compensation.
Let’s put aside the legal issues surrounding licensing and fair use. The bottom line is that these ML tools suck up work without any consideration for the creators. GitHub is not alone. Every such system relies on poorly-paid workers to generate and curate vast amounts of data. OpenAI, which got $10 billion in funding from Microsoft, uses workers in Kenya who are paid less than $2 per hour to filter out harmful data.
Meanwhile, some are stripping away the guardrails intended to prevent abuse. Microsoft laid off the entire ethics and society team in their AI division. It is not hard to imagine this team getting in the way of the new craze by raising difficult questions about fairness, compensation, and equity.
The owners of these ML tools pulling in billions of dollars in funding should compensate the people whose labor they rely on and develop systems for people to see where their work is being used. Copilot is a fancy interface to assemble relevant bits of publicly-available source code. Microsoft is charging you $10 per month for access to it and pouring vast amounts of money into expanding this code-generation technology into other areas.
However, none of that money goes to the people who wrote the code that Copilot relies on to give you good results. I’m not saying you should quit using Copilot. But, maybe, as we should have been doing whenever we use open-source code, you should also pitch in $10 a month to the projects you rely on every day. We should also expect that giant corporations with unimaginable resources will compensate the people whose work they use.