GitHub accused of varying Copilot output to avoid copyright allegations

124 points
1/20/1970
a year ago
by belter

Comments


rolph

[The judge overseeing the case has permitted the plaintiffs to remain anonymous in court filings because of credible threats of violence [PDF] directed at their attorney. The Register understands that the plaintiffs are known to the defendants.]

https://storage.courtlistener.com/recap/gov.uscourts.cand.40...

a year ago

formerly_proven

> go f**g cry about github you f**g piece of s*t n**r, I hope your throat gets cut open and every single family member of you is burnt to death

Are github users gamers? Really puts the "git" into "github" there.

a year ago

arp242

Some more here:

https://storage.courtlistener.com/recap/gov.uscourts.cand.40...

https://storage.courtlistener.com/recap/gov.uscourts.cand.40...

https://storage.courtlistener.com/recap/gov.uscourts.cand.40...

Friendly people.

I've received emails like that too over the years. What hugely controversial thing do I do? I have a website where I sometimes write about $stuff and I post on HN. Keeping the basic info private is probably a good thing especially if they're based in the US, because "SWATting" etc, but beyond that it doesn't seem "credible" in the sense that it's very likely someone will show up at their door with a gun.

Since the first two are redacted, I wonder if they sent them with their real names.

a year ago

z3t4

It can be explained by the normal curve. The bigger your audience is the weirder the outliers will be.

a year ago

arp242

Pretty much, yeah. There's about 26.8 million developers in the world. Assuming 5 million read this story (not everyone speaks English) and 0.01% of people is a bit unhinged then you've got 50,000 unhinged people, and only about 0.006% of those 50,000 (or 0.00006% of total) need to be unhinged enough to actually shoot off an email.

a year ago

indrora

An impressive number of 4chan's /g/ users are on github. Some even actively contribute to Linux (though usually to Arch, Gentoo, and now more and more Nix).

I wrote a paper during college that I should release some time about when /g/ threw an absolute shitfit over Linus going "so, I've been a kinda shit human being to people and I'm going to step back and get some help", going as far as to blame his daughter/"the woke mob"/multiple named core kernel contributors for killing their god.

At one point, I attended a GitHub event that wasn't directly sponsored by github but encouraged a lot of github users to show up. While there I met several people who, outside the venue, were talking animatedly about Terry Davis. Listening in on the conversation revealed that they more or less just approved of his extensive use of racist language and epithets.

I haven't checked, but I would suspect that Linus' recent "trans rights" by proxy post has caused at least one or two aneurisms in the /g/ user group.

a year ago

pxc

> An impressive number of 4chan's /g/ users are on github. Some even actively contribute to Linux (though usually to Arch, Gentoo, and now more and more Nix).

An aside about this from a moderately longtime Nix user and very occasional Nixpkgs contributor:

I used to occasionally post about Nix on /g/ before virtually anyone there knew what it was just to gauge reactions, and boy were people shitty and dismissive about it. It was all hot takes, broad strokes, and very little curiosity about the technical details. And even though Nix is 'cool' on /g/ now, all of those things are still true about the way /g/ treats NixOS and other distros.

The interest that 90% of /g/ users have in Linux distros like NixOS is as a bullshit status symbol, a token in some consumerist identity game. The presence of that shallow, status-obsessed, needlessly edgy type of person in the Nix community is definitely more visible in the Nix(OS) community now than it was a few years ago, but it still sticks out like a sore thumb against the backdrop of longtime Nix users and the culture they've evolved together.

For that reason, I strongly recommend engaging with the Nix community in community-owned channels, like discourse.nixos.org or the community Matrix channels, rather than message boards like 4chan or mainstream social media platforms like Reddit. If you do that, you'll find kinder, more knowledgeable people (and perhaps in some cases, kinder more knowledgeable personas for the same people).

If you're reading this and you've unfortunately encountered Nix 'evangelists' with those shitty attitudes online, please understand that those influences are external to the community, and as far as most participants in the community are concerned, quite unwelcome.

a year ago

Dma54rhs

/g/ is pretty mainstream among the zoomers they browse it publicly. Also 4chan is among one of the most popular websites on the internet so it doesn't come off as a surprise.

a year ago

edgyquant

A large number of people you meet, from all walks of life, will admit that 4chan is a guilty pleasure. At least I’ve met a ton and none of them were right leaning to say the least.

a year ago

waboremo

That would be my general theory as well, you're far more likely to meet someone who is left leaning who admits to having posted on 4chan (or still does) than you are otherwise. Maybe it has to do with perceived biases, in that a right leaning person/group is probably then going to feel they are aligned with the seediest aspects of 4chan, whether they actually do or not and their perceived social impact/failings for using 4chan.

a year ago

edgyquant

Yeah moderate to right wingers would probably not admit accept to people close to them due to that perception. But using the site you can tell there are a lot of very intellectual liberals and fiscal conservatives. The racist and sexist stuff is just their equivalent of the dumb Reddit memes that encompass 80% of its content.

a year ago

StrauXX

I would love to read that paper if you do decide to publicise it! 4chan mob dynamics never fail to make interesting (albeit often nasty) stories.

a year ago

web3-is-a-scam

Considering how large GitHub is (in the industry) it’s like asking is “are Facebook users gamers”?

a year ago

notjoemama

That's what struck me about it too. Isn't it the case in a large enough population you can always find representation of something you dislike or hate? I've seen lists of "Republicans" (meaning anyone in, near, or related to a Republican politician) showing those people being caught or convicted of various moral, economic, and social "crimes". Ok. But if I sat down and looked using the same criteria, couldn't I just as easily create a long list for the Democrat party? Having made that statement on Reddit, the response I got was, "well, there are MORE republicans". That struck me as odd too. Are you trying to say of the two horrible things, one is worse, and so I have a moral imperative to chose the less horrible one? I'm fairly sure I get to abandon both in search of a better option. lol

a year ago

bandyaboot

> I've seen lists of "Republicans" (meaning anyone in, near, or related to a Republican politician) showing those people being caught or convicted of various moral, economic, and social "crimes". Ok.

I’m intrigued. I’d like to see the subset of the list that are people who were in Republican politicians.

a year ago

zer0tonin

No, but I assume a lot of AI bros are

a year ago

obiefernandez

ffs, can we just not make this a thing?

a year ago

edgyquant

Too late, a demographic or people who could just barely scrape together a script making REST requests are now selling themselves as “AI specialists” or “prompt engineers” to the corporate class. These are this cycles cryptobros, who were mostly not engineers but people riding a hype wave.

The age of the AI bro is here, and as I’ve been in the space as someone genuinely interested in the models, working with them from time to time, for a while. I’m giving a lot of eye rolls in meetings when these people start talking about the underlying tech.

a year ago

arp242

Booming industries always attracts its fair share of snake oil peddlers I suppose. From "Webmasters" who could barely program a VCR during the dot-com era to "SEO specialists" to "Crypto bros" and now "AI bros".

a year ago

foobarbazetc

“Prompt engineers”… sigh.

a year ago

arp242

Some complex shell prompts I've seen people use definitely must have required a fair bit of engineering!

a year ago

hooomil

[dead]

a year ago

testacct22

It's a thing lol

The other day I saw some gym bro in the IG comment section trying to flex on people with "do you even know what backpropagation is?"

a year ago

faangsticle

Too late, the AI bros already did.

a year ago

JVillella

They say crypto is “regulatory arbitrage”, I say this AI co-pilot stuff is “copyright arbitrage”.

Being a bit hand-wavy with it: It’s akin to torrenting music/movies. The torrented files are lossy compressed representations of the original waveform from the music producer. Limewire, or Pirate Bay, or whatever provide interface to retrieve them (download or stream). The model weights are a form of lossy compression, and inference is like a document retrieval.

One may say, “it’s like an employee working at company X, then going to work at company Y, they retain their knowledge and experience.” I would say it’s more like, employee going from X to Y, but retaining audio and video recordings of all interactions he had, notes, documents, and other proprietary info and bringing it to company Y.

a year ago

soultrees

What would you say the basis of all knowledge you know is? You are a collection of everything you have consumed and the stuff you create is all influenced by that.

Personally this whole llm debate about copyright is quite funny. As someone who very much has skin in the game(my art is trained on midjourney.), and who runs in a circle of artists, it’s interesting to see people’s ego’s come at play here. The ones who are excited about these as tools are the ones who are openly inspired and want to inspire however the ones who claim copyright infringement seem to come off as insecure, almost like they are afraid that this idea of theirs will be the last great idea they have. There’s already a separation happening in the art world of people who are exploding in creative output vs the people who are so defensive and cling to the old way of doing things.

If I had my way, I’d see copyright laws abolished completely. A complete free for all in innovation. And people who claim that without parents and copyright then there’s no incentive to make money seriously underestimate humans and their ego to continually innovate.

a year ago

saurik

> What would you say the basis of all knowledge you know is? You are a collection of everything you have consumed and the stuff you create is all influenced by that.

FWIW, humans certainly can infringe other peoples' copyrights and can do so even if they aren't actively intending to do so. There is some boundary across which you are no longer just learning something and you are now copying, and it isn't clear at all that these generative AI techniques are actively considering the latter the way a human is required to.

But, sure: if you are against the idea of copyright entirely then it is hard to consider the idea inconsistent, though I would think a world without copyright would be a particularly hard one for an artist to make money at all...

a year ago

JVillella

>What would you say the basis of all knowledge you know is? You are a collection of everything you have consumed and the stuff you create is all influenced by that.

Surely you're not suggesting that there's no such thing as "original work". The production of which may have very high capital and labour costs - which if not protected from theft - would remove the incentives of producing original work.

>As someone who very much has skin in the game(my art is trained on midjourney)

I don't know your specific situation, but there's obviously different scales of importance here. What if your art was your sole source of income, and people were reproducing it under their own name? or if you had a product where you poured millions into developing some novel IP/methods, and some employee brought it with them when they went to work at your competitors?

a year ago

WalterBright

Over here at the D Language Foundation, we encourage people to download it for free and do whatever they want to with it. It's all Boost licensed.

> some employee brought it with them when they went to work at your competitors?

Other programming languages have copied lots of D features. We at the DLF don't mind at all. Though often they copy them and kinda miss the mark.

(Yes, we sometimes copy features from other languages, too, and try to improve on them.)

a year ago

zzzzzzzza

some things like drug discovery could probably be done with a bounty system rather than intellectual property, and could probably get much better results for a fraction of the cost for maintaining the intellectual property component of the court system

a year ago

lewhoo

The ones who are excited about these as tools are the ones who are openly inspired and want to inspire however the ones who claim copyright infringement seem to come off as insecure

Yeah yeah, your side are the good guys and the other side is a bit dodgy.

a year ago

vkou

> What would you say the basis of all knowledge you know is?

When someone teaches me, they don't own all my future creative output.

When someone teaches an AI, they do.

That's the principal difference between human learning and machine learning.

a year ago

bhattid

I think what you suggested is an unpopular opinion, but I also wholeheartedly agree with it. :)

I'm certainly no expert on copyright law, but my understanding is that its purpose is to protect the financial interests of certain creators from the progress of technology (e.g. copy paste). I've heard arguments that removing copyright would lead to less creativity or reduced quantity or quality of work, but I'm personally a bit skeptical (probably for the same reasons as you - I think people have a natural desire to create). Even in terms of financials, I would speculate that an employment/patronage model would become more widespread.

I think there's something to be said about the benefits of having freely available knowledge, music, and art for common consumption. When I was a child in high school (or well, always lol), my parents couldn't afford a lot of material I needed or wanted for studying (especially for standardized testing, SAT and AP tests) and most of the books in my local library either did not exist or were outdated. But when I discovered that much of this information could be found online, it really changed my world and made success in life feel attainable to me. I consider myself quite wealthy now, but I don't think I would have been able to escape poverty if all this information was paywalled from me. Maybe others would argue the writers are not being compensated for their efforts, but if there are other people in the world in the same position as past me who could positively benefit from it, I think that's a better world to live in, personally.

Incidentally, the release of StableDiffusion has actually inspired me to draw a little. Not sure why, but I find it inspiring being able to iterate on a prompt and produce something of quality that I can try to replicate on my own. Even if I fail, I still have something to appreciate that maps fairly well to the concept in my head.

My hope is that these technologies might lead to a change in our financial system (I think UBI would be a good idea), but I suppose we'll see where everything ends up. I think there's likely going to be a lot of pain in the short-term (especially since there are those who don't want to adapt), but hopefully everyone will positively benefit in the long-term.

a year ago

JVillella

I really appreciate your personal experiences and how the availability of knowledge online changed your life. It did the same for me as well.

a year ago

popalchemist

When it comes to copyright infringement vs free use in US law there are certain requirements that have to be met, one of which is "transformativity"

This concept has specific technical meaning -

https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...

It seems obvious to me that to call model weights "lossy compression" is not only incorrect from a technical (software dev) point of view, but also from this legal perspective.

The weights serve a different purpose than the original works from which they are derived, and wouldn't/couldn't POSSIBLY exist were it not for the original work of the authors of the models.

It's bad practice to go around espousing strong and condemnatory opinions about topics you don't have a full grasp of. In this case, it's both the technical details and the legal system.

It makes you look like a fool and costs you your credibility amongst peers in future encounters.

a year ago

JVillella

Model weights are a form of data compression. Have a look at variational autoencoders for example where this is made more explicit (its latent space is a compression).

Your arguments regarding "transformative work" is what the discussion is all about. Let's see where it lands with case law.

FWIW, my stance is copyrighted content should not be used in training without request.

>It's bad practice to go around espousing strong and condemnatory opinions about topics you don't have a full grasp of. In this case, it's both the technical details and the legal system. >It makes you look like a fool and costs you your credibility amongst peers in future encounters.

Agree, thank you for the feedback. My analogy was quite exaggerative.

a year ago

jazzyjackson

> It's bad practice to go around espousing strong and condemnatory opinions about topics you don't have a full grasp of.

I disagree, the internet wouldn't be half as full of knowledge as it is if it weren't for the loudly ignorant giving the experts somebody to correct.

a year ago

mafuy

Afaik the currently best compression techniques use a neural network internally. So the distinction is perhaps not as clear as you might think.

a year ago

mjburgess

I call it copyright laundering

a year ago

az226

Yea but only if you get to download a few seconds of the movie and not more.

a year ago

ldehaan

[dead]

a year ago

ShamelessC

Eh, their argument is simply that they tuned temperature settings to encourage the model to output slight variations on memorized data. But this is kind of just one of many things you do with a language model and certainly doesn’t imply intent to avoid copyright allegations.

Just implies they tuned it for user experience.

I was expecting there to be some discovery around them deliberately fine tuning their model to output modifications if and only if the code had a certain license.

a year ago

kevingadd

What's the value of slight variations? Isn't it more likely that the memorized data was already known to be good and effective? It doesn't seem like a useful change unless your goal is to avoid infringement. I don't see how randomly permuting the suggestions improves UX.

a year ago

moyix

The lowest temperature isn't always the one that results in working code! This was shown in the original Codex paper:

> When evaluating pass@k, it is important to optimize sampling temperature for the particular value of k. In Figure 5, we plot pass@k against the number of samples k and the sampling temperature. We find that higher temperatures are optimal for larger k, because the resulting set of samples has higher diversity, and the metric rewards only whether the model generates any correct solution.

> In particular, for a 679M parameter model, the optimal temperature for pass@1 is T∗ = 0.2 and the optimal temperature for pass@100 is T∗ = 0.8. With these temperatures, we find that pass@1 and pass@100 scale smoothly as a function of model size (Figure 6).

So even with pass@1 (likelihood of getting the right answer in 1 attempt) you don't use T=0, so there will be slight variations in the output each time.

a year ago

Brian_K_White

Why else bother with such an input? Are randomizations more likely to be correct or more useful?

a year ago

cubefox

Well, temperature 0 means the completion is always the most "likely" (or "best", after fine-tuning) token, while temperature 1 means to choose the next tokens stochastically according to their probability (or "goodness" after fine-tuning). Usually some temperature in between is chosen, like 0.7. It's not a priori clear to me which is the best way to do it.

a year ago

seanhunter

Generally the reason behind adding randomness to machine learning is avoiding "local minima" in the search space of the optimization function(s) used for training the model. If your training produces a very smooth descent to an optimum it can lead to the model converging on a solution that is not globally the best. Adding some randomness helps to avoid this.

Specifically for GPT models, the temperature parameter is used to get outputs wihch are a bit more "creative" and less deterministic. https://help.promptitude.io/en/ai-providers/gpt-temperature

a year ago

slashdev

I don't know much about AI, but I think one reason you might do that is to learn which variations are preferred (which are committed unmodified) so you can tune the model in the future. I don't know if Github does that, but given they've cited how often code from copilot is committed without modification, I assume they are measuring it at least in some cases.

a year ago

Brian_K_White

makes sense

a year ago

brookst

Huge topic, worth Googling. Short version is that too little randomness limits the solution space, so retrying suboptimal results yields the same problems.

a year ago

ianbutler

Potentially more correct, yes. It frees the model to choose lower probability tokens to some degree, technically it boosts their probabilities, which may be more correct depending on the task.

There are also sampling schemes, top_p and top_k which can each individually help choose tokens that are less probable (but still highly probable) but more correct, and they are often used together for the best effect.

And then there are various decoding methods like beam search where choosing the most optimal beam may not mean the most optimal individual token.

By default a simple greedy search is used which always chooses the next highest probability token.

a year ago

GuB-42

It is worthwhile with creative writing. For example if you ask ChatGPT to write a short story, you want some originality. Even when asking for an explanation it can be useful as you may want to try different things for the explanation that speaks to you the most.

But here we are talking about autocompleting code. I don't think programmers want the autocompleter to be creative. They want the exact same solution everyone uses, hopefully the right one, with only minor changes so that it matches their style and use their own variable names. In my case, I am the programmer, I decide what to do, I just want my autocompleter to save me some keystrokes and copy-pasting boilerplate from the web, the more it looks like existing code the better. I have enough work fixing my own bugs, thank you.

Speaking about bugs, how come everyone talks about code generation that, I think, doesn't bring that much value. Sure, it saves a few keystrokes and copy-pasting from StackOverflow, but I don't feel like it is the thing programmers spend most of the time doing. Dealing with bugs is. By bugs, there are the big ones that have tickets and can take days to analyze and fix, but also the ones that are just a normal part of writing code, like simple typos that result in compiler errors. I think that machine learning could be of great help here.

Just a system that tells me "hey, look here, this is not what I expected to see" would be of great help. Unexpected doesn't mean there is a bug, but it is something worth paying attention to. I know it has been done, but few people seem to talk about it. Or maybe a classifier trained on bug fix commits. If a piece of code looks like code that has been changed in a bug fix commit, there is a good chance it is also a bug. Have it integrated to the IDE, highlight the suspicious part as I type, just as modern IDEs highlight compilation errors in real time.

a year ago

2gremlin181

Ye olde Bias-Variance tradeoff

a year ago

golemotron

Yes.

a year ago

jimnotgym

Taking code off github, changing it a bit and passing it off as ones own crosses a line. Now we really can't tell the AI from the humans!

a year ago

unkulunkulu

oh come on, which code? writing imports? or iterating over collections? or am I underusing copilot? :)

I basically use it as stackoverflow on steroids. it is not even close to gpt-4 in terms of reproducing some original idea I could not find in a search engine

a year ago

missingdays

Why would you ever write imports? IDEs autocomplete them for you

a year ago

unkulunkulu

Copilot understands some convetions when there’s more than one way. I used it extensively with react bootstrap where I decided to go with the recommended way of importing each components like import Tab from ‘react-bootstrap/Tab’ It also knows which components are used in the file.

a year ago

sureglymop

But that's pretty much what copilot is... It's just Intellisense 2.0 and I would say even only marginally more useful. You can't even really instruct it except with some comments which may not work.

a year ago

beezlewax

Copilot really messes up imports for me sometimes. The ide does a better job via autocomplete when it is turned off

a year ago

taneq

Isn’t “rewrite the example code in your own style” accepted best practice for human coders, when working from an example that does what you need?

I’m not sure what would be acceptable output for a code generation tool if rewriting the examples isn’t ok and reimplementing something that performs the same function still isn’t ok. Are we automatically granting de-facto code patents on all published code now?

a year ago

l__l

The point here is that this isn't some example from a textbook or even stack overflow, but licensed pieces of work with all the legal complications that come with that. This is about the potential use of this code in proprietary code (or code otherwise incompatible with the original licenses), and I really don't think anyone would say it is "accepted best practice" to copy out someone else's work you find online, licenses be damned, in a professional setting.

a year ago

542458

> this isn't some example from a textbook or even stack overflow, but licensed pieces of work with all the legal complications that come with that

I understand why these might feel different to you, but textbooks and stack overflow are also proprietary, licensed pieces of work. I don’t see why there would be much of a legal distinction.

a year ago

salawat

No, you're missing the point.

There are two worlds.

In one, everytime someone publishes code with a license attached, they've taken a chunk out of the set of valid lines of software capable of being permissibly written without license encumberance. This is the world the poster you are replying to is imagining we're headed toward, and this case basically does a fantastic job of laying a test case/precedent for.

The other world, is one where everyone accepts all programming code is math, and copyrighting things is like erecting artificial barriers to facilitate information asymmetry. I.e. trying to own 2 + 2. In this second hypothetical world, we summarily reject IP as a thing.

The 2nd world is what I'd rather live in, as the first truly feels more and more like hell to me. However, given the first one is the world we're in, I'd like to see the mental gymnastics employed to undermine Microsoft's original software philosophy.

EDIT: Voir dire will be a hoot. Any wagers on how many software people make it onto the jury if any?

a year ago

harles

> In one, everytime someone publishes code with a license attached, they've taken a chunk out of the set of valid lines of software capable of being permissibly written without license encumberance.

If this were true of copyright, we would’ve run out of permissible novels a long time ago. There’s plenty to complain about with how software IP works, but copyright seems pretty sane. The alternative of protecting IP via trade secret is not a world I want to live in. That seems bad for open source.

a year ago

mitthrowaway2

Code is a more restrictive space than prose. Prose has to be grammatical and meaningful, but code has to compile and efficiently serve a useful specification.

The central idea of programming languages is that the grammar is very restrictive compared to natural languages. It's quite likely that, with the exception of variable names and whitespace, some function you wrote to implement a circular buffer is coincidentally identical to code that exists in Sony's or Lockheed Martin's codebases.

Plus there's the birthday problem -- coincidences can happen way more than you expect. And even with prose, constraints like non-fiction can narrow things down quickly. If everyone on HN had to write a theee-sentence summary of, say, how a bicycle works, there would probably be coincidentally identical summaries.

a year ago

harles

Three sentence summaries probably wouldn’t qualify for copyright protection. The same should be true of code - if we think the standard for copyright protection is too low, we should raise the bar on complexity requirements, not throw out copyright.

Even if a programming grammar is more restrictive, there’s some length where things become almost certainly unique.

a year ago

edgyquant

ReactOS actually got sued by Microsoft for stealing code and one of their proofs was a piece of code (can’t remember exactly what it did) that basically matched the same function Windows code with a few things changed.

It was ASM code I think, and their defense was that there was basically one way to write a function that does this.

a year ago

moyix

I think you're misremembering here; as far as I know (and as far as I can tell from searching just now) MS has never sued ReactOS. There was a claim made back in 2006 on the mailing list that a portion of syscall.S was copied, and this caused ReactOS to do their own audit:

https://en.wikipedia.org/wiki/ReactOS#Internal_audit

a year ago

quesera

It raises an interesting question though.

Aside from obligatory syntactic bits, what is the most common line of code across all software ever developed?

It'll probably be C or Java. HTML doesn't count.

And it's probably something boring like:

  i++;
a year ago

l__l

I'm don't think this dichotomy is at all fair. Just because someone makes a piece of software public does not mean they want it freely copied, and I think that can be a completely reasonable stance to have. I'm struggling to make sense of your argument unless you believe either:

- Code is not intellectual property; I don't see this as easily defensible. It takes time, effort, and in some cases seriously heavy resources to come up with some of the tech companies rely on. Should all private companies rescind copyright on literally everything their staff write?

- Intellectual property is a nonsense concept altogether; in this case, I don't think you're ever going to get your way in the court of public opinion.

a year ago

williamcotton

a year ago

rolph

in many cases a snip;routine;proc...whatever you work with, is rote procedure. such as device access. ie retrieving a directory listing.

code that reverts to a conserved sequence of bytes interchanged ,no functional variations.

code that is so common knowledge it has become street graffiti, belongs in world 2

versus code that creates a functionality not available by direct command, is innovative and should be attributed. this sounds like what 1st world should be.

a year ago

williamcotton

That’s not actually how it works. Purely functional code, such as code that it written in a certain way to achieve maximum performance, is not deemed expressive and therefore not covered by copyright. This code would be covered by patent.

a year ago

rolph

i think we are actually talking about the same thing.

in simpl terms:

mov bax eax ; an obvious function; no IP

mov eax eax ; seems useless unless you know what de-referencing is. probably IP

this is of course example not considering granularities at level of patents on a language, or macro directives

a year ago

jazzyjackson

"Isn’t “rewrite the example code in your own style” accepted best practice [...]?"

Why would it be? If a function performs the data transform I need you better believe i'm copy pasting that sucker with a hyperlink to where I found it

But then again, I'm not trying to win in court.

a year ago

rolph

what would happen without that hyperlink? the overall issue seems to be a lack of attribution to the originator.

a year ago

patmcc

That depends a lot on the license - some require attribution, some don't, some care not a bit (in that they don't permit copying).

a year ago

rolph

proper attribution to the writer seems to be a big part of this. there is also suggestion ms knows, all about it but passes the liability buck to the end user of copilot suggestions.

[Lawyer and developer Matthew Butterick announced last month that he'd teamed up with the Joseph Saveri Law Firm to investigate Copilot. They wanted to know if and how the software infringed upon the legal rights of coders by scraping and emitting their work without proper attribution under current open-source licenses.]

https://www.theregister.com/2022/11/07/in_brief_ai/

https://www.theregister.com/2022/10/19/github_copilot_copyri...

a year ago

layer8

Mitigating copyright issues by “rewriting in your own style” arguably only applies to humans doing the rewriting as a creative task, because copyright only applies to human creative works.

a year ago

waboremo

I can't recall a single time that's been common advice given to programmers. It's usually either don't reinvent the wheel (therefore use the source while adhering to license), or come up with your own solution.

Don't know how you would even write code in your own style. As soon as you start altering it, the result is different. It's more/less efficient.

a year ago

toast0

How do you like to name varibles. Do you use constant == variable or variable == constant. Tabs vs spaces. Declarations inline with first use, or at the top as K&R intended. Comment syntax and content. Etc.

Lots of little things.

a year ago

njharman

Depending on language there are ton of style choices. There’s style guides as examples of trivial.

Non trivial include names, comments, logging, error checking, structure, ordering of operations that aren’t sequential.

a year ago

waboremo

Yes, but all of those have impact to the actual function and performance of the proposed solution. By doing so, you are changing the solution.

Look at FizzBuzz. If you were to set strict requirements on performance (and allow for reiterative testing), the results from different groups of people would be identical. They would reach the same conclusion because that's how code works, it's far more aligned to math than it is creative writing.

So you cannot take an existing code solution and translate it to your own style. You are altering the program, the efficiency, and therefore the solution itself. Even when you do something like changing 1 single variable name!

a year ago

williamcotton

I interpreted the comment you are responding to as “make sure it uses the same style conventions as the rest of this file”, which is something that Copilot does very well!

a year ago

mistrial9

this comment really hits hard for me -- its like there is a place to buy food where every menu item is clearly shown, with a large color picture and a printed price.. and the person talking has only every purchased food in that way.. as if there are no alternatives that "really exist"

there really are a lot of other scenarios that involve writing software, to make software. Its not possible to list them all.. the list changes while I type

a year ago

WalterBright

One of the specific complaints is:

https://devclass.com/2022/10/17/github-copilot-under-fire-as...

It's a 25 or so line function that looks like a pedestrian implementation of a sparse matrix transpose algorithm. The author should have been patented it to protected it, not copyrighted it.

a year ago

rkagerer

The plaintiffs were granted anonymity due to credible threats against their attorney. Is there any mechanism other than publication ban that ensures the protection? Can't someone just attend the day of the hearing to see who the attorneys are?

EDIT: Apparently the lawyers are attending via Zoom.

a year ago

coryrc

The plaintiffs, not the plaintiffs' lawyers.

a year ago

cmrdporcupine

Copilot is to license violations (esp of copyleft licenses) what cryptocurrency mixers are for money laundering.

My employer (IMHO smartly) forbids use of LLMs in company IP and company laptops, etc. Many others I'm sure are doing the same, and many others will follow.

a year ago

bushbaba

Once the ip rules are figured out it’ll open the door to a lot of usecases. This reminds me more of p2p file sharing being precursor to paid streaming services.

a year ago

theRealMe

Nobody uses copilot intentionally to violate copyright law. People do use crypto mixers intentionally to violate money laundering laws.

a year ago

SpicyLemonZest

Nobody affirmatively says “yes, my goal is to violate copyright law, and Copilot is the best tool I’ve found”. But it doesn’t seem impossible to me that the value of Copilot comes partially from the fact that it can copy paste code from copyrighted repositories in ways which would be illegal for you or I to do. I’m not sure it’s proven yet but I wouldn’t be shocked if it is in the future.

a year ago

shagie

It provides the same value as someone who copies and pastes code from Stack Overflow or any of the predecessors without concerning themselves with the license.

I am certain that I can find code from Linux or gcc or emacs on Stack Overflow that is under a GPL license and not compatible with the CC license Stack Overflow uses... and yet it's there. What's more, people will copy that code into their own ignoring the CC license too.

How is that really any different than using Copilot if the original license and attribution are something to respect.

Note that I do think that the original license is something to respect which is why for any of the code that I write that has copyright that matters on it (toy program for home? meh. Hobby project repo that I'm working on that I'll publish? yep. Employer's code for work? absolutely.) I either don't touch questionable sources or run a license check on it when using it.

The key thing is that I don't consider the use of Copilot to be any more controversial than copying from Stack Overflow - which has been done by countless programmers for a decade before Copilot existed and no one got up in arms about it then.

a year ago

cmrdporcupine

Browsing Stack Overflow and even blindly copy and pasting is an intentional action done by research by the user, and the source of the material pasted is known or discoverable.

Using Copilot is an automated process, and the source of the material used in learning is deeply obfuscated in the learning model.

That's why I make the analogy back to cryptocurrency mixers.

a year ago

cmrdporcupine

Copilot is a product -- at least indirectly -- of Microsoft, a company who for about a decade made very public pronouncements about how they disagreed with the GPL (or copyleft generally), found it problematic, and tried actively to discourage its use.

Today's MS isn't really the same, and they've clearly made their peace with Linux. But it still happens that the GPL is in some fundamental ways at odds with commercial exploitation of open source code. So any corporate entity is going to struggle with it because at best it requires being very careful in distribution, or trying to negotiate or cut a deal with the licensee. At worst it can lead to legal problems and IP leakage on your own product.

So, not claiming any conspiracy. Or intent to violate intentionally. But it is in the convenient interests of companies like MS/OpenAI/GitHub to treat open source work as effectively public domain rather than under copyright, and to push the limits there.

The risk to an employer is of course the accidental introduction of such copylefted material into their code-base through copilot or similar tools.

I suspect two sources of disconnect with the broader community on hackernews that doesn't seem to see the issue here:

a) Much of the folks on this forum are working in the full-stack/web space where fundamentally novel, patented, or conceptually difficult algorithms and datastructures are rare. For them Copilot is an absolute blessing in helping to reduce the tedium of boilerplate. However in the embedded systems, operating systems, compiler, game engine dev, database internals etc. world there are other aspects at work. In certain contexts, Copilot has been shown to reproduce complicated or difficult code taken from copyrighted or copylefted (or maybe even patented sources) without attribution. And apparently now with some explicit obfuscation.

To put it another way: it's unlikely that Copilot's going to violate licenses with its assistance with turning your value/model objects from one structure to another, or writing a call into a SQL ORM. But it's quite possible that if I'm writing a DB join algorithm or some complicated math in a rendering engine or a compiler optimization phase that it could "crimp notes" from a source under restrictive license... because those things are absolutely in its learning set and the LLM doesn't "know" about the licensing behind them.

b) Either misunderstanding of, or lack of knowledge of, or outright hostility to... copylefted or attribution licenses which require special handling.

a year ago

fooster

Sorry your employer forbids the use of tooling that makes your life better and reduces drudgery. Perhaps you should vote with your feet and find a less Luddite employer.

a year ago

reaperducer

Sorry your employer forbids the use of tooling that makes your life better and reduces drudgery. Perhaps you should vote with your feet and find a less Luddite employer.

Does your company allow you to outsource your work to people in a poorer nation for a fraction of the cost that you are paid? Why not? Perhaps you should vote with your feet and find a less Luddite employer.

a year ago

Dylan16807

If you have the skills for that, hell yes find an employer that will let you do it, either explicitly or implicitly.

a year ago

indrora

My company forbids the use of LLMs that aren't validated (and we make one).

Our managers get emails if we make calls to known LLMs, and there's guidance on locally running LLMs and using their output ("it's okay for small things maybe, but be careful"). Why?

Because legal's job is to protect the company from legal threats. Sometimes that means making some awkward choices, like handwringing over the use of GPL licensed software in publicly exposed example code (such as sample apps) purely because some aspects of the GPL haven't been tested in American courts, much less international ones.

So the use cases for LLMs there are mostly source-to-source transformative ("Turn this function and documentation into javadoc format please") or similar -- stuff where you can show that the LLM isn't introducing anything that might maybe possibly have any hint of externally licensed software.

a year ago

renewiltord

Wild. I suppose it's good that people who like these conditions can find employers like this and people like me who don't can find employers not like this.

I could never countenance operating under these conditions.

a year ago

blibble

so not only is it a shitty boilerplate generator, now it also introduces deliberate random changes (i.e. bugs)

a year ago

brookst

[flagged]

a year ago

williamcotton

[flagged]

a year ago

matkoniecz

> Downvoting

Presumably people downvoted it because it is really unclear what exactly you are claiming.

Instead of "Everyone needs to first familiarize themselves with" you could write a very simple summary of that and how it relates to this case and your next claim that

> If you’re under the impression that every line of code is covered by copyright you are very mistaken.

Well, for example empty ones are really unlikely to be.

Ones that quote out-of copyright works also will not be.

a year ago

williamcotton

[flagged]

a year ago

catiopatio

The downvotes probably have to do with the fact that:

(1) you lead with a rude and mostly contentless comment, and

(2) your follow-up is merely a dump of Wikipedia quotes, instead of actually summarizing what you’ve been trying to say.

a year ago