I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly “source available” security mainly focusing on BSD. I’m on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.
Well, on that server I tried to deny AI with Suricata, robots.txt, “NO AI” Licenses, Human Intelligence (HI) License links in the software, “NO AI” comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.
Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the “dead pool” of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.
The traceback and AI result analysis shows the following:
[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]
You should test this out for yourself as I’m not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.
P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.
This is a most excellent place for technology news and articles.
Discussion Primer: From my perspective and potential millions of others, the readme is part of the software, it is delivered with the software whether zip, tar, git. Itself, Markdown is a specifiction and can be consider the document as software.
In fact README is so integral to the software you cannot run the software without it.
Conclusion: I think we all think of readme, especially ones with examples of your code in your readme, as code. I have evidence AI trains on your README even if you tell it specifally not to use readme, block readme, block markdowns, it still goes after it. Kinda scary?
I want everyone else to have the evidence I have, Science.
So… if you don’t want the world to see your work, why are you hosting it publicly?
The comments so far aren’t real people posting how they really feel. An agenda or automata. Does that tell you I’m over the target or what?
Look my post is doing really well on the cyberescurity exchanges. So to all real developers and program managers out there:
Recommend the removal of any “primary logic” functional code examples out of your
README.md
, that’s it.PSA, Here to help, Elias
I agree with you that they have consumed far more of the internet than they let on. That scrapers are shoving just everything into these regardless of legality or consent. Its messed up. Once more if the world wasn’t just a concrete jungle this could probably be a great ubiquitous tool in a faster and safer manner than it is now.
It all started with this today:
Perplexity AI Is Lying about Their User Agent https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/
I also just realized why I’m getting heat here, lawsuits.
I just gave legal cause that practice was not properly disclosed by Microsoft, abused by OpenAI, a legal grounds as a README.markdown containg code as being software, not speech, integral to licensed software, which is covered by said license.
If an entity does find out like me your technical writing or code is in AI from a README, they are perhaps liable?
Thanks for all the comments affirming my hard working planned 6 month AI honeypot endeavouring to be a threat to anything that even remotely has the possibility of becoming anti-human. It was in my capability and interest to do, so I did it. This phase may pass and we won’t have to worry, but we aren’t there yet, I believe.
I did some more digging in Perplexity on niche security but this is tangential and speculative un-like my previous evidenced analysis, but I do think I’m on to something and maybe others can help me crack it.
I wrote this nice article https://www.quadhelion.engineering/articles/freebsd-synfin.html about FreeBSD syscontrols tunables, dropping SYN FIN and it’s performance impact on webhosting and security, so I searched for that. There are many conf files out there containing this directive and performance in aggregate but I couldn’t find any specific data on a controlled test of just that tunable, so I tested it months ago.
Searched for it Perplexity:
The forked gist was: https://gist.github.com/gspu/ac748b77fa3c001ef3791478815f7b6a
[Contradiction over time] The impact was none, negligible, trivial, improve
[Errors] Corrected after yesterday, and in following with my comments on the web that it actually improves performance as in my months old article
drop_synfin
is mainly mitigating fingerprinting, not DOS/DDoS, that’s a SYN flood it’s meaning, but I also tested this in my article!Anyone feel like an experiment here in this thread and ask ChatGPT the same question for me/us?