AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
rnhmjoj 3 hours ago [-]
I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.
Fomite 39 minutes ago [-]
I'd argue it's part of the baked in, fundamental disrespect AI firms have for literally everyone else.
mostlysimilar 2 hours ago [-]
But it didn't, and it's happening now, because of AI.
kjkjadksj 1 hours ago [-]
People have been complaining about these crawlers for years well before AI
PaulDavisThe1st 52 minutes ago [-]
The issue is 1 to 4 orders of magnitude worse than it was just a couple of years ago. This is not "crawlers suck". This is "crawlers are overwhelming us and almost impossible to fully block". It really isn't the same thing.
majkinetor 2 hours ago [-]
Obaying robots.txt can not be enforced. Even if one country makes laws about it, another one will have 0 fucks to give.
spinningslate 1 hours ago [-]
It was never intended to be "enforced":
> The standard, developed in 1994, relies on voluntary compliance [0]
It was conceived in a world with an expectation of collectively respectful behaviour: specifically that search crawlers could swamp "average Joe's" site but shouldn't.
We're in a different world now but companies still have a choice. Some do still respect it... and then there's Meta, OpenAI and such. Communities only work when people are willing to respect community rules, not have compliance imposed on them.
It then becomes an arms race: a reasonable response from average Joe is "well, OK, I'll allows anyone but [Meta|OpenAI|...] to access my site. Fine in theory, dificult in practice:
1. Block IP addresses for the offending bots --> bots run from obfuscated addresses
2. Block the bot user agent --> bots lie about UA.
"It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished [sic] conventions (respecting robots.txt, using proper UA string, rate limiting, whatever)."
How does "proper UA string" solve this "blowing up websites" problem
The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour
"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure
"Spoofed" UA strings have been facilitated and expected since the earliest web browsers
To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically
It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting
superkuh 2 hours ago [-]
This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.
And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.
autoexec 1 hours ago [-]
> This isn't AI damaging anything. This is corporations damaging things.
This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.
I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.
ujkhsjkdhf234 4 minutes ago [-]
Cloudflare exists because people can't be good stewards of the internet.
> This isn't AI damaging anything. This is corporations damaging things
This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.
renewiltord 1 hours ago [-]
If you don't want to receive data, don't. If you don't want to send data, don't. No one is asking you to receive traffic from my IPs or send to my IPs. You've just configured your server one way.
Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.
PaulDavisThe1st 48 minutes ago [-]
I don't know if I want your traffic until I see what your traffic is.
You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!
Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.
But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.
I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.
renewiltord 27 minutes ago [-]
There's a protocol for that. Just reject the connection. Don't implode, just write some code. Your business model isn't my problem.
PaulDavisThe1st 5 minutes ago [-]
Reject the connection based on what?
IP address (presumably after too many visits) ? So now the iptables mechanism has to scale to fit your business model (of hammering my git repository 1 commit at a time from nearly a million IP addresses) ? Why does the code I use have to fit your braindead model? We wouldn't care if you just used git clone, but you're too dumb to do that.
The URL? Legitimate human (or other) users won't be happy about that.
Our web-fronted git repo is not part of our business model. It's just a free service we like to offer people, unrelated to revenue flow or business operations. So your behavior is not screwing my business model, but it is screwing up people who for whatever reason want to use that service, who can no longer use the web-fronted git repo.
ps. I've used "you" throughout the above because you used "my". No idea if you personally are involved in any such behavior.
latexr 32 minutes ago [-]
> Disconnect from me if you don’t want my traffic.
The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.
renewiltord 24 minutes ago [-]
Websites aren't people. They don't have desires. Machines have communication protocols. You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic. Do what you want with it.
People send me spam. I don't whine about it. I block it.
latexr 9 minutes ago [-]
> Websites aren't people. They don't have desires.
Obviously I’m talking about the people behind them, and I very much doubt you lack the minimal mental acuity to understand that when I used “website owners” in the preceding sentence. If you don’t want to engage in a good faith discussion you can just say so, no need to waste our time with fake pedantry. But alright, I edited that section.
> You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic.
And then you spend all your time in a game of cat and mouse, while these scrappers bring your website down and cost you huge amounts of money. Are you incapable of understanding how that is a problem?
> People send me spam. I don't whine about it. I block it.
Is the amount of spam you get so overwhelming that it swamps your inbox every day to a level you’re unable to find the real messages? Do those spammers routinely circumvent your rules and filters after you’ve blocked them? Is every spam message you get costing you money? Are they increasing every day? No? Then it’s not the same thing at all.
mcpar-land 3 hours ago [-]
My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.
fleebee 1 hours ago [-]
For me it was OpenAI. GTPBot hammered my honeypot with 0.87 requests per second for about 5 weeks. Other crawlers only made up 2% of the traffic. 1.8 million requests, 4 GiB of traffic. Then it just abruptly stopped for whatever reason.
Group_B 2 hours ago [-]
That's hilarious. I need to set up one of these myself
bwb 3 hours ago [-]
My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.
I wish there were a better way to solve this.
skydhash 3 hours ago [-]
If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.
bwb 2 hours ago [-]
Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..
weaksauce 2 hours ago [-]
put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.
bwb 1 hours ago [-]
Got a good link to something on github that does this?
I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.
rco8786 5 hours ago [-]
OpenAI straight up DoSed a site I manage for my in-laws a few months ago.
muzani 5 hours ago [-]
What is it about? I'm curious what kinds of things people ask that floods sites.
rco8786 3 hours ago [-]
The site is about a particular type of pipeline cleaning (think water/oil pipelines). I am certain that nobody was asking about this particular site or even the industry its in 15,000 times a minute 24 hours a day.
It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
average_r_user 5 hours ago [-]
I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website
tehwebguy 3 hours ago [-]
This is a feature! If half the internet is nuked and the other half put up fences there is less readily available training data for competitors.
AutoDunkGPT 2 hours ago [-]
I love this for us!
internet_points 6 hours ago [-]
They mention anubis, cloudflare, robots.txt – does anyone have experiences with how much any of them help?
davidfischer 1 hours ago [-]
My employer, Read the Docs, has a blog on the subject (https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...) of how we got pounded by these bots to the tune of thousands of dollars. To be fair though, the AI company that hit us the hardest did end up compensating us for our bandwidth bill.
We've done a few things since then:
- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.
- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.
- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.
- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.
Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.
nromiun 6 hours ago [-]
CDNs like Cloudflare are the best. Anubis is a rate limitor for small websites where you can't or won't use CDNs like Cloudflare. I have used Cloudflare on several medium sized websites and it works really well.
Anubis's creator says the same thing:
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
CloudFlare's Super Bot Fight Mode completely killed the surge in bot traffic for my large forum.
ajsnigrutin 4 hours ago [-]
And added captchas to every user with an adblock or sensible privacy settings.
pjc50 3 hours ago [-]
How would you suggest that such users prove they're not a crawler?
ajsnigrutin 3 hours ago [-]
Why would they have to?
What's wrong with crawlers? That's how google finds you, and people find you on google.
Just put some sensible request limits per hour per ip, and be done.
nucleardog 50 minutes ago [-]
> Just put some sensible request limits per hour per ip, and be done.
I have no personal experience, but probably worth reading like... any of the comments where people are complaining about these crawlers.
Claims are that they're: ignoring robots.txt; sending fake User-Agent headers; they're crawling from multiple IPs; when blocked they will use residential proxies.
People who have deployed Anubis to try and address this include: Linux Kernel Mailing List, FreeBSD, Arch Linux, NixOS, Proxmox, Gnome, Wine, FFMPEG, FreeDesktop, Gitea, Marginalia, FreeCAD, ReactOS, Duke University, The United Nations (UNESCO)...
I'm relatively certain if this were as simple as "just set a sensible rate limit and the crawlers will stop DDOS'ing your site" one person at one of these organizations would have figured that out by now. I don't think they're all doing it because they really love anime catgirls.
skydhash 2 hours ago [-]
Or use CDN caching. That's one of the reasons they here for.
bakugo 5 hours ago [-]
robots.txt is obviously only effective against well-behaved bots. OpenAI etc are usually well behaved, but there's at least one large network of rogue scraping bots that ignores robots.txt, fakes the user-agent (usually to some old Chrome version) and cycles through millions of different residential proxy IPs. On my own sites, this network is by far the worst offender and the "well-behaved" bots like OpenAI are barely noticeable.
To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
ip.src.continent in {"AF" "SA"} or
ip.src.country in {"CN" "HK" "SG"} or
ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
ip.src.asnum in {28573 45899 55836}
GuB-42 1 hours ago [-]
Surely there are solutions more subtle than blocking 80% of the world population...
sumtechguy 1 hours ago [-]
is there an http code for 'hey I gave you this already 10 times. This is a you problem not a me problem I refuse to give you another copy'.
It also sounds like there is an opportunity to sell scraped data to these companies. Instead of 10 crawlers we get one crawler and they just resell/give it away. More honey pots doesnt really fix the root cause (which is greed).
GuB-42 50 minutes ago [-]
> is there an http code for 'hey I gave you this already 10 times.
429 Too Many Requests
> This is a you problem not a me problem
That's the "4" in "429"
timsh 4 hours ago [-]
A bit off-topic but wtf is this preview image of a spider in the eye?
It’s even worse than the clickbait title of this post.
I think this should be considered bad practice.
levleontiev 2 hours ago [-]
That's why I am building a Rate Limiter as a service.
Seems that it has its niche.
neilv 2 hours ago [-]
Don't the companies in the headlines pay big bucks for people working on "AI"?
Maybe they are paying big bucks for people who are actually very bad at their jobs?
Why would the CEOs tolerate that? Do they think it's a profitable/strategic thing to get away with, rather than a sign of incompetence?
When subtrees of the org chart don't care that they are very bad at their jobs, harmed parties might have to sue to get the company to stop.
xrd 4 hours ago [-]
Isn't there a class action lawsuit coming from all this? I see a bunch of people here indicating these scrapers are costing real money to people who host even small niche sites.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
EgregiousCube 3 hours ago [-]
Under what law? It's interesting because these are sites that host content for the purpose of providing it to anonymous network users. ebay won a case against a scraper back in 2000 by claiming that the server load was harming them, but that reasoning was later overturned because it's difficult to say that server load is actual harm. ebay was in the same condition before and after a scrape.
Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?
buttercraft 2 hours ago [-]
If you put measures in place to prevent someone from accessing a computer, and they circumvent those measures, is that not a criminal offense in some jurisdictions?
outside1234 4 hours ago [-]
Yes. There are one set of rules for us and another set of rules for anything with more than a billion dollars.
s_ting765 2 hours ago [-]
Can confirm META's bots aggressively scraping some of my internet-facing services have but they do respect robots.txt.
shinycode 6 hours ago [-]
In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?
rco8786 5 hours ago [-]
My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.
Leynos 5 hours ago [-]
Where is caching breaking so badly that this is happening? Are OpenAI failing to use etags or honour cache validity?
4 hours ago [-]
Analemma_ 4 hours ago [-]
Their crawler is vibe-coded.
danaris 5 hours ago [-]
Same here.
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
mrweasel 3 hours ago [-]
Wikis seems to attract AI bots like crazy, especially the bad kind that will attempt any type of cache invalidation available to them.
pm215 6 hours ago [-]
Sure, but if the fetcher is generating "39,000 requests per minute" then surely something has gone wrong somewhere ?
miohtama 6 hours ago [-]
Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.
ndriscoll 4 hours ago [-]
Could be serving no-cache headers? Seems like yet another problem stemming from every website being designed as if it were some dynamic application when nearly all of them are static documents. nginx doing 39k req/min to cacheable pages on an n100 is what you might call "98% idle", not "unsustainable load on web servers".
The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?
mrweasel 3 hours ago [-]
Many AI crawlers seems to go to great length to avoid caches, not sure why.
andai 5 hours ago [-]
They're not very good at web queries, if you expand the thinking box to see what they're searching for, like half of it is nonsense.
e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...
sct202 3 hours ago [-]
I wonder how much of the rapid expansion of datacenters is from trying to support bot traffic.
loeg 3 hours ago [-]
In terms of CapEx, not much. The GPUs are much more expensive. Physical footprint? I don't know.
vkou 1 hours ago [-]
Why is this not a violation of the CFAA, and why aren't SWEs and directors going to prison over it?
As long as I have an EULA or a robots.txt or even a banner that forbids this sort of access, shouldn't any computerized access be considered abuse? Something, something, scraping JSTOR?
jasoncartwright 4 hours ago [-]
I recently, for pretty much the first time ever in 30 years of running websites, had to blanket ban crawlers. I now whitelist a few, but the rest (and all other non-UK visitors) have to pass a Cloudflare challenge [1].
AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
breakyerself 6 hours ago [-]
There's so much bullshit on the internet how do they make sure they're not training on nonsense?
prasadjoglekar 6 hours ago [-]
By paying a pretty penny for non bullshit data (Scale Ai). That along with Nvidia are the shovels in this gold rush.
danny_codes 2 hours ago [-]
Making a lot of assumptions about the quality of scale AI.
bgwalter 6 hours ago [-]
Much of it is not training. The LLMs fetch webpages for answering current questions, summarize or translate a page at the user's request etc.
Any bot that answers daily political questions like Grok has many web accesses per prompt.
snowwrestler 5 hours ago [-]
While it’s true that chatbots fetch information from websites in response to requests, the load from those requests is tiny compared to the volume of requests indexing content to build training corpuses.
The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
shikon7 5 hours ago [-]
But surely there aren't thousands of new corpuses built every minute.
bgwalter 4 hours ago [-]
Why would the Register point out Meta and OpenAI as the worst offenders? I'm sure they do not continuously build new corpuses every day. It is probably the search function, as mentioned in the top comments.
8organicbits 6 hours ago [-]
Is an AI chatbot fetching a web page to answer a prompt a 'web scraping bot'? If there is a user actively promoting the LLM, isn't it more of a user agent? My mental model, even before LLMs, was that a human being present changes a bot into a user agent. I'm curious if others agree.
bgwalter 6 hours ago [-]
The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.
I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
ronsor 14 minutes ago [-]
> The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.
So does my browser when I have uBlock Origin enabled.
danaris 5 hours ago [-]
But they're (generally speaking) not being asked for the contents of one specific webpage, fetching that, and summarizing it for the user.
They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
danaris 5 hours ago [-]
I mean...they don't. That's part of the problem with "AI answers" and such.
okasaki 3 hours ago [-]
I wonder if we're doing the wrong thing blocking them with invasive tools like cloudflare?
If all you're concerned about is server load, wouldn't it be better to just offer a tar file containing all of your pages they can download instead? The models are months out of date, so a monthly dumb would surely satisfy them. There could even be some coordination for this.
They're going to crawl anyway. We can either cooperate or turn it into some weird dark market with bad externalities like drugs.
recallingmemory 2 hours ago [-]
Yeah, I am in the opposing camp too - I don't use Cloudflare's bot fight tooling on any of our high traffic websites. I'm not seeing the issue with allowing bots to crawl our websites other than some additional spend for bandwidth. Agent mode is pretty powerful when paired with a website that cooperates, and if people want to use AI to interact with our data then what's wrong with that?
masfuerte 2 hours ago [-]
A tar file would be better if the crawlers would use it, but even sites with well-publicised options for bulk downloads (like wikipedia) are getting hammered by the bots.
The bot operators DNGAF.
jgalt212 2 hours ago [-]
about 18 months ago, our non-Google / Bing bot traffic went from single digits per cent to over 99.9% bot traffic. We tried some home-spun solutions at first, but eventually threw in the towel and put Cloudflare in front of all our publicly accessible pages. On a long term basis, this was probably the right move for us, but we felt forced into this. And the Cloudflare Managed Ruleset definitely blocks some legit traffic such that it requires a fair amount of manual tuning.
lostmsu 4 hours ago [-]
This article and the "report" look like a submarine ad for Fastly services. At no point does it mention the human/bot/AI bot ratio, making it useless for any real insights.
delfinom 5 hours ago [-]
I run a symbol server, as in, PDB debug symbol server. Amazon's crawler and a few others love requesting the ever loving shit out of it for no obvious reason. Especially since the files are binaries.
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
ack_complete 4 hours ago [-]
I have a simple website consisting solely of static webpages pointing to a bunch of .zip binaries. Nothing dynamic, all highly cacheable. The bots are re-downloading the binaries over and over. I can see Bingbot downloading a .zip file in the logs, and then an hour later another Bingbot instance from a different IP in the same IP range downloading the same .zip file in full. These are files that were uploaded years ago and have never retroactively changed, and don't contain crawlable contents within them (executable code).
Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.
hereme888 5 hours ago [-]
I'm absolutely pro AI-crawlers. The internet is so polluted with garbage, compliments of marketing. My AI agent should find and give me concise and precise answers.
mrweasel 3 hours ago [-]
They just don't need to hammer sites into the ground to do it. This wouldn't be an issue if the AI companies where a bit more respectful of their data sources, but they are not, they don't care.
All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.
Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"
lionkor 4 hours ago [-]
The second I get hit with bot traffic that makes my server heat up, I would just slam some aggressive anti bot stuff infront. Then you, my friend, are getting nothing with your fancy AI agent.
hereme888 4 hours ago [-]
I've never ran any public-facing servers, so maybe I'm missing the experience of your frustration. But mine, as a "consumer" is wanting clean answers, like what you'd expect when asking your own employee for information.
3 hours ago [-]
mediumsmart 4 hours ago [-]
so the fancy AI agent will have to get really fancy and mimic human traffic and all is good until the server heats up from all those separate human trafficionados - then what?
depingus 1 hours ago [-]
The end of the open web. That's what.
Sites will have to either shutdown or move behind a protection racket run by one of the evil megacorps. And TBH, shutting down is the better option.
With clickthru traffic dead, whats even the point of putting anything online? To feed AIs so that someone else can profit at my (very literal) expense? No thanks. The knowledge dies with me.
The internet dark age is here. Everyone, retreat to your fiefdom.
lionkor 1 hours ago [-]
Nobody is forcing anyone to share their knowledge. What then? Dead internet.
depingus 57 minutes ago [-]
Absolutely yes. I guarantee you these megacorps are betting on a future where the open internet has been completely obliterated. And the only way to participate online is thru their portal; where everything you do feeds back into their AI. Because that is the only way to acquire fresh food for their beast.
AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
> The standard, developed in 1994, relies on voluntary compliance [0]
It was conceived in a world with an expectation of collectively respectful behaviour: specifically that search crawlers could swamp "average Joe's" site but shouldn't.
We're in a different world now but companies still have a choice. Some do still respect it... and then there's Meta, OpenAI and such. Communities only work when people are willing to respect community rules, not have compliance imposed on them.
It then becomes an arms race: a reasonable response from average Joe is "well, OK, I'll allows anyone but [Meta|OpenAI|...] to access my site. Fine in theory, dificult in practice:
1. Block IP addresses for the offending bots --> bots run from obfuscated addresses
2. Block the bot user agent --> bots lie about UA.
...and so on.
[0]: https://en.wikipedia.org/wiki/Robots.txt
How does "proper UA string" solve this "blowing up websites" problem
The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour
"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure
"Spoofed" UA strings have been facilitated and expected since the earliest web browsers
For example,
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically
It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting
And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.
This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.
I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.
> This isn't AI damaging anything. This is corporations damaging things
This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.
Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.
You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!
Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.
But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.
I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.
IP address (presumably after too many visits) ? So now the iptables mechanism has to scale to fit your business model (of hammering my git repository 1 commit at a time from nearly a million IP addresses) ? Why does the code I use have to fit your braindead model? We wouldn't care if you just used git clone, but you're too dumb to do that.
The URL? Legitimate human (or other) users won't be happy about that.
Our web-fronted git repo is not part of our business model. It's just a free service we like to offer people, unrelated to revenue flow or business operations. So your behavior is not screwing my business model, but it is screwing up people who for whatever reason want to use that service, who can no longer use the web-fronted git repo.
ps. I've used "you" throughout the above because you used "my". No idea if you personally are involved in any such behavior.
The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.
People send me spam. I don't whine about it. I block it.
Obviously I’m talking about the people behind them, and I very much doubt you lack the minimal mental acuity to understand that when I used “website owners” in the preceding sentence. If you don’t want to engage in a good faith discussion you can just say so, no need to waste our time with fake pedantry. But alright, I edited that section.
> You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic.
And then you spend all your time in a game of cat and mouse, while these scrappers bring your website down and cost you huge amounts of money. Are you incapable of understanding how that is a problem?
> People send me spam. I don't whine about it. I block it.
Is the amount of spam you get so overwhelming that it swamps your inbox every day to a level you’re unable to find the real messages? Do those spammers routinely circumvent your rules and filters after you’ve blocked them? Is every spam message you get costing you money? Are they increasing every day? No? Then it’s not the same thing at all.
I wish there were a better way to solve this.
I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.
It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
We've done a few things since then:
- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.
- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.
- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.
- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.
Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.
Anubis's creator says the same thing:
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
Source: https://github.com/TecharoHQ/anubis
What's wrong with crawlers? That's how google finds you, and people find you on google.
Just put some sensible request limits per hour per ip, and be done.
I have no personal experience, but probably worth reading like... any of the comments where people are complaining about these crawlers.
Claims are that they're: ignoring robots.txt; sending fake User-Agent headers; they're crawling from multiple IPs; when blocked they will use residential proxies.
People who have deployed Anubis to try and address this include: Linux Kernel Mailing List, FreeBSD, Arch Linux, NixOS, Proxmox, Gnome, Wine, FFMPEG, FreeDesktop, Gitea, Marginalia, FreeCAD, ReactOS, Duke University, The United Nations (UNESCO)...
I'm relatively certain if this were as simple as "just set a sensible rate limit and the crawlers will stop DDOS'ing your site" one person at one of these organizations would have figured that out by now. I don't think they're all doing it because they really love anime catgirls.
To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
It also sounds like there is an opportunity to sell scraped data to these companies. Instead of 10 crawlers we get one crawler and they just resell/give it away. More honey pots doesnt really fix the root cause (which is greed).
429 Too Many Requests
> This is a you problem not a me problem
That's the "4" in "429"
Maybe they are paying big bucks for people who are actually very bad at their jobs?
Why would the CEOs tolerate that? Do they think it's a profitable/strategic thing to get away with, rather than a sign of incompetence?
When subtrees of the org chart don't care that they are very bad at their jobs, harmed parties might have to sue to get the company to stop.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?
e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...
As long as I have an EULA or a robots.txt or even a banner that forbids this sort of access, shouldn't any computerized access be considered abuse? Something, something, scraping JSTOR?
AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.
Really disappointing.
[1] https://developers.cloudflare.com/cloudflare-challenges/
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
Any bot that answers daily political questions like Grok has many web accesses per prompt.
The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
So does my browser when I have uBlock Origin enabled.
They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
If all you're concerned about is server load, wouldn't it be better to just offer a tar file containing all of your pages they can download instead? The models are months out of date, so a monthly dumb would surely satisfy them. There could even be some coordination for this.
They're going to crawl anyway. We can either cooperate or turn it into some weird dark market with bad externalities like drugs.
The bot operators DNGAF.
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.
All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.
Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"
Sites will have to either shutdown or move behind a protection racket run by one of the evil megacorps. And TBH, shutting down is the better option.
With clickthru traffic dead, whats even the point of putting anything online? To feed AIs so that someone else can profit at my (very literal) expense? No thanks. The knowledge dies with me.
The internet dark age is here. Everyone, retreat to your fiefdom.