The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Davriellelouna@lemmy.world · edit-2 2 days ago

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

poopkins@lemmy.world · edit-2 1 day ago

I’ve developed my own agent for assisting me with researching a topic I’m passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I’m a human using a web browser. (For my network requests, I’ve defined my own user agent.)

So I use that as a signal that the website doesn’t want automated tools scraping their data. That’s fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity’s are awful for the web.

(Edited for clarity)

IphtashuFitz@lemmy.world · 1 day ago

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.

poopkins@lemmy.world · 1 day ago

What I meant with “things like this are awful for the web,” I meant that automation through AI is awful for the web. It takes away from the original content creators without any attribution and hits their bottom line.

My story was supposed to be one about responsible AI, but somehow I screwed that up in my summary.