发帖回复

2312阅读
0回复

用nginx屏蔽爬虫的方法 [复制链接]

上一主题下一主题查看指定楼层

离线李唐

管理员

只看楼主倒序阅读 0楼发表于: 2023-04-02

1. 使用"robots.txt"规范

在网站根目录新建空白文件，命名为"robots.txt"，将下面内容保存即可。

User-agent: BaiduSpider

Disallow:

User-agent: YisouSpider

Disallow:

User-agent: 360Spider

Disallow:

User-agent: Sosospider

Disallow:

User-agent: SogouSpider

Disallow:

User-agent: YodaoBot

Disallow:

User-agent: Googlebot

Disallow:

User-agent: bingbot

Disallow:

User-agent: *

Disallow: /

2. nginx

将下面代码添加到"location / { }" 段里面，比如伪静态规则里面。

#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {

return 403;

}

#禁止指定UA及UA为空的访问

t|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Ezooms|^$" ) {

return 404;

}

#禁止非GET|HEAD|POST方式的抓取, ~ 为模糊匹配 ~* 为模糊匹配不区分大小写

if ($request_method !~ ^(GET|HEAD|POST)$) {

return 403;

}

if ($http_user_agent ~ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" ) {

return 404;

}

if ($http_user_agent ~ "Mozilla/5.0+(compatible;+Baiduspider/2.0;++http://www.baidu.com/search/spider.html)") {

return 404;

}

if ($http_user_agent ~ "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X)(compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)") {

return 404;

}

if ($http_user_agent ~ "Mozilla/5.0 (Linux; Android 10; VCE-AL00 Build/HUAWEIVCE-AL00; wv)") {

return 404;

}

测试一下：

curl -I -A "Mozilla/5.0Macintosh;IntelMacOSX10_12_0AppleWebKit/537.36KHTML,likeGeckoChrome/60.0.6967.1704Safari/537.36;YandexBot" http://www.xxxxx.com

返回 403 表示设置成功！