爬豆瓣相册遇到 403，伪装浏览器不成功，呼叫总部...

2015 年 3 月 24 日

dedewei

google得到伪装浏览器有两种选择：
第一种： https://gist.github.com/jianjiao2021/2c34d12dc2b327e62966

第二种： https://gist.github.com/jianjiao2021/05f9bbed66e79c24c9dc

还是返回403，请问哪里出错了？

全部代码： https://gist.github.com/jianjiao2021/7a8069afab52b12b0c76

12585 次点击

所在节点

Python

39 条回复

em70

2015 年 3 月 24 日

豆瓣早就用频率监控了,经过测试,一分钟40次是临界点,抓一个等1秒就肯定没问题

fork3rt

2015 年 3 月 24 日

为什么不使用 requests + beautifulsoup ?

vjnjc

2015 年 3 月 24 日

挺好玩的,楼主你的程序借我用用啊,据说豆瓣里有很多隐藏的美女,顺便学学python ^^

CaoZ

2015 年 3 月 24 日

使用豆瓣的 API (http://developers.douban.com/wiki/?title=photo_v2), 使用豆瓣客户端用的 apikey, 怎么抓也不会被封~

e.g. http://api.douban.com/v2/group/taotaopaoxiao/topics?alt=json&apikey=08f332d3675ca9d71ad9987a3615fd85

happywowwow

2015 年 3 月 24 日

http://www.douban.com/group/haixiuzu/
请不要害羞
以前写过爬这个的
hhh

muyi

2015 年 3 月 24 日

模拟容易造成IP被封，如楼上所提到的，用官方客户端的apikey，使用api来爬

AnyOfYou

2015 年 3 月 24 日

http://doc.scrapy.org/en/0.24/topics/practices.html#bans
Scrapy 的文档中有一点关于如何防治爬虫被 Ban 的方法：

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the sites directly
use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

justlikemaki

2015 年 3 月 24 日

..我遇到过网站故意返回错误代码，然后还返回页面代码的。

darmody

2015 年 3 月 24 日

看你的代码没有加延时之类的东西，估计是抓取频率的问题

v4dc

2015 年 3 月 25 日

注意豆瓣的 header 里面的 bid

v4dc

2015 年 3 月 25 日

@aliao0019 headers

dedewei

2015 年 3 月 25 日

@terrychang 没看懂，不过谢谢，以后遇到再试试

dedewei

2015 年 3 月 25 日

@lerry lxml and Requests 似乎大家都在推荐这样，继续学习。谢谢指点！

dedewei

2015 年 3 月 25 日

@caoz 多谢，当时顺手google了下，没找到，就放弃了。还没用过api，打算这就试试。非常感谢。

dedewei

2015 年 3 月 25 日

@happywowwow 哈哈哈〜，提供很好的素材，这就爬去！！！！！！！！！！

dedewei

2015 年 3 月 25 日

@AnyOfYou mark.....等再熟练点再好好看看......

lerry

2015 年 3 月 25 日

@dedewei 我用的PyQuery，可以像jQuery一样操作dom元素，很方便

penjianfeng

2015 年 3 月 25 日

@happywowwow 进去看了下，终于明白为何以前他们说douban才是大黄了-_-||

zjuster

2015 年 3 月 25 日

豆瓣的反爬虫机制都是被你们逼的..haha 请不要误会，我并没有恶意..

第 2 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://study.congcong.us/t/178984

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.