用 BeautifulSoup 分析百度贴吧的页面，为什么只能提取前 60 多行的数据呢？

2015 年 7 月 27 日

liaipeng

f = urllib.urlopen(url).read()
soup = BeautifulSoup(f, 'html.parser')

如上面的代码，f打印出来看了是完整的页面，有几百行，但是把soup打印出来只有60多行。爬取其他网页的数据整成，就是爬百度贴吧的帖子会出现这种情况，是什么原因呢？

3676 次点击

所在节点

Python

12 条回复

WhiteLament

2015 年 7 月 27 日

'html.parser' 换成 'lxml' 试试？

lingo233

2015 年 7 月 27 日

我记得贴吧未登录只能看一页的内容。

iyaozhen

2015 年 7 月 27 日

2 楼应该是真相。

liaipeng

2015 年 7 月 27 日

@WhiteLament
提示这个，对BeautifulSoup模块还不熟悉，第一次接触
Couldn't find a tree builder with the features you requested: lxml.parser. Do you need to install a parser library?

liaipeng

2015 年 7 月 27 日

@lingo233 不是的，现在是soup连主楼的内容都没有抓取完整

yappa

2015 年 7 月 27 日

html.parser改成lxml，或者html5lib,这两个模块都要先安装

liaipeng

2015 年 7 月 27 日

@yappa
好的，我试试

liaipeng

2015 年 7 月 27 日

@yappa 可以了！太感谢了。想知道为什么会有这种情况呢？是因为其他网页跟贴吧帖子的什么不同？

WhiteLament

2015 年 7 月 27 日

你没安装
pip install lxml

yappa

2015 年 7 月 27 日

估计你是从文档里面复制出来的代码，“html.parser”是“html解析器”的意思，你要找到适合的解析器，lxml,html5lib就是所谓的“html.parser"。

WhiteLament

2015 年 7 月 27 日

有些页面不够规范，不同解析器兼容不一样，造成结果不同。
我也遇到过，换一个解析器就好了

liaipeng

2015 年 7 月 28 日

@WhiteLament
@yappa
感谢两位！

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://study.congcong.us/t/208593

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.