爬取地址: https://tieba.baidu.com/p/4959928798 在 chrome 上查看源代码,有着一段
<a class="pb_nameplate j_nameplate j_self_no_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
依据: class="pb_nameplate j_nameplate j_self_no_nameplate
写了一个正则:(?<=pb_nameplate\sj_nameplate\sj_self_nameplate)[\s\S]*?(?=)
运行后发现死活匹配不了,所以
# -*- coding: utf-8 -*-
__author__ = 'duohappy'
import requests
def get_info_from(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
web_data = requests.get(url, headers=headers)
web_data.encoding = 'utf-8'
content = web_data.text
with open('./test.txt', 'w') as f:
f.write(content)
if __name__ == '__main__':
url = 'http://tieba.baidu.com/p/4959928798'
get_info_from(url)
才发现
<a class="pb_nameplate j_nameplate j_self_nameplate" href="/tbmall/propslist?category=112&ps=24" data-field='{"props_id":"1120050972","end_time":"1512731564","title":"\u6d77\u8d3c\u738b\u7684\u53f3\u624b","optional_word":["\u7684","\u4e4b","\u306e"],"pattern":["1","1","1","2","3","3"]}' target="_blank">海贼王的右手</a>
class="pb_nameplate j_nameplate j_self_no_nameplate 变成了 pb_nameplate j_nameplate j_self_nameplate
这是什么技术,还是我的姿势有问题?
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.