我试图抓取亚马逊的产品页面( https://www.amazon.com/dp/B0B6TR2GTJ), 代码如下:
import requests
url = "https://www.amazon.com/dp/B0B6TR2GTJ"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'
}
r= requests.get(url, headers = headers)
print(r.status_code)
print("-------------------")
doc = pq(r.text)
print(doc("title"))
print("-------------------")
print(r.text)
结果如下(被判断为机器人了): Headers 尝试了各种写法, 都是一样的结果.
503
-------------------
<title>Sorry! Something went wrong!</title>
-------------------
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<!doctype html>
......
我爬虫还在初学阶段, 有没有前辈大神帮帮我. 万分感谢
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.