如何用 python 抓取银行各年利率(动态的)?

2015-09-05 18:38:46 +08:00
 explist
想从工行网页上抓取历年储蓄利率,其网址为: http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx 。想用 python3 自带的库写个爬虫程序,请大家帮忙指教:
当选择不同的时间,发现网址并未改变,是不是 AJAX ?
在浏览器中按 F12 毫无反应,但可以查看源码。在源码中抓取了日期列表,我认为这个日期应以某种方式告诉给服务器,但不知如何具体操作?
4701 次点击
所在节点    Python
18 条回复
seki
2015-09-05 18:48:27 +08:00
信息是提交并 post 给同一个地址的,每次更改之后重新载入了而已

审查 select ,可以看到 onchange 绑定,然后可以去找 <script>,代码也是明文的
seki
2015-09-05 18:51:24 +08:00
关于 python 的部分,用 urllib2 或者 requests 来构造相同的 post 请求
至于后台有什么反爬虫检测,这个就不清楚了,保守估计是不会有的,遇到了再说
ljcarsenal
2015-09-05 19:33:47 +08:00
不是 ajax 可以看到 select 的 onchange 绑定了事件
rwalle
2015-09-05 19:40:23 +08:00
看 Network 标签
1130335361
2015-09-05 19:51:49 +08:00
explist
2015-09-05 19:52:18 +08:00
Network 标签看不了,或许因为这是银行网站
onchange 看见了,但是...但是我根本解读不了它(对 HTML 知之甚少)
explist
2015-09-05 19:54:57 +08:00
def ghtest ():
url = r'http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx'
req = request.Request (url )

req.add_header ("User-Agent",'')
g=ghHtml () # HTMLParser
with request.urlopen (req ) as f:
g.feed (f.read ().decode ())
dataDict={}
for item in g.dates:
dataDict['id'] = item
log=parse.urlencode (dataDict ).encode ('utf-8')
f = request.urlopen (url,log )
# dosoming
f.close ()
paradoxs
2015-09-05 19:59:16 +08:00
Shy07
2015-09-05 20:25:36 +08:00
写了一个 Ruby 版的,只要改个日期就可以了

```ruby

require 'net/http'

params = {
'Sel_Date' => '2012-07-06', # 修改日期即可
'__EVENTTARGET' => 'Sel_Date',
'__EVENTARGUMENT' => '',
'__LASTFOCUS' => '',
'__VIEWSTATE' => '/wEPDwUJNDkwNDM1MTYwD2QWAgIDD2QWAgIBD2QWBmYPEGQPFiFmAgECAgIDAgQCBQIGAgcCCAIJAgoCCwIMAg0CDgIPAhACEQISAhMCFAIVAhYCFwIYAhkCGgIbAhwCHQIeAh8CIBYhEAUP6K+36YCJ5oup5pe26Ze0ZWcQBQoyMDE1LTA4LTI2BQoyMDE1LTA4LTI2ZxAFCjIwMTUtMDYtMjgFCjIwMTUtMDYtMjhnEAUKMjAxNS0wNS0xMQUKMjAxNS0wNS0xMWcQBQoyMDE1LTAzLTAxBQoyMDE1LTAzLTAxZxAFCjIwMTQtMTEtMjIFCjIwMTQtMTEtMjJnEAUKMjAxMi0wNy0wNgUKMjAxMi0wNy0wNmcQBQoyMDEyLTA2LTA4BQoyMDEyLTA2LTA4ZxAFCjIwMTEtMDctMDcFCjIwMTEtMDctMDdnEAUKMjAxMS0wNC0wNgUKMjAxMS0wNC0wNmcQBQoyMDExLTAyLTA5BQoyMDExLTAyLTA5ZxAFCjIwMTAtMTItMjYFCjIwMTAtMTItMjZnEAUKMjAxMC0xMC0yMAUKMjAxMC0xMC0yMGcQBQoyMDA4LTEyLTIzBQoyMDA4LTEyLTIzZxAFCjIwMDgtMTEtMjcFCjIwMDgtMTEtMjdnEAUKMjAwOC0xMC0zMAUKMjAwOC0xMC0zMGcQBQoyMDA4LTEwLTA5BQoyMDA4LTEwLTA5ZxAFCjIwMDctMTItMjEFCjIwMDctMTItMjFnEAUKMjAwNy0wOS0xNQUKMjAwNy0wOS0xNWcQBQoyMDA3LTA4LTIyBQoyMDA3LTA4LTIyZxAFCjIwMDctMDctMjEFCjIwMDctMDctMjFnEAUKMjAwNy0wNS0xOQUKMjAwNy0wNS0xOWcQBQoyMDA3LTAzLTE4BQoyMDA3LTAzLTE4ZxAFCjIwMDYtMDgtMTkFCjIwMDYtMDgtMTlnEAUKMjAwNC0xMC0yOQUKMjAwNC0xMC0yOWcQBQoyMDAyLTAyLTIxBQoyMDAyLTAyLTIxZxAFCjE5OTktMDYtMTAFCjE5OTktMDYtMTBnEAUKMTk5OC0xMi0wNwUKMTk5OC0xMi0wN2cQBQoxOTk4LTA3LTAxBQoxOTk4LTA3LTAxZxAFCjE5OTgtMDMtMjUFCjE5OTgtMDMtMjVnEAUKMTk5Ny0xMC0yMwUKMTk5Ny0xMC0yM2cQBQoxOTk2LTA4LTIzBQoxOTk2LTA4LTIzZxAFCjE5OTYtMDUtMDEFCjE5OTYtMDUtMDFnFgFmZAIBDxYCHgRUZXh0BQoyMDE1LTA4LTI2ZAICDxYCHwAFohU8dGFibGUgYm9yZGVyPSIxIiBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9IjAiIHdpZHRoPSI4NSUiICBydWxlcz0iYWxsIiBmcmFtZT0iYm9yZGVyIiBzdHlsZT0iYm9yZGVyLWNvbGxhcHNlOmNvbGxhcHNlOyBib3JkZXItY29sb3I6ICNDQ0NDQ0M7Ij48dGJvZHk+PHRyPjx0ZCB3aWR0aD0iNTclIiAgdmFsaWduPSJjZW50ZXIiIGJnY29sb3I9IiNlOGU4ZTgiPjxwIGFsaWduPSJjZW50ZXIiPjxiPumhueebrjwvYj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBiZ2NvbG9yPSIjZThlOGU4IiBoZWlnaHQ9IjE5Ij48cCBhbGlnbj0iY2VudGVyIj48Yj7lubTliKnnjoc8L2I+JTwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGhlaWdodD0iMTkiIGFsaWduPSJsZWZ0Ij7kuIDjgIHln47kuaHlsYXmsJHlj4rljZXkvY3lrZjmrL48L3RkPjx0ZCBoZWlnaHQ9IjE5IiB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIj4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBoZWlnaHQ9IjE5Ij48ZGl2IGFsaWduPSJsZWZ0Ij7vvIjkuIDvvInmtLvmnJ88L2Rpdj48L3RkPjx0ZCBoZWlnaHQ9IjE5IiB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIj4wLjM1PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgaGVpZ2h0PSIxOSIgYWxpZ249ImxlZnQiPu+8iOS6jO+8ieWumuacnzwvdGQ+PHRkIGhlaWdodD0iMTkiIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGhlaWdodD0iMTkiPjxkaXYgYWxpZ249ImxlZnQiPjEu5pW05a2Y5pW05Y+WPC9kaXY+PC90ZD48dGQgaGVpZ2h0PSIxOSIgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciI+Jm5ic3A7PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5LiJ5Liq5pyIPC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MS42PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5Y2K5bm0PC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MS44PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5LiA5bm0PC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MjwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPuS6jOW5tDwvdGQ+PHRkIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjIuNTwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPuS4ieW5tDwvdGQ+PHRkIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjM8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kupTlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4zLjA1PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgaGVpZ2h0PSIxOSI+PGRpdiBhbGlnbj0ibGVmdCI+Mi7pm7blrZjmlbTlj5bjgIHmlbTlrZjpm7blj5bjgIHlrZjmnKzlj5bmga88L2Rpdj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBoZWlnaHQ9IjE5Ij4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIDlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjY8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuInlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjg8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kupTlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjg1PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0PSIxOSI+PGRpdiBhbGlnbj0ibGVmdCI+My7lrprmtLvkuKTkvr88L2Rpdj48L3RkPjx0ZCBjb2xzcGFuPSIyIiBoZWlnaHQ9IjE5IiBhbGlnbj0ibGVmdCI+5oyJ5LiA5bm05Lul5YaF5a6a5pyf5pW05a2Y5pW05Y+W5ZCM5qGj5qyh5Yip546H5omTNuaKmDwvdGQ+PC90cj48dHI+PHRkIGhlaWdodD0iMTkiPjxkaXYgYWxpZ249ImxlZnQiPuS6jOOAgeWNj+WumuWtmOasvjwvZGl2PjwvdGQ+PHRkIGNvbHNwYW49IjIiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjEuMTU8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBoZWlnaHQ9IjE5Ij48ZGl2IGFsaWduPSJsZWZ0Ij7kuInjgIHpgJrnn6XlrZjmrL48L2Rpdj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBoZWlnaHQ9IjE5Ij48Zm9udCBjb2xvcj0iI2ViZWJlYiI+LjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIDlpKk8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4wLjg8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIPlpKk8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjM1PC90ZD48L3RyPjwvdGFibGU+ZGRDrgsxnIFuzBq+7MoE9zn85XGzBQ=='
}

uri = URI.parse ("http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx")
res = Net::HTTP.post_form uri, params
puts res.body
```
Shy07
2015-09-05 20:58:35 +08:00
施工完毕

require 'net/http'

uri = URI.parse ("http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx")
html = Net::HTTP.get uri
dates = []
html.scan (/<option value="(\d{4}-\d{2}-\d{2})">/) {|s| dates += s }
html =~ /name="__VIEWSTATE" id="__VIEWSTATE" value="(.*)" \/>/

params = {
'__EVENTTARGET' => 'Sel_Date',
'__EVENTARGUMENT' => '',
'__LASTFOCUS' => '',
'__VIEWSTATE' => $1.clone
}

dates.each do |date|
params['Sel_Date'] = date
res = Net::HTTP.post_form uri, params
# 正则提取具体内容就不写了,这里直接输出 html =_=b
open ("#{date}.html", 'w') {|io| io.write res.body }
end
ljdawn
2015-09-05 21:10:53 +08:00
先给左边的时间抓下来。 然后挨个儿 post 一下。。
explist
2015-09-05 21:42:58 +08:00
有了时间列表后,如何构造 POST 请求?
imlonghao
2015-09-05 21:45:19 +08:00
@explist RTFM
Shy07
2015-09-05 22:01:32 +08:00
@explist
表单就五个参数, post 给原地址就可以了
'__EVENTTARGET' => 'Sel_Date', 固定
'__EVENTARGUMENT' => '', 固定
'__LASTFOCUS' => '', 固定
'__VIEWSTATE' => 那串 Base64 ,固定
'Sel_Date' => 日期,可变
explist
2015-09-05 22:12:52 +08:00
@Shy07 这下对了,你怎么知道她们间的对应关系的
miemiekurisu
2015-09-05 22:23:03 +08:00
....你直接起个 scrapy 用 xpath 抓页面数据不就结了么...省时省力...
Shy07
2015-09-05 22:34:42 +08:00
@explist
看他的 js ,最后是 submit 提交的,所以把页面里所有可以提交的表单元素找出来就行了
explist
2015-09-06 23:08:35 +08:00
出于学习目的问一下:
建设银行的这个网站: http://www.ccb.com/cn/personal/interest/rmbdeposit.html 如何爬取,源代码中并无 table 标签

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/218416

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX