[爬虫]php 通过 ajax 与 file_get_contents, snoopy 都无法获取 壹心理 电台的动态页面

2014-12-22 23:05:46 +08:00
 bosshida
尝试抓取 壹心理的FM的信息,例如: http://fm.xinli001.com/#4916186 通过firebug,知道页面载入后会发送
http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474 会获取当前FM的基本信息。我尝试用ajax访问该网址,返回403 FORBIDDEN。
firefox另外提示:“已阻止交叉源请求:同源策略不允许读取 http://fm.xinli001.com/broadcast/?pk=12139723&t=1419253731488 上的远程资源。可以将资源移动到相同的域名上或者启用 CORS 来解决这个问题。”
js代码:
$.ajax
({
type: "get",
dataType: "json",
url: "http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474",
success:function(data){alert('ok');},
timeout:30000,
error: function (XMLHttpRequest, textStatus, errorThrown) {
alert('error');
}
});

googel一翻,搜到一个方法,增加:jQuery.support.cors = true; 后也是不可以。
增加Headers相当参数也不可以。实在没辄了。
请问有没什么办法可以获取到FM的基本信息?
6651 次点击
所在节点    问与答
13 条回复
Jat001
2014-12-22 23:18:32 +08:00
带上 header
X-Requested-With XMLHttpRequest
Referer http://fm.xinli001.com/
做爬虫就是模拟浏览器,看看浏览器发了什么 header,一个个减少,直到出错,就知道需要什么 header。
fising
2014-12-22 23:38:37 +08:00
ajax跨域了,被浏览器block住了
bosshida
2014-12-23 00:13:00 +08:00
@fising 有什么办法解决吗?
Jat001
2014-12-23 00:37:07 +08:00
@bosshida 要么在他们的服务端设置 Access-Control-Allow-Origin header,当然,你肯定没这权限。要么就用类似 userscripts 的方法搞。
其实我觉得这种请求最好在服务端搞定。
esile
2014-12-23 01:52:08 +08:00
设置referer和X-Requested-With即可成功获取了

以下是测试返回值
{"code": 0, "data": {"favnum": 398, "commentnum": 120, "speaker_id": 108, "is_home": true, "background": "http://image.xinli001.com/20141220/18083879570a3ec9b9a360.jpg", "speak_url": "http://www.xinli001.com/user/742450/", "duration": 1283, "tags": [], "weight": 397, "title1": "", "_cache_key": "data_fm_broadcast_4916186", "article": null, "specials": [], "_id": "54954aea4f670ade3e8b4a1b", "range": 20535196, "word": "\u6625\u6653", "speakers_id": [], "lizhi_url": "", "created": "2014-12-20 18:01", "word_url": "http://www.xinli001.com/user/article/3866918/", "speak": "\u5cf0_\u5c0f\u5cf0", "id": 4916186, "is_teacher": false, "message_url": "", "cover": "http://image.xinli001.com/20141220/18094254011b53336c1227.jpg", "title": "\u6211\u548c\u90b5\u6bdb\u6bdb\u7684\u65e5\u4e0e\u591c", "url": "http://image.kaolafm.net/mz/audios/201412/a59b5e60-e515-4804-88f5-64f167aa957e.mp3", "absolute_url": "http://fm.xinli001.com/4916186/", "content": "\u4e0d\u8bba\u751f\u6d3b\u5728\u54ea\u91cc\uff0c\u53ea\u8981\u5728\u4e00\u8d77\u5c31\u597d\u4e86\u3002\u6211\u4eec\u5728\u83dc\u5e02\u573a\u4e70\u83dc\uff0c\u5728\u623f\u95f4\u91cc\u505a\u996d\uff0c\u996d\u540e\u6cbf\u7740\u8857\u8fb9\u6563\u6b65\uff0c\u4e00\u8d77\u770b\u592a\u9633\u5347\u8d77\uff0c\u592a\u9633\u843d\u4e0b\uff0c\u8fd9\u6837\u5c31\u8db3\u591f\u4e86\u3002", "url1": ""}}
bosshida
2014-12-23 10:17:52 +08:00
@Jat001 可以加的header都加了,但都不行。我对着Firefox的header,逐个增加参数,还是提示403 FORBIDDEN.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<script src="./jquery-2.0.0.min.js"></script>

<script type="text/javascript">
function test(){
$.ajax({
type : "get",
url : "http://fm.xinli001.com/broadcast/",
datatype:"json",
data: "pk=97701348&t=1419296643104",
headers:{
"Referer":"http://fm.xinli001.com/",
"X-Requested-With":"XMLHttpRequest",
"Accept":"*/*",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":" zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
"Connection":"keep-alive",
"Host":"fm.xinli001.com",
"User-Agent":"Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0",
},
success : function(json){
alert('ok');
},
error:function(){
alert('fail');
}
});
}
</script>

<title>parseFm</title>
</head>
<body>
<input type="button" value="test" onclick="test();">
</body>
</html>
yrdr
2014-12-23 10:18:23 +08:00
第一,你跨域了,所以请用jsonp
第二,你没设置http头,被服务器屏蔽了请求了吧
bosshida
2014-12-23 10:18:27 +08:00
@esile 你是怎么测试成功的?可以发下测试代码吗?
zhangwei727
2014-12-23 12:10:54 +08:00
@esile 同求测试源码,505600376@qq.com 谢谢!
nilennoct
2014-12-23 13:41:04 +08:00
@bosshida 这种需求就不要在浏览器里玩了,还是用node吧==
bosshida
2014-12-23 20:55:48 +08:00
@yrdr 试过jsonp了,还是不行。用jquery和用原生Js代码的Jsonp都返回403 forbidden。
Jquery:
<script type="text/javascript">
function haha(){
$.ajax({
type : "get",
async:false,
url : "http://fm.xinli001.com/broadcast/",
data: "pk=97701348&t=1419336731430",
dataType: "jsonp",
jsonpCallback:"fmHandler",
headers:{
"Referer":"http://fm.xinli001.com/",
"X-Requested-With":"XMLHttpRequest",
"Accept":"*/*",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":" zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
"Connection":"keep-alive",
"Host":"fm.xinli001.com",
"User-Agent":"Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0",
},
success : function(json){
console.log(json);
alert('ok');
},
error:function(){
alert('fail');
}
});
}
</script>

原生Js:
<script type="text/javascript">
var myFmHandler = function(data){
alert('ok');
};
var url = "http://fm.xinli001.com/broadcast/?pk=97701348&t=1419336731430&callback=myFmHandler";
var script = document.createElement('script');
script.setAttribute('src', url);
document.getElementsByTagName('head')[0].appendChild(script);
</script>

楼上说的Node.js,我没用过,现在来现学现用一下。。。
esile
2014-12-24 11:01:38 +08:00
@bosshida @zhangwei727 需要搞那么负责么?
<?php
function fetchpage($url, $referer)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array ('X-Requested-With: XMLHttpRequest') );
curl_setopt($ch, CURLOPT_HEADER,false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6; .NET CLR 2.0.50727; CIBA)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
$temp = curl_exec($ch);
curl_close($ch);
return $temp;

}

var_dump(fetchpage('http://fm.xinli001.com/broadcast/?pk=4916186&t=1419258885474', 'http://fm.xinli001.com/'));
esile
2014-12-24 11:02:21 +08:00
负责=复杂,o(︶︿︶)o 唉 拼音坑人

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/155870

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX