Python 提取 POST 返回的 Response

2016-04-10 08:00:15 +08:00

haomni

看到有人在 http://v2ex.com/t/269567 求助，准备在 @hzlzh 脚本基础上增加一个 exurl

自己刚开始学 Python 不久，能力不足，目前卡在从下面 POST 返回的代码中提取 http://test.long.url/ 这个返回的链接，其余都差不多了

<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr>	</table>

看了下面两个帖子弄了半天还是没搞定，求助各位

先谢谢各位~

10154 次点击

所在节点

Python

21 条回复

eoo

2016-04-10 08:28:45 +08:00

要 POST 的地址呢？

haomni

2016-04-10 08:50:38 +08:00

@eoo 感谢回复，从 POST 返回的结果中抓取 URL 应该不需要原来的 POST 地址吧……

virusdefender

2016-04-10 08:54:00 +08:00

# coding=utf-8
import re

html = """
<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/"
"""

print re.compile('<td class="shorturl">([\s\S]*?)</td>').findall(html)[0]

haomni

2016-04-10 09:00:24 +08:00

@virusdefender 感谢，但是这个找出来的是 shorturl
我换成 longurl 之后结果是：
<a href="http://test.long.url/" target="_blank">http://test.long.url/</a>

还是没达成目的……

uyhyygyug1234

2016-04-10 09:05:32 +08:00

uyhyygyug1234

2016-04-10 09:06:25 +08:00

这样可以不过是在太丑了。应该上 bs4 ， pyquery 之类的额

haomni

2016-04-10 09:14:28 +08:00

@uyhyygyug1234 大侠结果好像不太对啊也可能是我 Reponse 结果没有贴全的缘故

>>> print re.compile('href="(.*)"').findall(req.content)[0]
/screen.css

class="longurl" 这个在整个 Response 中是唯一的，现在要的是取后面那个指向链接

sh4n3

2016-04-10 09:31:05 +08:00

用 .longurl a 这样的 css Selector 就好了。

ericls

2016-04-10 09:32:04 +08:00

直接 pyquery 来搞

eoo

2016-04-10 09:56:07 +08:00

@haomni 用 PHP 很容易

<?php

$str='<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr> </table>';

$zz='#<td class="longurl"><a href="(.*?)" target="_blank">.*?</a></td>#';

preg_match($zz, $str, $matchs);

print_r($matchs);

haomni

2016-04-10 10:03:31 +08:00

@uyhyygyug1234
@ericls
试了下 PyQuery ，可能我用法不太对
print doc1('class:contains("longurl")')

@eoo 不准备再换 php 了，其它都写好了

eoo

2016-04-10 10:08:59 +08:00

@haomni 好吧，写的什么？

seki

2016-04-10 10:09:31 +08:00

为啥不用 beautifulsoup 或者 lxml

seki

2016-04-10 10:10:09 +08:00

嗯比方说你的 bs4 提取失败的代码是什么样的

haomni

2016-04-10 10:16:12 +08:00

感谢各位， PyQuery 不太会用
在 @uyhyygyug1234 的基础上再用一次正则就搞定了

@seki 唉，虽然有心想用，但是不会啊……

longchisihai

2016-04-10 10:35:12 +08:00

from bs4 import BeautifulSoup

html = '''<table id="response" border="0" cellpadding="0" cellspacing="0">
<tr><td class="shorturl">http://test/shorturl</td><td class="longurl"><a href="http://test.long.url/" target="_blank">http://test.long.url/</a></td></tr> </table>'''

soup = BeautifulSoup(html, 'lxml')

longurl_tag = soup.find('td', class_ = 'longurl')

print (longurl_tag.contents[0].get('href'))

haomni

2016-04-10 10:48:42 +08:00

@longchisihai 简直完美，感谢！

haomni

2016-04-10 10:54:05 +08:00

大致的样子有了，
弄了一宿，先去睡一会，醒了再测，先上个图压压惊
再次感谢各位技术帝帮忙，稍后会将作品上传到 Github 开源

hzlzh

2016-04-10 11:02:42 +08:00

干得漂亮~

haomni

2016-04-10 11:05:22 +08:00

@hzlzh 都是在前辈的基础上改的还有一些细节没有完善
弄好了我再联系你~

第 1 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/269878

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.