豆瓣个人简介中的网址提取是用的什么正则？感觉挺强大的。

2013-02-23 03:48:13 +08:00

paloalto

http://xxx.xxxx.com
xxx.xxxx.com
xxxx.com
xxxx.me
xxxx.it

试了下，基本上都能匹配到了。

一对比，我现在用的这个简直弱爆了：

def replace_links(s):
return re.sub('(http://[^\s]+)', r'<a rel="nofollow noopener" href="\1">' + r'\1' + '</a>', s, re.M)

求指点，求提高。

3321 次点击

所在节点

4 条回复

rankjie

2013-02-23 11:31:30 +08:00

不要用正则去解析html
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Mutoo

2013-02-23 11:37:31 +08:00

@rankjie 解析url和解析html根本是两回事嘛

lz可以参考一些现成的regex
http://regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2&AspxAutoDetectCookieSupport=1

或者根据w3c对uri的定义自己构造（参考第50页）
http://www.ietf.org/rfc/rfc3986.txt

rankjie

2013-02-23 12:04:24 +08:00

@Mutoo 我看楼主的匹配里面有个</a>，看起来似乎就是在解析html，我不会正则啊=_=说错了还请指正

CoX

2013-02-23 13:46:00 +08:00

lz可以试试tornado.escape.linkify
它的正则写的复杂点： _URL_RE = re.compile(ur"""\b((?:([\w-]+):(/{1,3})|www[.])(?:(?:(?:[^\s&()]|&|")*(?:[^!"#$%&'()*+,.:;<=>?@\[\]^`{|}~\s]))|(?:\((?:[^\s&()]|&|")*\)))+)""")

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/61115

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX