请问如何用 BeautifulSoup 提取这个 a 标签里的网址？

2017-09-01 22:01:11 +08:00

saximi

from urllib.request import urlopen  
from bs4 import BeautifulSoup   
import re  
   
def getLinks(articleUrl):  
    html = urlopen(" http://www.ccb.com/cn/home "+articleUrl)  
    bsObj = BeautifulSoup(html,'lxml')  
    print('bsObj.find=',bsObj.find("div", {"class":"Language_select"}).findAll("a"))  
    return bsObj.find("div", {"class":"Language_select"}).findAll("a",re.compile("^(<a href=\")(.*(?!\">繁体))$"))  

links = getLinks("/indexv3.html")  
print('links=',links)  

输出如下： 
bsObj.find= [<a href="http://fjt.ccb.com">繁体</a>, <a href="http://en.ccb.com/en/home/indexv3.html">ENGLISH</a>]
links= [] 

上面的代码用 BeautifulSoup 爬了" http://www.ccb.com/cn/home/indexv3.html "，输出的 a 标签内容里，“繁体”这两个字左侧的网址是我想要提取的网址， 即我希望输出的第二行应该是 links= ['http://fjt.ccb.com']。 现在看来 return 语句中的 findAll 没写对，导致输出为空，恳请大家指点应该怎么写才对呢？ 
感谢！

10646 次点击

所在节点

Python

13 条回复

impyf104

2017-09-01 22:24:35 +08:00

用 re.search 或者 re.findall 吧
<a.*?href="(.*?)".*?>繁体</a>

ossicee

2017-09-01 22:32:50 +08:00

a.get('href')

saximi

2017-09-01 23:59:46 +08:00

@impyf104 语句应该怎么写呢，我把 return 语句修改为：return re.findall('<a.*?href="(.*?)".*?>繁体</a>',bsObj.find("div", {"class":"Language_select"}).findAll("a"))

但是运行报错如下

File "D:\Python\Python3\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

saximi

2017-09-02 00:11:35 +08:00

@impyf104 re.compile 为何不适用这种需求？

impyf104

2017-09-02 00:42:52 +08:00

@saximi return re.findall(r'<a.*?href="(.*?)".*?>繁体</a>',html)[0]
这样吧,[0]是满足正则的括号内的内容，具体的去查下怎么用吧，表达能力有限...

yashirq

2017-09-02 05:23:03 +08:00

return bsObj.find("div", {“ class ”：” Language_select ”}.find("a", href=True)['href']

OpenJerry

2017-09-02 10:58:33 +08:00

我感觉用正则有点复杂，这种情况我一般都是直接用 css 选择器

```python
import requests
from bs4 import BeautifulSoup
import re

def getLinks(articleUrl):
html = requests.get("http://www.ccb.com/cn/home"+articleUrl).text
bsObj = BeautifulSoup(html,'lxml')
fanti_link = bsObj.select(".Language_select a")[0]['href']
return fanti_link

links = getLinks("/indexv3.html")
print('links=',links)
```

OpenJerry

2017-09-02 11:03:30 +08:00

https://gist.github.com/JerryLi-X/d7dc24ffbeef82a0ffc88ac564f4de07

saximi

2017-09-02 14:42:05 +08:00

@yashirq 谢谢，请问 href=True 表示什么含义呢？

saximi

2017-09-02 15:26:48 +08:00

@yashirq href=True 表示什么含义，以及为何用['href']键就能取到第一个链接？顺便问一下，如果要取第二个链接，应该用什么下标呢？谢谢

VitaCoCo

2017-09-02 18:55:16 +08:00

for a in bsObj.find("div", {"class":"Language_select"}).findAll("a")):
print a.attr("href")

yashirq

2017-09-03 07:08:03 +08:00

@saximi find(name, attrs, recursive, string, **kwargs) href=True 指的是找 bsobj 下有 href 这个属性的 object ；取到第一个链接不是因为用[href']，而是因为 find （）返回的就是第一个 object，findall （）返回的所有包含 href 的 object ；取第二个链接可以用 find_next()，或者 findall(limit=2)，可以看看官方文档的介绍 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

saximi

2017-09-03 12:07:22 +08:00

@yashirq 感谢！

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/387570

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.