[新人求助] 为何在爬取网站图片时,其他链接正常输出,但是爬取到其中一个链接时就会报错

2018-12-27 21:23:13 +08:00
 15874103329
报错地段在 59 行,报错提示为:Unterminated string starting at: line 1 column 1 (char 0)
主要想不通,为啥别的链接不报错,每次一到这个链接就报错
import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
import random
import json
from bs4 import BeautifulSoup
import re

headers_chi = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0',
'Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
]
def shouye_dizhi():
data = {
'offset': '0',
'format': 'json',
'keyword': '美女',
'autoload': 'true',
'count': '20',
'cur_tab': '3',
'from': 'gallery'
}
url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
try:
headers = {}
headers['User-Agent'] = random.choice(headers_chi)
dizhi = requests.get(url,headers = headers)
if dizhi.status_code == 200:
return dizhi.text
except RequestException:
print('首页加载出错')
return None

def shouye_xiangqing(html):
data = json.loads(html)
if data and 'data' in data.keys():
for item in data.get('data'):
yield item.get('article_url')

def xiangqingye_dizhi(url):
try:
headers = {}
headers['User-Agent'] = random.choice(headers_chi)
dizhi = requests.get(url,headers = headers)
if dizhi.status_code == 200:
return dizhi.text
except RequestException:
print('详情页加载出错')
return None

def xiangqingye_jiexi(html,url):
jiexi = BeautifulSoup(html,'lxml')
title = jiexi.select('title')[0].get_text()
print(title)
zhengze = re.compile('JSON.parse\(([\s\S]*?)\)')
jieguo = re.search(zhengze,html)
data = json.loads(json.loads(jieguo.group(1)))
if data and 'sub_images' in data.keys():
sub_images = data.get('sub_images')
items = [item.get('url')for item in sub_images]
return {
'title':title,
'url':url,
'items':items
}


def main():
html = shouye_dizhi()
for url in shouye_xiangqing(html):
html = xiangqingye_dizhi(url)
tupian = xiangqingye_jiexi(html,url)
print(tupian)

if __name__ == "__main__":
main()
1897 次点击
所在节点    Python
4 条回复
15874103329
2018-12-28 10:38:07 +08:00
求大佬帮忙看一下啊
hp66722667
2018-12-28 10:40:25 +08:00
这么多 if 格式也都是错的,爱莫能助啊,建议你好好排版一下也许会有人帮你看看
careofzm
2018-12-28 10:47:03 +08:00
可以跑过, 就改了一个地方

15874103329
2018-12-28 15:54:33 +08:00
@careofzm 感谢大佬

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/521680

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX