爬虫爬到 90%的数据以后，超时了。。

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3528 days ago, the information mentioned may be changed or developed.

顿时傻掉了，这个如何中途保存呢。。方便下次爬呢。只会 urllib,beautifulsoup4 , 我知道有个断点续爬，但是这个东西怎么弄，求个迎门砖。

迎门砖

urllib

超时

爬

6 replies • 2016-09-16 12:04:05 +08:00

Karblue

Sep 14, 2016

把爬的深度和连接记下来啊.下次直接开始从这里爬.

web88518

Sep 14, 2016 via iPhone

我也是新手，也遇到过不知道怎么处理好，没看到这样的实倒，

haozibi

Sep 14, 2016 via Android

在数据库设计一个表，存放当前爬取的位置，或者没爬取 100 次保存一下数据

practicer

Sep 14, 2016

seen = []
todo = []

1 将带爬的 url 全部添加到 todo
2 每爬过一个 url 时（或在 ConnectionError 抛出时）在循环体的末尾将 url 添加到 seen
3 再从 todo 删除这个 url

所以断了下次再从 todo 里接着爬就行了

JamesMackerel

Sep 15, 2016 via Android

布隆过滤器？

makeapp

Sep 16, 2016

维护一个几个队列，用来存放增量爬取的数据