从 HideMyAss 抓取 http 代理服务器列表

HideMyAss 上提供了一个不断更新的匿名代理列表，可以用来给爬虫之类的程序使用。不过为了防止滥用， HideMyAss 随机地在文本中插入不可见元素进行干扰，增加自动抓取匹配难度。

对于前端 🐶 来说，这并没有什么用， jQuery(':not(:visible)').remove() 一句话就能破解。因此最简单的抓取方案便是使用类似 phantom 那样的无界面浏览器，在 DOMContentLoaded 事件中执行 remove 操作，接着就可以愉快地用 innerText 读取单元格里的文本了。

如果觉得 headless browser 的方案太大，可以使用更绿色环保的方式。

在 HMA 中隐藏元素的办法只有两种， style 属性内联样式和 <style> 标签定义干扰文本的 className 。首先找到所有的 style 标签，解析 css 规则，找到设置为不可见（如 display:none ）的语句，提取出 Selector 。接着在 DOM 树上执行类似 querySelectorAll 的操作，将集合内的元素节点移除掉。接着查询所有带有 style 属性（内联样式）的元素，同样地，解析 css 规则，判断其是否设置了不可见属性，并根据可见性进行移除操作。

因此我们所需要的依赖项从一个完整的浏览器减少到只需 css parser 和 html parser 。 Python 模块 TinyCSS 只提供基础的 parse 功能，并具体地处理 css 属性，完美符合要求； HTML 的处理就交给 BeautifulSoup4 。

完整实现代码如下：

https://gist.github.com/ChiChou/8ae9512fad468a042c84

#!/usr/bin/env python
#coding:utf-8

'''
  Author: @CodeColorist
  Usage: 
    from hidemyass import proxies
    ...
    for proxy in proxies():
      # do something
'''

import urllib2

import tinycss

from bs4 import BeautifulSoup


def proxies():
  # check whether decl is a css property that hides the element
  isinvisible = lambda decl: (decl.name == 'display' and decl.value.as_css() == 'none') \
    or (decl.name == 'visibility' and decl.value.as_css() == 'hidden')

  css_parser = tinycss.make_parser()
  html = urllib2.urlopen('http://proxylist.hidemyass.com').read()
  soup = BeautifulSoup(html, 'lxml')
  table = soup.find('table', id='listable')
  rows = table.find_all('tr')[1:] # skip first row

  # remove invisible elements
  for style in table.find_all('style'):
    invisible_classes = [rule.selector.as_css() for rule in css_parser.parse_stylesheet(style.text).rules if \
      any(map(isinvisible, rule.declarations))]
    [e.extract() for e in table.select(','.join(invisible_classes))]
    style.extract()

  [e.extract() for e in table.select('[style]') if e.name != 'tr' and any(filter(
    lambda rule: any(map(isinvisible, rule)), css_parser.parse_style_attr(e.get('style'))))]

  # parse data
  order = ('lastupdate', 'ip', 'port', 'country', 'speed', 'conectiontime', 'type', 'anon')

  for tr in rows:
    td = tr.find_all('td')

    yield {key: td[i].div.get('value') \
      if key in ('speed', 'conectiontime') else td[i].text.strip() \
      for i, key in enumerate(order)}


if __name__ == '__main__':
  print list(proxies())