求解, 今日热榜数据是怎么批量爬取的?动态网页用 Puppeteer 不仅性能慢还超时报错

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

这是一个创建于 507 天前的主题，其中的信息可能已经有所发展或是发生改变。

问题

各位老哥求解, 如能帮忙解决问题, 口令红包私人感谢.

单独解决私发:￥ 8.88 ￥ 16.88
多人解决群发:￥ 20
感谢方式: 支付宝口令红包

谢谢各位.

1. 今日热榜数据是怎么批量爬取的

好奇今日热榜这些热榜站是如何进行批量爬取的

2. 怎么爬取动态网页

cheerio 抓取静态网页, Puppeteer 批量爬取, 性能好慢, 频繁超时报错找不到原因

page.goto 设置多长时间都超时, 30s 60s 90s

本地 windows 运行又没有问题, 远程服务器 vps 小鸡动不动报超时

报错

数据处理过程错误: TimeoutError: Navigation timeout of 30000 ms exceeded
    at new Deferred (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
    at Deferred.create (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
    at new LifecycleWatcher (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:66:60)
    at CdpFrame.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:143:29)
    at CdpFrame.<anonymous> (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
    at CdpPage.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:588:43)
    at fetchData (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:51:18)
    at async executeProcess (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:108:24)

代码逻辑

async function fetchData(page, name, url, hrefSelector) {
  const maxRetries = 3; // Maximum number of attempts
  let attempts = 0;

  while (attempts < maxRetries) {
    try {
      attempts++;
      await page.goto(url, { timeout: 1000 * 30 });
      await page.waitForSelector(hrefSelector, { timeout: 1000 * 30 });

      const results = await page.$$eval(hrefSelector, anchors =>
        anchors.map(anchor => ({ href: anchor.href, text: anchor.textContent.trim() }))
      );

      const trade_date = getCurrentDateTime();

      return { name, news: results, trade_date };
    } catch (error) {
      if (attempts < maxRetries) {
        console.warn(`获取数据报错 ${url}. Retry attempt ${attempts}...`);
        await delay(2000); // Wait for 2 seconds before retrying
      } else {
        console.error(`获取数据报错 ${url} after ${attempts} attempts:`, error);
        throw error;
      }
    }
  }
}

使用插件

xvfb
puppeteer-extra
puppeteer-extra-plugin-stealth
puppeteer-extra-plugin-anonymize-ua

import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import AnonymizeUaPlugin from 'puppeteer-extra-plugin-anonymize-ua'
puppeteer.use(StealthPlugin());
puppeteer.use(AnonymizeUaPlugin());

puppeteer 启动参数

// 启动浏览器和页面
    const browser = await puppeteer.launch({
      args: [
        "--disable-setuid-sandbox",
        "--no-sandbox",
        "--disable-gpu",
        "--no-first-run",
        "--disable-dev-shm-usage",
        "--single-process"
      ],
      headless: true
    });
    console.log('启动浏览器 √');
    const page = await browser.newPage();

    // 设置拦截请求，屏蔽不必要的资源请求
    await page.setRequestInterception(true);
    page.on('request', (request) => {
      const resourceType = request.resourceType();
      if (['image', 'stylesheet', 'font'].includes(resourceType)) {
        request.abort();
      } else {
        request.continue();
      }
    });

    // 抓取数据
    const allContents = [];
    for (const data of config) {
      const contents = await fetchData(page, data.name, data.url, data.hrefSelector);
      allContents.push(contents);
    }

爬取易报错的网站

{
    name: '第一财经',
    url: 'https://www.yicai.com/news/',
    hrefSelector: '#newsRank div:nth-child(1) > ul > li a'
},
{
    name: '金融界',
    url: 'https://stock.jrj.com.cn/',
    hrefSelector: 'ul.opportunity-list > li a'
},
{
  	name: '八阕',
	url: 'https://news.popyard.space/cgi-mod/threads.cgi?lan=cn&r=0&cid=11&t=all',
	hrefSelector: 'div#page_1 > table b > a'
}

本地 Windows 爬取第一财经也报错..

系统

Extra IPv4None
RAM2.5 GB RAM (Included)
CPU Cores2 CPU Cores (Included)
Operating SystemDebian 12 64 Bit (Recommended Min. 2 GB RAM)
LocationSan Jose, CA (Test IP: 192.210.207.88)

puppeteer

超时

远程

19 条回复 • 2024-06-16 09:13:48 +08:00

dedad558

2024-06-14 19:28:48 +08:00

技术小白, 已经 google chatgpt 查了很多资料, 自己测试了很多遍, 依然无法解决, 求助求助

dedad558

2024-06-14 19:31:30 +08:00

如果 python 能解决也行, ;爬这三个网站容易报错超时, 不太懂反爬
```
{
name: '第一财经',
url: 'https://www.yicai.com/news/',
hrefSelector: '#newsRank div:nth-child(1) > ul > li a'
},
{
name: '金融界',
url: 'https://stock.jrj.com.cn/',
hrefSelector: 'ul.opportunity-list > li a'
},
{
name: '八阕',
url: 'https://news.popyard.space/cgi-mod/threads.cgi?lan=cn&r=0&cid=11&t=all',
hrefSelector: 'div#page_1 > table b > a'
}
```

puppeteer 本地偶尔也能爬取到, 但时不时也经常超时报错, 懵逼

macaodoll

2024-06-14 19:40:04 +08:00 via Android

两个字：逆向

dedad558

2024-06-14 19:44:53 +08:00 via Android

@macaodoll 总不能 50 个网站，一个个逆向吧

dedad558

2024-06-14 19:45:07 +08:00 via Android

而且我也不懂逆向哈哈惭愧

wzdsfl

2024-06-14 19:49:43 +08:00

你这么爬相当于访问了一次网页，把页面上无用的 img/css/js/html 都给爬了一遍，直接抓接口，接口加密的就加断点扒参数，身份校验的就挂上 cookie ，效率不比你这快多了

dedad558

2024-06-14 19:59:53 +08:00

@wzdsfl 好的谢谢有点头绪, 抓接口试试

dedad558

2024-06-14 20:06:54 +08:00

@wzdsfl 确实快多了哈哈, 技术小白一个, 之前老是担心接口反爬自己又不懂, 所以就上 puppeteer 稳妥起见感谢

wushenlun

2024-06-14 20:11:02 +08:00

这不就是明文么

```html
<div class="cc-cd-cb-ll">
<span class="s ">70</span>
<span class="t">19 块？森马棉致冰丝裤，我没看错吧！原价¥54.9 券后¥19.9</span>
<span class="e">热销 450 件(近 2 小时)</span>
</div>
</a> <a href="https://tophub.today/link?domain=taobao.com&url=https%3A%2F%2Fremai.today%2Flink%2F1%2F2gaR2v3hotge6Z5J6OFaq7TDtD-36ON23KC05mnekbjsO" target="_blank" rel="nofollow noopener" itemid="171347537">
<div class="cc-cd-cb-ll">
<span class="s ">71</span>
<span class="t"> [全尺寸一个价] 床笠床罩床套保护套原价¥139.9 券后¥39.9</span>
<span class="e">热销 446 件(近 2 小时)</span>
</div>
```

dedad558

2024-06-14 20:12:53 +08:00

@wushenlun ?什么明文

longlonglanguage

2024-06-14 21:43:50 +08:00

那种语言都能爬，应该是一个一个网站做了适配，然后存入数据库，最后后台聚合数据，网页再显示出来。

dedad558

2024-06-14 22:35:46 +08:00

@longlonglanguage 嗯, 现在我改为抓接口了, 接口抓不到抓 DOM

nx6Ta67v2A43frV2

2024-06-15 11:33:55 +08:00 via iPhone

@dedad558 肯定啊，大公司安全人员又不是吃干饭的。

要抓这个数据，并不是一锤子买卖，而是个长期对抗的过程。

对方服务端防御规则一直在变，网关上还会离线识别扫刷子流量。

检测到疑似刷子，会质询你，也就是人机检测，过不去就关小黑屋。百度和 cloudflare 都是这样。

nx6Ta67v2A43frV2

2024-06-15 11:37:10 +08:00 via iPhone

另外，这事儿是违法的，或者在法律的灰色地带。

网上有程序员抓政务网站上的公开数据，把别人网站搞挂了导致坐牢的新闻。

他的程序写得有问题的，导致形成死循环，反复爬人家，恰好政务网站又比较脆弱。

root71370

2024-06-15 22:37:23 +08:00

之前收藏了个仓库，你看看它怎么做的
https://github.com/imsyy/DailyHotApi

yuaotian

2024-06-15 23:08:17 +08:00

糊涂啊，你再去爬他们的不就行了🤣😏

mumbler

2024-06-16 01:01:46 +08:00

爬虫是需要系统学习+刻意练习的，你什么都不知道，帮都没法帮。有 chatgpt 的时代，页面数据直接给 gpt 就能自动提取数据，你只需要拿到页面数据就行了

dedad558

2024-06-16 09:13:21 +08:00

@mumbler 谢谢, 简单爬虫只是发个请求解析, 没上难度, 难得我也不会啊哈哈

dedad558

2024-06-16 09:13:48 +08:00

@yuaotian 糊涂糊涂哈哈, 不想依赖第三方