各位老哥求解, 如能帮忙解决问题, 口令红包私人感谢.
谢谢各位.
好奇今日热榜这些热榜站是如何进行批量爬取的
cheerio 抓取静态网页, Puppeteer 批量爬取, 性能好慢, 频繁超时报错找不到原因
page.goto 设置多长时间都超时, 30s 60s 90s
本地 windows 运行又没有问题, 远程服务器 vps 小鸡动不动报超时
数据处理过程错误: TimeoutError: Navigation timeout of 30000 ms exceeded
at new Deferred (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
at Deferred.create (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
at new LifecycleWatcher (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:66:60)
at CdpFrame.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:143:29)
at CdpFrame.<anonymous> (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
at CdpPage.goto (/usr/local/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:588:43)
at fetchData (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:51:18)
at async executeProcess (file:///usr/local/script/%E8%B4%A2%E7%BB%8F%E7%83%AD%E6%A6%9C.js:108:24)
async function fetchData(page, name, url, hrefSelector) {
const maxRetries = 3; // Maximum number of attempts
let attempts = 0;
while (attempts < maxRetries) {
try {
attempts++;
await page.goto(url, { timeout: 1000 * 30 });
await page.waitForSelector(hrefSelector, { timeout: 1000 * 30 });
const results = await page.$$eval(hrefSelector, anchors =>
anchors.map(anchor => ({ href: anchor.href, text: anchor.textContent.trim() }))
);
const trade_date = getCurrentDateTime();
return { name, news: results, trade_date };
} catch (error) {
if (attempts < maxRetries) {
console.warn(`获取数据报错 ${url}. Retry attempt ${attempts}...`);
await delay(2000); // Wait for 2 seconds before retrying
} else {
console.error(`获取数据报错 ${url} after ${attempts} attempts:`, error);
throw error;
}
}
}
}
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import AnonymizeUaPlugin from 'puppeteer-extra-plugin-anonymize-ua'
puppeteer.use(StealthPlugin());
puppeteer.use(AnonymizeUaPlugin());
// 启动浏览器和页面
const browser = await puppeteer.launch({
args: [
"--disable-setuid-sandbox",
"--no-sandbox",
"--disable-gpu",
"--no-first-run",
"--disable-dev-shm-usage",
"--single-process"
],
headless: true
});
console.log('启动浏览器 √');
const page = await browser.newPage();
// 设置拦截请求,屏蔽不必要的资源请求
await page.setRequestInterception(true);
page.on('request', (request) => {
const resourceType = request.resourceType();
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
request.abort();
} else {
request.continue();
}
});
// 抓取数据
const allContents = [];
for (const data of config) {
const contents = await fetchData(page, data.name, data.url, data.hrefSelector);
allContents.push(contents);
}
{
name: '第一财经',
url: 'https://www.yicai.com/news/',
hrefSelector: '#newsRank div:nth-child(1) > ul > li a'
},
{
name: '金融界',
url: 'https://stock.jrj.com.cn/',
hrefSelector: 'ul.opportunity-list > li a'
},
{
name: '八阕',
url: 'https://news.popyard.space/cgi-mod/threads.cgi?lan=cn&r=0&cid=11&t=all',
hrefSelector: 'div#page_1 > table b > a'
}
本地 Windows 爬取第一财经也报错..
Extra IPv4None
RAM2.5 GB RAM (Included)
CPU Cores2 CPU Cores (Included)
Operating SystemDebian 12 64 Bit (Recommended Min. 2 GB RAM)
LocationSan Jose, CA (Test IP: 192.210.207.88)
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.