我需要使用 spider-flow 框架爬取下面这三个网站的内容 https://price.21food.cn/product/939.html https://price.21food.cn/product/1505.html https://price.21food.cn/product/196.html
这三个网址中我已经实现了其中一个网址的爬虫,由于这三个网址只是数据不同,所以这三个网址的数据其实可以放到一个爬虫里实现,之前我在 Selenium 框架中我是直接构建一个 url 集合用 for 循环解决的,但是在 spider-flow 中却难以实现
我的想法是先定义一个 url 集合,然后建立循环爬取,所以我构建了如下所示的内容
第一个定义变量的内容是 urlList,定义了三个地址的集合["https://price.21food.cn/product/939.html","https://price.21food.cn/product/1505.html","https://price.21food.cn/product/196.html"]
第二个是循环,顶一个 urlIndex 的下标,次数为 urlList
第三个变量定义了 url 变量,值为${urlList[urlIndex]},其实就是获取前面集合中的具体 url
第四个开始爬取使用的 url 指定为前面的 url ,值为${url}
后面都是爬取数据爬虫逻辑,后面的内容是完全可用的,我之前已经试过了,这样构造我看着感觉没问题,但是时间运行之后的结果就是在第一个定义变量定义完之后就结束了
我去网上搜索了很多教程,但是关于这个需求怎么实现的是找不到相关教程和案例,这个官网的文档我还不知道为什么打不开,我是实在没办法了,所以我来请教各位,各位有懂的还希望能不吝赐教,小弟在这里先谢过了
spider-flow 框架的码云地址: https://gitee.com/ssssssss-team/spider-flow
下载项目然后用 idea 打开,在数据库中运行项目提供 db.sql 并指定配置文件中数据库的地址就可以正确运行了,默认访问地址是 localhost:8088
下面是我的构建的爬虫的内容,各位只要将该内容粘贴到 spider-flow 中即可运行,具体点击 XML 编辑的选项
<mxGraphModel>
<root>
<mxCell id="0">
<JsonProperty as="data">
{"spiderName":"食品商务网爬虫(未整合多个网址)","submit-strategy":"random","threadCount":""}
</JsonProperty>
</mxCell>
<mxCell id="1" parent="0"/>
<mxCell id="2" value="开始" style="start" parent="1" vertex="1">
<mxGeometry x="300" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"shape":"start"}
</JsonProperty>
</mxCell>
<mxCell id="3" value="开始抓取" style="request" parent="1" vertex="1">
<mxGeometry x="490" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"开始抓取","loopVariableName":"","method":"GET","sleep":"","timeout":"","response-charset":"","retryCount":"","retryInterval":"","body-type":"none","body-content-type":"text/plain","loopCount":"","url":"${url}","proxy":"","request-body":"","follow-redirect":"1","tls-validate":"1","cookie-auto-set":"1","repeat-enable":"0","shape":"request"}
</JsonProperty>
</mxCell>
<mxCell id="4" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="620" y="80" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["dataList"],"variable-description":[""],"loopCount":"","variable-value":["${extract.xpaths(resp.html,'/html/body/div[2]/div[3]/div/div[2]/div[1]/div[2]/div[2]/ul/li')}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="9" value="" style="strokeWidth=2;sharp=1;" parent="1" source="3" target="4" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="11" value="循环" style="loop" parent="1" vertex="1">
<mxGeometry x="620" y="170" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"循环","loopItem":"","loopVariableName":"index","loopCount":"${list.length(dataList)}","loopStart":"0","loopEnd":"-1","shape":"loop"}
</JsonProperty>
</mxCell>
<mxCell id="12" value="" style="strokeWidth=2;sharp=1;" parent="1" source="4" target="11" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="13" value="输出" style="output" parent="1" vertex="1">
<mxGeometry x="790" y="334" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"输出","loopVariableName":"","tableName":"","csvName":"","csvEncoding":"GBK","output-name":["产品名","市场","规格","最高价格","平均价格","最低价格","日期"],"loopCount":"","output-value":["${name}","${market}","${specifications}","${top}","${avg}","${low}","${dataDate}"],"output-all":"0","output-database":"0","output-csv":"0","shape":"output"}
</JsonProperty>
</mxCell>
<mxCell id="15" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="620" y="250" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["name","market","specifications","top","avg","low","dataDate"],"variable-description":["","","","","","",""],"loopCount":"","variable-value":["${dataList[index].selectors('table tbody tr td a')[0].text()}","${dataList[index].selectors('table tbody tr td a')[1].text()}","${dataList[index].selectors('table tbody tr td span')[0].text()}","${dataList[index].selectors('table tbody tr td span')[1].text()}","${dataList[index].selectors('table tbody tr td span')[3].text()}","${dataList[index].selectors('table tbody tr td span')[2].text()}","${dataList[index].selectors('table tbody tr td span')[4].text()}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="16" value="" style="strokeWidth=2;sharp=1;" parent="1" source="11" target="15" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="18" value="" style="strokeWidth=2;sharp=1;" parent="1" source="15" target="13" edge="1">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="27" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="90" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["urlList"],"variable-description":[""],"loopCount":"","variable-value":["[\"https://price.21food.cn/product/939.html\",\"https://price.21food.cn/product/1505.html\",\"https://price.21food.cn/product/196.html\"]"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="29" value="循环" style="loop" parent="1" vertex="1">
<mxGeometry x="180" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"循环","loopItem":"","loopVariableName":"urlIndex","loopCount":"${list.length(urlList)}","loopStart":"0","loopEnd":"-1","shape":"loop"}
</JsonProperty>
</mxCell>
<mxCell id="31" value="定义变量" style="variable" parent="1" vertex="1">
<mxGeometry x="262" y="440" width="32" height="32" as="geometry"/>
<JsonProperty as="data">
{"value":"定义变量","loopVariableName":"","variable-name":["url"],"variable-description":[""],"loopCount":"","variable-value":["${urlList[urlIndex]}"],"shape":"variable"}
</JsonProperty>
</mxCell>
<mxCell id="42" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="27" target="29">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="43" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="29" target="31">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="44" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="2" target="27">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
<mxCell id="45" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="31" target="3">
<mxGeometry relative="1" as="geometry"/>
<JsonProperty as="data">
{"value":"","exception-flow":"0","lineWidth":"2","line-style":"sharp","lineColor":"black","condition":"","transmit-variable":"1"}
</JsonProperty>
</mxCell>
</root>
</mxGraphModel>
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.