BS4 求助

2017-05-16 23:03:29 +08:00
 wudaown
<body><html> <table border="1" width="100%" cellspacing="0" cellpadding="1"> <tr bgcolor="#3366FF"> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Date </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Day </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Time </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Course </font></td> <td align="left" width="40%" valign="top"><font color="#FFFFFF"> Course Title </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Duration </font></td> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AC1101 </td> <td align="left" width="40%" valign="top"> ACCOUNTING I </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AD1101 </td> <td align="left" width="40%" valign="top"> FINANCIAL ACCOUNTING </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> BA3201 </td> <td align="left" width="40%" valign="top"> LIFE CONTINGENCIES AND DEMOGRAPHY </td> <td align="left" width="10%" valign="top"> 3 </td> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> </table> </body></html>

这样一个 html 文件,想导出到这样的 json 格式

{"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}

https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322

求助 bs4 如何导!不想用正则

谢谢

2193 次点击
所在节点    Python
4 条回复
15015613
2017-05-16 23:55:52 +08:00
In [1]: from lxml import etree
In [2]: with open('tmp.html','r') as f:
...: tree=etree.HTML(f.read())
In [10]: tmp=tree.xpath('//tr')
In [29]: import json
In [37]: out=list()
...: for tmp1 in tmp[1:]:
...: i=0
...: dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
...: t1=dict()
...: for t in tmp1:
...: i=i+1
...: t2=t.xpath('text()')[0]
...: t1[dict_d[i]]=t2
...: out.append(t1)
In [45]: out2=dict()
...: for o in out:
...: try:
...: out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
...: except:
...: pass
In [46]: out2
Out[46]:
{' AC1101 ': {'Course Title': ' ACCOUNTING I ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 2.5 ',
'Time': ' 9.00 am '},
' AD1101 ': {'Course Title': ' FINANCIAL ACCOUNTING ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 2.5 ',
'Time': ' 9.00 am '},
' BA3201 ': {'Course Title': ' LIFE CONTINGENCIES AND DEMOGRAPHY ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 3 ',
'Time': ' 9.00 am '}}
15015613
2017-05-16 23:59:35 +08:00
from lxml import etree
with open('tmp.html','r') as f:
____tree=etree.HTML(f.read())
tmp=tree.xpath('//tr')
import json
out=list()
for tmp1 in tmp[1:]:
____i=0
____dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
____t1=dict()
____for t in tmp1:
________i=i+1
________t2=t.xpath('text()')[0]
________t1[dict_d[i]]=t2
____out.append(t1)
out2=dict()
for o in out:
____try:
________out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
____except:
________pass
print(out2)
wudaown
2017-05-17 00:05:16 +08:00
@15015613 非常感谢你的回答,都是我没有见过的东西,需要慢慢消化。在等待的时候我已经用 dict,list 和 bs4 实现了。就是代码看起来很初级的样子
justtery
2017-05-17 08:17:03 +08:00
为什么不用 pyquery 呢 滑稽

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/361807

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX