召集爱看小薄本子又熟悉 regex 的各路绅士大神~

2015-08-28 17:09:08 +08:00
 eromoe

大家都知道,小薄本子多了,整理起来就麻烦了=。=
我想按作者分,按社团分,按展会分等等,所以写了个正则 想从一个本子的名字里抽取所有信息
但是本子标题五花八门,如下
0. (event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]

  1. (event ) [group (artist )] title (form ) [addition1]

  2. [event] [group (artist )] title (form ) (addition1 )

  3. (tag ) [group (artist )] title

  4. [group (artist )] title

  5. title

我试着写了一个

import re
regex_patern = ur'([\(\[](?P<event>[^\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]])'

p = re.compile (regex_patern )

rows= [
'(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]',
'(event ) [group (artist )] title (form ) [addition1]',
'[event] [group (artist )] title (form ) (addition1 )',
'(tag ) [group (artist )] title',
'[group (artist )] title',
'title',
]

for r in rows:
    r = re.search (p, r )
    print r.groupdict ()

#输出:

{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': 'tag', u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'tag'}
{u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': None}
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last )
<ipython-input-5-831c548bc3f0> in <module>()
     15 for r in rows:
     16     r = re.search (p, r )
---> 17     print r.groupdict ()

AttributeError: 'NoneType' object has no attribute 'groupdict'

从第四行开始结果就不对了,我感觉 re 应该要先匹配中间的简单规则,再最后扩展到最复杂的规则,
但是不知道怎么写。。。。特来请教各位

2948 次点击
所在节点    Python
9 条回复
plqws
2015-08-28 17:48:29 +08:00
为啥一定要用正则,代码看起来好难改的样子。
还有我觉得这种东西用 日文分词 + tag 整理起来更方便吧。
rogerchen
2015-08-28 18:18:38 +08:00
(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]]) 最后一个空白为什么要捕捉,和前边不一致,而且 more1 这个段是可选的吧,应该只有 title 这个段是强制的
rogerchen
2015-08-28 18:20:54 +08:00
楼主我还发现一个问题,你来源一会写 from 一会儿写 form ,虽然不影响吧,但确实把我看晕了
rogerchen
2015-08-28 18:24:18 +08:00
改了之后是这样,貌似还有点小问题,我继续看
$ python re.py
{u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '}
{u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '}
{u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '}
{u'from': None, u'more1': None, u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'group (artist '}
{u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None}
rogerchen
2015-08-28 18:34:56 +08:00
import re
regex_patern = ur'([\(\[](?P<event>[^\()\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?\s*([\(\[](?P<more1>[^\)\]]*)[\)\]])?'

p = re.compile (regex_patern )

rows= [
'(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]',
'(event ) [group (artist )] title (form ) [addition1]',
'[event] [group (artist )] title (form ) (addition1 )',
'(tag ) [group (artist )] title',
'[group (artist )] title',
'title',
]

for r in rows:
r = re.search (p, r )
print r.groupdict ()

完全改好了,你有两个地方不对,一个是最后边那个地方强制捕获了,一个是不能让 event 捕获 [group (artist )],所以在 event 那个段里边要改成最后\(也放弃。

$ python re.py
{u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '}
{u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '}
{u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
{u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '}
{u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': None}
{u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None}
eromoe
2015-08-29 08:52:46 +08:00
@rogerchen 非常感谢,准备写点代码先测测分类效果~
eromoe
2015-08-29 09:02:27 +08:00
突然发现一个很囧的问题。。。
[event] [group] title (from )
[event] [artist] title (from )

是不是无解啊。。。
正则能不能写出 从 title 左边抓一个[XXX] ,然后 XXX 不包含 同人 /Cxx/成年 XXX 这样的,来判断是 group+artist 块?
rogerchen
2015-08-29 09:12:01 +08:00
都要涉及到比较字符串了,只用正则搞就是黑魔法了,建议先抓出来再写点代码判断
eromoe
2015-08-29 09:54:02 +08:00
@rogerchen 嗯,也只能这样了

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/216752

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX