请教，这种情况的 pandas 多条件去重复行是 pd 内置方法来搞，还是设计算法来处理？另请教重建索引

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2214 天前的主题，其中的信息可能已经有所发展或是发生改变。

df 是这样子的：

rubDF = pd.DataFrame(columns=["corp", "stype", "mktime", "serNum", "status", "A","B","C","D"])

打算去重，想法: corp，mktime,status 相同的行，只保留 A 内容为"20kg"的看了文档，似乎 drop_duplicates 选项比较简单，这种情况是不是只能用 python 设计算法操作？不知道是否 pandas 有妖招可以解决，pandas 内置方法始终比 python 算法来操作的快，毕竟数据有 80 多 W 行

另外请教一个索引重建问题：

df 简单地去重了以后:

rubDF .drop_duplicates(subset=None,keep='first',inplace=True)

index 的值是默认 df 创建的，并没有单独地去做或者指定一列索引，

如果要简单地重建索引，以 mktime 降序，应该是怎么样操作？

索引

mktime

pandas

Python

6 条回复 • 2019-08-13 20:18:44 +08:00

wqzjk393

2019-08-13 11:38:27 +08:00

data = [["corp1", "stype1", "mktime1", "serNum1", "status1", "20kg","B1","C1","D1"],
["corp1", "stype1", "mktime1", "serNum2", "status3", "20kg","B1","C1","D1"],
["corp1", "stype1", "mktime1", "serNum2", "status5", "30kg","B1","C1","D1"],
["corp1", "stype1", "mktime1", "serNum7", "status3", "40kg","B1","C1","D1"],
["corp2", "stype3", "mktime4", "serNum4", "status9", "A1","B1","C1","D1"],
["corp2", "stype1", "mktime67", "serNum2", "status4", "20kg","B1","C1","D1"]]
rubDF = pd.DataFrame(data,columns=["corp", "stype", "mktime", "serNum", "status", "A","B","C","D"])
rubDF['sortindex'] = rubDF.apply(lambda x:1 if x.A == '20kg' else 2,axis=1)
rubDF.sort_values(by=['sortindex'],ascending=True,inplace=True)
rubDF.drop_duplicates(['A'],keep='first',inplace=True)
print(rubDF)

不知道是不是符合需求

wqzjk393

2019-08-13 11:40:04 +08:00

哦错了，导数第二行改一下：rubDF.drop_duplicates(["corp", "stype", "mktime"],keep='first',inplace=True)

wqzjk393

2019-08-13 11:43:28 +08:00

想法: corp，mktime,status 相同的行，只保留 A 内容为"20kg"的。所以思路就是先排序让 20kg 作为排序后最一个，然后用 drorduplicates （ [corp，mktime,status ],keep=first ）保留[corp，mktime,status ]相同时的第一个值。那排序就简单了，自己按需要写个 map 就可以了

cigarzh

2019-08-13 11:57:44 +08:00 via iPhone

连文档都懒得翻了吗……

qazwsxkevin

2019-08-13 17:10:49 +08:00

@wqzjk393 好思路，但估计会很耗时。。。

@cigarzh 你对问题有了解的话，就不会这么说了。

cigarzh

2019-08-13 20:18:44 +08:00 via iPhone

@qazwsxkevin #5 columns=[“ a ”, “ b ”, “ c ”, “ d ”] 不就 abc 重复的里留下 d==20 的吗搞个 duplicate(abc)的 boolean series 搞个 d!=20 的 boolean series 两个 series 做 and 取反再丢 df.loc 里不就完了吗