小内存如何对两个大型列表求差集？

求 a 与 b 的差集，内存限制 1G

# 500w
a = [...] 

# 5000w
b = [...]

# 记录为 10 - 100 不等字符串

不知是不是关键词不对，搜索到的方案都是 set、numpy，或 set(b) 遍历 a，内存实在扛不住啊

求指点，速度慢些也可以接受

clino

2017-10-30 10:28:02 +08:00

@ioven 如果 sqlite 操作都放在一个事务里面,估计时间优化得比较短

ioven

2017-10-30 21:50:37 +08:00

@clino 多谢提醒，默认是智能事务，试试手工开启事务看速度有没有提升

shamashii

2017-11-21 22:36:24 +08:00

生成 110s，比较 120s,实验时感觉坑点竟然在于生成随机字符串效率,求改进
```
import timeit
def main():
import h5py, cyrandom
allchr = "".join((chr(i) for i in range(33,127)))
pspool = [[cyrandom.choice(allchr) for _ in range(cyrandom.randint(10, 100))] for x in range(100000)]

chunkl = []
for _ in range(5000000):
b1 = cyrandom.choice(pspool)
cyrandom.shuffle(b1)
chunkl.append(''.join(b1).encode('utf-8'))

f = h5py.File('h5.h5','w')
for k in range(50000000//5000000):
l = [str(k).encode('utf-8')]
# cyrandom.shuffle(chunkl)
print(k)
f.create_dataset(str(k), data=chunkl+l,)
del chunkl
f.close()

def query():
import h5py
f = h5py.File('h5.h5','a')
wbw = set(f['0'].value)
count = []
for k in f.keys():
print(k)
for x in f[k].value:
if x not in wbw:
count.append(x)
print(count)
f.close()

print(timeit.timeit(main, number=1))
print(timeit.timeit(query, number=1))
```

shamashii

2017-11-21 22:42:24 +08:00

sigh``
https://pastebin.com/uDna4GK9

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/401369

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.