写了个删除重复文件的脚本

2018-07-16 13:00:21 +08:00

ucun

收了很多二次元图片，难免会有重复的。写了个小脚本删除重复的。


#! /bin/env python
# -*- coding:utf-8 -*-

import sqlite3
import hashlib
import os
import sys

def md5sum(file):
    md5_hash = hashlib.md5()
    with open(file,"rb") as f:
        for byte_block in iter(lambda:f.read(65536),b""):
            md5_hash.update(byte_block)
    return md5_hash.hexdigest()

def create_hash_table():
    if os.path.isfile('filehash.db'):
        os.unlink('filehash.db')
    conn = sqlite3.connect('filehash.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE FILEHASH
            (ID INTEGER PRIMARY KEY AUTOINCREMENT,
            FILE TEXT NOT NULL,
            HASH TEXT NOT NULL);''')
    conn.commit()
    c.close()
    conn.close()

def insert_hash_table(file):
    conn = sqlite3.connect('filehash.db')
    c = conn.cursor()
    md5 = md5sum(file)
    c.execute("INSERT INTO FILEHASH (FILE,HASH) VALUES (?,?);",(file,md5))
    conn.commit()
    c.close()
    conn.close()

def scan_files(dir_path):
    for root,dirs,files in os.walk(dir_path):
        print('create hash table for {} files ...'.format(root))
        for file in files:
            filename = os.path.join(root,file)
            insert_hash_table(filename)

def del_repeat_file(dir_path):
    conn = sqlite3.connect('filehash.db')
    c = conn.cursor()
    for root,dirs,files in os.walk(dir_path):
        print('scan repeat files {} ...'.format(root))
        for file in files:
            filename = os.path.join(root,file)
            md5 = md5sum(filename)
            c.execute('select * from FILEHASH where HASH=?;',(md5,))
            total = c.fetchall()
            removed = 0
            if len(total) >= 2:
                os.unlink(filename)
                removed += 1
                print('{} removed'.format(filename))
                c.execute('delete from FILEHASH where HASH=? and FILE=?;',(md5,filename))
                conn.commit()

    conn.close()
    print('removed total {} files.'.format(removed))



def main():
    dir_path = sys.argv[-1]
    create_hash_table()
    scan_files(dir_path)
    del_repeat_file(dir_path)

if __name__ == '__main__':
    main()

6398 次点击

所在节点

34 条回复

AX5N

2018-07-16 13:47:08 +08:00

我觉得没必要一上来就对比 md5，可以先对比字节数，字节数相同的再对比 md5

AX5N

2018-07-16 13:49:14 +08:00

而且如果你收的是二次元的图片的话，那你应该要知道有的图片还分老版和新版的，有可能新版会加了细节，也有可能新版的分辨率会被故意降低。甚至如果你图源不一样的话，md5、分辨率都有可能不一样。比如同一作者在 p 站发的和在 twitter 上发的就有可能不一样。

swulling

2018-07-16 13:55:49 +08:00

简单问题复杂解，没必要用 sqlite 吧，直接把 hash -> file 关系存在 Dict 中就完了

另外用 shell 只需要一行

slime7

2018-07-16 14:29:00 +08:00

先把重复的列出来再提示删除哪个不是更好

cdlixucd

2018-07-16 14:32:54 +08:00

@swulling 求教一行 shell

scriptB0y

2018-07-16 18:21:50 +08:00

@cdlixucd 你这个 python rmrepeatfile.py 就是一行 shell

scriptB0y

2018-07-16 18:22:06 +08:00

开玩笑哈哈，可以用这个： https://github.com/adrianlopezroche/fdupes

wsds

2018-07-16 18:25:27 +08:00

@scriptB0y 楼上是失散的兄弟吗，哈哈

Sanko

2018-07-16 18:28:22 +08:00

https://github.com/ghosx/git/blob/master/photo_del_same.py

fyxtc

2018-07-16 18:40:31 +08:00

@wsds 你是赛亚人状态的，哈哈

ucun

2018-07-16 18:55:14 +08:00

@AX5N 也不只用来删除图片，你说的区分同一张照片不同尺寸和分辨率那就是另一个问题了。机器学习应该可以搞定。

ucun

2018-07-16 18:56:21 +08:00

@swulling dict 是可以搞定。文件多了内存就有点麻烦。

ucun

2018-07-16 18:57:32 +08:00

@scriptB0y 这个就相当完善了

ucun

2018-07-16 19:09:50 +08:00

@slime7 可以先把重复的文件统一移动到一个目录并记录日志。

ucun

2018-07-16 19:12:48 +08:00

@AX5N 我觉得没必要一上来就对比 md5，可以先对比字节数，字节数相同的再对比 md5
=============================
还是直接对比 MD5 保险一点。

May725

2018-07-16 19:17:47 +08:00

👍，我需要你的二次元图片试一下😁

yuanshuai1995

2018-07-16 19:21:19 +08:00

分享一波图库？

rrfeng

2018-07-16 19:23:19 +08:00

反正都要强行全部算 md5 了……
md5sum * | awk '{if(a[$2])print $1;a[$2]=1}' |xargs rm

swulling

2018-07-16 19:33:39 +08:00

@cdlixucd 简单思路，算 md5，然后用 awk 算出重复的，最后去重

你直接 google 下 shell remove duplicate files

likuku

2018-07-16 19:42:33 +08:00

@swulling 你低估了文件数目…量太大了 Dict 太巨，内存会爆的吧

第 1 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/471286

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX