我的笔记本是 dell precision 7740,插了 5 个 ssd,一个是 windows10 系统盘,一个是 ubuntu 20.04 (我日常使用的系统)系统盘,还有一个 zfs 池,包括三个 1t ssd,组成形式为 mirror,存放数据。配置好后一年都默默使用,没有查看过状况,直到最近查看才发现其中两个都 faulted 了。
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see:
http://zfsonlinux.org/msg/ZFS-8000-8A scan: resilvered 91.2G in 0 days 00:20:39 with 0 errors on Wed Feb 10
17:30:23 2021
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 28 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 47 0 220
too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 2
too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 22 0 3
too many errors`
因为三个 ssd 当初都是买的新的,到手检查状况都不错,到现在使用也才一年,所以平时都没管过。现在我怎么也不敢相信两个 ssd 都不行了。检查 smart 信息也没有异常。 之后我抢救了部分数据,也就是把数据从内置固态盘( zpool )转移到外置固态盘,使用 rsync -avcXP 两遍来确保数据正确。但是有部分数据在第一遍时会提示校验错误(failed verification -- update discarded),第二遍却无报错。实际查看,文件应该是损坏了。然后我重启了,重启之后的状态:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Feb 14 15:38:53 2021
40.5G scanned at 251M/s, 11.1G issued at 68.8M/s, 755G total
23.1G resilvered, 1.47% done, 0 days 03:04:30 to go
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 0 0 0 too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 ONLINE 0 0 7 (resilvering)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 ONLINE 0 0 11 (resilvering)
待它 resilver 完毕:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see:
http://zfsonlinux.org/msg/ZFS-8000-9P scan: resilvered 83.5G in 0 days 00:10:26 with 0 errors on Sun Feb 14 15:49:19 2021
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 2 0 9 too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 ONLINE 0 0 15
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 ONLINE 0 0 19
errors: No known data errors
接着我进行了 zfs scrub,然后没过多久后两个 ssd 又 faulted 了:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub in progress since Sun Feb 14 15:56:49 2021
209G scanned at 1.76G/s, 903M issued at 7.59M/s, 755G total
849K repaired, 0.12% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 3 0 9 too many errors (repairing)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 1.90K too many errors (repairing)
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 64 0 419 too many errors (repairing)
待它 repair 完毕:
sudo zpool status -v
pool: tankmain
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see:
http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 970K in 0 days 00:29:42 with 213 errors on Sun Feb 14 16:26:31 2021
config:
NAME STATE READ WRITE CKSUM
tankmain DEGRADED 0 0 0
mirror-0 DEGRADED 168 0 0
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500057 DEGRADED 327 0 2.34K too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500063 FAULTED 32 0 690K too many errors
nvme-PLEXTOR_PX-1TM9PGN_+_P02952500220 FAULTED 64 0 682K too many errors
实际检查,应该有少数文件被修复了,但大部分没有。请问还有没有办法来恢复受损文件?
另外,关于出现这个情况的可能原因:
1.内存:我的内存是有 ecc 的,没超频,当初到手时也是各种方法检验正品,至少能保证不是粗制滥造的山寨货。而且也是用 memtest86 跑了好几天无错误的。只是我平时有时候内存用得比较满,这个有影响吗?
2.我家供电线路:笔记本有电池的,而且电源适配器功率也足够,虽然感觉不至于供电跟不上,但是结合一年多前的经历,这个也还是有可能的。一年多前,我还在使用另一只笔记本,内置 ssd 在短时间里先后坏了两块,导致我主力数据全毁。没想到一年多后还有可能再经历一次这种事情。当时我猜测可能是我平时拆笔记本太暴力,有几次不拆电池就继续拆机,导致笔记本供电部分受到影响,最后导致它很容易坏 ssd 。我不敢再用这只笔记本做主力,又买了笔记本,还用三只 ssd 做 zfs mirror,想着总不至于三只还会同时坏了。新笔记本我也从未不拆电池就继续拆机过。但是现在还是这样,难道是我家的供电线路有问题?不同的笔记本,不同的电源适配器,都未能过滤掉这个供电的问题,导致 ssd 持续损坏?
3.笔记本主板供电:莫非是我笔记本主板上供电无法承受 3 个 ssd 同时读写(也就是说我之前一个笔记本坏 ssd 的问题跟现在这个没有联系)?有一个现象不知道跟这个有没有关联,就是笔记本雷电口和 USB 口的输出电流感觉都很低,很多时候识别不了移动硬盘。
4.是我对虚拟机的 vmdk 文件进行 defragment 和 compact 操作导致文件系统损坏?这个不太可能吧,那只是文件而已。
5.zfs 本身 bug 。这个就不知道怎么排查了。
6.这个型号 /批次的 ssd 都有缺陷。ssd 是浦科特 m9p plus 1t 。这个也很难排查。