虚拟机遭遇停电后,用磁盘文件记录恢复后,虚机内的 ceph 出现故障
➜ ~ ceph -s
cluster:
id: cf09d650-4629-4509-81d1-3e7005ca3595
health: HEALTH_ERR
2 scrub errors
Possible data damage: 3 pgs inconsistent
2 slow requests are blocked > 32 sec. Implicated osds 10,15
services:
mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03
mgr: ceph02(active), standbys: ceph01, ceph03, controller
mds: cephfs-1/1/1 up {0=ceph03=up:active}, 3 up:standby
osd: 16 osds: 16 up, 16 in
data:
pools: 8 pools, 1280 pgs
objects: 72 objects, 342MiB
usage: 18.2GiB used, 15.1TiB / 15.1TiB avail
pgs: 1276 active+clean
3 active+clean+inconsistent
1 active+clean+scrubbing
➜ ~ ceph tell osd.\* injectargs '--osd_op_thread_timeout 30'
osd.0: osd_op_thread_timeout = '30' (not observed, change may require restart)
osd.1: osd_op_thread_timeout = '30' (not observed, change may require restart)
osd.2: osd_op_thread_timeout = '30' (not observed, change may require restart)
......
➜ ~ ceph tell osd.\* injectargs '--osd_op_thread_suicide_timeout 300'
osd.0: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
osd.1: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
osd.2: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
......
ceph osd down 和 ceph osd in,依然无效
➜ ~ ceph health detail
HEALTH_ERR nodown flag(s) set; 2 scrub errors; Possible data damage: 2 pgs inconsistent; 2 slow requests are blocked > 32 sec. Implicated osds 10,15
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 10.f is active+clean+inconsistent, acting [11,9,10]
pg 10.1d is active+clean+inconsistent, acting [11,9,10]
REQUEST_SLOW 2 slow requests are blocked > 32 sec. Implicated osds 10,15
2 ops are blocked > 262.144 sec
osds 10,15 have blocked requests > 262.144 sec
➜ ~ ceph pg repair 10.f
instructing pg 10.f on osd.11 to repair
➜ ~ ceph pg repair 10.1d
instructing pg 10.1d on osd.11 to repair
毫无变化,查看故障 pg 所在 osd
➜ ~ ceph pg 10.f query | grep primary
"same_primary_since": 763,
"num_objects_missing_on_primary": 0,
"up_primary": 11,
"acting_primary": 11
"same_primary_since": 763,
"num_objects_missing_on_primary": 0,
"up_primary": 11,
"acting_primary": 11
关掉这个 osd 再修复
➜ ~ ceph pg repair 10.f
instructing pg 10.f on osd.11 to repair
➜ ~ ceph pg repair 10.1d
instructing pg 10.1d on osd.11 to repair
➜ ~ ceph health detail
HEALTH_WARN nodown flag(s) set; 1 osds down; 1 host (1 osds) down; Reduced data availability: 374 pgs inactive, 374 pgs peering; Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded
OSDMAP_FLAGS nodown flag(s) set
OSD_DOWN 1 osds down
osd.11 (root=root-ssd,host=ceph03-ssd) is down
OSD_HOST_DOWN 1 host (1 osds) down
host ceph03-ssd (root=root-ssd) (1 osds) is down
PG_AVAILABILITY Reduced data availability: 374 pgs inactive, 374 pgs peering
pg 2.5c is stuck peering for 2391.685850, current state peering, last acting [10,15]
pg 2.5e is stuck peering for 2535.599036, current state stale+peering, last acting [11,9]
pg 2.5f is stuck inactive for 2387.032031, current state peering, last acting [10,15]
......
PG_DEGRADED Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded
pg 10.7 is active+undersized+degraded, acting [10,9]
pg 10.b is active+undersized+degraded, acting [10,9]
pg 10.f is active+undersized+degraded, acting [9,10]
pg 10.13 is active+undersized+degraded, acting [10,9]
pg 10.15 is active+undersized+degraded, acting [9,10]
pg 10.1d is active+undersized+degraded, acting [9,10]
➜ ~ ceph pg 10.f query | grep primary
"same_primary_since": 829,
"num_objects_missing_on_primary": 0,
"up_primary": 9,
"acting_primary": 9
"same_primary_since": 829,
"num_objects_missing_on_primary": 0,
"up_primary": 9,
"acting_primary": 9
➜ ~ ceph pg 10.1d query | grep primary
"same_primary_since": 829,
"num_objects_missing_on_primary": 0,
"up_primary": 9,
"acting_primary": 9
"same_primary_since": 829,
"num_objects_missing_on_primary": 0,
"up_primary": 9,
"acting_primary": 9
重启 osd 11,稍等一会
➜ ~ ceph -s
cluster:
id: cf09d650-4629-4509-81d1-3e7005ca3595
health: HEALTH_WARN
Reduced data availability: 374 pgs inactive, 374 pgs peering, 55 pgs stale
services:
mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03
mgr: ceph02(active), standbys: ceph01, ceph03, controller
mds: cephfs-1/1/1 up {0=ceph03=up:active}, 3 up:standby
osd: 16 osds: 16 up, 16 in
data:
pools: 8 pools, 1280 pgs
objects: 72 objects, 342MiB
usage: 16.9GiB used, 14.3TiB / 14.3TiB avail
pgs: 29.219% pgs not active
906 active+clean
319 peering
55 stale+peering
➜ ~ ceph health detail
HEALTH_WARN 15/216 objects misplaced (6.944%); 1 slow requests are blocked > 32 sec. Implicated osds 11
OBJECT_MISPLACED 15/216 objects misplaced (6.944%)
REQUEST_SLOW 1 slow requests are blocked > 32 sec. Implicated osds 11
1 ops are blocked > 32.768 sec
osd.11 has blocked requests > 32.768 sec
➜ ~ ceph health detail
HEALTH_WARN 5/216 objects misplaced (2.315%)
OBJECT_MISPLACED 5/216 objects misplaced (2.315%)
➜ ~ ceph health detail
HEALTH_OK
解决
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.