V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Recommended Services
Amazon Web Services
LeanCloud
New Relic
ClearDB
firejoke
V2EX  ›  云计算

ceph pg inconsistent 与 Implicated osds

  •  
  •   firejoke · 2018-12-12 11:43:52 +08:00 · 3937 次点击
    这是一个创建于 2205 天前的主题,其中的信息可能已经有所发展或是发生改变。

    虚拟机遭遇停电后,用磁盘文件记录恢复后,虚机内的 ceph 出现故障

    故障初始状态:

    ➜  ~ ceph -s
      cluster:
        id:     cf09d650-4629-4509-81d1-3e7005ca3595
        health: HEALTH_ERR
                2 scrub errors
                Possible data damage: 3 pgs inconsistent
                2 slow requests are blocked > 32 sec. Implicated osds 10,15
    
      services:
        mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03
        mgr: ceph02(active), standbys: ceph01, ceph03, controller
        mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby
        osd: 16 osds: 16 up, 16 in
    
      data:
        pools:   8 pools, 1280 pgs
        objects: 72 objects, 342MiB
        usage:   18.2GiB used, 15.1TiB / 15.1TiB avail
        pgs:     1276 active+clean
                 3    active+clean+inconsistent
                 1    active+clean+scrubbing
    

    针对 slow requests 设置超时时长

    ➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_timeout 30'
    osd.0: osd_op_thread_timeout = '30' (not observed, change may require restart)
    osd.1: osd_op_thread_timeout = '30' (not observed, change may require restart)
    osd.2: osd_op_thread_timeout = '30' (not observed, change may require restart)
    ......
    ➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_suicide_timeout 300'
    osd.0: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
    osd.1: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
    osd.2: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)
    ......
    

    ceph osd down 和 ceph osd in,依然无效

    ➜  ~ ceph health detail
    HEALTH_ERR nodown flag(s) set; 2 scrub errors; Possible data damage: 2 pgs inconsistent; 2 slow requests are blocked > 32 sec. Implicated osds 10,15
    OSD_SCRUB_ERRORS 2 scrub errors
    PG_DAMAGED Possible data damage: 2 pgs inconsistent
        pg 10.f is active+clean+inconsistent, acting [11,9,10]
        pg 10.1d is active+clean+inconsistent, acting [11,9,10]
    REQUEST_SLOW 2 slow requests are blocked > 32 sec. Implicated osds 10,15
        2 ops are blocked > 262.144 sec
        osds 10,15 have blocked requests > 262.144 sec
    

    修复 pg

    ➜  ~ ceph pg repair 10.f
    instructing pg 10.f on osd.11 to repair
    ➜  ~ ceph pg repair 10.1d
    instructing pg 10.1d on osd.11 to repair
    

    毫无变化,查看故障 pg 所在 osd

    ➜  ~ ceph pg 10.f query | grep primary
                "same_primary_since": 763,
                    "num_objects_missing_on_primary": 0,
                "up_primary": 11,
                "acting_primary": 11
                    "same_primary_since": 763,
                        "num_objects_missing_on_primary": 0,
                    "up_primary": 11,
                    "acting_primary": 11
    

    关掉这个 osd 再修复

    ➜  ~ ceph pg repair 10.f
    instructing pg 10.f on osd.11 to repair
    ➜  ~ ceph pg repair 10.1d
    instructing pg 10.1d on osd.11 to repair
    ➜  ~ ceph health detail
    HEALTH_WARN nodown flag(s) set; 1 osds down; 1 host (1 osds) down; Reduced data availability: 374 pgs inactive, 374 pgs peering; Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded
    OSDMAP_FLAGS nodown flag(s) set
    OSD_DOWN 1 osds down
        osd.11 (root=root-ssd,host=ceph03-ssd) is down
    OSD_HOST_DOWN 1 host (1 osds) down
        host ceph03-ssd (root=root-ssd) (1 osds) is down
    PG_AVAILABILITY Reduced data availability: 374 pgs inactive, 374 pgs peering
        pg 2.5c is stuck peering for 2391.685850, current state peering, last acting [10,15]
        pg 2.5e is stuck peering for 2535.599036, current state stale+peering, last acting [11,9]
        pg 2.5f is stuck inactive for 2387.032031, current state peering, last acting [10,15]
       ......
    PG_DEGRADED Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded
        pg 10.7 is active+undersized+degraded, acting [10,9]
        pg 10.b is active+undersized+degraded, acting [10,9]
        pg 10.f is active+undersized+degraded, acting [9,10]
        pg 10.13 is active+undersized+degraded, acting [10,9]
        pg 10.15 is active+undersized+degraded, acting [9,10]
        pg 10.1d is active+undersized+degraded, acting [9,10]
    ➜  ~ ceph pg 10.f query | grep primary
                "same_primary_since": 829,
                    "num_objects_missing_on_primary": 0,
                "up_primary": 9,
                "acting_primary": 9
                    "same_primary_since": 829,
                        "num_objects_missing_on_primary": 0,
                    "up_primary": 9,
                    "acting_primary": 9
    ➜  ~ ceph pg 10.1d query | grep primary
                "same_primary_since": 829,
                    "num_objects_missing_on_primary": 0,
                "up_primary": 9,
                "acting_primary": 9
                    "same_primary_since": 829,
                        "num_objects_missing_on_primary": 0,
                    "up_primary": 9,
                    "acting_primary": 9
    

    重启 osd 11,稍等一会

    ➜  ~ ceph -s
      cluster:
        id:     cf09d650-4629-4509-81d1-3e7005ca3595
        health: HEALTH_WARN
                Reduced data availability: 374 pgs inactive, 374 pgs peering, 55 pgs stale
    
      services:
        mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03
        mgr: ceph02(active), standbys: ceph01, ceph03, controller
        mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby
        osd: 16 osds: 16 up, 16 in
    
      data:
        pools:   8 pools, 1280 pgs
        objects: 72 objects, 342MiB
        usage:   16.9GiB used, 14.3TiB / 14.3TiB avail
        pgs:     29.219% pgs not active
                 906 active+clean
                 319 peering
                 55  stale+peering
    ➜  ~ ceph health detail
    HEALTH_WARN 15/216 objects misplaced (6.944%); 1 slow requests are blocked > 32 sec. Implicated osds 11
    OBJECT_MISPLACED 15/216 objects misplaced (6.944%)
    REQUEST_SLOW 1 slow requests are blocked > 32 sec. Implicated osds 11
        1 ops are blocked > 32.768 sec
        osd.11 has blocked requests > 32.768 sec
    ➜  ~ ceph health detail
    HEALTH_WARN 5/216 objects misplaced (2.315%)
    OBJECT_MISPLACED 5/216 objects misplaced (2.315%)
    ➜  ~ ceph health detail
    HEALTH_OK
    

    解决

    目前尚无回复
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   3993 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 28ms · UTC 05:20 · PVG 13:20 · LAX 21:20 · JFK 00:20
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.