[root@ceph-6-11 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors;
pg 2.37c is active+clean+inconsistent, acting [75,6,35]
1 scrub errors
报错信息总结:
问题PG:2.37c
OSD编号:75,6,35
执行常规修复:
ceph pg repair 2.37c
这时会出现osd节点各别重启 从新分配pg remap 稍等片刻后恢复ok
如果查看修复结果还不正常:
[root@ceph-6-11 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.37c is active+clean+inconsistent, acting [75,6,35]
1 scrub errors
问题依然存在,异常pg没有修复;
然后执行:
要洗刷一个pg组,执行命令:
ceph pg scrub 2.37c
ceph pg deep-scrub 2.37c
ceph pg repair 2.37c
以上命令执行后均未修复,依然报上面的错误,查看相关osd 日志报错如下:
2017-07-24 17:31:10.585305 7f72893c4700 0 log_channel(cluster) log [INF] : 2.37c repair starts
2017-07-24 17:31:10.710517 7f72893c4700 -1 log_channel(cluster) log [ERR] : 2.37c repair 1 errors, 0 fixed
决定修复pg 设置的三块osd ,执行命令如下:
ceph osd repair 75
ceph osd repair 6
ceph osd repair 35
最后决定用一个最粗暴的方法解决,关闭有问题pg 所使用的主osd 75
查询pg 使用主osd信息
ceph pg 2.37c query |grep primary
"blocked_by": [],
"up_primary": 75,
"acting_primary": 75
执行操作如下:
systemctl stop ceph-osd@75
此时ceph开始数据恢复,将osd75 上面的数据在其它节点恢复,等待一段时间,发现数据滚动完成,执行命令查看集群状态。
[root@ceph-6-11 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.37c is active+clean+inconsistent, acting [8,38,17]
1 scrub errors
[root@ceph-6-11 ~]# ceph pg repair 2.37c
instructing pg 2.37c on osd.8 to repair
然后查看集群状态:
[root@ceph-6-11 ~]# ceph health detail
HEALTH_OK
原文地址:http://blog.51cto.com/wujingfeng/2083359