在做进程安全监控的时候,拍脑袋决定的,如果发现一个进程在D状态时,即TASK_UNINTERRUPTIBLE(不可中断的睡眠状态),时间超过了8min,就将系统panic掉。恰好DB组做日志时,将整个log缓存到内存中,最后刷磁盘,结果系统就D状态了很长时间,自然panic了,中间涉及到Linux的缓存写回刷磁盘的一些机制和调优方法,写一下总结。
目前机制需要将脏页刷回到磁盘一般是以下情况:
- 脏页缓存占用的内存太多,内存空间不足;
- 脏页已经更改了很长时间,时间上已经到了临界值,需要及时刷新保持内存和磁盘上数据一致性;
- 外界命令强制刷新脏页到磁盘
- write写磁盘时检查状态刷新
内核使用pdflush线程刷新脏页到磁盘,pdflush线程个数在2和8之间,可以通过/proc/sys/vm/nr_pdflush_threads文件直接查看,具体策略机制参看源码函数__pdflush。
一、内核其他模块强制刷新
先说一下第一种和第三种情况:当内存空间不足或外界强制刷新的时候,脏页的刷新是通过调用wakeup_pdflush函数实现的,调用其函数的有do_sync、free_more_memory、try_to_free_pages。wakeup_pdflush的功能是通过background_writeout的函数实现的:
static void background_writeout(unsigned long _min_pages)
{
long min_pages = _min_pages;
struct writeback_control wbc = {
.bdi = NULL,
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = 0,
.nonblocking = 1,
};
for ( ; ; ) {
struct writeback_state wbs;
long background_thresh;
long dirty_thresh;
get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, NULL);
if (wbs.nr_dirty + wbs.nr_unstable < background_thresh
&& min_pages <= 0)
break;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
writeback_inodes(&wbc);
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
blk_congestion_wait(WRITE, HZ/10);
if (!wbc.encountered_congestion)
break;
}
}
}
background_writeout进到一个死循环里面,通过get_dirty_limits获取脏页开始刷新的临界值background_thresh,即为dirty_background_ratio的总内存页数百分比,可以通过proc接口/proc/sys/vm/dirty_background_ratio调整,一般默认为10。当脏页超过临界值时,调用writeback_inodes写MAX_WRITEBACK_PAGES(1024)个页,直到脏页比例低于临界值。
二、内核定时器启动刷新
内核在启动的时候在page_writeback_init初始化wb_timer定时器,超时时间是dirty_writeback_centisecs,单位是0.01秒,可以通过/proc/sys/vm/dirty_writeback_centisecs调节。wb_timer的触发函数是wb_timer_fn,最终是通过wb_kupdate实现。
static void wb_kupdate(unsigned long arg)
{
sync_supers();
get_writeback_state(&wbs);
oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100;
start_jif = jiffies;
next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100;
nr_to_write = wbs.nr_dirty + wbs.nr_unstable +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion)
blk_congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */
}
nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
}
if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
if (dirty_writeback_centisecs)
mod_timer(&wb_timer, next_jif);
}
上面的代码没有拷贝全。内核首先将超级块信息刷新到文件系统上,然后获取oldest_jif作为wbc的参数只刷新已修改时间大于dirty_expire_centisecs的脏页,dirty_expire_centisecs参数可以通过/proc/sys/vm/dirty_expire_centisecs调整。
三、WRITE写文件刷新缓存
用户态使用WRITE函数写文件时也有可能要刷新脏页,generic_file_buffered_write函数会在将写的内存页标记为脏之后,根据条件刷新磁盘以平衡当前脏页比率,参看balance_dirty_pages_ratelimited函数:
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
static DEFINE_PER_CPU(int, ratelimits) = 0;
long ratelimit;
ratelimit = ratelimit_pages;
if (dirty_exceeded)
ratelimit = 8;
/*
* Check the rate limiting. Also, we do not want to throttle real-time
* tasks in balance_dirty_pages(). Period.
*/
if (get_cpu_var(ratelimits)++ >= ratelimit) {
__get_cpu_var(ratelimits) = 0;
put_cpu_var(ratelimits);
balance_dirty_pages(mapping);
return;
}
put_cpu_var(ratelimits);
}
balance_dirty_pages_ratelimited函数通过ratelimit_pages调节刷新(调用balance_dirty_pages函数)的次数,每ratelimit_pages次调用才会刷新一次,具体刷新过程看balance_dirty_pages函数:
static void balance_dirty_pages(struct address_space *mapping)
{
struct writeback_state wbs;
long nr_reclaimable;
long background_thresh;
long dirty_thresh;
unsigned long pages_written = 0;
unsigned long write_chunk = sync_writeback_pages();
struct backing_dev_info *bdi = mapping->backing_dev_info;
for (;;) {
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
};
get_dirty_limits(&wbs, &background_thresh,
&dirty_thresh, mapping);
nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;
if (!dirty_exceeded)
dirty_exceeded = 1;
/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
* written to the server‘s write cache, but has not yet
* been flushed to permanent storage.
*/
if (nr_reclaimable) {
writeback_inodes(&wbc);
get_dirty_limits(&wbs, &background_thresh,
&dirty_thresh, mapping);
nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
break;
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We‘ve done our duty */
}
blk_congestion_wait(WRITE, HZ/10);
}
if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded)
dirty_exceeded = 0;
if (writeback_in_progress(bdi))
return; /* pdflush is already working this queue */
/*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
* to the lower threshold. So slow writers cause minimal disk activity.
*
* In normal mode, we start background writeout at the lower
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
(!laptop_mode && (nr_reclaimable > background_thresh)))
pdflush_operation(background_writeout, 0);
}
函数走进一个死循环,通过get_dirty_limits获取dirty_background_ratio和dirty_ratio对应的内存页数值,当24行做判断,如果脏页大于dirty_thresh,则调用writeback_inodes开始刷缓存到磁盘,如果一次没有将脏页比率刷到dirty_ratio之下,则用blk_congestion_wait阻塞写,然后反复循环,直到比率降低到dirty_ratio;当比率低于dirty_ratio之后,但脏页比率大于dirty_background_ratio,则用pdflush_operation启用background_writeout,pdflush_operation是非阻塞函数,唤醒pdflush后直接返回,background_writeout在有pdflush调用。
如此可知:WRITE写的时候,缓存超过dirty_ratio,则会阻塞写操作,回刷脏页,直到缓存低于dirty_ratio;如果缓存高于background_writeout,则会在写操作时,唤醒pdflush进程刷脏页,不阻塞写操作。
四,问题总结
导致进程D状态大部分是因为第3种和第4种情况:有大量写操作,缓存由Linux系统管理,一旦脏页累计到一定程度,无论是继续写还是fsync刷新,都会使进程D住。
由于测试导致系统启动不了,需要将系统中的数据拷贝出来,所以想到将磁盘挂载到另一个能用的系统中进行拷贝,但是由于创建的系统都是用默认的方式创建的,所以一般的系统盘都是由两个分区组成,例如/dev/sda,/dev/sda1用来存放启动项,而/dev/sda2是一个逻辑卷,都是由/dev/VolGroup/lv_root、/dev/VolGroup/lv_home、/dev/VolGroup/lv_swap三部分组成的,由于逻辑卷的名字是一样的,所以这样导致将系统盘/dev/sdb挂载后,其中的逻辑卷部分不能挂载。
解决方法:
下面介绍分区格式为为Linux LVM的虚拟机挂载另一块的Linux LVM硬盘的方法:
在Hyper-V上添加要挂载的硬盘,开启虚拟机,启动好以后
#fdisk -l //下面会看到所有硬盘的情况,sda是虚拟机自己的硬盘,sdb则是挂载上去的硬盘
[root@localhost ~]# fdisk -l
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000c4715
Device Boot Start End Blocks Id System
/dev/sda1 * 1 64 512000 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 64 121602 976248832 8e Linux LVM
Disk /dev/mapper/VolGroup-lv_root: 53.7 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/VolGroup-lv_swap: 8338 MB, 8338276352 bytes
255 heads, 63 sectors/track, 1013 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/VolGroup-lv_home: 937.6 GB, 937649242112 bytes
255 heads, 63 sectors/track, 113996 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sdb: 120.0 GB, 120034123776 bytes
255 heads, 63 sectors/track, 14593 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ec679
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 64 512000 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sdb2 64 14594 116707328 8e Linux LVM
#vgscan //扫描所有卷组
[root@localhost ~]# vgscan
Reading all physical volumes. This may take a while...
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
Found volume group "VolGroup" using metadata type lvm2
Found volume group "VolGroup" using metadata type lvm2
#vgdisplay //显示出多有卷组信息
[root@localhost ~]# vgdisplay
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX (created here) takes precedence over FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ
--- Volume group ---
VG Name VolGroup
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 111.30 GiB
PE Size 4.00 MiB
Total PE 28492
Alloc PE / Size 28492 / 111.30 GiB
Free PE / Size 0 / 0
VG UUID JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
--- Volume group ---
VG Name VolGroup
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size 931.02 GiB
PE Size 4.00 MiB
Total PE 238341
Alloc PE / Size 238341 / 931.02 GiB
Free PE / Size 0 / 0
VG UUID FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ
(主要是通过大小来区分的)
[root@localhost ~]# vgrename JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX vg01
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ (created here) takes precedence over JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
WARNING: Duplicate VG name VolGroup: Existing JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX (created here) takes precedence over FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ
Volume group "VolGroup" successfully renamed to "vg01"
//重新命名挂载盘的卷组名,因为和虚拟机本身的卷组名是一样的所以LVM分区无法挂载,必须改名(要修改挂载盘的卷组名,这里要分编号哪个是挂载上去的),这里的类似于序列号的就是VG UUID,卷组名是一样的,只能靠VG UUID来修改卷组名
看到最下面Volume group "VolGroup00" successfully renamed to "vg01"则说明改名成功
#vgdisplay //可以看到已经有一个改名为vg01
[root@localhost ~]# vgdisplay
--- Volume group ---
VG Name vg01
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 5
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 111.30 GiB
PE Size 4.00 MiB
Total PE 28492
Alloc PE / Size 28492 / 111.30 GiB
Free PE / Size 0 / 0
VG UUID JTFVF9-ULu5-cHKu-T1p3-4HnB-Tk2p-BpjwHX
--- Volume group ---
VG Name VolGroup
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 3
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size 931.02 GiB
PE Size 4.00 MiB
Total PE 238341
Alloc PE / Size 238341 / 931.02 GiB
Free PE / Size 0 / 0
VG UUID FKsJuO-7348-hHqs-MKVq-WLMl-2Sbh-0oh8NZ
#lvscan //看到上面两个是挂在上去的磁盘,没有激活
[root@localhost ~]# lvscan
inactive ‘/dev/vg01/lv_root‘ [50.00 GiB] inherit
inactive ‘/dev/vg01/lv_home‘ [53.45 GiB] inherit
inactive ‘/dev/vg01/lv_swap‘ [7.85 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_root‘ [50.00 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_home‘ [873.25 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_swap‘ [7.77 GiB] inherit
可以看到下面ACTIVE是正在使用的系统盘,而上面是需要拷贝出来的系统盘
#vgchange -ay /dev/vg01 //激此卷组
[root@localhost ~]# vgchange -ay /dev/vg01
3 logical volume(s) in volume group "vg01" now active
#lvscan //看下vg01的卷组是否被激活
[root@localhost ~]# lvscan
ACTIVE ‘/dev/vg01/lv_root‘ [50.00 GiB] inherit
ACTIVE ‘/dev/vg01/lv_home‘ [53.45 GiB] inherit
ACTIVE ‘/dev/vg01/lv_swap‘ [7.85 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_root‘ [50.00 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_home‘ [873.25 GiB] inherit
ACTIVE ‘/dev/VolGroup/lv_swap‘ [7.77 GiB] inherit
#mkdir /mnt/hdb //新建一个文件夹用来作为挂载点
#mount /dev/vg01/逻辑卷名(比如LogVol00) /mnt/hdb //挂载vg01/LogVol00
使用完以后要卸载
#umount /mnt/hdb
#vgchange -an /dev/vg01 去除激活的LVM分区
修改过卷组名的这个硬盘是不能再启动了,应为默认启动引导是引导到VolGroup00卷组的,所以修改过卷组名的硬盘要想仍能重新启动的话,必须把卷组名修改会默认的VolGroup00
但是在已经有VolGroup00卷组的虚拟机上是不能修改的!这里我们单独建立一个分区格式不是Linux LVM格式的虚拟机来挂载硬盘,其格式为ext3的(显示为Linux)
#fdisk -l //查看是否识别挂上的硬盘
#vgscan //扫描卷组
#lvscan //查看要改名的卷组是否被激活,要是被激活的话则不能修改卷组名,如果直接修改的话会提示报错
#vgchange -an /dev/VolGroup00 //如果上一步骤是激活状态,则需用此命令修改为不激 活状态
#lvscan //查看是否修改为不激活状态
#vgrename vg01 VolGroup00 //修改为不激活状态则可以修改卷组名
在ext3格式分区的虚拟机上直接挂载Linux LVM分区的硬盘的方法如下:
#fdisk -l //查看是否识别挂上的硬盘
#vgscan //扫描卷组
#vgdisplay //显示所有卷组
#lvscan //查看挂载的磁盘的卷组是否处于激活的状态
#vgchange -ay /dev/VolGroup00 //如上步看到没有激活,则执行此命令
#mkdir /mnt/hdb //创建挂载点的文件夹
#mount /dev/VolGroup00/LogVol00 /mnt/hdb // 挂载,挂载后则可以访问/mnt/hdb
#umount /mnt/hdb //卸载磁盘
#vgchange -an /dev/VolGroup00 //去除挂载磁盘卷组的激活状态
补充如果要设置开机自动加载LVM分区:
vim /etc/fstab
加入
/dev/vg110/LogVol01 /wwwroot/ ext4 defaults 0 0