标签:
kernel preemption:
the main characteristic of a preemptive kernel is that a process running in the kernel mode can be replaced by another process while in the middle of a kernel function .
The main motivation for making a kernel preemptive is to reduce the dispatch latencyof the User Mode processes, that is, the delay between the time they become runnable and the time they actually begin running.
the preempt may be happened in the interruption or exception ;when Returning from Interrupts and Exceptions , kernel preemption is disabled when the
preempt_count field in the thread_info descriptor referenced by the current_thread_info() macro is greater than zero.
it is greater than zero when any of the following cases occurs:
1. The kernel is executing an interrupt service routine.
2. The deferrable functions are disabled (always true when the kernel is executing asoftirq or tasklet).
3. The kernel preemption has been explicitly disabled by setting the preemption counter to a positive value.
the kernel can be preempted only when it is executing an exception handler (in particular a system call) and the kernel preemption has not been explicitly disabled.
whne process "Returning from Interrupts and Exceptions” ,the local CPU must have local interrupts enabled, otherwise kernel preemption is not performed.
The preempt_enable() macro decreases the preemption counter, then checks whetherthe TIF_NEED_RESCHED flag is set
In this case, a process switch request is pending, so the macro invokes the preempt_schedule() function,which essentially executes the following code:
/* * this is the entry point to schedule() from kernel preemption * off of irq context. * Note, that this is called and return with irqs disabled. This will * protect us against recursive calling from irq. */ asmlinkage void __sched preempt_schedule_irq(void) { struct thread_info *ti = current_thread_info(); /* Catch callers which need to be fixed */ BUG_ON(ti->preempt_count || !irqs_disabled()); do { add_preempt_count(PREEMPT_ACTIVE); local_irq_enable(); __schedule(); local_irq_disable(); sub_preempt_count(PREEMPT_ACTIVE); /* * Check again in case we missed a preemption opportunity * between schedule and now. */ barrier(); } while (need_resched()); }
Various types of synchronization techniques used by the kernel
Per_CPU
a per_cpu is an array of data structure which the elements per each CPU in the system .the CPUs have its own per_cpus elements ;they cannot access to the other per_cpus elements belong to other CPUs ;
per-CPU variables provide protection against concurrent accesses from several CPUs, they do not provide protection against accesses fromasynchronous functions
(interrupt handlers and deferrable functions). In these cases, additional synchronization primitives are required.
As a general rule, a kernel control path should access a per-CPU variable with kernel preemption disabled.Just consider, for instance, what would happen if a kernel control
path gets the address of its local copy of a per-CPU variable, and then it is preempted and moved to another CPU: the address still refers to the element of the previous CPU.
the initialize the per_cpu elements at start_kernel()function such as:
start_kernel()-->setup_per_cpu_areas();
分配好每个cpu的per-cpu变量副本所占用的物理空间的同时,也对__per_cpu_offset[NR_CPUS]数组进行了初始化用于以后找到指定CPU的这些per-cpu变量副本。
the macrors and functions of per_CPU variables
访问静态声明和定义Per-CPU变量的API静态定义的per cpu变量不能象普通变量那样进行访问,需要使用特定的接口函数,具体如下:
get_cpu_var(var)
put_cpu_var(var)
/* * Must be an lvalue. Since @var must be a simple identifier, * we force a syntax error here if it isn't. */ #define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); })) /* * The weird & is necessary because sparse considers (void)(var) to be * a direct dereference of percpu variable (var). */ #define put_cpu_var(var) do { (void)&(var); preempt_enable(); } while (0)
#define DEFINE_PER_CPU_SECTION(type, name, sec) __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __typeof__(type) name /* * Variant on the per-CPU variable declaration/definition theme used for * ordinary per-CPU variables. */ #define DECLARE_PER_CPU(type, name) DECLARE_PER_CPU_SECTION(type, name, "")在这里具体arch specific的percpu代码中(arch/arm/include/asm/percpu.h)可以定义PER_CPU_DEF_ATTRIBUTES,以便控制该per cpu变量的属性,当然,如果arch specific的percpu代码不定义,那么在general arch-independent的代码中(include/asm-generic/percpu.h)会定义为空。这里可以顺便提一下Per-CPU变量的软件层次:
Atomic Operations
in 80*86
an atomic operation works by locking the affected memory address in the CPU‘s cache. The CPU acquires the
memory address exclusively in its cache and then does not permit any other CPU to acquire or share that address until the operation completes.
When the control unit detects the prefix by the lock byte (0xf0), it “locks” the memory bus until the instruction
is finished.
Optimization and Memory Barriers
When using optimizing compilers, wu may never take for granted that instructions will be performed in the exact order in which they appear
in the source code.
An optimization barrier primitiveensures that the assembly language instructions corresponding to C
statements placed before the primitive are not mixed by the compiler with assembly language instructions corresponding to C statements placed after the primitive. In Linux the barrier() macro, which expands into asm volatile("":::"memory"), acts as an optimization
barrier.
A memory barrier primitive ensures that the operations placed before the primitive
are finished before starting the operations placed after the primitive.
保证memory barrier源语 前的指令比在源语后指令先执行;;;;
In the 80×86 processors, the following kinds of assembly language instructions are
said to be “serializing” because they act as memory barriers:
All instructions that operate on I/O ports All instructions prefixed by the lock byte (see the section “Atomic Operations”) ? All instructions that write into control registers, system registers, or debug registers (for instance, cli and sti, which change the status of the IF flag in the eflags register) ? The lfence, sfence, and mfence assembly language instructions, which have been introduced in the Pentium 4 microprocessor to efficiently implement read memory barriers, write memory barriers, and read-write memory barriers, respectively. ? A few special assembly language instructions; among them, the iret instruction that terminates an interrupt or exception handler上述指令和memory barrier 相同效果
Spin Locks
Spin_lock的kernel中的实现对单核(UP),多核(SMP)有不同的处理方式。对单核来说,如果spin_lock不处于中断上下文,则spin_lock锁定的代码丢失CPU拥有权,只会在内核抢占的时候发生。所以,对于单核来说,只需要在spin_lock获得锁的时候禁止抢占,释放锁的时候开放抢占。对多核来说,存在两段代码同时在多核上执行的情况,这时候才需要一个真正的锁来宣告代码对资源的占有。
spin lock除了考虑SMP-safe以外,还要考虑两种伪并发情况,就是中断(interrupt)和抢占(preemption),就是要保证interrupt-safe和preempt-safe。
如果在中断处理程序中,因为要访问共享变量而使用spin lock,则要避免dead-lock出现。比如,CPU0上线程A获取了锁1,在获取和释放锁之间CPU0上发生软中断进入中断处理程序,中断处理程序也尝试去获取spin lock,但是由于同一CPU0上的lock holder线程A在中断处理程序退出之前无法被调度而释放锁,所以在CPU0上就出现dead-lock;但是如果软中断是发生在其他CPU比如CPU1上,则是没有问题的,因为发现在CPU1上的中断不会中断CPU0上lock holder线程A的执行。所以要保证interrupt-safe,就要在获取锁之前disable本地CPU中断。
The reasons you mustn‘t use these versions if you have interrupts
that
113 play with the spinlock is that you can get deadlocks:
114
115 spin_lock(&lock);
116 ...
117 <- interrupt comes in:
118 spin_lock(&lock);
119
120 where an interrupt tries to lock an already locked variable. This is ok if
121 the other interrupt happens on another CPU, but it is _not_ ok if the
122 interrupt happens on the same CPU that already holds the lock, because the
123 lock will obviously never be released (because the interrupt is waiting
124 for the lock, and the lock-holder is interrupted by the interrupt and will
125 not continue until the interrupt has been processed).
126
127 (This is also the reason why the irq-versions of the spinlocks only need
128 to disable the _local_ interrupts - it‘s ok to use spinlocks in interrupts
129 on other CPU‘s, because an interrupt on another CPU doesn‘t interrupt the
130 CPU that holds the lock, so the lock-holder can continue and eventually
131 releases the lock).
In Linux, each spin lock is represented by aspinlock_tstructure
consisting ofsome fields:
typedef struct spinlock { union { struct raw_spinlock rlock; #ifdef CONFIG_DEBUG_LOCK_ALLOC # define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map)) struct { u8 __padding[LOCK_PADSIZE]; struct lockdep_map dep_map; }; #endif }; } spinlock_t; typedef struct raw_spinlock { <span style="white-space:pre"> </span>arch_spinlock_t raw_lock; #ifdef CONFIG_GENERIC_LOCKBREAK <span style="white-space:pre"> </span>unsigned int break_lock; #endif #ifdef CONFIG_DEBUG_SPINLOCK <span style="white-space:pre"> </span>unsigned int magic, owner_cpu; <span style="white-space:pre"> </span>void *owner; #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC <span style="white-space:pre"> </span>struct lockdep_map dep_map; #endif } raw_spinlock_t; typedef struct arch_spinlock { <span style="white-space:pre"> </span>unsigned int slock; } arch_spinlock_t;there are some macros about the spinlock;
#define spin_lock_init(_lock) do { spinlock_check(_lock); raw_spin_lock_init(&(_lock)->rlock); } while (0) static inline raw_spinlock_t *spinlock_check(spinlock_t *lock) { <span style="white-space:pre"> </span>return &lock->rlock; }
static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); }
#define raw_spin_lock(lock)<span style="white-space:pre"> </span>_raw_spin_lock(lock) static inline void __raw_spin_lock(raw_spinlock_t *lock) { <span style="white-space:pre"> </span>preempt_disable(); <span style="white-space:pre"> </span>spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); <span style="white-space:pre"> </span>LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); }
spin_acquire():
LOCK_CONTENDED:
LOCK_CONTENDED():是一个宏,如果不考虑CONFIG_LOCK_STAT(该宏是为了统计lock的操作),则:#define LOCK_CONTENDED(_lock, try, lock) lock(_lock)
考虑CONFIG_LOCK_STAT这个宏则定义为如下:#define
LOCK_CONTENDED(_lock, try, lock) \
do { \
if (!try(_lock)) {\
lock_contended(&(_lock)->dep_map, _RET_IP_);\
lock(_lock);
\
} \
lock_acquired(&(_lock)->dep_map, _RET_IP_);\
} while (0)
则实际上上述等于执行下面指令:do_raw_spin_lock()函数;
void do_raw_spin_lock(raw_spinlock_t *lock)
{
debug_spin_lock_before(lock);
if (unlikely(!arch_spin_trylock(&lock->raw_lock)))
__spin_lock_debug(lock);
debug_spin_lock_after(lock);
}
最后执行arch_spin_trylock()和平台有关 可从arch字段判断;
spinlock的实现就是检查lock->slock的值来判断锁的free
or busy状态,所以不同的CPU对锁进行的decl或者incl指令必须是原子的,否则会出现多个CPU同时认为锁是free而进入临界区或者所有CPU都认为锁是busy而dead-lock的后果;在x86平台上,LOCK_PREFIX用前缀保证对lock->slock的原子性。
static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { int tmp, new; asm volatile("movzwl %2, %0\n\t" "cmpb %h0,%b0\n\t" "leal 0x100(%" REG_PTR_MODE "0), %1\n\t" "jne 1f\n\t" LOCK_PREFIX "cmpxchgw %w1,%2\n\t" "1:" "sete %b1\n\t" "movzbl %b1,%0\n\t" : "=&a" (tmp), "=&q" (new), "+m" (lock->slock) : : "memory", "cc"); return tmp; }
对于(up)单处理器:
#define _raw_spin_lock(lock)__LOCK(lock)
/*
* In the UP-nondebug case there‘s no real locking going on, so the
* only thing we have to do is to keep the preempt counts and irq
* flags straight, to suppress compiler warnings of unused lock
* variables, and to add the proper checker annotations:
*/
#define __LOCK(lock) \
do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
void)(lock)仅仅是为了防止编译器对lock的未使用报警;
spin_unlock 分析:
#define _raw_spin_unlock(lock)__UNLOCK(lock)
#define __UNLOCK(lock) \
do { preempt_enable(); __release(lock); (void)(lock); } while
read_copy_update(RCU):
rcu is a synchronization technology designed to protect data structure that are mostly accessed for reading by several CPUs ;rcu allow many readers and writer to proceed concurrently ;confronted with seqlock which allow one writer to proceed ; RCU is lock-free, that is, it uses nolock or counter shared by all CPUs;
but how to realize it without lock;the idea is occurs as follow :
1. Only data structures that are dynamically allocated and referenced by means of
pointers can be protected by RCU.
2.No kernel control path can sleep inside a critical region protected by RCU.
读者在访问被RCU保护的共享数据期间不能被阻塞,这是RCU机制得以实现的一个基本前提,也就说当读者在引用被RCU保护的共享数据期间,读者所在的CPU不能发生上下文切换,spinlock和rwlock都需要这样的前提;写者在访问被RCU保护的共享数据时不需要和读者竞争任何锁,只有在有多于一个写者的情况下需要获得某种锁以与其他写者同步,写者修改数据前首先拷贝一个被修改元素的副本,然后在副本上进行修改,修改完毕后它向垃圾回收器注册一个回调函数以便在适当的时机执行真正的修改操作。等待适当时机的这一时期称为grace period,而CPU发生了上下文切换称为经历一个quiescent state,grace period就是所有CPU都经历一次quiescent state所需要的等待的时间。垃圾收集器就是在grace period之后调用写者注册的回调函数来完成真正的数据修改或数据释放操作的。
static
inline void __rcu_read_lock(void)
{
preempt_disable();
}
static inline void __rcu_read_unlock(void)
{
preempt_enable();
}
/**
* struct rcu_head - callback structure for use with RCU
* @next: next update requests in a list
* @func: actual update function to call after the grace period.
*/
struct rcu_head {
struct rcu_head *next;
void (*func)(struct rcu_head *head);
};
转载自 ;RCU的使用;转载自https://www.ibm.com/developerworks/cn/linux/l-rcu/
1、只有增加和删除的链表操作;
在这种应用情况下,绝大部分是对链表的遍历,即读操作,而很少出现的写操作只有增加或删除链表项,并没有对链表项的修改操作,这种情况使用RCU非常容易,从rwlock转换成RCU非常自然。路由表的维护就是这种情况的典型应用,对路由表的操作,绝大部分是路由表查询,而对路由表的写操作也仅仅是增加或删除,因此使用RCU替换原来的rwlock顺理成章。系统调用审计也是这样的情况。
使用RWLOCK如下:
static enum audit_state audit_filter_task(struct task_struct *tsk) { struct audit_entry *e; enum audit_state state; read_lock(&auditsc_lock); /* Note: audit_netlink_sem held by caller. */ list_for_each_entry(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { read_unlock(&auditsc_lock); return state; } } read_unlock(&auditsc_lock); return AUDIT_BUILD_CONTEXT; }使用RCU则如下:
static enum audit_state audit_filter_task(struct task_struct *tsk) { struct audit_entry *e; enum audit_state state; rcu_read_lock(); /* Note: audit_netlink_sem held by caller. */ list_for_each_entry_rcu(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { rcu_read_unlock(); return state; } } rcu_read_unlock(); return AUDIT_BUILD_CONTEXT; }这种转换非常直接,使用rcu_read_lock和rcu_read_unlock分别替换read_lock和read_unlock,链表遍历函数使用_rcu版本替换就可以了。
static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { struct audit_entry *e; write_lock(&auditsc_lock); list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { list_del(&e->list); write_unlock(&auditsc_lock); return 0; } } write_unlock(&auditsc_lock); return -EFAULT; /* No matching rule */ } static inline int audit_add_rule(struct audit_entry *entry, struct list_head *list) { write_lock(&auditsc_lock); if (entry->rule.flags & AUDIT_PREPEND) { entry->rule.flags &= ~AUDIT_PREPEND; list_add(&entry->list, list); } else { list_add_tail(&entry->list, list); } write_unlock(&auditsc_lock); return 0; }使用RCU后变为:
static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { struct audit_entry *e; /* Do not use the _rcu iterator here, since this is the only * deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { list_del_rcu(&e->list); call_rcu(&e->rcu, audit_free_rule, e); return 0; } } return -EFAULT; /* No matching rule */ } static inline int audit_add_rule(struct audit_entry *entry, struct list_head *list) { if (entry->rule.flags & AUDIT_PREPEND) { entry->rule.flags &= ~AUDIT_PREPEND; list_add_rcu(&entry->list, list); } else { list_add_tail_rcu(&entry->list, list); } return 0; }对于链表删除操作,list_del替换为list_del_rcu和call_rcu,这是因为被删除的链表项可能还在被别的读者引用,所以不能立即删除,必须等到所有读者经历一个quiescent state才可以删除。另外,list_for_each_entry并没有被替换为list_for_each_entry_rcu,这是因为,只有一个写者在做链表删除操作,因此没有必要使用_rcu版本。
通常情况下,write_lock和write_unlock应当分别替换成spin_lock和spin_unlock,但是对于只是对链表进行增加和删除操作而且只有一个写者的写端,在使用了_rcu版本的链表操作API后,rwlock可以完全消除,不需要spinlock来同步读者的访问。对于上面的例子,由于已经有audit_netlink_sem被调用者保持,所以spinlock就没有必要了。
这种情况允许修改结果延后一定时间才可见,而且写者对链表仅仅做增加和删除操作,所以转换成使用RCU非常容易。
2.写端需要对链表条目进行修改操作;如果写者需要对链表条目进行修改,那么就需要首先拷贝要修改的条目,然后修改条目的拷贝,等修改完毕后,再使用条目拷贝取代要修改的条目,要修改条目将被在经历一个grace
period后安全删除。
static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { struct audit_entry *e; struct audit_newentry *ne; write_lock(&auditsc_lock); /* Note: audit_netlink_sem held by caller. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { e->rule.action = newaction; e->rule.file_count = newfield_count; write_unlock(&auditsc_lock); return 0; } } write_unlock(&auditsc_lock); return -EFAULT; /* No matching rule */ }使用RCU后;
static inline int audit_upd_rule(struct audit_rule *rule, struct list_head *list, __u32 newaction, __u32 newfield_count) { struct audit_entry *e; struct audit_newentry *ne; list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { ne = kmalloc(sizeof(*entry), GFP_ATOMIC); if (ne == NULL) return -ENOMEM; audit_copy_rule(&ne->rule, &e->rule); ne->rule.action = newaction; ne->rule.file_count = newfield_count; list_replace_rcu(e, ne); call_rcu(&e->rcu, audit_free_rule, e); return 0; } } return -EFAULT; /* No matching rule */ }
读者能够容忍修改可以在一段时间后看到,也就说读者在修改后某一时间段内,仍然看到的是原来的数据。在很多情况下,读者不能容忍看到旧的数据,这种情况下,需要使用一些新措施,如System V IPC,它在每一个链表条目中增加了一个deleted字段,标记该字段是否删除,如果删除了,就设置为真,否则设置为假,当代码在遍历链表时,核对每一个条目的deleted字段,如果为真,就认为它是不存在的。
还是以系统调用审计代码为例,如果它不能容忍旧数据,那么,读端代码应该修改为:
static enum audit_state audit_filter_task(struct task_struct *tsk) { struct audit_entry *e; enum audit_state state; rcu_read_lock(); list_for_each_entry_rcu(e, &audit_tsklist, list) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { spin_lock(&e->lock); if (e->deleted) { spin_unlock(&e->lock); rcu_read_unlock(); return AUDIT_BUILD_CONTEXT; } rcu_read_unlock(); return state; } } rcu_read_unlock(); return AUDIT_BUILD_CONTEXT; }注意,对于这种情况,每一个链表条目都需要一个spinlock保护,因为删除操作将修改条目的deleted标志。此外,该函数如果搜索到条目,返回时应当保持该条目的锁,因为只有这样,才能看到新的修改的数据,否则,仍然可能看到就的数据。
static inline int audit_del_rule(struct audit_rule *rule, struct list_head *list) { struct audit_entry *e; /* Do not use the _rcu iterator here, since this is the only * deletion routine. */ list_for_each_entry(e, list, list) { if (!audit_compare_rule(rule, &e->rule)) { spin_lock(&e->lock); list_del_rcu(&e->list); e->deleted = 1; spin_unlock(&e->lock); call_rcu(&e->rcu, audit_free_rule, e); return 0; } } return -EFAULT; /* No matching rule */ }
synchronization in Linux kernel
标签:
原文地址:http://blog.csdn.net/u012681083/article/details/51239924