标签:int start move with wan better source tom ken
https://idea.popcount.org/
In previous articles we talked about:
This time we‘ll focus on Linux‘s select(2)
successor - the epoll(2)
I/O multiplexing syscall.
Epoll is relatively young. It was created by Davide Libenzi in 2002. For comparison: Windows did IOCP in 1994 and FreeBSD‘s kqueue was introduced in July 2000. Unfortunately, even though epoll is the youngest in the advanced IO multiplexing family, it‘s the worse in the bunch.
Bryan Cantrill of Joyent is known for bashing epoll()
. Here‘s one of the more entertaining interviews:
He mentions two defects.
First he describes "a fatal flaw, that is subtle" in the Solaris /dev/poll
model. He starts by describing the "thundering herd" problem (which we discussed earlier). Then he moves on to the real issue. In a multithreaded scenario, when the /dev/poll
descriptor is shared, it is impossible to deliver events on one file descriptor to precisely one worker thread. He explains that band aids to level triggered /dev/poll
model and naive edge-triggered won‘t work in multithreaded case1.
This argument is indeed subtle, but since epoll
has semantics close to/dev/poll
, it‘s safe to say it wasn‘t designed to work in multithreaded scenarios.
In the video Mr Cantrill raised a second argument against epoll
: the events registered in epoll aren‘t associated with file descriptor, but with the underlying kernel object referred to by the file descriptor (let‘s call this the filedescription). He mentions the "stunning" effect of forking and closing an fd. We will leave this problem for now and describe it in another blog post.
Most of the epoll
critique is based on two fundamental design issues:
1) Sometimes it is desirable to scale application by using multi threading. This was not supported by early implementations of epoll
and was fixed byEPOLLONESHOT
and EPOLLEXCLUSIVE
flags.
2) Epoll registers the file descripton, the kernel data structure, not file descriptor, the userspace handler pointing to it.
The debate is heated because it‘s technically possible to avoid both pitfalls with careful defensive programming. If you can you should avoid using epoll for load balancing across threads. Avoid sharing epoll file descriptor across threads. Avoid sharing epoll-registered file descriptors. Avoid forking, and if you must: close all epoll-registered file descriptors before calling execve
. Explicitly deregister affected file descriptors from epoll set before callingdup
/dup2
/dup3
or close
.
If you have simple code and follow the advice above you might be fine. The problem starts when your epoll program gets complex.
Let‘s dig deeper. In this blog post I‘ll focus on the load balancing argument.
There are two distinct load balancing scenarios:
accept()
calls for a single bound TCP socketread()
calls for large number of connected socketsSometimes it‘s necessary to serve lots of very short TCP connections. A high throughput HTTP 1.0 server is one such example. Since the rate of inbound connections is high, you want to distribute the work of accept()
ing connections across multiple CPU‘s.
This is a real problem happening in large deployments. Tom Herbert reported an application handling 40k connections per second. With such a volume it does makes sense to spread the work across cores.
But it‘s not that simple. Up until kernel 4.5 it wasn‘t possible to use epoll to scale out accepts.
A naive solution is to have a single epoll file descriptor shared across worker threads. This won‘t work well, neither will sharing bound socket file descriptor and registering it in each thread to unique epoll instance.
This is because "level triggered" (aka: normal) epoll inherits the "thundering herd" semantics from select()
. Without special flags, in level-triggered mode, all the workers will be woken up on each and every new connection. Here‘s an example:
epoll_wait()
.epoll_wait()
.accept()
, this succeeds.accept()
, this fails with EAGAIN.Waking up "Thread B" was completely unnecessary and wastes precious resources. Epoll in level-triggered mode scales out poorly.
Okay, since we ruled out naive level-triggered setup, maybe "edge triggered" could do better?
Not really. Here is a possible pessimistic run:
epoll_wait()
.accept()
, this succeeds.epoll_wait()
. Kernel wakes up Thread B.accept()
since it does not know if kernel received one or more connections originally. It hopes to get EAGAIN, but gets another socket.accept()
, receives EAGAIN. This thread is confused.accept()
again, gets EAGAIN.The wake-up of Thread B was completely unnecessary and is confusing. Additionally, in edge triggered mode it‘s hard to avoid starvation:
epoll_wait()
.accept()
, this succeedsaccept()
, hopes to get EGAIN, but gets another socket.accept()
, hopes to get EGAIN, but gets another socket.In this case the socket moved only once from "non-readable" to "readable" state. Since the socket is in edge-triggered mode, the kernel will wake up epoll exactly once. In this case all the connections will be received by Thread A and load balancing won‘t be achieved.
There are two workarounds.
The best and the only scalable approach is to use recent Kernel 4.5+ and use level-triggered events with EPOLLEXCLUSIVE
flag. This will ensure only one thread is woken for an event, avoid "thundering herd" issue and scale properly across multiple CPU‘s
Without EPOLLEXCLUSIVE
, similar behavior it can be emulated with edge-triggered and EPOLLONESHOT
, at a cost of one extra epoll_ctl()
syscall after each event. This will distribute load across multiple CPU‘s properly, but at most one worker will call accept()
at a time, limiting throughput.
accept()
, this succeeds.epoll_ctl(EPOLL_CTL_MOD)
, this will reset theEPOLLONESHOT
and re-arm the socket.It‘s worth noting there are other ways to scale accept()
without relying on epoll. One option is to use SO_REUSEPORT
and create multiple listen sockets sharing the same port number. This approach has problems though - when one of the file descriptors is closed, the sockets already waiting in the accept queue will be dropped. Read more in this Yelp blog post and this LWN comment.
Kernel 4.4 introduced SO_INCOMING_CPU
to further improve locality ofSO_REUSEPORT
sockets. I wasn‘t able to find a good documentation of this very new feature.
Even better, kernel 4.5 introduced SO_ATTACH_REUSEPORT_CBPF
andSO_ATTACH_REUSEPORT_EBPF
socket options. When used properly, with a bit of magic, it should be possible to substitute SO_INCOMING_CPU
and overcome the usual SO_REUSEPORT
dropped connections on rebalancing problem.
Apart from scaling accept()
there is a second use case for scaling epoll
across many cores. Imagine a situation when you have a large number of HTTP client connections and you want to serve them as quickly as the data arrives. Each connection may require some unpredictable processing, so sharding them into equal buckets across worker threads will worsen mean latency. It‘s better to use "the combined queue" queuing model - have one epoll set and use multiple threads to pull active sockets and perform the work.
Here‘s The Engineer Guy explaining the combined queue model:
In our case the shared queue is an epoll descriptor, the tills are worker threads and the jobs are readable sockets.
We don‘t want to use the level triggered model due to the "thundering herd" behavior. Additionally the EPOLLEXCLUSIVE
won‘t help since there is a race condition possible. Here‘s how it may materialize:
EPOLLEXCLUSIVE
behavior. Let‘s say kernel woke up Thread A.epoll_wait()
read(2048)
and reads full buffer of 2048 bytes.read(2048)
and reads remaining 1 byte of dataIn this situation the data is split across two threads and without using mutexes the data may be reordered.
Okay, so maybe edge triggered model will do better? Not really. The same race condition occurs:
epoll_wait()
read(2048)
and reads full buffer of 2048 bytesepoll_wait()
read(2048)
and gets 1 byte of dataread(2048)
, which returns nothing, gets EAGAINThe correct solution is to use EPOLLONESHOT
and re-arm the file descriptor manually. This is the only way to guarantee that the data will be delivered to only one thread and avoid race conditions.
Using epoll()
correctly is hard. Understanding extra flags EPOLLONESHOT
andEPOLLEXCLUSIVE
is necessary to achieve load balancing free of race conditions.
Considering that EPOLLEXCLUSIVE
is a very new epoll flag, we may conclude that epoll
was not originally designed for balancing load across multiple threads.
In the next blog post in this series we will describe the epoll
"file descriptor vs file description" problem which occurs when used with close()
and fork
calls.
标签:int start move with wan better source tom ken
原文地址:http://www.cnblogs.com/zengkefu/p/6703885.html