Now, as to whether that’ll actually improve the performance, that’s a very different question, one that doesn’t have an easy answer. My experience is that most networking code is I/O bound, so getting more CPUs to work on the problem doesn’t help.
It's not for getting more CPU time, it would be so multiple independent requests are not waiting on all requests before them.
No. Remember that each filter is running in its own process. However, it wouldn’t surprise me if your profiling revealed serialisation bottlenecks within the NE infrastructure.
Correct, but the kernel must wait on all filters to complete. Does it dispatch requests to each in parallel or does it simply have a for(filter){request} serial loop?