As Cilk provides a framework of parallel programming in a C-like parallel language, we are implementing a C thread-pool library with the very similar syntactic sugar, programming abstraction and primitive semantics. Until now, one the on hand, we have catched up the schedule and completed all the planned tasks; but on the other hand, we have learned a painful lesson and proved that implementing the work stealing in C can not be done in an efficient way comparable to Cilk.
Goals & Deliverables Revisited
We are delight to see that our development was going well as planned. Here is what we did in the past month and what we plan to do for the parallelism competition.
Implement Thread Library
We want to deliver a thread library with support of spawn and sync semantics and work-stealing scheduler and we are well on our way to doing this. At this moment, we have implemented basic spawn and sync semantics and tested the correctness of our implementation using fib().
Specifically, with current implementation, the user can spawn any a thread with arbitrary interface and with a return object of primitive types; the library will set the return value to the desired location asynchronously; the spawned work is enqueued to a global queue and shared by all workers.
With sync, the calling thread will be suspended just as desired and resume only after the descendents have complete; just as Cilk, any function implicitly makes a sync call before return.
The workers can be scheduled to do any available work at the point of spawn and sync.
The next stage is to develop work-stealing scheduler and continue to reduce the overhead of thread management.
Plan at Parallelism Competition
At the parallelism competition, we are planned to show the functionality and scalability of our framework. First, we may use small animations to explain how to create and steal works and also run example code to show how to use the library. Moreover, we will show our implmentation on different machines with different number of cores. Last but not least, we will compare our implementation with a naive implementation of worker-pool library without work stealing.
We have tested the performance of our implementation on fib() but got a disappointed speedup. The result showed that our implementation was slower than both cilk version and serial version of fib() program. The slowdown mainly results from the frequent memory subscriptions and potentially context switch overhead. So the next step we will test our framework on more compute-intensive applications and continue to reduce the overhead, e.g. by adding and optimizing the memory manager.
we are not proficient in profiling. So we need some suggestions about analyzing performance and locating bottlenecks, more urgently, we need the software multithread profiling resources like VTune.
Schedule (half week)
|April 20-April 24||Naive work stealing implementation|
|April 24-April 27||second benchmarking naive work stealing vs global queue + naive work stealing implemenation continued|
|April 27-April 30||third benchmarking naive work stealing vs global queue + refine work stealing|
|April 30-May 2||fourth benchmarking refined work stealing vs global queue + refine sync implementation|
|May 2-May 5||fifth benchmarking refined working stealing vs global queue + refine sync implementation continued|
|May 5-May 9||Toward a self-implementated thread and locking scheme|