(The original version is in Japanese. Still under translation... This manual is based on version 0.2, though the library's current version is 0.3, in which SPMD and concurrency support are added. )
This is a library for writing parallel programs (for UNIX OSes). With this library, you can write parallel programs more easily than using other libraries like pthread directly. As a practical example, bzip2 is parallelized using this library.
Today, speedup of single processor is becoming difficult, and using multiple processors is a popular way to achieve high performance. This is true not only for EWS, but also for PCs and embedded processors.
However, it is not easy to write programs for parallel machines. Usually, programmers should use libraries like pthread, and need to use lock and/or semaphores. This is not easy, and tend to cause bugs that are quite difficult to debug, because the behavior of the program changes at every program run.
This library offers a way to write parallel programs that is more intuitive and easier. It offers:
For example, if "a" is declared as "Sync<int> a", you can do operations like "a.read()", "a.write(1)". The operation "a.write(1)" makes the contents of "a" 1, and the operation "a.read()" gets the contents of "a". You can do "inter process communication" using this functionality.
Here, the operation "a.read()" stops (blocks) until the operation "a.write(1)" is executed. This is called dataflow synchronization. The write operation can only be applied once for the same variable (more exactly, other operations after the 1st write operation cannot change the contents.)
SyncList<T> is a list of Sync<T>, and SyncQueue<T> is SyncList<T> whose length is limited.
WorkPool<T1,T2> supports multiple processes to extract "works" from a work pool.
I will explain the usage using the samples.cc file in the samples directory. Here is the main of samples.cc with comments.
int main() { pards_init(); // ...(main:1) Library initialization Sync<int> a, b, c; // ...(main:2) Decl. of sync vars SPAWN(add(1,b,c)); // ...(main:3) fork add, wait for b, executed 3rd SPAWN(add(1,a,b)); // ...(main:4) fork add, wait for a, executed 2nd a.write(3); // ...(main:5) executed 1st int v = c.read(); // ...(main:6) wait for add(1,b,c) of (main:3) printf("value = %d\n",v); pards_finalize(); // ...(main:7) Finalize of the library }
In order to use the library, you need to call pards_init()(main:1)。 In addition, to finalize the library, you need to call pards_finalize() (main:7).
Synchronization variable are declared like Sync<int> a,b,c(main:2). In this case, these variables can contain values whose type is "int".
The function add(1,b,c) is SPAWNed at (main:3). This means that the function add(1,b,c) is forked as a process. (SPAWN is implemented as an macro).
Here, the function add is defined as follows:
void add(int i, Sync<int> a, Sync<int> b) { int val; val = i + a.read(); // ...(add:1) a.read() waits for a to be written b.write(val); // ...(add:2) b.write writes a value }
This function adds the 1st argument and the 2nd argument, and returns the value as the 3rd argument.The type of the 1st argument is simple "int", and that of 2nd and 3rd argument is Sync<int>.
"a.read()" in (add:1) blocks until the value of a is written. After the value of a is written, it restarts the execution, and get the value of it. After that, the value is added with the 1st argument, and written to the variable val.
In (add:2), val is written to b. This makes the processes that wait for b's value restart its execution.
Back to the main. The 2nd argument of the function add that is forked in (main:3) is not written by any processes. Therefore, this function will block for a while.
Likewise, the 2nd argument of the function add that is forked in (main:4) is not written at this time, this add also blocks.
Then, 3 is written to a in (main:5). This makes the function add that is forked in (main:4) restarts. After execution, 4 is written to b.
After the value is written to b, the function add that is forked in (main:3) also restarts. After execution, 5 is written to c.
The value of c is read in (main:6). This also blocks until a value is written to c. Therefore, it waits for the execution of the function add that is forked in (main:3).
As you can see, inter process communication and synchronization between the processes forked by SPAWN can be realized using Sync<int> variables.
Here, multiple writes to the same variable cannot change the value. This kind of variable is called single assignment variable.
You can write an algorithm like first-come-first-served using this functionality.
As mentioned above, SPAWN is implemented as a macro that calls fork() from it.
Sync<T> uses System V IPC to realize inter process communication; shared memory is allocated in pards_init().
In addition, semaphore of System V IPC is used in order to realize block and resume of processes and mutual exclusion of shared memory access.
Sync<T> variable only stores a pointer to shared memory and IDs of semaphores. Therefore, this variable can be passed as value to functions (arguments of function add in sample.cc). Of course, you can pass these variables as pointers or references.
The important thing is that even if you modify the global variable in the SPAWNed function, it does not affect other processes, because we use fork instead of pthread. Changing global variables in threads is a typical reason of bugs that cannot be corrected easily, but our library does not cause such bugs. And SPAWNed functions can read global / local variables that is set before SPAWN, because fork() logically copies all memory spaces (to be exact, the copy occurs only when write to the memory happens).
If we use many synchronization variables or the program runs for a long time, we need to release resources (shared memory and semaphores). Here, shared memory and semaphores are shared between multiple processes, so it is dangerous to release these resources in the destructor; even the resources is not needed in the process that writes a value to the synchronization variable, the process that reads the value still needs the resources. Therefore, basically in this library, you need to release resources explicitly.
As for variables allocated in the stack, you need to call free(). Of course, free() should be called only when other processes are not referring the resources. Typically, there is one writer process and one reader process, and just after the read is finished, free() can be called. Example of free() is in fib.cc in the samples directory.
If you allocate a synchronization variable using new, not only the value inside of the variable, but also memory for Sync<T> will be stored in the shared memory area. In this case, you can just use delete in order to release both shared resources and memory area for the synchronization variable.
The reason of this specification is that I wanted to make the specification similar to that of SyncList<T>. I will explain SyncList<T> next.
SyncList<T> is used in the generator-consumer pattern. In this pattern, one process creates list of values (generator), and the other process uses these values (consumer). By using different processes for generating and consuming lists, pipeline parallel processing becomes possible.
I will explain this using listsample.cc in the samples directory.
int main() { pards_init(); SyncList<int> *a; // ...(main:1) declaration of first cell of the list a = new SyncList<int> // ...(main:2) allocation of the list cell SPAWN(generator(a)); // ...(main:3) fork generator process SPAWN(consumer(a)); // ...(main:4) fork consumer process pards_finalize(); }
First,the first "cell" of the list is declared and allocated at (main:1), (main:2). Then, the generator process and the consumer process are forked at (main:3), (main:4). The first cell of the list is passed to the generator process and the consumer process.
Then, let's see the definition of the generator process.
void generator(SyncList<int> *a) { int i; SyncList<int> *current, *nxt; current = a; // ...(gen:1) assign the argument to current for(i = 0; i < 10; i++){ current->write(i); // ...(gen:2) write a value to the current list cell printf("writer:value = %d\n",i); nxt = new SyncList<int>; // ...(gen:3) allocate new list cell current->writecdr(nxt); // ...(gen:4) set the allocated cell as cdr of the current cell current = nxt; // ...(gen:5) set the allocated cell as the current cell sleep(1); // ...(gen:6) "wait" to show the behavior } current->write(i); printf("writer:value = %d\n",i); current->writecdr(0); // ...(gen:7) terminate the list using 0 }
The generator process creates a list and inserts values to it. Like Sync<T>, a value can be set to the list cell using write() (gen:2).
The next cell of the list is created using new at (gen:3). Then the cell is connected to the previous cell using writecdr() at (gen:4).
Here, a new cell should be created using "new"; don't connect a cell that is allocated on the stack. This is because the consumer process cannot read the memory if the cell is on the stack. The cell allocated using new is stored in the shared memory, so the consumer process can read it.
Because I need to make "new" of SyncList<T> allocate shared memory, I also made "new" of Sync<T> allocate shared memory.
The list is created by iterating the above process using the for loop. In order to show the behavior, 1 second wait is inserted at the end of the loop (gen:6). The end of the list is terminated by 0 (gen:7).
Then, let's see the definition of the consumer process.
void consumer(SyncList<int> *a) { SyncList<int> *current,*prev; current = a; while(1){ printf("reader:value = %d\n", current->read()); // ...(cons:1) read the value of the cell and print it prev = current; // ...(cons:2) save the current cell current = current->readcdr(); // ...(cons:3) extract the cdr of the current cell, and make it the current cell delete prev; // ...(cons:4) delete the used cell if(current == 0) break; // ...(cons:5) check the termination } }
The value of the cell is extracted and shown at (cons:1). Here, this read blocks until the value is written like Sync<T>.
The current cell is saved at (cons:2). Cdr of the current cell is extracted and is made to be the current cell (cons:3). Like read(), readcdr() blocks until cdr is written.
After the cdr is read, the previous cell is no longer needed. So it is deleted at (cons:4). Here, "delete" releases the memory in the shared memory area, and releases the semaphores.
Lastly, termination is checked at (cons:5).
The output of this program should be like this:
writer:value = 0 reader:value = 0 writer:value = 1 reader:value = 1 writer:value = 2 reader:value = 2 ...
The consumer process waits for the write of the generator process. Therefore, above output is shown second by second.
Since the list creation and consumption described above is typical pattern, I prepared abbreviated notation that reduces the amount of codes
Firstly, the operation "create a new list cell, and connect it to the current list cell" is described as follows:
nxt = new SyncList<int>; current->writecdr(nxt); current = nxt;
In order to describe this concisely, there is a create() member function that "creates new SyncList<T> variable, which is connected to the target object, and the newly created variable is returned". Using this member function, the above example can be described as follows:
current = current->create();
Now, the temporary variable nxt is no longer needed.
Next, the operation "extract cdr from the current cell, and make this as the current cell and delete the previous cell" is described as follows:
prev = current; current = current->readcdr(); delete prev;
In order to describe this concisely, there is a release() member function that "extracts cdr and delete the cell, then returns the cdr". Using this, the above example can be written as follows:
current = current->release();
Using these abbreviated notation, you can write programs concisely. The example that uses these notations is in listsample2.cc.
In the previous example, "wait" is inserted in the generator's side. It is OK to use SyncList if the generator's execution is the bottleneck. However, if the consumer's execution is slower than the generator's execution, the system might run short of the resource because releasing the resource of the consumer's side is slow.
To avoid this problem, we need to block the generator's execution until the consumer's release is done. SyncQueue<T> provides this functionality.
SyncQueue<T> is almost the same as SyncList<T>, but it accepts the length of the "Queue" as the argument of the constructor.
a = new SyncQueue<int>(2);
If SyncQueue is declared like this, the system limits the number of operations that connects "cdr" to the Queue up to 2. If more cons cells are tried to connected to the Queue, the operation blocks. Then, if the cells connected to the Queue is released by "delete" or "release", the number increases. If there is an operation that is waiting, it resumes the operation. This means that at most 3 cells can exist at a time.
Only the first cell require the number in the constructor. After that, other cells can be allocated same as SyncList<T>.
In addition, a cell cannot be set as multiple cells' cdr unlike SyncList<T> ; the system detects this and outputs an error.
An example of SyncQueue<T> is in queuesample.cc. It changed listsample2.cc so that SyncList type is replaces by SyncQueue<T> and the consumer's side waits. The output of this program should be like this:
writer:value = 0 writer:value = 1 writer:value = 2 reader:value = 0 writer:value = 3 reader:value = 1 writer:value = 4 ...
First, because the generator's side does not wait, 0, 1, 2 are shown. Then after the consumer side shows 0 and releases the cell, the generator resumes execution and shows 3, and so on.
If you use only SyncList and SyncQueue, there might be a case that you need to call SPAWN very frequently. Process invocation cost is not that large in recent OSes, but still you might need to reduce the number of process invocation.
Therefore, I prepared a class that supports a pattern like that worker processes are invoked at first (for example, number of processors), and they get their work from "work pool".
I will explain how to use the class using workpoolsample.cc in the "samples" directory.
int main() { pards_init(); SyncQueue<int> *work = new SyncQueue<int>(2); // ...(main:1) work queue SyncQueue<int> *output = new SyncQueue<int>(2);// ...(main:2) output queue WorkPool<int,int> *pool = new WorkPool<int>(work,output); // ... (main:3) definition of WorkPool SPAWN(generator(work)); // ... (main:4) creation of work SPAWN(worker(1, pool)); // ... (main:5) fork woker1 SPAWN(worker(2, pool)); // ... (main:6) fork woker2 while(1){ printf("%d...\n",output->read()); // ... (main:7) show output queue if((output = output->release()) == 0) break; // ... (main:8) get the next cell & check termination } pards_finalize(); }
At first, SyncQueue work whose value is "work", and SyncQueue output whose value is "output" are defined (main:1) (main:2). You can use SyncList instead of SyncQueue.
Next, "pool" whose type is WorkPool is defined (main:3). T1 and T2 of WorkPool<T1,T2> are SyncQueue<T> 's T for work (int), and SyncQueue<T> 's T for output (int). In addition, argument of the constructor includes SyncQueue work for "work" and SyncQueue output for "output".
Here, SyncQueue (SyncList) for work and SyncQueue (SyncList) for output are treated as a pair; when you get work cell from the pool, you also get the output cell for the work as a pair.
This enables us to get the output in order, even if the work is processed out of order by different processes.
In (main:5) and (main:6), worker processes are SPAWNed, whose argument includes the work pool. The other argument is id of the worker.
(main:7), (main:8) show the result.
Next, let's see the definition of the worker.
void worker(int id, WorkPool<int,int> *workpool) { while(1){ WorkItem<int,int> item = workpool->getwork(); // (worker:1) get the work if(item.output == 0) break; // (worker:2) check termination else{ item.output->write(item.work*2); // (worker:3) double the value and write to the output printf("(%d by [%d])\n",item.work*2,id); // (worker:4) print the worker id } } }
At first, the work is got from WorkPool in (worker:1). Here, the type of the work is WorkItem. T1 and T2 of WorkItem<T1,T2> are the same as WorkPool<T1,T2>.
You can get the "work" whose type is WorkItem by calling the getwork() member function. WorkItem includes the work and the output cell for the work.
Here, inside of getwork(),
Therefore, users of WorkPool don't have to worry about release/creation of cells.
The variable item whose type is WorkItem includes a member "output" that has the output cell. You can check if the work pool is terminated or not by checking if it is 0 (worker:2).
The variable item (whose type is WorkItem) includes a member "work" whose type is T1 (in this case, int). In (worker:3) this value is doubled and written to the output cell.
In (worker:4), in order to show the worker id that did the job, the value that is written to the output cell, and worker id are printed. You can see that the works are processed by multiple workers.
The output of this program should be like this:
(0 by [1]) (2 by [1]) 0... 2... (4 by [1]) (6 by [2]) 4... 6... (8 by [1]) (10 by [2]) 8... 10...
Integer lists like "0, 1, 2, ..." are doubled and shown like "0... 2... 3...". In addition, which worker did the work is shown like (2 by [1]). The worker id would change by each execution.
In the above example, WorkPool type variable was not deleted. When deleting this, it is difficult to decide when it is safe to delete it; if it is deleted just after the output is all extracted, there might be still exist workers (that didn't reach the termination check), and they cause errors by using the released resources.
To avoid this, you can specify worker's number as reference count as the argument of the constructor (the last argument). In addition, workers call the release() member function after the termination check. This reduces the reference count, and can tell the system that the worker terminated.
When delete of WorkPool variable is called, this counter is checked; until all the workers terminate, delete is blocked.
This example is written in workpoolsample.cc as comments.
If the counter is not specified in the constructor, you can delete the variable without calling release(). However, this is a dangerous operation in general. In addition, you can call release() even if the counter is not specified in the constructor.
If WorkPool variable is allocated in the stack, you need to call free() to release the resources (same as Sync). The above mechanism are also applied in this case.