In earlier PostgreSQL versions, and by default in the current version, the pgbench program itself runs as a single process on the system. It queues up statements for each client to execute, then feeds them new ones as they report that they have finished responding to the previous one. In all situations, it is possible that the process will be slow if you have a large number of clients active. A common approach to work around that has been to run the pgbench program itself on another system, which unfortunately adds its own overhead too.
The situation is particularly bad on Linux systems running kernel 2.6.23 or later. The Completely Fair Scheduler (CFS) introduced in that version is not very fair to pgbench, when it runs in its standard mode using a socket connection. The result is that throughput only scales into the low tens of thousands of transactions per second, acting as if only one database core or processor is working correctly. Note how much the use of multiple threads speeded up on the initial read-only sample shown in the preceding text--an increase from 17 K to 31 K TPS--clearly shows that the single pgbench worker was the bottleneck. The worst manifestation of this problem is usually avoided when using pgbench-tools, as that always uses the TCP/IP host-based connection scheme instead.
Starting from PostgreSQL 9.0, you can work around these problems by increasing the number of worker threads that pgbench runs in parallel to talk to clients with. Each worker is assigned an equal share of clients that it opens and manages, splitting the load of clients up so that no single process has to service them all. The number of workers must be a multiple of the client count, so you need to choose that carefully. The program will abort complaining of bad input rather than do anything if it can't divide the client count requested equally among the number of workers specified; the way the program is written, each worker must have the same number of clients assigned to it.
Generally, you'll want one worker thread per core on your system if you have a version that this capability exists in. pgbench-tools allows you to set a maximum worker count while aiming to keep the clients properly divisible among workers. See the pgbench documentation for details. Using multiple workers is usually good for at least a 15% or greater speedup over using only a single worker, and as suggested earlier, increases of over 100% are possible. If you create too many workers--for example, making one worker per client when the number of clients far exceeds system cores--you can end up right back at where excessive swapping between processes on the system limits performance again.