The soft delay is a nice way to handle rate limiting in a way which is the least disruptive to most users. However, if you're running an API-based service and want to ensure the requesting application is notified of any request which hits the limit, you can add the nodelay parameter to the limit_req directive. Consider this example:
limit_req zone=basiclimit burst=5 nodelay;
Instead of seeing the connections queued, they're immediately returned with a 503 (Service Unavailable) HTTP error. If we rerun the same initial Apache Benchmark call (even with a single connection), we now see this:
Concurrency Level: 1 Time taken for tests: 4.306 seconds Complete requests: 200 Failed requests: 152 (Connect: 0, Receive: 0, Length: 152, Exceptions: 0) Non-2xx responses: 152 Total transferred: 1387016 bytes HTML transferred: 1343736 bytes Requests per second: 46.45 [#/sec] (mean) Time per request: 21.529 [ms] (mean) Time per request: 21.529 [ms] (mean, across all concurrent requests) Transfer rate: 314.58 [Kbytes/sec] received
Not all our requests returned with a status of 200, and instead any requests over the limit immediately received a 503. This is why our benchmark only shows 46 successful requests per second, as 152 of these were 503 errors.