These benchmarks are mainly used to identify bottlenecks, not to showcase the performance of the framework. Proper throughput and latency benchmarks are TBD.
Setup: AMD 7950X, 6000MHz@CL38 RAM, ubuntu 22.04, gcc 11.3
Compile Options |
coroutines | events | start | start & stop |
---|---|---|---|---|
std containers std alloc |
964,542 µs 5MB |
2,291,967 µs 5760MB |
5,392,778 µs 373MB |
2,270,025 µs 5MB |
absl containers std alloc |
1,308,459 µs 5MB |
1,730,601 µs 3722MB |
2,802,061 µs 303MB |
2,486,035 µs 5MB |
std containers mimalloc |
858,882 µs 5MB |
802,964 µs 4383MB |
4,494,822 µs 366MB |
1,704,202 µs 5MB |
absl containers mimalloc |
1,044,875 µs 5MB |
896,118 µs 3116MB |
2,592,989 µs 312MB |
2,024,756 µs 6MB |
Detailed data:
../std_std/ichor_coroutine_benchmark single threaded ran for 519,920 µs with 4,194,304 peak memory usage 9,616,864 coroutines/s
../std_std/ichor_coroutine_benchmark multi threaded ran for 964,542 µs with 4,718,592 peak memory usage 41,470,459 coroutines/s
../std_std/ichor_event_benchmark single threaded ran for 1,042,089 µs with 644,349,952 peak memory usage 4,798,054 events/s
../std_std/ichor_event_benchmark multi threaded ran for 2,291,967 µs with 5,760,876,544 peak memory usage 17,452,258 events/s
../std_std/ichor_serializer_benchmark single threaded rapidjson ran for 685,394 µs with 4,456,448 peak memory usage 226 MB/s
../std_std/ichor_serializer_benchmark multi threaded rapidjson ran for 731,981 µs with 4,718,592 peak memory usage 1,694 MB/s
../std_std/ichor_serializer_benchmark single threaded boost.JSON ran for 1,004,706 µs with 4,718,592 peak memory usage 149 MB/s
../std_std/ichor_serializer_benchmark multi threaded boost.JSON ran for 1,234,318 µs with 4,718,592 peak memory usage 972 MB/s
../std_std/ichor_start_benchmark single threaded advanced injection ran for 2,801,849 µs with 45,563,904 peak memory usage
../std_std/ichor_start_benchmark multi threaded advanced injection ran for 5,392,778 µs with 372,981,760 peak memory usage
../std_std/ichor_start_benchmark single threaded constructor injection ran for 3,532,298 µs with 372,981,760 peak memory usage
../std_std/ichor_start_benchmark multi threaded constructor injection ran for 6,355,558 µs with 390,033,408 peak memory usage
../std_std/ichor_start_stop_benchmark single threaded ran for 1,374,069 µs with 4,194,304 peak memory usage 727,765 start & stop /s
../std_std/ichor_start_stop_benchmark multi threaded ran for 2,270,025 µs with 4,718,592 peak memory usage 3,524,190 start & stop /s
../std_mimalloc/ichor_coroutine_benchmark single threaded ran for 396,122 µs with 4,718,592 peak memory usage 12,622,373 coroutines/s
../std_mimalloc/ichor_coroutine_benchmark multi threaded ran for 858,882 µs with 4,718,592 peak memory usage 46,572,171 coroutines/s
../std_mimalloc/ichor_event_benchmark single threaded ran for 559,430 µs with 566,755,328 peak memory usage 8,937,668 events/s
../std_mimalloc/ichor_event_benchmark multi threaded ran for 802,964 µs with 4,383,309,824 peak memory usage 49,815,433 events/s
../std_mimalloc/ichor_serializer_benchmark single threaded rapidjson ran for 499,905 µs with 4,718,592 peak memory usage 310 MB/s
../std_mimalloc/ichor_serializer_benchmark multi threaded rapidjson ran for 598,100 µs with 5,242,880 peak memory usage 2,073 MB/s
../std_mimalloc/ichor_serializer_benchmark single threaded boost.JSON ran for 925,410 µs with 5,242,880 peak memory usage 162 MB/s
../std_mimalloc/ichor_serializer_benchmark multi threaded boost.JSON ran for 976,667 µs with 5,242,880 peak memory usage 1,228 MB/s
../std_mimalloc/ichor_start_benchmark single threaded advanced injection ran for 2,087,926 µs with 44,814,336 peak memory usage
../std_mimalloc/ichor_start_benchmark multi threaded advanced injection ran for 4,494,822 µs with 365,953,024 peak memory usage
../std_mimalloc/ichor_start_benchmark single threaded constructor injection ran for 2,540,647 µs with 368,238,592 peak memory usage
../std_mimalloc/ichor_start_benchmark multi threaded constructor injection ran for 5,312,124 µs with 386,310,144 peak memory usage
../std_mimalloc/ichor_start_stop_benchmark single threaded ran for 1,019,242 µs with 4,718,592 peak memory usage 981,121 start & stop /s
../std_mimalloc/ichor_start_stop_benchmark multi threaded ran for 1,704,202 µs with 4,718,592 peak memory usage 4,694,279 start & stop /s
../absl_std/ichor_coroutine_benchmark single threaded ran for 544,185 µs with 4,456,448 peak memory usage 9,188,051 coroutines/s
../absl_std/ichor_coroutine_benchmark multi threaded ran for 1,308,459 µs with 4,718,592 peak memory usage 30,570,312 coroutines/s
../absl_std/ichor_event_benchmark single threaded ran for 529,475 µs with 417,595,392 peak memory usage 9,443,316 events/s
../absl_std/ichor_event_benchmark multi threaded ran for 1,730,601 µs with 3,721,920,512 peak memory usage 23,113,357 events/s
../absl_std/ichor_serializer_benchmark single threaded rapidjson ran for 837,304 µs with 4,456,448 peak memory usage 185 MB/s
../absl_std/ichor_serializer_benchmark multi threaded rapidjson ran for 769,968 µs with 4,718,592 peak memory usage 1,610 MB/s
../absl_std/ichor_serializer_benchmark single threaded boost.JSON ran for 997,978 µs with 4,980,736 peak memory usage 150 MB/s
../absl_std/ichor_serializer_benchmark multi threaded boost.JSON ran for 1,041,410 µs with 4,980,736 peak memory usage 1,152 MB/s
../absl_std/ichor_start_benchmark single threaded advanced injection ran for 2,524,460 µs with 37,855,232 peak memory usage
../absl_std/ichor_start_benchmark multi threaded advanced injection ran for 2,802,061 µs with 302,882,816 peak memory usage
../absl_std/ichor_start_benchmark single threaded constructor injection ran for 3,118,810 µs with 302,882,816 peak memory usage
../absl_std/ichor_start_benchmark multi threaded constructor injection ran for 3,109,533 µs with 311,435,264 peak memory usage
../absl_std/ichor_start_stop_benchmark single threaded ran for 1,413,799 µs with 4,456,448 peak memory usage 707,314 start & stop /s
../absl_std/ichor_start_stop_benchmark multi threaded ran for 2,486,035 µs with 4,718,592 peak memory usage 3,217,975 start & stop /s
../absl_mimalloc/ichor_coroutine_benchmark single threaded ran for 417,400 µs with 4,718,592 peak memory usage 11,978,917 coroutines/s
../absl_mimalloc/ichor_coroutine_benchmark multi threaded ran for 1,044,875 µs with 4,718,592 peak memory usage 38,282,091 coroutines/s
../absl_mimalloc/ichor_event_benchmark single threaded ran for 458,921 µs with 414,187,520 peak memory usage 10,895,121 events/s
../absl_mimalloc/ichor_event_benchmark multi threaded ran for 896,118 µs with 3,116,367,872 peak memory usage 44,636,978 events/s
../absl_mimalloc/ichor_serializer_benchmark single threaded rapidjson ran for 577,368 µs with 4,718,592 peak memory usage 268 MB/s
../absl_mimalloc/ichor_serializer_benchmark multi threaded rapidjson ran for 613,906 µs with 5,242,880 peak memory usage 2,019 MB/s
../absl_mimalloc/ichor_serializer_benchmark single threaded boost.JSON ran for 950,804 µs with 5,242,880 peak memory usage 157 MB/s
../absl_mimalloc/ichor_serializer_benchmark multi threaded boost.JSON ran for 973,536 µs with 5,242,880 peak memory usage 1,232 MB/s
../absl_mimalloc/ichor_start_benchmark single threaded advanced injection ran for 2,249,762 µs with 39,104,512 peak memory usage
../absl_mimalloc/ichor_start_benchmark multi threaded advanced injection ran for 2,592,989 µs with 311,881,728 peak memory usage
../absl_mimalloc/ichor_start_benchmark single threaded constructor injection ran for 2,852,259 µs with 312,254,464 peak memory usage
../absl_mimalloc/ichor_start_benchmark multi threaded constructor injection ran for 2,803,249 µs with 314,540,032 peak memory usage
../absl_mimalloc/ichor_start_stop_benchmark single threaded ran for 1,101,855 µs with 4,718,592 peak memory usage 907,560 start & stop /s
../absl_mimalloc/ichor_start_stop_benchmark multi threaded ran for 2,024,756 µs with 5,505,024 peak memory usage 3,951,093 start & stop /s
Realtime example on a vanilla linux:
root# echo 950000 > /proc/sys/kernel/sched_rt_runtime_us #force kernel to fail our deadlines
root# ../bin/ichor_realtime_example
duration of run 9,848 is 50,104 µs which exceeded maximum of 2,000 µs
duration of run 16,464 is 51,101 µs which exceeded maximum of 2,000 µs
duration min/max/avg: 142/51,101/147 µs
root# echo 999999 > /proc/sys/kernel/sched_rt_runtime_us
root# ../bin/ichor_realtime_example
duration min/max/avg: 142/327/142 µs
root# echo -1 > /proc/sys/kernel/sched_rt_runtime_us
root# ../bin/ichor_realtime_example
duration min/max/avg: 142/328/142 µs
Some HTTP benchmarks with different frameworks
$ mkdir build && cd build && cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DICHOR_ARCH_OPTIMIZATION=X86_64_AVX2 -DICHOR_USE_ABSEIL=ON -DICHOR_USE_BOOST_BEAST=ON .. && ninja
$ ulimit -n 65535 && ../bin/ichor_pong_example -s &
$ wrk -c2 -t1 -d60s --latency -s ../benchmarks/wrk.lua http://localhost:8001/ping
Running 1m test @ http://localhost:8001/ping
1 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.17us 4.49us 1.36ms 92.57%
Req/Sec 67.51k 3.69k 80.86k 77.37%
Latency Distribution
50% 29.00us
75% 30.00us
90% 32.00us
99% 39.00us
4038387 requests in 1.00m, 427.49MB read
Requests/sec: 67194.72
Transfer/sec: 7.11MB
$ wrk -c400 -t2 -d60s --latency -s ../benchmarks/wrk.lua http://localhost:8001/ping
Running 1m test @ http://localhost:8001/ping
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.59ms 256.25us 9.76ms 79.41%
Req/Sec 43.74k 2.46k 58.80k 92.17%
Latency Distribution
50% 4.61ms
75% 4.66ms
90% 4.88ms
99% 4.96ms
5223768 requests in 1.00m, 552.98MB read
Requests/sec: 87048.80
Transfer/sec: 9.21MB
$ wrk -c20000 -t4 -d60s --latency -s ../benchmarks/wrk.lua http://localhost:8001/ping
Running 1m test @ http://localhost:8001/ping
4 threads and 20000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 270.41ms 21.43ms 879.87ms 95.93%
Req/Sec 18.38k 2.01k 24.18k 72.12%
Latency Distribution
50% 272.47ms
75% 274.42ms
90% 276.31ms
99% 281.80ms
4382499 requests in 1.00m, 463.92MB read
Requests/sec: 72971.55
Transfer/sec: 7.72MB
top -H -d2
PID PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
67529 20 0 2653500 307896 5356 R 99,9 0,5 0:04.35 HttpCon #3
67528 20 0 2653500 307896 5356 S 33,3 0,5 0:01.43 DepMan #0
67534 20 0 573972 99932 3940 S 22,9 0,2 0:01.04 wrk
67533 20 0 573972 99932 3940 S 20,9 0,2 0:00.95 wrk
67535 20 0 573972 99932 3940 R 20,4 0,2 0:00.92 wrk
67536 20 0 573972 99932 3940 S 20,4 0,2 0:00.93 wrk
$ cat wrk.lua
wrk.method = "POST"
-- post json data
wrk.body = '{"sequence": 2}'
wrk.headers['Content-Type'] = "application/json"
See beast_benchmark
$ wrk -c2 -t1 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:6000/ping
Running 10s test @ http://localhost:6000/ping
1 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 20.80us 13.58us 413.00us 97.16%
Req/Sec 41.63k 6.48k 54.91k 73.27%
Latency Distribution
50% 17.00us
75% 30.00us
90% 33.00us
99% 50.00us
418099 requests in 10.10s, 39.87MB read
Socket errors: connect 0, read 418099, write 0, timeout 0
Requests/sec: 41397.00
Transfer/sec: 3.95MB
$ wrk -c400 -t2 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:6000/ping
Running 10s test @ http://localhost:6000/ping
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.33ms 563.26us 10.10ms 77.69%
Req/Sec 37.23k 2.68k 41.64k 74.50%
Latency Distribution
50% 5.20ms
75% 5.39ms
90% 6.05ms
99% 6.45ms
740671 requests in 10.01s, 70.64MB read
Socket errors: connect 0, read 740671, write 0, timeout 0
Requests/sec: 73975.13
Transfer/sec: 7.05MB
$ wrk -c20000 -t4 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:6000/ping
Running 10s test @ http://localhost:6000/ping
4 threads and 20000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 95.10ms 141.27ms 1.73s 90.00%
Req/Sec 15.01k 8.27k 28.45k 62.82%
Latency Distribution
50% 51.59ms
75% 52.46ms
90% 255.04ms
99% 709.07ms
374703 requests in 10.04s, 35.73MB read
Socket errors: connect 0, read 374703, write 0, timeout 223
Requests/sec: 37311.37
Transfer/sec: 3.56MB
$ cat express.js
var express = require('express');
var app = express();
app.use(express.json())
app.post('/ping', function(req, res){
res.json(req.body)
});
app.listen(3000);
console.log('Listening on port 3000');
$ wrk -c 2 -t1 -d 10s --latency -s ./wrk.lua http://localhost:3000/ping
Running 10s test @ http://localhost:3000/ping
1 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 155.23us 183.51us 7.54ms 95.86%
Req/Sec 15.18k 1.65k 16.62k 94.06%
Latency Distribution
50% 121.00us
75% 132.00us
90% 174.00us
99% 1.04ms
152519 requests in 10.10s, 36.07MB read
Requests/sec: 15101.66
Transfer/sec: 3.57MB
$ wrk -c20000 -t4 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:3000/ping
Running 10s test @ http://localhost:3000/ping
4 threads and 20000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 462.34ms 168.32ms 1.38s 60.19%
Req/Sec 3.34k 1.02k 7.21k 75.76%
Latency Distribution
50% 491.98ms
75% 524.07ms
90% 668.50ms
99% 1.07s
131805 requests in 10.05s, 31.17MB read
Socket errors: connect 0, read 0, write 0, timeout 3652
Requests/sec: 13118.38
Transfer/sec: 3.10MB
$ cmake -DCMAKE_BUILD_TYPE=Release -DUSE_COROUTINE=ON .. #building drogon
$ cat drogon.cpp
#include <drogon/drogon.h>
using namespace drogon;
int main()
{
app()
.setLogPath("./")
.setLogLevel(trantor::Logger::kWarn)
.addListener("0.0.0.0", 7770)
.setThreadNum(1);
app().registerHandler("/ping",
[](const HttpRequestPtr& req, std::function<void (const HttpResponsePtr &)> &&callback) {
auto reqjson = req->getJsonObject();
Json::Value json;
json["sequence"]=(*reqjson)["sequence"].asInt64();
auto resp=HttpResponse::newHttpJsonResponse(json);
callback(resp);
},
{Post});
app().run();
}
$ wrk -c2 -t1 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:7770/ping
Running 10s test @ http://localhost:7770/ping
1 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.48us 14.26us 1.28ms 99.32%
Req/Sec 133.01k 17.28k 194.91k 91.09%
Latency Distribution
50% 15.00us
75% 15.00us
90% 16.00us
99% 24.00us
1336011 requests in 10.10s, 202.58MB read
Requests/sec: 132285.77
Transfer/sec: 20.06MB
$ wrk -c400 -t2 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:7770/ping
Running 10s test @ http://localhost:7770/ping
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.00ms 140.69us 6.12ms 92.01%
Req/Sec 66.88k 1.88k 68.83k 98.00%
Latency Distribution
50% 2.99ms
75% 3.01ms
90% 3.08ms
99% 3.33ms
1330911 requests in 10.01s, 201.81MB read
Requests/sec: 132936.53
Transfer/sec: 20.16MB
$ wrk -c20000 -t4 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:7770/ping
Running 10s test @ http://localhost:7770/ping
4 threads and 20000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 64.98ms 11.16ms 1.98s 99.19%
Req/Sec 30.07k 8.92k 66.73k 76.26%
Latency Distribution
50% 65.11ms
75% 65.71ms
90% 66.18ms
99% 68.04ms
1185278 requests in 10.07s, 179.73MB read
Socket errors: connect 0, read 2094, write 0, timeout 120
Requests/sec: 117676.17
Transfer/sec: 17.84MB
$ cat main.go
package main
import (
"flag"
"log"
"github.com/goccy/go-json"
"github.com/valyala/fasthttp"
)
var (
addr = flag.String("addr", ":9000", "TCP address to listen to")
)
type Sequence struct {
Sequence uint64
}
func main() {
flag.Parse()
h := requestHandler
if err := fasthttp.ListenAndServe(*addr, h); err != nil {
log.Fatalf("Error in ListenAndServe: %v", err)
}
}
func requestHandler(ctx *fasthttp.RequestCtx) {
ctx.SetContentType("application/json; charset=utf8")
var seq Sequence
json.Unmarshal(ctx.Request.Body(), &seq)
data, _ := json.Marshal(seq)
ctx.SetBody(data)
}
$ GOMAXPROCS=1 ./fasthttp_bench &
$ wrk -c2 -t1 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:9000/ping
Running 10s test @ http://localhost:9000/ping
1 threads and 2 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 12.43us 11.95us 1.04ms 99.32%
Req/Sec 166.51k 6.24k 172.03k 95.05%
Latency Distribution
50% 12.00us
75% 12.00us
90% 13.00us
99% 18.00us
1673959 requests in 10.10s, 245.85MB read
Requests/sec: 165747.41
Transfer/sec: 24.34MB
$ wrk -c400 -t2 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:9000/ping
Running 10s test @ http://localhost:9000/ping
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.36ms 582.61us 17.77ms 75.14%
Req/Sec 85.68k 5.74k 106.14k 88.00%
Latency Distribution
50% 2.32ms
75% 2.71ms
90% 2.98ms
99% 3.46ms
1704231 requests in 10.02s, 250.29MB read
Requests/sec: 170028.45
Transfer/sec: 24.97MB
$ wrk -c20000 -t4 -d10s --latency -s ../benchmarks/wrk.lua http://localhost:9000/ping
Running 10s test @ http://localhost:9000/ping
4 threads and 20000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 159.97ms 48.79ms 499.17ms 94.48%
Req/Sec 31.58k 6.40k 38.00k 90.66%
Latency Distribution
50% 150.91ms
75% 151.65ms
90% 152.37ms
99% 467.35ms
1243939 requests in 10.05s, 182.69MB read
Requests/sec: 123741.19
Transfer/sec: 18.17MB
These benchmarks currently lead to the following characteristics for Ichor:
- creating services with dependencies overhead is likely O(N²).
- Starting services, stopping services overhead is likely O(N)
- event handling overhead is amortized O(1)
- Creating more threads is not 100% linearizable in all cases (pure event creation/handling seems to be, otherwise not really)
- Latency spikes while scheduling in user-space is in the order of 100's of microseconds, lower if a proper realtime tuning guide is used (except if the queue goes empty and is forced to sleep)
Help with improving memory usage and the O(N²) behaviour would be appreciated.
Also, Golang's fasthttp + go-json is really fast. Together with Ichor + Boost.BEAST, they are the only two tested solutions here that can handle 20,000 connections without errors. Although Ichor "cheats" in the sense that there is an extra thread doing the user logic separate from the networking thread, where other implementations had to do everything with 1 thread.
- Linux-only: Implement I/O uring event loop for decreased syscalls
- Use libuv and modify hiredis implementation to use that instead of polling @ 1000Hz