Benchmark for conception

From CatchChallenger wiki
Jump to: navigation, search

The Qt benchmark have been do by gcc 4.5, Qt 4.8, Qt 5.0 alpha, and this benchmark application: https://github.com/alphaonex86/QtSignalsSlotsBenchmark Have look to: http://catchchallenger.first-world.info/benchmark-tracking/view.php

Connection system

Connection by seconds

Greater is better

Thread: 1, previous connection: 10000 Thread: 8, previous connection: 10000 Thread: 64, previous connection: 10000 Thread: 64, previous connection: 1000 Thread: 64, previous connection: 100
Qt4 with 2 slots + 2 signals 526316 526316 526316 500000 526316
Qt4 with 1000 slots + 1000 signals 31645 31645 31545 31746 31446
Qt5 (old syntax) 476190 500000 500000 500000 500000
Qt5 (new syntax) with 2 slots + 2 signals 909091 909091 1000000 1000000 909091
Qt5 (new syntax) with 1000 slots + 1000 signals 500000 500000 500000 500000 500000

Time to disconnect

With arguments

Lower is better, the time is in second. Try disconnect(sender,SIGNAL(),receiver,SLOT());

Thread: 1, previous connection: 10000 Thread: 8, previous connection: 10000 Thread: 64, previous connection: 10000 Thread: 64, previous connection: 100000 Thread: 64, previous connection: 1000000 Thread: 64, previous connection: 1000 Thread: 64, previous connection: 100
Qt4 1.35 1.375 1.369 11.6 103 0.888 0.817
Qt5 (old syntax) 1.469 1.495 1.51 12.5 117 0.993 0.933
Qt5 (new syntax) 1.653 1.573 1.568 19.5 212 0.993 0.93

Is this case with catchchallenger with just 1000 client (each client need ~ 1000 connection inter-thread), your server is out, it use 117s to disconnect a client (and potential free during this time).

Without arguments

Lower is better, the time is in second. this->disconnect();

Where 10000000 connections already setup Where 30000000 connections already setup 10000000 connections for other objects
Qt5 (new syntax) 0.205 0.583 0

Is this case with catchchallenger with just 1000 client (each client need ~ 1000 connection inter-thread), your server is out, it use 117s to disconnect a client (and potential free during this time). Not depands of number of signals/slots defined into the class with Qt5.

Messages

The signals/slots are connected by: Qt::QueuedConnection like for threaded usage.

Messages send

Greater is better

Thread: 1, previous connection: 10000 Thread: 8, previous connection: 10000 Thread: 64, previous connection: 10000 Thread: 64, previous connection: 1000 Thread: 64, previous connection: 100
Qt4 169870 171612 172446 175223 174128
Qt5 275285 279835 277826 277117 274924

Here we can see +60% of performance when we use Qt5. Any change in any version of Qt if have 2 or 1000 signals/slots declared on the class.

I count 3 signals passed to reply (QTcpSocket -> ClientRead, ClientReadSocket -> Thread to parse, Thread to parse -> ClientWriteSocket). In worse case I'm to 169870/3 ~= 56000. With 10 000 client connected, that's mean 5.6 signals/s by player. With 200ms of foot step (the worst case), that's mean -5 signals/s, then remaining 0.6 req/s. In the worst case, with normal pc, and large number of player, with the worst version of Qt.

Cpu usage

The value is in ms, lower is better

Thread: 1, previous connection: 10000 Thread: 8, previous connection: 10000 Thread: 64, previous connection: 10000 Thread: 64, previous connection: 1000 Thread: 64, previous connection: 100
Qt4 1050 1030 910 810 970
Qt5 910 830 810 870 880

System cpu usage

The value is in ms, lower is better

Thread: 1, previous connection: 10000 Thread: 8, previous connection: 10000 Thread: 64, previous connection: 10000 Thread: 64, previous connection: 1000 Thread: 64, previous connection: 100
Qt4 280 250 300 350 320
Qt5 200 260 270 230 210

Qt5 seam have performance, but due to fluctuation of the benchmark, is better of not trust on it.

Container

To store the player pointer, with full look up

Store 65535 pointer.

  • Insert the current player, into the 65534 list.
  • List the 65535 player to send the broadcast message.
  • Remove the current player of the 65535 list.
Insert List Remove
QList 0 1 1
QSet 0 6 6

String

Lower is better (https://github.com/alphaonex86/CatchChallenger/tree/master/tools/benchmarkstring), done with gcc 4.7, debug Qt 5.2 with c++11, format:

  • QStringLiteral, where QStringLiteral("string")
  • QLatin1String, where QLatin1String("string")
  • QString, where QString("string")
  • QLatin1Literal, where QLatin1Literal("string")
  • Char*, where "string"
  • Prepared, where use directly QString variable

Test:

  • Condition with != and ==: test into condition
  • QDomElement::hasAttribute(): test as function arguements
  • concat by +: concat
  • replace format to QString: var.replace(FORMAT ABOVE,varQString);
  • replace format to format: var.replace(FORMAT ABOVE,FORMAT ABOVE);
  • search replace string to format: varQString(regex,FORMAT ABOVE);
  • search replace format to string: FORMAT ABOVE.replace(regex,varQString);
Condition with != and == QDomElement::hasAttribute() concat by + Replace format to QString Replace format to format Search replace string to format Search replace format to string
QStringLiteral 3936ms 1947ms 4161ms 2957ms 5140ms 11986ms 11877ms
QLatin1String 204ms 1507ms 3771ms 1559ms 2142ms 11194ms NA
QString 4586ms 2280ms 4589ms 3288ms 5741ms 12310ms 12145ms
QLatin1Literal 231ms 1505ms 3775ms 1555ms 2130ms 11207ms NA
Char* 4630ms 2339ms 4275ms 3364ms 5811ms 12291ms NA
Prepared 92ms 31ms 2492ms 1280ms 237ms 9472ms 9313ms

And now with CONFIG += c++11:

indexOf Condition with != and == QDomElement::hasAttribute() concat by + Replace format to QString Replace format to format Search replace string to format Search replace format to string
QStringLiteral 510ms 275ms 128ms 2379ms 1036ms 1158ms 8077ms 7592ms
QLatin1String 200ms 163ms 1371ms 3792ms 1180ms 1744ms 9963ms NA
QString 1146ms 3589ms 1789ms 3855ms 2698ms 4542ms 9841ms 9751ms
QLatin1Literal 455ms 156ms 1369ms 3948ms 1187ms 1814ms 10193ms NA
Char* 915ms 4544ms 1830ms 3714ms 2744ms 4711ms 10254ms NA
Prepared 56ms 66ms 23ms 2104ms 802ms 211ms 7885ms 7877ms

Using prepared string, the loading datapack of the datapack (mostly Xml parsing time + file access time) 1560ms -> 742ms (https://github.com/alphaonex86/CatchChallenger/commit/4eeece9d5169d125fcfdbc2126629cdd2a75fe1c).

String in C++

Highter is better, gcc with c++11, format:

Arm Cortex A9 Quad core with grsecurity (PAX) Intel x86 haswell
std::string::find("test") 3802281 cycles/s 35404496 cycles/s
std::string::find(std::string) 4201680 cycles/s 38079281 cycles/s
nop 83333333 cycles/s 471920717 cycles/s
  • Mean 7% better on x86, 10% better on ARM
  • raw free benmark: std::string::find(std::string) vs QString::indexOf(QString) then both prepared string: 38079281 vs 24181750, 57% more fast in c++11 for the prepared, mean 32x for unprepared string

Send to all object

  • Case 1: 600x
    connect(timer to object)
    to send to all object
  • Case 2: connect(timer to object with QList), into object with QList do:
int index=0;
while(index<objectlist.size())
{
    objectlist.at(index)->call();
    index++;
}
 % of Cpu
Case 1 21%
Case 2 2,5%

That's mean: 8,4x performance improvement for only 600 objet... imagine with 65535 object (then player) as I plan...

Ssl vs Clear

I have do this comparison with: https://github.com/alphaonex86/CatchChallenger/tree/master/tools/epoll-server/epoll-ssl-server-skeleton with typical small packet use in CatchChallenger

It's 24x more slow with Ssl than clear. Mostly because it's small packet (<10 bytes) sended with big interval (>1s), the OS can't group it, then need do lot of syscall. The ssl is not needed in this case:

  • Use over crypted network like I2P or ipsec (I have not found this informations for TOR)
  • Don't need security (don't use crypto currency) and need greate performance (embed device)

memcpy vs raw

x86 Intel haswell, 64Bits Arm Cortex A9 Quad core, 32Bits, with grsecurity (PAX) on linux kernel
memcpy 1.4s 7.0s
raw 1.2s 1.2s

Then why use memcpy? To large block of data (use SIMD to great speed improvement), access to unaligned 64Bits integer on 32Bits platform. Raw mean: pointer + reinterpret cast, effective on small amount of data.

Hash and control sum

x86 Intel haswell, 64Bits Arm cortex A9 Quad core, 32Bits, with grsecurity (PAX) on linux kernel
sha224 650KH/s 36KH/s
xxhash 115KH/s
xxhash -march=native -O2 1700KH/s 35KH/s
xxhash -march=native -O2 -fomit-frame-pointer -floop-block -floop-interchange -fgraphite -funroll-loops -ffast-math -faggressive-loop-optimizations -funsafe-loop-optimizations 1700KH/s
xxhash -pipe -march=native -O2 -fomit-frame-pointer -floop-block -funroll-loops -ffast-math -funsafe-loop-optimizations 1700KH/s 31KH/s

Intel haswell have hardware acceleration for sha224. Internal tool (sha224 of Qt) as "cryptsetup benchmark" give around the same result, Qt is 10% slower.

Set to packet parsing

Lower is better

Unaligned result C Found unordered_set C++ result Not found unordered_set C++ result Found set Qt result Not found set Qt result Found array C result Not found array C result Found vector C++ result Not found vector C++ result
x86 Intel haswell 47ms 3362ms 4378ms 456ms 495ms 54ms 53ms 1366ms 2284ms
x86 Intel haswell clang 67ms 2261ms 2825ms 480ms 480ms 69ms 74ms 1077ms 1752ms
x86 Intel haswell -march=native -fomit-frame-pointer -floop-block -floop-interchange -fgraphite -funroll-loops -ffast-math 47ms 3333ms 4288ms 468ms 485ms 54ms 58ms 1358ms 2257ms
x86 Intel haswell -march=native -fomit-frame-pointer -floop-block -floop-interchange -fgraphite -funroll-loops -ffast-math -O2 0ms 369ms 498ms 120ms 103ms 0ms 0ms 60ms 89ms
x86 Intel haswell clang -O2 -fPIC -std=c++0x 0ms 0ms 0ms 94ms 74ms 0ms 0ms 0ms 0ms
ARM Cortex A9 -march=native -fomit-frame-pointer -funroll-loops -ffast-math, Grsec PaX, gcc 4.7 hardened 48ms 522ms 343ms 259ms 265ms 28ms 52ms 466ms 857ms

Done with -std=c++0x, gcc 4.9.3 hardened, Qt core 5.4.2. Most of access in memory is read only and 0 copy/allocation, because it's mostly read the datapack content to control. Mean this benchmark is very important due to huge read access at memory.

Lock: None, atomic, spinlock, mutex

Benchmark used: https://github.com/attractivechaos/benchmarks/blob/master/lock/lock_test.c on ARMv7 (Cortex A9 Quad Core) platform, gcc 4.9, Gentoo Linux, Grsec, PaX, Kernel 3.17m CLI: gcc -std=c++11 -lstdc++ -pthread -fpermissive -O3 -o test test.cpp

Lower is better:

No lock gcc builtin spin lock pthread spin mutex semaphore buffer+spin buffer+mutex
ARMv7 Thread=1 1m3.478s 1m16.873s 1m20.687s 1m14.666s 1m14.633s 1m14.115s 1m11.729s 1m6.932s
ARMv7 Thread=4 0m17.495s 0m20.960s 0m38.235s 0m32.086s 1m36.684s 2m52.436s 0m17.631s 0m17.415s
ARMv7 Thread=4 under heavy pressure, 4x dd if=/dev/zero of=/dev/null 2m29.136s 3m11.012s 3m7.528s 4m45.539s 3m56.250s 5m16.327s 3m32.740s 2m4.369s

Note: Buffer + mutex it's exactly what happen with tcp/unix socket and async communication between process.

http server on arm64 debian stretch

On Odroid C2 via LXC VM: Performance-http-server-arm64-debian-stretch.png