The entropy isolation allow better compression ratio. Because it group similar block near, and prevent mode switching (compressed block, uncompressed block, ...)
The data encoding is very important to help the compression to do the job. Pass uint8_t to bit directly allow 8x less size. Very important for random and semi random data.
The most common word is grouped into a dictionary. Form wikipedia:
Due to its intrinsic characteristic of permitting gains when used in communications, as the example of the mike click demonstrates, there is a pre-understood code system agreed and shared between the emitter and the receptor. It is not without a cause that most communication network oriented compression algorithms rely in some form of dictionary compression.
Text vs binary
For poorly compressible datas like random or not repeating data, you have:
Then as you see, it's better to improve the format style and use binary.
Ordened incrementing list
Based on this characteristic:
- some item follow the previous (as human we organise similar item in same id range)
- the list is ordened and the id is unique (yes, you had one time this item)
- LittleEndian o BigEndian change nothing, omited here
- Raw id list, compressed id list, row incremental list is omited because grow proportionally to the list lenght
- Compression used is gzip, then zlib same as btrfs compression=zlib and zlib for ssl compression (note: I think of that's for ssh tunnel and postgresql)
- the incremental usage help the compression to do stable size.
- for few item the compression is not effective but the size is minimal
- same size uncompressed with id or incremental value
- raw bit give 8x of compression ratio, but the size is fixed
- compressed bit give a good size
The compression is used for datapack list/datapack content:
- Zlib: To save bandwidth at exchange of cpu
- Lz4: lower ratio, and use less cpu
- Xz: Very good compression ratio, but hight cpu usage
- zstd: same ratio than zlib but same speed than lz4, then the best tradeoff
- None: No compression overhead, but some query can use lot of bandwidth
Allow reduce to 8/16Bits the player id, and hide the database id
That's allow store int, be aligned, don't use enum (should be too often updated, and it's hard), and very more compact than string.
In case in zlib compression, allow better ratio.
On file system
CatchChallenger is not sensitive to database latency, then you should have prior less space used than high performance. You should use compression like provided by btrfs.