Tar format

From CatchChallenger wiki
Jump to: navigation, search

The Tar format

It's a open source which ensures interoperability. It's important to always have a software to do and read the archive, without license or royalty. That's allow to encode under Windows, Linux, Mac. Do the own lib to read it (then I can reduce the code size because I limit the features that I need).


To translate:

This allows me to separate the storage format and the compression itself. Use the Tar format with gzip, zlib or other.

Tar structure

You have the datas, the file content. And the meta data, oath, chmod, user, group, creation date ... The Tar format work by block, the holes into the block is fil by binary zeros.

 The name is derived from (t)ape (ar)chive, as it was originally developed to write data to sequential I/O devices with no file system of their own.
 The command line utility was first introduced in the seventh edition of unix (v7) in January 1979, replacing the tp program.[1]
 The file structure to store this information was later standardized in POSIX.1-1988[2] and later POSIX.1-2001.[3] and became a format supported by most modern file archiving systems.

Don't do simple serialization can be awkward for the compression system because you have X binary zeros. This add noise, which is problematic to optimize compression. If in addition we add that the integer is stored into string, that's reduce event more the compression ratio compared to binary (See: Compression and Binary). This also reduces the performance of Tar encoder and decoder to work in string and not in binary (to not mention the limitations of the length of the numbers represented).

A better format could have a much smaller size without compression and a better compression ratio.

Delete some meta data can be a good idea. Into the CatchChallenger unpacker I take care ONLY of file name and file content, I ignore: chmod, user, group, creation date, ...

For CatchChallenger

We continue to use Tar to have an easy integration at server side (99% of unix) for CatchChallenger via a http mirror (the self hosting you have the simplified rsync to send file and sync without any archiving tool need). It's exactly the same reason why I use http protocol, it's very easy to have access to http hosting support.

Do own format is not a problem, the problem is to do/maintain and distribute the encoder and decoder for this format (right to install it?).

And will be as the format .tar.zst, `--owner=0 --group=0 --mode='a+rw' --mtime='2010-01-01T00:00:00,0+00:00' -H ustar` allow better compression, drop information leak and allow check mirror integrity:

 find datapack/ -type f -print | grep -F -v 'datapack/map/main/' | sort > /tmp/temporary_file_list
 tar --owner=0 --group=0 --mode='a+rw' --mtime='2010-01-01T00:00:00,0+00:00' -H ustar -c -f - -T /tmp/temporary_file_list | /usr/bin/zstd -22 --ultra > pack/datapack.tar.zst