Compression-related projects

Compression related projects

A couple of days ago (2006) I answered a question on random access in gzip streams and as the result of that conversation realised that there isn't much documentation out there, so here's a brain dump related to compression.

Compression and delta-encoding are part of the same general case, which is why they are inter-mingled below.

Overview of compressors

Making a stream smaller requires finding redundancy and duplication in data. Random data does not compress and never will. Compression tends to take three main steps:

	Transform/ Preprocessing	Duplicate elimination	Bit reduction
deflate		≤32kB Sliding dictionary	Huffman tree
bzip2	RLE, ≤900kB BWT, then Move-to-Front		Multiple Huffman trees
LZMA		≤1GB Sliding dictionary	Range coding
rzip		≤900MB LZ77 dictionary	Bzip2 stack
PNG (deflate)	Line:next-line, pixel:next-pixel predictors	≤32kB Sliding dictionary	Huffman tree
7-zip (deflate,lzma,bzip2)	BCJ2 executable jump compression	(various)	(various)
MS LZX (aka cabinet)	0xe8 x86 Jump convertor	32kB-2MB Sliding dictionary	Huffman tree
Courgette	PE/COFF x86 disassembler	modified bsdiff	7-zip stack

I have some compression projects on the go at the moment:

`pyflate.py`: stand-alone pure-Python .gz/.bz2 decoder

pyflate.py will decode and decompress DEFLATE (as found in zlib, gzip, ZIP/pkzip, PNG, PDF...) and bz2 streams (from the BWT-based Bzip2). This is mainly designed as a research project to learn about the algorithms and for producing indexes for random-access. Around 650 lines of Python. BSD/GPL/LGPL/DFSG licensed.

There was a later, completely refactored version that fixed a number of bugs. I don't have a copy of it—it died along with a laptop; remember to sync early and sync often!

`succinct`: efficient downloading for upgrades

Related to zsync/rsync
Remote random-access: HTTP Partial Content/Range fetching

Asynchronous keep-alive mode

Arbitary-sized hash-based checksum matcher
deflate stream flush/partial/sync/full reset locator
bzip2 stream block boundary locator
gzip stream reconstructor (to cope with dpkg-deb and --rsyncable)

Extension and late research ideas ("premature optimisation")

deflate stream locators for .tar.gz, .deb (ar→.tar.gz x2), .zip (.jar, .odt), .pdf, .rpm
non-deflate based locator for .iso; especially if local .debs are already available. Similar to Jigdo, but without quite so much overhead. Might be possible to directly convert jigdo files. The secret to this working is intelligently placed variable-sized chunks.
optimisation of break-points (Z_PARTIAL_FLUSH, Z_SYNC_FLUSH, Z_FULL_FLUSH)
optimisation of chunk checksum boundaries (align to file or sync boundaries)
limited cacheing of literal data to reduce chunk download
argumented advancecomp to all reconstruction of none gzip -9 streams
tying in bittorrent for even lower mirror load
deflating per-file headers inside a tar-archive without backreferences, to allow direct-access

`debfs`: live filesystem from a bunch of .deb/.tar archives

Using debfs will go something like this

$ cd ~/alternate-cd
$ find pool/ -name \*.deb | mkfs.debfs > livecd.debfs
$ echo livecd.debfs will be tiny.  It just contains meta data.
$ mount -t debfs livecd.debfs

debfs will then populate the filesystem; files will go into the root, control scripts will be presented in /var/lib/dpkg/info/* and the debs themselves aliased (in directly compressed form) in /var/cache/apt/archives/*; deb unpacking is avoided but the control scripts will still need to be configured.

At this stage, a unionfs needs to be mounted. Once editing as been completed, the contents of backing store for the unionfs (tmpfs, or a on-disk directory tree) can be tar'ed up into a tarball and added to the list of debfs files. There's also an opportunity to take the deletion data from the unionfs (if there was any left over) and remove those entries from the .debfs manifest

optimisation of deflate/gzip/zlib streams

kzip, advancecomp (p7zip converging deflate implementation), bjwflate/deflopt

Research ideas

Ken Silverman discribes, of kzip, "The main one is an exhaustive search of all patterns. The other one is my advanced block splitter."

Ben Jos Walbeehm alludes to an un implemented idea for bjwflate, "Blockwise multiplexing! (like a blockwise ZipMax! (zip level --> file level --> block level)).". I do not know this would be but I've wondered whether it would be possible carefully split the huffman trees into two balances halves, allowing them to be used for separate purposes (streams). A concocted example I can think of would be keeping the data and the tags in a XML document apart.

Random access on gzip streams

Required for both succinct and debfs, but likely to have wider use. Store a lookup-table of flush points in the FEXTRA field at the start of a gzip header (or store this meta-data separately). The data can be retrospectively scanned from an existing stream and needs to list tuple pairs of plain-text offset and compressed offset.

Ideally this data could be extended to provide a hint on what method (eg. gzip -9, --rsyncable, or necessary advancecomp options are required to recompress each block, if the stream needs to be recreated identically.

I've been working on this issue for about 1 year now and I keep making discovers and jumps in my understanding of the problem. There are overlapping problem domains.

List of block headers. Positions (in bits), length of block header (in bits). This makes it possible to recover the huffman-tree for this block.
Mapping of uncompressed:compressed positions. Uncompressed offset (in bytes) to compressed offset (in bits), size of sliding dictionary required to process from this point.

Data can be decoded from the middle of a block as the Huffman tree details from the closest preceding header can be retrieved. Stored/literal blocks become completely random-access since no alignment map is required.

The amount of dictionary required can be computed by decompressing from that point in the stream and recording the furthest distance referenced in the history buffer.it is not necessary to know the contents of the history buffer in order to produce this analysis. Indeed, the actual bytes accessed within the history buffer could be recorded as a bitmap and a sparse dictionary constructed specific to that starting position be constructed.

During compression it would be possible to have certain sections not reference any history. Temporally turning off history access to make a particular block independently accessibly maybe useful in cases such as the file-header with in a tar-archive. As the history lookup and Huffman decoding are dependent, this would not require resetting or restarting the block.

gzip

Add gzip -0 forced store support (can be useful when stacking compression algorithms.

Comment on 7-zip, LZMA and blah

7zip (.7z) is a container format for a series of other algorithms. Many algorithms are tried for each set of data and the one producing the smallest result is picked. The two highest performing methods are bzip2 and a new kid on the block called LZMA. Methods may be stacked and the data preprocessed before compression.

LZMA is effectively deflate (zlib, gzip, zip) with a larger dictionary size, 32MB instead of 32kB. LZMA stands for Lempel-Ziv-Markov chain-Algorithm, after string back-references have been located, values are reduced using a Markov chain range-encoder (aka arithmetic coding) instead of huffman.

7zip also produces gains on x86, arm and other binary format files by converting relative jumps into an absolute addresses, increasing the likely of a match. In the case of some code where multiple places (different points in a 'for' loop) branch to the same lcoation:

.loop_start:
      test eax, eax
      jcc .loop_start    (jcc -4)
      ...
      jmp .loop_start    (jmp -23)
      ...
      jnz .loop_start    (jnz -56)

Then the BCJ/BCJ2 preprocessors would step though the stream and convert all relative locations into jumps to the same absolute value, eg. 7472. The change is performed by add the value already present with the offset in the current file. The latter steps of the compression (string matching, huffman/range-coding) easily take care of compressing the three duplicated values in the stream.

Comment on Bzip2

Bzip2 is block-based, not stream based as gzip is. A fixed amount of input is taken (normally 900kB) and the block in considered on its own in a completely independant fashion. First a BWT transform is performed, placing likely similar letters closer together. Next is a Move-to-Front transform resulting in values clustered around zero. Both BWT and MTF processing steps are reversible and do not alter the length of the text. Final compression is performed using Huffman, much the same as for gzip.

Embedding of arbitrary data in .bz2 streams

Julian Steward includes a section about the limitations of the bzip stream format. A further limitation not listed is that there is no obvious way to add arbitary data in a compatible. Gzip provides storage space for a timestamp, filename, filecomment and a method for structured meta-data storage.

The only flexibility that I've located any flexibility in within the format is the encoding of the Huffman tables. The Huffman tables store the code length for each used symbol in the data-stream. A zero-bit means Continue at current bit-length, a one-bit means Modify; the following bit indicating whether to increment or decrement.

# current code length is 5, current symbol is A
0  B: 5 bits
0  C: 5 bits
0  D: 5 bits
11           # decrement to 4-bits
0  E: 4 bits
10           # increment to 5-bits
0  F: 5 bits

What is not documented is that multiple increment and decrements can be intermixed before the submission of a continue. This is the scope that may allow embedded of meta-data in a backwards compatible manner. A transform of binary data into increment and decrement operations (50% efficiency):

# current code length is 5, current symbol is A
# dec, dec, inc, dec, inc, dec, dec, inc
11 11 10 11 10 11 11 10
0  B: 3 bits

However, to the current decoder, the apparent bit length must remain with in the range of valid lengths (1-19) and hovering around a central value. One option is encode each bit of our binary meta-data as clustered pairs of operations; increment-decrement being one bit value and decrement-increment being the other (25% efficency):

# current code length is 5, current symbol is A
# dec, inc, inc, dec, inc, dec, dec, inc
(11 10) (10 11) (10 11) (11 10)
0  B: 5 bits

To gain back some of the lost efficency, we could use one of several "DC free" encoding schemes. One option is 8b/10b encoding which ensures a maximum walk of 5 bits and a result after each byte that is not more than 1 bit out in either direction. 8B/10B encoding brings long-term DC at the cost of look-up table based decoding (40% efficency).

# current code length is 5, current symbol is A
# dec, inc, inc, dec, inc, dec, dec, inc
((11 10 10 11 10 11) (11 11 10 10))
0  B: 5 bits

Pre-compression optimisation of tar archive ordering using clustering methods

Generally the order of files inside a tar archive makes no difference to the decoding of the files. A pre-processing step can be taken to reorder the individual file blocks in a tar steam to increase the likelyhood of increased compression. Note, that if the archive or contents are checksummed, then the order of components may be significant.

To take advantage of cross-file similarity in a tarball, the contents of two separate, but similar, files must be located within the same window. The window size is normally 32kB for gzip, or 900kB with bzip2 -9. Ideally some effort can also be made to ensure that the gzip, or bzip2 stream is not reset between two similar files—which would result in no compression improvement from having matched and located the similar data closely together.

If either file is bigger than the window size of the expected compression algorithm, then the comparision that needs to be performed is whether the end of one file closely matches the start of a second file.

For a tarball containing two files A and B the possible file orderings are A,B or B,A. Either of these orderings generates a tar file of equivalent length, however when compressed with a dictionary-based compression algorithm the final compressed length is likely to be less. To solve completely for any set of files there are N! factorial possibilites to investigate.

Re-ordering a tarfile

Performing the tarfile reordering is a four stage-process:

Decompress split existing tar archive into components file, each component retaining the 512byte tar file header and any necessary padding. This is the smallest unit that may be arbitrarially re-ordered. The 1024byte null padding block from the end should be removed.
Compare all pairs of individual components to produce a weighted graph showing the similarity between any pair of components.
Cluster components based on the weights produced in step (2), create a linear ordering of all components such that the total weighted route distance through the set is reduced to the minimum. Each file should be included only once! The solution turns out to be the asymmetric traveling salesman problem (TSP).
Reconstruct a valid tar stream by appending components in the order generated by step (3), followed by the 1024 zeros to indicate end-of-archive. Compress this new stream with gzip or bzip2 compressors and double-check that the resulting file is an improvement on the size of the original compressed tarball.

Once the window-size is known, a additional step can be inserted between (3) and (4) to rearrange the cluster chains so that (ideally) chains are not broken across a window size if using fixed blocks, such as bzip2.

Quick and dirty

A simpler ordering method, involves clustering based purely on filename and extension can be produced with a command similar to:

cat filelist.txt | rev | sort | rev > neworder.txt

This sorting process workings by reversing each line in the file; hello.text would become txet.olleh allowing files with similar file extensions or basenames to be ordered adjacently. The filenames are reversed again producing the file order; this method appears to work well for language-packs containing translated strings, resulting in a 14% improvement using bzip2 compression both before and afterwards, or 2% if using gzip (most files are larger than the 32kB window size).

I came across a paper (without source code) which discusses pre-ordering for efficient zdelta encoding as well as the tarfile ordering: Compressing File Collections with a TSP-Based Approach (PDF). For this paper, a relatively simple, greedy method is chosen, yeilding compression improvements of ~10-15% on webpages of online news services.

8bit clean gzip/deflate

One of the problem is transmitting binary files through email is that certain character, or certain sequences of characters take on special meanings. Examples of this include:

CR LF . CR LF End of email body
CR LF - - MIME boundary
NUL End of zero-terminated string

The normal way to avoid NUL bytes or other undesirable sequences appearing in a final output message is to use an established encoding method such as uuencode, base64 or ascii-85. The three methods trade a larger output size, in exchange for a reduced symbol set, such as only common letters as upper and lowercase Latin letters, Arabic numerals and punctuation.

It may, with repeated trial and error be possible to produce a value deflate stream that does not contain the blacklisted sequences.

In the case of simply needing to avoid the NUL (zero) byte appearing the output, the following problems all present themselves:

None of the following stores would be possible (a) string containing a NUL, (b) Where the length is <257, >65278 or where the length is a multiple of 256, or the mod256 of the length is 255.
Canonical Huffman trees are built in ascending order, so the most common code is all-zeros, it maybe possible to avoid this by ensure one of the unused codes (257?) is assigned the most common length. From what remember code lengths for values in the alphabet above 256 may be hard-coded?
Transmitting a valid dynamic Huffman tree description may be impossible.
The Gzip header commonly used allows for option fields, the prescense of which is indicated by a non-zero flag. To have a valid Gzip header, no zero-terminated filename or comment be stored, but a CRC or EXTRA secition must be saved in order that the result be non-zero. The EXTRA secion would require padding such that the length of the EXTRA section is less 261 bytes in length and the first extra item is not less than 257 and contains not NUL bytes.
Any stream with a CRC or Alder-32 of zero! Although breaking a gzip stream and appending a new whole stream may work.

This idea would be to have an encoding that is understood by existing decoders and which has an expansion less than the ~1.4 of uudecode and base64, or ~1.27 of Ascii85.

Delta compression

bsdiff
Courgette
diffball
diff
edelta
exediff
gdiff NOTE-gdiff-19970901
rdiff
vcdiff rfc3284
xdelta
zdelta

Instruction set

        ADD     LZ              EOF  COPY          REPAIR   DELETE   REPLACE  ALTER
diffe   Y                                                   Y        Y
rzip    Y       Y
gzip    1       3-258@1-32768   Y
vcdiff  Y       Y               N    Y             
reijers Y                            1-65536       Y                          Y
bsdiff  Y       Y                    Y             Y
succinctY       Y               Y    Y             Y

ADD(length): insert n literal bytes/words
LZ(length,offset): reuse/copy n bytes already generated
EOF: finish processing
COPY(length,position): copy n symbols from reference/base file
REPAIR(...): patch a series of symbols (equivlent to COPY,ADD,COPY)
DELETE(length): remove range from reference file
REPLACE(string1,string2): exchange string1 in base for string2 in output
ALTER(...): patch a series offsets/relative positions

Succinct wins because

Literal data is not necessarily in edit script
Checksums are used to copy external strings, instead of pointers
Very large dictionary range can be exploited through not downloading data twice, even if in the output twice