How are data stored in a .gz format?
A .gz file contains:
- A prefix
1f 8b
to identify the file as gz - Information about the compression type used, which is always
08
because it supports only one type of compression, deflate - Header flags, informing what data are contained in the header
- 4 byte timestamp
- Identifier of the operating system (most of the operating systems in the specification are long forgotten, Linux isn’t even there)
- Extra information specified in header flags:
- Name
- Comment
- Whether the contents is probably a text file
- Extra content
- The header’s checksum
- The compressed file (always compressed with deflate)
- CRC32 checksum
- Size of the original file (32 bit Little Endian unsigned integer, may overflow)
The main interesting thing about it is the deflate algorithm. It was originally used in the popular .zip format. However, the .zip format has a key functional difference – a single .zip can store multiple deflate-compressed files. This feature was not added to .gz because it was commonly used in combination with .tar for merging files into one, which results in better overall compression.
The deflate algorithm uses two distinct mechanisms of reducing the size of the output. The first one, duplicate string elimination, replaces repeating strings with information which part is repeated, and Huffman code, which uses shorter encoding for more commonly occurring parts.
Duplicate string elimination
Let us assume that we are trying to compress the following text:
The main interesting thing about it is the deflate algorithm.
The text is quite short, so there aren’t many duplications (in a long file, it can check up to 32768 bytes behind). Still, the substring ‘ing ‘ appears twice, substring ‘t i’ appears twice and the substring ‘ th’ also appears twice. Duplications shorter than 3 bytes are ignored (although they could potentially shorten the compressed result in some cases). Thanks to these duplications, we can write the string as:
The main interesting th(copy 4 bytes from 6 bytes behind)about i(copy 3 bytes from 3 bytes behind)s(copy 3 bytes from 18 bytes behind)e deflate algorithm.
Codes for individual bytes and codes for repeating earlier bytes form together 286 codewords (not all must be present):
- 0 – byte 0
- 1 – byte 1
- …
- 255 – byte 255
- 256 – end of section
- 257 – copy 3 bytes
- 258 – copy 4 bytes
- …
- 264 – copy 10 bytes
- 265 – read 1 more bit and depending on it, copy 11 or 12 bytes
- …
- 284 – read 5 more bits and depending on them, copy 227-257 bytes
- 285 – copy 258 bytes
The exact ranges and other numeric constants can be found in the specification.
Every copy codeword is followed by a codeword telling how far behind is the repeated sequence:
- 0 – it’s 1 byte behind
- 1 – it’s 2 bytes behind
- 2 – it’s 3 bytes behind
- 3 – it’s 4 bytes behind
- 4 – read 1 more bit, and depending on it, it’s 5 or 6 bytes behind
- 5 – read 1 more bit, and depending on it, it’s 7 or 8 bytes behind
- 6 – read 2 more bits, and depending on them, it’s 9-12 bytes behind
- …
- 29 – read 13 more bits, and depending on them, it’s 24577-32768 bytes behind
A byte can have only 256 values, but that is not really relevant, because these codewords are then Huffman-encoded with variable length.
Huffman coding
Suppose we have a file that only contains letters A, B, C and D. Normally, each of them would take 8 bits:
- A –
01000001
- B –
01000010
- C –
01000011
- D –
01000100
This is the usual ASCII code used to write text without needing any additional information about the file (something like UTF-8 is probably needed to support other characters, of course). However, knowing that only A, B, C and D can appear in the file, we might use a different encoding into bits:
- A –
00
- B –
01
- C –
10
- D –
11
This is obviously 4 times more space efficient.
Now, what if we had 5 letters, A, B, C, D and E? One solution is evident:
- A –
000
- B –
001
- C –
010
- D –
011
- E –
100
This is still better than ASCII encoding, but every letter takes as much as 50% more space. We can do better than that:
- A –
00
- B –
01
- C –
10
- D –
110
- E –
111
Now, adding a new possibility adds only 20% more space. We need to be careful with choosing the codes, because they are not preceded by information about their length. If there was a 11 sequence, it would be indistinguishable from sequences 110 or 111. It’s necessary to avoid any sequence being the prefix of another.
If these letters are equally common, this is the best we can do. However, in case they aren’t (which is true for files containing text), we can use that for more efficient encoding. Consider the following sequence:
BAACCEACAAAEBAACEABAEDEACEAACAAECCAADAEAACAEADAA
The letter counts are as follows:
- A – 24 times
- B – 3 times
- C – 7 times
- D – 3 times
- E – 9 times
In general encoding, this would be 46 bytes, or 368 bits. 3 bits per character would shorten it to 138 bits. The last last suggested encoding would shorten it to 104.
However, we know that the text contains A and E more often than the others. C is also more common than B and D. This can be used to better choose the codes for letters:
- A –
00
- B –
110
- C –
01
- D –
111
- E –
10
- This would make the result 98 bits long.
And how about using the fact that A is extremely common, and could be even shorter at the expense of other letters?
- A –
0
- B –
100
- C –
101
- D –
110
- E –
111
This would make the result 90 bits long.
Considering that C and E are also more common:
- A –
0
- B –
1110
- C –
110
- D –
1111
- E –
10
This would make the result 87 bits long.
Huffman code in the Deflate encoding
The format has a default table for Huffman encoding, which is typically used for shorter text files where writing out a Huffman table would not be worth the space. It contains all the 286 codes (256 possible byte values + end + 29 repetition lengths)
0000000
– end of block (end of file or switch to different coding)0000001
– 257 (copy 3 bytes)0000010
– 258 (copy 4 bytes)- …
0010111
– 279 (copy 99-114 bytes accordingly to the following 4 bits)00110000
– 0 (byte value 0)00110001
– 1 (byte value 1)- … all ASCII codes are within this range (A is for example
01110001
) 10111111
– 143 (byte value 143)11000000
– 280 (copy 115-130 bytes accordingly to the following 4 bits)11000001
– 281 (copy 131-162 bytes accordingly to the following 5 bits)- …
11000111
– 285 (copy 258 bytes)110010000
– 144 (byte value 144)110010001
– 145 (byte value 145)- …
111111111
– 255 (byte value 255)
Consequentially, all characters of the ASCII table are represented with 8 bits, and some non-ASCII bytes are 9 bytes long to make space and allow the shorter repetition codes could be 7 bits long and the longer repetition codes could be 8 bytes long.
Distance codes telling from where bytes are copied are always saved on 5 bits, with the obvious encoding:
00000
– 1 byte behind00001
– 2 bytes behind00010
– 3 bytes behind00011
– 4 bytes behind00100
– read 1 more bit, and depending on it, it’s 5 or 6 bytes behind- …
11101
– read 13 more bits, and depending on them, it’s 24577-32768 bytes behind
In case a custom Huffman encoding is worth the space its definition takes, it is written at the start of the block.
Let’s get back to the example of Huffman code for letters A, B, C, D and E (now ordered by their codes):
- A –
0
- E –
10
- C –
110
- B –
1110
- D –
1111
The code would work even if shorter codes were starting with ones and longer codes with zeroes, with no drawbacks or advantages. However, the canonical form is to assign codes starting with zeroes to shorter codes, incrementing the starting numbers as the code gets longer and have the longest codes start with many ones.
Assuming all codes are in canonical form, the code above can be unambiguously represented as follows:
- … – 0 (N/A)
- A – 1 (length 1)
- B – 4 (length 4)
- C – 3 (length 3)
- D – 4 (length 4)
- E – 2 (length 2)
- F – 0 (N/A)
- … – 0 (N/A)
A table of these codes must contain all codes until the block ending code, plus some of the copying codes. Maximum length permitted by the .gz format is 15. Because there will be repetitions and the length 0 for letters not present appears a lot, the lengths can be encoded as follows:
- 0 – N/A
- 1 – length 1
- 2 – length 2
- …
- 15 – length 15
- 16 – read 2 more bits, then repeat the previous length 3-6 times accordingly to it
- 17 – read 3 more bits, then add a sequence of 3-10 zeroes accordingly to it
- 18 – read 7 more bits, then add a sequence of 11-138 zeroes accordingly to it
To shorten this table, these 18 values are Huffman-encoded too, with lengths stored in 3 bits each (maximum length is thus 7). Not all have to be present, so the length of code 16 is stored first, the length of code 17 is stored second, followed by 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. The idea behind this is that codes as short as 1 or 2 bits would require some characters to be extremely frequent in the compressed file and codes as long as 14 or 15 are unlikely to be needed, so the table can often be shorter.
Representation inside a file
To showcase the file format without discussing too much stuff simultaneously, let’s think about the most basic case – a file named A
, containing 2 bytes, A
and newline. There’s no way to be clever and compress that.
This is the file’s contents, byte by byte:
- 1f 8b – the prefix for identifying .gz files
- 08 – compression is Deflate – obviously, as it doesn’t support any other compression
- 08 – bit flags, 3rd bit is up, meaning the name of the file is in the header
- 3c 13 87 63 – the time it was created, encoded as a Little Endian unsigned 32 bit integer
- 00 – extra flags – none
- 03 – operating system – 03 means it was a Unix-based OS, Linux or Mac (00 is Windows)
- 41 00 – the name of the file,
A
with terminating zero (present because the file name flag was 1 in the starting bit flags) - 73 e4 02 00 – the Deflate compressed text
A
with newline (01110011 11100100 00000010 00000000 in binary)1
– flag whether this is the last block – value 1 means it is last01
– fixed Huffman encoding – we will use the default table10001110
– ‘A’ (anything in Huffman code is written in reversed bit order, prefixes are unique, not suffixes)01011100
– newline character (opposite order)0000000
– end of block000000
– padding (the next entry is not written by bits and thus needs to start at new byte)
- 07 a1 ea dd – CRC32 of the uncompressed text with seed edb88320 (again, Little Endian)
- 02 00 00 00 – length of the uncompressed text (Little endian, 32 bit unsigned int)
The order of reading of bits may be relatively weird, but allows for very efficient implementation on Little Endian architectures:
- We read bytes 73 e4 02 00 as integer, which produces 0002e473 (highest order byte is on the highest address, or last in array)
- In binary, it is
0000000000101110010001110011
, last bit is1
- Downshifting it by the one bit, we get
000000000010111001000111001
, the last two bytes are01
- Downshifting it by the two bits, we get
0000000000101110010001110
, the last eight bytes are10001110
- Downshifting it by the eight bits, we get
00000000001011100
, the last eight bits are01011100
- Downshifting it by the eight bits, we get
000000000
, the last seven bits are0000000
- The rest is discarded
The Huffman codes have to be written in opposite bit order because we must start reading the values by the highest order bits to determine which one is last. It’s not necessary for numeric values because it requires doing extra operations. It’s doesn’t come with performance impact for Huffman codes because their lookup table can be already pre-reversed.
Fixed Huffman coding block
This type of block is introduced with bits 01 and the Huffman coding of bytes, copy lengths and copy distances is given by the specification. It encodes every character with 8 or 9 bits, but it can remove duplicate strings without adding many bytes long Huffman table at the start of the block. Consequentially, it’s good for short text files.
The main interesting thing about it is the deflate algorithm.
This text with 61 characters + newline leads to a compressed block of size 59 bytes. This is not much, but the text is quite diverse and is too short to benefit much from the compression.
In hex, the encoded text is: 0b c9 48 55 c8 4d cc cc 53 c8 cc 2b 49 2d 4a 2d 2e c9 cc 4b 57 28 c9 00 91 89 49 f9 a5 25 0a 99 40 54 0c 14 49 55 48 49 4d cb 49 2c 49 55 48 cc 49 cf 2f ca 2c c9 c8 d5 e3 02
, which is 00001011 11001001 01001000 01010101 11001000 01001101 11001100 11001100 01010011 11001000 11001100 00101011 01001001 00101101 01001010 00101101 00101110 11001001 11001100 01001011 01010111 00101000 11001001 00000000 10010001 10001001
etc.
The individual parts of the code are:
1
– flag indicating it’s the last block01
– flag indicating it’s a fixed Huffman code00100001
– code for ‘T’ (in reversed bit order)00011001
– ‘h’ (reversed)10101001
– ‘e’00001010
– space10111001
– ‘m’10001001
– ‘a’10011001
– ‘i’01111001
– ‘n’00001010
– space10011001
– ‘i’01111001
– ‘n’00100101
– ‘t’10101001
– ‘e’01000101
– ‘r’10101001
– ‘e’11000101
– ‘s’00100101
– ‘t’10011001
– ‘i’01111001
– ‘n’11101001
– ‘g’00001010
– space00100101
– ‘t’00011001
– ‘h’0100000
– copy 4 bytes (still Huffman-encoded, still the order is reversed)00100
– distance is either 5 or 6 depending on the following bit (the order is reversed, distances are also Huffman codes)1
– distance is 6, so copying 4 bytes from 6 bytes behind (that is, copying ‘ing ‘), the order is not reversed because it’s not a Huffman code10001001
– ‘a’ (again, reversed order)- …
This kind of output can be obtained by running the program infgen with flags -dd
on any file, but it tells only what values are there, not quite what they mean.
Dynamic Huffman coding block
This type of block is introduced with bits 10 and the Huffman coding of the 286 words and 30 distances is defined at its start. This is much more complex than the previous, but almost always gives better compression. Looking at the old text from the Huffman coding example:
BAACCEACAAAEBAACEABAEDEACEAACAAECCAADAEAACAEADAA
The original text contained 48 characters plus newline, and the compressed block is 35 bytes long. A significant portion of it is the definition of the Huffman code.
In hex, the encoded text is 1d c8 b1 11 00 30 08 c3 c0 3e 5b 09 a3 fd 57 0a d0 f8 de 2a 48 24 80 35 96 c2 76 b1 d1 cc 34 77 1c f0
. In binary, it is 00011101 11001000 10110001 00010001 00000000 00110000 00001000 11000011 11000000 00111110 01011011 00001001 10100011 11111101 01010111 00001010 11010000 11111000 11011110 00101010 01001000 00100100 10000000 00110101 10010110 11000010
etc.
Individual parts of the code are:
1
– flag indicating it’s the last block10
– flag indicating it’s a dynamic Huffman code00011
– we have 3 extra codes, therefore there are codes for 256 byte values, 1 exit and 3 lengths (code 259, for copying 5 bytes, is the last)01000
– there are 8 distance codes in addition to the basic one for copying from 1 byte behind, so 9 in total1110
– there are 14 code lengths after lengths for control codes 16, 17, 18 and 0 (which are always enabled), so 18 in total000
– the length code 16 (repeat previous length) is not used011
– the length code 17 (place 3-10 zeroes) has length 3 (assigned101
)011
– the length code 18 (place 11-138 zeroes) has length 3 (assigned110
)100
– the length code 0 has length 4 (assigned1110
)000
– the length code 8 is not used000
– the length code 7 is not used000
– the length code 9 is not used000
– the length code 6 is not used000
– the length code 10 is not used011
– the length code 5 has length 3 (assigned100
)000
– the length code 11 is not used010
– the length code 4 has length 2 (assigned00
)000
– the length code 12 is not used011
– the length code 3 has length 3 (assigned011
)000
– the length code 13 is not used011
– the length code 2 has length 3 (assigned010
)000
– the length code 14 is not used011
– the length code 1 has length 4 (assigned1111
)101
– place 3-10 zeroes (remainder: this is a Huffman code, so the order is reversed)111
– the number of zeroes is 10, therefore characters 0-9 are not used (including 9)001
– character 10 (newline) has length 5 (assigned11110
)011
– place 11-138 zeroes0101011
– the number of zeroes is 54, so characters 11-64 are not used010
– character 65 (letter ‘A’) has length 2 (assigned00
)00
– character 66 (letter ‘B’) has length 4 (assigned1010
)110
– character 67 (letter ‘C’) has length 3 (assigned100
)00
– character 68 (letter ‘D’) has length 4 (assigned1011
)010
– character 69 (letter ‘E’) has length 2 (assigned01
)011
– place 11-138 zeroes1111111
– the number of zeroes is 138, so characters 70-207 are not used011
– place 11-138 zeroes0100101
– the number of zeroes is 48, so codes 208-255 are not used001
– code 256 (end of block) has length 5 (assigned11111
)00
– code 257 (copy 3 characters) has length 4 (assigned1100
)00
– code 258 (copy 4 characters) has length 4 (assigned1101
)00
– code 259 (copy 5 characters) has length 4 (assigned1110
)101
– start the distance table by placing 3-10 zeroes001
– the number of zeroes is 4, so distance codes 0-3 are not used110
– distance code 4 (5-6 characters back) has length 3 (assigned110
)0111
– distance code 5 is not used1111
– distance code code 6 (9-12 characters back) has length 1 (assigned0
)110
– distance code 7 (13-16 characters back) has length 3 (assigned111
)010
– distance code 8 (17-24 characters back) has length 2 (assigned10
)0101
– code for ‘B’00
– ‘A’00
– ‘A’001
– ‘C’001
– ‘C’10
– ‘E’00
– ‘A’001
– ‘C’00
– ‘A’00
– ‘A’00
– ‘A’10
– ‘E’0101
– ‘B’0011
– copy 3 bytes0
– copy from 9-12 bytes behind11
– the distance of copying is 12 (that is, copying ‘AAC’)10
– ‘E’00
– ‘A’0101
– ‘B’00
– ‘A’10
– ‘E’1101
– ‘D’10
– ‘E’1011
– copy 4 bytes0
– copy from 9-12 characters behind00
– the distance of copying is 9 (that is, copying ‘ACEA’)- …
Again, it’s possible to use infgen to see the most important of this information for any file.
Literal block
Sometimes, the values of bytes are too evenly distributed (as in noisy data) to have any benefit from Huffman coding and repetitions of 3 or more bytes are uncommon, there may be too little benefit from custom Huffman encoding and default encoding could make the file actually larger because of writing non-ASCII characters using 9 bits rather than 8. In that case, it’s better to use no compression at all. That is what the literal block is for.
A good example for this can be this sequence of UTF-8 characters:
čóšéňáďôž
Each of the characters takes 2 bytes (except the ending newline). The encoded string is 01 13 00 ec ff c4 8d c3 b3 c5 a1 c3 a9 c5 88 c3 a1 c4 8f c3 b4 c5 be 0a
. For obvious reasons, it will never be shorter than the uncompressed data. Parsing it is much simpler than the others:
1
– yes, it is the last block00
– identification as a literal block00000
– padding, we are going to read by bytes13 00
– literal block is 19 bytes long (16 bit unsigned integer, Little Endian)ec ff c4 8d c3 b3 c5 a1 c3 a9 c5 88 c3 a1 c4 8f c3 b4 c5 be 0a
– the entire string without any changes