1 | .PU
|
---|
2 | .TH bzip2 1
|
---|
3 | .SH NAME
|
---|
4 | bzip2, bunzip2 \- a block-sorting file compressor, v1.0.3
|
---|
5 | .br
|
---|
6 | bzcat \- decompresses files to stdout
|
---|
7 | .br
|
---|
8 | bzip2recover \- recovers data from damaged bzip2 files
|
---|
9 |
|
---|
10 | .SH SYNOPSIS
|
---|
11 | .ll +8
|
---|
12 | .B bzip2
|
---|
13 | .RB [ " \-cdfkqstvzVL123456789 " ]
|
---|
14 | [
|
---|
15 | .I "filenames \&..."
|
---|
16 | ]
|
---|
17 | .ll -8
|
---|
18 | .br
|
---|
19 | .B bunzip2
|
---|
20 | .RB [ " \-fkvsVL " ]
|
---|
21 | [
|
---|
22 | .I "filenames \&..."
|
---|
23 | ]
|
---|
24 | .br
|
---|
25 | .B bzcat
|
---|
26 | .RB [ " \-s " ]
|
---|
27 | [
|
---|
28 | .I "filenames \&..."
|
---|
29 | ]
|
---|
30 | .br
|
---|
31 | .B bzip2recover
|
---|
32 | .I "filename"
|
---|
33 |
|
---|
34 | .SH DESCRIPTION
|
---|
35 | .I bzip2
|
---|
36 | compresses files using the Burrows-Wheeler block sorting
|
---|
37 | text compression algorithm, and Huffman coding. Compression is
|
---|
38 | generally considerably better than that achieved by more conventional
|
---|
39 | LZ77/LZ78-based compressors, and approaches the performance of the PPM
|
---|
40 | family of statistical compressors.
|
---|
41 |
|
---|
42 | The command-line options are deliberately very similar to
|
---|
43 | those of
|
---|
44 | .I GNU gzip,
|
---|
45 | but they are not identical.
|
---|
46 |
|
---|
47 | .I bzip2
|
---|
48 | expects a list of file names to accompany the
|
---|
49 | command-line flags. Each file is replaced by a compressed version of
|
---|
50 | itself, with the name "original_name.bz2".
|
---|
51 | Each compressed file
|
---|
52 | has the same modification date, permissions, and, when possible,
|
---|
53 | ownership as the corresponding original, so that these properties can
|
---|
54 | be correctly restored at decompression time. File name handling is
|
---|
55 | naive in the sense that there is no mechanism for preserving original
|
---|
56 | file names, permissions, ownerships or dates in filesystems which lack
|
---|
57 | these concepts, or have serious file name length restrictions, such as
|
---|
58 | MS-DOS.
|
---|
59 |
|
---|
60 | .I bzip2
|
---|
61 | and
|
---|
62 | .I bunzip2
|
---|
63 | will by default not overwrite existing
|
---|
64 | files. If you want this to happen, specify the \-f flag.
|
---|
65 |
|
---|
66 | If no file names are specified,
|
---|
67 | .I bzip2
|
---|
68 | compresses from standard
|
---|
69 | input to standard output. In this case,
|
---|
70 | .I bzip2
|
---|
71 | will decline to
|
---|
72 | write compressed output to a terminal, as this would be entirely
|
---|
73 | incomprehensible and therefore pointless.
|
---|
74 |
|
---|
75 | .I bunzip2
|
---|
76 | (or
|
---|
77 | .I bzip2 \-d)
|
---|
78 | decompresses all
|
---|
79 | specified files. Files which were not created by
|
---|
80 | .I bzip2
|
---|
81 | will be detected and ignored, and a warning issued.
|
---|
82 | .I bzip2
|
---|
83 | attempts to guess the filename for the decompressed file
|
---|
84 | from that of the compressed file as follows:
|
---|
85 |
|
---|
86 | filename.bz2 becomes filename
|
---|
87 | filename.bz becomes filename
|
---|
88 | filename.tbz2 becomes filename.tar
|
---|
89 | filename.tbz becomes filename.tar
|
---|
90 | anyothername becomes anyothername.out
|
---|
91 |
|
---|
92 | If the file does not end in one of the recognised endings,
|
---|
93 | .I .bz2,
|
---|
94 | .I .bz,
|
---|
95 | .I .tbz2
|
---|
96 | or
|
---|
97 | .I .tbz,
|
---|
98 | .I bzip2
|
---|
99 | complains that it cannot
|
---|
100 | guess the name of the original file, and uses the original name
|
---|
101 | with
|
---|
102 | .I .out
|
---|
103 | appended.
|
---|
104 |
|
---|
105 | As with compression, supplying no
|
---|
106 | filenames causes decompression from
|
---|
107 | standard input to standard output.
|
---|
108 |
|
---|
109 | .I bunzip2
|
---|
110 | will correctly decompress a file which is the
|
---|
111 | concatenation of two or more compressed files. The result is the
|
---|
112 | concatenation of the corresponding uncompressed files. Integrity
|
---|
113 | testing (\-t)
|
---|
114 | of concatenated
|
---|
115 | compressed files is also supported.
|
---|
116 |
|
---|
117 | You can also compress or decompress files to the standard output by
|
---|
118 | giving the \-c flag. Multiple files may be compressed and
|
---|
119 | decompressed like this. The resulting outputs are fed sequentially to
|
---|
120 | stdout. Compression of multiple files
|
---|
121 | in this manner generates a stream
|
---|
122 | containing multiple compressed file representations. Such a stream
|
---|
123 | can be decompressed correctly only by
|
---|
124 | .I bzip2
|
---|
125 | version 0.9.0 or
|
---|
126 | later. Earlier versions of
|
---|
127 | .I bzip2
|
---|
128 | will stop after decompressing
|
---|
129 | the first file in the stream.
|
---|
130 |
|
---|
131 | .I bzcat
|
---|
132 | (or
|
---|
133 | .I bzip2 -dc)
|
---|
134 | decompresses all specified files to
|
---|
135 | the standard output.
|
---|
136 |
|
---|
137 | .I bzip2
|
---|
138 | will read arguments from the environment variables
|
---|
139 | .I BZIP2
|
---|
140 | and
|
---|
141 | .I BZIP,
|
---|
142 | in that order, and will process them
|
---|
143 | before any arguments read from the command line. This gives a
|
---|
144 | convenient way to supply default arguments.
|
---|
145 |
|
---|
146 | Compression is always performed, even if the compressed
|
---|
147 | file is slightly
|
---|
148 | larger than the original. Files of less than about one hundred bytes
|
---|
149 | tend to get larger, since the compression mechanism has a constant
|
---|
150 | overhead in the region of 50 bytes. Random data (including the output
|
---|
151 | of most file compressors) is coded at about 8.05 bits per byte, giving
|
---|
152 | an expansion of around 0.5%.
|
---|
153 |
|
---|
154 | As a self-check for your protection,
|
---|
155 | .I
|
---|
156 | bzip2
|
---|
157 | uses 32-bit CRCs to
|
---|
158 | make sure that the decompressed version of a file is identical to the
|
---|
159 | original. This guards against corruption of the compressed data, and
|
---|
160 | against undetected bugs in
|
---|
161 | .I bzip2
|
---|
162 | (hopefully very unlikely). The
|
---|
163 | chances of data corruption going undetected is microscopic, about one
|
---|
164 | chance in four billion for each file processed. Be aware, though, that
|
---|
165 | the check occurs upon decompression, so it can only tell you that
|
---|
166 | something is wrong. It can't help you
|
---|
167 | recover the original uncompressed
|
---|
168 | data. You can use
|
---|
169 | .I bzip2recover
|
---|
170 | to try to recover data from
|
---|
171 | damaged files.
|
---|
172 |
|
---|
173 | Return values: 0 for a normal exit, 1 for environmental problems (file
|
---|
174 | not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
|
---|
175 | compressed file, 3 for an internal consistency error (eg, bug) which
|
---|
176 | caused
|
---|
177 | .I bzip2
|
---|
178 | to panic.
|
---|
179 |
|
---|
180 | .SH OPTIONS
|
---|
181 | .TP
|
---|
182 | .B \-c --stdout
|
---|
183 | Compress or decompress to standard output.
|
---|
184 | .TP
|
---|
185 | .B \-d --decompress
|
---|
186 | Force decompression.
|
---|
187 | .I bzip2,
|
---|
188 | .I bunzip2
|
---|
189 | and
|
---|
190 | .I bzcat
|
---|
191 | are
|
---|
192 | really the same program, and the decision about what actions to take is
|
---|
193 | done on the basis of which name is used. This flag overrides that
|
---|
194 | mechanism, and forces
|
---|
195 | .I bzip2
|
---|
196 | to decompress.
|
---|
197 | .TP
|
---|
198 | .B \-z --compress
|
---|
199 | The complement to \-d: forces compression, regardless of the
|
---|
200 | invocation name.
|
---|
201 | .TP
|
---|
202 | .B \-t --test
|
---|
203 | Check integrity of the specified file(s), but don't decompress them.
|
---|
204 | This really performs a trial decompression and throws away the result.
|
---|
205 | .TP
|
---|
206 | .B \-f --force
|
---|
207 | Force overwrite of output files. Normally,
|
---|
208 | .I bzip2
|
---|
209 | will not overwrite
|
---|
210 | existing output files. Also forces
|
---|
211 | .I bzip2
|
---|
212 | to break hard links
|
---|
213 | to files, which it otherwise wouldn't do.
|
---|
214 |
|
---|
215 | bzip2 normally declines to decompress files which don't have the
|
---|
216 | correct magic header bytes. If forced (-f), however, it will pass
|
---|
217 | such files through unmodified. This is how GNU gzip behaves.
|
---|
218 | .TP
|
---|
219 | .B \-k --keep
|
---|
220 | Keep (don't delete) input files during compression
|
---|
221 | or decompression.
|
---|
222 | .TP
|
---|
223 | .B \-s --small
|
---|
224 | Reduce memory usage, for compression, decompression and testing. Files
|
---|
225 | are decompressed and tested using a modified algorithm which only
|
---|
226 | requires 2.5 bytes per block byte. This means any file can be
|
---|
227 | decompressed in 2300k of memory, albeit at about half the normal speed.
|
---|
228 |
|
---|
229 | During compression, \-s selects a block size of 200k, which limits
|
---|
230 | memory use to around the same figure, at the expense of your compression
|
---|
231 | ratio. In short, if your machine is low on memory (8 megabytes or
|
---|
232 | less), use \-s for everything. See MEMORY MANAGEMENT below.
|
---|
233 | .TP
|
---|
234 | .B \-q --quiet
|
---|
235 | Suppress non-essential warning messages. Messages pertaining to
|
---|
236 | I/O errors and other critical events will not be suppressed.
|
---|
237 | .TP
|
---|
238 | .B \-v --verbose
|
---|
239 | Verbose mode -- show the compression ratio for each file processed.
|
---|
240 | Further \-v's increase the verbosity level, spewing out lots of
|
---|
241 | information which is primarily of interest for diagnostic purposes.
|
---|
242 | .TP
|
---|
243 | .B \-L --license -V --version
|
---|
244 | Display the software version, license terms and conditions.
|
---|
245 | .TP
|
---|
246 | .B \-1 (or \-\-fast) to \-9 (or \-\-best)
|
---|
247 | Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
|
---|
248 | effect when decompressing. See MEMORY MANAGEMENT below.
|
---|
249 | The \-\-fast and \-\-best aliases are primarily for GNU gzip
|
---|
250 | compatibility. In particular, \-\-fast doesn't make things
|
---|
251 | significantly faster.
|
---|
252 | And \-\-best merely selects the default behaviour.
|
---|
253 | .TP
|
---|
254 | .B \--
|
---|
255 | Treats all subsequent arguments as file names, even if they start
|
---|
256 | with a dash. This is so you can handle files with names beginning
|
---|
257 | with a dash, for example: bzip2 \-- \-myfilename.
|
---|
258 | .TP
|
---|
259 | .B \--repetitive-fast --repetitive-best
|
---|
260 | These flags are redundant in versions 0.9.5 and above. They provided
|
---|
261 | some coarse control over the behaviour of the sorting algorithm in
|
---|
262 | earlier versions, which was sometimes useful. 0.9.5 and above have an
|
---|
263 | improved algorithm which renders these flags irrelevant.
|
---|
264 |
|
---|
265 | .SH MEMORY MANAGEMENT
|
---|
266 | .I bzip2
|
---|
267 | compresses large files in blocks. The block size affects
|
---|
268 | both the compression ratio achieved, and the amount of memory needed for
|
---|
269 | compression and decompression. The flags \-1 through \-9
|
---|
270 | specify the block size to be 100,000 bytes through 900,000 bytes (the
|
---|
271 | default) respectively. At decompression time, the block size used for
|
---|
272 | compression is read from the header of the compressed file, and
|
---|
273 | .I bunzip2
|
---|
274 | then allocates itself just enough memory to decompress
|
---|
275 | the file. Since block sizes are stored in compressed files, it follows
|
---|
276 | that the flags \-1 to \-9 are irrelevant to and so ignored
|
---|
277 | during decompression.
|
---|
278 |
|
---|
279 | Compression and decompression requirements,
|
---|
280 | in bytes, can be estimated as:
|
---|
281 |
|
---|
282 | Compression: 400k + ( 8 x block size )
|
---|
283 |
|
---|
284 | Decompression: 100k + ( 4 x block size ), or
|
---|
285 | 100k + ( 2.5 x block size )
|
---|
286 |
|
---|
287 | Larger block sizes give rapidly diminishing marginal returns. Most of
|
---|
288 | the compression comes from the first two or three hundred k of block
|
---|
289 | size, a fact worth bearing in mind when using
|
---|
290 | .I bzip2
|
---|
291 | on small machines.
|
---|
292 | It is also important to appreciate that the decompression memory
|
---|
293 | requirement is set at compression time by the choice of block size.
|
---|
294 |
|
---|
295 | For files compressed with the default 900k block size,
|
---|
296 | .I bunzip2
|
---|
297 | will require about 3700 kbytes to decompress. To support decompression
|
---|
298 | of any file on a 4 megabyte machine,
|
---|
299 | .I bunzip2
|
---|
300 | has an option to
|
---|
301 | decompress using approximately half this amount of memory, about 2300
|
---|
302 | kbytes. Decompression speed is also halved, so you should use this
|
---|
303 | option only where necessary. The relevant flag is -s.
|
---|
304 |
|
---|
305 | In general, try and use the largest block size memory constraints allow,
|
---|
306 | since that maximises the compression achieved. Compression and
|
---|
307 | decompression speed are virtually unaffected by block size.
|
---|
308 |
|
---|
309 | Another significant point applies to files which fit in a single block
|
---|
310 | -- that means most files you'd encounter using a large block size. The
|
---|
311 | amount of real memory touched is proportional to the size of the file,
|
---|
312 | since the file is smaller than a block. For example, compressing a file
|
---|
313 | 20,000 bytes long with the flag -9 will cause the compressor to
|
---|
314 | allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
|
---|
315 | kbytes of it. Similarly, the decompressor will allocate 3700k but only
|
---|
316 | touch 100k + 20000 * 4 = 180 kbytes.
|
---|
317 |
|
---|
318 | Here is a table which summarises the maximum memory usage for different
|
---|
319 | block sizes. Also recorded is the total compressed size for 14 files of
|
---|
320 | the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
|
---|
321 | column gives some feel for how compression varies with block size.
|
---|
322 | These figures tend to understate the advantage of larger block sizes for
|
---|
323 | larger files, since the Corpus is dominated by smaller files.
|
---|
324 |
|
---|
325 | Compress Decompress Decompress Corpus
|
---|
326 | Flag usage usage -s usage Size
|
---|
327 |
|
---|
328 | -1 1200k 500k 350k 914704
|
---|
329 | -2 2000k 900k 600k 877703
|
---|
330 | -3 2800k 1300k 850k 860338
|
---|
331 | -4 3600k 1700k 1100k 846899
|
---|
332 | -5 4400k 2100k 1350k 845160
|
---|
333 | -6 5200k 2500k 1600k 838626
|
---|
334 | -7 6100k 2900k 1850k 834096
|
---|
335 | -8 6800k 3300k 2100k 828642
|
---|
336 | -9 7600k 3700k 2350k 828642
|
---|
337 |
|
---|
338 | .SH RECOVERING DATA FROM DAMAGED FILES
|
---|
339 | .I bzip2
|
---|
340 | compresses files in blocks, usually 900kbytes long. Each
|
---|
341 | block is handled independently. If a media or transmission error causes
|
---|
342 | a multi-block .bz2
|
---|
343 | file to become damaged, it may be possible to
|
---|
344 | recover data from the undamaged blocks in the file.
|
---|
345 |
|
---|
346 | The compressed representation of each block is delimited by a 48-bit
|
---|
347 | pattern, which makes it possible to find the block boundaries with
|
---|
348 | reasonable certainty. Each block also carries its own 32-bit CRC, so
|
---|
349 | damaged blocks can be distinguished from undamaged ones.
|
---|
350 |
|
---|
351 | .I bzip2recover
|
---|
352 | is a simple program whose purpose is to search for
|
---|
353 | blocks in .bz2 files, and write each block out into its own .bz2
|
---|
354 | file. You can then use
|
---|
355 | .I bzip2
|
---|
356 | \-t
|
---|
357 | to test the
|
---|
358 | integrity of the resulting files, and decompress those which are
|
---|
359 | undamaged.
|
---|
360 |
|
---|
361 | .I bzip2recover
|
---|
362 | takes a single argument, the name of the damaged file,
|
---|
363 | and writes a number of files "rec00001file.bz2",
|
---|
364 | "rec00002file.bz2", etc, containing the extracted blocks.
|
---|
365 | The output filenames are designed so that the use of
|
---|
366 | wildcards in subsequent processing -- for example,
|
---|
367 | "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in
|
---|
368 | the correct order.
|
---|
369 |
|
---|
370 | .I bzip2recover
|
---|
371 | should be of most use dealing with large .bz2
|
---|
372 | files, as these will contain many blocks. It is clearly
|
---|
373 | futile to use it on damaged single-block files, since a
|
---|
374 | damaged block cannot be recovered. If you wish to minimise
|
---|
375 | any potential data loss through media or transmission errors,
|
---|
376 | you might consider compressing with a smaller
|
---|
377 | block size.
|
---|
378 |
|
---|
379 | .SH PERFORMANCE NOTES
|
---|
380 | The sorting phase of compression gathers together similar strings in the
|
---|
381 | file. Because of this, files containing very long runs of repeated
|
---|
382 | symbols, like "aabaabaabaab ..." (repeated several hundred times) may
|
---|
383 | compress more slowly than normal. Versions 0.9.5 and above fare much
|
---|
384 | better than previous versions in this respect. The ratio between
|
---|
385 | worst-case and average-case compression time is in the region of 10:1.
|
---|
386 | For previous versions, this figure was more like 100:1. You can use the
|
---|
387 | \-vvvv option to monitor progress in great detail, if you want.
|
---|
388 |
|
---|
389 | Decompression speed is unaffected by these phenomena.
|
---|
390 |
|
---|
391 | .I bzip2
|
---|
392 | usually allocates several megabytes of memory to operate
|
---|
393 | in, and then charges all over it in a fairly random fashion. This means
|
---|
394 | that performance, both for compressing and decompressing, is largely
|
---|
395 | determined by the speed at which your machine can service cache misses.
|
---|
396 | Because of this, small changes to the code to reduce the miss rate have
|
---|
397 | been observed to give disproportionately large performance improvements.
|
---|
398 | I imagine
|
---|
399 | .I bzip2
|
---|
400 | will perform best on machines with very large caches.
|
---|
401 |
|
---|
402 | .SH CAVEATS
|
---|
403 | I/O error messages are not as helpful as they could be.
|
---|
404 | .I bzip2
|
---|
405 | tries hard to detect I/O errors and exit cleanly, but the details of
|
---|
406 | what the problem is sometimes seem rather misleading.
|
---|
407 |
|
---|
408 | This manual page pertains to version 1.0.3 of
|
---|
409 | .I bzip2.
|
---|
410 | Compressed data created by this version is entirely forwards and
|
---|
411 | backwards compatible with the previous public releases, versions
|
---|
412 | 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1 and 1.0.2, but with the following
|
---|
413 | exception: 0.9.0 and above can correctly decompress multiple
|
---|
414 | concatenated compressed files. 0.1pl2 cannot do this; it will stop
|
---|
415 | after decompressing just the first file in the stream.
|
---|
416 |
|
---|
417 | .I bzip2recover
|
---|
418 | versions prior to 1.0.2 used 32-bit integers to represent
|
---|
419 | bit positions in compressed files, so they could not handle compressed
|
---|
420 | files more than 512 megabytes long. Versions 1.0.2 and above use
|
---|
421 | 64-bit ints on some platforms which support them (GNU supported
|
---|
422 | targets, and Windows). To establish whether or not bzip2recover was
|
---|
423 | built with such a limitation, run it without arguments. In any event
|
---|
424 | you can build yourself an unlimited version if you can recompile it
|
---|
425 | with MaybeUInt64 set to be an unsigned 64-bit integer.
|
---|
426 |
|
---|
427 |
|
---|
428 |
|
---|
429 | .SH AUTHOR
|
---|
430 | Julian Seward, jsewardbzip.org.
|
---|
431 |
|
---|
432 | http://www.bzip.org
|
---|
433 |
|
---|
434 | The ideas embodied in
|
---|
435 | .I bzip2
|
---|
436 | are due to (at least) the following
|
---|
437 | people: Michael Burrows and David Wheeler (for the block sorting
|
---|
438 | transformation), David Wheeler (again, for the Huffman coder), Peter
|
---|
439 | Fenwick (for the structured coding model in the original
|
---|
440 | .I bzip,
|
---|
441 | and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
|
---|
442 | (for the arithmetic coder in the original
|
---|
443 | .I bzip).
|
---|
444 | I am much
|
---|
445 | indebted for their help, support and advice. See the manual in the
|
---|
446 | source distribution for pointers to sources of documentation. Christian
|
---|
447 | von Roques encouraged me to look for faster sorting algorithms, so as to
|
---|
448 | speed up compression. Bela Lubkin encouraged me to improve the
|
---|
449 | worst-case compression performance.
|
---|
450 | Donna Robinson XMLised the documentation.
|
---|
451 | The bz* scripts are derived from those of GNU gzip.
|
---|
452 | Many people sent patches, helped
|
---|
453 | with portability problems, lent machines, gave advice and were generally
|
---|
454 | helpful.
|
---|