Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: trunk/minix/commands/bzip2-1.0.3/bzip2.txt@ 9

Last change on this file since 9 was 9, checked in by Mattia Monga, 13 years ago
Minix 3.1.2a
File size: 18.5 KB

Line
1
2	NAME
3	bzip2, bunzip2 - a block-sorting file compressor, v1.0.3
4	bzcat - decompresses files to stdout
5	bzip2recover - recovers data from damaged bzip2 files
6
7
8	SYNOPSIS
9	bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
10	bunzip2 [ -fkvsVL ] [ filenames ... ]
11	bzcat [ -s ] [ filenames ... ]
12	bzip2recover filename
13
14
15	DESCRIPTION
16	bzip2 compresses files using the Burrows-Wheeler block
17	sorting text compression algorithm, and Huffman coding.
18	Compression is generally considerably better than that
19	achieved by more conventional LZ77/LZ78-based compressors,
20	and approaches the performance of the PPM family of sta-
21	tistical compressors.
22
23	The command-line options are deliberately very similar to
24	those of GNU gzip, but they are not identical.
25
26	bzip2 expects a list of file names to accompany the com-
27	mand-line flags. Each file is replaced by a compressed
28	version of itself, with the name "original_name.bz2".
29	Each compressed file has the same modification date, per-
30	missions, and, when possible, ownership as the correspond-
31	ing original, so that these properties can be correctly
32	restored at decompression time. File name handling is
33	naive in the sense that there is no mechanism for preserv-
34	ing original file names, permissions, ownerships or dates
35	in filesystems which lack these concepts, or have serious
36	file name length restrictions, such as MS-DOS.
37
38	bzip2 and bunzip2 will by default not overwrite existing
39	files. If you want this to happen, specify the -f flag.
40
41	If no file names are specified, bzip2 compresses from
42	standard input to standard output. In this case, bzip2
43	will decline to write compressed output to a terminal, as
44	this would be entirely incomprehensible and therefore
45	pointless.
46
47	bunzip2 (or bzip2 -d) decompresses all specified files.
48	Files which were not created by bzip2 will be detected and
49	ignored, and a warning issued. bzip2 attempts to guess
50	the filename for the decompressed file from that of the
51	compressed file as follows:
52
53	filename.bz2 becomes filename
54	filename.bz becomes filename
55	filename.tbz2 becomes filename.tar
56	filename.tbz becomes filename.tar
57	anyothername becomes anyothername.out
58
59	If the file does not end in one of the recognised endings,
60	.bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot
61	guess the name of the original file, and uses the original
62	name with .out appended.
63
64	As with compression, supplying no filenames causes decom-
65	pression from standard input to standard output.
66
67	bunzip2 will correctly decompress a file which is the con-
68	catenation of two or more compressed files. The result is
69	the concatenation of the corresponding uncompressed files.
70	Integrity testing (-t) of concatenated compressed files is
71	also supported.
72
73	You can also compress or decompress files to the standard
74	output by giving the -c flag. Multiple files may be com-
75	pressed and decompressed like this. The resulting outputs
76	are fed sequentially to stdout. Compression of multiple
77	files in this manner generates a stream containing multi-
78	ple compressed file representations. Such a stream can be
79	decompressed correctly only by bzip2 version 0.9.0 or
80	later. Earlier versions of bzip2 will stop after decom-
81	pressing the first file in the stream.
82
83	bzcat (or bzip2 -dc) decompresses all specified files to
84	the standard output.
85
86	bzip2 will read arguments from the environment variables
87	BZIP2 and BZIP, in that order, and will process them
88	before any arguments read from the command line. This
89	gives a convenient way to supply default arguments.
90
91	Compression is always performed, even if the compressed
92	file is slightly larger than the original. Files of less
93	than about one hundred bytes tend to get larger, since the
94	compression mechanism has a constant overhead in the
95	region of 50 bytes. Random data (including the output of
96	most file compressors) is coded at about 8.05 bits per
97	byte, giving an expansion of around 0.5%.
98
99	As a self-check for your protection, bzip2 uses 32-bit
100	CRCs to make sure that the decompressed version of a file
101	is identical to the original. This guards against corrup-
102	tion of the compressed data, and against undetected bugs
103	in bzip2 (hopefully very unlikely). The chances of data
104	corruption going undetected is microscopic, about one
105	chance in four billion for each file processed. Be aware,
106	though, that the check occurs upon decompression, so it
107	can only tell you that something is wrong. It can't help
108	you recover the original uncompressed data. You can use
109	bzip2recover to try to recover data from damaged files.
110
111	Return values: 0 for a normal exit, 1 for environmental
112	problems (file not found, invalid flags, I/O errors, &c),
113	2 to indicate a corrupt compressed file, 3 for an internal
114	consistency error (eg, bug) which caused bzip2 to panic.
115
116
117	OPTIONS
118	-c --stdout
119	Compress or decompress to standard output.
120
121	-d --decompress
122	Force decompression. bzip2, bunzip2 and bzcat are
123	really the same program, and the decision about
124	what actions to take is done on the basis of which
125	name is used. This flag overrides that mechanism,
126	and forces bzip2 to decompress.
127
128	-z --compress
129	The complement to -d: forces compression,
130	regardless of the invocation name.
131
132	-t --test
133	Check integrity of the specified file(s), but don't
134	decompress them. This really performs a trial
135	decompression and throws away the result.
136
137	-f --force
138	Force overwrite of output files. Normally, bzip2
139	will not overwrite existing output files. Also
140	forces bzip2 to break hard links to files, which it
141	otherwise wouldn't do.
142
143	bzip2 normally declines to decompress files which
144	don't have the correct magic header bytes. If
145	forced (-f), however, it will pass such files
146	through unmodified. This is how GNU gzip behaves.
147
148	-k --keep
149	Keep (don't delete) input files during compression
150	or decompression.
151
152	-s --small
153	Reduce memory usage, for compression, decompression
154	and testing. Files are decompressed and tested
155	using a modified algorithm which only requires 2.5
156	bytes per block byte. This means any file can be
157	decompressed in 2300k of memory, albeit at about
158	half the normal speed.
159
160	During compression, -s selects a block size of
161	200k, which limits memory use to around the same
162	figure, at the expense of your compression ratio.
163	In short, if your machine is low on memory (8
164	megabytes or less), use -s for everything. See
165	MEMORY MANAGEMENT below.
166
167	-q --quiet
168	Suppress non-essential warning messages. Messages
169	pertaining to I/O errors and other critical events
170	will not be suppressed.
171
172	-v --verbose
173	Verbose mode -- show the compression ratio for each
174	file processed. Further -v's increase the ver-
175	bosity level, spewing out lots of information which
176	is primarily of interest for diagnostic purposes.
177
178	-L --license -V --version
179	Display the software version, license terms and
180	conditions.
181
182	-1 (or --fast) to -9 (or --best)
183	Set the block size to 100 k, 200 k .. 900 k when
184	compressing. Has no effect when decompressing.
185	See MEMORY MANAGEMENT below. The --fast and --best
186	aliases are primarily for GNU gzip compatibility.
187	In particular, --fast doesn't make things signifi-
188	cantly faster. And --best merely selects the
189	default behaviour.
190
191	-- Treats all subsequent arguments as file names, even
192	if they start with a dash. This is so you can han-
193	dle files with names beginning with a dash, for
194	example: bzip2 -- -myfilename.
195
196	--repetitive-fast --repetitive-best
197	These flags are redundant in versions 0.9.5 and
198	above. They provided some coarse control over the
199	behaviour of the sorting algorithm in earlier ver-
200	sions, which was sometimes useful. 0.9.5 and above
201	have an improved algorithm which renders these
202	flags irrelevant.
203
204
205	MEMORY MANAGEMENT
206	bzip2 compresses large files in blocks. The block size
207	affects both the compression ratio achieved, and the
208	amount of memory needed for compression and decompression.
209	The flags -1 through -9 specify the block size to be
210	100,000 bytes through 900,000 bytes (the default) respec-
211	tively. At decompression time, the block size used for
212	compression is read from the header of the compressed
213	file, and bunzip2 then allocates itself just enough memory
214	to decompress the file. Since block sizes are stored in
215	compressed files, it follows that the flags -1 to -9 are
216	irrelevant to and so ignored during decompression.
217
218	Compression and decompression requirements, in bytes, can
219	be estimated as:
220
221	Compression: 400k + ( 8 x block size )
222
223	Decompression: 100k + ( 4 x block size ), or
224	100k + ( 2.5 x block size )
225
226	Larger block sizes give rapidly diminishing marginal
227	returns. Most of the compression comes from the first two
228	or three hundred k of block size, a fact worth bearing in
229	mind when using bzip2 on small machines. It is also
230	important to appreciate that the decompression memory
231	requirement is set at compression time by the choice of
232	block size.
233
234	For files compressed with the default 900k block size,
235	bunzip2 will require about 3700 kbytes to decompress. To
236	support decompression of any file on a 4 megabyte machine,
237	bunzip2 has an option to decompress using approximately
238	half this amount of memory, about 2300 kbytes. Decompres-
239	sion speed is also halved, so you should use this option
240	only where necessary. The relevant flag is -s.
241
242	In general, try and use the largest block size memory con-
243	straints allow, since that maximises the compression
244	achieved. Compression and decompression speed are virtu-
245	ally unaffected by block size.
246
247	Another significant point applies to files which fit in a
248	single block -- that means most files you'd encounter
249	using a large block size. The amount of real memory
250	touched is proportional to the size of the file, since the
251	file is smaller than a block. For example, compressing a
252	file 20,000 bytes long with the flag -9 will cause the
253	compressor to allocate around 7600k of memory, but only
254	touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the
255	decompressor will allocate 3700k but only touch 100k +
256	20000 * 4 = 180 kbytes.
257
258	Here is a table which summarises the maximum memory usage
259	for different block sizes. Also recorded is the total
260	compressed size for 14 files of the Calgary Text Compres-
261	sion Corpus totalling 3,141,622 bytes. This column gives
262	some feel for how compression varies with block size.
263	These figures tend to understate the advantage of larger
264	block sizes for larger files, since the Corpus is domi-
265	nated by smaller files.
266
267	Compress Decompress Decompress Corpus
268	Flag usage usage -s usage Size
269
270	-1 1200k 500k 350k 914704
271	-2 2000k 900k 600k 877703
272	-3 2800k 1300k 850k 860338
273	-4 3600k 1700k 1100k 846899
274	-5 4400k 2100k 1350k 845160
275	-6 5200k 2500k 1600k 838626
276	-7 6100k 2900k 1850k 834096
277	-8 6800k 3300k 2100k 828642
278	-9 7600k 3700k 2350k 828642
279
280
281	RECOVERING DATA FROM DAMAGED FILES
282	bzip2 compresses files in blocks, usually 900kbytes long.
283	Each block is handled independently. If a media or trans-
284	mission error causes a multi-block .bz2 file to become
285	damaged, it may be possible to recover data from the
286	undamaged blocks in the file.
287
288	The compressed representation of each block is delimited
289	by a 48-bit pattern, which makes it possible to find the
290	block boundaries with reasonable certainty. Each block
291	also carries its own 32-bit CRC, so damaged blocks can be
292	distinguished from undamaged ones.
293
294	bzip2recover is a simple program whose purpose is to
295	search for blocks in .bz2 files, and write each block out
296	into its own .bz2 file. You can then use bzip2 -t to test
297	the integrity of the resulting files, and decompress those
298	which are undamaged.
299
300	bzip2recover takes a single argument, the name of the dam-
301	aged file, and writes a number of files
302	"rec00001file.bz2", "rec00002file.bz2", etc, containing
303	the extracted blocks. The output filenames are
304	designed so that the use of wildcards in subsequent pro-
305	cessing -- for example, "bzip2 -dc rec*file.bz2 > recov-
306	ered_data" -- processes the files in the correct order.
307
308	bzip2recover should be of most use dealing with large .bz2
309	files, as these will contain many blocks. It is clearly
310	futile to use it on damaged single-block files, since a
311	damaged block cannot be recovered. If you wish to min-
312	imise any potential data loss through media or transmis-
313	sion errors, you might consider compressing with a smaller
314	block size.
315
316
317	PERFORMANCE NOTES
318	The sorting phase of compression gathers together similar
319	strings in the file. Because of this, files containing
320	very long runs of repeated symbols, like "aabaabaabaab
321	..." (repeated several hundred times) may compress more
322	slowly than normal. Versions 0.9.5 and above fare much
323	better than previous versions in this respect. The ratio
324	between worst-case and average-case compression time is in
325	the region of 10:1. For previous versions, this figure
326	was more like 100:1. You can use the -vvvv option to mon-
327	itor progress in great detail, if you want.
328
329	Decompression speed is unaffected by these phenomena.
330
331	bzip2 usually allocates several megabytes of memory to
332	operate in, and then charges all over it in a fairly ran-
333	dom fashion. This means that performance, both for com-
334	pressing and decompressing, is largely determined by the
335	speed at which your machine can service cache misses.
336	Because of this, small changes to the code to reduce the
337	miss rate have been observed to give disproportionately
338	large performance improvements. I imagine bzip2 will per-
339	form best on machines with very large caches.
340
341
342	CAVEATS
343	I/O error messages are not as helpful as they could be.
344	bzip2 tries hard to detect I/O errors and exit cleanly,
345	but the details of what the problem is sometimes seem
346	rather misleading.
347
348	This manual page pertains to version 1.0.3 of bzip2. Com-
349	pressed data created by this version is entirely forwards
350	and backwards compatible with the previous public
351	releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1 and
352	1.0.2, but with the following exception: 0.9.0 and above
353	can correctly decompress multiple concatenated compressed
354	files. 0.1pl2 cannot do this; it will stop after decom-
355	pressing just the first file in the stream.
356
357	bzip2recover versions prior to 1.0.2 used 32-bit integers
358	to represent bit positions in compressed files, so they
359	could not handle compressed files more than 512 megabytes
360	long. Versions 1.0.2 and above use 64-bit ints on some
361	platforms which support them (GNU supported targets, and
362	Windows). To establish whether or not bzip2recover was
363	built with such a limitation, run it without arguments.
364	In any event you can build yourself an unlimited version
365	if you can recompile it with MaybeUInt64 set to be an
366	unsigned 64-bit integer.
367
368
369	AUTHOR
370	Julian Seward, jsewardbzip.org.
371
372	http://www.bzip.org
373
374	The ideas embodied in bzip2 are due to (at least) the fol-
375	lowing people: Michael Burrows and David Wheeler (for the
376	block sorting transformation), David Wheeler (again, for
377	the Huffman coder), Peter Fenwick (for the structured cod-
378	ing model in the original bzip, and many refinements), and
379	Alistair Moffat, Radford Neal and Ian Witten (for the
380	arithmetic coder in the original bzip). I am much
381	indebted for their help, support and advice. See the man-
382	ual in the source distribution for pointers to sources of
383	documentation. Christian von Roques encouraged me to look
384	for faster sorting algorithms, so as to speed up compres-
385	sion. Bela Lubkin encouraged me to improve the worst-case
386	compression performance. Donna Robinson XMLised the docu-
387	mentation. The bz* scripts are derived from those of GNU
388	gzip. Many people sent patches, helped with portability
389	problems, lent machines, gave advice and were generally
390	helpful.
391

Note: See TracBrowser for help on using the repository browser.

Download in other formats: