Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

source: trunk/minix/commands/bzip2-1.0.3/bzip2.txt@ 10

Last change on this file since 10 was 9, checked in by Mattia Monga, 14 years ago
Minix 3.1.2a
File size: 18.5 KB

Rev	Line
[9]	1
	2	NAME
	3	bzip2, bunzip2 - a block-sorting file compressor, v1.0.3
	4	bzcat - decompresses files to stdout
	5	bzip2recover - recovers data from damaged bzip2 files
	6
	7
	8	SYNOPSIS
	9	bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
	10	bunzip2 [ -fkvsVL ] [ filenames ... ]
	11	bzcat [ -s ] [ filenames ... ]
	12	bzip2recover filename
	13
	14
	15	DESCRIPTION
	16	bzip2 compresses files using the Burrows-Wheeler block
	17	sorting text compression algorithm, and Huffman coding.
	18	Compression is generally considerably better than that
	19	achieved by more conventional LZ77/LZ78-based compressors,
	20	and approaches the performance of the PPM family of sta-
	21	tistical compressors.
	22
	23	The command-line options are deliberately very similar to
	24	those of GNU gzip, but they are not identical.
	25
	26	bzip2 expects a list of file names to accompany the com-
	27	mand-line flags. Each file is replaced by a compressed
	28	version of itself, with the name "original_name.bz2".
	29	Each compressed file has the same modification date, per-
	30	missions, and, when possible, ownership as the correspond-
	31	ing original, so that these properties can be correctly
	32	restored at decompression time. File name handling is
	33	naive in the sense that there is no mechanism for preserv-
	34	ing original file names, permissions, ownerships or dates
	35	in filesystems which lack these concepts, or have serious
	36	file name length restrictions, such as MS-DOS.
	37
	38	bzip2 and bunzip2 will by default not overwrite existing
	39	files. If you want this to happen, specify the -f flag.
	40
	41	If no file names are specified, bzip2 compresses from
	42	standard input to standard output. In this case, bzip2
	43	will decline to write compressed output to a terminal, as
	44	this would be entirely incomprehensible and therefore
	45	pointless.
	46
	47	bunzip2 (or bzip2 -d) decompresses all specified files.
	48	Files which were not created by bzip2 will be detected and
	49	ignored, and a warning issued. bzip2 attempts to guess
	50	the filename for the decompressed file from that of the
	51	compressed file as follows:
	52
	53	filename.bz2 becomes filename
	54	filename.bz becomes filename
	55	filename.tbz2 becomes filename.tar
	56	filename.tbz becomes filename.tar
	57	anyothername becomes anyothername.out
	58
	59	If the file does not end in one of the recognised endings,
	60	.bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot
	61	guess the name of the original file, and uses the original
	62	name with .out appended.
	63
	64	As with compression, supplying no filenames causes decom-
	65	pression from standard input to standard output.
	66
	67	bunzip2 will correctly decompress a file which is the con-
	68	catenation of two or more compressed files. The result is
	69	the concatenation of the corresponding uncompressed files.
	70	Integrity testing (-t) of concatenated compressed files is
	71	also supported.
	72
	73	You can also compress or decompress files to the standard
	74	output by giving the -c flag. Multiple files may be com-
	75	pressed and decompressed like this. The resulting outputs
	76	are fed sequentially to stdout. Compression of multiple
	77	files in this manner generates a stream containing multi-
	78	ple compressed file representations. Such a stream can be
	79	decompressed correctly only by bzip2 version 0.9.0 or
	80	later. Earlier versions of bzip2 will stop after decom-
	81	pressing the first file in the stream.
	82
	83	bzcat (or bzip2 -dc) decompresses all specified files to
	84	the standard output.
	85
	86	bzip2 will read arguments from the environment variables
	87	BZIP2 and BZIP, in that order, and will process them
	88	before any arguments read from the command line. This
	89	gives a convenient way to supply default arguments.
	90
	91	Compression is always performed, even if the compressed
	92	file is slightly larger than the original. Files of less
	93	than about one hundred bytes tend to get larger, since the
	94	compression mechanism has a constant overhead in the
	95	region of 50 bytes. Random data (including the output of
	96	most file compressors) is coded at about 8.05 bits per
	97	byte, giving an expansion of around 0.5%.
	98
	99	As a self-check for your protection, bzip2 uses 32-bit
	100	CRCs to make sure that the decompressed version of a file
	101	is identical to the original. This guards against corrup-
	102	tion of the compressed data, and against undetected bugs
	103	in bzip2 (hopefully very unlikely). The chances of data
	104	corruption going undetected is microscopic, about one
	105	chance in four billion for each file processed. Be aware,
	106	though, that the check occurs upon decompression, so it
	107	can only tell you that something is wrong. It can't help
	108	you recover the original uncompressed data. You can use
	109	bzip2recover to try to recover data from damaged files.
	110
	111	Return values: 0 for a normal exit, 1 for environmental
	112	problems (file not found, invalid flags, I/O errors, &c),
	113	2 to indicate a corrupt compressed file, 3 for an internal
	114	consistency error (eg, bug) which caused bzip2 to panic.
	115
	116
	117	OPTIONS
	118	-c --stdout
	119	Compress or decompress to standard output.
	120
	121	-d --decompress
	122	Force decompression. bzip2, bunzip2 and bzcat are
	123	really the same program, and the decision about
	124	what actions to take is done on the basis of which
	125	name is used. This flag overrides that mechanism,
	126	and forces bzip2 to decompress.
	127
	128	-z --compress
	129	The complement to -d: forces compression,
	130	regardless of the invocation name.
	131
	132	-t --test
	133	Check integrity of the specified file(s), but don't
	134	decompress them. This really performs a trial
	135	decompression and throws away the result.
	136
	137	-f --force
	138	Force overwrite of output files. Normally, bzip2
	139	will not overwrite existing output files. Also
	140	forces bzip2 to break hard links to files, which it
	141	otherwise wouldn't do.
	142
	143	bzip2 normally declines to decompress files which
	144	don't have the correct magic header bytes. If
	145	forced (-f), however, it will pass such files
	146	through unmodified. This is how GNU gzip behaves.
	147
	148	-k --keep
	149	Keep (don't delete) input files during compression
	150	or decompression.
	151
	152	-s --small
	153	Reduce memory usage, for compression, decompression
	154	and testing. Files are decompressed and tested
	155	using a modified algorithm which only requires 2.5
	156	bytes per block byte. This means any file can be
	157	decompressed in 2300k of memory, albeit at about
	158	half the normal speed.
	159
	160	During compression, -s selects a block size of
	161	200k, which limits memory use to around the same
	162	figure, at the expense of your compression ratio.
	163	In short, if your machine is low on memory (8
	164	megabytes or less), use -s for everything. See
	165	MEMORY MANAGEMENT below.
	166
	167	-q --quiet
	168	Suppress non-essential warning messages. Messages
	169	pertaining to I/O errors and other critical events
	170	will not be suppressed.
	171
	172	-v --verbose
	173	Verbose mode -- show the compression ratio for each
	174	file processed. Further -v's increase the ver-
	175	bosity level, spewing out lots of information which
	176	is primarily of interest for diagnostic purposes.
	177
	178	-L --license -V --version
	179	Display the software version, license terms and
	180	conditions.
	181
	182	-1 (or --fast) to -9 (or --best)
	183	Set the block size to 100 k, 200 k .. 900 k when
	184	compressing. Has no effect when decompressing.
	185	See MEMORY MANAGEMENT below. The --fast and --best
	186	aliases are primarily for GNU gzip compatibility.
	187	In particular, --fast doesn't make things signifi-
	188	cantly faster. And --best merely selects the
	189	default behaviour.
	190
	191	-- Treats all subsequent arguments as file names, even
	192	if they start with a dash. This is so you can han-
	193	dle files with names beginning with a dash, for
	194	example: bzip2 -- -myfilename.
	195
	196	--repetitive-fast --repetitive-best
	197	These flags are redundant in versions 0.9.5 and
	198	above. They provided some coarse control over the
	199	behaviour of the sorting algorithm in earlier ver-
	200	sions, which was sometimes useful. 0.9.5 and above
	201	have an improved algorithm which renders these
	202	flags irrelevant.
	203
	204
	205	MEMORY MANAGEMENT
	206	bzip2 compresses large files in blocks. The block size
	207	affects both the compression ratio achieved, and the
	208	amount of memory needed for compression and decompression.
	209	The flags -1 through -9 specify the block size to be
	210	100,000 bytes through 900,000 bytes (the default) respec-
	211	tively. At decompression time, the block size used for
	212	compression is read from the header of the compressed
	213	file, and bunzip2 then allocates itself just enough memory
	214	to decompress the file. Since block sizes are stored in
	215	compressed files, it follows that the flags -1 to -9 are
	216	irrelevant to and so ignored during decompression.
	217
	218	Compression and decompression requirements, in bytes, can
	219	be estimated as:
	220
	221	Compression: 400k + ( 8 x block size )
	222
	223	Decompression: 100k + ( 4 x block size ), or
	224	100k + ( 2.5 x block size )
	225
	226	Larger block sizes give rapidly diminishing marginal
	227	returns. Most of the compression comes from the first two
	228	or three hundred k of block size, a fact worth bearing in
	229	mind when using bzip2 on small machines. It is also
	230	important to appreciate that the decompression memory
	231	requirement is set at compression time by the choice of
	232	block size.
	233
	234	For files compressed with the default 900k block size,
	235	bunzip2 will require about 3700 kbytes to decompress. To
	236	support decompression of any file on a 4 megabyte machine,
	237	bunzip2 has an option to decompress using approximately
	238	half this amount of memory, about 2300 kbytes. Decompres-
	239	sion speed is also halved, so you should use this option
	240	only where necessary. The relevant flag is -s.
	241
	242	In general, try and use the largest block size memory con-
	243	straints allow, since that maximises the compression
	244	achieved. Compression and decompression speed are virtu-
	245	ally unaffected by block size.
	246
	247	Another significant point applies to files which fit in a
	248	single block -- that means most files you'd encounter
	249	using a large block size. The amount of real memory
	250	touched is proportional to the size of the file, since the
	251	file is smaller than a block. For example, compressing a
	252	file 20,000 bytes long with the flag -9 will cause the
	253	compressor to allocate around 7600k of memory, but only
	254	touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the
	255	decompressor will allocate 3700k but only touch 100k +
	256	20000 * 4 = 180 kbytes.
	257
	258	Here is a table which summarises the maximum memory usage
	259	for different block sizes. Also recorded is the total
	260	compressed size for 14 files of the Calgary Text Compres-
	261	sion Corpus totalling 3,141,622 bytes. This column gives
	262	some feel for how compression varies with block size.
	263	These figures tend to understate the advantage of larger
	264	block sizes for larger files, since the Corpus is domi-
	265	nated by smaller files.
	266
	267	Compress Decompress Decompress Corpus
	268	Flag usage usage -s usage Size
	269
	270	-1 1200k 500k 350k 914704
	271	-2 2000k 900k 600k 877703
	272	-3 2800k 1300k 850k 860338
	273	-4 3600k 1700k 1100k 846899
	274	-5 4400k 2100k 1350k 845160
	275	-6 5200k 2500k 1600k 838626
	276	-7 6100k 2900k 1850k 834096
	277	-8 6800k 3300k 2100k 828642
	278	-9 7600k 3700k 2350k 828642
	279
	280
	281	RECOVERING DATA FROM DAMAGED FILES
	282	bzip2 compresses files in blocks, usually 900kbytes long.
	283	Each block is handled independently. If a media or trans-
	284	mission error causes a multi-block .bz2 file to become
	285	damaged, it may be possible to recover data from the
	286	undamaged blocks in the file.
	287
	288	The compressed representation of each block is delimited
	289	by a 48-bit pattern, which makes it possible to find the
	290	block boundaries with reasonable certainty. Each block
	291	also carries its own 32-bit CRC, so damaged blocks can be
	292	distinguished from undamaged ones.
	293
	294	bzip2recover is a simple program whose purpose is to
	295	search for blocks in .bz2 files, and write each block out
	296	into its own .bz2 file. You can then use bzip2 -t to test
	297	the integrity of the resulting files, and decompress those
	298	which are undamaged.
	299
	300	bzip2recover takes a single argument, the name of the dam-
	301	aged file, and writes a number of files
	302	"rec00001file.bz2", "rec00002file.bz2", etc, containing
	303	the extracted blocks. The output filenames are
	304	designed so that the use of wildcards in subsequent pro-
	305	cessing -- for example, "bzip2 -dc rec*file.bz2 > recov-
	306	ered_data" -- processes the files in the correct order.
	307
	308	bzip2recover should be of most use dealing with large .bz2
	309	files, as these will contain many blocks. It is clearly
	310	futile to use it on damaged single-block files, since a
	311	damaged block cannot be recovered. If you wish to min-
	312	imise any potential data loss through media or transmis-
	313	sion errors, you might consider compressing with a smaller
	314	block size.
	315
	316
	317	PERFORMANCE NOTES
	318	The sorting phase of compression gathers together similar
	319	strings in the file. Because of this, files containing
	320	very long runs of repeated symbols, like "aabaabaabaab
	321	..." (repeated several hundred times) may compress more
	322	slowly than normal. Versions 0.9.5 and above fare much
	323	better than previous versions in this respect. The ratio
	324	between worst-case and average-case compression time is in
	325	the region of 10:1. For previous versions, this figure
	326	was more like 100:1. You can use the -vvvv option to mon-
	327	itor progress in great detail, if you want.
	328
	329	Decompression speed is unaffected by these phenomena.
	330
	331	bzip2 usually allocates several megabytes of memory to
	332	operate in, and then charges all over it in a fairly ran-
	333	dom fashion. This means that performance, both for com-
	334	pressing and decompressing, is largely determined by the
	335	speed at which your machine can service cache misses.
	336	Because of this, small changes to the code to reduce the
	337	miss rate have been observed to give disproportionately
	338	large performance improvements. I imagine bzip2 will per-
	339	form best on machines with very large caches.
	340
	341
	342	CAVEATS
	343	I/O error messages are not as helpful as they could be.
	344	bzip2 tries hard to detect I/O errors and exit cleanly,
	345	but the details of what the problem is sometimes seem
	346	rather misleading.
	347
	348	This manual page pertains to version 1.0.3 of bzip2. Com-
	349	pressed data created by this version is entirely forwards
	350	and backwards compatible with the previous public
	351	releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1 and
	352	1.0.2, but with the following exception: 0.9.0 and above
	353	can correctly decompress multiple concatenated compressed
	354	files. 0.1pl2 cannot do this; it will stop after decom-
	355	pressing just the first file in the stream.
	356
	357	bzip2recover versions prior to 1.0.2 used 32-bit integers
	358	to represent bit positions in compressed files, so they
	359	could not handle compressed files more than 512 megabytes
	360	long. Versions 1.0.2 and above use 64-bit ints on some
	361	platforms which support them (GNU supported targets, and
	362	Windows). To establish whether or not bzip2recover was
	363	built with such a limitation, run it without arguments.
	364	In any event you can build yourself an unlimited version
	365	if you can recompile it with MaybeUInt64 set to be an
	366	unsigned 64-bit integer.
	367
	368
	369	AUTHOR
	370	Julian Seward, jsewardbzip.org.
	371
	372	http://www.bzip.org
	373
	374	The ideas embodied in bzip2 are due to (at least) the fol-
	375	lowing people: Michael Burrows and David Wheeler (for the
	376	block sorting transformation), David Wheeler (again, for
	377	the Huffman coder), Peter Fenwick (for the structured cod-
	378	ing model in the original bzip, and many refinements), and
	379	Alistair Moffat, Radford Neal and Ian Witten (for the
	380	arithmetic coder in the original bzip). I am much
	381	indebted for their help, support and advice. See the man-
	382	ual in the source distribution for pointers to sources of
	383	documentation. Christian von Roques encouraged me to look
	384	for faster sorting algorithms, so as to speed up compres-
	385	sion. Bela Lubkin encouraged me to improve the worst-case
	386	compression performance. Donna Robinson XMLised the docu-
	387	mentation. The bz* scripts are derived from those of GNU
	388	gzip. Many people sent patches, helped with portability
	389	problems, lent machines, gave advice and were generally
	390	helpful.
	391

Note: See TracBrowser for help on using the repository browser.

Download in other formats: