In the Unix filesystem, files are stored in blocks. Each nonempty file, no matter how small, takes up at least one block.[2] A directory tree full of little files can fill up a lot of partly empty blocks. A big file is more efficient because it fills all (except possibly the last) of its blocks completely.
The tar (Section 39.2) command can read lots of little files and put them into one big file. Later, when you need one of the little files, you can extract it from the tar archive. Seems like a good space-saving idea, doesn't it? But tar, which was really designed for magnetic tape archives, adds "garbage" characters at the end of each file to make it an even size. So, a big tar archive uses about as many blocks as the separate little files do.
Okay, then why am I writing this article? Because the gzip (Section 15.6) utility can solve the problems. It squeezes files down — compressing them to get rid of repeated characters. Compressing a tar archive typically saves 50% or more. The bzip2 (Section 15.6) utility can save even more.
If your compressed archive is corrupted somehow — say, a disk block goes bad — you could lose access to all of the files. That's because neither tar nor compression utilities recover well from missing data blocks. If you're archiving an important directory, be sure you have good backup copies of the archive.
Making a compressed archive of a directory and all of its subdirectories is
easy: tar copies the whole tree when you give
it the top directory name. Just be sure to save the archive in some directory
that won't be copied — so tar won't try to
archive its own archive! I usually put the archive in the parent directory. For
example, to archive my directory named project, I'd use the
following commands. The .tar.gz extension isn't required,
but is just a convention; another common convention is
.tgz. I've added the gzip - -best
option for more compression — but
it can be a lot slower, so use it only if you need to squeeze out every last
byte. bzip2 is another way to save bytes, so
I'll show versions with both gzip and
bzip2. No matter what command you use,
watch carefully for errors:
..
Section 1.16, -r
Section 14.16
%cd project
%tar clf - . | gzip --best > ../project.tar.gz
%gzcat ../project.tar.gz | tar tvf -
Quick verification %tar clf - . | bzip2 --best > ../project.tar.bz2
%bzcat ../project.tar.bz2 | tar tvf -
Quick verification %cd ..
%rm -r project
Go to http://examples.oreilly.com/upt3 for more information on: tar
If you have GNU tar
or another version with the
z
option, it will run gzip for you. This method doesn't use the gzip - -best
option, though — so
you may want to use the previous method to squeeze out all you can. Newer
GNU
tar
s have an I
option
to run bzip2. Watch out for other tar versions that use -I
as an
"include file" operator — check your manpage or tar — help.
If you want to be sure that you don't have a problem like this, use the long
options ( -- gzip
and -- bzip2
) because
they're guaranteed not to conflict with something else; if your tar doesn't support the particular compression
you've asked for, it will fail cleanly rather than do something you don't
expect.
Using the short flags to get compression from GNU tar, you'd write the previous tar command lines as follows:
tar czlf ../project.tar.gz . tar cIlf ../project.tar.bz2 .
In any case, the tar l (lowercase letter L) option will print messages if any of the files you're archiving have other hard links (Section 10.4). If a lot of your files have other links, archiving the directory may not save much disk space — the other links will keep those files on the disk, even after your rm -r command.
Any time you want a list of the files in the archive, use tar t or tar tv:
less
Section 12.3
%gzcat project.tar.gz | tar tvf - | less
rw-r--r--239/100 485 Oct 5 19:03 1991 ./Imakefile rw-rw-r--239/100 4703 Oct 5 21:17 1991 ./scalefonts.c rw-rw-r--239/100 3358 Oct 5 21:55 1991 ./xcms.c rw-rw-r--239/100 12385 Oct 5 22:07 1991 ./io/input.c rw-rw-r--239/100 7048 Oct 5 21:59 1991 ./io/output.c ... %bzcat project.tar.bz2 | tar tvf - | less
... %tar tzvf project.tar.gz | less
... %tar tIvf project.tar.bz2 | less
...
To extract all the files from the archive, type one of these tar command lines:
%mkdir project
%cd project
%gzcat ../project.tar.gz | tar xf -
%mkdir project
%cd project
%bzcat ../project.tar.bz2 | tar xf -
%mkdir project
%cd project
%tar xzf ../project.tar.gz
%mkdir project
%cd project
%tar xIf ../project.tar.bz2
Of course, you don't have to extract the files into a directory named project. You can read the archive file from other directories, move it to other computers, and so on.
You can also extract just a few files or directories from the archive. Be sure to use the exact name shown by the previous tar t command. For instance, to restore the old subdirectory named project/io (and everything that was in it), you'd use one of the previous tar command lines with the filename at the end. For instance:
%mkdir project
%cd project
%gzcat ../project.tar.gz | tar xf - ./io
— JP