As an alternative to checksums, cryptographic hash provides similar functions, but provides better resistance to collisions. Among the many characteristics collected to verify file integrity, hashing is, no doubt, the most complex. A hash is a short and constant size signature nearly uniquely identifying the content of a chunk of data.
The terms cryptographic hash, binary signature, message digest, and message sum all refer to the same thing.
Here are some interesting properties of file hash:
Cryptographic hashes of files of different sizes are constant in size. In other words, all signatures are the same size regardless of the original file size. This allows you to store a single short string uniquely, representing the content of potentially very large files.
Cryptographic hashes are made of readable characters, making it easy to compare with a simple string comparison.
Hashes of files differing by just one bit are widely different—at least half the output bits flip—called the avalanche effect. For example:
$echo "Cryptographic hash function" | sha1sum
a2ba24e46a59f296de8641fbe7add1d5d1cff4bb - $echo "Cryptographic gash function" | sha1sum
7d044b4a37fb5d9cec5755b4e299d0f9589b22be -
One-way functions; i.e., a signature is computed from a file but not the reverse, so the original content cannot be derived from the signature.
Cryptography is a very lively domain, and progress is constantly being made to improve and replace existing algorithms. An ideal hash algorithm should be fast and not prone to collisions. Hash collisions happen when two files with different content have identical hash sums. In 2005, a team of Chinese and American security researchers published a paper showing a method to look for hash collisions in the SHA-1 algorithm with supercomputers available today. It is not yet a cause of panic because such computing power is outside the reach of most people, but the NSA has this saying: "Attacks always get better; they never get worse."
Table 20-2 shows a list of hash functions with output size and collision likeliness.
Table 20-2. Hash functions (source: Wikipedia)
Hash function | Output size | Collision |
---|---|---|
Gost | 64 | Not available |
HAVAL | 256/224/192/160/128 | Yes |
MD2 | 128 | Almost |
MD4 | 128 | Yes |
MD5 | 128 | Yes |
PANAMA | 256 | With flaws |
RIPEMD | 128 | Yes |
RIPEMD-128/256 | 128/256 | Yes |
RIPEMD-160/320 | 128/256 | Yes |
SHA-0 | 160 | Yes |
SHA-1 | 160 | With flaws |
SHA-256/224 | 256/224 | No |
SHA-512/384 | 512/384 | No |
Tiger(2)-192/160/128 | 19/160/128 | No |
Whirlpool | 512 | No |
The most popular hash functions are SHA-1 and MD5. Those are often the default choice for file integrity checkers.
Hashes are used in Hash Message Authentication Codes (HMAC), which is a method to authenticate and verify the data integrity of a given message. An HMAC is often used to secure network traffic that does not need to be confidential. For example, the Juniper Networks IDP product uses HMAC-authenticated packets to exchange messages between members of an IDP cluster. Those packets include the particular node state and ID number, among other things. It really does not matter that they are confidential. What matters is that they come from an actual cluster member and that they cannot be replayed so as to mess up the cluster's state. We achieve that goal by using timestamped packets and embedding an HMAC field in each packet, making them unique and authenticated. Figure 20-2 shows the two-step process of creating an HMAC-authenticated data chunk. Note that HMAC authenticates a data chunk, but does not provide encryption.
File integrity hashing a large number of files (> 150,000 files) really requires file characteristics to be stored and available in a Relational Database Management System (RDBMS). They provide indexes for fast access and efficient storage. Each tool usually creates its own database.
If an attacker replacing files on your system is able to update the baseline of file hash sums, the action would go totally unnoticed and lead to a compromised system without your knowledge—not the kind of thing you like to see happening. Securing the baseline is of utmost importance and is sometimes provided by the tool. If your tool does not provide database security, I strongly recommend you move the baseline database to read-only storage.
Secure your baseline by moving it to read-only storage.