In the previous sections, we said that all nodes store identical copies of the blockchain database, didn't we?
And that's an issue that undoubtedly results in a lot of data storage and redundancy. However, it is the price we have to pay in order to obtain a truly decentralized peer-to-peer system without any middlemen.
Moreover, datasets can be different in size, some blocks may have 200 transactions, other blocks may have 500 transactions, and others may have 1,000 transactions. All these transactions also typically vary in the size of the information they contain in terms of kilobytes. The only capacity limit in the Bitcoin blockchain protocol is on the size of each block, which has been 1 megabyte since 2010. It was recently amended to effectively 1.4 MB with the latest upgrade of the Bitcoin software. But again, block sizes can vary up to that limit. You can check for yourselves what blocks look like at blockchain.info or other online block explorers. In the following screenshot, you can see some example blocks:
Hence, a blockchain can benefit from some standardization and rationalization of the data it stores.
A mechanism that allows us to address that are cryptographic hash functions, which are an efficient way to secure data integrity and reduce file size. Hash functions are used to convert input data of any length into a compressed unique fixed length string of characters (also known as a bit string). This output data serves as a unique reference code or digital fingerprint to verify the authenticity of some underlying dataset without the need to actually check the entire dataset.
In practice, this hash function is a mathematical algorithm that maps data of arbitrary size to a bit string of a fixed size (also known as a hash). It is designed to be a one-way function, meaning a function which cannot be inverted and recalculated backward to get to the input data. This can be seen in the following diagram:
The only way to recreate the input data, if one has the output only, is to attempt a brute-force search of all possible inputs to see if they produce a match. A brute-force search is basically systematically trying all possible combinations to find the solution.
Hash functions are heavily used in the Proof-of-Work blockchain consensus algorithm, as we'll see shortly.