Blockchain - Checksums and Hashes - The roots of blockchains
Index of all Chipkin blockchain articles: Index
Download the PDF version of this article here: Download
Checksums and Hashes
The roots of blockchains.
Peter Chipkin
Contents
- Introduction
- So what is hashing?
- Credit Card Numbers
- Measuring Success
- Checksums in Data Communications
- Checksums gone nuts
- Checksum or Hash – Jargon Watch
- What is a hash (and checksum) – A simple example
- File Hashing
- Password Hashing
- MD5 Hash
- SHA256 Hash
- Attributes of a Useful Hash
- Hash Pointer
- Examples of cryptographic hash functions
- Linked List
- Linked List with Hash Pointers – The Foundation of Block Chain
- Merkle Trees
- Consistency Verification
- Data Synchronization
- Making a Merkle Tree
- Merkle Tree More Reading
Introduction
How do you compare two editions of the same book to see that they contain the same text?
Method 1: word-for-word comparison.
Method 2: feed both into a hashing algorithm. If the hashes are the same, then the books are the same.
So what is hashing?
In simple terms, hashing means taking an input string of any length and producing an output of a fixed length. A hashing algorithm is a mathematical function that condenses data to a fixed size.
Credit Card Numbers
Credit card numbers are often typed in, transferred, and quoted. All of this transmission can cause errors, especially when humans are involved. To reduce common errors, credit card numbers include a check digit.
A 16-digit credit card number actually contains a 15-digit number. The last digit (the 16th) is calculated using the other 15 digits and a special formula.
The formula is designed to catch common human errors such as transposition (switching two digits around).
Measuring Success
In the table below you can see the Modulus 11 checksum detects 100% of “Single Error”, “Single Transposition”, and more. In other words, Modulus 11 is excellent at its job.
Checksums in Data Communications
We don’t choose checksums—the protocols we use do. But what if we were designing a protocol from scratch?
CHIPKIN has reviewed the science for you.
Here are the main conclusions:
- Ethernet, TCP, and IP checksums in almost all cases make the protocol checksum redundant.
- If your data isn’t random, you should take care not to assume general-purpose checksums are applicable.
Example: You choose to use CRC as a protocol checksum for messages relayed over TCP/IP. This could be an error because CRCs are particularly good at detecting errors caused by noise in transmission channels. Noise is typically the least likely source of error on an Ethernet packet, so your checksum choice may be inappropriate.
Checksums gone nuts
Consider a DNP3 packet (the protocol used in the power generation and distribution business). The protocol designers went “all in” on checksums.
Checksum or Hash – Jargon Watch
We call a hash designed to detect data transmission errors—and embedded in the transmitted data—a checksum.
Terminology also depends on context. In blockchain circles, you will hear “hashes”. In embedded communications, you will hear “checksums”. Checksums tend to be one or two bytes long. Common cryptographic hashes are 16 or 32 bytes long.
What is a hash (and checksum) – A simple example
Say each letter of the alphabet has a number assigned to it: A=1, B=2, etc.
Then: Adam = 1, 4, 1, 13
Hashing Rule: sum all the letters.
hash = 1 + 4 + 1 + 13 = 19
Fixed-length example: sum all letters and take the last digit of the sum.
1 + 4 + 1 + 13 = 19 → last digit = 9 → hash = 9
This hashing rule makes the hash fixed length.
- A “hash” (also called a “digest”, and informally a “checksum”) is a kind of “signature” for a stream of data that represents the contents.
- Hashes are “digests”, not “encryption”.
- Encryption transforms data from cleartext to ciphertext and back (given the right keys).
- Encryption is a two-way operation. Hashing is not.
File Hashing
The publisher provides a downloadable file and the MD5 hash of the file.
You download the file, calculate the MD5 hash, and compare it to the publisher’s value. If they match, you likely have the authentic file.
MD5 used to be common, but it is now considered cracked.
PGP signatures are now often used instead.
Password Hashing
- It’s a bad idea for computer systems to store passwords in cleartext.
- A more secure approach is to store a hash of the password rather than the password itself.
- Since hashes are not reversible, it’s not possible to recover “what password produced this hash?” from the hash alone.
- This also reduces damage in a compromise and limits insider theft.
- Salt: extra string added to make it harder to reverse.
MD5 Hash
Not considered safe anymore after collisions were found.
SHA256 Hash
SHA (Secure Hash Algorithm) is a family of cryptographic hash functions. A cryptographic hash is like a signature for a text or data file. SHA-256 generates an almost-unique, fixed-size 256-bit (32-byte) hash.
Hashing is a one-way function—it cannot be decrypted back.
Note: a tiny change in input produces a big change in the hash.
AND THEREFORE
From now on we assume that if the hash is different then the input data / text is different.
- If the hash is the same then the data is the same
- The assumption is not strictly true, but it is statistically effective
As you can see, in the case of SHA-256, no matter how big or small your input is, the output will always have a fixed 256-bit length. This is critical when dealing with large volumes of data and transactions.
Instead of remembering the full input (which could be huge), you can remember the hash and keep track. Before going further we need to look at properties of hashing functions and how they get implemented in blockchain.
Attributes of a Useful Hash
Attribute #1 – Deterministic
No matter how many times you run the same input through a hash function, you will always get the same result. This is critical because inconsistent hashes would make tracking impossible.
Attribute #2 – Quick Computation
The hash function should return the hash quickly. If the process isn’t fast enough, the system won’t be efficient.
Attribute #3 – Small Changes in Input Change the Hash
Even a small change in input should produce a very large change in output.
| Input | SHA-256 Hash |
| Test12345 | 106ac304ae39bc4029db0faf0d1734bd5a1dc2474331e8e17039365847536d7 |
| test12345 | 6fec2a9601d5b3581c94f2150fc07fa3d6e45808079428354b868e412b76e6bb |
Attribute #4 – Collision Resistant
Given two different inputs A and B where H(A) and H(B) are their hashes, it should be infeasible for H(A) to equal H(B). For the most part, each input should have its own unique hash.
No hash function is collision-free, but it usually takes so long to find a collision that for a function like SHA-256 it is safe to assume that if H(A) = H(B) then A = B.
Attribute #5 – Puzzle Friendly
For every output Y, if k is sufficiently large it is infeasible to find an input x such that: H(k concatenated with x) = Y.
Hash Pointer
The pointer in a block that points to the next block has two components:
- The hash of the previous block header
- The pointer to the previous block
Examples of cryptographic hash functions
There are hundreds of hashing algorithms, each optimized for different goals (speed, security, data types, etc.).
- MD5: 128-bit hash. Collision resistance broken after ~2^21 hashes.
- SHA-1: 160-bit hash. Collision resistance broken after ~2^61 hashes.
- SHA-256: 256-bit hash. Used by Bitcoin.
- Keccak-256: 256-bit hash. Used by Ethereum.
If you see “SHA-2,” “SHA-256,” or “SHA-256 bit,” those names generally refer to the same family / variant.
Linked List
Linked lists: each object contains two elements—data and a pointer to the next item in the list.
Big advantage: logical and physical order are separate.
Linked List with Hash Pointers – The Foundation of Block Chain
Data Block
Hash this data and append the hash to the data.
But this time, before you calculate the hash on your own data, you add the hash of the previous block to your data.
Now Block 3 can detect if Block 2 changes because it knows what Block 2’s hash should be and can recalculate and compare.
Block 3 will also detect if someone changes the order of blocks or inserts a new block because the hashes won’t match.
Profound: if you can detect change easily, it is easier to trust the data.
Jargon watch: Blockchain ledger = think of the hashed linked list. This is the ledger.
Hashing hashes – building blocks of the blockchain ledger.
Merkle Trees
Merkle (a person) realized it’s easier working with hashes of hashes than with large original blocks of data that keep growing. He defined the math and methods for hashing hashes of blocks of data.
Each block contains thousands of transactions. Storing all data as a flat series is inefficient for verification and lookup. With a Merkle tree, you can quickly determine whether a particular transaction belongs in a block.
These notes assume that new data/changes are appended rather than edited in place in old data blocks that have already been hashed.
We want to add data to a file (add file 2 to file 2).
Using a hash tree like a Merkle tree:
- Reduces the amount of data a trusted authority must maintain to prove integrity.
- Reduces network I/O required for verification and synchronization.
- Separates validation from the data itself. The Merkle tree can reside locally, on a trusted authority, or across a distributed system (you might maintain your own tree). This decoupling allows separate persistence strategies for the Merkle tree and the underlying data.
So the answer to “why” is three-fold:
- Merkle trees provide a way to prove integrity/validity of your data.
- They require little storage and proofs are computationally easy and fast.
- Merkle proofs require only small, terse messages across a network.
Merkle trees (and variations) are used to provide:
- consistency verification
- data verification
- data synchronization
Consistency Verification
Merkle (hash) trees allow easy consistency validation.
Instead of comparing the data, you compare the tree of hashes. This is fast and easy.
This does not prove the data is the same, but it proves the hash trees are consistent.
A “consistency proof” lets you verify that two versions of a hash tree are consistent:
- the later version includes everything in the earlier version
- in the same order
- all new records come after records in the older version
If you can prove a hash tree is consistent it means:
- no records have been back-dated and inserted
- no records have been modified
- the hash tree has never been branched or forked
It is time-consuming and computationally expensive to check entire files whenever a system needs to verify data. This is why Merkle trees are used: they limit the amount of data sent over a network. Instead of sending an entire file, send a hash and use the tree to locate inconsistencies.
- Computer A sends a hash of the file to Computer B.
- Computer B checks that hash against the root of the Merkle tree.
- If there is no difference, you’re done. Otherwise continue.
- Computer B requests roots of the two subtrees for the mismatching hash.
- Computer A creates the necessary hashes and sends them back.
- Repeat until the inconsistent data block(s) are found.
Data Synchronization
Merkle trees are useful for synchronizing data across a distributed data store because they allow each node to quickly identify records that have changed without sending all data to compare.
Making a Merkle Tree
Merkle trees are binary.
- Take any two records and calculate the hash of each.
- Allocate a parent node for the two records (leaves).
- Store the hash of the two child hashes in the parent node.
- Repeat until every pair of records has a parent.
- Repeat until every pair of parents has a parent.
- Continue until you reach the top. This is the Root Node.