Skip to content

7. Cryptographic Hashes

Overview

A cryptographic hash function is a function \(H\) that when applied on a message, \(M\), can be used to generate a fixed-length "fingerprint" of the message. As such, any change to the message, no matter how small, will change many of the bits of the hash value with there being no detectable patters as to how the output changes based on specific input changes.

\(H\) is deterministic, meaning if you compute \(H(M)\) twice with the same input \(M\), you will always get the same output twice. The hash function is unkeyed, as it only takes in a message \(M\) and no secret key.

Typically outputs of hash functions are fixed size: the SHA256 hash algorithm can be used to hash a message of any size, but always produces a 256-bit hash value.

Properties of Hash Functions

  1. One-way: The hash function can be computed efficiently

Given \(x\), it is easy to compute \(H(x)\). However, given an output \(y\), it is infeasible to find any input \(x\) such that \(H(x) = y\). This property is known as "preimage resistant"

  1. Second preimage resistant: Given input \(x\), it is infeasible to find another input \(x'\) such that \(x' \neq x\) but \(H(x) = H(x')\).

The difference between preimage resistance is that the adversary also knows a starting point \(x\), and wishes to tweak it to \(x'\) in order to produce the same hash -- but cannot.

  1. Collision resistant: It is infeasible to find any pair of messages \(x, x'\) such that \(x' \neq x\) but \(H(x) = H(x')\).

The difference between the others is that the adversary can freely choose their starting point \(x\), potentially designing it specially to enable finding the associated \(x'\) -- but again cannot.

Note the third property implies the second property. We keep them separate because given hash function's resistance towards the one might differ from resistance towards the other.

Hash Algorithms

One of the earliest hash function, MD5 (message digest 5) was broken years ago. SHA1 (secure hash algorithm) was broken in 2017. git version control used to use SHA1. This worked find until someone discovered how to produce SHA1 collisions.

There are two primary "families" of hash algorithms in common use that are believed to be secure: SHA2 and SHA3. In each family, there are differing output lengths. SHA-256, SHA-384, SHA-512 are all instances of SHA2 family with outputs of 256, 384, 512 bits respectively. SHA3-256, SHA3-284, SHA3-512 are SHA3 family members.

Lowest-Hash Scheme

Cryptographic hashes have many practical applications outside of cryptography.

Suppose you are a journalist, and a hacker contacts you claiming to have stolen 150 million records from a website. The hacker is keeping the records for ransom, so they don't want to present all 150 million files. You can tell if they are lying by getting the hash of them. If the hacker hashes all 150 million records, they are generating 150 million bitstrings. We might ask the hacker to hash them and then return the 10 lowest resulting hashes. Then check if those hashes are consistent with what we would expect the lowest 10 samples out of 150 million random bitstrings to be. The probability of getting those lowest 10 hashes from 150 million records is astronomically low and conclude that the hacker is lying about their claim.

We can also require the attacker to send the 10 records corresponding to the lowest hashes, and they must know which of the 150 million fake records would result in the lowest hash.