Extractors

Published 2023 May 26, 13:39

In this article we will get a feeling of a subject of randomness extraction, informational entropy and encoding. We will look at some examples of extractors and discuss how it can be used outside of natural habitat.

How Random Makes Random

Randomness extraction is when you have a source of randomness and want to convert it into a stream of random numbers. This is how nearly all hardware random number generators are made:

principle design of RNG

For this article we will just think of a source as something that spits out random events that we cannot predict. Generally speaking it does not have to be numerical values or even have compare application between the members. But let’s just stick to something that we can easily work with, like coins, unfair coins, totally nt fair USB ports and dice.

Why bother at all with adding another link after the true source of randomness? Because the source is likely to have uneven distribution of its output. Say, it’s considered fair to flip coins but if we flip a boat it is likely to end up hull down rather than overkeel due to its design. %TRANSPARENT: It is not as easy to find images for keyword ‘over keel’ or ‘over hull’. Who’d though?%

So if we assign ‘0’ to boat being upside-up and ‘1’ to upside-down then the distribution of our output sequence will contain more zeroes which makes it invalid for cryptographical purposes. However since the distribution is random it still has the ’entropy’ (we will get a taste of entropy in a while) to be purposed.

Hence, the need for the extractors. As for the output, we want it to be uncorrelated, evenly distributed and preferably as big as possible. This will be the inexact definition that we will stick to.

Randomness extractor are functions in terms of science but using the word ‘algorithm’ is easier perceived for those who are not scientists. Probably the closest randomness extractor to the reader would be in less than a meter away since random numbers are required for secure Internet connection, hence it exists in any smartphone at least.

For example, in linux systems a pseudofile /dev/random is the output of OS’s randomness extractor. If there is no dedicated hardware randomness source then it is fed with whatever events it can get such as unpredictable user input or network activity and there is a whole separate topic as for what inputs can be used for the job.

Let’s Just Do It The Simple Way

I’m pretty sure you have already come up with some ideas as for how to solve the problem of randmness extraction. Lte’s first check what other people have done.

XOR Extraction

One-time pads are the best not only in generating random keys but also as randomness extractors. If you have an secret pre-generated random numbers then you can xor the output of a randomness source with it to get the most secure random sequence possible.

“Bit juggling” techniques

There is a class of extractors used in the processing-power-bound silicone. The idea is to shuffle the bytes around so that it would be uneasy to guess what the input was. One example is matrix multiplication. It could also be a non-square matrix to reduce the size of output in case the source of entropy is ‘weak’.

Keywords: Toeplitz matrix, latin square reshuffling

A number of extractors are an integral part of the generator, like the fpga schemes that use clock jitter as entropy source. Linear shift registers with feedback (LFSR) are heavily used in the field.

Von-Neumann extractor

Like a dog without a bone this article cannot exist without mentioning The Man himself. The idea of Von-Neumann extractor focused only on “unfair coin” samples and turning it into 100% cryptography-ready sequence.

The algorithm would split input sequence into pairs of two consequitive samples. And then perform the following mapping:

in		out
01	->	0
10	->	1
00	->	_
11	->	_

Which means for every pair with distinct symbols we will output either zero or one. And if the elements are the same then we will just ignore this part. Such scheme will work for any sequence and will effectively debias any binary distribution

Debiasing is making uneven probability distribution even

Hash-based extractors

The idea is also pretty straightforward: hash-functions are hard to reverse, unlikely to collide and most likely unstable. So we can feed the output of the source into hash-function and produce highly random number sequence. Or so it seems.

And for those crypto-philes who thinks it is not enough: one could use two extractors sequentially too. For example hardware random number generators in INTEL ¹ are sourced from unstable loop output of which is fed into the Von-Neumann extractor which is then passed to SHA-1 to finally produce the numbers that are good enough for the userspace.

The Problem With Entropy

If we look closer at the given examples then a feeling of unfairness arrives. The most secure algorithm requires hugelot of secure pre-generated random numbers. It is never used because we’d have to ship our CPU’s with hard-drives playing role of one-time pad. And then replace it with new ones if we surf the internet too actively. Not to mention that we’d have to blindly trust the CPU manufacturer that there are no backdoors in what he is shipping. . . Wait a second… Oh, nevermind.

Von Veumann extractor is great but its bit utilisation is very poor: at best it’s output produces 1 bit for 4 input bits. And at worst it’s one bit for 1/p bits for highly asymmetric distributions with small probability p.

While for the hashing techniques another problem arises: it produces a fixed number of output bits for whatever input we feed it. And if you think that feeding it same amount of bits as the output would solve the problem, then congratulations: you are the lucky one to learn something new from this article!

Say our etropy source is the same unfair coin with probability 1/160 and we use sha-1 algorithm that works with 160 bit long sequences (or values/vectors/arrays/words to be culturally-diverse). In this case the input sequence is likely to have one ‘rare’ number in it. While the output will still have all 160 wannabe-random numbers. Hence for the baddy who wants to bruteforce the key it would be easier to start from the 160 most probable hashes rather than starting with all zeroes and increment by one eash attempt

In order to become more quantative and less intuitive let’s become a bit more abstract yet more grounded. Our problem of randomness extraction is just a problem of mapping. We have one set of input and another set of output and want all the elements of the output sequence to have equal probability.

Exchanging coins for dice

Can we emulate a flip of coin if we have one square dice? Easy: just roll the dice and check if the number is odd or even (or compare it with 3.5. Or search for the central dot on the face. Or pick the ones that don’t look the same when rotated by 90 degrees).

What if we cannot afford a square dice with our 1 cent coin? We flip the coind 3 times and then map the output:

coin	die
hhh	redo
hht	1
hth	2
htt	3
thh	4
tht	5
tth	6
ttt	redo

That’s still a bit of waste of effort but we managed to conserve coutry’s GDP and save some money with our mindpower!

It could get more effective though: rolling a dice 3 times (if the order is important) results in 216 possible combinations. This could be covered with just eight coin flips and not nine!

The takeaway is that we can convert “probability-generating structures” into one another and we can quantatively describe the efficiency of the process. 1 die x6 > 2 coin flips while 1 die x6 < 3 coin flips. 3 die x6 rolls < 8 coin flip. By the way, that’s approach to learn pre-school kids logarithms. Provided they can concentrate while flipping a coin whole 8 times.

Etnropics ans sub-entropics

With obtained knowledge we can straighten some other things. Remember the Morse code? It’s when you map the alphabet into symbols ‘dot’, ‘dash’ and ‘silence’. This allows the letters to travel the telegraph line when dots and dashes are being put as current. Telegraph is limited so we want to utilise it effectively. Say, there are 27 letters plus 10 digits and a then several special symbols. This makes it about 40 characters to be encoded. If we ‘play smart’ and encode it in binary then we will have very poor utilisation of binary space: 40 characters out of 64. Hence the need to ‘play smarter’.

The ‘smarter play’ is the following: we abandon idea of every symbol being encoded with the same number of digits. If we were to design a telegraph line for say Twitch stream then we’d expect certain symbols to appear more often. My best guess would be that top symbols are “e,l,a,p,s …” hence we would assign them the most short and easy code. And now welcome to the future where “…—…” translates to “LEL”.

With this in mind we can speculate that depending on the probability of the character there would be a different amount of characters that will encode it. There is little new when someone posts “lel” to the chat (let’s switch from characters evaluation to message evaluation for a whil). However message “Disgustingly marvellous, ser” would bring way more attention not only because it is rarely used but also because there is a typo which makes it even more ‘unique’. This intuitive ‘uniqueness’ is directly linked to strict definition of information and entropy. The two latter words are actually synonyms that came from different scientific fields independently.

Basically, amount of information in the message could be viewed as minimal number of encoding characters that is required to encode the message. And it is the Entropy that you’ve definitely heard of before. Particularly in physics, entropy describes the ‘complexity’ or ‘chaoticity’ of the system: positions of atoms in ordered crystal are like a wall of ’lel’, while you need way more words to explain positions of same atoms in the gaseous disordered state.

Min-entropy and beyond

Still, how long should be the message sent to the hash-based extractor? Let’s look at the process from a different angle.

The output of the extractor is of certain length

__Big hash, mapping, entropy, big-entropy

The Problem With Entropy Solved?

__More examples with min-entropy: multi-source, prng, ???

The Best Extractor

__There are ways to compare extractors

So Which One Is The Best?

__Computability, source drift, tests

Taming The Entropy

__Cheating, snort

Sources and links:

…

Literature

B. Jun and P. Kocher. The Intel random number generator, White Paper Prepared for Intel Corporation, April 1999 ↩︎