Published 0001 January 1, 00:00

Extraction preparation

Mrk

Today on X-play:

Q:Why are we here?

A: There is more room to increase the silliness of my research

Q:Schedule?

A: This +- 100%:

15-30 mins the origin of problem that kept me hostage for years
15-30 mins “Old man yells at clouds”
15 mins warm water break
2+hours: sciency stuff

Q:Questions interruption policy?

A: Anytime. But I reserve a right to ask to reschedule the question for the afterparty if it feels unfitting the flow

Q:Where is my cake?

A: You have a right to apply for a cake according to the official “Cakeware license” as soon as you attend the meeting personally “In vivo”

WARNING: this page contains poorly scaled images and a cultural reference to the R* roll. Sensitive people are advised to proceed with caution

Pt1: The problem origin

Job search (and a brief introduction to bash piping)

$ cat docs/cv/log | wc -l
> 94

$ cat docs/cv/log | grep denied |  wc -l
> 34

$ cat docs/cv/log | grep interview |  wc -l
> 1

Where

cat - prints document docs/cv/log with my job application log
grep xxx - filter only lines containing xxx
wc -l - count lines of the provided input
| - throw output of previous command to the input of the following command

Cheeky plan from the todo list

TODO list (Mark’s Millenium problems, shortlist with physics problems almost completely excluded):

Mechanical chaos
Noise data transmission
La2 web market
Swap game
Funny mechanics MMORPG
Break Bitcoin
Oil Sprengel pump
Moth cannon
Plastic hose cutter for Eudg
Inflatable satellite dish
KUKA hot-dog maker
Vacuum insulation for the poor
T-shirt shop //Delayed until image generators are ready
Stonks: the crypto-based stock market for non-companies
Gauss game: a crypto gambling www overlay
W3ddit: reddit-like www overlay
An arcade car racing game toribash-like
Randomness extraction algorithm that the world is yet to witness
Choose what email to use for Tinder account

IPT problem

Kudos July

Problem solved and reported, except for output optimisation. A matter of couple hours, bro. Except there is no such luxury at the IPT

The extraction problem

Randomness extraction is to translate physical distribution of a given value into a stream of bits with a number of properties like equal probability and complete incoherence.

Attempt #1

MIPT poject course project: failed not hard enough

Demonstrated partial success for one particular case (a few weeks after the exam)
Student avoided further contact
- Probably becaue the theory was incorrect
- And indeed it was

Attempt #2

Biohacking: Kazakhstan’s air stimulates braincells and increases productivity

Only 3 months of coding later. Got a code that the world needed so much:

$ echo -e "Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you" | sed -e "s/ /\n/g" | ./mrge.py -p -c str 
> 000101100111100110110100010000010010

Deeper dive (no more Ricks there):

Despite its straightforwardness and simplicity I have failed to find the algorithm published anywhere. The search is still on.

Links:

webmrk.org/mrge_readme

mrge.webmrk.org

Pt2: Academia attempts

Preparations

How not to warmup: EPFL

First approximation

Twitter search, Linkedin search: Randomness extractor

mailto: Hassan Chizari, Glochester

mailto: random japanese dude, Pittsburg

<- mailfrom: Hassan Chizari: Yeah, but no

Lucky coincidence

Linkedin: PhD scholarships at St.Andrews

<- mailfrom: Ognjen Arandelovic

Scaling

Toronto
Waterloo
Sydney
Melbourne
Auckland
Beijing
Delft
+Couple more with totally irrelevant dudes

Results

1 Application to St. Andrews //Update March 7: Fail

1 Acceptance without obligations, Glochester

1 Redirect. Resulted in silence

1 Cautious interest

Cautious interest

Waterloo. 5g-data-transmission-state-or-art-something-data-transfer-lab

Conversation starter: carrier-less data transmission //Fail

Trench stage: comparison table

Extra slide: job search bloopers

Jobot: why would anyone make a job search site with AI-generated jobs?. Upd: very funny
Job search reminds partner search: you either have lots of friends and constantly search new connections around or have to use services that require you to look like a job searcher, talk like a job searcher, behave like a job searcher. And when someone sees you’ve had two jobs in the past at the same time you are immediately tagged “a specimen of corporatus immoralitus”
The German incident
The French accident
The Porkchop startuping
ChatGPT before it became mainstream
Beijing university structure

Pt3: The hardcore-ish part

Where we will learn to draw an information-based owl

Deeper dive into the problem

Task: transform xxxx -> yyy

Where xxxx is random input of len N and yyy is extracted output of len M, M<=N

Example. Parity-based

10011000111101110 -> 1

Bad example. Throw all you can to the output

$ echo 0 | sha256sum
9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa

Entropy definition

s(x) = − log₂p(x)

where

s(x) - “entropy” of event x. Or rather amount of information that event carries
x - event. Generally speaking could be anything. Examples of event space:
- {0,1}
- {‘apples’,‘oranges’, ‘firetrucks’}
- {0..1025}
- {%parity of number of legs you have woke up today with%}
- {%Rice vs Texas score%}
log(2,_) - the measure of the event. –We will make sense of it later
p(x) - probability as probability density. I.e 1/2 for a fair coin, 1/6 for a fair dice and 0 for a casino win

Make sense of entropy:

It does not matter what the input is. What matters is its probability

Hence making any transformations over the input to beautify it is useless

Small probability equals lots of information

Working lamp in the fridge = not interested
Waking up with 3 legs in the morning = now you’re talking (also: decreased effect on children)

Quantification of information for different bases

It takes one dice to quantify a state of one dice
Ot takes 3 coins to quantify a state of dice and leave two “excess” places
It takes 2 usb ports to quantify a state of dice and leave 3 spare states. (Assuming a standard USB Type A port with 3 orientation states: {plastic down, plastic up, you are wrong}) Normally we stick to base 2, hence coin encoding

Entropy == Information <- Amount of coins you need in order to describe the system.

Unfair distributions are treated in terms of optimal encoding:
- Let’s imagine a roll of 3 independent fair USB ports. Imagine statistics show that probability to get result of [you are wrong, you are wrong, you are wrong] is 1/2 and the rest is equally probable. Hence each of the rest cases is p(at least one is classic)=1/16. Where 1/16 consists of 1/2 for “not all are wrong simultaneously” and 1/8 for number of equal-probability states. So optimal encoding of this event in terms of Shannon encoding tree is: [w,w]:0, [w,u]:1000, [w,d]:1001, [u,w]:1010, [u,u]: 1011 … [d,d]:1111 * Optimal encoding=minimal size of the string required to write down the results of sampling numerous events
  * In this example single event is not a roll of a USB-port but 3 states of USB-port as an input
  ** It’s weird that this image looks like the US TV ads

Lets count the entropy of USB example distibution:

S_tot = − ∑_ip(x_i) ⋅ log₂(p(x_i)) = N ⋅ (1/2log₂(1/2)+8⋅1/16⋅log₂(1/16)) = N ⋅ (1/2⋅1+1/2⋅4) = 5/2 ⋅ N

Where N is a total number of events we are processing

*Sorry for a limited TeX interpreter used on this page

Let’s count the number of ‘coins’ we need in order to write down all the Shannon-optimally encoded results:

N_coins = N ⋅ (l_[w,w]⋅p([w,w])+l_rest⋅∑_restp([rest])) = N ⋅ (1⋅1/2+8⋅4⋅1/16) = 5/2 ⋅ N

Where l is length of a code for a given event.

Which should give a bit of intuition on the meaning of entropy.

Couple more examples

Fair coin

S = N

Unfair coin

p(1) = .8
p(0) = .2

S = 0.72 N

This is related to a definition of “weak entropy source” that is kind of obscure and underdefined but everybody seems to act as if it is obvoius. Best link so far: answer at stackExchange. WIP

Distance to a random star in cm

S = N ⋅ log₂(N)

Provided we know the distribution and all the events are independent we can do optimal extraction of information based on this principle. My work is dedicated to extending this principle to the non-refined cases with any probabilities*, but providing output entropy of the same measure as the input entropy

* The probabilities should be fixed in time for a given distribution. Although it can be extended to detection of a distribution float too which is kinda obvious for a physicist

Definition of Min-entropy

Min-entropy is the smallest entropy of a single event in our distribution*

* Unofficial definition

What was done is done

What was won is won

Von Neumann, the man
A ton of hash-based stuff
Peres debiassing
Asymptotically efficient improve of Peres
M algorithm
Toeplitz matrix multiplication
Error-Correction-Codes based Trevisan
XOR, latin square, other fun stuff
Zig-zag product
More hardcore graph-based something extractors
Pseudo-random number generator as hash-function as extractor
Multiple sources input
Attack-prone extractor
And more

I’m the bad guy

Bound to min-entropy output
Requires primary information about distribution
Does not work with weak randomness sources
Requires random seed
Takes fixed input vectors x^n
Produces fixed output*

* Or not. And it’s even worse when this happens

What I’ve done. Phase 1

Let’s map input {a,b,c} to binary representation

Phase 2

Let’s map cars on the road to a finger family members

Phase 3

Let’s map unfair coin to binary representation

Phase 4

Let’s remove prior knowledge about the source

Phase 5

Let’s forget about the graphs

Results. Or a guide to poor data presentation

There is science to be done

The test procedure:

I have generated 8 files with pseudo-random zeroes and ones. Between the files the probability of '0' has been decreasing, making them progressively weaker entropy sources: p(0)=[1/2, 1/4 ... , 1/64]. These files are called input files or initial generated files further on.

For each input file I've run my implementation of randomness extraction without supplying the algorithm any preliminary information about the source's distribution.

- Most of the tests show high score
- Some of the tests (RandomExcursions and RandomExcursionsVariant) have never been executed neither for initial nor for the processed data files. I have not tried to debug it and get it working yet. Perhaps the tests require certain specific parameters to be set
- Two tests show near-zero score: Universal and AproximateEntropy. It is worth mentioning that the Universal test showed zero score for a randomly generated data with equal probabilities of zeroes and ones. This means it either can detect a flaw in my laptop's pseudo-random generator, or there are certain special conditions for it to be applied that I have not met (check file "rep_p1-2_li20000.txt") . I will try to debug it later.
- The second near-zero test is  ApproximateEntropy. Here are the results:
========= Pseudo-random with probabilities 1/16, 1/2, 1/4 ==========
==== Filename ========= p-value ====== Score ====== Test Name ==
rep_p1-16_li20000.txt:100           0.000000 *    0/100  *  ApproximateEntropy
rep_p1-2_li20000.txt:      0.000006 *   95/100  *  ApproximateEntropy
rep_p1-32_li20000.txt:      0.000000 *    0/100  *  ApproximateEntropy
rep_p1-4_li20000.txt:100           0.000000 *    0/100  *  ApproximateEntropy
========= After Randomness Extraction ===========
rep_p1-16_li1e7_lo56963.txt:          0.000000 *    1/10   *  ApproximateEntropy
rep_p1-2_li1e7_lo35228.txt:          0.000000 *    0/10   *  ApproximateEntropy
rep_p1-32_li1e7_lo14846.txt:           0.122325     10/10      ApproximateEntropy
rep_p1-4_li1e7_lo268728.txt:           0.213309      9/10      ApproximateEntropy
rep_p1-8_li1e7_lo53691.txt:          0.000000 *    3/10   *  ApproximateEntropy

Results look like spaghetti!

//TODO: ask people about their food preferences before starting work with them

Added value

Rev modes
Unordered source
State save/restore (NSA donations are accepted via PayPal)

The opposition

Extractor should be stateless
1. Does it? There are Perez and second order Von Neumann extractors
2. My extractor can be turned stateless after a number of steps which will fix the state
Extractor should map n->m input to output
- Nope. See Von Neumann
The output bits are not independent from each other (see fractional information carry)
- It’s a feature, not a bug. The output of the extractor is just a map and it is as random as the input. How else does one expect using fractional information to work?
The output tree contains holes due to the ‘learning’ process
1. Yes. This is to be studied
2. Turning the extractor stateless would disable the drawback
3. Calculations are to be conducted as for the severity of the effect and the conditions to achieve tolerable deviation from the proper distribution
4. The effect is suppressed in the “reverse entropy” and “reverse number” modes which are used for fastest initial seed generation
This is trivial
- I’ve tried hard for it to be
It is written in Python
- Amen

Discussion

The output of the extractor is as good as its input: for any extractor we may generate input that produces weirdness
Why bother? This approach is the best for the weak entropy sources in embedded devices. IoT should be happy if this fits into microcontrollers