Extraction preparation
Today on X-play:
Q:Why are we here?
A: There is more room to increase the silliness of my research
Q:Schedule?
A: This +- 100%:
- 15-30 mins the origin of problem that kept me hostage for years
- 15-30 mins “Old man yells at clouds”
- 15 mins warm water break
- 2+hours: sciency stuff
Q:Questions interruption policy?
A: Anytime. But I reserve a right to ask to reschedule the question for the afterparty if it feels unfitting the flow
Q:Where is my cake?
A: You have a right to apply for a cake according to the official “Cakeware license” as soon as you attend the meeting personally “In vivo”
WARNING: this page contains poorly scaled images and a cultural reference to the R* roll. Sensitive people are advised to proceed with caution
Pt1: The problem origin
Job search (and a brief introduction to bash piping)
$ cat docs/cv/log | wc -l
> 94
$ cat docs/cv/log | grep denied | wc -l
> 34
$ cat docs/cv/log | grep interview | wc -l
> 1
Where
cat
- prints document docs/cv/log with my job application loggrep xxx
- filter only lines containing xxxwc -l
- count lines of the provided input|
- throw output of previous command to the input of the following command
Cheeky plan from the todo list
TODO list (Mark’s Millenium problems, shortlist with physics problems almost completely excluded):
- Mechanical chaos
- Noise data transmission
- La2 web market
- Swap game
- Funny mechanics MMORPG
- Break Bitcoin
- Oil Sprengel pump
- Moth cannon
- Plastic hose cutter for Eudg
- Inflatable satellite dish
- KUKA hot-dog maker
- Vacuum insulation for the poor
- T-shirt shop //Delayed until image generators are ready
- Stonks: the crypto-based stock market for non-companies
- Gauss game: a crypto gambling www overlay
- W3ddit: reddit-like www overlay
- An arcade car racing game toribash-like
- Randomness extraction algorithm that the world is yet to witness
- Choose what email to use for Tinder account
IPT problem
Kudos July
Problem solved and reported, except for output optimisation. A matter of couple hours, bro. Except there is no such luxury at the IPT
The extraction problem
Randomness extraction is to translate physical distribution of a given value into a stream of bits with a number of properties like equal probability and complete incoherence.
Attempt #1
MIPT poject course project: failed not hard enough
Demonstrated partial success for one particular case (a few weeks after the exam)
Student avoided further contact
Probably becaue the theory was incorrect
And indeed it was
Attempt #2
Biohacking: Kazakhstan’s air stimulates braincells and increases productivity
Only 3 months of coding later. Got a code that the world needed so much:
$ echo -e "Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you" | sed -e "s/ /\n/g" | ./mrge.py -p -c str
> 000101100111100110110100010000010010
Deeper dive (no more Ricks there):
Despite its straightforwardness and simplicity I have failed to find the algorithm published anywhere. The search is still on.
Links:
Pt2: Academia attempts
Preparations
How not to warmup: EPFL
First approximation
Twitter search, Linkedin search: Randomness extractor
mailto: Hassan Chizari, Glochester
mailto: random japanese dude, Pittsburg
<- mailfrom: Hassan Chizari: Yeah, but no
Lucky coincidence
Linkedin: PhD scholarships at St.Andrews
<- mailfrom: Ognjen Arandelovic
Scaling
- Toronto
- Waterloo
- Sydney
- Melbourne
- Auckland
- Beijing
- Delft
- +Couple more with totally irrelevant dudes
Results
1 Application to St. Andrews //Update March 7: Fail
1 Acceptance without obligations, Glochester
1 Redirect. Resulted in silence
1 Cautious interest
Cautious interest
Waterloo. 5g-data-transmission-state-or-art-something-data-transfer-lab
Conversation starter: carrier-less data transmission //Fail
Trench stage: comparison table
Extra slide: job search bloopers
Jobot: why would anyone make a job search site with AI-generated jobs?. Upd: very funny
Job search reminds partner search: you either have lots of friends and constantly search new connections around or have to use services that require you to look like a job searcher, talk like a job searcher, behave like a job searcher. And when someone sees you’ve had two jobs in the past at the same time you are immediately tagged “a specimen of corporatus immoralitus”
The German incident
The French accident
The Porkchop startuping
ChatGPT before it became mainstream
Beijing university structure
Pt3: The hardcore-ish part
Where we will learn to draw an information-based owl
Contents:
Definition
Intuition
Literature
Uncomprehensibles
The weirdness
Deeper dive into the problem
Task: transform xxxx -> yyy
Where xxxx
is random input of len N and yyy
is extracted output of len M, M<=N
Example. Parity-based
10011000111101110 -> 1
Bad example. Throw all you can to the output
$ echo 0 | sha256sum
9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa
Entropy definition
s(x) = − log2p(x)
where
s(x)
- “entropy” of event x. Or rather amount of information that event carriesx
- event. Generally speaking could be anything. Examples of event space:- {0,1}
- {‘apples’,‘oranges’, ‘firetrucks’}
- {0..1025}
- {%parity of number of legs you have woke up today with%}
- {%Rice vs Texas score%}
log(2,_)
- the measure of the event. –We will make sense of it laterp(x)
- probability as probability density. I.e 1/2 for a fair coin, 1/6 for a fair dice and 0 for a casino win
Make sense of entropy:
- It does not matter what the input is. What matters is its probability
- Hence making any transformations over the input to beautify it is useless
- Small probability equals lots of information
- Working lamp in the fridge = not interested
- Waking up with 3 legs in the morning = now you’re talking (also: decreased effect on children)
- Quantification of information for different bases
- It takes one dice to quantify a state of one dice
- Ot takes 3 coins to quantify a state of dice and leave two “excess” places
- It takes 2 usb ports to quantify a state of dice and leave 3 spare states. (Assuming a standard USB Type A port with 3 orientation states: {plastic down, plastic up, you are wrong}) Normally we stick to base 2, hence coin encoding
Entropy == Information <- Amount of coins you need in order to describe the system.
- Unfair distributions are treated in terms of optimal encoding:
- Let’s imagine a roll of 3 independent fair USB ports. Imagine
statistics show that probability to get result of [you are wrong, you
are wrong, you are wrong] is 1/2 and the rest is equally probable. Hence
each of the rest cases is p(at least one is classic)=1/16. Where 1/16
consists of 1/2 for “not all are wrong simultaneously” and 1/8 for
number of equal-probability states. So optimal encoding of this event in
terms of Shannon encoding tree is: [w,w]:0, [w,u]:1000, [w,d]:1001,
[u,w]:1010, [u,u]: 1011 … [d,d]:1111 * Optimal
encoding=minimal size of the string required to write down the results
of sampling numerous events
* In this example single event is not a roll of a USB-port but 3 states of USB-port as an input
** It’s weird that this image looks like the US TV ads
- Let’s imagine a roll of 3 independent fair USB ports. Imagine
statistics show that probability to get result of [you are wrong, you
are wrong, you are wrong] is 1/2 and the rest is equally probable. Hence
each of the rest cases is p(at least one is classic)=1/16. Where 1/16
consists of 1/2 for “not all are wrong simultaneously” and 1/8 for
number of equal-probability states. So optimal encoding of this event in
terms of Shannon encoding tree is: [w,w]:0, [w,u]:1000, [w,d]:1001,
[u,w]:1010, [u,u]: 1011 … [d,d]:1111 * Optimal
encoding=minimal size of the string required to write down the results
of sampling numerous events
Lets count the entropy of USB example distibution:
Stot = − ∑ip(xi) ⋅ log2(p(xi)) = N ⋅ (1/2log2(1/2)+8⋅1/16⋅log2(1/16)) = N ⋅ (1/2⋅1+1/2⋅4) = 5/2 ⋅ N
Where N
is a total number of events we are
processing
*Sorry for a limited TeX interpreter used on this page
Let’s count the number of ‘coins’ we need in order to write down all the Shannon-optimally encoded results:
Ncoins = N ⋅ (l[w,w]⋅p([w,w])+lrest⋅∑restp([rest])) = N ⋅ (1⋅1/2+8⋅4⋅1/16) = 5/2 ⋅ N
Where l
is length of a code for a given event.
Which should give a bit of intuition on the meaning of entropy.
Couple more examples
Fair coin
S = N
Unfair coin
p(1) = .8 |
---|
p(0) = .2 |
S = 0.72 N
This is related to a definition of “weak entropy source” that is kind of obscure and underdefined but everybody seems to act as if it is obvoius. Best link so far: answer at stackExchange. WIP
Distance to a random star in cm
S = N ⋅ log2(N)
Provided we know the distribution and all the events are independent we can do optimal extraction of information based on this principle. My work is dedicated to extending this principle to the non-refined cases with any probabilities*, but providing output entropy of the same measure as the input entropy |
* The probabilities should be fixed in time for a given distribution. Although it can be extended to detection of a distribution float too which is kinda obvious for a physicist |
Definition of Min-entropy
Min-entropy is the smallest entropy of a single event in our distribution*
* Unofficial definition
What was done is done
What was won is won
- Von Neumann, the man
- A ton of hash-based stuff
- Peres debiassing
- Asymptotically efficient improve of Peres
- M algorithm
- Toeplitz matrix multiplication
- Error-Correction-Codes based Trevisan
- XOR, latin square, other fun stuff
- Zig-zag product
- More hardcore graph-based something extractors
- Pseudo-random number generator as hash-function as extractor
- Multiple sources input
- Attack-prone extractor
- And more
I’m the bad guy
- Bound to min-entropy output
- Requires primary information about distribution
- Does not work with weak randomness sources
- Requires random seed
- Takes fixed input vectors x^n
- Produces fixed output*
* Or not. And it’s even worse when this happens
What I’ve done. Phase 1
Let’s map input {a,b,c} to binary representation
Phase 2
Let’s map cars on the road to a finger family members
Phase 3
Let’s map unfair coin to binary representation
Phase 4
Let’s remove prior knowledge about the source
Phase 5
Let’s forget about the graphs
Results. Or a guide to poor data presentation
There is science to be done
The test procedure:
I have generated 8 files with pseudo-random zeroes and ones. Between the files the probability of '0' has been decreasing, making them progressively weaker entropy sources: p(0)=[1/2, 1/4 ... , 1/64]. These files are called input files or initial generated files further on.
For each input file I've run my implementation of randomness extraction without supplying the algorithm any preliminary information about the source's distribution.
- Most of the tests show high score
- Some of the tests (RandomExcursions and RandomExcursionsVariant) have never been executed neither for initial nor for the processed data files. I have not tried to debug it and get it working yet. Perhaps the tests require certain specific parameters to be set
- Two tests show near-zero score: Universal and AproximateEntropy. It is worth mentioning that the Universal test showed zero score for a randomly generated data with equal probabilities of zeroes and ones. This means it either can detect a flaw in my laptop's pseudo-random generator, or there are certain special conditions for it to be applied that I have not met (check file "rep_p1-2_li20000.txt") . I will try to debug it later.
- The second near-zero test is ApproximateEntropy. Here are the results:
========= Pseudo-random with probabilities 1/16, 1/2, 1/4 ==========
==== Filename ========= p-value ====== Score ====== Test Name ==
rep_p1-16_li20000.txt:100 0.000000 * 0/100 * ApproximateEntropy
rep_p1-2_li20000.txt: 0.000006 * 95/100 * ApproximateEntropy
rep_p1-32_li20000.txt: 0.000000 * 0/100 * ApproximateEntropy
rep_p1-4_li20000.txt:100 0.000000 * 0/100 * ApproximateEntropy
========= After Randomness Extraction ===========
rep_p1-16_li1e7_lo56963.txt: 0.000000 * 1/10 * ApproximateEntropy
rep_p1-2_li1e7_lo35228.txt: 0.000000 * 0/10 * ApproximateEntropy
rep_p1-32_li1e7_lo14846.txt: 0.122325 10/10 ApproximateEntropy
rep_p1-4_li1e7_lo268728.txt: 0.213309 9/10 ApproximateEntropy
rep_p1-8_li1e7_lo53691.txt: 0.000000 * 3/10 * ApproximateEntropy
Results look like spaghetti!
//TODO: ask people about their food preferences before starting work with them
Added value
- Rev modes
- Unordered source
- State save/restore (NSA donations are accepted via PayPal)
The opposition
- Extractor should be stateless
- Does it? There are Perez and second order Von Neumann extractors
- My extractor can be turned stateless after a number of steps which will fix the state
- Extractor should map n->m input to output
- Nope. See Von Neumann
- The output bits are not independent from each other (see fractional
information carry)
- It’s a feature, not a bug. The output of the extractor is just a map and it is as random as the input. How else does one expect using fractional information to work?
- The output tree contains holes due to the ‘learning’ process
- Yes. This is to be studied
- Turning the extractor stateless would disable the drawback
- Calculations are to be conducted as for the severity of the effect and the conditions to achieve tolerable deviation from the proper distribution
- The effect is suppressed in the “reverse entropy” and “reverse number” modes which are used for fastest initial seed generation
- This is trivial
- I’ve tried hard for it to be
- It is written in Python
- Amen
Discussion
The output of the extractor is as good as its input: for any extractor we may generate input that produces weirdness
Why bother? This approach is the best for the weak entropy sources in embedded devices. IoT should be happy if this fits into microcontrollers