Extraction preparation

Extraction preparation

Mrk

Today on X-play:

Q:Why are we here?

A: There is more room to increase the silliness of my research

Q:Schedule?

A: This +- 100%:

  • 15-30 mins the origin of problem that kept me hostage for years
  • 15-30 mins “Old man yells at clouds”
  • 15 mins warm water break
  • 2+hours: sciency stuff

Q:Questions interruption policy?

A: Anytime. But I reserve a right to ask to reschedule the question for the afterparty if it feels unfitting the flow

Q:Where is my cake?

A: You have a right to apply for a cake according to the official “Cakeware license” as soon as you attend the meeting personally “In vivo”

WARNING: this page contains poorly scaled images and a cultural reference to the R* roll. Sensitive people are advised to proceed with caution


Pt1: The problem origin

Job search (and a brief introduction to bash piping)

$ cat docs/cv/log | wc -l
> 94

$ cat docs/cv/log | grep denied |  wc -l
> 34

$ cat docs/cv/log | grep interview |  wc -l
> 1

Where

  • cat - prints document docs/cv/log with my job application log
  • grep xxx - filter only lines containing xxx
  • wc -l - count lines of the provided input
  • | - throw output of previous command to the input of the following command

Cheeky plan from the todo list

TODO list (Mark’s Millenium problems, shortlist with physics problems almost completely excluded):

  • Mechanical chaos
  • Noise data transmission
  • La2 web market
  • Swap game
  • Funny mechanics MMORPG
  • Break Bitcoin
  • Oil Sprengel pump
  • Moth cannon
  • Plastic hose cutter for Eudg
  • Inflatable satellite dish
  • KUKA hot-dog maker
  • Vacuum insulation for the poor
  • T-shirt shop //Delayed until image generators are ready
  • Stonks: the crypto-based stock market for non-companies
  • Gauss game: a crypto gambling www overlay
  • W3ddit: reddit-like www overlay
  • An arcade car racing game toribash-like
  • Randomness extraction algorithm that the world is yet to witness
  • Choose what email to use for Tinder account

IPT problem

source

Kudos July

Problem solved and reported, except for output optimisation. A matter of couple hours, bro. Except there is no such luxury at the IPT

The extraction problem

Randomness extraction is to translate physical distribution of a given value into a stream of bits with a number of properties like equal probability and complete incoherence.

Attempt #1

MIPT poject course project: failed not hard enough

  • Demonstrated partial success for one particular case (a few weeks after the exam)

  • Student avoided further contact

    • Probably becaue the theory was incorrect

    • And indeed it was

Attempt #2

Biohacking: Kazakhstan’s air stimulates braincells and increases productivity

Only 3 months of coding later. Got a code that the world needed so much:

$ echo -e "Never gonna give you up
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry
Never gonna say goodbye
Never gonna tell a lie and hurt you" | sed -e "s/ /\n/g" | ./mrge.py -p -c str 
> 000101100111100110110100010000010010

Deeper dive (no more Ricks there):

Despite its straightforwardness and simplicity I have failed to find the algorithm published anywhere. The search is still on.

Links:

webmrk.org/mrge_readme

mrge.webmrk.org


Pt2: Academia attempts

Preparations

How not to warmup: EPFL

First approximation

Twitter search, Linkedin search: Randomness extractor

mailto: Hassan Chizari, Glochester

mailto: random japanese dude, Pittsburg

<- mailfrom: Hassan Chizari: Yeah, but no

Lucky coincidence

Linkedin: PhD scholarships at St.Andrews

<- mailfrom: Ognjen Arandelovic

Scaling

  • Toronto
  • Waterloo
  • Sydney
  • Melbourne
  • Auckland
  • Beijing
  • Delft
  • +Couple more with totally irrelevant dudes

Results

1 Application to St. Andrews //Update March 7: Fail

1 Acceptance without obligations, Glochester

1 Redirect. Resulted in silence

1 Cautious interest

Cautious interest

Waterloo. 5g-data-transmission-state-or-art-something-data-transfer-lab

Conversation starter: carrier-less data transmission //Fail

Trench stage: comparison table


Extra slide: job search bloopers

  • Jobot: why would anyone make a job search site with AI-generated jobs?. Upd: very funny

  • Job search reminds partner search: you either have lots of friends and constantly search new connections around or have to use services that require you to look like a job searcher, talk like a job searcher, behave like a job searcher. And when someone sees you’ve had two jobs in the past at the same time you are immediately tagged “a specimen of corporatus immoralitus”

  • The German incident

  • The French accident

  • The Porkchop startuping

  • ChatGPT before it became mainstream

  • Beijing university structure


Pt3: The hardcore-ish part

Where we will learn to draw an information-based owl

Contents:

  • Definition

  • Intuition

  • Literature

  • Uncomprehensibles

  • The weirdness

Deeper dive into the problem

Task: transform xxxx -> yyy

Where xxxx is random input of len N and yyy is extracted output of len M, M<=N

Example. Parity-based

10011000111101110 -> 1

Bad example. Throw all you can to the output

$ echo 0 | sha256sum
9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa

Entropy definition

s(x) =  − log2p(x)

where

  • s(x) - “entropy” of event x. Or rather amount of information that event carries
  • x - event. Generally speaking could be anything. Examples of event space:
    • {0,1}
    • {‘apples’,‘oranges’, ‘firetrucks’}
    • {0..1025}
    • {%parity of number of legs you have woke up today with%}
    • {%Rice vs Texas score%}
  • log(2,_) - the measure of the event. –We will make sense of it later
  • p(x) - probability as probability density. I.e 1/2 for a fair coin, 1/6 for a fair dice and 0 for a casino win

Make sense of entropy:

  1. It does not matter what the input is. What matters is its probability
  • Hence making any transformations over the input to beautify it is useless
  1. Small probability equals lots of information
  • Working lamp in the fridge = not interested
  • Waking up with 3 legs in the morning = now you’re talking (also: decreased effect on children)
  1. Quantification of information for different bases
  • It takes one dice to quantify a state of one dice
  • Ot takes 3 coins to quantify a state of dice and leave two “excess” places
  • It takes 2 usb ports to quantify a state of dice and leave 3 spare states. (Assuming a standard USB Type A port with 3 orientation states: {plastic down, plastic up, you are wrong}) encode a dice Normally we stick to base 2, hence coin encoding

Entropy == Information <- Amount of coins you need in order to describe the system.


  1. Unfair distributions are treated in terms of optimal encoding:
    • Let’s imagine a roll of 3 independent fair USB ports. Imagine statistics show that probability to get result of [you are wrong, you are wrong, you are wrong] is 1/2 and the rest is equally probable. Hence each of the rest cases is p(at least one is classic)=1/16. Where 1/16 consists of 1/2 for “not all are wrong simultaneously” and 1/8 for number of equal-probability states. So optimal encoding of this event in terms of Shannon encoding tree is: [w,w]:0, [w,u]:1000, [w,d]:1001, [u,w]:1010, [u,u]: 1011 … [d,d]:1111 I will lose half of the audience here * Optimal encoding=minimal size of the string required to write down the results of sampling numerous events
      * In this example single event is not a roll of a USB-port but 3 states of USB-port as an input
      ** It’s weird that this image looks like the US TV ads

Lets count the entropy of USB example distibution:

Stot =  − ∑ip(xi) ⋅ log2(p(xi)) = N ⋅ (1/2log2(1/2)+8⋅1/16⋅log2(1/16)) = N ⋅ (1/2⋅1+1/2⋅4) = 5/2 ⋅ N

Where N is a total number of events we are processing

*Sorry for a limited TeX interpreter used on this page

Let’s count the number of ‘coins’ we need in order to write down all the Shannon-optimally encoded results:

Ncoins = N ⋅ (l[w,w]p([w,w])+lrest⋅∑restp([rest])) = N ⋅ (1⋅1/2+8⋅4⋅1/16) = 5/2 ⋅ N

Where l is length of a code for a given event.

Which should give a bit of intuition on the meaning of entropy.

Couple more examples

Fair coin

S = N

Unfair coin

p(1) = .8
p(0) = .2

S = 0.72 N

This is related to a definition of “weak entropy source” that is kind of obscure and underdefined but everybody seems to act as if it is obvoius. Best link so far: answer at stackExchange. WIP

Distance to a random star in cm

S = N ⋅ log2(N)

Provided we know the distribution and all the events are independent we can do optimal extraction of information based on this principle. My work is dedicated to extending this principle to the non-refined cases with any probabilities*, but providing output entropy of the same measure as the input entropy
* The probabilities should be fixed in time for a given distribution. Although it can be extended to detection of a distribution float too which is kinda obvious for a physicist

Definition of Min-entropy

Min-entropy is the smallest entropy of a single event in our distribution*

* Unofficial definition

What was done is done

What was won is won

  • Von Neumann, the man Peteroupc kudos
  • A ton of hash-based stuff
  • Peres debiassing Peres explained by Peteroupc
  • Asymptotically efficient improve of Peres
  • M algorithm
  • Toeplitz matrix multiplication
  • Error-Correction-Codes based Trevisan
  • XOR, latin square, other fun stuff
  • Zig-zag product
  • More hardcore graph-based something extractors
  • Pseudo-random number generator as hash-function as extractor
  • Multiple sources input
  • Attack-prone extractor
  • And more

I’m the bad guy

  • Bound to min-entropy output
  • Requires primary information about distribution
  • Does not work with weak randomness sources
  • Requires random seed
  • Takes fixed input vectors x^n
  • Produces fixed output*

* Or not. And it’s even worse when this happens

What I’ve done. Phase 1

Let’s map input {a,b,c} to binary representation

Phase 2

Let’s map cars on the road to a finger family members

Phase 3

Let’s map unfair coin to binary representation

Phase 4

Let’s remove prior knowledge about the source

Phase 5

Let’s forget about the graphs

Results. Or a guide to poor data presentation

There is science to be done

The test procedure:

I have generated 8 files with pseudo-random zeroes and ones. Between the files the probability of '0' has been decreasing, making them progressively weaker entropy sources: p(0)=[1/2, 1/4 ... , 1/64]. These files are called input files or initial generated files further on.

For each input file I've run my implementation of randomness extraction without supplying the algorithm any preliminary information about the source's distribution.

- Most of the tests show high score
- Some of the tests (RandomExcursions and RandomExcursionsVariant) have never been executed neither for initial nor for the processed data files. I have not tried to debug it and get it working yet. Perhaps the tests require certain specific parameters to be set
- Two tests show near-zero score: Universal and AproximateEntropy. It is worth mentioning that the Universal test showed zero score for a randomly generated data with equal probabilities of zeroes and ones. This means it either can detect a flaw in my laptop's pseudo-random generator, or there are certain special conditions for it to be applied that I have not met (check file "rep_p1-2_li20000.txt") . I will try to debug it later.
- The second near-zero test is  ApproximateEntropy. Here are the results:
========= Pseudo-random with probabilities 1/16, 1/2, 1/4 ==========
==== Filename ========= p-value ====== Score ====== Test Name ==
rep_p1-16_li20000.txt:100           0.000000 *    0/100  *  ApproximateEntropy
rep_p1-2_li20000.txt:      0.000006 *   95/100  *  ApproximateEntropy
rep_p1-32_li20000.txt:      0.000000 *    0/100  *  ApproximateEntropy
rep_p1-4_li20000.txt:100           0.000000 *    0/100  *  ApproximateEntropy
========= After Randomness Extraction ===========
rep_p1-16_li1e7_lo56963.txt:          0.000000 *    1/10   *  ApproximateEntropy
rep_p1-2_li1e7_lo35228.txt:          0.000000 *    0/10   *  ApproximateEntropy
rep_p1-32_li1e7_lo14846.txt:           0.122325     10/10      ApproximateEntropy
rep_p1-4_li1e7_lo268728.txt:           0.213309      9/10      ApproximateEntropy
rep_p1-8_li1e7_lo53691.txt:          0.000000 *    3/10   *  ApproximateEntropy

Results look like spaghetti!

//TODO: ask people about their food preferences before starting work with them

Added value

  1. Rev modes
  2. Unordered source
  3. State save/restore (NSA donations are accepted via PayPal)

The opposition

  1. Extractor should be stateless
    1. Does it? There are Perez and second order Von Neumann extractors
    2. My extractor can be turned stateless after a number of steps which will fix the state
  2. Extractor should map n->m input to output
    • Nope. See Von Neumann
  3. The output bits are not independent from each other (see fractional information carry)
    • It’s a feature, not a bug. The output of the extractor is just a map and it is as random as the input. How else does one expect using fractional information to work?
  4. The output tree contains holes due to the ‘learning’ process
    1. Yes. This is to be studied
    2. Turning the extractor stateless would disable the drawback
    3. Calculations are to be conducted as for the severity of the effect and the conditions to achieve tolerable deviation from the proper distribution
    4. The effect is suppressed in the “reverse entropy” and “reverse number” modes which are used for fastest initial seed generation
  5. This is trivial
    • I’ve tried hard for it to be
  6. It is written in Python
    • Amen

Discussion

  1. The output of the extractor is as good as its input: for any extractor we may generate input that produces weirdness

  2. Why bother? This approach is the best for the weak entropy sources in embedded devices. IoT should be happy if this fits into microcontrollers