Linguistic number complexity
We know that computers are built to calculate. We can store numbers in computers very efficiently. But what about our own efficiency with numbers?
Human languages are supposed to evolve for thousands of years and hence gain a certain degree of effective encoding of the most important daily concepts. Are numbers as important for humankind so that languages integrate it naturally?
Let’s try to get a feeling of it. There are short words for single digit numbers that have a correspondance to the ten fingers on our hands. There are special words for the next couple numbers and then the first structure appears: the suffix -teen. Then after reaching the four limbs finger limit the next structure appears: merging “teen” as “ten” with our basic numbers implying multiplication and summation with single digit numbers. This lasts until the first ten of tens is met that has its own short name, a hundred. That’s where the next structure appears, when everything repeats but we add the named hundred. And so on.
If we dig the structure of the observation, we will see that roughly for every new order (“order” is a way to say “times 10”) a new structure emerges* for numbers under ten thousand and then for every third order we have a new name.
* for more quirky languages than English additional structures will emerge, like eighty being written as quattro-twenty, or paying special attention to ten thousand by giving it its own name. However, ergodically these small number trickery will converge to same rules**
** But I cannot deny that exact numbers for those cases are of significant interest. Code for correct wording in different languages is needed. Extra points for phonetic count as well as orthographic one
In simpler words, when we increase the number by thousand our linguistic length increases by a fixed length. Which is way more efficient than a factor of thousand.
Let’s plot the linguistic length of number for every number
Basic structure is shown between 1 and 10. It is then stretched to tens. Base-10 round numbers are shorter than non-round. Longest spelling of ten multiples is ‘seventy’.
Let’s see how the longest numbers - those of form 7…7 - behave
With logarithmic x axis we see precisely the predicted: straight line which correspnds to logarithmic scale. It takes 100 symbols at max to namenumbers below ten billion. Is it a lot? Let’s compare the longest and the shortest lengths - those of form X0..0:
Striking difference. Short-named numbers almost don’t grow. That’s why it is common to hear those *llions in news: easy to pronounce, easy to perceive.
Noticed periodic behaviour of those lenghts? Let’s count the period: 5 periods span over 15 decades. So the length grows for a decade, then another one, then drops forthe next one and so on. This corresponds to calling “X *llion” , then “Xty *llion” and finally “X hudred *llion”
What about the numbers in between? Do we encode most of the numbers closer to shortest or longest possible character lenght?
Let’s generate random numbers and plot them on the same plot:
This chart has 10 million random points on it
The chart above does not show every point but rather density distribution. What we see? We see that on large scales it is way more probably to hit a “long” number when going random. Also we see that there are gaps that correspond to letter quantisation. And we see that although there is no numbers ’longer’ than the sevenths in the decade, jumps of ’length’ occur once we hit the next one.
One more thing left to check: will the density shift once we consider an empirical law of digit distribution in multi-order sets. In other words, let’s plug in the Benford’s law and see if things change.
Blue is random numbers with Benford's distribution, pink is equal distribution of numbers
Results: visibly better but obviously not by a lot. Less ‘sevens’ in the beginning do help, however for high number of numbers there is still observed dominance of equal distribution of nont-leading digits
High numbers? Let’s check it out
Kinda works. But the other way around. We can see one distribution spanning lower than another one. Which shows that humans are likely not more efficient with daily-met numbers than with mathematically random numbers.
Why results are inconclusive? Because proper analysis would require math and not pictures eyeballing. But that’s left for other keen minds
Moral of the story? Finishing pointless projects IS finishing projects. Have fun!