Orthographic Mass
A circular-statistics invariant of written forms across writing systems
Blaize Rouyea · Corey Bourgeois
abstract
We define the orthographic mass of a written word as the magnitude of the first circular moment of the angular distribution of its UTF-8 nibbles. The construction is parameter-free and corpus-free: it reads only the bytes of a single written form. Our central claim is deliberately small and, we hope, durable.
The quantity we define is not new. It is identical to the mean resultant length R̄, the standard measure of concentration in directional statistics, and its complement 1−R̄ is the circular variance. What is new is the object we point it at — the externalized written form under a fixed encoding — and what that instrument turns out to see.
Computed over 833,116 lemmas across fourteen languages and five writing systems, orthographic mass separates alphabetic scripts (high concentration) from logographic and abugida scripts (low concentration). Because mass is exactly R̄, the relationship between script type and mass is a consequence of how each writing system populates the encoding, not a coincidence.
Reading mass as the first of a sequence of circular moments yields a typological fingerprint: the second moment resolves scripts the first collapses — most strikingly Arabic, which carries alphabetic-level mass yet a uniquely low second moment — and hierarchical clustering on the moment vector recovers the writing systems, with genealogical structure emerging within the Latin script as a correlate of shared orthographic statistics.
contents
- 1. Introduction1
- 2. The Construction2
- 3. What the Quantity Is2
- 4. Data and Reproducibility3
- 5. What the Instrument Sees4
- Mass distributions by writing system4
- Why concentration varies: the angular footprint4
- The complexity relationship as a corollary5
- What mass alone misses: the second moment5
- Typological clustering5
- 11. Scope: What This Is, and Is Not7
- 12. Generalization and Future Work8
- 13. Conclusion9
topics
cite
@article{rouyea2026orthographic,
title={Orthographic Mass: A Circular-Statistics Invariant of Written Forms Across Writing Systems},
author={Rouyea, Blaize and Bourgeois, Corey},
year={2026},
note={Draft}
}