Invariant Forms Emerge from Constrained Stochastic Process
A corpus of 476 conversations between a human and multiple AI systems, extracted
from HTML chat logs. The conversations were developed through an iterative process
in which multiple AI systems refined human intuitions under noise and constraints
until what persisted fell out naturally. The author does not assert any particular
meaning or conclusion for this corpus beyond that which speaks for itself and
presents the result of an empirical dialectic process as an artifact of general interest.
Top 81 content words by frequency (≥3 occurrences, stopwords excluded),
organized as a depth-4 ternary tree (34 = 81 leaves).
Each leaf links to the corpus (34,929 pages).
The corpus (939,556 lines, 34,929 paginated pages, 114,231 tokens,
24,587 unique words) was tokenized and analyzed for semantic structure in the
frequency domain.
The primary test (semantic-embed-test.py) builds co-occurrence vectors
from the raw corpus (window=5, vocab ≥10 occurrences) and measures mean cosine
similarity between frequency-adjacent word pairs vs. 200 shuffled baselines.
Result: Z=27.89, p=1.74×10−171, ratio=1.028.
Words adjacent in the frequency list are semantically closer than chance. The effect
concentrates in mid-frequency and rare words, not in common words where high
co-occurrence is trivially expected:
Secondary tests (POS bigrams Z=0.70, n-gram coherence Z=1.31, POS fragments
Z=−2.42, WordNet overlap indeterminate, GrammaticalQ 0 hits) did not reach
significance. The co-occurrence result stands alone.
Page references. For each word, all occurrences are located in the corpus.
A sliding-window density score (neighbors within ±1% of corpus span) identifies the
most locally concentrated regions. The 81 densest pages are selected with
a minimum-gap constraint to ensure coverage across the full corpus.
Reproducibility. All scripts are available in the
repository:
frequency-coherence/ (statistical tests),
scripts/gen-appendix.py (this page).