Corpus Analysis Results
Coherence
Coherence Test: Frequency-Adjacent Grammatical Bigrams
======================================================
Words: 24587
POS tagged: 23466
Bigram pairs: 24586
Real sequence valid bigrams: 17917 (72.87%)
Shuffled mean (n=1000): 17883.9 (72.74%)
Shuffled StdDev: 47.1191
Z-score: 0.703219
p-value: -1
2.4096 × 10
Real/Expected ratio: 1.00185
Ngram Coherence
N-gram Coherence Test (WordNet Broader Terms)
=============================================
Method: Count adjacent word pairs sharing a WordNet hypernym
Windows: 50 x 20
Total pairs tested: 950
Shuffled trials: 200
Real shared-category pairs: 1. (0.105263%)
Shuffled mean: 0.295 (0.0310526%)
Shuffled SD: 0.53798
Z-score: 1.31046
Interpretation: Z > 2 = significant semantic clustering beyond chance
Pos Sequence
POS Fragment Sequence Test
=========================
Method: Count POS pattern matches (3-5 word grammatical fragments)
Patterns tested: 27
Shuffled trials: 500
Real fragment count: 3511
Shuffled mean: 3675.96
Shuffled SD: 68.2763
Z-score: -2.41601
p-value: -1
9.92154 × 10
Ratio: 0.955126
Interpretation:
Z > 2: significant (p < 0.025)
Z > 3: highly significant (p < 0.001)
Ratio > 1: more grammatical fragments than chance
Semantic Coherence
Semantic Coherence Test
======================
Method: WordData RelatedWords overlap between adjacent pairs
Windows: 39 x 30 words
Shuffled trials: 100
Real mean relatedness score: 0.
Shuffled mean: 0.
Shuffled SD: 0.
Z-score: Indeterminate
Interpretation: Z > 2 suggests significant semantic clustering
Semantic Embed
Semantic Co-occurrence Coherence Test
==================================================
Method: Cosine similarity of co-occurrence vectors (window=5)
Corpus: /home/claude/Documents/rawcorpus.txt
Vocab: 12041 words (min freq 10)
Pairs: 12040
Shuffled trials: 200
Real adjacent mean similarity: 0.575121
Shuffled mean: 0.559455
Shuffled SD: 0.000562
Z-score: 27.8905
p-value: 1.74e-171
Ratio: 1.0280
By frequency band:
Top 1000 (common): real=0.809541 shuf=0.809648 z=-0.08
1000-5000 (mid): real=0.703791 shuf=0.702098 z=2.21
5000-10000 (uncommon): real=0.512852 shuf=0.504163 z=10.25
10000+ (rare): real=0.360614 shuf=0.351759 z=5.63
Interpretation:
Z > 2: significant semantic clustering (p < 0.025)
Z > 3: highly significant (p < 0.001)
Sentence Fragment
Sentence Fragment Coherence Test
================================
Method: GrammaticalQ on consecutive word windows
Sample positions: 500 (of 24587)
Shuffled trials: 50
TRIGRAMS (3-word fragments):
Real hits: 0 / 500 (0.%)
Shuffled mean: 0. (0.%)
Shuffled SD: 0.
Z-score: N/A
Interpretation: Z > 3 = highly significant grammatical clustering