Corpus Analysis Results

Coherence

Coherence Test: Frequency-Adjacent Grammatical Bigrams
======================================================

Words: 24587
POS tagged: 23466
Bigram pairs: 24586

Real sequence valid bigrams: 17917 (72.87%)
Shuffled mean (n=1000): 17883.9 (72.74%)
Shuffled StdDev: 47.1191
Z-score: 0.703219
p-value:            -1
2.4096 × 10
Real/Expected ratio: 1.00185

Ngram Coherence

N-gram Coherence Test (WordNet Broader Terms)
=============================================

Method: Count adjacent word pairs sharing a WordNet hypernym
Windows: 50 x 20
Total pairs tested: 950
Shuffled trials: 200

Real shared-category pairs: 1. (0.105263%)
Shuffled mean: 0.295 (0.0310526%)
Shuffled SD: 0.53798
Z-score: 1.31046

Interpretation: Z > 2 = significant semantic clustering beyond chance

Pos Sequence

POS Fragment Sequence Test
=========================

Method: Count POS pattern matches (3-5 word grammatical fragments)
Patterns tested: 27
Shuffled trials: 500

Real fragment count: 3511
Shuffled mean: 3675.96
Shuffled SD: 68.2763
Z-score: -2.41601
p-value:             -1
9.92154 × 10
Ratio: 0.955126

Interpretation:
  Z > 2: significant (p < 0.025)
  Z > 3: highly significant (p < 0.001)
  Ratio > 1: more grammatical fragments than chance

Semantic Coherence

Semantic Coherence Test
======================

Method: WordData RelatedWords overlap between adjacent pairs
Windows: 39 x 30 words
Shuffled trials: 100

Real mean relatedness score: 0.
Shuffled mean: 0.
Shuffled SD: 0.
Z-score: Indeterminate
Interpretation: Z > 2 suggests significant semantic clustering

Semantic Embed

Semantic Co-occurrence Coherence Test
==================================================

Method: Cosine similarity of co-occurrence vectors (window=5)
Corpus: /home/claude/Documents/rawcorpus.txt
Vocab: 12041 words (min freq 10)
Pairs: 12040
Shuffled trials: 200

Real adjacent mean similarity:  0.575121
Shuffled mean:                  0.559455
Shuffled SD:                    0.000562
Z-score:                        27.8905
p-value:                        1.74e-171
Ratio:                          1.0280

By frequency band:
  Top 1000 (common): real=0.809541 shuf=0.809648 z=-0.08
  1000-5000 (mid): real=0.703791 shuf=0.702098 z=2.21
  5000-10000 (uncommon): real=0.512852 shuf=0.504163 z=10.25
  10000+ (rare): real=0.360614 shuf=0.351759 z=5.63

Interpretation:
  Z > 2: significant semantic clustering (p < 0.025)
  Z > 3: highly significant (p < 0.001)

Sentence Fragment

Sentence Fragment Coherence Test
================================

Method: GrammaticalQ on consecutive word windows
Sample positions: 500 (of 24587)
Shuffled trials: 50

TRIGRAMS (3-word fragments):
  Real hits: 0 / 500 (0.%)
  Shuffled mean: 0. (0.%)
  Shuffled SD: 0.
  Z-score: N/A

Interpretation: Z > 3 = highly significant grammatical clustering