This book introduces mathematical foundations of statistical modeling of natural language. The author attempts to explain a few statistical power laws satisfied by texts in natural language in terms of non-Markovian and non-hidden Markovian discrete stochastic processes with some sort of long-range dependence. To achieve this, he uses various concepts and technical tools from information theory and probability measures. This book begins with an introduction. The first half of the book is an introduction to probability measures, information theory, ergodic decomposition, and Kolmogorov complexity, which is provided to make the book relatively self-contained. This section also covers less standard concepts and results, such as excess entropy and generalization of conditional mutual information to sigma-fields. The second part of the book discusses the results concerning power laws for mutual information and maximal repetition, such as theorems about facts and words. There is also a separate chapter discussing toy examples of stochastic processes, which should inspire future work in statistical language modeling.
Sprache
Verlagsort
Verlagsgruppe
Zielgruppe
Maße
Höhe: 231 mm
Breite: 152 mm
Dicke: 23 mm
Gewicht
ISBN-13
978-1-119-62527-8 (9781119625278)
Schweitzer Klassifikation
Preface xi
Acknowledgments xiv
Basic notations 1
1 Guiding ideas 3
1.1 The motivating question 3
1.2 Further questions about texts 7
1.3 Zipf's and Herdan's laws 10
1.4 Markov and finite-state processes 17
1.5 More general stochastic processes 23
1.6 Two interpretations of probability 26
1.7 Insights from information theory 28
1.8 Estimation of entropy rate 31
1.9 Entropy of natural language 34
1.10 Algorithmic information theory 38
1.11 Descriptions of a random world 41
1.12 Facts and words related 47
1.13 Repetitions and entropies 52
1.14 Decay of correlations 57
1.15 Recapitulation 59
2 Probabilistic preliminaries 63
2.1 Probability measures 65
2.2 Product measurable spaces 68
2.3 Discrete random variables 71
2.4 From IID to finite-state processes 74
Problems 79
3 Probabilistic toolbox 81
3.1 Borel -fields and a fair coin 83
3.2 Integral and expectation 87
3.3 Inequalities and corollaries 91
3.4 Semi-distributions 96
3.5 Conditional probability 99
3.6 Modes of convergence 106
3.7 Complete spaces 107
Problems 110
4 Ergodic properties 113
4.1 Plain relative frequency 115
4.2 Birkhoff ergodic theorem 120
4.3 Ergodic and mixing criteria 124
4.4 Ergodic decomposition 129
Problems 132
5 Entropy and information 135
5.1 Shannon measures for partitions 137
5.2 Block entropy and its limits 143
5.3 Shannon measures for fields 150
5.4 Block entropy limits revisited 160
5.5 Convergence of entropy 164
5.6 Entropy as self-information 166
Problems 169
6 Equipartition and universality 173
6.1 SMB theorem 175
6.2 Universal semi-distributions 177
6.3 PPM probability 178
6.4 SMB theorem revisited 184
6.5 PPM-based statistics 186
Problems 193
7 Coding and computation 195
7.1 Elements of coding 197
7.2 Kolmogorov complexity 203
7.3 Algorithmic coding theorems 213
7.4 Limits of mathematics 222
7.5 Algorithmic randomness 226
Problems 232
8 Power laws for information 237
8.1 Hilberg exponents 239
8.2 Second order SMB theorem 246
8.3 Probabilistic and algorithmic facts 250
8.4 Theorems about facts and words 256
Problems 263
9 Power laws for repetitions 267
9.1 R¿enyi-Arimoto entropies 269
9.2 Generalized entropy rates 274
9.3 Recurrence times 277
9.4 Subword complexity 281
9.5 Two maximal lengths 288
9.6 Logarithmic power laws 293
Problems 299
10 AMS processes 301
10.1 AMS and pseudo-AMS measures 302
10.2 Quas i-periodic coding 305
10.3 Synchronizable coding 308
10.4 Entropy rate in the AMS case 310
Problems 314
11 Toy examples 317
11.1 Finite and ultra-finite energy 319
11.2 Santa Fe processes and alike 325
11.3 Encoding into a finite alphabet 334
11.4 Random hierarchical association 346
11.5 Towards better models 357
Problems 360
Future research 361
Index 363
Bibliography 366