
Statistical Implications of Turing's Formula
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Person
Content
Preface
This book introduces readers to Turing's formula and then re-examines several core statistical issues of modern data science from Turing's perspective. Turing's formula was a remarkable invention of Alan Turing during World War II in an early attempt to decode the German enigmas. The formula looks at the world of randomness through a unique and powerful binary perspective - unmistakably of Turing. However, Turing's formula was not well understood for many years. Research amassed during the last decade has brought to light profound and new statistical implications of the formula that were previously not known. Recently, and only recently, a relatively clear and systematic description of Turing's formula, with its statistical properties and implications, has become possible. Hence this book.
Turing's formula is often perceived as having a mystical quality. I was awestruck when I first learned of the formula 10 years ago. Its anti-intuitive implication was simply beyond my immediate grasp. However, I was not along in this regard. After turning it over in my mind for a while, I mentioned to two of my colleagues, both seasoned mathematicians, that there might be a way to give a nonparametric characterization to tail probability of a random variable beyond data range. To that, their immediate reaction was, "tell us more when you have figured it out." Some years later, a former doctoral student of mine said to me, "I used to refuse to think about anti-intuitive mathematical statements, but after Turing's formula, I would think about a statement at least twice however anti-intuitive it may sound." Still another colleague of mine recently said to me, "I read everything you wrote on the subject, including details of the proofs. But I still cannot see intuitively why the formula works." To that, I responded with the following two points:
- Our intuition is a bounded mental box within which we conduct intellectual exercises with relative ease and comfort, but we must admit that this box also reflects the limitations of our experience, knowledge, and ability to reason.
- If a fact known to be true does not fit into one's current box of intuition, is it not time to expand the boundary of the box to accommodate the true fact?
My personal journey in learning about Turing's formula has proved to be a rewarding one. The experience of observing Turing's formula totally outside of my box of intuition initially and then having it gradually snuggled well within the boundary of my new box of intuition is one I wish to share.
Turing's formula itself, while extraordinary in many ways, is not the only reason for this book. Statistical science, since R.A. Fisher, has come a long way and continues to evolve. In fact, the frontier of Statistics has largely moved on to the realm of nonparametrics. The last few decades have witnessed great advances in the theory and practice of nonparametric statistics. However in this realm, a seemingly impenetrable wall exists: how could one possibly make inference about the tail of a distribution beyond data range? In front of this wall, many, if not most, are discouraged by their intuition from exploring further. Yet it is often said in Statistics that "it is all in the tail." Statistics needs a trail to get to the other side of the wall. Turing's formula blazes a trail, and this book attempts to mark that trail.
Turing's formula is relevant to many key issues in modern data sciences, for example, Big Data. Big Data, though as of yet not a field of study with a clearly defined boundary, unambiguously points to a data space that is a quantum leap away from what is imaginable in the realm of classical statistics in terms of data volume, data structure, and data complexity. Big Data, however defined, issues fundamental challenges to Statistics. To begin, the task of retrieving and analyzing data in a vastly complex data space must be in large part delegated to a machine (or software), hence the term Machine Learning. How does a machine learn and make judgment? At the very core, it all boils down to a general measure of association between two observable random elements (not necessarily random variables). At least two fundamental issues immediately present themselves:
- High Dimensionality. The complexity of the data space suggests that a data observation can only be appropriately registered in a very high-dimensional space, so much so that the dimensionality could be essentially infinite. Quickly, the usual statistical methodologies run into fundamental conceptual problems.
- Discrete and Non-ordinal Nature. The generality of the data space suggests that possible data values may not have a natural order among themselves: different gene types in the human genome, different words in text, and different species in an ecological population are all examples of general data spaces without a natural "neighborhood" concept.
Such issues would force a fundamental transition from the platform of random variables (on the real line) to the platform of random elements (on a general set or an alphabet). On such an alphabet, many familiar and fundamental concepts ofStatistics and Probability no longer exist, for example, moments, correlation, tail, and so on. It would seem that Statistics is in need of a rebirth to tackle these issues.
The rebirth has been taking place in Information Theory. Its founding father, Claude Shannon, defined two conceptual building blocks: entropy (in place of moments) and mutual information (in place of correlation) in his landmark paper (Shannon, (1948). Just as important as estimating moments and coefficient of correlation for random variables, entropy and mutual information must be estimated for random elements in practice. However, estimation of entropy and estimation of mutual information are technically difficult problems due to the curse of "High Dimensionality" and "Discrete and Non-ordinal Nature." For about 50 years since (Shannon, (1948), advances in this arena have been slow to come. In recent years however, research interest, propelled by the rapidly increasing level of data complexity, has been reinvigorated and, at the same time, has been splintered into many different perspectives. One in particular is Turing's perspective, which has brought about significant and qualitative improvement to these difficult problems. This book presents an overview of the key results and updates the frontier in this research space.
The powerful utility of Turing's perspective can also be seen in many other areas. One increasingly important modern concept is Diversity. The topics of what it is and how to estimate it are rapidly moving into rigorous mathematical treatment. Scientists have passionately argued about them for years but largely without consensus. Turing's perspective gives some very interesting answers to these questions. This book gives a unified discussion of diversity indices, hence making good reading for those who are interested in diversity indices and their estimation. The final two chapters of the book speak to the issues of tail classification and, if classified, how to perform a refined analysis for a parametric tail model via Turing's perspective. These issues are scientifically relevant in many fields of study.
I intend this book to serve two groups of readers:
- Textbook for graduate students. The material is suitable for a topic course at the graduate level for students in Mathematics, Probability, Statistics, Computer Science (Artificial Intelligence, Machine Learning, Big Data), and Information Theory.
- Reference book for researchers and practitioners. This book offers an informative presentation of many of the critical statistical issues of modern data science and with updated new results. Both researchers and practitioners will find this book a good learning resource and enjoy the many relevant methodologies and formulas given and explained under one cover.
For a better flow of the presentation, some of the lengthy but instructive proofs are placed at the end of each chapter.
The seven chapters of this book may be naturally organized into three groups. Group 1 includes Chapters 1 and 2. Chapter 1 gives an introduction to Turing's formula; and Chapter 2 translates Turing's formula into a particular perspective (referred to as Turing's perspective) as embodied in a class of indices (referred to as Generalized Simpson's Indices). Group 1 may be considered as the theoretical foundation of the whole book. Group 2 includes Chapters 3-5. Chapter 3 takes Turing's perspective into entropy estimation, Chapter 4 takes it into diversity estimation, and Chapter 5 takes it into estimation of various information indices. Group 2 may be thought of as consisting of applications of Turing's perspective. Chapters 6 and 7 make up Group 3. Chapter 6 discusses the notion of tail on alphabets and offers a classification of probability distributions. Chapter 7 offers an application of Turing's formula in estimating parametric tails of random variables. Group 3 may be considered as a pathway to further research.
The material in this book is relatively new. In writing the book, I have made an effort to let the book, as well as its chapters, be self-contained. On the one hand, I wanted the material of the book to flow in a linearly coherent manner for students learning it for the first time. In this regard, readers may experience a certain degree of...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.