Informatics and Machine Learning

Name: Informatics and Machine Learning | From Martingales to Metaheuristics
Brand: Wiley
Price: 109.99 EUR
Availability: OnlineOnly

From Martingales to Metaheuristics

Stephen Winters-Hilt(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 20. Dezember 2021

592 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-71676-1 (ISBN)

109,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

1
Introduction

Informatics provides new avenues of understanding and inquiry in any medium that can be captured in digital form. Areas as diverse as text analysis, signal analysis, and genome analysis, to name a few, can be studied with informatics tools. Computationally powered informatics tools are having a phenomenal impact in many fields, including engineering, nanotechnology, and the biological sciences (Figure 1.1).

In this text I provide a background on various methods from Informatics and Machine Learning (ML) that together comprise a "complete toolset" for doing data analytics work at all levels - from a first year undergraduate introductory level to advanced topics in subsections suitable for graduate students seeking a deeper understanding (or a more detailed example). Numerous prior book, journal, and patent publications by the author are drawn upon extensively throughout the text [1-68]. Part of the objective of this book is to bring these examples together and demonstrate their combined use in typical signal processing situations. Numerous other journal and patent publications by the author [69-100] provide related material, but are not directly drawn upon this text. The application domain is practically everything in the digital domain, as mentioned above, but in this text the focus will be on core methodologies with specific application in informatics, bioinformatics, and cheminformatics (nanopore detection, in particular). Other disciplines can also be analyzed with informatics tools. Basic questions about human origins (anthrogenomics) and behavior (econometrics) can also be explored with informatics-based pattern recognition methods, with a huge impact on new research directions in anthropology, sociology, political science, economics, and psychology. The complete toolset of statistical learning tools can be used in any of these domains.

In the chapter that follows an overview is given of the various information processing stages to be discussed in the text, with some highlights to help explain the order and connectivity of topics, as well as motivate their presentation in further detail in what is to come.

Figure 1.1 A Penrose tiling. A non-repeating tiling with two shapes of tiles, with 5-point local symmetry and both local and global (emergent) golden ratio.

1.1 Data Science: Statistics, Probability, Calculus . Python (or Perl) and Linux

Knowledge construction using statistical and computational methods is at the heart of data science and informatics. Counts on data features (or events) are typically gathered as a starting point in many analyses [101, 102]. Computer hardware is very well suited to such counting tasks. Basic operating system commands and a popular scripting language (Python) will be taught to enable doing these tasks easily. Computer software methods will also be shown that allow easy implementation and understanding of basic statistical methods, whereby the counts, for example, can be used to determine event frequencies, from which statistical anomalies can be subsequently identified. The computational implementation of basic statistics methods then provides the framework to perform more sophisticated knowledge construction and discovery by use of information theory and basic ML methods. ML can be thought of as a specialized branch of statistics where there is minimal assumption of a statistical "model" based on prior human learning. This book shows how to use computational, statistical, and informatics/algorithmic methods to analyze any data that is captured in digital form, whether it be text, sequential data in general (such as experimental observations over time, or stock market/econometric histories), symbolic data (genomes), or image data. Along the way there will be a brief introduction to probability and statistics concepts (Chapter 2) and basic Python/Linux system programming methods (Chapter 2 and Appendix A).

1.2 Informatics and Data Analytics

It is common to need to acquire a signal where the signal properties are not known, or the signal is only suspected and not discovered yet, or the signal properties are known but they may be too much trouble to fully enumerate. There is no common solution, however, to the acquisition task. For this reason the initial phases of acquisition methods unavoidably tend to be ad hoc. As with data dependency in non-evolutionary search metaheuristics (where there is no optimal search method that is guaranteed to always work well), here there is no optimal signal acquisition method known in advance. In what follows methods are described for bootstrap optimization in signal acquisition to enable the most general-use, almost "common," solution possible. The bootstrap algorithmic method involves repeated passes over the data sequence, with improved priors, and trained filters, among other things, to have improved signal acquisition on subsequent passes. The signal acquisition is guided by statistical measures to recognize anomalies. Informatics methods and information theory measures are central to the design of a good finite state automata (FSAs) acquisition method, and will be reviewed in signal acquisition context in Chapters 2-4. Code examples are given in Python and C (with introductory Python described in Chapter 2 and Appendix A). Bootstrap acquisition methods may not automatically provide a common solution, but appear to offer a process whereby a solution can be improved to some desirable level of general-data applicability.

The signal analysis and pattern recognition methods described in this book are mainly applied to problems involving stochastic sequential data: power signals and genomic sequences in particular. The information modeling, feature selection/extraction, and feature-vector discrimination, however, were each developed separately in a general-use context. Details on the theoretical underpinnings are given in Chapter 3, including a collection of ab initio information theory tools to help "find your way around in the dark." One of the main ab initio approaches is to search for statistical anomalies using information measures, so various information measures will be described in detail [103-115].

The background on information theory and variational/statistical modeling has significant roots in variational calculus. Chapter 3 describes information theory ideas and the information "calculus" description (and related anomaly detection methods). The involvement of variational calculus methods and the possible parallels with the nascent development of a new (modern) "calculus of information" motivates the detailed overview of the highly successful physics development/applications of the calculus of variations (Appendix B). Using variational calculus, for example, it is possible to establish a link between a choice of information measure and statistical formalism (maximum entropy, Section 3.1). Taking the maximum entropy on a distribution with moment constraints leads to the classic distributions seen in mathematics and nature (the Gaussian for fixed mean and variance, etc.). Not surprisingly, variational methods also help to establish and refine some of the main ML methods, including Neural Nets (NNs) (Chapters 9, 13) and Support Vector Machines (SVM) (Chapter 10). SVMs are the main tool presented for both classification (supervised learning) and clustering (unsupervised learning), and everything in between (such as bag learning).

1.3 FSA-Based Signal Acquisition and Bioinformatics

Many signal features of interest are time limited and not band limited in the observational context of interest, such as noise "clicks," "spikes," or impulses. To acquire these signal features a time-domain finite state automaton (tFSA) is often most appropriate [116-124]. Human hearing, for example, is a nonlinear system that thereby circumvents the restrictions of the Gabor limit (to allow for musical geniuses, for example, who have "perfect pitch"), where time-frequency acuity surpasses what would be possible by linear signal processing alone [116] , such as with Nyquist sampled linear response recording devices that are bound by the limits imposed by the Fourier uncertainty principle (or Benedick's theorem) [117] . Thus, even when the powerful Fourier Transform or Hidden Markov Model (HMM) feature extraction methods are utilized to full advantage, there is often a sector of the signal analysis that is only conveniently accessible to analysis by way of FSAs (without significant oversampling), such that a parallel processing with both HMM and FSA methods is often needed (results demonstrating this in the context of channel current analysis [1-3] will be described in Chapter 14). Not all of the methods employed at the FSA processing stage derive from standard signal processing approaches, either, some are purely statistical such as with oversampling [118] (used in radar range oversampling [119, 120]) and dithering [121] (used in device stabilization and to reduce quantization error [122,...

Systemvoraussetzungen

Als PDF speichern Als Link merken

Informatics and Machine Learning

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

1 Introduction

1.1 Data Science: Statistics, Probability, Calculus . Python (or Perl) and Linux

1.2 Informatics and Data Analytics

1.3 FSA-Based Signal Acquisition and Bioinformatics

Systemvoraussetzungen

1
Introduction