
Text Analysis Pipelines
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
This monograph proposes a comprehensive and fully automatic approach to designing text analysis pipelines for arbitrary information needs that are optimal in terms of run-time efficiency and that robustly mine relevant information from text of any kind. Based on state-of-the-art techniques from machine learning and other areas of artificial intelligence, novel pipeline construction and execution algorithms are developed and implemented in prototypical software. Formal analyses of the algorithms and extensive empirical experiments underline that the proposed approach represents an essential step towards the ad-hoc use of text mining in web search and big data analytics.
Both web search and big data analytics aim to fulfill peoples' needs for information in an adhoc manner. The information sought for is often hidden in large amounts of natural language text. Instead of simply returning links to potentially relevant texts, leading search and analytics engines have started to directly mine relevant information from the texts. To this end, they execute text analysis pipelines that may consist of several complex information-extraction and text-classification stages. Due to practical requirements of efficiency and robustness, however, the use of text mining has so far been limited to anticipated information needs that can be fulfilled with rather simple, manually constructed pipelines.
More details
Other editions
Additional editions

Content
- Intro
- Foreword
- Preface
- Symbols
- Contents
- 1 Introduction
- 1.1 Information Search in Times of Big Data
- 1.1.1 Text Mining to the Rescue
- 1.2 A Need for Efficient and Robust Text Analysis Pipelines
- 1.2.1 Basic Text Analysis Scenario
- 1.2.2 Shortcomings of Traditional Text Analysis Pipelines
- 1.2.3 Problems Approached in This Book
- 1.3 Towards Intelligent Pipeline Design and Execution
- 1.3.1 Central Research Question and Method
- 1.3.2 An Artificial Intelligence Approach
- 1.4 Contributions and Outline of This Book
- 1.4.1 New Findings in Ad-Hoc Large-Scale Text Mining
- 1.4.2 Contributions to the Concerned Research Fields
- 1.4.3 Structure of the Remaining Chapters
- 1.4.4 Published Research Within This Book
- 2 Text Analysis Pipelines
- 2.1 Foundations of Text Mining
- 2.1.1 Text Mining
- 2.1.2 Information Retrieval
- 2.1.3 Natural Language Processing
- 2.1.4 Data Mining
- 2.1.5 Development and Evaluation
- 2.2 Text Analysis Tasks, Processes, and Pipelines
- 2.2.1 Text Analysis Tasks
- 2.2.2 Text Analysis Processes
- 2.2.3 Text Analysis Pipelines
- 2.3 Case Studies in This Book
- 2.3.1 InfexBA -- Information Extraction for Business Applications
- 2.3.2 ArguAna -- Argumentation Analysis in Customer Opinions
- 2.3.3 Other Evaluated Text Analysis Tasks
- 2.4 State of the Art in Ad-Hoc Large-Scale Text Mining
- 2.4.1 Text Analysis Approaches
- 2.4.2 Design of Text Analysis Approaches
- 2.4.3 Efficiency of Text Analysis Approaches
- 2.4.4 Robustness of Text Analysis Approaches
- 3 Pipeline Design
- 3.1 Ideal Construction and Execution for Ad-Hoc Text Mining
- 3.1.1 The Optimality of Text Analysis Pipelines
- 3.1.2 Paradigms of Designing Optimal Text Analysis Pipelines
- 3.1.3 Case Study of Ideal Construction and Execution
- 3.1.4 Discussion of Ideal Construction and Execution
- 3.2 A Process-Oriented View of Text Analysis
- 3.2.1 Text Analysis as an Annotation Task
- 3.2.2 Modeling the Information to Be Annotated
- 3.2.3 Modeling the Quality to Be Achieved by the Annotation
- 3.2.4 Modeling the Analysis to Be Performed for Annotation
- 3.2.5 Defining an Annotation Task Ontology
- 3.2.6 Discussion of the Process-Oriented View
- 3.3 Ad-Hoc Construction via Partial Order Planning
- 3.3.1 Modeling Algorithm Selection as a Planning Problem
- 3.3.2 Selecting the Algorithms of a Partially Ordered Pipeline
- 3.3.3 Linearizing the Partially Ordered Pipeline
- 3.3.4 Properties of the Proposed Approach
- 3.3.5 An Expert System for Ad-Hoc Construction
- 3.3.6 Evaluation of Ad-Hoc Construction
- 3.3.7 Discussion of Ad-Hoc Construction
- 3.4 An Information-Oriented View of Text Analysis
- 3.4.1 Text Analysis as a Filtering Task
- 3.4.2 Defining the Relevance of Portions of Text
- 3.4.3 Specifying a Degree of Filtering for Each Relation Type
- 3.4.4 Modeling Dependencies of the Relevant Information Types
- 3.4.5 Discussion of the Information-Oriented View
- 3.5 Optimal Execution via Truth Maintenance
- 3.5.1 Modeling Input Control as a Truth Maintenance Problem
- 3.5.2 Filtering the Relevant Portions of Text
- 3.5.3 Determining the Relevant Portions of Text
- 3.5.4 Properties of the Proposed Approach
- 3.5.5 A Software Framework for Optimal Execution
- 3.5.6 Evaluation of Optimal Execution
- 3.5.7 Discussion of Optimal Execution
- 3.6 Trading Efficiency for Effectiveness in Ad-Hoc Text Mining
- 3.6.1 Integration with Passage Retrieval
- 3.6.2 Integration with Text Filtering
- 3.6.3 Implications for Pipeline Efficiency
- 4 Pipeline Efficiency
- 4.1 Ideal Scheduling for Large-Scale Text Mining
- 4.1.1 The Efficiency Potential of Pipeline Scheduling
- 4.1.2 Computing Optimal Schedules with Dynamic Programming
- 4.1.3 Properties of the Proposed Solution
- 4.1.4 Case Study of Ideal Scheduling
- 4.1.5 Discussion of Ideal Scheduling
- 4.2 The Impact of Relevant Information in Input Texts
- 4.2.1 Formal Specification of the Impact
- 4.2.2 Experimental Analysis of the Impact
- 4.2.3 Practical Relevance of the Impact
- 4.2.4 Implications of the Impact
- 4.3 Optimized Scheduling via Informed Search
- 4.3.1 Modeling Pipeline Scheduling as a Search Problem
- 4.3.2 Scheduling Text Analysis Algorithms with k-best A* Search
- 4.3.3 Properties of the Proposed Approach
- 4.3.4 Evaluation of Optimized Scheduling
- 4.3.5 Discussion of Optimized Scheduling
- 4.4 The Impact of the Heterogeneity of Input Texts
- 4.4.1 Experimental Analysis of the Impact
- 4.4.2 Quantification of the Impact
- 4.4.3 Practical Relevance of the Impact
- 4.4.4 Implications of the Impact
- 4.5 Adaptive Scheduling via Self-supervised Online Learning
- 4.5.1 Modeling Pipeline Scheduling as a Classification Problem
- 4.5.2 Learning to Predict Run-Times Self-supervised and Online
- 4.5.3 Adapting a Pipeline's Schedule to the Input Text
- 4.5.4 Properties of the Proposed Approach
- 4.5.5 Evaluation of Adaptive Scheduling
- 4.5.6 Discussion of Adaptive Scheduling
- 4.6 Parallelizing Execution in Large-Scale Text Mining
- 4.6.1 Effects of Parallelizing Pipeline Execution
- 4.6.2 Parallelization of Text Analyses
- 4.6.3 Parallelization of Text Analysis Pipelines
- 4.6.4 Implications for Pipeline Robustness
- 5 Pipeline Robustness
- 5.1 Ideal Domain Independence for High-Quality Text Mining
- 5.1.1 The Domain Dependence Problem in Text Analysis
- 5.1.2 Requirements of Achieving Pipeline Domain Independence
- 5.1.3 Domain-Independent Features of Argumentative Texts
- 5.2 A Structure-Oriented View of Text Analysis
- 5.2.1 Text Analysis as a Structure Classification Task
- 5.2.2 Modeling the Argumentation and Content of a Text
- 5.2.3 Modeling the Argumentation Structure of a Text
- 5.2.4 Defining a Structure Classification Task Ontology
- 5.2.5 Discussion of the Structure-Oriented View
- 5.3 The Impact of the Overall Structure of Input Texts
- 5.3.1 Experimental Analysis of Content and Style Features
- 5.3.2 Statistical Analysis of the Impact of Task-Specific Structure
- 5.3.3 Statistical Analysis of the Impact of General Structure
- 5.3.4 Implications of the Invariance and Impact
- 5.4 Features for Domain Independence via Supervised Clustering
- 5.4.1 Approaching Classification as a Relatedness Problem
- 5.4.2 Learning Overall Structures with Supervised Clustering
- 5.4.3 Using the Overall Structures as Features for Classification
- 5.4.4 Properties of the Proposed Features
- 5.4.5 Evaluation of Features for Domain Independence
- 5.4.6 Discussion of Features for Domain Independence
- 5.5 Explaining Results in High-Quality Text Mining
- 5.5.1 Intelligible Text Analysis through Explanations
- 5.5.2 Explanation of Arbitrary Text Analysis Processes
- 5.5.3 Explanation of the Class of an Argumentative Text
- 5.5.4 Implications for Ad-Hoc Large-Scale Text Mining
- 6 Conclusion
- 6.1 Contributions and Open Problems
- 6.1.1 Enabling Ad-Hoc Text Analysis
- 6.1.2 Optimally Analyzing Text
- 6.1.3 Optimizing Analysis Efficiency
- 6.1.4 Robustly Classifying Text
- 6.2 Implications and Outlook
- 6.2.1 Towards Ad-Hoc Large-Scale Text Mining
- 6.2.2 Outside the Box
- Appendix A Text Analysis Algorithms
- A.1 Analyses and Algorithms
- A.1.1 Classification of Text
- A.1.2 Entity Recognition
- A.1.3 Normalization and Resolution
- A.1.4 Parsing
- A.1.5 Relation Extraction and Event Detection
- A.1.6 Segmentation
- A.1.7 Tagging
- A.2 Evaluation Results
- A.2.1 Efficiency Results
- A.2.2 Effectiveness Results
- Appendix B Software
- B.1 An Expert System for Ad-hoc Pipeline Construction
- B.1.1 Getting Started
- B.1.2 Using the Expert System
- B.1.3 Exploring the Source Code of the System
- B.2 A Software Framework for Optimal Pipeline Execution
- B.2.1 Getting Started
- B.2.2 Using the Framework
- B.2.3 Exploring the Source Code of the Framework
- B.3 A Web Application for Sentiment Scoring and Explanation
- B.3.1 Getting Started
- B.3.2 Using the Application
- B.3.3 Exploring the Source Code of the Application
- B.3.4 Acknowledgments
- B.4 Source Code of All Experiments and Case Studies
- B.4.1 Software
- B.4.2 Text Corpora
- B.4.3 Experiments and Case Studies
- Appendix c Text Corpora
- C.1 The Revenue Corpus
- C.1.1 Compilation
- C.1.2 Annotation
- C.1.3 Files
- C.1.4 Acknowledgments
- C.2 The ArguAna TripAdvisor Corpus
- C.2.1 Compilation
- C.2.2 Annotation
- C.2.3 Files
- C.2.4 Acknowledgments
- C.3 The LFA-11 Corpus
- C.3.1 Compilation
- C.3.2 Annotation
- C.3.3 Files
- C.3.4 Acknowledgments
- C.4 Used Existing Text corpora
- C.4.1 CoNLL-2003 Dataset (English and German)
- C.4.2 Sentiment Scale Dataset (and Related Datasets)
- C.4.3 Brown Corpus
- C.4.4 Wikipedia Sample
- References
- Index
System requirements
File format: PDF
Copy protection: Watermark-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Watermark-DRM, a „soft” copy protection. This means that there are no technical restrictions to prevent illegal distribution. However, there is a personalised watermark embedded in the eBook that can be used to identify the purchaser of the eBook in the event of misuse and to provide evidence for legal purposes.
For more information, see our eBook Help page.