
Text as Data
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
The need for powerful, accurate and increasingly automatic text analysis software in modern information technology has dramatically increased. Fields as diverse as financial management, fraud and cybercrime prevention, Pharmaceutical R&D, social media marketing, customer care, and health services are implementing more comprehensive text-inclusive, analytics strategies. Text as Data: Computational Methods of Understanding Written Expression Using SAS presents an overview of text analytics and the critical role SAS software plays in combining linguistic and quantitative algorithms in the evolution of this dynamic field.
Drawing on over two decades of experience in text analytics, authors Barry deVille and Gurpreet Singh Bawa examine the evolution of text mining and cloud-based solutions, and the development of SAS Visual Text Analytics. By integrating quantitative data and textual analysis with advanced computer learning principles, the authors demonstrate the combined advantages of SAS compared to standard approaches, and show how approaching text as qualitative data within a quantitative analytics framework produces more detailed, accurate, and explanatory results.
* Understand the role of linguistics, machine learning, and multiple data sources in the text analytics workflow
* Understand how a range of quantitative algorithms and data representations reflect contextual effects to shape meaning and understanding
* Access online data and code repositories, videos, tutorials, and case studies
* Learn how SAS extends quantitative algorithms to produce expanded text analytics capabilities
* Redefine text in terms of data for more accurate analysis
This book offers a thorough introduction to the framework and dynamics of text analytics--and the underlying principles at work--and provides an in-depth examination of the interplay between qualitative-linguistic and quantitative, data-driven aspects of data analysis. The treatment begins with a discussion on expression parsing and detection and provides insight into the core principles and practices of text parsing, theme, and topic detection. It includes advanced topics such as contextual effects in numeric and textual data manipulation, fine-tuning text meaning and disambiguation. As the first resource to leverage the power of SAS for text analytics, Text as Data is an essential resource for SAS users and data scientists in any industry or academic application.
More details
Other editions
Additional editions


Persons
BARRY DEVILLE is a Data Scientist and Solutions Architect with 18 years of experience working at SAS. He led the development of the KnowledgeSEEKER decision tree package and has given workshops and tutorials on decision trees for Statistics Canada, the American Marketing Association, the IEEE, and the Direct Marketing Association.
GURPREET SINGH BAWA is the Data Science Senior Manager at Accenture PLC in India. He delivers advanced analytics solutions for global clients in a variety of corporate sectors.
Content
Preface xi
Acknowledgments xiii
About the Authors xv
Introduction 1
Chapter 1 Text Mining and Text Analytics 3
Chapter 2 Text Analytics Process Overview 15
Chapter 3 Text Data Source Capture 33
Chapter 4 Document Content and Characterization 43
Chapter 5 Textual Abstraction: Latent Structure, Dimension Reduction 73
Chapter 6 Classification and Prediction 103
Chapter 7 Boolean Methods of Classification and Prediction 125
Chapter 8 Speech to Text 139
Appendix A Mood State Identification in Text 157
Appendix B A Design Approach to Characterizing Users Based on Audio Interactions on a Conversational AI Platform 175
Appendix C SAS Patents in Text Analytics 189
Glossary 197
Index 203
CHAPTER 1
Text Mining and Text Analytics
This chapter describes some of the background and recent history of text analytics and provides real-world examples of how text analytics works and solves business problems. This treatment provides examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications.
BACKGROUND AND TERMINOLOGY
The analysis of written and spoken expression has been developing as a computer application over several decades. Some of the earliest research in machine learning and artificial intelligence dealt with the problem of reading and interpreting text as well as in text translation (machine translation). These early activities gave rise to a field of computer science known as natural language processing (NLP). The recent rapid development of computer power - including processing power, large data, high bandwidth communication, and cloud-based, high-capacity computer memory - has provided a major new (and considerably broadened) emphasis on computerized text processing and text analysis.
TEXT ANALYTICS: WHAT IS IT?
Text processing and text analysis are components of the developing area of understanding written and spoken expression. Commonly occurring text documents - such as traditional newspapers, journals and periodicals, and, more recently, electronic documents, such as social media posts and emails - are forms of written expression. This active, multilayered area in current computer applications joins well-established, traditional fields such as linguistics and literary analysis to form the outline of the emerging field we call text analytics.
Current approaches to text analytics operate in two reinforcing directions that incorporate traditional forms of linguistic and literary analysis with a wide range of statistical, artificial intelligence (AI), and cognitive computing techniques to effectively process written and spoken expressions. The decoded expressions are used to drive a wide range of computer-mediated inference tasks that includes artificial intelligence, cognitive computing, and statistical inference. An everyday example is when we speak or type in a destination in order to receive an optimal driving route. Similarly, a call center agent might decipher multiple forms of common requests in order to construct the most effective solution approach.
Our treatment throughout the chapters to come includes examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications. Since speech is quickly becoming an important form of unstructured data, a final chapter takes up the topic of rendering speech to text.
Computer science and AI emerged as formal disciplines in the aftermath of World War II. An early application of computers to the analysis of written expression, natural language processing, took a universal approach, designed to apply regardless of what language the text was written in - English, Spanish, or Chinese. The techniques that have been developed also apply regardless of the source of the text to be analyzed. With the widespread availability of speech-to-text engines, it is also possible to consider a wide variety of spoken documents as potential sources for text analytics.
An important goal of NLP is to decompose text constructs (sentences, paragraphs, articles, chapters) into various kinds of entities, verbs, semantic constructs (like articles and conjunctions), and so on. The sentence "See Spot run" may be processed and encoded into an NLP representation as: declarative sentence (intransitive); Spot - Subject (Animal/Dog); run - Verb (motion).
Historically, NLP relied on various linguistic analysis capabilities, including extensive logical processing and reasoning capabilities. As computing capabilities have expanded, NLP has increasingly relied on a range of computational approaches to enhance the range of NLP results. An emerging area of NLP includes statistical natural language processing (SNLP). This form of NLP can be used to craft high-level representations of textual documents so that relationships between and among the documents can be computed statistically. The statistical capability also improves the accuracy of the NLP processing itself.
One recent area of written language processing includes statistical document analysis (SDA). Like SNLP, SDA enables us to show the statistical relationships between and among the various components of a textual document. Further, it enables us to summarize the document using multivariate statistical techniques like cluster analysis and latent class analysis. Predictive analytics such as regression analysis, decision trees, and neural networks can also be used.
As computer processing and storage have continued to grow, so too have a variety of deep learning applications. One such application is the Bidirectional Encoder Representations from Transformers (BERT), a deep-learning application for research at Google AI language.i
BERT can be leveraged for tasks such as categorization, entity extraction, and natural language generation. Deep learning approaches require significant computing power and training. As the area of text analytics continues to unfold, we will likely see how deep learning approaches complement the capabilities offered in traditional text analytics, which are less computationally intensive and more than adequate for a wide range of tasks.
The fields of text mining and text analytics are recent applied areas of SDA used in a variety of general-purpose social and economic settings. Text mining often refers to the construction of statistical or numerical models or predictions. Common sources of data include customer service logs and emails, customer use records for warranty issue analysis and defect detection. Text analytics often refers to semantically based applications - for example, customer analytics (who talks to whom and what do they say?), competitive analysis (brand metrics, mentions), and content management (the creation of taxonomies, web page characterization).
Brief History of Text
Language is a form of communication, and text is a written form of language. Text comes in a variety of symbolic forms. In addition to the alphabetic representation we see capturing the written expression in this text, there are other encoding systems such as syllabaries that capture spoken syllables and logograms that capture pictographic representations. Linguistics distinguishes between phonograms - which capture parts of words like syllables in written expression - and logograms - which capture entire concepts.
Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.
Source: Shot November 11, 2007. By Uyvsdi. License: Public Domain.
Figure 1.1 shows an example of a pictographic representation - the STOP sign itself - an alphabetic representation (in Latin script) that spells the word "STOP" and a syllabary - in this case, one used to record the Cherokee language.
One of the earliest true writing systems, dating to the third millennium BCE, was cuneiform, originally a pictographic writing system that eventually evolved into a variety of alphabetic representations. One intermediate form of simplified cuneiform was Old Persian. It included a semi-alphabetic syllabary, using far fewer wedge strokes than earlier Assyrian versions of cuneiform. It included a handful of logograms for frequently occurring words such as "god" and "king" (see Figure 1.2).
Chinese characters evolved in the second millennium BCE and, according to sources such as Dong,ii were first organized into a comprehensive writing system during the Qin dynasty (259-210 BCE). These characters eventually gave rise to the widespread use of the characteristic logograms of Chinese in Asia (see Figure 1.3).
Figure 1.2 Example of cuneiform recording the distribution of beer in southern Iraq, 3100-3000 BCE.
Source: BabelStone, Licensed under CC BY-SA 3.0.
The representation of different writing systems is important for mapping language meanings between languages. Figure 1.4 shows a modern representation of the Chinese character for eye and the associated Latin script representation to show the translation between a pictograph (logogram) and syllabary.
Figure 1.3 Shang oracle bone script for character "Eye." Modern character is ?.
Source: Tomchen1989. Public Domain.
Figure 1.4 Modern Chinese representation of "eye" (mù).
Source: B. deVille.
Writing Systems of the World
Writing systems of the world that have evolved from ancient times to the present day can be organized into five categoriesiii: alphabets, abjads, abugidas, syllabaries, and logo-syllabaries.
- Alphabets. Each letter represents a sound which can be either a consonant or a vowel. English uses an alphabet as do such related languages as French, German, and Spanish.
- Abjads. Similar to alphabets except they are...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.