Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
This chapter describes some of the background and recent history of text analytics and provides real-world examples of how text analytics works and solves business problems. This treatment provides examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications.
The analysis of written and spoken expression has been developing as a computer application over several decades. Some of the earliest research in machine learning and artificial intelligence dealt with the problem of reading and interpreting text as well as in text translation (machine translation). These early activities gave rise to a field of computer science known as natural language processing (NLP). The recent rapid development of computer power - including processing power, large data, high bandwidth communication, and cloud-based, high-capacity computer memory - has provided a major new (and considerably broadened) emphasis on computerized text processing and text analysis.
Text processing and text analysis are components of the developing area of understanding written and spoken expression. Commonly occurring text documents - such as traditional newspapers, journals and periodicals, and, more recently, electronic documents, such as social media posts and emails - are forms of written expression. This active, multilayered area in current computer applications joins well-established, traditional fields such as linguistics and literary analysis to form the outline of the emerging field we call text analytics.
Current approaches to text analytics operate in two reinforcing directions that incorporate traditional forms of linguistic and literary analysis with a wide range of statistical, artificial intelligence (AI), and cognitive computing techniques to effectively process written and spoken expressions. The decoded expressions are used to drive a wide range of computer-mediated inference tasks that includes artificial intelligence, cognitive computing, and statistical inference. An everyday example is when we speak or type in a destination in order to receive an optimal driving route. Similarly, a call center agent might decipher multiple forms of common requests in order to construct the most effective solution approach.
Our treatment throughout the chapters to come includes examples of common forms of text analytics and examples of solution approaches. The discussion ranges from a history of the analytical treatment of text expression up to the most recent developments and applications. Since speech is quickly becoming an important form of unstructured data, a final chapter takes up the topic of rendering speech to text.
Computer science and AI emerged as formal disciplines in the aftermath of World War II. An early application of computers to the analysis of written expression, natural language processing, took a universal approach, designed to apply regardless of what language the text was written in - English, Spanish, or Chinese. The techniques that have been developed also apply regardless of the source of the text to be analyzed. With the widespread availability of speech-to-text engines, it is also possible to consider a wide variety of spoken documents as potential sources for text analytics.
An important goal of NLP is to decompose text constructs (sentences, paragraphs, articles, chapters) into various kinds of entities, verbs, semantic constructs (like articles and conjunctions), and so on. The sentence "See Spot run" may be processed and encoded into an NLP representation as: declarative sentence (intransitive); Spot - Subject (Animal/Dog); run - Verb (motion).
Historically, NLP relied on various linguistic analysis capabilities, including extensive logical processing and reasoning capabilities. As computing capabilities have expanded, NLP has increasingly relied on a range of computational approaches to enhance the range of NLP results. An emerging area of NLP includes statistical natural language processing (SNLP). This form of NLP can be used to craft high-level representations of textual documents so that relationships between and among the documents can be computed statistically. The statistical capability also improves the accuracy of the NLP processing itself.
One recent area of written language processing includes statistical document analysis (SDA). Like SNLP, SDA enables us to show the statistical relationships between and among the various components of a textual document. Further, it enables us to summarize the document using multivariate statistical techniques like cluster analysis and latent class analysis. Predictive analytics such as regression analysis, decision trees, and neural networks can also be used.
As computer processing and storage have continued to grow, so too have a variety of deep learning applications. One such application is the Bidirectional Encoder Representations from Transformers (BERT), a deep-learning application for research at Google AI language.i
BERT can be leveraged for tasks such as categorization, entity extraction, and natural language generation. Deep learning approaches require significant computing power and training. As the area of text analytics continues to unfold, we will likely see how deep learning approaches complement the capabilities offered in traditional text analytics, which are less computationally intensive and more than adequate for a wide range of tasks.
The fields of text mining and text analytics are recent applied areas of SDA used in a variety of general-purpose social and economic settings. Text mining often refers to the construction of statistical or numerical models or predictions. Common sources of data include customer service logs and emails, customer use records for warranty issue analysis and defect detection. Text analytics often refers to semantically based applications - for example, customer analytics (who talks to whom and what do they say?), competitive analysis (brand metrics, mentions), and content management (the creation of taxonomies, web page characterization).
Language is a form of communication, and text is a written form of language. Text comes in a variety of symbolic forms. In addition to the alphabetic representation we see capturing the written expression in this text, there are other encoding systems such as syllabaries that capture spoken syllables and logograms that capture pictographic representations. Linguistics distinguishes between phonograms - which capture parts of words like syllables in written expression - and logograms - which capture entire concepts.
Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.
Source: Shot November 11, 2007. By Uyvsdi. License: Public Domain.
Figure 1.1 shows an example of a pictographic representation - the STOP sign itself - an alphabetic representation (in Latin script) that spells the word "STOP" and a syllabary - in this case, one used to record the Cherokee language.
One of the earliest true writing systems, dating to the third millennium BCE, was cuneiform, originally a pictographic writing system that eventually evolved into a variety of alphabetic representations. One intermediate form of simplified cuneiform was Old Persian. It included a semi-alphabetic syllabary, using far fewer wedge strokes than earlier Assyrian versions of cuneiform. It included a handful of logograms for frequently occurring words such as "god" and "king" (see Figure 1.2).
Chinese characters evolved in the second millennium BCE and, according to sources such as Dong,ii were first organized into a comprehensive writing system during the Qin dynasty (259-210 BCE). These characters eventually gave rise to the widespread use of the characteristic logograms of Chinese in Asia (see Figure 1.3).
Figure 1.2 Example of cuneiform recording the distribution of beer in southern Iraq, 3100-3000 BCE.
Source: BabelStone, Licensed under CC BY-SA 3.0.
The representation of different writing systems is important for mapping language meanings between languages. Figure 1.4 shows a modern representation of the Chinese character for eye and the associated Latin script representation to show the translation between a pictograph (logogram) and syllabary.
Figure 1.3 Shang oracle bone script for character "Eye." Modern character is ?.
Source: Tomchen1989. Public Domain.
Figure 1.4 Modern Chinese representation of "eye" (mù).
Source: B. deVille.
Writing systems of the world that have evolved from ancient times to the present day can be organized into five categoriesiii: alphabets, abjads, abugidas, syllabaries, and logo-syllabaries.
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.