Text Mining in Practice with R

 
 
Standards Information Network (Verlag)
  • erschienen am 12. Mai 2017
  • |
  • 320 Seiten
 
E-Book | PDF mit Adobe-DRM | Systemvoraussetzungen
978-1-119-28209-9 (ISBN)
 
A reliable, cost-effective approach to extracting priceless business information from all sources of text
Excavating actionable business insights from data is a complex undertaking, and that complexity is magnified by an order of magnitude when the focus is on documents and other text information. This book takes a practical, hands-on approach to teaching you a reliable, cost-effective approach to mining the vast, untold riches buried within all forms of text using R.
Author Ted Kwartler clearly describes all of the tools needed to perform text mining and shows you how to use them to identify practical business applications to get your creative text mining efforts started right away. With the help of numerous real-world examples and case studies from industries ranging from healthcare to entertainment to telecommunications, he demonstrates how to execute an array of text mining processes and functions, including sentiment scoring, topic modelling, predictive modelling, extracting clickbait from headlines, and more. You'll learn how to:
* Identify actionable social media posts to improve customer service
* Use text mining in HR to identify candidate perceptions of an organisation, match job descriptions with resumes, and more
* Extract priceless information from virtually all digital and print sources, including the news media, social media sites, PDFs, and even JPEG and GIF image files
* Make text mining an integral component of marketing in order to identify brand evangelists, impact customer propensity modelling, and much more
Most companies' data mining efforts focus almost exclusively on numerical and categorical data, while text remains a largely untapped resource. Especially in a global marketplace where being first to identify and respond to customer needs and expectations imparts an unbeatable competitive advantage, text represents a source of immense potential value. Unfortunately, there is no reliable, cost-effective technology for extracting analytical insights from the huge and ever-growing volume of text available online and other digital sources, as well as from paper documents--until now.
weitere Ausgaben werden ermittelt
TED KWARTLER is a data science instructor at DataCamp.com. He has worked in analytical and executive roles at DataRobot, Liberty Mutual Insurance and Amazon.com.
1 - Cover [Seite 1]
2 - Title Page [Seite 5]
3 - Copyright [Seite 6]
4 - Dedication [Seite 7]
5 - Contents [Seite 9]
6 - Foreword [Seite 13]
7 - Chapter 1 What is Text Mining? [Seite 15]
7.1 - 1.1 What is it? [Seite 15]
7.1.1 - 1.1.1 What is Text Mining in Practice? [Seite 16]
7.1.2 - 1.1.2 Where Does Text Mining Fit? [Seite 16]
7.2 - 1.2 Why We Care About Text Mining [Seite 16]
7.2.1 - 1.2.1 What Are the Consequences of Ignoring Text? [Seite 17]
7.2.2 - 1.2.2 What Are the Benefits of Text Mining? [Seite 19]
7.2.3 - 1.2.3 Setting Expectations: When Text Mining Should (and Should Not) Be Used [Seite 20]
7.3 - 1.3 A Basic Workflow - How the Process Works [Seite 23]
7.4 - 1.4 What Tools Do I Need to Get Started with This? [Seite 26]
7.5 - 1.5 A Simple Example [Seite 26]
7.6 - 1.6 A Real World Use Case [Seite 27]
7.7 - 1.7 Summary [Seite 29]
8 - Chapter 2 Basics of Text Mining [Seite 31]
8.1 - 2.1 What is Text Mining in a Practical Sense? [Seite 31]
8.2 - 2.2 Types of Text Mining: Bag of Words [Seite 34]
8.2.1 - 2.2.1 Types of Text Mining: Syntactic Parsing [Seite 36]
8.3 - 2.3 The Text Mining Process in Context [Seite 38]
8.4 - 2.4 String Manipulation: Number of Characters and Substitutions [Seite 39]
8.4.1 - 2.4.1 String Manipulations: Paste, Character Splits and Extractions [Seite 43]
8.5 - 2.5 Keyword Scanning [Seite 47]
8.6 - 2.6 String Packages stringr and stringi [Seite 50]
8.7 - 2.7 Preprocessing Steps for Bag of Words Text Mining [Seite 51]
8.8 - 2.8 Spellcheck [Seite 58]
8.9 - 2.9 Frequent Terms and Associations [Seite 61]
8.10 - 2.10 DeltaAssist Wrap Up [Seite 63]
8.11 - 2.11 Summary [Seite 63]
9 - Chapter 3 Common Text Mining Visualizations [Seite 65]
9.1 - 3.1 A Tale of Two (or Three) Cultures [Seite 65]
9.2 - 3.2 Simple Exploration: Term Frequency, Associations and Word Networks [Seite 67]
9.2.1 - 3.2.1 Term Frequency [Seite 68]
9.2.2 - 3.2.2 Word Associations [Seite 71]
9.2.3 - 3.2.3 Word Networks [Seite 73]
9.3 - 3.3 Simple Word Clusters: Hierarchical Dendrograms [Seite 81]
9.4 - 3.4 Word Clouds: Overused but Effective [Seite 87]
9.4.1 - 3.4.1 One Corpus Word Clouds [Seite 88]
9.4.2 - 3.4.2 Comparing and Contrasting Corpora in Word Clouds [Seite 89]
9.4.3 - 3.4.3 Polarized Tag Plot [Seite 93]
9.5 - 3.5 Summary [Seite 97]
10 - Chapter 4 Sentiment Scoring [Seite 99]
10.1 - 4.1 What is Sentiment Analysis? [Seite 99]
10.2 - 4.2 Sentiment Scoring: Parlor Trick or Insightful? [Seite 102]
10.3 - 4.3 Polarity: Simple Sentiment Scoring [Seite 103]
10.3.1 - 4.3.1 Subjectivity Lexicons [Seite 103]
10.3.2 - 4.3.2 Qdap's Scoring for Positive and Negative Word Choice [Seite 107]
10.3.3 - 4.3.3 Revisiting Word Clouds - Sentiment Word Clouds [Seite 110]
10.4 - 4.4 Emoticons - Dealing with These Perplexing Clues [Seite 117]
10.4.1 - 4.4.1 Symbol?Based Emoticons Native to R [Seite 119]
10.4.2 - 4.4.2 Punctuation Based Emoticons [Seite 120]
10.4.3 - 4.4.3 Emoji [Seite 122]
10.5 - 4.5 R's Archived Sentiment Scoring Library [Seite 127]
10.6 - 4.6 Sentiment the Tidytext Way [Seite 132]
10.7 - 4.7 Airbnb.com Boston Wrap Up [Seite 140]
10.8 - 4.8 Summary [Seite 140]
11 - Chapter 5 Hidden Structures: Clustering, String Distance, Text Vectors and Topic Modeling [Seite 143]
11.1 - 5.1 What is clustering? [Seite 143]
11.1.1 - 5.1.1 K?Means Clustering [Seite 144]
11.1.2 - 5.1.2 Spherical K?Means Clustering [Seite 153]
11.1.3 - 5.1.3 K?Mediod Clustering [Seite 158]
11.1.4 - 5.1.4 Evaluating the Cluster Approaches [Seite 159]
11.2 - 5.2 Calculating and Exploring String Distance [Seite 161]
11.2.1 - 5.2.1 What is String Distance? [Seite 162]
11.2.2 - 5.2.2 Fuzzy Matching - Amatch, Ain [Seite 165]
11.2.3 - 5.2.3 Similarity Distances - Stringdist, Stringdistmatrix [Seite 166]
11.3 - 5.3 LDA Topic Modeling Explained [Seite 168]
11.3.1 - 5.3.1 Topic Modeling Case Study [Seite 170]
11.3.2 - 5.3.2 LDA and LDAvis [Seite 172]
11.4 - 5.4 Text to Vectors using text2vec [Seite 183]
11.4.1 - 5.4.1 Text2vec [Seite 185]
11.5 - 5.5 Summary [Seite 193]
12 - Chapter 6 Document Classification: Finding Clickbait from Headlines [Seite 195]
12.1 - 6.1 What is Document Classification? [Seite 195]
12.2 - 6.2 Clickbait Case Study [Seite 197]
12.2.1 - 6.2.1 Session and Data Set?Up [Seite 199]
12.2.2 - 6.2.2 GLMNet Training [Seite 202]
12.2.3 - 6.2.3 GLMNet Test Predictions [Seite 210]
12.2.4 - 6.2.4 Test Set Evaluation [Seite 212]
12.2.5 - 6.2.5 Finding the Most Impactful Words [Seite 214]
12.2.6 - 6.2.6 Case Study Wrap Up: Model Accuracy and Improving Performance Recommendations [Seite 220]
12.3 - 6.3 Summary [Seite 221]
13 - Chapter 7 Predictive Modeling: Using Text for Classifying and Predicting Outcomes [Seite 223]
13.1 - 7.1 Classification vs Prediction [Seite 223]
13.2 - 7.2 Case Study I: Will This Patient Come Back to the Hospital? [Seite 224]
13.2.1 - 7.2.1 Patient Readmission in the Text Mining Workflow [Seite 225]
13.2.2 - 7.2.2 Session and Data Set?Up [Seite 225]
13.2.3 - 7.2.3 Patient Modeling [Seite 228]
13.2.4 - 7.2.4 More Model KPIs: AUC, Recall, Precision and F1 [Seite 230]
13.2.4.1 - 7.2.4.1 Additional Evaluation Metrics [Seite 232]
13.2.5 - 7.2.5 Apply the Model to New Patients [Seite 236]
13.2.6 - 7.2.6 Patient Readmission Conclusion [Seite 237]
13.3 - 7.3 Case Study II: Predicting Box Office Success [Seite 238]
13.3.1 - 7.3.1 Opening Weekend Revenue in the Text Mining Workflow [Seite 239]
13.3.2 - 7.3.2 Session and Data Set?Up [Seite 239]
13.3.3 - 7.3.3 Opening Weekend Modeling [Seite 242]
13.3.4 - 7.3.4 Model Evaluation [Seite 245]
13.3.5 - 7.3.5 Apply the Model to New Movie Reviews [Seite 248]
13.3.6 - 7.3.6 Movie Revenue Conclusion [Seite 249]
13.4 - 7.4 Summary [Seite 250]
14 - Chapter 8 The OpenNLP Project [Seite 251]
14.1 - 8.1 What is the OpenNLP project? [Seite 251]
14.2 - 8.2 R's OpenNLP Package [Seite 252]
14.3 - 8.3 Named Entities in Hillary Clinton's Email [Seite 256]
14.3.1 - 8.3.1 R Session Set?Up [Seite 257]
14.3.2 - 8.3.2 Minor Text Cleaning [Seite 259]
14.3.3 - 8.3.3 Using OpenNLP on a single email [Seite 260]
14.3.4 - 8.3.4 Using OpenNLP on Multiple Documents [Seite 265]
14.3.5 - 8.3.5 Revisiting the Text Mining Workflow [Seite 268]
14.4 - 8.4 Analyzing the Named Entities [Seite 269]
14.4.1 - 8.4.1 Worldwide Map of Hillary Clinton's Location Mentions [Seite 270]
14.4.2 - 8.4.2 Mapping Only European Locations [Seite 273]
14.4.3 - 8.4.3 Entities and Polarity: How Does Hillary Clinton Feel About an Entity? [Seite 276]
14.4.4 - 8.4.4 Stock Charts for Entities [Seite 280]
14.4.5 - 8.4.5 Reach an Insight or Conclusion About Hillary Clinton's Emails [Seite 282]
14.5 - 8.6 Summary [Seite 283]
15 - Chapter 9 Text Sources [Seite 285]
15.1 - 9.1 Sourcing Text [Seite 285]
15.2 - 9.2 Web Sources [Seite 286]
15.2.1 - 9.2.1 Web Scraping a Single Page with rvest [Seite 286]
15.2.2 - 9.2.2 Web Scraping Multiple Pages with rvest [Seite 290]
15.2.3 - 9.2.3 Application Program Interfaces (APIs) [Seite 296]
15.2.4 - 9.2.4 Newspaper Articles from the Guardian Newspaper [Seite 297]
15.2.5 - 9.2.5 Tweets Using the twitteR Package [Seite 299]
15.2.6 - 9.2.6 Calling an API Without a Dedicated R Package [Seite 301]
15.2.7 - 9.2.7 Using Jsonlite to Access the New York Times [Seite 302]
15.2.8 - 9.2.8 Using RCurl and XML to Parse Google Newsfeeds [Seite 304]
15.2.9 - 9.2.9 The tm Library Web?Mining Plugin [Seite 306]
15.3 - 9.3 Getting Text from File Sources [Seite 307]
15.3.1 - 9.3.1 Individual CSV, TXT and Microsoft Office Files [Seite 308]
15.3.2 - 9.3.2 Reading Multiple Files Quickly [Seite 310]
15.3.3 - 9.3.3 Extracting Text from PDFs [Seite 312]
15.3.4 - 9.3.4 Optical Character Recognition: Extracting Text from Images [Seite 313]
15.4 - 9.4 Summary [Seite 316]
16 - Index [Seite 319]
17 - EULA [Seite 323]

Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie bei der Verwendung der Lese-Software Adobe Digital Editions: wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

52,99 €
inkl. 7% MwSt.
Download / Einzel-Lizenz
PDF mit Adobe-DRM
siehe Systemvoraussetzungen
E-Book bestellen