Textual Data Science with R

 
 
Chapman and Hall (Verlag)
  • erschienen am 11. März 2019
  • |
  • 204 Seiten
 
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
978-1-351-81635-9 (ISBN)
 
Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.
  • Englisch
  • London
  • |
  • Großbritannien
Taylor & Francis Ltd
  • Für höhere Schule und Studium
  • |
  • Für Beruf und Forschung
50 schwarz-weiße Abbildungen
978-1-351-81635-9 (9781351816359)

Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.

1. Encoding: from a corpus to statistical tables


Textual and contextual data


Textual data


Contextual data


Documents and aggregate documents


Examples and notation


Choosing textual units


Graphical forms


Lemmas


Stems


Repeated segments


In practice


Preprocessing


Unique spellings


Partially-automated preprocessing


Word selection


Word and segment indexes


The Life UK corpus: preliminary results


Verbal content through word and repeated segment indexes


Univariate description of contextual variables


A note on the frequency range


Implementation with the Xplortext package


In summary





2. Correspondence analysis of textual data


Data and goals


Correspondence analysis: a tool for linguistic data analysis


Data: a small example


Objectives


Associations between documents and words


Profile comparisons


Independence of documents and words


The X2 test
Association rates between columns and words

Active row and column clouds


Row and column pro_le spaces


Distributional equivalence and the X2 distance


Inertia of a cloud


Fitting document and word clouds


Factorial axes


Visualizing rows and columns


Category representation


Word representation


Transition formulas


Superimposed representation of rows and columns


Interpretation aids


Eigenvalues and representation quality of the clouds


Contribution of documents and words to axis inertia


Representation quality of a point


Supplementary rows and columns


Supplementary tables


Supplementary frequency rows and columns


Supplementary quantitative and qualitative variables


Validating the visualization


Interpretation scheme for textual CA results


Implementation with Xplortext


Summary of the CA approach





3. Applications of correspondence analysis


Choosing the level of detail for analyses


Correspondence analysis on aggregate free text answers


Data and objectives


Word selection


CA on the aggregate table


Document representation


Word representation


Simultaneous interpretation of the plots


Supplementary elements


Supplementary words


Supplementary repeated segments


Supplementary categories


Implementation with Xplortext


Direct analysis


Data and objectives


The main features of direct analysis


Direct analysis of the culture question


Implementation with Xplortext





4. Clustering in textual analysis


Clustering documents


Dissimilarity measures between documents


Measuring partition quality


Document clusters in the factorial space


Partition quality


Dissimilarity measures between document clusters


The single-linkage method


The complete-linkage method


Ward's method


Agglomerative hierarchical clustering


Hierarchical tree construction algorithm


Selecting the final partition


Interpreting clusters


Direct partitioning


Combining clustering methods


Consolidating partitions


Direct partitioning followed by AHC


A procedure for combining CA and clustering


Example: joint use of CA and AHC


Data and objectives


Data preprocessing using CA


Constructing the hierarchical tree


Choosing the final partition


Contiguity-constrained hierarchical clustering


Principles and algorithm


AHC of age groups with a chronological constraint


Implementation with Xplortext


Example: clustering free text answers


Data and objectives


Data preprocessing


CA: eigenvalues and total inertia


Interpreting the first axes


AHC: building the tree and choosing the final partition


Describing cluster features


Lexical features of clusters


Describing clusters in terms of characteristic words


Describing clusters in terms of characteristic documents


Describing clusters using contextual variables


Describing clusters using contextual qualitative variables


Describing clusters using quantitative contextual variables


Implementation with Xplortext


Summary of the use of AHC on factorial coordinates coming from CA





5. Lexical characterization of parts of a corpus


Characteristic words


Characteristic words and CA


Characteristic words and clustering


Clustering based on verbal content


Clustering based on contextual variables


Hierarchical words


Characteristic documents


Example: characteristic elements and CA


Characteristic words for the categories


Characteristic words and factorial planes


Documents that characterize categories


Characteristic words in addition to clustering


Implementation with Xplortext





6. Multiple factor analysis for textual analysis


Multiple tables in textual analysis


Data and objectives


Data preprocessing


Problems posed by lemmatization


Description of the corpora data


Indexes of the most frequent words


Notation


Objectives


Introduction to MFACT


The limits of CA on multiple contingency tables


How MFACT works


Integrating contextual variables


Analysis of multilingual free text answers


MFACT: eigenvalues of the global analysis


Representation of documents and words


Superimposed representation of the global and partial configurations


Links between the axes of the global analysis and the separate analyses


Representation of the groups of words


Implementation with Xplortext


Simultaneous analysis of two open-ended questions: impact of lemmatization


Objectives


Preliminary steps


MFACT on the left and right: lemmatized or nonlemmatized


Implementation with Xplortext


Other applications of MFACT in textual analysis


MFACT summary





7. Applications and analysis workflows


General rules for presenting results


Analyzing bibliographic databases


Introduction to the lupus data


The corpus


Exploratory analysis of the corpus


CA of the documents _ words table


The eigenvalues


Meta-keys and doc-keys


Analysis of the year-aggregate table


Eigenvalues and CA of the lexical table


Chronological study of drug names


Implementation with Xplortext


Conclusions from the study


Badinter's speech: a discursive strategy Methods


Breaking up the corpus into documents


The speech trajectory unveiled by CA


Results


Argument flow


Conclusions on the study of Badinter's speech


Implementation with Xplortext


Political speeches


Data and objectives


Methodology


Results


Data preprocessing


Lexicometric characteristics of the speeches and lexical table coding


Eigenvalues and Cramer's V


Speech trajectory


Word representation


Remarks


Hierarchical structure of the corpus


Conclusions


Implementation with Xplortext


Corpus of sensory descriptions


Introduction


Data


Eight Catalan wines


Jury


Verbal categorization


Encoding the data


Objectives


Statistical methodology


MFACT and constructing the mean configuration


Determining consensual words


Results


Data preprocessing


Some initial results


Individual configurations


MFACT: directions of inertia common to the majority of groups


MFACT: representing words and documents on the first plane


Word contributions


MFACT: group representation


Consensual words


Conclusion

Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie bei der Verwendung der Lese-Software Adobe Digital Editions: wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

70,99 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
E-Book bestellen