Textual Data Science with R

Chapman and Hall (Verlag)
  • erschienen am 11. März 2019
  • |
  • 204 Seiten
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
978-1-351-81635-9 (ISBN)
Textual Statistics with R comprehensively covers the main multidimensional methods in textual statistics supported by a specially-written package in R. Methods discussed include correspondence analysis, clustering, and multiple factor analysis for contigency tables. Each method is illuminated by applications. The book is aimed at researchers and students in statistics, social sciences, hiistory, literature and linguistics. The book will be of interest to anyone from practitioners needing to extract information from texts to students in the field of massive data, where the ability to process textual data is becoming essential.
  • Englisch
  • London
  • |
  • Großbritannien
Taylor & Francis Ltd
  • Für höhere Schule und Studium
  • |
  • Für Beruf und Forschung
50 schwarz-weiße Abbildungen
978-1-351-81635-9 (9781351816359)

Mónica Bécue-Bertaut is an elected fellow of the International Statistical Institute and was named Chevalier des Palmes Académiques by the French Government. She taught statistics and data science at the Universitat Politènica de Catalunya and offered numerous guest lectures on textual data science in different countries. Dr. Bécue-Bertaut published several books (in French or Spanish) and work chapters (in English) on this last topic. She also participated in the design of software related to textual data science, such as SPAD.T and Xplortext; being this latter an R package.

1. Encoding: from a corpus to statistical tables

Textual and contextual data

Textual data

Contextual data

Documents and aggregate documents

Examples and notation

Choosing textual units

Graphical forms



Repeated segments

In practice


Unique spellings

Partially-automated preprocessing

Word selection

Word and segment indexes

The Life UK corpus: preliminary results

Verbal content through word and repeated segment indexes

Univariate description of contextual variables

A note on the frequency range

Implementation with the Xplortext package

In summary

2. Correspondence analysis of textual data

Data and goals

Correspondence analysis: a tool for linguistic data analysis

Data: a small example


Associations between documents and words

Profile comparisons

Independence of documents and words

The X2 test
Association rates between columns and words

Active row and column clouds

Row and column pro_le spaces

Distributional equivalence and the X2 distance

Inertia of a cloud

Fitting document and word clouds

Factorial axes

Visualizing rows and columns

Category representation

Word representation

Transition formulas

Superimposed representation of rows and columns

Interpretation aids

Eigenvalues and representation quality of the clouds

Contribution of documents and words to axis inertia

Representation quality of a point

Supplementary rows and columns

Supplementary tables

Supplementary frequency rows and columns

Supplementary quantitative and qualitative variables

Validating the visualization

Interpretation scheme for textual CA results

Implementation with Xplortext

Summary of the CA approach

3. Applications of correspondence analysis

Choosing the level of detail for analyses

Correspondence analysis on aggregate free text answers

Data and objectives

Word selection

CA on the aggregate table

Document representation

Word representation

Simultaneous interpretation of the plots

Supplementary elements

Supplementary words

Supplementary repeated segments

Supplementary categories

Implementation with Xplortext

Direct analysis

Data and objectives

The main features of direct analysis

Direct analysis of the culture question

Implementation with Xplortext

4. Clustering in textual analysis

Clustering documents

Dissimilarity measures between documents

Measuring partition quality

Document clusters in the factorial space

Partition quality

Dissimilarity measures between document clusters

The single-linkage method

The complete-linkage method

Ward's method

Agglomerative hierarchical clustering

Hierarchical tree construction algorithm

Selecting the final partition

Interpreting clusters

Direct partitioning

Combining clustering methods

Consolidating partitions

Direct partitioning followed by AHC

A procedure for combining CA and clustering

Example: joint use of CA and AHC

Data and objectives

Data preprocessing using CA

Constructing the hierarchical tree

Choosing the final partition

Contiguity-constrained hierarchical clustering

Principles and algorithm

AHC of age groups with a chronological constraint

Implementation with Xplortext

Example: clustering free text answers

Data and objectives

Data preprocessing

CA: eigenvalues and total inertia

Interpreting the first axes

AHC: building the tree and choosing the final partition

Describing cluster features

Lexical features of clusters

Describing clusters in terms of characteristic words

Describing clusters in terms of characteristic documents

Describing clusters using contextual variables

Describing clusters using contextual qualitative variables

Describing clusters using quantitative contextual variables

Implementation with Xplortext

Summary of the use of AHC on factorial coordinates coming from CA

5. Lexical characterization of parts of a corpus

Characteristic words

Characteristic words and CA

Characteristic words and clustering

Clustering based on verbal content

Clustering based on contextual variables

Hierarchical words

Characteristic documents

Example: characteristic elements and CA

Characteristic words for the categories

Characteristic words and factorial planes

Documents that characterize categories

Characteristic words in addition to clustering

Implementation with Xplortext

6. Multiple factor analysis for textual analysis

Multiple tables in textual analysis

Data and objectives

Data preprocessing

Problems posed by lemmatization

Description of the corpora data

Indexes of the most frequent words



Introduction to MFACT

The limits of CA on multiple contingency tables

How MFACT works

Integrating contextual variables

Analysis of multilingual free text answers

MFACT: eigenvalues of the global analysis

Representation of documents and words

Superimposed representation of the global and partial configurations

Links between the axes of the global analysis and the separate analyses

Representation of the groups of words

Implementation with Xplortext

Simultaneous analysis of two open-ended questions: impact of lemmatization


Preliminary steps

MFACT on the left and right: lemmatized or nonlemmatized

Implementation with Xplortext

Other applications of MFACT in textual analysis

MFACT summary

7. Applications and analysis workflows

General rules for presenting results

Analyzing bibliographic databases

Introduction to the lupus data

The corpus

Exploratory analysis of the corpus

CA of the documents _ words table

The eigenvalues

Meta-keys and doc-keys

Analysis of the year-aggregate table

Eigenvalues and CA of the lexical table

Chronological study of drug names

Implementation with Xplortext

Conclusions from the study

Badinter's speech: a discursive strategy Methods

Breaking up the corpus into documents

The speech trajectory unveiled by CA


Argument flow

Conclusions on the study of Badinter's speech

Implementation with Xplortext

Political speeches

Data and objectives



Data preprocessing

Lexicometric characteristics of the speeches and lexical table coding

Eigenvalues and Cramer's V

Speech trajectory

Word representation


Hierarchical structure of the corpus


Implementation with Xplortext

Corpus of sensory descriptions



Eight Catalan wines


Verbal categorization

Encoding the data


Statistical methodology

MFACT and constructing the mean configuration

Determining consensual words


Data preprocessing

Some initial results

Individual configurations

MFACT: directions of inertia common to the majority of groups

MFACT: representing words and documents on the first plane

Word contributions

MFACT: group representation

Consensual words


Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)


Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie bei der Verwendung der Lese-Software Adobe Digital Editions: wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Download (sofort verfügbar)

70,99 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
E-Book bestellen