Formalizing Natural Languages

Name: Formalizing Natural Languages | The NooJ Approach
Brand: Wiley-ISTE
Price: 139.99 EUR
Availability: OnlineOnly

The NooJ Approach

Max Silberztein(Autor*in)

Wiley-ISTE (Verlag)

Erschienen am 7. Januar 2016

346 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-26414-9 (ISBN)

139,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

Cover
Half-Title Page
Dedication
Title Page
Copyright Page
Contents
Acknowledgments
1. Introduction: the Project
1.1. Characterizing a set of infinite size
1.2. Computers and linguistics
1.3. Levels of formalization
1.4. Not applicable
1.4.1. Poetry and plays on words
1.4.2. Stylistics and rhetoric
1.4.3. Anaphora, coreference resolution, and semantic disambiguation
1.4.4. Extralinguistic calculations
1.5. NLP applications
1.5.1. Automatic translation
1.5.2. Part-of-speech (POS) tagging
1.5.3. Linguistic rather than stochastic analysis
1.6. Linguistic formalisms: NooJ
1.7. Conclusion and structure of this book
1.8. Exercises
1.9. Internet links
PART 1: Linguistic Units
2. Formalizing the Alphabet
2.1. Bits and bytes
2.2. Digitizing information
2.3. Representing natural numbers
2.3.1. Decimal notation
2.3.2. Binary notation
2.3.3. Hexadecimal notation
2.4. Encoding characters
2.4.1. Standardization of encodings
2.4.2. Accented Latin letters, diacritical marks, and ligatures
2.4.3. Extended ASCII encodings
2.4.4. Unicode
2.5. Alphabetical order
2.6. Classification of characters
2.7. Conclusion
2.8. Exercises
2.9. Internet links
3. Defining Vocabulary
3.1. Multiple vocabularies and the evolution of vocabulary
3.2. Derivation
3.2.1. Derivation applies to vocabulary elements
3.2.2. Derivations are unpredictable
3.2.3. Atomicity of derived words
3.3. Atomic linguistic units (ALUs)
3.3.1. Classification of ALUs
3.4. Multiword units versus analyzable sequences of simple words
3.4.1. Semantics
3.4.2. Usage
3.4.3. Transformational analysis
3.5. Conclusion
3.6. Exercises
3.7. Internet links
4. Electronic Dictionaries
4.1. Could editorial dictionaries be reused?
4.2. LADL electronic dictionaries
4.2.1. Lexicon-Grammar
4.2.2. DELA
4.3. Dubois and Dubois-Charlier electronic dictionaries
4.3.1. The Dictionnaire électronique des mots
4.3.2. Les Verbes Français (LVF)
4.4. Specifications for the construction of an electronic dictionary
4.4.1. One ALU = one lexical entry
4.4.2. Importance of derivation
4.4.3. Orthographic variation
4.4.4. Inflection of simple words, compound words, and expressions
4.4.5. Expressions
4.4.6. Integration of syntax and semantics
4.5. Conclusion
4.6. Exercises
4.7. Internet links
PART 2: Languages, Grammars and Machines
5. Languages, Grammars, and Machines
5.1. Definitions
5.1.1. Letters and alphabets
5.1.2. Words and languages
5.1.3. ALU, vocabularies, phrases, and languages
5.1.4. Empty string
5.1.5. Free language
5.1.6. Grammars
5.1.7. Machines
5.2. Generative grammars
5.3. Chomsky-Schützenberger hierarchy
5.3.1. Linguistic formalisms
5.4. The NooJ approach
5.4.1. A multifaceted approach
5.4.2. Unified notation
5.4.3. Cascading architecture
5.5. Conclusion
5.6. Exercises
5.7. Internet links
6. Regular Grammars
6.1. Regular expressions
6.1.1. Some examples of regular expressions
6.2. Finite-state graphs
6.3. Non-deterministic and deterministic graphs
6.4. Minimal deterministic graphs
6.5. Kleene's theorem
6.6. Regular expressions with outputs and finite-state transducers
6.7. Extensions of regular grammars
6.7.1. Lexical symbols
6.7.2. Syntactic symbols
6.7.3. Symbols defined by grammars
6.7.4. Special operators
6.8. Conclusion
6.9. Exercises
6.10. Internet links
7. Context-Free Grammars
7.1. Recursion
7.1.1. Right recursion
7.1.2. Left recursion
7.1.3. Middle recursion
7.2. Parse trees
7.3. Conclusion
7.4. Exercises
7.5. Internet links
8. Context-Sensitive Grammars
8.1. The NooJ approach
8.1.1. The anbncn language
8.1.2. The language a2n
8.1.3. Handling reduplications
8.1.4. Grammatical agreements
8.1.5. Lexical constraints in morphological grammars
8.2. NooJ contextual constraints
8.3. NooJ variables
8.3.1. Variables' scope
8.3.2. Computing a variable's value
8.3.3. Inheriting a variable's value
8.4. Conclusion
8.5. Exercises
8.6. Internet links
9. Unrestricted Grammars
9.1. Linguistic adequacy
9.2. Conclusion
9.3. Exercise
9.4. Internet links
PART 3: Automatic Linguistic Parsing
10. Text Annotation Structure
10.1. Parsing a text
10.2. Annotations
10.2.1. Limits of XML/TEI representation
10.3. Text annotation structure (TAS)
10.4. Exercise
10.5. Internet links
11. Lexical Analysis
11.1. Tokenization
11.1.1. Letter recognition
11.1.2. Apostrophe/quote
11.1.3. Dash/hyphen
11.1.4. Dot/period/point ambiguity
11.2. Word forms
11.2.1. Space and punctuation
11.2.2. Numbers
11.2.3. Words in upper case
11.3. Morphological analyses
11.3.1. Inflectional morphology
11.3.2. Derivational morphology
11.3.3. Lexical morphology
11.3.4. Agglutinations
11.4. Multiword unit recognition
11.5. Recognizing expressions
11.5.1. Characteristic constituent
11.5.2. Varying the characteristic constituent
11.5.3. Varying the light verb
11.5.4. Resolving ambiguity
11.5.5. Annotating expressions
11.6. Conclusion
11.7. Exercise
12. Syntactic Analysis
12.1. Local grammars
12.1.1. Named entities
12.1.2. Grammatical word sequences
12.1.3. Automatically identifying ambiguity
12.2. Structural grammars
12.2.1. Complex atomic linguistic units
12.2.2. Structured annotations
12.2.3. Ambiguities
12.2.4. Syntax trees vs parse trees
12.2.5. Dependency grammar and tree
12.2.6. Resolving ambiguity transparently
12.3. Conclusion
12.4. Exercises
12.5. Internet links
13. Transformational Analysis
13.1. Implementing transformations
13.2. Theoretical problems
13.2.1. Equivalence of transformation sequences
13.2.2. Ambiguities in transformed sentences
13.2.3. Theoretical sentences
13.2.4. The number of transformations to be implemented
13.3. Transformational analysis with NooJ
13.3.1. Applying a grammar in "generation" mode
13.3.2. The transformation's arguments
13.4. Question answering
13.5. Semantic analysis
13.6. Machine translation
13.7. Conclusion
13.8. Exercises
13.9. Internet links
Conclusion
Bibliography
Index
Other titles from iSTE in Cognitive Science and Knowledge Management
EULA

1
Introduction: the Project

The project described in this book is at the very heart of linguistics; its goal is to describe, exhaustively and with absolute precision, all the sentences of a language likely to appear in written texts1. This project fulfills two needs: it provides linguists with tools to help them describe languages exhaustively (linguistics), and it aids in the building of software able to automatically process texts written in natural language (natural language processing, or NLP).

A linguistic project2 needs to have a theoretical and methodological framework (how to describe this or that linguistic phenomenon; how to organize the different levels of description); formal tools (how to write each description); development tools to test and manage each description; and engineering tools to be used in sharing, accumulating, and maintaining large quantities of linguistic resources.

There are many potential applications of descriptive linguistics for NLP: spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, etc. These applications have the potential for considerable economic usefulness, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them.

For now, we must reduce the overall linguistic project of describing all phenomena related to the use of language, to a much more modest project: here, we will confine ourselves to seeking to describe the set of all of the sentences that may be written or read in natural-language texts. The goal, then, is simply to design a system capable of distinguishing between the two sequences below:

a) Joe is eating an apple
b) Joe eating apple is an

Sequence (a) is a grammatical sentence, while sequence (b) is not.

This project constitutes the mandatory foundation for any more ambitious linguistic projects. Indeed it would be fruitless to attempt to formalize text styles (stylistics), the evolution of a language across the centuries (etymology), variations in a language according to social class (sociolinguistics), cognitive phenomena involved in the learning or understanding of a language (psycholinguistics), etc. without a model, even a rudimentary one, capable of characterizing sentences.

If the number of sentences were finite - that is, if there were a maximum number of sentences in a language - we would be able to list them all and arrange them in a database. To check whether an arbitrary sequence of words is a sentence, all we would have to do is consult this database: it is a sentence if it is in the database, and otherwise it is not. Unfortunately, there are an infinite number of sentences in a natural language. To convince ourselves of this, let us resort to a redictio ad absurdum: imagine for a moment that there are n sentences in English.

Based on this finite number n of initial sentences, we can construct a second set of sentences by putting the sequence Lea thinks that, for example, before each of the initial sentences:

Joe is sleeping Lea thinks that Joe is sleeping

The party is over Lea thinks that the party is over

Using this simple mechanism, we have just doubled the number of sentences, as shown in the figure below.

Figure 1.1. The number of any set of sentences can be doubled

This mechanism can be generalized by using verbs other than the verb to think; for example:

There are several hundred verbs that could be used here. Likewise, we could replace Lea with several thousand human nouns:

(The CEO | The employee | The neighbor | The teacher | .) thinks that Sentence.

Whatever the size n of an initial set of sentences, we can thus construct n × 100 × 1,000 sentences simply by inserting before each of the initial sentences, sequences such as Lea thinks that, Their teacher claimed that, My neighbor declared that, etc.

Language has other mechanisms that can be used to expand a set of sentences exponentially. For example, based on n initial sentences, we can construct n × n sentences by combining all of these sentences in pairs and inserting the word and between them. For example:

It is raining + Joe is sleeping It is raining and Joe is sleeping

This mechanism can also be generalized by using several hundred connectors; for example:

These two mechanisms (linking of sentences and use of connectors) can be used multiple times in a row, as in the following:

Lea claims that Joe hoped that Ida was sleeping. It was raining while Lea was sleeping, however Ida is now waiting, but the weather should clear up as soon as night falls.

Thus these mechanisms are said to be recursive; the number of sentences that can be constructed with recursive mechanisms is infinite. Therefore it would be impossible to define all of these sentences in extenso. Another way must be found to characterize the set of sentences.

1.1. Characterizing a set of infinite size

Mathematicians have known for a long time how to define sets of infinite size. For example, the two rules below can be used to define the set of all natural numbers :

(a) Each of the ten elements of set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is a natural number;

(b) any word that can be written as xy is a natural number if and only if its two constituents x and y are natural numbers.

These two rules constitute a formal definition of all natural numbers. They make it possible to distinguish natural numbers from any other object (decimal numbers or others). For example:

- Is the word "123" a natural number? Thanks to rule (a), we know that "1" and "2" are natural numbers. Rule (b) allows us to deduce from this that "12" is a natural number. Thanks to rule (a) we know that "3" is a natural number; since "12" and "3" are natural numbers, then rule (b) allows us to deduce that "123" is a natural number.
- The word "2.5" is not a natural number. Rule (a) enables us to deduce that "2" is a natural number, but it does not apply to the decimal point ".". Rule (b) can only apply to two natural numbers, therefore it does not apply to the decimal point because it is not a natural number. In this case, "2." is not a natural number; therefore "2.5" is not a natural number either.

There is an interesting similarity between this definition of set and the problem of characterizing the sentences in a language:

- Rule (a) describes in extenso the finite set of numerals that must be used to form valid natural numbers. This rule resembles a dictionary in which we would list all the words that make up the vocabulary of a language.
- Rule (b) explains how numerals can be combined to construct an infinite number of natural numbers. This rule is similar to grammatical rules that specify how to combine words in order to construct an infinite number of sentences.

To describe a natural language, then, we will proceed as follows: firstly we will define in extenso the finite number of basic units in a language (its vocabulary); and secondly, we will list the rules used to combine the vocabulary elements in order to construct sentences (its grammar).

1.2. Computers and linguistics

Computers are a vital tool for this linguistic project, for at least four reasons:

- From a theoretical point of view, a computer is a device that can verify automatically that an element is part of a mathematically-defined set. Our goal is then to construct a device that can automatically verify whether a sequence of words is a valid sentence in a language.
- From a methodological point of view, the computer will impose a framework to describe linguistic objects (words, for example) as well as the rules for use of these objects (such as syntactic rules). The way in which linguistic phenomena are described must be consistent with the system: any inconsistency in a description will inevitably produce an error (or "bug").
- When linguistic descriptions have been entered into a computer, a computer can apply them to very large texts in order to extract from these texts examples or counterexamples that validate (or not) these descriptions. Thus a computer can be used as a scientific instrument (this is the corpus linguistics approach), as the telescope is in astronomy or the microscope in biology.
- Describing a language requires a great deal of descriptive work; software is used to help with the development of databases containing numerous linguistic objects as well as numerous grammar rules, much like engineers use computer-aided design (CAD) software to design cars, electronic circuits, etc. from libraries of components.

Finally, the description of certain linguistic phenomena makes it possible to construct NLP software applications. For example, if we have a complete list of the words in a language, we can...

Inhalt (EPUB)

Systemvoraussetzungen

Als PDF speichern Als Link merken

Formalizing Natural Languages

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

1 Introduction: the Project

1.1. Characterizing a set of infinite size

1.2. Computers and linguistics

Systemvoraussetzungen

1
Introduction: the Project