
Formalizing Natural Languages
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Content
- Cover
- Half-Title Page
- Dedication
- Title Page
- Copyright Page
- Contents
- Acknowledgments
- 1. Introduction: the Project
- 1.1. Characterizing a set of infinite size
- 1.2. Computers and linguistics
- 1.3. Levels of formalization
- 1.4. Not applicable
- 1.4.1. Poetry and plays on words
- 1.4.2. Stylistics and rhetoric
- 1.4.3. Anaphora, coreference resolution, and semantic disambiguation
- 1.4.4. Extralinguistic calculations
- 1.5. NLP applications
- 1.5.1. Automatic translation
- 1.5.2. Part-of-speech (POS) tagging
- 1.5.3. Linguistic rather than stochastic analysis
- 1.6. Linguistic formalisms: NooJ
- 1.7. Conclusion and structure of this book
- 1.8. Exercises
- 1.9. Internet links
- PART 1: Linguistic Units
- 2. Formalizing the Alphabet
- 2.1. Bits and bytes
- 2.2. Digitizing information
- 2.3. Representing natural numbers
- 2.3.1. Decimal notation
- 2.3.2. Binary notation
- 2.3.3. Hexadecimal notation
- 2.4. Encoding characters
- 2.4.1. Standardization of encodings
- 2.4.2. Accented Latin letters, diacritical marks, and ligatures
- 2.4.3. Extended ASCII encodings
- 2.4.4. Unicode
- 2.5. Alphabetical order
- 2.6. Classification of characters
- 2.7. Conclusion
- 2.8. Exercises
- 2.9. Internet links
- 3. Defining Vocabulary
- 3.1. Multiple vocabularies and the evolution of vocabulary
- 3.2. Derivation
- 3.2.1. Derivation applies to vocabulary elements
- 3.2.2. Derivations are unpredictable
- 3.2.3. Atomicity of derived words
- 3.3. Atomic linguistic units (ALUs)
- 3.3.1. Classification of ALUs
- 3.4. Multiword units versus analyzable sequences of simple words
- 3.4.1. Semantics
- 3.4.2. Usage
- 3.4.3. Transformational analysis
- 3.5. Conclusion
- 3.6. Exercises
- 3.7. Internet links
- 4. Electronic Dictionaries
- 4.1. Could editorial dictionaries be reused?
- 4.2. LADL electronic dictionaries
- 4.2.1. Lexicon-Grammar
- 4.2.2. DELA
- 4.3. Dubois and Dubois-Charlier electronic dictionaries
- 4.3.1. The Dictionnaire électronique des mots
- 4.3.2. Les Verbes Français (LVF)
- 4.4. Specifications for the construction of an electronic dictionary
- 4.4.1. One ALU = one lexical entry
- 4.4.2. Importance of derivation
- 4.4.3. Orthographic variation
- 4.4.4. Inflection of simple words, compound words, and expressions
- 4.4.5. Expressions
- 4.4.6. Integration of syntax and semantics
- 4.5. Conclusion
- 4.6. Exercises
- 4.7. Internet links
- PART 2: Languages, Grammars and Machines
- 5. Languages, Grammars, and Machines
- 5.1. Definitions
- 5.1.1. Letters and alphabets
- 5.1.2. Words and languages
- 5.1.3. ALU, vocabularies, phrases, and languages
- 5.1.4. Empty string
- 5.1.5. Free language
- 5.1.6. Grammars
- 5.1.7. Machines
- 5.2. Generative grammars
- 5.3. Chomsky-Schützenberger hierarchy
- 5.3.1. Linguistic formalisms
- 5.4. The NooJ approach
- 5.4.1. A multifaceted approach
- 5.4.2. Unified notation
- 5.4.3. Cascading architecture
- 5.5. Conclusion
- 5.6. Exercises
- 5.7. Internet links
- 6. Regular Grammars
- 6.1. Regular expressions
- 6.1.1. Some examples of regular expressions
- 6.2. Finite-state graphs
- 6.3. Non-deterministic and deterministic graphs
- 6.4. Minimal deterministic graphs
- 6.5. Kleene's theorem
- 6.6. Regular expressions with outputs and finite-state transducers
- 6.7. Extensions of regular grammars
- 6.7.1. Lexical symbols
- 6.7.2. Syntactic symbols
- 6.7.3. Symbols defined by grammars
- 6.7.4. Special operators
- 6.8. Conclusion
- 6.9. Exercises
- 6.10. Internet links
- 7. Context-Free Grammars
- 7.1. Recursion
- 7.1.1. Right recursion
- 7.1.2. Left recursion
- 7.1.3. Middle recursion
- 7.2. Parse trees
- 7.3. Conclusion
- 7.4. Exercises
- 7.5. Internet links
- 8. Context-Sensitive Grammars
- 8.1. The NooJ approach
- 8.1.1. The anbncn language
- 8.1.2. The language a2n
- 8.1.3. Handling reduplications
- 8.1.4. Grammatical agreements
- 8.1.5. Lexical constraints in morphological grammars
- 8.2. NooJ contextual constraints
- 8.3. NooJ variables
- 8.3.1. Variables' scope
- 8.3.2. Computing a variable's value
- 8.3.3. Inheriting a variable's value
- 8.4. Conclusion
- 8.5. Exercises
- 8.6. Internet links
- 9. Unrestricted Grammars
- 9.1. Linguistic adequacy
- 9.2. Conclusion
- 9.3. Exercise
- 9.4. Internet links
- PART 3: Automatic Linguistic Parsing
- 10. Text Annotation Structure
- 10.1. Parsing a text
- 10.2. Annotations
- 10.2.1. Limits of XML/TEI representation
- 10.3. Text annotation structure (TAS)
- 10.4. Exercise
- 10.5. Internet links
- 11. Lexical Analysis
- 11.1. Tokenization
- 11.1.1. Letter recognition
- 11.1.2. Apostrophe/quote
- 11.1.3. Dash/hyphen
- 11.1.4. Dot/period/point ambiguity
- 11.2. Word forms
- 11.2.1. Space and punctuation
- 11.2.2. Numbers
- 11.2.3. Words in upper case
- 11.3. Morphological analyses
- 11.3.1. Inflectional morphology
- 11.3.2. Derivational morphology
- 11.3.3. Lexical morphology
- 11.3.4. Agglutinations
- 11.4. Multiword unit recognition
- 11.5. Recognizing expressions
- 11.5.1. Characteristic constituent
- 11.5.2. Varying the characteristic constituent
- 11.5.3. Varying the light verb
- 11.5.4. Resolving ambiguity
- 11.5.5. Annotating expressions
- 11.6. Conclusion
- 11.7. Exercise
- 12. Syntactic Analysis
- 12.1. Local grammars
- 12.1.1. Named entities
- 12.1.2. Grammatical word sequences
- 12.1.3. Automatically identifying ambiguity
- 12.2. Structural grammars
- 12.2.1. Complex atomic linguistic units
- 12.2.2. Structured annotations
- 12.2.3. Ambiguities
- 12.2.4. Syntax trees vs parse trees
- 12.2.5. Dependency grammar and tree
- 12.2.6. Resolving ambiguity transparently
- 12.3. Conclusion
- 12.4. Exercises
- 12.5. Internet links
- 13. Transformational Analysis
- 13.1. Implementing transformations
- 13.2. Theoretical problems
- 13.2.1. Equivalence of transformation sequences
- 13.2.2. Ambiguities in transformed sentences
- 13.2.3. Theoretical sentences
- 13.2.4. The number of transformations to be implemented
- 13.3. Transformational analysis with NooJ
- 13.3.1. Applying a grammar in "generation" mode
- 13.3.2. The transformation's arguments
- 13.4. Question answering
- 13.5. Semantic analysis
- 13.6. Machine translation
- 13.7. Conclusion
- 13.8. Exercises
- 13.9. Internet links
- Conclusion
- Bibliography
- Index
- Other titles from iSTE in Cognitive Science and Knowledge Management
- EULA
1
Introduction: the Project
The project described in this book is at the very heart of linguistics; its goal is to describe, exhaustively and with absolute precision, all the sentences of a language likely to appear in written texts1. This project fulfills two needs: it provides linguists with tools to help them describe languages exhaustively (linguistics), and it aids in the building of software able to automatically process texts written in natural language (natural language processing, or NLP).
A linguistic project2 needs to have a theoretical and methodological framework (how to describe this or that linguistic phenomenon; how to organize the different levels of description); formal tools (how to write each description); development tools to test and manage each description; and engineering tools to be used in sharing, accumulating, and maintaining large quantities of linguistic resources.
There are many potential applications of descriptive linguistics for NLP: spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, etc. These applications have the potential for considerable economic usefulness, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them.
For now, we must reduce the overall linguistic project of describing all phenomena related to the use of language, to a much more modest project: here, we will confine ourselves to seeking to describe the set of all of the sentences that may be written or read in natural-language texts. The goal, then, is simply to design a system capable of distinguishing between the two sequences below:
- a) Joe is eating an apple
- b) Joe eating apple is an
Sequence (a) is a grammatical sentence, while sequence (b) is not.
This project constitutes the mandatory foundation for any more ambitious linguistic projects. Indeed it would be fruitless to attempt to formalize text styles (stylistics), the evolution of a language across the centuries (etymology), variations in a language according to social class (sociolinguistics), cognitive phenomena involved in the learning or understanding of a language (psycholinguistics), etc. without a model, even a rudimentary one, capable of characterizing sentences.
If the number of sentences were finite - that is, if there were a maximum number of sentences in a language - we would be able to list them all and arrange them in a database. To check whether an arbitrary sequence of words is a sentence, all we would have to do is consult this database: it is a sentence if it is in the database, and otherwise it is not. Unfortunately, there are an infinite number of sentences in a natural language. To convince ourselves of this, let us resort to a redictio ad absurdum: imagine for a moment that there are n sentences in English.
Based on this finite number n of initial sentences, we can construct a second set of sentences by putting the sequence Lea thinks that, for example, before each of the initial sentences:
Joe is sleeping Lea thinks that Joe is sleeping
The party is over Lea thinks that the party is over
Using this simple mechanism, we have just doubled the number of sentences, as shown in the figure below.
Figure 1.1. The number of any set of sentences can be doubled
This mechanism can be generalized by using verbs other than the verb to think; for example:
Lea (believes | claims | dreams | knows | realizes | thinks | .) that Sentence.
There are several hundred verbs that could be used here. Likewise, we could replace Lea with several thousand human nouns:
(The CEO | The employee | The neighbor | The teacher | .) thinks that Sentence.
Whatever the size n of an initial set of sentences, we can thus construct n × 100 × 1,000 sentences simply by inserting before each of the initial sentences, sequences such as Lea thinks that, Their teacher claimed that, My neighbor declared that, etc.
Language has other mechanisms that can be used to expand a set of sentences exponentially. For example, based on n initial sentences, we can construct n × n sentences by combining all of these sentences in pairs and inserting the word and between them. For example:
It is raining + Joe is sleeping It is raining and Joe is sleeping
This mechanism can also be generalized by using several hundred connectors; for example:
It is raining (but | nevertheless | therefore | where | while |.) Joe is sleeping.
These two mechanisms (linking of sentences and use of connectors) can be used multiple times in a row, as in the following:
Lea claims that Joe hoped that Ida was sleeping. It was raining while Lea was sleeping, however Ida is now waiting, but the weather should clear up as soon as night falls.
Thus these mechanisms are said to be recursive; the number of sentences that can be constructed with recursive mechanisms is infinite. Therefore it would be impossible to define all of these sentences in extenso. Another way must be found to characterize the set of sentences.
1.1. Characterizing a set of infinite size
Mathematicians have known for a long time how to define sets of infinite size. For example, the two rules below can be used to define the set of all natural numbers :
(a) Each of the ten elements of set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is a natural number;
(b) any word that can be written as xy is a natural number if and only if its two constituents x and y are natural numbers.
These two rules constitute a formal definition of all natural numbers. They make it possible to distinguish natural numbers from any other object (decimal numbers or others). For example:
- - Is the word "123" a natural number? Thanks to rule (a), we know that "1" and "2" are natural numbers. Rule (b) allows us to deduce from this that "12" is a natural number. Thanks to rule (a) we know that "3" is a natural number; since "12" and "3" are natural numbers, then rule (b) allows us to deduce that "123" is a natural number.
- - The word "2.5" is not a natural number. Rule (a) enables us to deduce that "2" is a natural number, but it does not apply to the decimal point ".". Rule (b) can only apply to two natural numbers, therefore it does not apply to the decimal point because it is not a natural number. In this case, "2." is not a natural number; therefore "2.5" is not a natural number either.
There is an interesting similarity between this definition of set and the problem of characterizing the sentences in a language:
- - Rule (a) describes in extenso the finite set of numerals that must be used to form valid natural numbers. This rule resembles a dictionary in which we would list all the words that make up the vocabulary of a language.
- - Rule (b) explains how numerals can be combined to construct an infinite number of natural numbers. This rule is similar to grammatical rules that specify how to combine words in order to construct an infinite number of sentences.
To describe a natural language, then, we will proceed as follows: firstly we will define in extenso the finite number of basic units in a language (its vocabulary); and secondly, we will list the rules used to combine the vocabulary elements in order to construct sentences (its grammar).
1.2. Computers and linguistics
Computers are a vital tool for this linguistic project, for at least four reasons:
- - From a theoretical point of view, a computer is a device that can verify automatically that an element is part of a mathematically-defined set. Our goal is then to construct a device that can automatically verify whether a sequence of words is a valid sentence in a language.
- - From a methodological point of view, the computer will impose a framework to describe linguistic objects (words, for example) as well as the rules for use of these objects (such as syntactic rules). The way in which linguistic phenomena are described must be consistent with the system: any inconsistency in a description will inevitably produce an error (or "bug").
- - When linguistic descriptions have been entered into a computer, a computer can apply them to very large texts in order to extract from these texts examples or counterexamples that validate (or not) these descriptions. Thus a computer can be used as a scientific instrument (this is the corpus linguistics approach), as the telescope is in astronomy or the microscope in biology.
- - Describing a language requires a great deal of descriptive work; software is used to help with the development of databases containing numerous linguistic objects as well as numerous grammar rules, much like engineers use computer-aided design (CAD) software to design cars, electronic circuits, etc. from libraries of components.
Finally, the description of certain linguistic phenomena makes it possible to construct NLP software applications. For example, if we have a complete list of the words in a language, we can...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.