1
Introduction: the Project
The project described in this book is at the very heart of linguistics; its goal is to describe, exhaustively and with absolute precision, all the sentences of a language likely to appear in written texts1. This project fulfills two needs: it provides linguists with tools to help them describe languages exhaustively (linguistics), and it aids in the building of software able to automatically process texts written in natural language (natural language processing, or NLP).
A linguistic project2 needs to have a theoretical and methodological framework (how to describe this or that linguistic phenomenon; how to organize the different levels of description); formal tools (how to write each description); development tools to test and manage each description; and engineering tools to be used in sharing, accumulating, and maintaining large quantities of linguistic resources.
There are many potential applications of descriptive linguistics for NLP: spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, etc. These applications have the potential for considerable economic usefulness, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them.
For now, we must reduce the overall linguistic project of describing all phenomena related to the use of language, to a much more modest project: here, we will confine ourselves to seeking to describe the set of all of the sentences that may be written or read in natural-language texts. The goal, then, is simply to design a system capable of distinguishing between the two sequences below:
- a) Joe is eating an apple
- b) Joe eating apple is an
Sequence (a) is a grammatical sentence, while sequence (b) is not.
This project constitutes the mandatory foundation for any more ambitious linguistic projects. Indeed it would be fruitless to attempt to formalize text styles (stylistics), the evolution of a language across the centuries (etymology), variations in a language according to social class (sociolinguistics), cognitive phenomena involved in the learning or understanding of a language (psycholinguistics), etc. without a model, even a rudimentary one, capable of characterizing sentences.
If the number of sentences were finite - that is, if there were a maximum number of sentences in a language - we would be able to list them all and arrange them in a database. To check whether an arbitrary sequence of words is a sentence, all we would have to do is consult this database: it is a sentence if it is in the database, and otherwise it is not. Unfortunately, there are an infinite number of sentences in a natural language. To convince ourselves of this, let us resort to a redictio ad absurdum: imagine for a moment that there are n sentences in English.
Based on this finite number n of initial sentences, we can construct a second set of sentences by putting the sequence Lea thinks that, for example, before each of the initial sentences:
Joe is sleeping Lea thinks that Joe is sleeping
The party is over Lea thinks that the party is over
Using this simple mechanism, we have just doubled the number of sentences, as shown in the figure below.
Figure 1.1. The number of any set of sentences can be doubled
This mechanism can be generalized by using verbs other than the verb to think; for example:
Lea (believes | claims | dreams | knows | realizes | thinks | .) that Sentence.
There are several hundred verbs that could be used here. Likewise, we could replace Lea with several thousand human nouns:
(The CEO | The employee | The neighbor | The teacher | .) thinks that Sentence.
Whatever the size n of an initial set of sentences, we can thus construct n × 100 × 1,000 sentences simply by inserting before each of the initial sentences, sequences such as Lea thinks that, Their teacher claimed that, My neighbor declared that, etc.
Language has other mechanisms that can be used to expand a set of sentences exponentially. For example, based on n initial sentences, we can construct n × n sentences by combining all of these sentences in pairs and inserting the word and between them. For example:
It is raining + Joe is sleeping It is raining and Joe is sleeping
This mechanism can also be generalized by using several hundred connectors; for example:
It is raining (but | nevertheless | therefore | where | while |.) Joe is sleeping.
These two mechanisms (linking of sentences and use of connectors) can be used multiple times in a row, as in the following:
Lea claims that Joe hoped that Ida was sleeping. It was raining while Lea was sleeping, however Ida is now waiting, but the weather should clear up as soon as night falls.
Thus these mechanisms are said to be recursive; the number of sentences that can be constructed with recursive mechanisms is infinite. Therefore it would be impossible to define all of these sentences in extenso. Another way must be found to characterize the set of sentences.
1.1. Characterizing a set of infinite size
Mathematicians have known for a long time how to define sets of infinite size. For example, the two rules below can be used to define the set of all natural numbers :
(a) Each of the ten elements of set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is a natural number;
(b) any word that can be written as xy is a natural number if and only if its two constituents x and y are natural numbers.
These two rules constitute a formal definition of all natural numbers. They make it possible to distinguish natural numbers from any other object (decimal numbers or others). For example:
- - Is the word "123" a natural number? Thanks to rule (a), we know that "1" and "2" are natural numbers. Rule (b) allows us to deduce from this that "12" is a natural number. Thanks to rule (a) we know that "3" is a natural number; since "12" and "3" are natural numbers, then rule (b) allows us to deduce that "123" is a natural number.
- - The word "2.5" is not a natural number. Rule (a) enables us to deduce that "2" is a natural number, but it does not apply to the decimal point ".". Rule (b) can only apply to two natural numbers, therefore it does not apply to the decimal point because it is not a natural number. In this case, "2." is not a natural number; therefore "2.5" is not a natural number either.
There is an interesting similarity between this definition of set and the problem of characterizing the sentences in a language:
- - Rule (a) describes in extenso the finite set of numerals that must be used to form valid natural numbers. This rule resembles a dictionary in which we would list all the words that make up the vocabulary of a language.
- - Rule (b) explains how numerals can be combined to construct an infinite number of natural numbers. This rule is similar to grammatical rules that specify how to combine words in order to construct an infinite number of sentences.
To describe a natural language, then, we will proceed as follows: firstly we will define in extenso the finite number of basic units in a language (its vocabulary); and secondly, we will list the rules used to combine the vocabulary elements in order to construct sentences (its grammar).
1.2. Computers and linguistics
Computers are a vital tool for this linguistic project, for at least four reasons:
- - From a theoretical point of view, a computer is a device that can verify automatically that an element is part of a mathematically-defined set. Our goal is then to construct a device that can automatically verify whether a sequence of words is a valid sentence in a language.
- - From a methodological point of view, the computer will impose a framework to describe linguistic objects (words, for example) as well as the rules for use of these objects (such as syntactic rules). The way in which linguistic phenomena are described must be consistent with the system: any inconsistency in a description will inevitably produce an error (or "bug").
- - When linguistic descriptions have been entered into a computer, a computer can apply them to very large texts in order to extract from these texts examples or counterexamples that validate (or not) these descriptions. Thus a computer can be used as a scientific instrument (this is the corpus linguistics approach), as the telescope is in astronomy or the microscope in biology.
- - Describing a language requires a great deal of descriptive work; software is used to help with the development of databases containing numerous linguistic objects as well as numerous grammar rules, much like engineers use computer-aided design (CAD) software to design cars, electronic circuits, etc. from libraries of components.
Finally, the description of certain linguistic phenomena makes it possible to construct NLP software applications. For example, if we have a complete list of the words in a language, we can...