
Practical Corpus Linguistics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Reviews / Votes
"This textbook makes Practical Corpus Linguistics accessible to everyone. The focus on methodological and technical aspects and the instructive dimension of the book - nothing is considered obvious or already known - make it very useful to any corpus linguist aiming at a better understanding of his/her data...Through the various exercises, it is very easy to test one's comprehension and the reader gradually gains confidence. The educational, sometimes entertaining tone as well as the glossary also contribute to gradually enhance the reader's learning capacities in a field in which many feel insecure...It should accompany scholars at the beginning of any research to raise awareness about technical issues that are too often overlooked..." - Robert A. Cote for The LINGUIST List, December 2016More details
Other editions
Additional editions



Person
Content
List of Tables xv
Acknowledgements xvii
1 Introduction 1
1.1 Linguistic Data Analysis 3
1.1.1 What's data? 3
1.1.2 Forms of data 3
1.1.3 Collecting and analysing data 7
1.2 Outline of the Book 8
1.3 Conventions Used in this Book 10
1.4 A Note for Teachers 11
1.5 Online Resources 11
2 What's Out There? 13
2.1 What's a Corpus? 13
2.2 Corpus Formats 13
2.3 Synchronic vs. Diachronic Corpora 15
2.3.1 'Early' synchronic corpora 15
2.3.2 Mixed corpora 18
2.3.3 Examples of diachronic corpora 20
2.4 General vs. Specific Corpora 21
2.4.1 Examples of specific corpora 22
2.5 Static Versus Dynamic Corpora 25
2.6 Other Sources for Corpora 26
Solutions to/Comments on the Exercises 26
Note 28
Sources and Further Reading 28
3 Understanding Corpus Design 29
3.1 Food for Thought - General Issues in Corpus Design 29
3.1.1 Sampling 30
3.1.2 Size 31
3.1.3 Balance and representativeness 32
3.1.4 Legal issues 32
3.2 What's in a Text? - Understanding Document Structure 33
3.2.1 Headers, 'footers' and meta-data 34
3.2.2 The structure of the (text) body 36
3.2.3 What's (in) an electronic text? - understanding file formats and their properties 37
3.3 Understanding Encoding: Character Sets, File Size, etc. 38
3.3.1 ASCII and legacy encodings 38
3.3.2 Unicode 39
3.3.3 File sizes 40
Solutions to/Comments on the Exercises 41
Sources and Further Reading 42
4 Finding and Preparing Your Data 43
4.1 Finding Suitable Materials for Analysis 44
4.1.1 Retrieving data from text archives 44
4.1.2 Obtaining materials from Project Gutenberg 44
4.1.3 Obtaining materials from the Oxford Text Archive 45
4.2 Collecting Written Materials Yourself ('Web as Corpus') 46
4.2.1 A brief note on plain-text editors 46
4.2.2 Browser text export 48
4.2.3 Browser HTML export 49
4.2.4 Getting web data using ICEweb 50
4.2.5 Downloading other types of files 52
4.3 Collecting Spoken Data 53
4.4 Preparing Written Data for Analysis 56
4.4.1 'Cleaning up' your data 56
4.4.2 Extracting text from proprietary document formats 58
4.4.3 Removing unnecessary header and 'footer' information 58
4.4.4 Documenting what you've collected 59
4.4.5 Preparing your data for distribution or archiving 60
Solutions to/Comments on the Exercises 62
Sources and Further Reading 66
5 Concordancing 67
5.1 What's Concordancing? 67
5.2 Concordancing with AntConc 69
5.2.1 Sorting results 74
5.2.2 Saving, pruning and reusing your results 75
Solutions to/Comments on the Exercises 78
Sources and Further Reading 81
6 Regular Expressions 82
6.1 Character Classes 84
6.2 Negative Character Classes 86
6.3 Quantification 86
6.4 Anchoring, Grouping and Alternation 87
6.4.1 Anchoring 87
6.4.2 Grouping and alternation 88
6.4.3 Quoting and using special characters 90
6.4.4 Constraining the context further 91
6.5 Further Exercises 92
Solutions to/Comments on the Exercises 93
Sources and Further Reading 100
7 Understanding Part-of-Speech Tagging and Its Uses 101
7.1 A Brief Introduction to (Morpho-Syntactic) Tagsets 103
7.2 Tagging Your Own Data 109
Solutions to/Comments on the Exercises 113
Sources and Further Reading 120
8 Using Online Interfaces to Query Mega Corpora 121
8.1 Searching the BNC with BNCweb 122
8.1.1 What is BNCweb? 122
8.1.2 Basic standard queries 123
8.1.3 Navigating through and exploring search results 124
8.1.4 More advanced standard query options 126
8.1.5 Wildcards 126
8.1.6 Word and phrase alternation 128
8.1.7 Restricting searches through PoS tags 129
8.1.8 Headword and lemma queries 131
8.2 Exploring COCA through the BYU Web-Interface 132
8.2.1 The basic syntax 133
8.2.2 Comparing corpora in the BYU interface 135
Solutions to/Comments on the Exercises 137
Sources and Further Reading 145
9 Basic Frequency Analysis - or What Can (Single) Words Tell Us About Texts? 146
9.1 Understanding Basic Units in Texts 146
9.1.1 What's a word? 147
9.1.2 Types and tokens 149
9.2 Word (Frequency) Lists in AntConc 151
9.2.1 Stop words - good or bad? 156
9.2.2 Defining and using stop words in AntConc 158
9.3 Word Lists in BNCweb 160
9.3.1 Standard options 160
9.3.2 Investigating subcorpora 162
9.3.3 Keyword lists 169
9.4 Keyword Lists in AntConc and BNCweb 169
9.4.1 Keyword lists in AntConc 169
9.4.2 Keyword lists in BNCweb 172
9.5 Comparing and Reporting Frequency Counts 175
9.6 Investigating Genre-Specific Distributions in COCA 178
Solutions to/Comments on the Exercises 179
Sources and Further Reading 192
10 Exploring Words in Context 193
10.1 Understanding Extended Units of Text 194
10.2 Text Segmentation 195
10.3 N-Grams, Word Clusters and Lexical Bundles 196
10.4 Exploring (Relatively) Fixed Sequences in BNCweb 198
10.5 Simple, Sequential Collocations and Colligations 198
10.5.1 'Simple' collocations 198
10.5.2 Colligations 200
10.5.3 Contextually constrained and proximity searches 201
10.6 Exploring Colligations in COCA 202
10.7 N-grams and Clusters in AntConc 205
10.8 Investigating Collocations Based on Statistical Measures in AntConc, BNCweb and COCA 207
10.8.1 Calculating collocations 207
10.8.2 Computing collocations in AntConc 209
10.8.3 Computing collocations in BNCweb 210
10.8.4 Computing collocations in COCA 211
Solutions to/Comments on the Exercises 212
Sources and Further Reading 226
11 Understanding Markup and Annotation 227
11.1 From SGML to XML - A Brief Timeline 229
11.2 XML for Linguistics 230
11.2.1 Why bother? 230
11.2.2 What does markup/annotation look like? 230
11.2.3 The 'history' and development of (linguistic) markup 232
11.2.4 XML and style sheets 234
11.3 'Simple XML' for Linguistic Annotation 236
11.4 Colour Coding and Visualisation 240
11.5 More Complex Forms of Annotation 246
Solutions to/Comments on the Exercises 248
Sources and Further Reading 253
12 Conclusion and Further Perspectives 254
Appendix A: The CLAWS C5 Tagset 259
Appendix B: The Annotated Dialogue File 261
Appendix C: The CSS Style Sheet 269
Glossary 271
References 277
Index 283
CHAPTER 1
Introduction
This textbook aims to teach you how to analyse and interpret language data in written or orthographically transcribed form (i.e. represented as if it were written, if the original data is spoken). It will do so in a way that should not only provide you with the technical skills for such an analysis for your own research purposes, but also raise your awareness of how corpus evidence can be used in order to develop a better understanding of the forms and functions of language. It will also teach you how to use corpus data in more applied contexts, such as e.g. in identifying suitable materials/examples for language teaching, investigating socio- linguistic phenomena, or even trying to verify existing linguistic theories, as well as to develop your own hypotheses about the many different aspects of language that can be investigated through corpora. The focus will primarily be on English-language data, although we may occasionally, whenever appropriate, refer to issues that could be relevant to the analysis of other languages. In doing so, we'll try to stay as theory-neutral as possible, so that no matter which 'flavour(s)' of linguistics you may have been exposed to before, you should always be able to understand the background to all the exercises or questions presented here.
The book is aimed at a variety of readers, ranging mainly from linguistics students at senior undergraduate, Masters, or even PhD levels who are still unfamiliar with corpus linguistics, to language teachers or textbook developers who want to create or employ more real-life teaching materials. As many of the techniques we'll be dealing with here also allow us to investigate issues of style in both literary and non-literary text, and much of the data we'll initially use actually consists of fictional works because these are easier to obtain and often don't cause any copyright issues, the book should hopefully also be useful to students of literary stylistics. To some extent, I also hope it may be beneficial to computer scientists working on language processing tasks, who, at least in my experience, often lack some crucial knowledge in understanding the complexities and intricacies of language, and frequently tend to resort to mathematical methods when more linguistic (symbolic) ones would be more appropriate, even if these may make the process of writing 'elegant' and efficient algorithms more difficult.
You may also be asking yourself why you should still be using a textbook at all in this day and age, when there are so many video tutorials available, and most programs offer at least some sort of online help to get you started. Essentially, there are two main reasons for this: a) such sources of information are only designed to provide you with a basic overview, but don't actually teach you, simply demonstrating how things are done. In other words they may do a relatively good job in showing you one or more ways of doing a few things, but often don't really allow you to use a particular program independently and for more complex tasks than the author of the tutorial/help file may actually have envisaged. And b) online tutorials, such as the ones on YouTube, may not only take a rather long time to (down)load, but might not even be (easily) accessible in some parts of the world at all, due to internet censorship.
If you're completely new to data analysis on the computer and working with - as opposed to simply opening and reading - different file types, some of the concepts and methods we'll discuss here may occasionally make you feel like you're doing computer science instead of working with language. This is, unfortunately, something you'll need to try and get used to, until you begin to understand the intricacies of working with language data on the computer better, and, by doing so, will also develop your understanding of the complexity inherent in language (data) itself. This is by no means an easy task, so working with this book, and thereby trying to develop a more complete understanding of language and how we can best analyse and describe it, be it for linguistic or language teaching purposes, will often require us to do some very careful reading and thinking about the points under discussion, so as to be able to develop and verify our own hypotheses about particular language features. However, doing so is well worth it, as you'll hopefully realise long before reaching the end of the book, as it opens up possibilities for understanding language that go far beyond a simple manual, small-scale, analysis of texts.
In order to achieve the aims of the book, we'll begin by discussing which types of data are already readily available, exploring ways of obtaining our own data, and developing an understanding of the nature of electronic documents and what may make them different from the more traditional types of printed documents we're all familiar with. This understanding will be developed further throughout the book, as we take a look at a number of computer programs that will help us to conduct our analyses at various levels, ranging from words to phrases, and to even larger units of text. At the same time, of course, we cannot ignore the fact that there may be issues in corpus linguistics related to lower levels, such as that of morphology, or even phonology. Having reached the end of the book, you'll hopefully be aware of many of the different issues involved in collecting and analysing a variety of linguistic - as well as literary - data on the computer, which potential problems and pitfalls you may encounter along the way, and ideally also how to deal with them efficiently. Before we start discussing these issues, though, let's take a few minutes to define the notion of (linguistic) data analysis properly.
1.1 Linguistic Data Analysis
1.1.1 What's data?
In general, we can probably see all different types of language manifestation as language data that we may want/need to investigate, but unfortunately, it's not always possible to easily capture all such 'available' material for analysis. This is why, apart from the 'armchair' data available through introspection (cf. Fillmore 1992: 35), we usually either have to collect our materials ourselves or use data that someone else has previously collected and provided in a suitable form, or at least a form that we can adapt to our needs with relative ease. In both of these approaches, there are inherent difficulties and problems to overcome, and therefore it's highly important to be aware of these limitations in preparing one's own research, be it in order to write a simple assignment, a BA dissertation, MA/PhD thesis, research paper, etc.
Before we move on to a more detailed discussion of the different forms of data, it's perhaps also necessary to clarify the term data itself a little more, in order to avoid any misunderstandings. The word itself originally comes from the plural of the Latin word datum, which literally means '(something) given', but can usually be better translated as 'fact'. In our case, the data we'll be discussing throughout this book will therefore represent the 'facts of language' we can observe. And although the term itself, technically speaking, is originally a plural form referring to the individual facts or features of language (and can be used like this), more often than not we tend to use it as a singular mass noun that represents an unspecified amount or body of such facts.
1.1.2 Forms of data
Essentially, linguistic data comes in two general forms, written or spoken. However, there are also intermediate categories, such as texts that are written to be spoken (e.g. lectures, plays, etc.), and which may therefore exhibit features that are in between the two clear-cut variants. The two main media types often require rather radically different ways of 'recording' and analysis, although at least some of the techniques for analysing written language can also be used for analysing transliterated or (orthographically) transcribed speech, as we'll see later when looking at some dialogue data. Beyond this distinction based on medium, there are of course other classification systems that can be applied to data, such as according to genre , register , text type , etc., although these distinctions are not always very clearly formalised and distinguished from one another, so that different scholars may sometimes be using distinct, but frequently also overlapping, terminology to represent similar things. For a more in-depth discussion of this, see Lee (2002).
To illustrate some of the differences between the various forms of language data we might encounter, let's take a look at some examples, taken from the Corpus of English Novels (CEN) and Corpus of Late Modern English Texts, version 3.0 (CLMET3.0; De Smet, 2005), respectively. To get more detailed information on these corpora, you can go to https://perswww.kuleuven.be/~u0044428/, but for our purposes here, it's sufficient for you to know that these are corpora that are mainly of interest to researchers engaged in literary stylistic analyses or historical developments within the English language. However, as previously stated, throughout the book, we'll often resort to literary data to illustrate specific points related to both the mechanics of processing language and as...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.