Practical Corpus Linguistics

Name: Practical Corpus Linguistics | An Introduction to Corpus-Based Language Analysis
Brand: Wiley-Blackwell
Price: 48.99 EUR
Availability: OnlineOnly

An Introduction to Corpus-Based Language Analysis

Martin Weisser(Author)

Wiley-Blackwell (Publisher)

Published on 3. December 2015

240 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-83190-8 (ISBN)

€48.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

List of Figures xiii

List of Tables xv

Acknowledgements xvii

1 Introduction 1

1.1 Linguistic Data Analysis 3

1.1.1 What's data? 3

1.1.2 Forms of data 3

1.1.3 Collecting and analysing data 7

1.2 Outline of the Book 8

1.3 Conventions Used in this Book 10

1.4 A Note for Teachers 11

1.5 Online Resources 11

2 What's Out There? 13

2.1 What's a Corpus? 13

2.2 Corpus Formats 13

2.3 Synchronic vs. Diachronic Corpora 15

2.3.1 'Early' synchronic corpora 15

2.3.2 Mixed corpora 18

2.3.3 Examples of diachronic corpora 20

2.4 General vs. Specific Corpora 21

2.4.1 Examples of specific corpora 22

2.5 Static Versus Dynamic Corpora 25

2.6 Other Sources for Corpora 26

Solutions to/Comments on the Exercises 26

Note 28

Sources and Further Reading 28

3 Understanding Corpus Design 29

3.1 Food for Thought - General Issues in Corpus Design 29

3.1.1 Sampling 30

3.1.2 Size 31

3.1.3 Balance and representativeness 32

3.1.4 Legal issues 32

3.2 What's in a Text? - Understanding Document Structure 33

3.2.1 Headers, 'footers' and meta-data 34

3.2.2 The structure of the (text) body 36

3.2.3 What's (in) an electronic text? - understanding file formats and their properties 37

3.3 Understanding Encoding: Character Sets, File Size, etc. 38

3.3.1 ASCII and legacy encodings 38

3.3.2 Unicode 39

3.3.3 File sizes 40

Solutions to/Comments on the Exercises 41

Sources and Further Reading 42

4 Finding and Preparing Your Data 43

4.1 Finding Suitable Materials for Analysis 44

4.1.1 Retrieving data from text archives 44

4.1.2 Obtaining materials from Project Gutenberg 44

4.1.3 Obtaining materials from the Oxford Text Archive 45

4.2 Collecting Written Materials Yourself ('Web as Corpus') 46

4.2.1 A brief note on plain-text editors 46

4.2.2 Browser text export 48

4.2.3 Browser HTML export 49

4.2.4 Getting web data using ICEweb 50

4.2.5 Downloading other types of files 52

4.3 Collecting Spoken Data 53

4.4 Preparing Written Data for Analysis 56

4.4.1 'Cleaning up' your data 56

4.4.2 Extracting text from proprietary document formats 58

4.4.3 Removing unnecessary header and 'footer' information 58

4.4.4 Documenting what you've collected 59

4.4.5 Preparing your data for distribution or archiving 60

Solutions to/Comments on the Exercises 62

Sources and Further Reading 66

5 Concordancing 67

5.1 What's Concordancing? 67

5.2 Concordancing with AntConc 69

5.2.1 Sorting results 74

5.2.2 Saving, pruning and reusing your results 75

Solutions to/Comments on the Exercises 78

Sources and Further Reading 81

6 Regular Expressions 82

6.1 Character Classes 84

6.2 Negative Character Classes 86

6.3 Quantification 86

6.4 Anchoring, Grouping and Alternation 87

6.4.1 Anchoring 87

6.4.2 Grouping and alternation 88

6.4.3 Quoting and using special characters 90

6.4.4 Constraining the context further 91

6.5 Further Exercises 92

Solutions to/Comments on the Exercises 93

Sources and Further Reading 100

7 Understanding Part-of-Speech Tagging and Its Uses 101

7.1 A Brief Introduction to (Morpho-Syntactic) Tagsets 103

7.2 Tagging Your Own Data 109

Solutions to/Comments on the Exercises 113

Sources and Further Reading 120

8 Using Online Interfaces to Query Mega Corpora 121

8.1 Searching the BNC with BNCweb 122

8.1.1 What is BNCweb? 122

8.1.2 Basic standard queries 123

8.1.3 Navigating through and exploring search results 124

8.1.4 More advanced standard query options 126

8.1.5 Wildcards 126

8.1.6 Word and phrase alternation 128

8.1.7 Restricting searches through PoS tags 129

8.1.8 Headword and lemma queries 131

8.2 Exploring COCA through the BYU Web-Interface 132

8.2.1 The basic syntax 133

8.2.2 Comparing corpora in the BYU interface 135

Solutions to/Comments on the Exercises 137

Sources and Further Reading 145

9 Basic Frequency Analysis - or What Can (Single) Words Tell Us About Texts? 146

9.1 Understanding Basic Units in Texts 146

9.1.1 What's a word? 147

9.1.2 Types and tokens 149

9.2 Word (Frequency) Lists in AntConc 151

9.2.1 Stop words - good or bad? 156

9.2.2 Defining and using stop words in AntConc 158

9.3 Word Lists in BNCweb 160

9.3.1 Standard options 160

9.3.2 Investigating subcorpora 162

9.3.3 Keyword lists 169

9.4 Keyword Lists in AntConc and BNCweb 169

9.4.1 Keyword lists in AntConc 169

9.4.2 Keyword lists in BNCweb 172

9.5 Comparing and Reporting Frequency Counts 175

9.6 Investigating Genre-Specific Distributions in COCA 178

Solutions to/Comments on the Exercises 179

Sources and Further Reading 192

10 Exploring Words in Context 193

10.1 Understanding Extended Units of Text 194

10.2 Text Segmentation 195

10.3 N-Grams, Word Clusters and Lexical Bundles 196

10.4 Exploring (Relatively) Fixed Sequences in BNCweb 198

10.5 Simple, Sequential Collocations and Colligations 198

10.5.1 'Simple' collocations 198

10.5.2 Colligations 200

10.5.3 Contextually constrained and proximity searches 201

10.6 Exploring Colligations in COCA 202

10.7 N-grams and Clusters in AntConc 205

10.8 Investigating Collocations Based on Statistical Measures in AntConc, BNCweb and COCA 207

10.8.1 Calculating collocations 207

10.8.2 Computing collocations in AntConc 209

10.8.3 Computing collocations in BNCweb 210

10.8.4 Computing collocations in COCA 211

Solutions to/Comments on the Exercises 212

Sources and Further Reading 226

11 Understanding Markup and Annotation 227

11.1 From SGML to XML - A Brief Timeline 229

11.2 XML for Linguistics 230

11.2.1 Why bother? 230

11.2.2 What does markup/annotation look like? 230

11.2.3 The 'history' and development of (linguistic) markup 232

11.2.4 XML and style sheets 234

11.3 'Simple XML' for Linguistic Annotation 236

11.4 Colour Coding and Visualisation 240

11.5 More Complex Forms of Annotation 246

Solutions to/Comments on the Exercises 248

Sources and Further Reading 253

12 Conclusion and Further Perspectives 254

Appendix A: The CLAWS C5 Tagset 259

Appendix B: The Annotated Dialogue File 261

Appendix C: The CSS Style Sheet 269

Glossary 271

References 277

Index 283

CHAPTER 1
Introduction

This textbook aims to teach you how to analyse and interpret language data in written or orthographically transcribed form (i.e. represented as if it were written, if the original data is spoken). It will do so in a way that should not only provide you with the technical skills for such an analysis for your own research purposes, but also raise your awareness of how corpus evidence can be used in order to develop a better understanding of the forms and functions of language. It will also teach you how to use corpus data in more applied contexts, such as e.g. in identifying suitable materials/examples for language teaching, investigating socio- linguistic phenomena, or even trying to verify existing linguistic theories, as well as to develop your own hypotheses about the many different aspects of language that can be investigated through corpora. The focus will primarily be on English-language data, although we may occasionally, whenever appropriate, refer to issues that could be relevant to the analysis of other languages. In doing so, we'll try to stay as theory-neutral as possible, so that no matter which 'flavour(s)' of linguistics you may have been exposed to before, you should always be able to understand the background to all the exercises or questions presented here.

The book is aimed at a variety of readers, ranging mainly from linguistics students at senior undergraduate, Masters, or even PhD levels who are still unfamiliar with corpus linguistics, to language teachers or textbook developers who want to create or employ more real-life teaching materials. As many of the techniques we'll be dealing with here also allow us to investigate issues of style in both literary and non-literary text, and much of the data we'll initially use actually consists of fictional works because these are easier to obtain and often don't cause any copyright issues, the book should hopefully also be useful to students of literary stylistics. To some extent, I also hope it may be beneficial to computer scientists working on language processing tasks, who, at least in my experience, often lack some crucial knowledge in understanding the complexities and intricacies of language, and frequently tend to resort to mathematical methods when more linguistic (symbolic) ones would be more appropriate, even if these may make the process of writing 'elegant' and efficient algorithms more difficult.

You may also be asking yourself why you should still be using a textbook at all in this day and age, when there are so many video tutorials available, and most programs offer at least some sort of online help to get you started. Essentially, there are two main reasons for this: a) such sources of information are only designed to provide you with a basic overview, but don't actually teach you, simply demonstrating how things are done. In other words they may do a relatively good job in showing you one or more ways of doing a few things, but often don't really allow you to use a particular program independently and for more complex tasks than the author of the tutorial/help file may actually have envisaged. And b) online tutorials, such as the ones on YouTube, may not only take a rather long time to (down)load, but might not even be (easily) accessible in some parts of the world at all, due to internet censorship.

If you're completely new to data analysis on the computer and working with - as opposed to simply opening and reading - different file types, some of the concepts and methods we'll discuss here may occasionally make you feel like you're doing computer science instead of working with language. This is, unfortunately, something you'll need to try and get used to, until you begin to understand the intricacies of working with language data on the computer better, and, by doing so, will also develop your understanding of the complexity inherent in language (data) itself. This is by no means an easy task, so working with this book, and thereby trying to develop a more complete understanding of language and how we can best analyse and describe it, be it for linguistic or language teaching purposes, will often require us to do some very careful reading and thinking about the points under discussion, so as to be able to develop and verify our own hypotheses about particular language features. However, doing so is well worth it, as you'll hopefully realise long before reaching the end of the book, as it opens up possibilities for understanding language that go far beyond a simple manual, small-scale, analysis of texts.

In order to achieve the aims of the book, we'll begin by discussing which types of data are already readily available, exploring ways of obtaining our own data, and developing an understanding of the nature of electronic documents and what may make them different from the more traditional types of printed documents we're all familiar with. This understanding will be developed further throughout the book, as we take a look at a number of computer programs that will help us to conduct our analyses at various levels, ranging from words to phrases, and to even larger units of text. At the same time, of course, we cannot ignore the fact that there may be issues in corpus linguistics related to lower levels, such as that of morphology, or even phonology. Having reached the end of the book, you'll hopefully be aware of many of the different issues involved in collecting and analysing a variety of linguistic - as well as literary - data on the computer, which potential problems and pitfalls you may encounter along the way, and ideally also how to deal with them efficiently. Before we start discussing these issues, though, let's take a few minutes to define the notion of (linguistic) data analysis properly.

1.1 Linguistic Data Analysis

1.1.1 What's data?

In general, we can probably see all different types of language manifestation as language data that we may want/need to investigate, but unfortunately, it's not always possible to easily capture all such 'available' material for analysis. This is why, apart from the 'armchair' data available through introspection (cf. Fillmore 1992: 35), we usually either have to collect our materials ourselves or use data that someone else has previously collected and provided in a suitable form, or at least a form that we can adapt to our needs with relative ease. In both of these approaches, there are inherent difficulties and problems to overcome, and therefore it's highly important to be aware of these limitations in preparing one's own research, be it in order to write a simple assignment, a BA dissertation, MA/PhD thesis, research paper, etc.

Before we move on to a more detailed discussion of the different forms of data, it's perhaps also necessary to clarify the term data itself a little more, in order to avoid any misunderstandings. The word itself originally comes from the plural of the Latin word datum, which literally means '(something) given', but can usually be better translated as 'fact'. In our case, the data we'll be discussing throughout this book will therefore represent the 'facts of language' we can observe. And although the term itself, technically speaking, is originally a plural form referring to the individual facts or features of language (and can be used like this), more often than not we tend to use it as a singular mass noun that represents an unspecified amount or body of such facts.

1.1.2 Forms of data

Essentially, linguistic data comes in two general forms, written or spoken. However, there are also intermediate categories, such as texts that are written to be spoken (e.g. lectures, plays, etc.), and which may therefore exhibit features that are in between the two clear-cut variants. The two main media types often require rather radically different ways of 'recording' and analysis, although at least some of the techniques for analysing written language can also be used for analysing transliterated or (orthographically) transcribed speech, as we'll see later when looking at some dialogue data. Beyond this distinction based on medium, there are of course other classification systems that can be applied to data, such as according to genre , register , text type , etc., although these distinctions are not always very clearly formalised and distinguished from one another, so that different scholars may sometimes be using distinct, but frequently also overlapping, terminology to represent similar things. For a more in-depth discussion of this, see Lee (2002).

To illustrate some of the differences between the various forms of language data we might encounter, let's take a look at some examples, taken from the Corpus of English Novels (CEN) and Corpus of Late Modern English Texts, version 3.0 (CLMET3.0; De Smet, 2005), respectively. To get more detailed information on these corpora, you can go to https://perswww.kuleuven.be/~u0044428/, but for our purposes here, it's sufficient for you to know that these are corpora that are mainly of interest to researchers engaged in literary stylistic analyses or historical developments within the English language. However, as previously stated, throughout the book, we'll often resort to literary data to illustrate specific points related to both the mechanics of processing language and as...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Practical Corpus Linguistics

Description

Reviews / Votes

More details

Other editions

Additional editions

Person

Content

CHAPTER 1
Introduction

1.1 Linguistic Data Analysis

1.1.1 What's data?

1.1.2 Forms of data

System requirements

Schweitzer Fachinformationen

Practical Corpus Linguistics

Description

Reviews / Votes

More details

Other editions

Additional editions

Person

Content

CHAPTER 1 Introduction

1.1 Linguistic Data Analysis

1.1.1 What's data?

1.1.2 Forms of data

System requirements

CHAPTER 1
Introduction