You are here: Home research rcep corpora available at the rcep

corpora available at the rcep

Corpora are large electronic collections of language data. Many corpora not only consist of written data such as newspaper articles but also contain samples of spoken language or even transcripts of conversations - which is one of the reasons why the use of corpora has become increasingly attractive for pragmatic research.

At the RCEP, students have access to the following corpora:

corpus name size (in words) spoken/ written variety sampled tagged? availability
ACE - Australian Corpus of English 1 million both Australian no

RCEP

Corpus Computer IAAK

more information on the ACE

Developer:
Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney
Sampling period:
1986
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:
written and spoken language; modelled on LOB and BROWN
Variety sampled:
Australian English
Annotation:
untagged
Availability:
available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of ACE

ANC - American National Corpus 100 million (aim) both American yes

RCEP

Online (restricted)

more information on the ANC

Developer:
Randi Reppen, Nancy Ide, Keith Suderman
Sampling period:
from 1990 (ongoing)
Size:
aim 100 million words
Contents:
written and spoken texts (written part 90%), genres comparable to BNC
Variety sampled:
American English
Annotation:
XML tagged
Availability:
available for students at the RCEP; restricted open version online
Homepage:

Homepage of the American National Corpus (ANC)

BROWN Corpus 1 million written American yes

RCEP

Corpus Computer IAAK

more information on the BROWN Corpus

Developer:
Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island
Sampling period:
early 1960s
Size:
1 million words
Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories
Variety sampled:
American English
Annotation:
untagged and tagged version POS tagging
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the BROWN Corpus

Buckeye Corpus 300,000 spoken American yes Online

more information on the Buckeye Corpus

Developer:
Ohio State University: Eric Fossler-Lussier, Elizabeth Hume, Keith Johnson, Mark Pitt
Sampling period:
2000
Size:
300,000
Contents:
Interviews of 40 people, each ~ one hour
Variety sampled:
American English, "long-term residents of Ohio"
Annotation:
phonetic/phonemic transcription, word labels
Availability:
Online Access through BAEL Licence
Homepage:
CEECS - Corpus of Early English Correspondence Sampler 450,000 written British no

RCEP

Corpus Computer IAAK

more information on the CEECS

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
1418-1680
Size:
450,000
Contents:
personal letters
Variety sampled:
British English
Annotation:
no annotation
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the CEECS Corpus

COLT - Bergen Corpus of London Teenage Language 500,000 spoken British yes

RCEP

Corpus Computer IAAK

more information on the COLT

Developer:
University of Bergen, Norway
Sampling period:
1993
Size:
500,000
Contents:
transcripts of spoken language of London teenagers (COLT is part of the BNC)
Variety sampled:
British English
Annotation:
POS tagging
Availability:
available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Colt Corpus

DCPSE - Diachronic Corpus of Present-Day Spoken English 800,000 spoken British yes

RCEP

Corpus Computer IAAK

more information on the DCPSE

Developer:
University College London; Department of English (Survey of English Usage); Principal Investigator: Professor Bas Aarts; Senior Research Fellow: Sean Wallis
Sampling period:
1958-1992
Size:
800,000
Contents:
more than 400,000 words from ICE-GB (collected in the early 1990s) and 400.000 words from the London-Lund corpus (late 1960s - early 1980s)
Face-to-face conversations: 154 (494,000 words)
                    Formal: 28 (90,000 words)
                    Informal 126 (403,000 words)
Telephone conversations: 14 (47,000 words)
Broadcast discussions: 28 (89,000 words)
Broadcast interviews: 14 (43,000 words)
Spontaneous commentary: 32 (95,000 words)
Parliamentary language: 7 (21,000 words)
Legal cross-examination: 3 (9,000 words)
Assorted spontaneous: 7 (21,000 words)
Prepared speech: 21 (63,000 words)
Variety sampled:
British English, Historical/Diachronic
Annotation:
parsed, annotated, sociolinguistic information on texts, speakers and authors
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Homepage of the DCPSE

FLOB - Freiburg-LOB Corpus of British English 1 million written British no

RCEP

Corpus Computer IAAK

more information on the FLOB Corpus

Developer:
Christian Mair at the University of Freiburg
Sampling period:
1990s
Size:
1 million words
Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories (matches the original LOB corpus)
Variety sampled:
British English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the FLOB corpus

FROWN - Freiburg BROWN Corpus of American English 1 million written American no

RCEP

Corpus Computer IAAK

more information on the FROWN Corpus

Developer:
Christian Mair at the University of Freiburg
Sampling period:
1990s
Size:
1 million words
Contents:
500 text samples of approx. 2,000 words; 15 text categories (matches the Brown Coprus)
Variety sampled:
American English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the FROWN corpus

Helsinki Corpus of English Texts: Diachronic Part 1,5 million written British no

RCEP

Corpus Computer IAAK

more information on the Helsinki Corpus of English Texts

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
ca. 750 to 1700
Size:
1.5 million words
Contents:
samples of Old, Middle and Early Modern English texts
Variety sampled:
British English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Helsinki Corpus

corpus name size (in words) spoken/ written variety sampled tagged? availability
Helsinki Corpus of Older Scots 830,000 written Northern British no

RCEP

Corpus Computer IAAK

more information on the Helsinki Corpus of Older Scots

Developer:
M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
Sampling period:
1450-1700
Size:
830,000 words
Contents:
Old, Middle and Early Modern English texts covering 15 prose genres
Variety sampled:
Northern British English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Bibliography of the Helsinki Corpus of Older Scots (no specific manual available online)

ICAMET - Innbruck Computer Archive of Machine-readable English Texts over 300,000 written  Middle English / Modern English partly

RCEP

more information on the ICAMET

Developer:
University of Innsbruck
Sampling period:
Middle English Prose: 1100 - 1500; Middle/Early Modern English Letters: 1386 - 1688; Middle/Modern English Texts: in progress
Size:
Middle English Prose: 182,000; Middle/Early Modern English Letters: 110,000; Middle/Modern English texts: in progress
Contents:
Prose
Letters
Varia
Variety sampled:
Middle English, Early Modern English, Modern English
Annotation:
Middle English Prose: untagged; Middle/Early Modern English Letters: untagged; Middle/Modern English texts: mix of tagged/normalized/translated/manipulated texts
Availability:
Available for students at the RCEP
Homepage:

Manual information for ICAMET

ICE - International Corpus of English

+ SPICE-Ireland - Systems of Pragmatic annotations for the spoken component of ICE-Ireland

1 million both all yes

RCEP

Corpus Computer IAAK

(partially)

more information on the ICE and SPICE-Ireland

Developer:
Jeffrey L. Kallen and John M. Kirk
Sampling period:
1990s
Size:
500 texts, each 2,000 words (1 million words)
Contents:
500 texts, spoken and written language (spoken part 60%):
Spoken (300)
  • Dialogue (180)
    • Private (100)
    • Public (80)
  • Monologue (120)
    • Unscripted (70)
    • Scripted (50)
written (200)
  • Non-printed (50)
    • Non-professional writing (20)
    • Correspondence (30)
  • Printed (150)
    • Informational (learned) (40)
    • Informational (popular) (40)
    • Informational (reportage) (20)
    • Instructional (20)
    • Persuasive (10)
    • Creative (20)
(Figures adapted from Kennedy (1998: 55))
SPICE-Ireland
  • provides pragmatic and discourse annotation and
  • a prosodic transcription to 100 of the 300 texts of the spoken component of the ICE-Ireland Corpus.
Variety sampled:
Aim is to sample all varieties of English
Annotation:
Textual markup, word class tagging, syntactic parsing (+ additional tags in some components)
Availability:

Hong Kong, East Africa, India, Philippines, Singapore, Jamaica, USA written, Canada, Ireland, SPICE-Ireland, Great Britain, Nigeria, Sri Lanka, Ghana, New Zealand

RCEP: All subcorpora are available at the RCEP

IAAK corpus computer: Great Britain, East Africa are available for students on the corpus computer in the IAAK library

Homepage:

Homepage of the ICE corpus

ICLE - International Corpus of Learner English 3,7 million written World Englishes, L2 yes

RCEP

Corpus Computer IAAK

more information on the ICLE

Developer:
CECL UCL; Project director: Prof. Sylviane Granger
Sampling period:
1990 - 2000
Size:
3,7 million
Contents:
Subcorpora (learners of English): 
Bulgarian
Chinese
Czech
Dutch
Finnish
French
German
Italian
Japanese
Norwegian
Polish
Russian
Spanish
Swedish
Tswana
Turkish
Variety sampled:
Learners of English
Annotation:
word form/lemma/POS tagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Homepage of the ICLE

Kolhapur Corpus 1 million written Indian no

RCEP

Corpus Computer IAAK

more information on the Kolhapur Corpus

Developer:
S. K. Verma at University of Lancaster and Shivaji University, Kolhapur
Sampling period:
1978
Size:
1 million words, 500 text samples of approx. 2,000 words
Contents:
written language; modelled on BROWN and LOB
Variety sampled:
Indian English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Kolhapur Corpus

Lampeter Corpus of Early Modern English Tracts 1,1 million written British yes

RCEP

Corpus Computer IAAK

more information on the Lampeter Corpus

Developer:
Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz
Sampling period:
1640 -1740
Size:
1.1 million words
Contents:
non-literary prose texts of Early Modern English (various genres)
Variety sampled:
British English
Annotation:
textual markup
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Homepage of the Lampeter Corpus

Lancaster Parsed Corpus 140,000 written British yes

RCEP

Corpus Computer IAAK

more information on the Lancester Parsed Corpus

Developer:
Roger Garside, Geoffrey Leech and Tamas Varadi at the University of Lancaster
Sampling period:
1961
Size:
140,000 words
Contents:
parsed subcorpus of the LOB
Variety sampled:
British English
Annotation:
POS tagging, syntactic parsing
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Lancaster Corpus (PDF version)

LINDSEI - Louvain International Database of Spoken English Interlanguage 1 million spoken Learner Language no

RCEP

Corpus Computer IAAK

more information on the LINDSEI

Developer:
Gaëtanelle Gilquin, Sylvie DeCock & Sylviane Granger [eds]
Sampling period:
1995 - 2010
Size:
1 million words, c. 50 interviews per subcorpus, each interview ~ 2000 words
Contents:
spoken language, interviews with learners of English
National subcorpus:
Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, Swedish
Variety sampled:
Interlanguage
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:
LLC - London-Lund Corpus of Spoken English 500,000 spoken British yes

RCEP

Corpus Computer IAAK

more information on the LLC

Developer:
Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University
Sampling period:
1960s, 1975-81, 1985-88
Size:
500,000 words
Contents:
spoken language, based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University)
Variety sampled:
British English
Annotation:
prosodic and discourse annotation
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the LLC

LOB - Lancaster / Oslo-Bergen Corpus 1 million written British yes

RCEP

Corpus Computer IAAK

more information on the LOB Corpus

Developer:
Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen
Sampling period:
1961
Size:
1 million words
Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus
Variety sampled:
British English
Annotation:
untagged and tagged version POS tagging
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the LOB Corpus

corpus name size (in words) spoken/ written variety sampled tagged? availability
Newdigate Newsletter Corpus 750,000 written British no

RCEP

Corpus Computer IAAK

more information on the Newdigate Newsletter Corpus

Developer:
Philip Hines, Jr., Norfolk, Virginia
Sampling period:
1692
Size:
750,000 words
Contents:
a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire)
Variety sampled:
British English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Newdigate Corpus

PoW - Polytechnic of Wales Corpus 65,000 spoken British yes

RCEP

Corpus Computer IAAK

more information on the PoW Corpus

Developer:
The Computational Linguistics Unit at University of Wales College of Cardiff
Sampling period:
1978-1984
Size:
65,000 words
Contents:
transcripts of spoken child language
Variety sampled:
British English
Annotation:
POS tagging, syntactic parsing
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the PoW Corpus

(SB)CSAE - Santa Barbara Corpus of Spoken American English 249,000 spoken American yes

RCEP

Corpus Computer IAAK

more information on the (SB)CSAE

Developer:
John W. Du Bois, Wallace L. Chafe, Sandra A. Thompson, Charles Meyer, Robert Englebretson
Sampling period:
1990s
Size:
249,000 words
Contents:
transcripts and audio files of naturally occuring interaction from all over the US (mostly face-to-face conversations)
Variety sampled:
American English
Annotation:
transcripts are time-stamped, overlap indicated; marked-up version on talkbank.org
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library (Parts 1-4)
Homepage:

Homepage of the Santa Barbara Corpus of Spoken American English

SEC - Lancaster / IBM Spoken English Corpus 52,000 spoken British yes

RCEP

Corpus Computer IAAK

more information on the SEC

Developer:
University of Lancaster and IBM Scientific Centre
Sampling period:
1984-87
Size:
52,000 words
Contents:
spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster
Variety sampled:
British English
Annotation:
prosodic markup, POS tagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the SEC

Wellington Corpus of Written New Zealand English 1 million written New Zealand no

RCEP

Corpus Computer IAAK

more information on the Wellington Corpus (written)

Developer:
Laurie Bauer at Victoria University, Wellington
Sampling period:
1986-90
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:
written language; modelled on BROWN and LOB
Variety sampled:
New Zealand English
Annotation:
untagged
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Wellington Corpus (written)

Wellington Corpus of Spoken New Zealand English 1 million spoken New Zealand yes

RCEP

Corpus Computer IAAK

more information on the Wellington Corpus (spoken)

Developer:
Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington
Sampling period:
1988-94
Size:
1 million words; 500 text samples of approx. 2,000 words
Contents:
spoken language; formal, semi-formal and informal speech
Variety sampled:
New Zealand English
Annotation:
discourse markup
Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
Homepage:

Manual of the Wellington Corpus (spoken)

 

Document Actions