corpus name |
size (in words) |
spoken/ written |
variety sampled |
tagged? |
availability |
ACE - Australian Corpus of English |
1 million |
both |
Australian |
no |
RCEP
Corpus Laptops
|
more information on the ACE
- Developer:
- Pam Peters, Peter Collins and David Blair at Macquarie University, Sydney
- Sampling period:
- 1986
- Size:
- 1 million words; 500 text samples of approx. 2,000 words
- Contents:
written and spoken language; modelled on LOB and BROWN
- Variety sampled:
- Australian English
- Annotation:
- untagged
- Availability:
available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of ACE
|
BROWN Corpus |
1 million |
written |
American |
yes |
RCEP
Corpus Laptops
|
more information on the BROWN Corpus
- Developer:
- Nelson Francis and Henry Kucera at Brown University, Providence, Rhode Island
- Sampling period:
- early 1960s
- Size:
- 1 million words
- Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories
- Variety sampled:
- American English
- Annotation:
- untagged and tagged version POS tagging
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the BROWN Corpus
|
Buckeye Corpus |
300,000 |
spoken |
American |
yes |
Online |
more information on the Buckeye Corpus
- Developer:
- Ohio State University: Eric Fossler-Lussier, Elizabeth Hume, Keith Johnson, Mark Pitt
- Sampling period:
- 2000
- Size:
- 300,000
- Contents:
Interviews of 40 people, each ~ one hour
- Variety sampled:
- American English, "long-term residents of Ohio"
- Annotation:
- phonetic/phonemic transcription, word labels
- Availability:
Online Access through BAEL Licence
- Homepage:
|
CEECS - Corpus of Early English Correspondence Sampler |
450,000 |
written |
British |
no |
RCEP
Corpus Laptops
|
more information on the CEECS
- Developer:
- M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
- Sampling period:
- 1418-1680
- Size:
- 450,000
- Contents:
personal letters
- Variety sampled:
- British English
- Annotation:
- no annotation
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the CEECS Corpus
|
COLT - Bergen Corpus of London Teenage Language |
500,000 |
spoken |
British |
yes |
RCEP
Corpus Laptops
|
more information on the COLT
- Developer:
- University of Bergen, Norway
- Sampling period:
- 1993
- Size:
- 500,000
- Contents:
transcripts of spoken language of London teenagers (COLT is part of the BNC)
- Variety sampled:
- British English
- Annotation:
- POS tagging
- Availability:
available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Colt Corpus
|
FLOB - Freiburg-LOB Corpus of British English |
1 million |
written |
British |
no |
RCEP
Corpus Laptops
|
more information on the FLOB Corpus
- Developer:
- Christian Mair at the University of Freiburg
- Sampling period:
- 1990s
- Size:
- 1 million words
- Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories (matches the original LOB corpus)
- Variety sampled:
- British English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the FLOB corpus
|
FROWN - Freiburg BROWN Corpus of American English |
1 million |
written |
American |
no |
RCEP
Corpus Laptops
|
more information on the FROWN Corpus
- Developer:
- Christian Mair at the University of Freiburg
- Sampling period:
- 1990s
- Size:
- 1 million words
- Contents:
500 text samples of approx. 2,000 words; 15 text categories (matches the Brown Coprus)
- Variety sampled:
- American English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the FROWN corpus
|
Helsinki Corpus of English Texts: Diachronic Part |
1,5 million |
written |
British |
no |
RCEP
Corpus Laptops
|
more information on the Helsinki Corpus of English Texts
- Developer:
- M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
- Sampling period:
- ca. 750 to 1700
- Size:
- 1.5 million words
- Contents:
samples of Old, Middle and Early Modern English texts
- Variety sampled:
- British English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Helsinki Corpus
|
corpus name |
size (in words) |
spoken/ written |
variety sampled |
tagged? |
availability |
Helsinki Corpus of Older Scots |
830,000 |
written |
Northern British |
no |
RCEP
Corpus Laptops
|
more information on the Helsinki Corpus of Older Scots
- Developer:
- M. Rissanen, O. Ihalainen and M. Kytö at the Department of English, University of Helsinki
- Sampling period:
- 1450-1700
- Size:
- 830,000 words
- Contents:
Old, Middle and Early Modern English texts covering 15 prose genres
- Variety sampled:
- Northern British English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Bibliography of the Helsinki Corpus of Older Scots (no specific manual available online)
|
ICAMET - Innsbruck Computer Archive of Machine-readable English Texts |
over 300,000 |
written |
Middle English / Modern English |
partly |
RCEP |
more information on the ICAMET
- Developer:
- University of Innsbruck
- Sampling period:
- Middle English Prose: 1100 - 1500; Middle/Early Modern English Letters: 1386 - 1688; Middle/Modern English Texts: in progress
- Size:
- Middle English Prose: 182,000; Middle/Early Modern English Letters: 110,000; Middle/Modern English texts: in progress
- Contents:
-
Prose Letters Varia
- Variety sampled:
- Middle English, Early Modern English, Modern English
- Annotation:
- Middle English Prose: untagged; Middle/Early Modern English Letters: untagged; Middle/Modern English texts: mix of tagged/normalized/translated/manipulated texts
- Availability:
Available for students at the RCEP
- Homepage:
Manual information for ICAMET
|
ICE - International Corpus of English
+ SPICE-Ireland - Systems of Pragmatic annotations for the spoken component of ICE-Ireland
|
1 million |
both |
all |
yes |
RCEP
Corpus Laptops
(partially)
|
more information on the ICE and SPICE-Ireland
- Developer:
- Jeffrey L. Kallen and John M. Kirk
- Sampling period:
- 1990s
- Size:
- 500 texts, each 2,000 words (1 million words)
- Contents:
-
500 texts, spoken and written language (spoken part 60%): Spoken (300)
-
Dialogue (180)
- Private (100)
- Public (80)
- Monologue (120)
- Unscripted (70)
Scripted (50)
written (200)
- Non-printed (50)
- Non-professional writing (20)
- Correspondence (30)
- Printed (150)
- Informational (learned) (40)
- Informational (popular) (40)
- Informational (reportage) (20)
- Instructional (20)
- Persuasive (10)
- Creative (20)
(Figures adapted from Kennedy (1998: 55)) SPICE-Ireland
- Variety sampled:
- Aim is to sample all varieties of English
- Annotation:
- Textual markup, word class tagging, syntactic parsing (+ additional tags in some components)
- Availability:
-
Hong Kong, East Africa, India, Philippines, Singapore, Jamaica, USA written, Canada, Ireland, SPICE-Ireland, Great Britain, Nigeria, Sri Lanka, Ghana, New Zealand
RCEP: All subcorpora are available at the RCEP
IAAK corpus computer: Great Britain, East Africa are available for students on the corpus computer in the IAAK library
- Homepage:
-
Homepage of the ICE corpus
|
ICLE - International Corpus of Learner English |
3,7 million |
written |
World Englishes, L2 |
yes |
RCEP
Corpus Laptops
|
more information on the ICLE
- Developer:
- CECL UCL; Project director: Prof. Sylviane Granger
- Sampling period:
- 1990 - 2000
- Size:
- 3,7 million
- Contents:
-
Subcorpora (learners of English): Bulgarian Chinese Czech Dutch Finnish French German Italian Japanese Norwegian Polish Russian Spanish Swedish Tswana Turkish
- Variety sampled:
- Learners of English
- Annotation:
- word form/lemma/POS tagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Homepage of the ICLE
|
Kolhapur Corpus |
1 million |
written |
Indian |
no |
RCEP
Corpus Laptops
|
more information on the Kolhapur Corpus
- Developer:
- S. K. Verma at University of Lancaster and Shivaji University, Kolhapur
- Sampling period:
- 1978
- Size:
- 1 million words, 500 text samples of approx. 2,000 words
- Contents:
written language; modelled on BROWN and LOB
- Variety sampled:
- Indian English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Kolhapur Corpus
|
Lampeter Corpus of Early Modern English Tracts |
1,1 million |
written |
British |
yes |
RCEP
Corpus Laptops
|
more information on the Lampeter Corpus
- Developer:
- Josef Schmied, Claudia Claridge and Rainer Siemund at TU Chemnitz
- Sampling period:
- 1640 -1740
- Size:
- 1.1 million words
- Contents:
non-literary prose texts of Early Modern English (various genres)
- Variety sampled:
- British English
- Annotation:
- textual markup
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Homepage of the Lampeter Corpus
|
LINDSEI - Louvain International Database of Spoken English Interlanguage |
1 million |
spoken |
Learner Language |
no |
RCEP
Corpus Laptops
|
more information on the LINDSEI
- Developer:
- Gaëtanelle Gilquin, Sylvie DeCock & Sylviane Granger [eds]
- Sampling period:
- 1995 - 2010
- Size:
- 1 million words, c. 50 interviews per subcorpus, each interview ~ 2000 words
- Contents:
-
spoken language, interviews with learners of English National subcorpus: Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, Swedish
- Variety sampled:
- Interlanguage
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
|
LLC - London-Lund Corpus of Spoken English |
500,000 |
spoken |
British |
yes |
RCEP
Corpus Laptops
|
more information on the LLC
- Developer:
- Randolph Quirk and Sidney Greenbaum at University College London Jan Svartvik at Lund University
- Sampling period:
- 1960s, 1975-81, 1985-88
- Size:
- 500,000 words
- Contents:
spoken language, based on the Survey of English Usage (SEU, 1959, University College London) and on the Survey of Spoken English (SSE, 1975, Lund University)
- Variety sampled:
- British English
- Annotation:
- prosodic and discourse annotation
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the LLC
|
LOB - Lancaster / Oslo-Bergen Corpus |
1 million |
written |
British |
yes |
RCEP
Corpus Laptops
|
more information on the LOB Corpus
- Developer:
- Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut Hofland, Norwegian Computing Centre for the Humanities, Bergen
- Sampling period:
- 1961
- Size:
- 1 million words
- Contents:
written language; 500 text samples of approx. 2,000 words; 15 text categories; British counterpart of Brown corpus
- Variety sampled:
- British English
- Annotation:
- untagged and tagged version POS tagging
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the LOB Corpus
|
corpus name |
size (in words) |
spoken/ written |
variety sampled |
tagged? |
availability |
Newdigate Newsletter Corpus |
750,000 |
written |
British |
no |
RCEP
Corpus Laptops
|
more information on the Newdigate Newsletter Corpus
- Developer:
- Philip Hines, Jr., Norfolk, Virginia
- Sampling period:
- 1692
- Size:
- 750,000 words
- Contents:
a series of more than 2,000 newsletters in the Newdigate series (most of which are addressed to Sir Richard Newdigate, Warwickshire)
- Variety sampled:
- British English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Newdigate Corpus
|
PoW - Polytechnic of Wales Corpus |
65,000 |
spoken |
British |
yes |
RCEP
Corpus Laptops
|
more information on the PoW Corpus
- Developer:
- The Computational Linguistics Unit at University of Wales College of Cardiff
- Sampling period:
- 1978-1984
- Size:
- 65,000 words
- Contents:
transcripts of spoken child language
- Variety sampled:
- British English
- Annotation:
- POS tagging, syntactic parsing
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the PoW Corpus
|
(SB)CSAE - Santa Barbara Corpus of Spoken American English |
249,000 |
spoken |
American |
yes |
RCEP
Corpus Laptops
|
more information on the (SB)CSAE
- Developer:
- John W. Du Bois, Wallace L. Chafe, Sandra A. Thompson, Charles Meyer, Robert Englebretson
- Sampling period:
- 1990s
- Size:
- 249,000 words
- Contents:
transcripts and audio files of naturally occuring interaction from all over the US (mostly face-to-face conversations)
- Variety sampled:
- American English
- Annotation:
- transcripts are time-stamped, overlap indicated; marked-up version on talkbank.org
- Availability:
-
Available for students at the RCEP and on the corpus computer in the IAAK library (Parts 1-4)
- Homepage:
Homepage of the Santa Barbara Corpus of Spoken American English
|
SEC - Lancaster / IBM Spoken English Corpus |
52,000 |
spoken |
British |
yes |
RCEP
Corpus Laptops
|
more information on the SEC
- Developer:
- University of Lancaster and IBM Scientific Centre
- Sampling period:
- 1984-87
- Size:
- 52,000 words
- Contents:
spoken language; transcripts from radio-broadcasts, recordings made at University of Lancaster
- Variety sampled:
- British English
- Annotation:
- prosodic markup, POS tagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the SEC
|
Wellington Corpus of Written New Zealand English |
1 million |
written |
New Zealand |
no |
RCEP
Corpus Laptops
|
more information on the Wellington Corpus (written)
- Developer:
- Laurie Bauer at Victoria University, Wellington
- Sampling period:
- 1986-90
- Size:
- 1 million words; 500 text samples of approx. 2,000 words
- Contents:
written language; modelled on BROWN and LOB
- Variety sampled:
- New Zealand English
- Annotation:
- untagged
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Wellington Corpus (written)
|
Wellington Corpus of Spoken New Zealand English |
1 million |
spoken |
New Zealand |
yes |
RCEP
Corpus Laptops
|
more information on the Wellington Corpus (spoken)
- Developer:
- Janet Holmes, Bernadette Vine and Gary Johnson at at Victoria University, Wellington
- Sampling period:
- 1988-94
- Size:
- 1 million words; 500 text samples of approx. 2,000 words
- Contents:
spoken language; formal, semi-formal and informal speech
- Variety sampled:
- New Zealand English
- Annotation:
- discourse markup
- Availability:
Available for students at the RCEP and on the corpus computer in the IAAK library
- Homepage:
Manual of the Wellington Corpus (spoken)
|