Sometimes language lovers sound as if they’re on a safari. They talk about observing words in their natural habitat and studying their behavior in herds.
With the first release of the American National Corpus, an annotated body of over 10 million words, linguists can hunt like never before.
“Up until now, linguists were kind of like Victorian bug hunters,” says Erin McKean, the Chicago-based senior editor of U.S. dictionaries for Oxford University Press and board member of the American National Corpus. “We’d go out with our nets and we’d catch some butterflies and we’d chloroform them and pin them to cards and put them in a drawer.”
“But now, when people are really studying an ecosystem—and English is like an ecosystem—what they do is, they take a representative square area and report everything that’s there: every bug, every plant, every leaf,” she said.
“And now with the corpus, we can do that for English.”
If the dictionary is like the drawer with bugs on cards, the corpus is the jungle. The ANC collects blocks of text from newspapers, books and conversations so words and phrases can be viewed in their natural habitat—that is, in an American English context.
Readers can search the collection by word, phrase, part of speech or type of source and find their quarry used in a sentence or paragraph.
For students learning English as a second language, a corpus—from the Latin word for “body”—can help teach idioms and tendencies in a way dictionaries cannot, as ANC users around the world have already discovered.
“I hear from language teacher trainers in Egypt, Germany, Japan and Sweden who are really excited to have these data available to them, so they can go in and look at aspects of conversation,” said Randi Reppen, English professor at Northern Arizona University and Project Manager for the ANC.
The ANC could also be used by advertising copywriters in search of resonant slogans, or by computer programmers to make automated customer service hotlines sound more natural, McKean said.
The ANC’s initial release last October, available on CD-ROM for $75 at www.americannationalcorpus.org, contains 11.5 million words. About one-fourth of the collection is made up of spoken English, including transcribed phone conversations from volunteers who were given phone cards in exchange for being recorded.
The rest of the corpus is written text contributed by The New York Times, the online magazine Slate, Langenscheidt travel guides and books from Oxford University Press on architecture and Abraham Lincoln.
“We want writers to want to be part of the American National Corpus,” McKean said. “We’re hoping to have an ANC logo that authors can have their publishers put on their books, as a way of saying, `My work is influencing the study of the English language.'”
By the end of 2005, the ANC, which last year received a grant from the National Science Foundation, hopes to release 100 million words — 90 million written, 10 million spoken — evenly balanced among sources as diverse as town meetings, medical journals and novels.
“It’s hard to take one area and say, `This is English,'” Reppen said. “By having different types of writing and speaking situations, the corpus gives a better picture for language researchers, teachers and learners.”
Until now, such seekers of untamed English have relied on other corpora such as the British National Corpus, a collection of 100 million words of British English released 10 years ago. But in the last 10 years, new technology has made formatting samples of text faster and cheaper.
“We’re lucky that we’re doing it today,” McKean said. “This is something that would have been insane to do in the 1950s and was barely possible in the 1980s when the British National Corpus [started].”
Meanwhile, demand for corpora has grown in the field of computational linguistics, which uses computer programs to analyze the structure of language.
“The motivation for the ANC came from the fact that many computational linguists were using the BNC to gather statistics about
syntactic patterns, [when in fact] British English and American English are not alike in several ways,” said Nancy Ide, professor of computer science at Vassar College and Technical Director of the ANC.
Another new wrinkle in corpus linguistics is the Internet. The ANC plans to add e-mails, message boards and Web sites to its collection. McKean has already gotten permission from her message board of fellow “Buffy the Vampire Slayer” fans to use their posts for the ANC.