Comparing the Corpus of American Soap Operas,
COCA, and the BNC

[ More detailed version of this page, with frequency counts ]

Go to:


Previous corpora of informal, spoken English

The new Corpus of American Soap Operas is based on 100 million words in more than 22,000 transcripts of ten American soap operas from 2001 and 2012. As the links on this page will show, it represents the informal, everyday language very well -- even better than some "spoken" corpora. In addition, it is also much, much larger than these small 1-5 million word spoken corpora. (More information...)

Phrases: Soap Opera corpus more informal than BNC and COCA Spoken

As the following table shows, many informal phrases and constructions in the Corpus of American Soap Operas are (much) more common than in the spoken portions of COCA and the BNC. (BTW, since this corpus is still quite new (it was released in July 2012), we welcome any comments that you might have, and especially searches that you've done that show the highly informal nature of this corpus of soap operas.)

The following table shows the frequency in the three corpora per million words, and you can click on any of the entries to see the actual examples from the corpus (and then click on RETURN in the upper right-hand corner of the corpus to come back to this page). For COCA and the BNC, look at the SPOKEN column of the chart. For SOAP, look at the ALL column at the left.

query example SOAP BNC* COCA
. you [vv*] me ? . You heard me?   (=subject ellipsis) 10.3 0.0 0.2
, ok|okay ? we're leaving now, OK? 1098.7 34.8 38.2
, right ? you're pretty tired, right? 536.0 27.7 140.0
I'm good . I'm good. 19.1 0.2 1.4
[be] so not [ADJ] That is so not possible. 3.5 0.0 0.2
I told you I told you to get out of here 208.5 38.7 11.5
[do] n't get it I don't get it -- why do you hate me so much? 36.7 9.0 8.1
how can you How can you even say that? 58.6 19.5 16.7
I totally I totally get it now! 13.5 3.7 2.0
[screw] [PRON] I'm not gonna screw it up this time. 13.3 4.5 1.8
[freak] [PRON] out Man, that totally freaked us out ! 10.7 0.2 0.9
[creep] [PRON] out He really creeps me out -- he's so gross! 2.9 0.0 0.2
my God My God -- she's horrible! 41.3 20.0 3.8
. it 's [ADJ] . . It's sad. She's totally forgotten him. (=short phrases) 133.9 34.3 31.5
Situational (shows that the soap opera scripts are very oriented to the "here and now")
hand me * [NOUN] Hand me a towel. 3.3 0.2 0.3
. Get out . Get out before I call the police! 18.7 2.7 2.2
Do n't leave Don't leave! I need you! 3.8 0.7 0.4
Soap opera transcripts = low frequencies for formal phenomena (opposite of above)
whom [do] [PRON] To whom does she really belong? 0.3 0.6 0.8
to which the extent to which Dinah was willing to go 1.3 21.5 11.8
must [vv*] you must know that whatever it takes... 70.4 186.4 104.0


Note: as far as the comparisons with the BNC, some might argue that it's not fair to compare a recent corpus like the Soap Opera corpus with a 20 year-old corpus like the BNC. But that's the point -- the BNC has become increasingly outdated at representing current spoken English. And if you doubt that some of these are now used in the UK -- like I'm good, [freak] PRON out, or [creep] PRON out -- just Google them (limiting the searches to the UK).

Samples of informal Soap Opera lexis: more common than BNC and COCA Spoken

The following is a sample of some words that are at least three times as frequent (per million words) in the soap opera corpus as they are in the spoken part of either the BNC or COCA. As this list indicates, the soap opera corpus has a lot more words dealing with everyday life and personal relationships than the more formal spoken in COCA and the BNC.

Click on any of these words to see them in the corpus. To come back to this page, just click on RETURN in the upper right-hand corner of the page. You can also see a more detailed list, with frequency listings for each word in each corpus.
 

VERBS damn, blackmail, ruin, overhear, kiss, screw, ditch, swear, mess, calm, kid, relax, trust, forgive, excuse, promise, pretend, marry, upset, hate, trick, baby-sit, jerk, fuss, fake, deserve, betray, hurt, busy, love, kidnap, sleep, disappoint, faint, wait, care, wreck, sneak, bust, breathe, worry, regret, listen, bribe, scare, lie, wake, spare, poison, apologize, fool, stop

ADJECTIVES okay, damn, sorry, selfish, ungrateful, jealous, stubborn, fine, pathetic, alone, lucky, hurt, miserable, glad, relieved, handsome, sweet, thrilled, lousy, crazy, pushy, twisted, weird, paranoid, pregnant, merry, sane, ok, precious, romantic, fancy, happy, mad, upset, perfect, dizzy, scared, creepy,

NOUNS sweetheart, honey, bitch, hell, wedding, idiot, fault, baby, fool, kiss, love, liar, excuse, pal, secret


Samples of formal and technical COCA Spoken lexis: more common than Soap Opera corpus

The following is a sample of some words that are at least three times as frequent (per million words) in the spoken part of COCA (95 million words) as in the soap opera corpus (100 million words). As this list indicates, there is still a lot of fairly formal or technical language in COCA Spoken that is not found in the soap opera corpus, which deals more with informal interaction between people.

Click on C to see the word in COCA, or S to see the word in SOAP. To come back to this page, just click on RETURN in the upper right-hand corner of the page. You can also see a more detailed list, with frequency listings for each word in each corpus.
 

VERBS export C S impeach C S govern C S underscore C S ratify C S abolish C S tax C S denounce C S range C S enact C S forecast C S contrast C S outlaw C S unify C S average C S exempt C S estimate C S veto C S vaccinate C S modernize C S re-elect C C narrate C S regulate C S mandate C S surge C S subsidize C S depict C S dominate C S erode C S disperse C S emphasize C S censor C S deploy C S portray C S combat C S

ADJECTIVES unidentified C S economic C S racial C S ethnic C S presidential C S moderate C S terrorist C S liberal C S affirmative C S nuclear C S African-american S S involved C S Jewish C S constitutional C S political C S controversial C S conservative C S widespread C S commercial C S unprecedented C S military C S tribal C S atomic C S civilian C S increasing C S reserve C S homosexual C S comprehensive C S Hispanic C S fundamental C S diverse C S fiscal C S veteran C S cultural C S environmental C S

NOUNS administration C S nation C S troop C S government C S region C S economy C S consumer C S missile C S leader C S critic C S tax C S poll C S voter C S author C S majority C S terrorist C S official C S leadership C S representative C S capital C S individual C S debate C S housing C S reporting C S broadcast C S criticism C S ambassador C S income C S coverage C S juror C S religion C S unemployment C S supporter C S policy C S scientist C S


S
amples of formal and technical BNC Spoken lexis: more common than Soap Opera corpus

The following is a sample of some words that are at least threetimes as frequent (per million words) in the spoken part of the BNC (10 million words) as in the soap opera corpus (100 million words). As this list indicates, there is still a lot of fairly formal or technical language in BNC Spoken that is not found in the soap opera corpus, which deals more with informal interaction between people.

Click on B to see the word in COCA, or S to see the word in SOAP. To come back to this page, just click on RETURN in the upper right-hand corner of the page. You can also see a more detailed list, with frequency listings for each word in each corpus.
 

VERBS differentiate B S photocopy B S export B S abolish B S allocate B S undertake B S underline B S range B S reply B S vary B S multiply B S tax B S estimate B S arise B S second B S overtake B S summarize B S commence B S highlight B S govern B S sack B S phone B S derive B S illustrate B S class B S increase B S select B S distinguish B S produce B S dominate B S employ B S implement B S reduce B S outline B S emphasize B S

ADJECTIVES rural B S regional B S composite B S economic B S existing B S environmental B S increasing B S involved B S liberal B S individual B S continuous B S voluntary B S strategic B S elderly B S reproductive B S residential B S payable B S industrial B S moderate B S proposed B S adjacent B S statutory B S widespread B S ethnic B S disabled B S operational B S agreed B S comprehensive B S general B S following B S urban B S external B S managing B S structural B S handicapped B S

NOUNS region B S housing B S paragraph B S provision B S wage B S committee B S income B S structure B S government B S policy B S requirement B S plaintiff B S membership B S capital B S unemployment B S estimate B S fraction B S consultation B S trade B S employment B S total B S individual B S objective B S majority B S function B S average B S revenue B S growth B S management B S section B S development B S area B S movement B S guideline B S context B S carbon B S

Mark Davies
Professor, Corpus Linguistics
Brigham Young University
Provo, Utah, USA