More Resources

Computing the vocabulary demands of L2 reading.


by Cobb, Tom

Employing a methodology of repeated readings and a computer-based testing apparatus that allowed the tracking of large numbers of words, Horst and Meara were able to trace the ups and downs of word knowledge that normally pass below the radar of conventional tests. What new information emerges from this methodology? For just this one state of the matrix Heading 3 (column 6) of Figure 1 shows that of the 300 learnable targets, 44 (or 3 + 6 +35) have moved into the "I definitely know this word" state from another knowledge state. This is new knowledge that would probably have shown up on a standard test. However, another 56 words (27 + 9 + 20) have made lesser gains (into the "not sure" and "think I know" territory) that would probably not have shown up on a standard test. These ratios change over the course of several readings, as the learning opportunities diminish, but in the first three readings there are often at least as many words moving rightward below the radar as above it, i.e., moving to knowledge-state 1 or 2 rather than 3. A surprising number of words move to the right and then back to the left for a time, presumably reflecting either a learning and forgetting cycle, or a hypothesis testing phase, or elements of both (Horst, 2000). The evidence from the matrix studies broadly shows that Krashen is right: there is more vocabulary learning from reading than most tests measure. It seems uncontroversial to generalize from Horst and Meara's data that the total amount of vocabulary learning from reading might be as much as double what the various studies using more conventional measures have typically shown.

An alternate source of evidence for substantial amounts of hidden vocabulary growth through reading is provided by Waring and Takaki (2003) using a different methodology. These researchers tested twenty-five words acquired from reading with measures at three levels of difficulty--passive recognition (that a word had been seen in the text), aided meaning selection (by a multiple choice measure), and unaided recall (through a translation test)--and found that scores were almost 2.5 times higher for multiple choice than translation, and more than 3 times higher for recognition than for translation. In other words, most of the initial learning represented by remembering that a word had appeared in the text would not have registered on either of the other more difficult tests. This finding thus complements the matrix finding, albeit for a smaller number of items.

But can we get from here to sufficiency? Even if it is clear that more learning takes place through word encounters than most tests measure, is free reading able to provide a sufficient number of such encounters?

Claim B: The Sufficiency of Hidden Learning

Krashen's related claim, the sufficiency of hidden vocabulary growth, can also be tested empirically in an L2 context, as the following very basic experiment in corpus analysis demonstrates. But first we need some definitions.

To arrive at an operational definition of sufficiency, we might ask questions such as: How many words are enough for various purposes, such as to begin academic study in a second language, or to undertake a professional activity? Vocabulary researchers working on questions of coverage calculate the minimum number of word families needed for non-specialist reading of materials designed for native speakers to be between 3000 (Laufer, 1989) and 5000 word families (Hirsch & Nation, 1992)--provided these are high frequency items and not just random pick-ups. How many encounters are needed for word learning to occur? The number varies with a host of individual and contextual factors, but the majority of studies (reviewed in Zahar, Cobb & Spada, 2001) find that an average of six to ten encounters are needed for stable initial word learning to occur. In Horst's (2000) matrix work, six encounters were the minimum exposure for words to travel reliably from state 0 to state 3 and stabilize. Will anything like 3,000 word families be met six times apiece through free reading?

Investigation 1

The materials assembled to answer this question were chosen to give the free reading argument optimal chances of succeeding. Thus the vocabulary size assumed to be sufficient for comprehension and learning was set as low as could be deemed plausible, at 3000 word families of written English rather than 5000. In contrast, the amount of reading a typical L2 learner would be likely to achieve was set as high as could be deemed plausible. A sample of the free reading that an ESL reader might be expected to undertake over a year or two of language study was extracted from the 1 million word Brown corpus (Kucera & Francis, 1979). This classic corpus comprises 500 text samples of roughly 2,000 words grouped into subcorpora of various sizes (different kinds of fiction, etc., as shown in the bottom half of Figure 2). To reflect the kinds of reading learners might do, the original sub-corpora were further grouped into three broad categories (press, academic, and fiction) of roughly similar size (179,000 words, 163,000 words, and 175,000 words, respectively). It is reasonable to suppose that one of these three groupings is a plausible if optimistic representation of the amount of free reading of authentic material that learners might achieve over a year or two of language study (these word counts are roughly equivalent to 100 pages of newspaper text, six stories the size of Alice in Wonderland, or 17 academic studies the length of this one.)

[FIGURE 2 OMITTED]

High frequency words were extracted from the 100-million-word British National Corpus (Leech, Rayson, & Wilson, 2001) and grouped into families and then into thousand-family lists by Nation (2006, available at http://www.lextutor.ca/vp/bnc/). The first three of Nation's lists (i.e. the 3000 most frequent word families) represent the current best estimate of the basic learner lexicon of English. A random item-from-wordlist generator (available at http://www.lextutor.ca/rand_words/) produced 20 sets of three 10-word samples from the 1000, 2000, and 3000 British National Corpus (BNC) lists. One of these sets was selected randomly for use as sample learning targets in the investigation (3).

A computer program calculated the number of occurrences of each sample word family that a learner would encounter in each of the Brown sub-corpora. This computer program called Range (Heatley & Nation, 1994) was adapted for Internet by the author and is available at http://www.lextutor.ca/range/. Figure 2 shows the distribution of a word, phrase, or family throughout a set of texts. The original version of the program allowed users to specify their own texts; the online version shown in Figure 2 provides a set of standard texts, namely the 15 original sub-corpora or the three larger groupings of the Brown corpus already mentioned. In the present experiment, word families as opposed to individual words were the search units. This was achieved by entering a stem form plus apostrophe for each item as appropriate (abandon' finds abandons, abandoning, abandonment, as shown in Figure 2). Since it cannot be taken for granted that learners will recognize family members as being related (Schmitt & Zimmerman, 2002), incorporating whole families in the analysis is likely to provide a generous estimate of the learning opportunities in the text sample. A similarly generous assumption is that learners have perfect memory for encountered items over extended time and text.

The distribution of the BNC 4000-level word family abandon' throughout the three major divisions of the Brown corpus is requested in Figure 2; the output of the Range search is shown in Figure 3. The point to notice is that while this item appears in all three samples, it appears more than six times in only one of them (press writing).

[FIGURE 3 OMITTED]

The overall and perhaps unexpected finding from this analysis is that after the most frequent 1000 items, family ranks tend to thin quite rapidly, and with them the learning opportunities. Table 1 shows the distributions in the three Brown samples for the ten target word families from each of the three most frequent BNC levels. For each target word family, the total number of occurrences in each sub-corpus is shown; at the bottom of each column, the number of targets appearing more than six times in each subcorpus is shown. As can be seen, all 1000-level word families will be met more than six times in press writing, all except bus more than six times in academic writing, and all except bus and associat' in fiction. However, five 2000-level families (persua', technolog', wire', analy', and sue) will dip below six encounters in one or more areas. And none of the 3000-level families will be encountered six times in all three areas, and half or more are not met six times in any area. (No member of the irritat' family is met in 163,000 words of academic text!) A sideline finding is that fiction writing, once the usual focus of free reading programs for learners in the process of acquiring 1000 and 2000-level vocabulary, does not present the strongest learning opportunities in either of these zones. Fiction does, however, seem to be a reasonable source of 3000-level items, providing six occurrences for five of its 10 words, as compared to four for press and three for academic writing. It is therefore worth looking at the vocabulary growing opportunities of fiction reading more closely.

Investigation 2


1  2  3  4  5  6  7  8  
COPYRIGHT 2007 University of Hawaii, National Foreign Language Resource Center Reproduced with permission of the copyright holder. Further reproduction or distribution is prohibited without permission.
Copyright 2007, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.
NOTE: All illustrations and photos have been removed from this article.


Browse by Journal Name:
Today on Entrepreneur
Related Video

e-Business & Technology
Franchise News
Business Book Sampler
Starting a Business
Sales & Marketing
Growing a Business
E-mail*:
Zip Code*: