Entrepreneur: Start & Grow Your Business

Computing the vocabulary demands of L2 reading.


by Cobb, Tom

Linguistic computing can make two important contributions to second language (L2) reading instruction. One is to resolve longstanding research issues that are based on an insufficiency of data for the researcher, and the other is to resolve related pedagogical problems based on insufficiency of input for the learner. The research section of the paper addresses the question of whether reading alone can give learners enough vocabulary to read. When the computer's ability to process large amounts of both learner and linguistic data is applied to this question, it becomes clear that, for the vast majority of L2 learners, free or wide reading alone is not a sufficient source of vocabulary knowledge for reading. But computer processing also points to solutions to this problem. Through its ability to reorganize and link documents, the networked computer can increase the supply of vocabulary input that is available to the learner. The development section of the paper elaborates a principled role for computing in L2 reading pedagogy, with examples, in two broad areas, computer-based text design and computational enrichment of undesigned texts.

INTRODUCTION

There is a lexical paradox at the heart of reading in a second language. On one side, after decades of guesswork, there is now widespread agreement among researchers that text comprehension depends heavily on detailed knowledge of most of the words in a text. However, it is also clear that the words that occur in texts are mainly available for learning in texts themselves. That is because the lexis (vocabulary) of texts, at least in languages like English, is far more extensive than the lexis of conversation or other non-textual media. Thus prospective readers of English must bring to reading the same knowledge they are intended to get from reading. This paradox has been known in outline for some time, but in terms loose enough to allow opposite proposals for its resolution. On one hand, Nation (e.g., 2001) argues for explicit instruction of targeted vocabulary outside the reading context itself. On the other, Krashen (e.g., 1989) believes that all the lexis needed for reading can be acquired naturally through reading itself, in a second language as in a first. It is only recently that the dimensions of this paradox could be quantified, with the application of computer text analysis to questions in language learning. What this quantification shows is the extreme unlikelihood of developing an adequate L2 reading lexicon through reading alone, even in highly favourable circumstances. This case is made in the initial research part of the paper. The subsequent development section goes on to show how the text computing that defined the lexical paradox can be re-tooled to break it, with (1) research-based design of texts and (2) lexical enrichment of undesigned texts. Empirical support for computational tools will be provided where available; all tools referred to, both analytical and pedagogical, are publicly available at the Compleat Lexical Tutor website (www.lextutor.ca).

DEFINING THE LEXICAL PARADOX

In applied linguistics conversations, turn-taking can involve a delay of several years. An example is Krashen's (2003) paper entitled Free voluntary reading: Still a very good idea, which criticizes the findings of a study by Horst, Cobb and Meara (1998) that had called into question the amount of vocabulary acquisition that normally results from free, pleasurable, meaning-oriented extensive reading. This study found that even with all the usual variables of an empirical study of extensive L2 reading controlled rather more tightly than usual, the number of new words that are learned through the experience of reading a complete, motivating, level-appropriate book of about 20,000 running words is minimal, and does not indicate that reading itself can reasonably be seen as the only or even main source of an adult reading lexicon. The gist of Krashen's response (2003) was that such studies typically underestimate the amount of lexical growth that takes place as words are encountered and re-encountered in the course of free reading. To support this contention, he calculated an effect size from the Horst, Cobb and Meara data that he interpreted to show stronger learning than these researchers' conclusions had implied. But more importantly, beyond the data, he believes that many words and phrases are learned from reading that do not appear in the test results of this type of study, owing to the crude nature of the testing instruments employed, which typically cannot account for partial or incremental learning. According to this argument, word knowledge is bubbling invisibly under the surface as one reads, and may appear as a known item in a vocabulary test only some time later (1). This hidden vocabulary learning from reading is seen as extensive enough to "do the entire job" (Krashen, 1989, p. 448) of acquiring a second lexicon, an idea that Waring and Nation (2004, p. 11) describe as "now entrenched" within second and foreign language teaching. Similar claims are many (e.g. Elley's belief that children graduating from a book flood approach had learned "all the vocabulary and syntax they required from repeated interactions with good stories," 1991, pp. 378-79); but clear definitions of "the entire job" are few.

Krashen has taken part in a number of conventional vocabulary-from-reading studies that use conventional measures, but these studies have not provided empirical evidence of either the extent of such hidden learning, or its sufficiency as the source of a reading lexicon. Instead, he cites the "default explanation" for the size of the adult lexicon, an account borrowed from first language (L1) theorizing (e.g., Nagy, 1988; Sternberg, 1987), whereby the lexical paradox is resolved through the sheer volume of reading time available over the course of growing up in a language. According to this explanation, a lifetime of L1 reading must eventually succeed in doing the job--even if very little measurable vocabulary knowledge is registered in any one reading event--since there is no other plausible way to account for the large number of words that adult native speakers typically know.

The extension of research assumptions and procedures from L1 to the L2 learning contexts is questionable at best (2), particularly in the absence of empirical support. But as will be shown here, both the extent and sufficiency of hidden vocabulary learning can in fact be investigated empirically within the L2 context, without recourse to default arguments. Key to this undertaking are a research instrumentation, method, and technology for measuring small increments of lexical knowledge that can be applied to sufficient numbers of words over a sufficient length of time to be plausibly commensurate with the known vocabulary sizes of learners: roughly 17,000 English word families in the case of a typical literate adult L1 lexicon (as calculated by Goulden, Nation & Read, 1990), or the 5,000 most frequent word families in the case of L2 (proposed as minimal for effective L2 reading by Hirsch & Nation, 1992). That is to say, the experimentation requires substantially more than the handful of words normally tested in this type of research (typically between 10 and 30, as discussed in Horst et al., 1998) in order to arrive at a credible estimate of "the entire job."

Claim A: The Extent of Hidden Learning

An instrument capable of measuring incremental knowledge is Wesche and Paribakht's (1996) vocabulary knowledge scale, or VKS, which asks learners to rate their knowledge of words not in binary terms (I know/I don't know what this word means) but on a five-point scale (ranging from "I don't remember having seen this word before," to "I can use this word in a sentence.") But since the VKS requires learners to also demonstrate their knowledge (e.g. by writing sentences), it is cumbersome to use in measuring changes in the knowledge of large numbers of words over time through repeated encounters, as would be needed to test the claim of extensive amounts of hidden acquisition. Therefore, Horst and Meara (1999) and Horst (2000) devised the following ratings-only version, which was suitable for adaptation to computer.

0 = I definitely don't know what this word means

1 = I am not really sure what this word means

2 = I think I know what this word means

3 = I definitely know what this word means (Horst, 2000, Chapter 7, p. 149)

Following a reading of a text, learners can efficiently rate their knowledge of a large number of its words using a computer input that employs this scale and stores the number of words rated 0, 1, 2, and 3 for each learner and each reading. But the real innovation of the adaptation is the conversion of the scale to a matrix, which allows the comparison of ratings over two (or more) readings of the same text. The matrix (shown in Figure 1) is essentially the 4-point scale in two dimensions, so that each cell represents results at both time n and after a subsequent reading (time n+1). For example, the data in the first horizontal row shows that 75 words had been rated 0 after reading n and were still rated 0 (I don't know) after reading n+1, but that 27 words had moved from 0 to 1, nine words from 0 to 2, and three words from 0 to 3. The second row shows how words rated 1 (not sure) at time n were distributed at time n+1, and so on. In other words, the cell intersections capture the numbers of words that have changed or failed to change from one knowledge state to another as a result of a subsequent reading.

Employing a methodology of repeated readings and a computer-based testing apparatus that allowed the tracking of large numbers of words, Horst and Meara were able to trace the ups and downs of word knowledge that normally pass below the radar of conventional tests. What new information emerges from this methodology? For just this one state of the matrix Heading 3 (column 6) of Figure 1 shows that of the 300 learnable targets, 44 (or 3 + 6 +35) have moved into the "I definitely know this word" state from another knowledge state. This is new knowledge that would probably have shown up on a standard test. However, another 56 words (27 + 9 + 20) have made lesser gains (into the "not sure" and "think I know" territory) that would probably not have shown up on a standard test. These ratios change over the course of several readings, as the learning opportunities diminish, but in the first three readings there are often at least as many words moving rightward below the radar as above it, i.e., moving to knowledge-state 1 or 2 rather than 3. A surprising number of words move to the right and then back to the left for a time, presumably reflecting either a learning and forgetting cycle, or a hypothesis testing phase, or elements of both (Horst, 2000). The evidence from the matrix studies broadly shows that Krashen is right: there is more vocabulary learning from reading than most tests measure. It seems uncontroversial to generalize from Horst and Meara's data that the total amount of vocabulary learning from reading might be as much as double what the various studies using more conventional measures have typically shown.

An alternate source of evidence for substantial amounts of hidden vocabulary growth through reading is provided by Waring and Takaki (2003) using a different methodology. These researchers tested twenty-five words acquired from reading with measures at three levels of difficulty--passive recognition (that a word had been seen in the text), aided meaning selection (by a multiple choice measure), and unaided recall (through a translation test)--and found that scores were almost 2.5 times higher for multiple choice than translation, and more than 3 times higher for recognition than for translation. In other words, most of the initial learning represented by remembering that a word had appeared in the text would not have registered on either of the other more difficult tests. This finding thus complements the matrix finding, albeit for a smaller number of items.

But can we get from here to sufficiency? Even if it is clear that more learning takes place through word encounters than most tests measure, is free reading able to provide a sufficient number of such encounters?

Claim B: The Sufficiency of Hidden Learning

Krashen's related claim, the sufficiency of hidden vocabulary growth, can also be tested empirically in an L2 context, as the following very basic experiment in corpus analysis demonstrates. But first we need some definitions.

To arrive at an operational definition of sufficiency, we might ask questions such as: How many words are enough for various purposes, such as to begin academic study in a second language, or to undertake a professional activity? Vocabulary researchers working on questions of coverage calculate the minimum number of word families needed for non-specialist reading of materials designed for native speakers to be between 3000 (Laufer, 1989) and 5000 word families (Hirsch & Nation, 1992)--provided these are high frequency items and not just random pick-ups. How many encounters are needed for word learning to occur? The number varies with a host of individual and contextual factors, but the majority of studies (reviewed in Zahar, Cobb & Spada, 2001) find that an average of six to ten encounters are needed for stable initial word learning to occur. In Horst's (2000) matrix work, six encounters were the minimum exposure for words to travel reliably from state 0 to state 3 and stabilize. Will anything like 3,000 word families be met six times apiece through free reading?

Investigation 1

The materials assembled to answer this question were chosen to give the free reading argument optimal chances of succeeding. Thus the vocabulary size assumed to be sufficient for comprehension and learning was set as low as could be deemed plausible, at 3000 word families of written English rather than 5000. In contrast, the amount of reading a typical L2 learner would be likely to achieve was set as high as could be deemed plausible. A sample of the free reading that an ESL reader might be expected to undertake over a year or two of language study was extracted from the 1 million word Brown corpus (Kucera & Francis, 1979). This classic corpus comprises 500 text samples of roughly 2,000 words grouped into subcorpora of various sizes (different kinds of fiction, etc., as shown in the bottom half of Figure 2). To reflect the kinds of reading learners might do, the original sub-corpora were further grouped into three broad categories (press, academic, and fiction) of roughly similar size (179,000 words, 163,000 words, and 175,000 words, respectively). It is reasonable to suppose that one of these three groupings is a plausible if optimistic representation of the amount of free reading of authentic material that learners might achieve over a year or two of language study (these word counts are roughly equivalent to 100 pages of newspaper text, six stories the size of Alice in Wonderland, or 17 academic studies the length of this one.)

[FIGURE 2 OMITTED]

High frequency words were extracted from the 100-million-word British National Corpus (Leech, Rayson, & Wilson, 2001) and grouped into families and then into thousand-family lists by Nation (2006, available at http://www.lextutor.ca/vp/bnc/). The first three of Nation's lists (i.e. the 3000 most frequent word families) represent the current best estimate of the basic learner lexicon of English. A random item-from-wordlist generator (available at http://www.lextutor.ca/rand_words/) produced 20 sets of three 10-word samples from the 1000, 2000, and 3000 British National Corpus (BNC) lists. One of these sets was selected randomly for use as sample learning targets in the investigation (3).

A computer program calculated the number of occurrences of each sample word family that a learner would encounter in each of the Brown sub-corpora. This computer program called Range (Heatley & Nation, 1994) was adapted for Internet by the author and is available at http://www.lextutor.ca/range/. Figure 2 shows the distribution of a word, phrase, or family throughout a set of texts. The original version of the program allowed users to specify their own texts; the online version shown in Figure 2 provides a set of standard texts, namely the 15 original sub-corpora or the three larger groupings of the Brown corpus already mentioned. In the present experiment, word families as opposed to individual words were the search units. This was achieved by entering a stem form plus apostrophe for each item as appropriate (abandon' finds abandons, abandoning, abandonment, as shown in Figure 2). Since it cannot be taken for granted that learners will recognize family members as being related (Schmitt & Zimmerman, 2002), incorporating whole families in the analysis is likely to provide a generous estimate of the learning opportunities in the text sample. A similarly generous assumption is that learners have perfect memory for encountered items over extended time and text.

The distribution of the BNC 4000-level word family abandon' throughout the three major divisions of the Brown corpus is requested in Figure 2; the output of the Range search is shown in Figure 3. The point to notice is that while this item appears in all three samples, it appears more than six times in only one of them (press writing).

[FIGURE 3 OMITTED]

The overall and perhaps unexpected finding from this analysis is that after the most frequent 1000 items, family ranks tend to thin quite rapidly, and with them the learning opportunities. Table 1 shows the distributions in the three Brown samples for the ten target word families from each of the three most frequent BNC levels. For each target word family, the total number of occurrences in each sub-corpus is shown; at the bottom of each column, the number of targets appearing more than six times in each subcorpus is shown. As can be seen, all 1000-level word families will be met more than six times in press writing, all except bus more than six times in academic writing, and all except bus and associat' in fiction. However, five 2000-level families (persua', technolog', wire', analy', and sue) will dip below six encounters in one or more areas. And none of the 3000-level families will be encountered six times in all three areas, and half or more are not met six times in any area. (No member of the irritat' family is met in 163,000 words of academic text!) A sideline finding is that fiction writing, once the usual focus of free reading programs for learners in the process of acquiring 1000 and 2000-level vocabulary, does not present the strongest learning opportunities in either of these zones. Fiction does, however, seem to be a reasonable source of 3000-level items, providing six occurrences for five of its 10 words, as compared to four for press and three for academic writing. It is therefore worth looking at the vocabulary growing opportunities of fiction reading more closely.

Investigation 2

For a complementary investigation, the sufficiency of a generous diet of free fiction reading as the sole or main source of vocabulary growth for 3000-level families is now examined. At the same time, the reading sample is changed from a corpus sample of texts produced by many writers to a sample of texts produced by a single author, where the vocabulary learning opportunities are arguably greater (through characteristic themes, repetitions, etc.). A corpus of just under 300,000 words was assembled from seven Jack London stories (including school favorites Call of the Wild (1903) and White Fang (1906) all offered free of cost at http://london.sonoma.edu/Writings/) as a second plausible representation of a heavy diet of free reading. Would a learner who read all these stories meet most of the 3000-level families six times apiece?

The computational tool used in this analysis is lexical frequency profiling, in this case the BNC version of VocabProfile (available at http://www.lextutor.ca/vp/bnc/, illustrated below in another context), which breaks any English text into its frequency levels according to the thousand-levels scheme already employed. The results of this analysis are as follows: The full collection of London adventure stories was shown to contain 817 word families at the 3000 level; however, only 469 of them are met six times or more, while 348 are met five times or less (181 of them twice or less). In other words, fewer than half will be met enough times for reliable learning to occur. Interestingly, this result is similar to that shown in Table 1, where half the 3000-level words appeared six times or more in the fiction sub-corpus.

Conclusion

Together, these projections indicate that even the largest plausible amounts of free reading will not take the learner very far into the 3000-family zone. It is thus somewhat redundant to raise the matter that even words met more than six times are not necessarily learned. New word meanings are normally inferable in environments containing no more than one unknown item per 20 known items, (Laufer, 1989; Liu Na & Nation, 1985). However, VocabProfile analysis of one of the best known of the London stories (Call of the Wild, comprising 31,473 words) shows that 10% of the text's words (not including proper nouns) come from frequency zones beyond the 3000 level itself, sometimes well beyond it. This means that many of the novel's 3000-level items will be met in environments of 1 unknown item per 10 words, or double the density that research has shown learners able to enjoy or learn from (4).

To summarize, this analysis is based on the most generous conditions possible: a 3000 word size requirement rather than 5000; six occurrences for learning rather than ten; a one in twenty new word density; a larger and broader diet of input than many learners will provide for themselves; an assumption that family members are usually recognized; and an assumption of minimal forgetting between reading encounters. Even then, fewer than half the 3000 level words present themselves sufficiently for reliable learning to occur. Further, the situation only gets worse for word families at the 4000 and 5000 levels and beyond. Thus, while there may well be more word learning from random encounters in free extensive reading than meets the eye, the fact is interesting but irrelevant, since most post-2000 words simply will not be encountered at all in a year or two of reading. Therefore free reading alone is not sufficient to "do the entire job" of building a functional second lexicon in any typical time frame of L2 learning.

To refute this finding, sufficiency proponents would need to define what the "entire job" of reading in an L2 is, and then show, either empirically or in principle, how this job can be done through reading alone, given the learning rate, learning conditions, and lexical profile findings outlined above and elsewhere. Until then, the common finding that many ESL learners tend to plateau with usable knowledge of about 2000 words families or less (leaving them poorly equipped to comprehend most texts) remains entirely explicable (Cobb, 2003).

BREAKING THE LEXICAL PARADOX

The findings presented thus far present a basis in text analysis for what many studies have shown empirically in the past 20 years (e.g. Alderson, 1984; Bernhardt, 2005), that L2 reading is "a problem" and that the main problem is lexis. This longstanding awareness, in the research if not in the teaching community, has produced many proposals to supplement vocabulary growth from reading with other and more direct approaches to vocabulary learning. Examples include Paribakht and Wesche's (1997) reading-plus (plus vocabulary activities) scheme and various vocabulary course supplements (e.g., Barnard, 1972; Redman & Ellis, 1991; Schmitt & Schmitt, 2004).

But there are problems in principle with the supplement solution, all of which rely to some degree on separating learning the words for reading from the act of reading. One problem is that lexical knowledge does not necessarily transfer well from vocabulary exercise and dictionary look-up to text comprehension (Cobb, 1997; Krashen, 2003; Mezynski, 1983), especially when there is a delay between the two. Second, the number of words to be met and recycled typically proliferates the vocabulary supplements to sets of several volumes (e.g., Barnard's five volumes; Redman & Ellis' four volumes), diverting a large amount of instruction time away from reading itself. The reading-plus approach is reading-based in that the target words are drawn from a text just read, but it has the disadvantage that this work must be prepared by a teacher with a text and vocabulary items that have been selected in advance and so can only be developed for a small handful of texts.

What is missing from either supplement scenario is some way of focusing attention on and proliferating encounters with new words at any level within the act of reading, or shortly after reading, for any type of text, and for lots of texts. The following section of this paper will look at several concrete proposals for doing this, with reference to empirical validation where available. The goal is to use computing to preserve the free in free reading. Two broad approaches will be described and illustrated. The first is computer-based text design, and the second is computer-aided enrichment of undesigned texts.

Computer-Aided Text Design: The Case for Home-made Simplified Materials

In principle, simplified or graded texts can meet some of the word learning requirements outlined above. Texts can be written to a particular vocabulary knowledge level, with words beyond that level introduced in environments that meet the '1 unknown word in 20' ratio mentioned earlier as the criterion for reliable guessing from context. New words can be recycled the desired number of times, in a process extending over a series of texts, until a vocabulary target, whether 3000 or 5000 frequent word families, has been met. Doing such re-writing well is clearly a difficult and expensive job. Perhaps for this reason, there is no set of graded readers in English that explicitly attempts to do it all. The arguably best designed of the graded reader sets available (e.g., the Longman Penguin series or Oxford's Bookworm), while useful, share a number of limitations that are readily evident without the help of detailed text analysis. As noted by Hill (1997), these texts are almost exclusively based on just one text genre, narrative fiction (either classics safely out of copyright or custom written originals). They employ a variety of unspecified frequency classification systems and offer no method of matching learner level to text level other than self-selection. And they make no claims about how many stories at each level a learner would have to read to achieve mastery at that level, or what coverage this mastery would provide with respect to real texts (although researchers like Nation & Wang (1999) have looked at some of these questions).

Computer text analysis can add two further limitations. One is that no series of graded readers proceeds systematically beyond the 3000-families level, and even those that get this far do not cover it particularly well. This is shown in a VocabProfile analysis of a whole set of graded readers similar to the analysis of the Jack London stories above. If a learner read all 54 stories at six levels in the Bookworm series (a total of 377,576 words), he or she would indeed meet 931 of the thousand word families at the 3000 frequency level, but would meet just over half of them (511 families) six times or more. This analysis is remarkably similar to the two above which also showed only half of the 3000-level families appearing six times or more. A difference, however, is the overall known-word density of the contexts the words will be met in, owing to the