Following on the heels ofÂ yesterday’s post, here’s a look from a different angle at the quantitative approaches to understanding books –specifically the language within books.Â The New York TimesÂ had a thought-provoking piece, “The Jargon of the Novel, Computed“, on current research comparing the usage of literary language (and what literary language specifically is, is a topic for another day) with everyday language. One such research project is the Corpus of Contemporary American English (COCA):
“which brings together 425 million words of text from the past two decades, with equally large samples drawn from fiction, popular magazines, newspapers, academic texts and transcripts of spoken English. The fiction samples cover short stories and plays in literary magazines, along with the first chapters of hundreds of novels from major publishers.”
These sorts of digital humanities projects (such asÂ Google’s Ngram Viewer, which we looked at previously) can provide fascinating insight into how words work in the world. Â Is there a fine line between the literary clichÃ© and phrases which are simply en vogue? Probably. (“Individual authors will of course have their own idiosyncratic linguistic tics. Dan Brown, of â€œDa Vinci Codeâ€ fame, is partial to eyebrows. In his techno-thriller â€œDigital Fortress,â€ characters arch or raise their eyebrows no fewer than 14 times“).
I happen to think that there is still a certain cool-factor to these sorts of digital tools, which couldn’t have existed even a few decades ago. And it’s by no means to take for granted just what technology could mean for scholarly humanities research work, yet it’s safe to assume this can only provide a piece of the larger puzzle. I’m assuming this sort of text-crunching works under the principle that a text is a text is a text. But, are all texts created equally? I’m not so sure.
Here are two passing thoughts, one on something I agree with, and one on something I disagree with. Disagree, first:
“Though traditional literary scholarship has long sought to track these echoes, the work can now be done automatically, transcending any single analystâ€™s selective attention. The same methods can also ferret out how intertextuality can work on a more unconscious level, silently directing a writer to select particular word combinations to match the expectations of the appropriate genre.”
As a scholarly tool, it’s great. And of course any given scholar’s subjective attention and intentions are going to produce widely different results — but surely that’s the whole point. This is a bit of an issue of the forest from the trees/trees from the forest. I think there are both values and limits to how much can be done with these kinds of textual patterns and collocations (if I were a linguist, I’m sure I’d have a slightly different response).
And, this sentiment, which does work for me:
“While Twain, Hemingway and the rest of the vernacularizers may have introduced more â€œnaturalâ€ or â€œauthenticâ€ styles of writing, literature did not suddenly become unliterary simply because the prose was no longer so high-flying. Rather, the textual hints of literariness continue to wash over us unannounced, even as a new kind of brainpower, the computational kind, can help identify exactly what those hints are and how they function.”Â
I’m going to oversimplify for a moment, and say that yes, language evolves and changes shape over time. Oftentimes it’s so subtle that we as readers and consumers of fictions and texts need a high level perspective which tools such as COCA afford us. The question of literary writing vs. nonliterary writing is a rather complicated one — we generally identify the one from the other by falling back on examples of what we think this is, or what that is. But this seems not to fully get at the problem. Hmm. To be continued, I think.
In the meantime, here’s a link below to the Corpus of Contemporary American English, at BYU. Check it out: