Fun For Word Nerds: On the Google Ngram Viewer

Researchers at Harvard are using some innovative  quantitative approaches to humanities studies, tracking the frequency of words as they occur in literature over different historical periods. Note the language employed to describe these projects, very symbolically-telling: research through scanned books becomes one part archaeology (“a digital “fossil record” of human culture) and another part life sciences (in search of parsing out a “cultural genome“). The team of academics working on this project has dubbed this data-driven approach “culturonomics” — an intriguingly broad potential scope, delving into topics such as “humanity’s collective memory, the adoption of technology, the dynamics of fame, and the effects of censorship and propaganda.”  The scope is ambitious — from The Harvard Gazette: “It is the largest data release in the history of the humanities, the authors note, a sequence of letters 1,000 times longer than the human genome. If written in a straight line, it would reach to the moon and back 10 times over.” Some of the early findings might confirm what some of us suspect (“Humanity is forgetting its past faster with each passing year”).

And as much as I love books for all that we can discover within them, there was a spot-on observation from one of the researchers noting the limits to what can be learned from quantitative analyses of books (and only books):

“Books are not representative of culture as a whole, even if our corpus contained 100% of all books ever published. Only certain types of people write books and get them published, and that subclass has changed over time, with the advent of things like public literacy.” Eventually, he says, the database will have to include “newspapers, manuscripts, maps, artwork, and a myriad of other human creations.”

In terms of perspective, a great bit of book scanning trivia from The New York Times (“In 500 Billion Words, New Window on Culture“):

“So far, Google has scanned more than 11 percent of the entire corpus of published books, about two trillion words.”

On the academic side of things, some well-known humanities scholars seem to be showing some cautious approval of the Google Ngram Project and its possibilities, such as Harvard Library’s Robert Darnton and Harvard linguistics professor Steven Pinker (although Louis Menand would also like to see some book historians get in on the action, too).

Drop Me a Line, Let Me Know What You Think

© 2023 by Train of Thoughts. Proudly created with