Google Books to Pay Scholars to Dig into its Digital Stacks
Google has “quietly” decided to pay humanities researchers $50,000 a year to dig into all the rich metadata accruing from its 12 million and counting library. Franco Moretti, whose “distant reading” analyzes literary trends from statistical data rather than close reading of texts, is among the humanity researchers going after the funds, and he’s well positioned for it.
Here’s a list of some of Google’s ideas to leverage their information silo:
-Build software for tracking changes in language over time.
-Create utilities to discover books and passages of interest to a particular discipline.
-Develop systems for crowd-sourced corrections to book data and metadata.
-Test of a literary or historical hypothesis through innovative analysis of a book.
If Google pulls the lid off of its (proprietary) silos of information from Google Books, select scholars would have unprecedented scales of data on reader and editorial habits and trends – even more so if Google gets its monopoly on in-copyright texts.
IMO there’s all sorts of urgent questions to ask as a private company goes about stewarding our digital collections. For instance, the article doesn’t make clear if Google will make its data publicly accessible to anyone beyond the people it funds. Siva Vaidhyanathan worries also that “a close relationship that the tools that scholars develop” will only jive with Google-supplied data, calling this “a tragic lock-in.”
Maybe this is a tangent, but I’m reminded of a lecture I saw in February by Bernhard Rieder and Theo Roehle, critically examining how automated methods of analysis are used to explore data – a perspective to keep in mind if Google is funding scholars to explore its digital stacks. New digital search tools and methods admittedly have several promising features: they can bridge micro and macro scales; they can reconcile quantitative and qualitative questions by treating large sets, then zooming in for analysis; they exhibit non-human vision by producing patterns we couldn’t discern ourselves. All well and good.
But these advantages can also obscure how power structures mediate patterns created by algorithms, how the technical properties of the the tools and methods used always subtly constitute the knowledge produced. Rieder and Roehle presented five challenges that encourage vigilance when confronted with digitally generated patterns. I’ll list them here:
- 1. Lure of objectivity, the disembodied ‘ideal observer’ whose ‘view from nowhere’ has the longed for accuracy of natural sciences. But the ideal observer is only an epistemic entity, because knowledge is always situated and software design is always based on human interpretation.
- 2. Power of visual evidence. Never assume an image contains important information – treat objects not as evidence but as processes, and focus on the methodological steps taken to produce it. Expose its production with proper annotation, a legend revealing units of analysis, and why something is included/excluded. Ask questions about its validity and purported meaning.
- 3. Black-boxing. Repeatability and transparency are essential: to transfer knowledge and build on other findings researchers must use open source technologies. Code literacy is also crucial if we are delegating the research process to software.
- 4. Institutional perturbations, or follow the money. How does methodology differ across institutions? How are people recruited? How does methodology affect power structures that fund grants?
- 5. Quest for universalism. Cybernetics and network theory, for instance, look for the deep structure underlying phenomena, risking that our tools act as agents to prove a specific ontology. Resist the gravitational pull towards totalizing explanations; cross check and look actively for idiosyncrasies, even if it means using other methods than digital ones.
The point isn’t that digital humanities is a waste of time or funds, but simply that all patterns are produced, are political and ideological. To say ‘patterns are out there, if we just find them’ is simply naive – a person can create any amount of information on any view of reality and wrap it up in a glossy infoviz.
So while at least there is a growing understanding of computational tools outside of computer science and in fields that traditionally shied away from numbers, Reider and Roehle show us a few ways to be vigilant towards such research projects in general. My suggestion is that we keep pace with any scholarship underwritten by Google, applying these critiques to results from the datasets it generates and offers up to scholars, since the changes ahead could be sweeping.