Bernhard Rieder: 81,498 Words: the Book as Data Object
[This post was originally published on The Unbound Book Conference Blog)
The second session of day 1 of the Unbound Book conference – also titled The Unbound Book – was moderated by Geert Lovink, and discussions of what a book becomes once it’s online and connected to information and people dominated the talks. Bernhard Rieder, Assistant Professor of New Media at the University of Amsterdam and Assistant Professor at the Hypermedia department at Paris VIII University, compelled the audience to think about what it means for the contemporary book to be meshed in digital structures from an information scientist meets media studies point of view. A refreshing talk not about the death of books but more about the new relationships and representations that digitization awards.
- Bernhard Rieder at the Unbound Book Conference – photo cc by-sa Sebastiaan ter Burg
Perhaps not at the top of discussions surrounding e-readers and digital publishing but an equally important aspect is the transformation of the book into a data object – the focus of Bernhard’s talk. His interests lie in looking at the book in the age of the database, and by reflecting on the last fifteen years — which has seen the emergence of digital book collections holding very large databases of titles — two aspects of interest emerge for him: 1) the arrangement for discovering and reading devices that these large scale databases of books encourage and 2) the “computational potential”, or the value of the data, of millions of scanned books.
With the rise of online and digital book culture coming face-to-face with data culture, it becomes worthy to look at e-books and digital publishing structurally. The power of digitization brings on the power of the database. And with the database come powerful changes to our relationships and treatment of books, where the digital book function and form is being “unbound”.
What does this mean?
Books are being scaled and various statistical properties of them can be analyzed for other purposes. We see this reflected in online book sites where a wealth of ratings, reviews and lists of most popular, best and worst books permeate. Using the example of The Hunchback of Notre Dame, Bernhard shows us its Amazon’s text stats that allow for different indexing of statistical properties of books — readability, complexity, number of words and fun facts (*The Hunchback of Notre Dame has 81,598 words). So thanks to the database you know just how many words per ounce are contained in a book and can decide which printed book is right for you.
As Bernhard explains, historically institutions (ranging from family, school and library to bookstores, market forces and affordances) have always contributed to structuring the universe of books, shaping what we read and how we read it. ‘The book in the age of the database adds a contemporary wave of new embedded practices and logistics of what do we read and how we read it’. In his view, three new practices emerge:
1) Exploring full text and metadata. This refers to the statistical projections of the whole text that allow various explorations of the catalogue’s content such as Google’s “common terms and phrases” or Amazon’s “key phrases” feature, both of which link to relevant passages of the book.
2) Connecting by means of data. Specific to the ‘database condition’ emerges the possibilities of interconnecting of books through data, and the connecting to and from books to other data, like the Web and Google Scholar, to name just a few. In other words, using Google’s database you can have a popular passage extracted, and then be able to link to other citations that cover the same topic or provide a different perspective.
3) Capturing and inferring. Perhaps the most important new embedded practice to materialize out of the database is the actual use of the data – of capturing user gestures and practices (word positions, metadata, and user data such as tagging or clicking, number of citations, reads, sales, reviews, and where in a passage a user decided to stop reading), and then using that data to create individual navigational experiences and opportunities, aka the personalization of reading.
Systems that digitize books, like Amazon and Google, transform books into information, and then unbind and rebind it again as an interactive, social and semantic interface.
Bernhard proceeds to elaborate that such transformations allow the discovery of a book through all different representations that the database affords (as mentioned above). He strongly believes that more than anything else those database technologies are increasingly steering online our opportunities for navigation, how the age of personalization [for reading] is coming about, and how it will be shaped for the future. ‘What we see online very much depends on what you may have already read and what you’ve clicked on’. So the experience a user will have, and the books they will stumble upon, becomes highly dependent on the competence of the user in the first place. The other important aspect to take into account when determining what a user will read is the actual role of the database technology and how it enables different forms of embedded and technology-mediated reading — via suggestions, comments, reviews, statistics and links to how different texts relate to one another.
So what kind of book institution are we moving towards?
How we read was always a complicated and contested affair, continues Bernhard. The difference now is the database is altering and reconfiguring the structures that orient what we read and how we read it. The new tools afford the database and algorithm companies like Amazon to give customers more of what they want (low prices, vast selection, and convenience), and allow Google to “organize the world’s information and make it universally accessible and useful”. From a commercial perspective, these initiatives can been as the way to sell books and ads, create a one-stop shop, and profit from network effects — but the impact perspective is yet to be assessed. According to Bernhard, it’s too early to say how the database system is actually affecting the way people read books. The larger questions — of what we should read, what we could read, and how we can read — is yet to be determined once we truly understand how the hierarchical and incentive system functions internally in the first place as a recommendation system.
Back to the original question of his talk: what does it mean to have a full database of all books ever published? What can you actually learn from so many books being scanned in a database?
Many applications are yet to be rendered feasible in the first place (much of it due to current legal constraints) but nonetheless, Bernhard points out quite a few useful applications that could emerge: the automatic translation of texts, knowledge engineering (knowing who has the best texts/concepts for a specific subject), and finally ‘culturomics’.
A great example is Google’s N-gram viewer, which uses its computational potential to see what you can actually learn from having just 4% (6 million) of the world books scanned. What the tool essentially does is take pairs (grams) of terms and looks through Google’s entire collection of digitized texts to determine the frequency of all the word combinations in the time period selected.
Looking at the results one can begin to see a whole breadth of insights emerge from rapidly quantifying cultural trends and in this way, ‘“culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.” (Michel et al., 2011)’
Bernhard concludes his talk by reaffirming how even without changing form, and without becoming part of an e-reader or e-book, the book is nonetheless caught up in large scale databases. From reading and finding a book to engaging, sharing and discussing a book, the shift towards e-reader makes the database aspect more easily put into place as it becomes something of a standard in e-publications.
Just imagine yourself finding a fascinating passage in a book and then being able to jump to all books that refer to that passage or similar concepts. It is time the debate around e-books move to surround aspects of the database and how it can serve us to think about and integrate things from a cultural perspective.