Smelly London project brings together historical data with modern digitisation and visualisation to give us a unique, revealing and visceral glimpse into a London of the past and what it tells us about London today. This is a collaborative, interdisciplinary project to demonstrate the capabilities of innovative text mining tools we design to facilitate new kinds of humanities research.
The Medical Officer of Health (MOH) reports were published annually by the Medical Officers of Health (MOsH) employed by local authorities across the UK. These reports provided vital statistics and a general statement on the health of the population. MOH reports concentrated on reporting infectious diseases and resolving the problems as well as covering other areas of social responsibilities. They have been long regarded as an important source for 19th and 20th century history of Public Health and stem from reaction to infectious disease in mid-19th century. Although there were attempts at standardisation, the reports display each MOH’s interest, idiosyncrasies and particular strengths. Therefore, they also provide a particular perspective on the everyday lives of Londoners over several generations.
Over the past few years Wellcome have been developing a world-class digital library by digitising a substantial proportion of their holdings. As part of this effort, approximately 5,800 MOH reports for London spanning from 1848-1972 were digitised in 2012. Currently Wellcome holds the most comprehensive digital collection of the London MOH reports. Since September 2016 Wellcome have been digitising 70,000 more reports covering the rest of the United Kingdom (UK) as part of UK Medical Heritage Library (UKMHL) project in partnership with Jisc and the Internet Archive. No digital techniques have yet been applied successfully to add value to this very rich resource.
As part of the [Smelly London] (www.londonsmells.co.uk) project , the OCR-ed text of the MOH London reports has been text-mined for the first time. Through text mining we produced a geo-referenced dataset containing smell types for visualisation to explore the data. We enrich the text-mining pipeline with NLP, including lemmatization, part-of-speech tagging and automatic identification of smell terms and concepts based on their contextual features. This allows us to identify smell categories in a data-driven fashion and to discover new categories that escaped previous classifications. This step complements the close reading analysis and enables us to scale up the amount of information extracted from the texts.
Analysing the MOH reports tells the intimate narratives of the everyday experiences of 19th and 20th century Londoners through the ‘smellscape’. As the data becomes more structured, they can be more readily overlaid with other maps and images such as Charles Booth’s London Poverty Map and 19th century disease maps. Having multiple layers will enable us to run various comparisons and assess if there are any correlations between smells and diseases as well as links to the socio-economic identity of areas in London. Smell has a great influence over how we perceive places and contribute to the construction of a place’s identify. During the 19th century the paranoia surrounding smells associated with poor hygiene heightened in many European cities. The Great Stink of 1858 resulted in the discussion of moving Parliament outside London for example. Despite the rise of germ theory (Pasteur and Koch) in the 1880s, concerns with disease-causing miasma (smells) did not disappear entirely. The MOH reports are one of the richest available sources on local public health administration and patterns of disease.
Computer programming can be used to perform tasks thousands of times faster than humans. In the Python code written to extract the data from the MOH reports, parallel processing was employed to speed up the running time of the program. Modern central processing units (CPUs) have multiple cores which allows the calculations to be run concurrently. In our project the CPU had four cores which allowed the running time of the program to be shortened by as much as three times. The next objective for the project is to scale up the size of the text-mining from 5,800 reports to over 70,000 reports covering the entire UK. In order to process such large datasets we are investigating using distributed computing resources such as Amazon Web Service (AWS). The code written for this project has been made open source under the MIT license along with documentation so that other programmers or researchers can use the codebase for use in their own text mining projects.
At the end of the Smelly London project the historical smell data will be available via 21st century [Smelly Maps] (http://www.goodcitylife.org/) by Daniele Quericia . This will allow the public and other researchers to compare smells in London from the 19th century to present day. Moreover, the Smelly London dataset will be available on the [Layers of London] (http://alpha.layersoflondon.org/the-map) project platform . This has the further potential benefit of engaging with the public.