
3 minute read
Analysing humanities data using Cray Urika-GX
In our role as members of the Research Engineering Group of the Alan Turing Institute, we have been working with Melissa Terras, University of Edinburgh’s College of Arts, Humanities and Social Sciences (CAHSS), and Raquel Alegre, Research IT Services, University College London (UCL), to explore text analysis of humanities data.
Our collaboration aimed to run text analysis codes developed by UCL upon data used by CAAHSS to exercise the data access, transfer and analysis services of the Turing Institute’s deployment of a Cray Urika-GX system.
Advertisement
We used two data sets of interest to CAHSS and hosted within the University of Edinburgh’s DataStore: British Library newspapers data (around 1tb of digitised newspapers from the 18th–20th centuries), and British Library books data (around 224gb of compressed digitised books from the 16th–19th centuries). Both are collections of XML documents, but have been organised differently and conform to different XML schemas, so affecting how the data can be queried.
To access both data sets from within Urika, we mounted the DataStore directories into our home directories on Urika using SSHFS. We then copied the data into Urika’s own Lustre file system. We did this because, unlike Urika’s login nodes, Urika’s compute nodes have no network access and so cannot access the DataStore via the mount points. Also, by moving the data to Lustre, we minimised the need for data movement and network transfer during analysis.
To exercise Urika’s data analytics capabilities, we ran two text analysis codes, one for each collection, which were initially developed by UCL with the British Library.
UCL’s code for analysing the newspapers data is written in Python and runs queries via the Apache Spark framework. A range of queries are supported eg count the number of articles per year, count the frequencies of a given list of words, find expressions matching a pattern.
UCL’s code for analysing the books data is also written in Python and runs queries via mpi4py, a wrapper for the message-passing interface (MPI) for parallel programming. However, work had been started on migrating some of these queries to use Spark. A range of queries are supported eg count the total number of pages across all books, count the frequencies of a given list of words, etc. This code is complemented with a set of Jupyter notebooks to visualise query results and to perform further analyses.
To run the codes within Urika we needed to modify them both to run without any dependence on UCL’s local environment, and instead access data located within Lustre. As a result, the modified newspapers code now allows the location of XML documents to be specified using either URLs or absolute file paths. The modified books code now runs its MPI-based queries via Urika’s Apache Mesos resource manager.
For the books data, Melissa suggested that we try to reproduce the results from her Jisc Research Data Spring 2015 project at UCL. This project developed queries to search for the names of thirteen diseases (eg “cholera”, “tuberculosis” etc) and return the total number of occurrences of each name, and to then to normalise the results by the number of books, pages and words per year to see how these occurrences change over time. Taken together, one can examine the extent to which references to specific diseases change in literature over time.
We compared the results of running the modified code on Urika to the original results and they were generally consistent but with some anomalies which we identified as arising from data missing from the books data set held within DataStore, which has been reported back to Melissa.
Urika is designed with the use of Spark in mind and Spark is wellsuited for this form of text analysis. Migrating the mpi4py books queries to Spark would be a good area for future work, combining this with the newspapers code which already uses Spark and can handle several XML Schemas. This would then yield a single code, with a common underlying data model, that could run queries across both the newspapers and books data.
The Cray Urika GX system is a highperformance analytics cluster with a preintegrated stack of popular analytics packages, including Apache Spark, Apache Hadoop, Jupyter notebooks and complemented by frameworks to develop data analytics applications in Python, Scala, R and Java.
Our updated codes and documentation on how to run them are publicly available on GitHub:
• Newspaper code: http://bit.ly/2EQrYvS
• Books code (Spark version): http://bit.ly/2qeCwL7
• Books code (MPI version): http://bit.ly/2ES63EJ
• Jupyter notebook for books: http://bit.ly/2JmcIoZ
Research Engineering Group of the Alan Turing Institute www.turing.ac.uk/research/ research-engineering
This work was funded by Scottish Enterprise as part of the Alan Turing Institute-Scottish Enterprise Data Engineering Programme.
Rosa Filgueira and Mike Jackson, EPCC r. filgueira@epcc.ed.ac.uk m.jackson@epcc.ed.ac.uk