Analysing humanities data using Cray Urika-GX
Plot of proportion of books mentioning ‘cholera’ against publication date.
In our role as members of the Research Engineering Group of the Alan Turing Institute, we have been working with Melissa Terras, University of Edinburgh’s College of Arts, Humanities and Social Sciences (CAHSS), and Raquel Alegre, Research IT Services, University College London (UCL), to explore text analysis of humanities data. Our collaboration aimed to run text analysis codes developed by UCL upon data used by CAAHSS to exercise the data access, transfer and analysis services of the Turing Institute’s deployment of a Cray Urika-GX system. We used two data sets of interest to CAHSS and hosted within the University of Edinburgh’s DataStore: British Library newspapers data (around 1tb of digitised newspapers from the 18th–20th centuries), and British Library books data (around 224gb of compressed digitised books from the 16th–19th centuries). Both are collections of XML documents, but have been organised differently and conform to different XML schemas, so affecting how the data can be queried. To access both data sets from within Urika, we mounted the DataStore directories into our home directories on Urika using SSHFS. We then copied the data into Urika’s own Lustre file system. We did this because, unlike Urika’s login nodes, Urika’s compute nodes have no network access and so cannot 12
access the DataStore via the mount points. Also, by moving the data to Lustre, we minimised the need for data movement and network transfer during analysis. To exercise Urika’s data analytics capabilities, we ran two text analysis codes, one for each collection, which were initially developed by UCL with the British Library. UCL’s code for analysing the newspapers data is written in Python and runs queries via the Apache Spark framework. A range of queries are supported eg count the number of articles per year, count the frequencies of a given list of words, find expressions matching a pattern. UCL’s code for analysing the books data is also written in Python and runs queries via mpi4py, a wrapper for the message-passing interface (MPI) for parallel programming. However, work had been started on migrating some of these queries to use Spark. A range of queries are supported eg count the total number of pages across all books,
The Cray Urika GX system is a highperformance analytics cluster with a preintegrated stack of popular analytics packages, including Apache Spark, Apache Hadoop, Jupyter notebooks and complemented by frameworks to develop data analytics applications in Python, Scala, R and Java.