EPCC News 84

Analysing humanities data using Cray Urika-GX

Plot of proportion of books mentioning ‘cholera’ against publication date.

In our role as members of the Research Engineering Group of the Alan Turing Institute, we have been working with Melissa Terras, University of Edinburgh’s College of Arts, Humanities and Social Sciences (CAHSS), and Raquel Alegre, Research IT Services, University College London (UCL), to explore text analysis of humanities data. Our collaboration aimed to run text analysis codes developed by UCL upon data used by CAAHSS to exercise the data access, transfer and analysis services of the Turing Institute’s deployment of a Cray Urika-GX system. We used two data sets of interest to CAHSS and hosted within the University of Edinburgh’s DataStore: British Library newspapers data (around 1tb of digitised newspapers from the 18th–20th centuries), and British Library books data (around 224gb of compressed digitised books from the 16th–19th centuries). Both are collections of XML documents, but have been organised differently and conform to different XML schemas, so affecting how the data can be queried. To access both data sets from within Urika, we mounted the DataStore directories into our home directories on Urika using SSHFS. We then copied the data into Urika’s own Lustre file system. We did this because, unlike Urika’s login nodes, Urika’s compute nodes have no network access and so cannot 12

access the DataStore via the mount points. Also, by moving the data to Lustre, we minimised the need for data movement and network transfer during analysis. To exercise Urika’s data analytics capabilities, we ran two text analysis codes, one for each collection, which were initially developed by UCL with the British Library. UCL’s code for analysing the newspapers data is written in Python and runs queries via the Apache Spark framework. A range of queries are supported eg count the number of articles per year, count the frequencies of a given list of words, find expressions matching a pattern. UCL’s code for analysing the books data is also written in Python and runs queries via mpi4py, a wrapper for the message-passing interface (MPI) for parallel programming. However, work had been started on migrating some of these queries to use Spark. A range of queries are supported eg count the total number of pages across all books,

The Cray Urika GX system is a highperformance analytics cluster with a preintegrated stack of popular analytics packages, including Apache Spark, Apache Hadoop, Jupyter notebooks and complemented by frameworks to develop data analytics applications in Python, Scala, R and Java.

Turn static files into dynamic content formats.

Create a flipbook

EPCC News 84

Articles inside

Building a community of research software engineers

Exploratory image analysis in cell biology

eCSE programme: funding software development for UK computational science

Making complex machines easier to use efficiently

On the frontline of energy-efficient computing

SpiNNaker arrives at EPCC

Online learning at EPCC

New beginnings

Analysing humanities data using Cray Urika-GX

Scottish Administrative Data Research Partnership

Your data is secure with EPCC!

VESTEC: saving the world one byte at a time

Fortissimo: a boost for European business

Data-driven innovation for business

Industry projects round up

Catalyst UK programme brings Arm-based HPC system to EPCC!

New Prosperity Partnership to develop world first in high-fidelity engineering simulations

The Bayes Centre: EPCC's new home