RIMCRISPool_FinalReport

Page 1

Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

JISC Final Report Project Information Project Acronym

CRISPool

Project Title

Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal

Start Date

1 March 2010

Lead Institution

University of St Andrews

Project Director

Anna Clements

Project Manager & contact details

Anna Clements akc@st-andrews.ac.uk 01334 462761

Partner Institutions

SUPA (The Scottish Universities Physics Alliance) http://www.supa.ac.uk/ University of Edinburgh University of Glasgow EuroCRIS http://www.eurocris.org Atira A/S http://www.atira.dk

Project Web URL Pilot Portal

http://www.crispool.org http://crispool.atira.dk/portal

Programme Name (and number)

Information Environment Programme 2009-2011 Research Information Management Call 11/09

Programme Manager

Neil Jacobs / Frederique Van Till

End Date

31 August 2010

Document Name Document Title

Final Report

Reporting Period Author(s) & project role

Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer)

Date

Filename

URL

if document is posted on project web site

Access

√Project and JISC internal

√ General dissemination

Document History Version

Date

Comments

V1.0

31/08/2010

Circulated to partners and programme manager

V2.0

09/09/2010

Amendments from partners plus Appendices

V2.1

16/09/2010

Final version to JISC

Page 1 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

JISC Final Report

CRISPool Project Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal

Author(s): Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer)

Contact Anna Clements akc@st-andrews.ac.uk University of St Andrews Business Improvements Butts Wynd Building St Andrews Fife KY16 9AD

Page 2 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Table of Contents

............................................................................................................. 1 JISC FINAL REPORT ................................................................................................ 1

............................................................................................................. 2 JISC FINAL REPORT ................................................................................................ 2

.................................................................................................................................................... 2 Acknowledgements ............................................................................................................................................ 4 Executive Summary ............................................................................................................................................ 5 Background......................................................................................................................................................... 6 Aims and Objectives ........................................................................................................................................... 7 Methodology....................................................................................................................................................... 7 Implementation ................................................................................................................................................... 9 Sourcing the data .............................................................................................................................................. 11 Producing the CERIF-XML .............................................................................................................................. 11 Outputs and Results .......................................................................................................................................... 14 Outcomes .......................................................................................................................................................... 16 Conclusions ...................................................................................................................................................... 18 Implications ...................................................................................................................................................... 18 References ........................................................................................................................................................ 19 Appendix 1: CRISPool Data Dictionary ........................................................................................................... 21 Appendix 2 : Class Scheme Data...................................................................................................................... 37 Appendix 3: CRISPool CERIF to PURE4 mapping ......................................................................................... 44 Appendix 4 : Technical Summary - CRISPool project prototype implementation ........................................... 53

Page 3 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Acknowledgements The CRISPool project would like to acknowledge the contributions of the following organisations to the success of the project: • •

JISC for part-funding the project through the Information Environment Programme 200911 and it’s Research Information Management Call 11/09 The project partners for their invaluable contributions to the project: 1 o SUPA (The Scottish Universities Physics Alliance) o University of Glasgow o University of Edinburgh 2 o EuroCRIS 3 o Atira A/S 4 5 The ERIS and R4R (Readiness4Ref) project teams for continuing enthusiastic support and advice.

1

www.supa.ac.uk www.eurocris.org 3 www.atira.dk 4 http://eriscotland.wordpress.com/ 5 http://www.kcl.ac.uk/iss/cerch/projects/portfolio/r4r.html 2

Page 4 of 55


Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Executive Summary We have successfully used CERIF-XML to bring together data on people, organisations and publications from three Universities for the SUPA [Scottish Universities Physics Alliance] research pool. These data are viewable and searchable at http://crispool.atira.dk/portal

This was the main aim of the project and has been achieved within the limited timescale and budget of this JISC call. The collaborative aspect of the project involving partner institutions, pool administrators, euroCRIS, third party developers, Atira, and related JISC-funded projects, Readiness4REF(R4R) and ERIS has meant that a wide number of stakeholders have been involved at all stages to help ensure the success of the project. The approach taken meant that we were learning how to use CERIF-XML as we went along so the expert help and advice of euroCRIS and Atira who are members of the euroCRIS CERIF Task Group and the sharing of preliminary findings from the Readiness4Ref project led by Kings College, London have been invaluable. Additionally, the enthusiastic support from the ERIS project has provided a channel to other pools in Scotland; several of whom have expressed interested in the project. The basic steps, once we had agreed on which data the partner Institutions (Glasgow and Edinburgh) could reasonably provide within the timescale, were that the University of St Andrews created some sample CERIF-XML files for the other University partner institutions which would allow them to generate the data needed for the portal. Each institution took a different approach to generating their XML data but all used relatively low-tech text editing and search and replace tools. No additional specialist knowledge was required. Although the main aim of the project was to test the suitability of CERIF-XML as an exchange format, it was evident that those Institutions with an existing culture of integrated research information management were better able to provide the required data quickly. For St Andrews there was no additional work required as all data were fed in from their existing CRIS. Glasgow, which has had an in-house integrated research information management system for many years were able to provide data on people and publications easily. Edinburgh were able to provide data on people but unfortunately were not able to provide publications data within the project timeframe. Returning to the main aim of testing CERIF-XML’s suitability as an exchange format, the CERIF data model fully supported the requirements of the project except for two relatively minor areas which have been reported to euroCRIS. For the pilot project we have been able to workaround these issues by using CERIF classifications; something that R4R has also been able to do during the exercise to map RAE2008 schema to CERIF. The main technical issue we have found is to do with the fragmentation of CERIF-XML into so many individual xml files. The sheer number means that it is very resource intensive to process as each item, whether a person or organisation or publication is defined by data in up to 10 related xml files. The issue facing the designers of CERIF is that the model itself needs to represent the real world of interrelated research information – the fully connected graph; however XML is a linearised tree structure and cannot natively represent the complexity required. However, XML is also the vehicle of choice for data exchange in web services. In conclusion all partners are positive about the results of the CRISPool project and SUPA are keen to move forward from a pilot to a sustainable solution. We see that while there are still areas to improve on (for example the processing of multiple xml files) the sector as a whole can take heart from our findings that reinforce the conclusion from the EXRI report that CERIF should be used as the exchange format within the UK research information sector.

Page 5 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Background The importance of collecting, maintaining and exchanging good quality, comprehensive and current research information has risen up the agenda in the UK Higher Education sector following the recently completed RAE2008 data collection exercise. In particular, the use of a standard exchange format, the Common European Research Information Format (CERIF), to improve interoperability of data between the different stakeholders (Funding Councils, Research Councils and other funders, HESA, Institutions) has been discussed by a JISC-led Research Information Management Group. This group commissioned the EXRI project to examine the suitability of CERIF versus other possible 6 standards, or no standard. The final report recommends the use of CERIF as a standard exchange format between the stakeholders. CRISPool builds directly on Recommendation 7 from this report ‘.. pilots to look at real exchange of research activity data between HEIs using CERIF’. 7

Research Pooling is well established in Scotland and there are currently 13 pools – the oldest being SUPA, established in 2005 The pools were setup in order to help create and maintain a critical mass of resources needed for Scotland’s universities to carry out world-class research. The success of the initiative was highlighted by the RAE2008 results in which Scottish institutions increased their share of the UK's world-class research from 11.6% in 2001 to 12.3%, even though the country has only 8.5% of the UK population. Every Scottish institution now has world leading research in at least one of its disciplines. This approach is also being discussed in the national UK press as reported in Times Higher 8 Education , THE, 5th August 2010 which presented the views of David Price, Vice-Provost for Research and Stephen Caddick, Vice-Provost for Enterprise, both of at University College London : ‘ … the coming cuts to the sector will necessitate "major restructuring" to preserve the global standing of the elite universities on which the success of UK higher education depends. The elite, they propose, should pool and coordinate their research strengths to form hubs of about half a dozen regional "research clusters". The current information infrastructure underpinning SUPA, as with the other pools, is poor and much resource and duplication of effort is spent by both SUPA administrators and members of the partner institutions in collecting and checking data on staff, students and publications. This information is held at member institutions in different formats with different vocabularies used, for example, for similar job descriptions or publication types. This information is collated and presented for reporting to the Scottish Funding Council. 9

In SUPA information on staff and students is used to provide access to the My.SUPA portal, a virtual 10 learning environment and research collaboration portal based on Moodle . Logged in staff and students are given access to lists of other users along with some limited profile information listing interests in various research themes with the aim of fostering new collaborations between researchers across Scotland. Gathering publications information has been particularly resource intensive. In previous years it has been carried out by requesting information from department administrators passing data using spreadsheets. SUPA administrators verified the information, emailing each staff member or research student, and provided opportunities to make corrections and additions prior to publication of a printed publications list. This final publications list is only available in a limited electronic form : a PDF version

6

http://www.jisc.ac.uk/publications/briefingpapers/2010/bpexriv1.aspx#downloads

7

http://www.sfc.ac.uk/research/researchpools/researchpools.aspx http://www.timeshighereducation.co.uk/story.asp?storycode=412909 9 http://my.supa.ac.uk/ 10 http://www.moodle.org 8

Page 6 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

of the printed copy. This monolithic document does not provide any other ways to analyses information except in the order and format provided, and a simple text search in the PDF document. Data supplied by the departments is of varying quality, and is sometimes provided indirectly rather than being sourced from institutional information systems. The complete dataset is based on information from different information systems with different business rules and data constraints. Feedback from several department administrators involved in these requests for information suggest that data is gathered from a variety of sources including institutional systems, departmental information systems or local files. This approach responds to the immediate query expeditiously, rather than following a repeatable process. Several requests and clarifications may be required in each information gathering exercise. The data is generally only updated on an annual basis, and not consistently maintained between annual reporting cycles and so quickly goes out of date, therefore considerably less useful. The CRISPool project partners have been working in the area of research information for several years including innovative projects to link research information and management systems to open access repositories. Glasgow University has developed an innovative integrated research management system, the University of Edinburgh is leading a consortial approach to driving the open access agenda forward in Scotland and the University of St Andrews, in a joint project with the University of Aberdeen, is the first UK institution to implement the CERIF-based CRIS [Current Research Information System] product (PURE), made by the Danish company, Atira. A key stipulation by both St Andrews and Aberdeen, and supported by Atira, is that the conceptual data model developed for the UK should be made available to other UK Institutions implementing or investigating a CERIF-CRIS independent of which system they choose.

Aims and Objectives CRISPool builds on the experience gained by the partners and their desire to work with other Institutions to find practical ways of reducing the overall burden or research information management across the sector. The implementation of PURE has demonstrated the suitability of CERIF for capturing research information internally within the two Institutions (St Andrews and Aberdeen). The CRISPool project had the following aims: • To demonstrate that CERIF-XML can be used to bring data from heterogeneous, cross institutional sources together. • To provide evidence of the benefits and costs of adopting CERIF-XML as a cross-institutional data exchange format. The aims were to be through the main objective: • To build an initial portal exposing these data on the web with basic search & retrieve functionality and basic technical exhibition of data (e.g. fetching data via RSS, XML/SOAP, OAI). Whilst testing the suitability of CERIF-XML was the primary focus of this project, there was also an expectation that organisational and information systems changes would occur as a direct result of the need to ensure data is up to date, sufficiently accurate and meets the commonly agreed criteria. The results from CRISPool are both transferable (to other exchange scenarios) and scaleable (to other CERIF elements not included in this project). These aims and objective remained constant throughout the project and the Outputs and Results section below discusses the degree to which they have been met.

Methodology With a six month project and multiple partners the methodology used was to build on existing expertise : euroCRIS with their in depth knowledge of CERIF; Atira with their existing PURE CRIS

Page 7 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

product and established expertise in CERIF-based CRIS and the partner Institutions [St Andrews, Glasgow and Edinburgh] with their experience in the area of research information and repository systems. The project was split into three strands: 1. Scoping and Investigation: defining data model (entities, relationships, constraints), common vocabularies (people, publications, organisations) to meet SUPA requirement for annual publications report. This strand also identified data sources and determined any limitations necessary due to data availability. Due to the limited time span and resource available to the project at the partner institutions we kept the data requirements to a minimum to meet what could be provided by Glasgow and Edinburgh and was still useful to SUPA. Thus Glasgow started with their dataset produced for the REF Bibliometrics Pilot project in 2008-9. This data set already linked outputs to staff using the institutional ID. Edinburgh aimed to provide all current academic staff in the School of Physics and Astronomy and then match them against publications data from the Edinburgh Research Archive [ERA]. A comprehensive set of publications data related to current academics in the School of Physics and Astronomy at St Andrews was provided from the PURE CERIF-CRIS database in CERIF-XML format.

Figure 1 : Summary of data flow in CRISPool 2. Technical delivery: configure and install PURE for the defined data model and build CERIF-XML integrator; data sources mapped to CERIF-XML to produce single or multiple data streams for integration into PURE; a simple portal was built to expose data via web pages, web services and RSS feeds.

Page 8 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Again, because of the short timeframe and limited budget existing technology [PURE4 product] formed the backend database and administrator functionality. Atira built CERIF-XML export [to export St Andrews data to CERIF-XML] and import functions [to import St Andrews, Glasgow and Edinburgh CERIF-XML] as add-ons to the PURE4 product. Each function was triggered through the administrative interface on an ad-hoc basis or using standard cron job configuration. Finally, a simple portal was created based on the current SUPA website design.

3. Engagement and Evaluation: conduct a base line review during SUPA annual data collection round; time and effort to identify sources and map to CERIF; advantages and disadvantages; ongoing engagement with regional, national and European projects and groups e.g. ERIS led by Edinburgh [project manager is member of CRISPool], Enquire led by Glasgow [ditto], Readiness4Ref (led by KCL), UCISA, WRN/ARMA, euroCRIS. The engagement strand ran throughout the project and is continuing, for example, at the Repository Fringe, Sep 2010 at Edinburgh. Due to resource issues at SUPA a full base line review was not carried out however feedback from SUPA staff has been incorporated into the Outputs and Results section.

Implementation Two workshops were held early in the project [March and April 2010] to familiarise all partners with the CERIF model; finalise the data requirements taking into account what was achievable over the short timescale and also useful to SUPA, and share the experiences of the R4R project in mapping RAE2008 to CERIF-XML. We also agreed to create institution-specific unique IDs for the organisations, persons and publications being brought together into CRISPool by using the UK Learner Provider number as a prefix to institutional IDs. The scope of CRISPool did not allow for any time to deduplicate/merge data on publications and, potentially, people. A CRISPool project was created within the existing ERIS project online collaboration tool, 11 Basecamp to help plan and manage the project.

11

http://basecamphq.com/

Page 9 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Figure 2: Elements of CERIF used in CRISPool

Figure 1 shows the basic CERIF model with the elements used in CRISPool highlighted. In all a total of 30 CERIF-XML files were used in the pilot. cfPers_CORE cfPers_Class-LINK cfPers_EAddr-LINK cfPers_OrgUnit-LINK cfPers_PAddr-LINK cfPers_ResPubl-LINK cfPersKeyW-LANG cfPersName-ADD cfPersResInt-LANG cfOrgUnit-CORE cfOrgUnit_EAddr-LINK cfOrgUnit_Class-LINK cfOrgUnit_OrgUnit-LINK cfOrgUnit_PAddr-LINK cfOrgUnit_ResPubl-LINK cfOrgUnitName-LANG cfResPubl-RES cfResPubl_Class-LINK

Page 10 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

cfResPubl_ResPubl-LINK cfResPublAbstr-LANG cfResPublBiblNote-LANG cfResPublKeyW-LANG cfResPublAbbrev-LANG cfResPublSubtitle-LANG cfResPublTitle-LANG ND

cfEAddr-2 ND cfPAddr-2 cfEAddr_Class-LINK cfClassTerm-LANG cfClass-CLASS Full details are in Appendix 1: CRISPool Data Dictionary, Appendix 2 : CRISPool Class Scheme Data and Appendix 3 \\\documents.

Sourcing the data For Glasgow this was straightforward once the data requirements had been finalised. The information on persons coming from the Institutional HR database and that for publications from the data set produced for the REF bibliometrics pilot. Glasgow considered using data from their institutional repository but at the time this did not link publications to internal authors via the institutional ID. The data from the HR database was already integrated with the research management system at Glasgow and so there were no problems with reusing this data for CRISPool. For Edinburgh detailed person data was provided from the Institutional HR database. While the data provided was of good quality it is worth noting that it took the team at Edinburgh some time to find the right contact within HR who could authorise use of the data for CRISPool. At Glasgow and St Andrews these links have already been made and so no delay was incurred. The publications data was sourced from the central closed Publications Repository and checked to ensure the bibliographic data could be made publicly available. It had originally been planned to use the public Edinburgh Research Archive, but on investigation this only included two publications that were by academics in the current HR feed, and were journal articles. Edinburgh therefore switched to using data from the closed repository, the Publications Repository, which had many more articles, and was the repository used for the RAE submission. For St Andrews all the required data was sourced directly from the Institution’s PURE4 CRIS, which itself is synchronised daily with data from the Institutional HR database. The CRIS is the golden source of publications data.

Producing the CERIF-XML Following on from the workshops, the University of St Andrews created template files with some sample data for each of the CERIF-XML files to be used in the pilot. We used documentation from the eurocris.org web-site and advice and examples from the R4R mapping documents. These sample files were validated against the CERIF-XML 2008-1.1 schema at http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ Note: we started with CERIF-2008_1.0 version but switched to the later version in order to be able to use IDs of greater length. Version 1.0 handled IDs up to 32 characters long; version 1.1. up to 128 characters. The CERIF-XML sample files were created using the text editor Notepad. The sample CERIF-XML files were distributed to the Universities of Edinburgh and Glasgow via the Basecamp site for them to populate with their own data.

Page 11 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

The University of Glasgow used MS Excel to create a worksheet for each CERIF-XML file required, populating the worksheets with data on persons from their Human Resources database and publications from the REF Bibliometrics pilot; the latter included links to the HR database via the Institutional staff ID. MS Word Mailmerge was then used to merge the data from each worksheet into the correct CERIFXML template. The xml header and footer was added to each resulting .doc file and saved as .txt and then renamed as .xml. This process took approximately 1 day to complete. For the University of Edinburgh, the process of generating the CERIF-XML files for persons was a largely manual process. HR provided them with an Excel spreadsheet containing academic names, hesa numbers and job titles. These were then amalgamated with further information manually copied from the school staff webpages. This process took 2.5 days to complete. The publications data was provided as an export from the dSpace Publications Repository but has not yet been converted to CERIF-XML for importing. Related work taking place as part of R4R to create a CERIF plug-in for dSpace is expected to provide this functionality. SUPA were provided with lists of people from all three Institutions in order to match to the existing SUPA ID and SUPA theme/s. This data was provided back to St Andrews in spreadsheet format and CERIF-XML cfPers_OrgUnit-LINK files produced linking each person to the main and additional SUPA themes. We had originally planned to use a classification for the SUPA themes but switched to using organisations very quickly once we realised that this would allow us more flexibility in linking other entities such as persons and publications to themes. In practice SUPA treat the themes as virtual organisations. Both sets of files (from Glasgow and Edinburgh) required tidying up before they validated successfully against the CERIF-XML Schemas. The issues included CERIF mismatched tags and elements in the wrong order. There was also a problem initially with files being saved with LATIN-1 encoding rather 12 than UTF-8. All these issues were solved using the freeware text and Unicode editor PSPad . On the whole it was a successful and straightforward low-tech process although there were a couple of more time-consuming problems where there were inconsistencies in the IDs across files thus preventing data being linked correctly once imported. Again these could be solved using PSPad, which had good functionality for checking several files side by side. What is evident is that the time taken initially to define requirements and prepare sample files was very important. In this project the resource was very limited and undoubtedly if we had had more resource at the member institutions the errors would have been much fewer. Equally, once a more automated process can be established such errors should be removed completely. For St Andrews data was exported using the export framework in Pure. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product.

Suitability of CERIF Most of the data mapped across to the CERIF model easily but there were two areas where the CERIF data model imposed restrictions. Both of these have been raised with the euroCRIS CERIF Task Group and could be worked around for this pilot using CERIF classifications. •

12

Issue 1: The placing of a person’s contact details as an attribute of the person rather than an attribute of the relationship between the person and organisation (cfPers_EAddr-LINK, cfPers_PAddr-LINK). In the CRISPool model which concentrates entirely on work contact details, rather than personal contact details, it is normal that a person’s contact details will change as they move from job to job.

http://www.pspad.com/en/

Page 12 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

o

For the pilot a workaround was used whereby the classification of the cfPers_EAddr and cfPers_PAddr relations were used to carry data. See Appendix 3 CRISPool CERIF to PURE4 mapping for details.

Issue 2: A one to one relationship between the publication entity and URI (cfResPubl13 CORE.URI). Thus we were unable to record both a DOI and URI to a full-text version in the IR against publications. This issue has been discussed within euroCRIS previously and at length and is a philosophical issue rather than a technical one. The current euroCRIS view is that each Publication object is represented by 1 and only 1 URI; if another URI is needed then that is another Publication object. This debate leads into the definitions of ‘work’ and 14 ‘manifestation’, and so on from the FRBR model, and is not part of the CRISPool project. o For the pilot we restricted ourselves to the DOI as there was more data for this than for URIs to full-text in IRs.

13

http://www.doi.org

14

http://www.loc.gov/cds/FRBR.html Page 13 of 55


Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Importing the CERIF-XML The importing of the CERIF-XML into the CRISPool PURE instance was done by uploading all the 15 XML files into WebDAV folders. There were four WebDAV folders, one for each institution involved (St Andrews, Glasgow, Edinburgh and SUPA). However, there were some issues with accessing these folders due to the operating systems used by the CRISPool team. A solution was found in using 16 a free program called NetDrive to gain access to folders. Once the data XML files were uploaded into the appropriate WebDAV folders then the data could be imported into PURE by selecting to synchronise it. If there were any problems with the CERIF-XML in any of the files being imported into PURE then all details of errors could be accessed after a failed synchronisation, the error list would give file name and line number of each issue so that could be amended. This detailed error logging helped to identify inconsistencies between IDs in the separate files, for instance. The synchronisation jobs could be run repeatedly to update existing data and this functionality was used, for example, when we received additional data on external authors from Glasgow. Unfortunately, due to the number of external authors on these publications, (there were an average of 150 authors per publication) the import process was taking so long that we decided to limit the authors to the first 5 (including at least one Glasgow author). Atira adopted an agile approach to the development of the import functionality working closely with the CRISPool team at St Andrews to test first the organisation and person import and then the publications import. This incremental approach meant we could sort out any issues with the organisations and persons before moving on to the much larger data sets containing publications data. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product.

Outputs and Results The CRISPool project has several deliverables, the first being the actual CRISPool portal. http://crispool.atira.dk/portal/ The portal has been designed to look and feel the same as the SUPA website. On the front page of the portal a selection of the most recent publications are displayed. A search bar allows anyone to search through researchers, organisations and publications. There is also a navigation menu on the right side of the portal pages which allow a user to search through the available data alphabetically. The portal also offers an option of ‘statistics’ which allows a user to view charts showing the volume and format of research of the institutions involved from the last 5 years. RSS feeds are also available from the portal. The CRISPool project managed successfully to employ the CERIF data model (2008 version 1.1). This can be demonstrated through the CERIF-XML files that have been created during the project as each file will conform to the data model documentation and the schema which can found on the euroCRIS website. http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ The data model used in CRISPool is described in detail in Appendices 1,2 and 3. A less tangible output is the transferable skills developed by the partner institutions in mapping internal data sources to CERIF-XML. The mapping process led to an improved understanding of how the CERIF data model works in a practical situation. For example, at Glasgow, this knowledge helped 17 inform the JISC-funded Enquire project looking at Research Council Outcomes and Outputs. 15 16

17

http://www.webdav.org/ http://www.sitepoint.com/blogs/2004/10/03/novell-netdrive-webdav-client-for-windows/ http://researchoutcomes.wordpress.com/

Page 14 of 55


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Overall timings for sourcing, mapping and producing the initial CERIF XML files, once the sample template files had been created, was between 2 and 5 days. A further period of time was spent by the project team at St Andrews checking and amending the files, which as they had been produced semimanually and to a tight deadline were prone to miss-typing [or miss-copying/pasting!]. This was a one-off relatively manual process for both Glasgow and Edinburgh and would need further work to develop into a sustainable production of CERIF-XML for use in keeping the portal up to date and removing the errors in the files. For St Andrews the CERIF-XML files were created directly from PURE using the functionality that Atira developed. As described in the Methodology section, a systematic baseline review was not able to be carried out due to resource issues at SUPA. However SUPA provided feedback as follows: For the work done by SUPA to gather data from all 6 Institutions [SUPA has only recently been expanded to 8 Institutions] : ‘I'd split the data gathering into two types: data gathering about people in SUPA, and the publication list (which is informed by the first process, however). For the first type, which is gathering the essential contact data for each member of SUPA: This takes approximately 1 month, which includes one week of solid work plus additional time to follow up with the institutions and verify accuracy. For the publication data exercise: This takes approximately 3 months, which includes a mix of solid work periods and following up with institutions. This process includes the initial meetings covering scope, the request for information, the follow up with individual institutions and the accuracy verification, collation and report publishing. ' For St Andrews, prior to implementation of Pure, a School Administrator took 4 days [spread over 2 weeks] to run Web of Science searches for all academic staff and research fellows. These data were not checked by individuals dues to lack of time. With Pure in place each individual member of staff can maintain an up to date accurate publication list that can then be fed out to CRISPool regularly – not just once a year as now. These data are also reused in other online pages such as School web sites. This not only saves time for School Administrators and individual researchers as the data is collected once but also improves data quality and timeliness with the researcher taking responsibility for their own data. If production of XML data streams can be automated at member institutions and synchronised within the portal then this would cut out the annual process of data collection via emailing spreadsheets back and forth between SUPA and each of the institutions. It would also have the benefit that corrections and additions prompted by the publication of information through the pool portal could be carried out directly in the source institutional information systems, saving staff time making separate updates to the SUPA systems and institutional systems.

Page 15 of 55


Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Outcomes The project plan listed a set of evaluation factors and questions to address. These are repeated below with an update following the project’s conclusion. Factor to Evaluate Suitability of CERIF 2008

Questions to Address Does it contain all data elements required? Can it be easily extended if not?

Method(s) Evaluate against SUPA requirements

Measure of Success All elements exist or can be easily add

Ease of mapping to CERIF-XML

What level of technical expertise required?

Feedback from technical experts who did the mapping

Technical expertise already available at member institutions or easily acquired

Usefulness of CERIF-CRIS

Is the CERIF-CRIS an improvement on previous solution? If so, in what way? If not, in what way?

Evaluation/Feedback

CRISPool solution extended to other member institutions

Overall time/cost savings

Across all stakeholders, does use of CERIF-XML increase/decrease resource/cost Does using a CERIF-CRIS facilitate improvement in

Evaluation [before/after]

Decreases resource/cost or further investigation needed over longer time period

Evaluation [before/after]

Data quality improved or likely to be improved

Data quality

Page 16 of 55

Outcome All but 2 directly mapped. Issue 1 – contact details against person not person-organisation relation; worked around using classification. Issue 2 – single URI per publication; philosophical point to be discussed with euroCRIS; opted for DOI not IR handle as more data; could have been addressed with classification Yes - Standard text editor tools used by St Andrews, Glasgow and Edinburgh. Basic relational db understanding necessary and knowledge of staff and publications data held by University SUPA – yes dynamic searchable portal much better than fixed pdf publications list. At least one other Pool expressed interest. Member Institutions - In principle – yes but requires up to date central data sources and further work to automate CERIF-XML production. So not a CERIF issue in itself. Main technical issue is fragmentation of CERIF-XML, which means import processes, are resource intensive. Further investigation needed. However it is indicative that SUPA were unable to collect data this year using existing method because too resource intensive Neutral – as data quality from partner institutions already good for the limited


Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010

Project success and impact

data quality?

subset of data we were working with.

Does using a publicly available portal facilitate improvement in data quality? To what extent has the project delivered on objectives and how useful are the projects findings?

Not answered as portal not public yet

End Project Report/Lessons Learned

Funder, partners and stakeholder feedback positive CRISPool solution extended to other member institutions and CERIF entities

Page 17 of 55

Partner and stakeholder feedback positive and keen to move pilot to sustainable system; more pools are interested


Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Conclusions On the technical side the project has been straightforward - CERIF is flexible and comprehensive and for the most part does not require additional expertise over and above standard relational database modelling; the exception is the use of Classification Schemes particularly when used with Link entities. Here the project benefited from the model expertise of euroCRIS and the practical experience of Atira. For those who do not have access to this expertise and experience it would be very useful to have more sample CERIF-XML files available at the euroCRIS.org web-site. The project has come across a couple of areas where the CERIF data model has not met our needs – or not immediately - and discussion on these is being taken forward in the CERIF Task Group. In one case a workaround was created relatively simply by extending the use of the CERIF classification concept. In the other case a similar work around could have been employed if we had had time to do so. The discussion with euroCRIS therefore is to do with whether such workarounds are the correct way to extend CERIF or whether the core CERIF data model should be extended. The issue of the resource- intensive nature of processing the CERIF-XML which is due to the fragmentation of CERIF-XML into many separate xml files is something that does need to be addressed whether by improving algorithms to process the data or by adjusting the CERIF model; however it is difficult to see how the latter can be done without losing the flexibility of the model. This proved to be such a problem with importing co-authors on some of the Glasgow papers, where typically 150-200 authors existed on each paper, that we had to limit the data to the 5 named authors (including at least one Glasgow author). Finally it has shown that in order to best take advantage of an initiative such as CRISPool, Institutions need at least publications and staff data joined up. For Glasgow this was straightforward as they were able to provide the publications data set from the REF bibliometrics pilot which was linked to internal authors via the HR staff ID. Going forward they are now linking their full publications data set in their Institutional Repository to internal authors and so will be able to provide a more comprehensive set of publications data in the future. For Edinburgh the publications data repository does hold the staff id for the user which could have been matched with the HR database to allow the publications to be matched easily to persons. St Andrews were able to provide comprehensive data on people and their publications directly from their CRIS.

Implications There are specific implications for CRISPool and more generic implications for adopting CERIF-XML as an exchange format within the UK.

CRISPool The project partners are keen to take CRISPool forward from a pilot to a live system. However, first we need to identify a clear achievable objective, such as bringing in people and publications data for all members of SUPA to support decision-making and specific reporting requirements. We then need to develop processes to produce the CERIF-XML data automatically and regularly from the various source databases that exist within member institutions. This is not a small undertaking and requires buy in from the partner institutions to the bigger picture across the UK research domain : that improved information management and operational efficiency can be gained by adopting CERIF-XML as the exchange format. CRISPool has demonstrated that those with an integrated research system or CRIS are already at an advantage here. Finally, but importantly, we need a CERIF-CRIS to bring the data together with functionality to view, search, report, and so on. Atira have worked as project partners on the pilot but there is no agreement to continue beyond the end of the pilot and any commercial solution would necessarily need to follow the normal procurement route.

Page 18 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010 18

It is also worth noting that at least one other Research pool, SICSA has already expressed interest in the idea so a project that was able to provide data for both pools from the common member institutions could be another option; it’s aim would be to demonstrate the scaleability and transferability of using CERIF-XML for this purpose.

CERIF-XML in general On the technical side and of relevance to others working with CERIF-XML the resource-intensive nature of processing CERIF-XML needs to be addressed. This could be via reviewing the CERIFXML model itself (which runs the risk of reducing CERIF’s ability to model the research information domain accurately) or improving the technology that processes large and or/fragmented XML files. It should be noted that in other applications – especially relating to research information – XML has proved to be an inefficient exchange format. EXEM has been developed to (partially) overcome this http://portal.acm.org/citation.cfm?id=1285888 . However the article at http://www.criticism.com/dita/dss.html suggests that using XML provides gains over legacy data exchange mechanisms. Considering the spreadsheet exchange method of SUPA hitherto, this appears to be borne out by CRISPool despite the apparent inefficiency of CERIF-XML. Brigitte Joerg, CERIF Task Group leader comments ; ‘I understand the fragmentation is seen as a problem. But, from my whole experience with ontologies, with respect to interchange is still the most appropriate format - and makes it very flexible to map to from legacy systems. Especially due to the fact of fragmentation, you can exchange just the data that you need. Imagine an interrelated or networked ontological graph (which can be based on XML too). Here it becomes a problem of where to cut of - and where to locate the related data. I think - the only way to improve fragmentation in CERIF-XML, would be, to define mini-CERIFSubontologies - like for person, including all the related entities and their basic attributes and also all the relationships for a particular context. That would mean, your CERIF Person Ontology would integrate the related entities - and you could consider such a Person Ontology as your "integration" manager for person records, because it tells you about all the entities you want to involve, and about all the attributes and relationships that come with them - according to your specification. Ontologies try to integrate information based on a real world view - they use URIs for interconnection - but finally they are also XML-based. They do the opposite of fragmentation - here you have to deal with the complexity - but down to the physical level - you still deal with XML.’ For both CRISPool and CERIF-XML in general further JISC support is recommended whether by the funding of follow-on project/s or by extending the scope of an existing project such as ERIS (for CRISPool) or R4R (for CERIF-XML in general)

References Rogers, N and Ferguson, N (2009), Exchanging Research Information in the UK. EXRI‐ UK: A study funded by JISC. http://ie-repository.jisc.ac.uk/448/1/exri_final_v2.pdf th

Price,D and Caddick, S (2010), How to stay on top, Times Higher, 5 Aug 2010 http://www.timeshighereducation.co.uk/story.asp?storycode=412909 Joerg, B, van Grootel, G and Jeffery, K [Eds], CERIF 2008 1-1 XML Data Exchange Format Specification http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_XML.pdf Joerg, B et al, CERIF 2008 1-1 Semantics http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf 18

www.sicsa.ac.uk

Page 19 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Natchetoi, Y, Wu,H, Babin, G and Dagtas, S (2007) EXEM: Efficient XML data exchange management for mobile applications, Information Systems Frontiers , 9 439-448 http://portal.acm.org/citation.cfm?id=1285888 Hoenisch,S (2005) Using Data Structure Standards to Foster Efficiency and Opportunity http://www.criticism.com/dita/dss.html

Page 20 of 55


Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Appendix 1

CRISPool Data Dictionary Niall Lockhart, Anna Clements

Version 1 30/04/10

Nal, akc

Version 2 11/05/10: a few classschemeIDs and classIDs revised for consistency and to match CRISPool Class Scheme Data.doc Also blanket changed cfPublicationId to cfResPublId Some minor corrections to xml files i.e. missing ‘<’ s To find changes look for 11/05/10

Akc

Version 2.1 19/05/10 Add info and examples for external people – in cfPers_CORE, cfPersName-ADD and cfPers_ResPubl-LINK

Nal

Version 2.2 24/08/10 Updated all tables to reflect use of CERIF 2008 V1.1

Akc, Nal

Final Version 2.3 30/08/10 Update at end of project

Note on IDs CRISPool is bringing together data from several UK Institutions and will use a combination of UK Learner Provider number plus Institutional internal ID to ensure uniqueness of IDs within CRISPool for Person and Publication records. UKPRNs can be found at http://www.ukrlp.co.uk. For CRISPool we need University of Edinburgh Glasgow University

Page 21 of 55

10007790 10007794


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

University of St Andrews

10007803

Akc 19/05/10 For external authors, we have to just assume each one is a separate entity unless the Institution has some kind of db of external authors. Have added examples in the relevant tables below. No need to link such persons to an organisation; but do need to link to publications. Tables affected: cfPers-CORE, cfPersName-ADD, cfPers_ResPubl-LINK NAL 24/08/10 It is important to note that all CERIF data contained within the document relates to CERIF 2008 version 1.1 and is correct at time of writing. NAL 31/08/10 Where elements cfFraction, cfStartDate and cfEndDate are not supplied then default values shall be used. These will be “1”(cfFraction), “190001-01T00:00:00.000+01:00” (cfStartDate) and “2099-12-31T00:00:00.000+01:00” (cfEndDate). Also, I have identified 3 tables with a “*” to show that they have not been implemented in this version of CRISPool.

PERSON cfPers-CORE Element cfPersId

Type CERIF String : 128 chars

cfSex

String : 1 char

cfURI

String : 128 chars

cfPers_Class-LINK Element cfPersId

Type CERIF String : 128 chars

Pure

Mandatory CERIF y

Content Pure y

y

Pure

Mandatory CERIF y

Unique person id INTERNAL; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] Examples: “m” “f” Example “http://dept.physics.gla.ac.uk/staff/default.asp?record=672”

Content Pure y

Page 22 of 55

Unique person id; person-[UKPRN]-[Person InstID] e.g.


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfPers_EAddr-LINK Element cfPersId

Type CERIF String : 128 chars

cfEAddrId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

Pure

“person-10007803-akc” Examples : “internal-person, external-person” “1626” Schemes : “class-scheme-person-types” “class-scheme-hesa-identifiers” “class-scheme-wos-identifiers” “class-scheme-supa-identifiers” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

Mandatory CERIF y

Content Pure y

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique email address id: email-[UKPRN]-[Person InstID] e.g. “email-10007803-akc” Examples : “email” skype Scheme: “class-scheme-eaddress-types” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfPers_OrgUnit-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc

Page 23 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Element cfPersId

Type CERIF String : 128 chars

cfOrgUnitId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String :128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfPers_PAddr-LINK Element

Pure

Mandatory CERIF y

cfPersId

Type CERIF String : 128 chars

cfPAddrId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String :128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

Pure

Mandatory CERIF y

Content Pure y

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique organisation unit address id: organisation-[UKPRN][Organisation-InstID] e.g. “organisation-10007803-80UNIV” “organisation-supa-condensed-matter-material-physics” “organisation-supa-nuclear-plasma-physics” Examples : “academic” Scheme: “class-scheme-job-families” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

Content Pure y

Page 24 of 55

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-40SCPHAS” Examples: “work” Scheme: “class-scheme-paddress-types” Examples: “1”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfPers_ResPubl-LINK Element Type CERIF cfPersId String : 128 chars

Pure

Mandatory CERIF y

cfResPublId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfPersKeyW-LANG Element cfPersId

Type CERIF String : 32 chars

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfKeyW

String : 255 chars

Pure

Content Pure y

Unique person id INTERNAL; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] Unique publication id; publication-[UKPRN]-[PublicationID] e.g. “publication-10007794-801001” Examples : “is-editor-of” “is-author-of” Schemes : “class-scheme-cerif-person-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Mandatory CERIF y

Content Pure y

Page 25 of 55

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Examples: “en-GB” “DE” Examples : “o” Examples: “Artificial Intelligence, AI, Human Computer Interfaces” “Physics, Space, Satellite”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

*cfPersName_Pers-LINK 31/08/10 – Due to uncertainty of required data this table has not been used in CRISPool Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId1 String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” cfPersId2 String : 128 chars y Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” cfClassId String : 128 chars y Example: “spelling-variant” cfClassSchemeId String : 128 chars y Examples: “class-scheme-person-name-variants” cfFraction Float y Examples: “1”, “0.5” cfStartDate Date y Examples: “2001-01-0101T00:00:00.000+01:00”, “1999-123101T00:00:00.000+01:00” cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” cfPersNameVar String : 128 chars Unknown Data cfPersName-ADD Element cfPersId

cfFamilyNames cfOtherNames cfFirstNames

Type CERIF String : 128 chars

String : 64 chars String : 64 chars String : 64 chars

Pure

Mandatory CERIF y

Content Pure y

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc”

y

Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] “Clements”

y

“Anna Katharine”

cfPersResInt-LANG

Page 26 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Element cfPersId

Type CERIF String : 128 chars

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfResInt

NClob

Pure

Mandatory CERIF y

Content Pure y

Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Examples: “en_GB” “DE” Examples : “o” Examples: “John Smith's current research subject areas are Artificial Intelligence and Human Computer Interfaces.”

OrganisationUnit cfOrgUnit-CORE Element cfOrgUnitId

Type CERIF String : 128 chars

cfAccro

String : 16 chars

cfURI

String : 128 chars

cfOrgUnit_Class-LINK 30/08/10 Added Element Type CERIF cfOrgUnitId String : 128 chars

cfClassId

String : 128 chars

Pure

Pure

Mandatory CERIF y

Pure y

Content

Mandatory CERIF y

Pure y

Unique organisation unit id; organisation-[UKPRN][Organisation InstID] e.g. “organisation-1000780340SCPHAS” Example: “Physics” Example: “http://www.gla.ac.uk/departments/physics/”

Content

y

Page 27 of 55

Unique organisation unit id; organisation-[UKPRN][Organisation InstID] e.g. “organisation-1000780340SCPHAS” Examples : “university”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfOrgUnit_EAddr-LINK Element Type CERIF cfOrgUnitId String : 128 chars

Pure

Mandatory CERIF y

cfEAddrId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfOrgUnit_OrgUnit-LINK Element Type CERIF cfOrgUnitId1 String : 128 chars

Pure

“school” “research-pool” “research-theme” Schemes : “class-scheme-organisation-types” Examples: “1.0”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Content Pure y

y

Mandatory CERIF y

Unique organisation unit id; organisation -[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique email address id: email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803-40SCPHAS” Examples : “email” “skype” Scheme: “class-scheme-eaddress-types” Examples: “1”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

Content Pure y

Page 28 of 55

Unique organisation unit id; organisation-[UKPRN]-


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfOrgUnitId2

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfOrgUnit_PAddr-LINK Element Type CERIF cfOrgUnitId String : 128 chars

Pure

y

Mandatory CERIF y

cfPAddrId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String :128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

[Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique organisation unit id; organisation -[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Examples: “is-parent-of” Scheme: “class-scheme-organisation-relationship-types” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Content Pure y

y

Unique organisation unit id; organisation-[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-40SCPHAS” Examples: “work” Scheme: “class-scheme-paddress-types” Example “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfOrgUnit_ResPubl-LINK 24/08/10 cfOrgUnitId actually declared as 32 chars on euroCRIS website but this is a mistake, RA from Atira has alerted euroCRIS to this error.

Page 29 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Element cfOrgUnitId

Type CERIF String : 128 chars

cfResPublId

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfOrgUnitName-LANG Element Type CERIF cfOrgUnitId String : 128 chars

Pure

Pure

Mandatory CERIF y

y

Mandatory CERIF y

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfName

String : 255 chars

Content Pure y

Unique organisation unit id; organisation-[UKPRN]-[InstID] e.g. “organisation-10007803-80UNIV” Unique publication id; publication-[UKPRN]-[PublicationID] e.g. “publication-10007794-801001” Examples: “is-publisher-of” “is-author-institution-of” “claims-ipr” Schemes: “class-scheme-cerif-orgunit-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Content Pure y

y

ResultPublication cfResPubl-RES

Page 30 of 55

Unique organisation unit id; organisation-[UKPRN][Organisation-InstID] e.g. “organisation-10007803-80UNIV” Examples: “en_GB” “DE” Examples: “o” Examples: “The University of St Andrews”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Element cfResPublId

Type CERIF String : 128 chars

cfResPublDate

Date

cfNum cfVol cfEdition cfSeries cfIssue cfStartPage cfEndPage cfTotalPages cfISBN cfISSN cfURI

String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 128 chars

Pure

Mandatory CERIF y

Content Pure y

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. publication-10007803-010101 Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

Example: “http://www.st-andrews.ac.uk/departments/physics/book”

cfResPubl_Class-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” cfClassId String : 128 chars y Examples: “textbook” “journal-article” cfClassSchemeId String : 128 chars y Schemes: “class-scheme-cerif-publication-types” cfFraction Float y Examples: “1”, “0.5” cfStartDate Date y Examples: “2001-01-0101T00:00:00.000+01:00”, “1999-123101T00:00:00.000+01:00” cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Page 31 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfResPubl_ResPubl-LINK Element Type CERIF cfResPublId1 String : 128 chars

Pure

Mandatory CERIF y

cfResPublId2

String : 128 chars

y

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfFraction

Float

y

cfStartDate

Date

y

cfEndDate

Date

y

cfResPublAbstr-LANG Element Type CERIF cfResPublId String : 128 chars

Pure

Mandatory CERIF y

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfAbstr

NClob

cfResPublBiblNote-LANG Element Type CERIF cfResPublId String : 128 chars

Pure

Mandatory CERIF y

Content Pure y

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010102” Examples: “is-part-of” Schemes: “class-scheme-cerif-publication-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Content Pure y

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples : “o” Examples: “An abstract of a publication would be written here.”

Content Pure y

Page 32 of 55

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfLangCode

String: 5 chars

cfTrans

String :1 chars

cfBiblNote

String : 255 chars

cfResPublKeyW-LANG Element Type CERIF cfResPublId String : 128chars

Examples: “en_GB” “DE” Examples : “o” Examples: “Additional information on publication up to 255 characters.”

Pure

Mandatory CERIF y

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfKeyW

String : 255 chars

Content Pure y

*cfResPublAbbrev-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory CERIF Pure CERIF Pure cfResPublId String : 128 chars y y cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfAbbrev

String : 255 chars

Page 33 of 55

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Physics, Space, Light, Gravity.”

Content Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Abbreviated title of an article.”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

*cfResPublSubtitle-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory CERIF Pure CERIF Pure cfResPublId String : 128 chars y y cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfSubtitle

String : 255 chars

cfResPublTitle-LANG 11/05/10 correction to xml tags to make valid Element Type CERIF Pure cfResPublId String : 128 chars

Mandatory CERIF y

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfTitle

String : 255 chars

Content Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Bloggs blogs about blogs”

Content Pure y

y

Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “An Example of a Textbook”

Other cfClassTerm-LANG 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure

Page 34 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfClassId

String : 128 chars

y

cfClassSchemeId

String : 128 chars

y

cfLangCode

String: 5 chars

y

cfTrans

String :1 chars

y

cfTerm

String : 64 chars

Examples : “academic-teaching” “academic-research” Schemes : “class-scheme-10007803-job-families” “class-scheme-10007994-job-families” Examples: “en_GB” “DE” Examples: “o” Examples: “Academic Teaching” “Academic Research”

ND

cfEAddr-2 Element cfEAddrId

Type CERIF String : 128 chars

cfPAddrId

String : 128 chars

cfURI

String : 128 chars

Pure

Mandatory CERIF y

Content Pure Unique email address id: email_UKPRN_[Person InstID] {OR} email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803et37”, “email-10007803-40SCPHAS” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-80UNIV” Examples: “et37@st-andrews.ac.uk”

y

ND

cfPAddr-2 Element cfPAddrId

Type CERIF String : 128 chars

cfCountryCode

String : 2 chars

cfAddrline1 cfAddrline2 cfAddrline3

String : 80 chars String : 80 chars String : 80 chars

Pure

Mandatory CERIF y

Content Pure

y

Page 35 of 55

Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-80UNIV” Examples: “UK” “DE”


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

cfAddrline4 cfAddrline5 cfPostCode cfCity/Town cfStateOfCountry cfURI

String : 80 chars String : 80 chars String : 16 chars String : 64 chars String : 64 chars String : 128 chars

cfClassScheme-CLASS Element Type CERIF cfClassSchemeId String : 128 chars

cfURI

cfClass-CLASS Element

Pure

Mandatory CERIF y

Pure

Content

Mandatory CERIF y

Pure

Schemes : “class-scheme-organisation-types” “class-scheme-cerif-publication-publication-roles” Examples: “/uk/crispool/organisation/types” “/org/eurocris/cerif/publication/publication/roles”

String : 128 chars

cfClassId

Type CERIF String : 128 chars

cfClassSchemeId

String : 128 chars

y

cfStartDate

Date

y

cfEndDate

Date

y

cfURI

String : 128 chars

Pure

Content

Page 36 of 55

Examples : “supa-physics-and-life-sciences” “in-book” Schemes : “class-scheme-supa-themes” “class-scheme-cerif-publication-publication-roles” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “/uk/crispool/supa/themes/physics-and-life-sciences” “/org/eurocris/cerif/publication/types/in-book”


Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Appendix 2

Class Scheme Data Niall Lockhart, Anna Clements

Version 1 05/04/10

Nal

Version 2.0 02/06/10 Added personal job titles for Glasgow

Akc

Version 2.1 30/05/10 Remove cfTerm for wos and hesa

Nal

Version 2.2 24/08/10 Updated supa identifiers and schema. Job families also modified. Added paddress types, person types and organisation types.

This documents lists the values for each of the class schemes to be used in CRISPool. Those with ‘cerif’ in the title are taken from the documentation on the eurocris website See http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf

class-scheme-eaddress-types cfClassId cfTerm email Email Address skype

Skype Address

class-scheme-paddress-types cfClassId cfTerm work Work Address home

Home Address

class-scheme-person-types cfClassId cfTerm externalInternal Person person internalExternal Person person

Link Entity cfOrgUnit_EAddr cfPers_EAddr cfOrgUnit_EAddr cfPers_EAddr

Link Entity cfOrgUnit_PAddr cfPers_PAddr cfOrgUnit_PAddr cfPers_PAddr

Link Entity cfPers_Class cfPers_Class

class-scheme-organisation-relationship-types cfClassId cfTerm Link Entity is-parent-of Is Parent Of cfOrgUnit_OrgUnit

Page 37 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

class-scheme-cerif-orgunit-publication-roles cfClassId cfTerm is-publisher-of Is Publisher Of claims-ipr Claims IPR Of curator Is Curator Of reviewer Provides Reviewer For is-author-of Is Author Of commissioned Has Commissioned funded Is Funded By author-institution Is Author Institution Of publishing-inst Is Publishing Institution Of external-org Is External Institution Of class-scheme-personal-titles cfClassId cfTerm mr Mr mrs Mrs miss Miss ms Ms dr Dr prof Professor class-scheme-academic-titles cfClassId cfTerm mlitt MLitt msc MSc bsc BSc ma MA mphil MPhil mres MRes phd PhD meng MEng mphys MPhys mmath MMath beng BEng ba BA pgdip PGDip

Link Entity cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl

Link Entity cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class

Link Entity cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class

Akc 30/06/10 – remove cfTerm and use cfClassID as the actual data value as here are using a cerif classification scheme purely as a way of augmenting base data for a person ie not as a true classification schema class-scheme-hesa-identifiers cfClassId cfClassID hesa-1234567890123 1234567890123 hesa-3210987654321 3210987654321 class-scheme-wos-identifiers cfClassId web-of-science- 1234-2009 web-of-science- 9876-2010

class-scheme-supa-themes cfClassId main-theme

Link Entity cfPers_Class cfPers_Class

cfClassID 1234-2009 9876-2010

Link Entity cfPers_Class cfPers_Class

cfTerm Main Theme

Page 38 of 55

Link Entity cfPers_OrgUnit


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

additional-theme

Additional Theme

cfPers_OrgUnit

class-scheme-supa-indentifiers 24/08/10 Nal no need for supaid as already identified as supa through scheme id cfClassId Link Entity cfClassId supaid-1620 1620 cfPers_OrgUnit supaid-349 349 cfPers_ OrgUnit class-scheme-cerif-person-publication-roles cfClassId cfTerm author Is Author Of editor Is Editor Of author-numbered Is Author (Numbered) Of author-percentage Is Author (Percentage) Of subject Is Subject Of commissioned Has Commissioned reviewer Is Reviewer Of translator Is Translator Of publisher Is Publisher Of commissioned Has Commissioned

Link Entity cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl

Akc 11/05/10 IGNORE class-scheme-person-name-variants cfClassId cfTerm spelling-variant Spelling Variant of Person’s Name class-scheme-cerif-publication-types cfClassId cfTerm book Book book-review Book Review book-chapter-abstract Book Chapter Abstract book-chapter-review Book Chapter Review in-book In Book anthology Anthology monograph Monograph reference-book Reference book textbook Textbook encyclopaedia Encyclopaedia manual Manual other-book Other Book journal Journal journal-article Journal Article journal-article-abstract Journal Article Abstract journal-article-review Journal Article Review conference-proceedings Conference Proceedings conference-proceedings-article Conference Proceedings Article letter Letter letter-to-editor Letter To Editor phd-thesis PhD Thesis doctoral-thesis Doctoral Thesis report Report short-communication Short Communication poster Poster presentation Presentation news-clipping News Clipping commentary Commentary

Page 39 of 55

Link Entity cfPersName_Pers

Link Entity cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

annotation

Annotation

cfResPubl_Class

class-scheme-cerif-publication-publication-roles cfClassId cfTerm is-part-of Is Part Of

Link Entity cfResPubl_ResPubl

Job Titles etc This area needs to be flexible to cope with the different ways different Institutions categorise their staff. Suggest up to three levels as follows : 1. Personal Job Title : what I want to be known as and what should be shown on portal e.g. ‘Professor of Photonics’ 2. Generic Job Title : for filtering and grouping by SUPA e.g. ‘Professor’ 3. Job Family : for filtering and grouping by SUPA e.g. Academic class-scheme-job-families 24/08/10 Nal Currently these are the only available jobs and schema within CRISPool cfClassId cfTerm Link Entity academic Academic cfPers_OrgUnit academic-research Academic Research cfPers_OrgUnit academic-teaching Academic Teaching cfPers_OrgUnit honorary Honorary cfPers_OrgUnit emeritus Emeritus cfPers_OrgUnit research-support Research Support cfPers_OrgUnit

At St Andrews we can only supply 1 and 3 at moment. St Andrews class-scheme-10007803-personal-job-titles : EXAMPLES as one created per link cfClassId cfTerm Link Entity professor-photonics Professor of Photonics cfPers_OrgUnit honorary-professor Honorary Professor cfPers_OrgUnit supa-advancedSUPA Advanced Fellow cfPers_OrgUnit fellow research-assistant Research Assistant cfPers_OrgUnit pic-technicalPIC Technical Manager cfPers_OrgUnit manager class-scheme-10007803-job-families cfClassId cfTerm academic Academic academic-research Academic Research academic-teaching honorary emeritus research-support

Academic Teaching Honorary Emeritus Research Support

Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

Glasgow class-scheme-10007794-personal-job-titles – added 02/06/10 NL cfClassId cfTerm Link Entity

Page 40 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

senior-researchfellow research-fellowknc-manager professor rcuk-researchfellow professor-ofphysics research-fellow regius-professor-ofastronomyastronomer-royalfor-scotland reader kelvin-chair-ofnatural-philosophy senior-lecturer reader-inastrophysics lecturer egee-scotgridtechnicalcoordinator professor-cargillchair-of-naturalphilosophy research-fellowatlas-neural-netanalysis

Senior Research Fellow

cfPers_OrgUnit

Research Fellow/KNC Manager Professor RCUK Research Fellow

cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

Professor of Physics

cfPers_OrgUnit

Research Fellow Regius Professor of Astronomy (Astronomer Royal for Scotland)

cfPers_OrgUnit cfPers_OrgUnit

Reader Kelvin Chair of Natural Philosophy Senior Lecturer Reader in Astrophysics

cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

Lecturer EGEE/ScotGrid Technical Coordinator

cfPers_OrgUnit cfPers_OrgUnit

Professor - Cargill Chair of Natural Philosophy

cfPers_OrgUnit

Research Fellow ATLAS Neural Net Analysis

cfPers_OrgUnit

Yellow highlights – may not be needed as historical records class-scheme-10007794-generic-job-titles cfClassId cfTerm administrative-library-andAdministrative Library & computing1 Computing 1 administrative-library-andAdministrative Library & computing2 Computing 2 advisor-of-studies Advisor of Studies atypical-worker Atypical Worker atypical-worker-grade5 Atypical Worker Grade 5 atypical-worker-minimum-wage Atypical Worker Minimum Wage head-of-department Head Of Department

Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

cfPers_OrgUnit honorary-staff

Honorary Staff

mpa-level4

MPA Level 4

mpa-level5 mpa-level6

MPA Level 5 MPA Level 6

mpa-level7

MPA Level 7

mpa-level8

MPA Level 8

marie-curie-fellow

Marie Curie Fellow

cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

Page 41 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

operational2

Operational 2

professor

Professor

reader

Reader

research1a

Research 1A

research1b

Research 1B

research2

Research 2

research-and-teaching6

Research & Teaching 6

research-and-teaching7

Research & Teaching 7

research-and-teaching8

Research & Teaching 8

research-and-teaching9

Research & Teaching 9

scholar

Scholar

scholarship

Scholarship

senior-lecturer

Senior Lecturer

technical2

Technical 2

technical4

Technical 4

technical5

Technical 5

technical6

Technical 6

technical7

Technical 7

technician-a

Technician A

technician-c

Technician C

technician-d

Technician D

technician-e

Technician E

technician-f

Technician F

cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit class-scheme-10007794-job-families cfClassId cfTerm admin-libraryAdministrative Library and computing Computing academic-related Academic and Related

Link Entity cfPers_OrgUnit cfPers_OrgUnit

atypical

Atypical Workers

honorary

Honorary University

mpa

Management Professional and

cfPers_OrgUnit cfPers_OrgUnit

Page 42 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

operational

Administrative Operational

cfPers_OrgUnit

academic

Academic

research

Research

research-andteaching scholars

Research and Teaching Scholars

technical

Technical and Related

cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit Edinburgh – tbc – need job-families and personal-job-titles class-scheme-10007790-job-families cfClassId cfTerm

Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit

SUPA-tbc Suggestion here is for SUPA to provide class-scheme-supa-job-families to which member Institutions can map their own job families. class-scheme-organisation-types cfClassId cfTerm department Department university University school School college College research-pool Research Pool research-theme Research Theme

Link Entity cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class

Page 43 of 55


Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Appendix 3

CRISPool CERIF to PURE4 mapping • o o o o • o o o  o   o • o o  o o o       

1 Important notes 1.1 CERIF imposes constraints on data 1.2 Persons/Authors 1.3 Fragmentation 1.4 Translations 2 Classification mappings 2.1 Organisations 2.2 SUPA Themes 2.3 Persons 2.3.1 Employment types 2.4 Publications 2.4.1 Publication Peer Review 2.4.2 Organisation to publication relations 2.5 Electronic addresses 3 Entity Mappings 3.1 Organisation 3.2 Person 3.2.1 Person-organisation relationships 3.3 Address (UK) 3.4 Email, Skype, etc. 3.5 Publications 3.5.1 General fields 3.5.2 Contribution to Journal 3.5.3 Book Anthology 3.5.4 Conference Contribution 3.5.5 Contribution to Book Anthology 3.5.6 Other Contribution 3.5.7 Working Paper

Important notes CERIF imposes constraints on data The CERIF XML format imposes many constraints on the data it holds. Many text strings in the format is limited by a max length constraint, and often this constraint is too small. Imposing such restrictions on data is not suitable for an exchange format as CERIF actually is. If a receiving CRIS system has length constraints on text strings, the problem should be dealt with internally. This has been reported this to euroCRIS for their information.

Persons/Authors Only "real" persons are exported/imported, meaning that when a person is connected to a publication the alias author name, which can be different from the persons's actual name, is discarded.

Page 44 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Fragmentation Generally the Cerif XML data model is vastly fragmented. Data regarding an entity is scattered into several XML files and namespaces, which means that the referential integrity is lost. Thus it is up to the data provider to ensure that references are correct.

Translations Cerif uses language codes for specifying languages and translations, but the only specification available is that language codes are 5 characters long. In this project we use the well-known standard <language code>_<country code>, where • •

language code is the two letter ISO 639-2 standard (see http://www.loc.gov/standards/iso639-2/englangn.html) country code is the two letter ISO 3166 standard (see http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/listen1.html)

Examples are: en_GB (british english), en_US (american english), da_DK (danish), fr_FR (french from France).

Classification mappings Classifications are mapped from Cerif classification id and scheme id to either a PURE4 classification URI or a contextual meaning.

Organisations An organisation is classified by an organisation type. In CERIF organisations are classified via the cfOrgUnit_Class classification element. CERIF Scheme id: class-scheme-organisation-types Cerif cfClassId

PURE Classification URI

university

/dk/atira/pure/organisation/organisationtypes/organisation/university

college

/dk/atira/pure/organisation/organisationtypes/organisation/college

faculty

/dk/atira/pure/organisation/organisationtypes/organisation/faculty

school

/dk/atira/pure/organisation/organisationtypes/organisation/school

department

/dk/atira/pure/organisation/organisationtypes/organisation/department

institute

/dk/atira/pure/organisation/organisationtypes/organisation/institue

research-pool /dk/atira/pure/organisation/organisationtypes/organisation/research researchtheme

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

publisher

/dk/atira/pure/publisher/publishertypes/publisher/publisher

The organisation relationship scheme classifies an organisation to organisation relation and is specified in the CERIF cfOrgUnit_OrgUnit link element. Scheme id: class-scheme-organisation-relationship-types Cerif cfClassId Contextual meaning

Page 45 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

is-parent-of

cfOrgUnitId1 is parent of cfOrgUnitId2 (cfOrgUnit_OrgUnit)

SUPA Themes SUPA Themes are mapped to organisations in the Research Theme classification. CERIF Scheme id: class-scheme-supa-themes Cerif cfClassId

PURE Classification URI

supa-particle/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme physics supaastronomy/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme space-physics supacondensedmattermaterialphysics

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-physics/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme life-sciences supa-energy

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-nuclear/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme plasmaphysics supaphotonics

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

Persons Both internal and external authors are mapped to the CERIF cfPers type and distinguished from each other by classification via the cfPers_Class element. Scheme id: class-scheme-person-types (cfPers_Class) Cerif cfClassId

Contextual meaning

internalperson

the person is mapped to a PURE Person (and PURE authors)

externalperson

the person is mapped to a PURE External Person Author (only present on publications etc.)

Employment types A person's relation to an organisation is classified by an employment type. This is expressed in CERIF via the classification present in the cfPers_OrgUnit link element. SchemeId: class-scheme-job-families Cerif cfClassId

PURE Classification URI

academic

/dk/atira/pure/person/employmenttypes/academic

academic-research /dk/atira/pure/person/employmenttypes/academicresearch academic-teaching /dk/atira/pure/person/employmenttypes/academicteaching

Page 46 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

honorary

/dk/atira/pure/person/employmenttypes/honorary

emeritus

/dk/atira/pure/person/employmenttypes/emeritus

research-support

/dk/atira/pure/person/employmenttypes/research-support

Publications Scheme id: class-scheme-cerif-publication-types Cerif cfClassId

PURE Classification URI

book

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book

book-review

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other

book-chapter/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/forewo abstract book-chapter/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other review in-book

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry

anthology

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology

monograph

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/special

reference-book /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book textbook

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book

encyclopaedia /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology manual

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other

other-book

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other

journal

/dk/atira/pure/journal/journaltypes/journal/journal

journal-article /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/article journal-article/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter abstract journal-article/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/scientific review conferenceproceedings

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other

conferenceproceedingsarticle

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/paper

letter

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter

letter-to-editor /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter phd-thesis

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly

doctoral-thesis /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly report

/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/commissioned

short/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other communication poster

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/poster

presentation

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other

news-clipping

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry

commentary

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment

annotation

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment

Page 47 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Publication Peer Review To signal whether the publication has been peer reviewed or not. This is done using the cfResPubl_Class element. Scheme id: class-scheme-publication-peer-review Cerif cfClassId Contextual meaning is-reviewed

The publication has been reviewed by a peer

is-not-reviewed The publication has not been reviewed by a peer

Organisation to publication relations Scheme id: class-scheme-cerif-orgunit-publication-roles Cerif cfClassId Contextual meaning claims-ipr

the organisation is considered the owner of the publication

author-institution the organisation has an author on the publication is-author-of

the organisation has an author on the publication

is-publisher-of

for future use

publishing-inst

for future use

curator

for future use

reviewer

for future use

commissioned

for future use

funded

for future use

external-org

for future use

Electronic addresses The electronic address classification is used to identify different types of addresses and is specified in the cfPers_EAddr element. Scheme id: class-scheme-eaddress-types Cerif cfClassId Context Type email

Email address (cfEAddr)

skype

Skype address (cfEAddr)

messenger

Instant Messaging

web

Web site URL

phone

Phone number

mobile

Mobile phone number

fax

Fax number

Entity Mappings Organisation PURE CERIF Default Mandatory Mandatory value

PURE field CERIF field

Page 48 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

name

cfOrgUnitName.cfName

Y

shortName cfOrgUnit.cfAcro period.start

cfOrgUnit_Class.cfStartDate (the first appearance)

peroid.end

cfOrgUnit_Class.cfEndtDate (the first appearance)

type

cfOrgUnit_Class (the first appearance)

Y

visibility

NA

Y

keywords

cfOrgUnitKeyw.cfKwyw

website

cfOrgUnit.cfURI

email

cfEAddr (first appearance classified as email)

Y

Y

today

Y Y FREE

Person The UK model has two different person-organisation relation types which are different if the person is staff or student. In this proof of concept project, we assume that only staff are synchronised. PURE field

PURE Mandatory

CERIF field

cfPersNamename.firstname ADD.cfFirstNames (first appearance)

Y

cfPersNamename.lastName ADD.cfLastNames (first appearance)

Y

-

cfPersNameADD.cfOtherNames

nameVariants

cfPersName-ADD (2nd to last appearance)

sex

cfPers.cfSex

CERIF Mandatory

Default value

Y

Person-organisation relationships In Cerif a number of postal address, email address, etc. is associated directly with the person. In PURE these relations are gathered as metadata on a person-organisation relation and a person can have one or more such relations. To overcome this obstacle CRISPool CERIF mapping bends the rules by using the classification of the cfPers_EAddr and cfPers_PAddr relations to carry data. Thus the following special classification schemes have been made. Common for these classifications is that the cfClassId contains the cfOrgUnitId of the related organisation. • •

cfPers_PAddr o class-scheme-person-organisation-address-postal specifies a person's postal work address in relation to an organisation cfPers_EAddr o class-scheme-person-organisation-address-email specifies a person's email work address in relation to an organisation o class-scheme-person-organisation-address-web specifies a person's web work address in relation to an organisation o class-scheme-person-organisation-address-phone specifies a person's work phone in relation to an organisation

Page 49 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

o o

class-scheme-person-organisation-address-mphone specifies a person's work mobile phone in relation to an organisation class-scheme-person-organisation-address-fax specifies a person's work fax number in relation to an organisation

A person's relation to an organisation is classified by an employment type in the class-scheme-job-families scheme. This is expressed in CERIF via the classification present in a cfPers_OrgUnit link element.

Address (UK) PURE field CERIF field

PURE Mandatory

CERIF Mandatory

Default value

postalCode cfPAddr.cfPostCode country

cfPAddr.cfCountryCode

address1

cfPAddr.cfAddrline1

address2

cfPAddr.cfAddrline2

address3

cfPAddr.cfAddrline3

address4

cfPAddr.cfAddrline4

address5

cfPAddr.cfAddrline5

Y

Email, Skype, etc. Cerif electronic addresses such as email, skype and messenger is specified via an cfEAddr. The different electronic addresses is distinguished from each other by their classification as specified earlier in the document. The actual electronic address is specified in the cfURI element.

Publications General fields PURE CERIF Mandatory Mandatory

PURE field

CERIF field

publishedDate, publicationYear, Month, -Day

cfResPubl.cfResPublDate

numberOfPages

cfResPubl.cfTotalPages

title (localised)

cfResPublTitle.cfTitle

abstract (localised)

cfResPublAbstr.cfAbstr

Default value

y

Y

bibliographicalNote cfResPublBiblNote.cfBiblNote keywords

cfResPublKeyw.cfKeyw

Contribution to Journal

PURE field

CERIF field

PURE Mandatory

Page 50 of 55

CERIF Default Mandatory value


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

pages

cfResPubl.cfStartPage, cfResPubl.cfEndPage

journalNumber cfReslPubl.cfNum volume

cfResPubl.cfVol

Book Anthology

PURE CERIF Default value Mandatory Mandatory

PURE field CERIF field printIsbns

cfResPubl.cfISBN

edition

cfResPubl.cfEdition

volume

cfResPubl.cfVol

Conference Contribution

PURE CERIF Default Mandatory Mandatory value

PURE field CERIF field pages

cfResPubl.cfStartPage, cfResPubl.cfEndPage

peerReview

cfResPubl_Class (peer review classification)

Contribution to Book Anthology

PURE CERIF Default value Mandatory Mandatory

PURE field

CERIF field

printIsbns

cfResPubl.cfISBN

edition

cfResPubl.cfEdition

hostPublicationTitle cfResPubl.cfSeries

Other Contribution

PURE field CERIF field

PURE

CERIF

Page 51 of 55

Default value


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Mandatory printIsbns

Mandatory

cfResPubl.cfISBN

Working Paper

PURE field CERIF field printIsbns

PURE CERIF Default value Mandatory Mandatory

cfResPubl.cfISBN

Page 52 of 55


Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Appendix 4

Technical Summary - CRISPool project prototype implementation

Created by: Atira A/S, edited by Thomas Vestdam Date: 87 February 2011 Version: 1.0 Rev. nr. 19

Technical Summary Below we have outlined how the CERIF-XML import and export functionality was implemented in Pure for the CRISPool project. In general, we have observed a few, but important, problems when using CERIF-XML as an exchange format:

Fragmentation – introduces unnecessary complexity, especially in import algorithms, as input must be scanned several times in order to collect all relevant XML-fragments that make up a single entity (e.g. a person). In addition, the excessive scanning of XML input also causes performance issues. Suggestion: allow certain XML entities/types to include other relevant entities resulting in a single comprehensive document type covering everything related to that type. E.g. the person element could allow inclusion of optional sub-elements such as names, keywords, relations other to other entities, etc. That is, the CERIF-model is kept as it is, but the exchange format becomes more suited for machine processing (as well as improve human readability). Too many namespaces – parsing and querying (e.g. using XPath) CERIF-XML is very cumbersome as every single element has its own namespace. Suggestion: only have one namespace per CERIF version. This would also allow having only one XML schema defining CERIF-XML. Constraints – the XML format should not impose too many constraints on data sizes other than IDs. It makes good sense to keep an upper limit to ID lengths, but we suggest that names, titles, abstract, etc. should be unbounded, and leave it up the different CRIS systems to decide, what to do if the incoming data length is greater than the systems internal representation. Most of the issues seem to stem from the fact that the CERIF-XML format is very close to the relational database schema defined the CERIF model. This leaves some desired improvements to CERIF-XML as an exchange format. However, the bottom line is that CERIF (XML) can, as such, be utilised as a flexible exchange format.

Page 53 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

Outline of the CERIF-XML import functionality Importing CERIF-XML into Pure is done by first loading the supplied CERIF-XML files for a given data-provider into an XML-database (eXist-db was used, http://exist.sourceforge.net/). Content is either created or updated in Pure based on the information in the XML-database. If a give piece of content (e.g. person, organisation, research output) does not already exists in Pure then the content is created, and stored along with the id found in the input (e.g. cfPersId) as a source id – it is this source id that is used to check for existence in Pure. When creating or updating content all relevant bits and pieces are loaded from the XML-database – e.g. for a given person that would be XML-fragments relating to the specific person id such as:

the person element person name elements associated keyword elements person-organisation relation elements and so on The relevant XML-fragments are found by performing several XPath queries in the XML-database in order to provide a “single document” containing all XML-fragments for a given piece of content. The XML-fragments are then transferred to relevant entities in the Pure model (we use XMLBeans to create binding to Java types, http://xmlbeans.apache.org/). The XML-database approach was choose over a handwritten parser for the reason of simplicity, and in order to be able to have a better basis for handling very large data-sets (if needed the XML-database can be kept in memory, or be streamed to the file-system depending on the needs).

Outline of the CERIF-XML export functionality Exporting content from Pure is implemented using the export framework in Pure by defining a series of “converters”. Each converter is responsible for converting a given a Pure model entity (e.g. Person, Organisation, Research Output, Journal, Patent) to CERIF-XML. The converter for a specific model entity is responsible for creating all relevant CERIF-XML fragments representing that entity in CERIF-XML. E.g. for a person that would be fragments such as

the CERIF person element, in a CERIF persons elements XML file the persons name, in a CERIF person name elements XML file associated keywords to a person, in a CERIF person keyword elements XML file relations to organisations, in a CERIF person organisation elements XML file and so on

Page 54 of 55


Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010

When exporting, the relevant data is loaded based on a list of the organisations that data is needed for – each research output associated with any of the input organisations or their sub-organisations is loaded and converted one by one. For each research output, associated organisations, authors (persons) and journals are loaded and converted. In turn, when converting a person, any associated organisations are converted, and when converting an organisation any associated organisations (e.g. sub-organisations) are converted as well. Due to the recursive nature of the order in which entities are loaded and converted the exporter keeps track of already converted entities by keeping a list of their UUIDs. This ensures that entities are only loaded and converted once. The actual XML files are generated when all relevant entities have been converted by serialising all XML fragments in a set of CERIF-XML files. In this specific implementation an XML-database was used to temporarily store the XML-fragments while converting data, and when the conversion was done, the final XML-files where created by utilizing the serializing capabilities of the XML-database. The CERIF-XML export is provided as a special web-service that the delivers serialised CERIF-XML files as a bundled zip-file. The procedure and techniques described above can be applied for any system that aims to export to CERIFXML. However, the description of how data is loaded is of cause Pure specific, and just serves as an example. While writing the exporter a mapping document was created, and mapping decisions were recorded in the document.

Page 55 of 55


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.