Page 1

22 June 2011

Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

JISCAD - TWITTER SEARCH

curious to see what #mydata project will deliver and if it'll build on semantic technologies http://t.co/3x9JI9x #uciad #jiscad

ACTIVITY DATA

JUN 21, 2011 05:50P.M.

Introduction

UCIAD

This is an informal report outlining the likely recommendations from the Activity Data projects to help JISC to determine future work in the area. This is not intended as a public document, rather to stimulate discussion and lead to a more formal document at a later stage.

The mydata project

There are two things to note at this stage.

Activity Data Synthesis Project: Recommendations JUN 21, 2011 01:28P.M. The following is the recommendations that we have submitted to JISC. Your comments would be most welcome by both JISC and us..

JUN 21, 2011 05:44P.M. • Activity data can serve a wide variety of different functions as exemplified by the range of projects in this programme. However the greatest impact (and return on investment) will be from supporting student success.

Announcements have come out recently regarding new projects from the government around the slogan “Better choices, better deals” to support better customer experience, through transparent customer information. This is exciting as it shows how the government, as well as businesses, are now realising that it is through giving control to information to the customers (i.e., the users) that we can build a better, more reliable and more transparent experience. At the core of the initiative is the mydata project which goal can be summarised by the sentence: “giving back customer data to customers”. To a large extent UCIAD can be seen as an experiment in this direction, proposing to deliver activity data to the users (i.e., customers) of large organisations. We certainly share the same hypothesis that, as expressed by Nigel Shadbolt (chair of the MyData project), customers/users getting back their information can help make organisations/businesses “more accountable”, “more efficient” and able to build “new kinds of services”.

• We suggest that the next call explicitly funds other universities to pick up of the techniques and / or software systems that have been developed in this programme in order to see if they are useful beyond the initial institution, and in this process, discover what the issues may be to make effective use of the techniques and / or systems. However, this may not be in accordance with JISC’s standard practice and is not an essential part of the recommendations. The recommendations appear under the following topic areas: • Student success

Of course, it is still unclear at this stage what will be the concrete outcomes of the mydata project. Great challenges have to be tackled both from a technological point of view (in what format should data be provided to customers? How to ensure reusability? How to deal with heterogeneity?) and from the societal point of view (What are the privacy/security implications? How to enforce “user-centric data provision” policies in businesses? How to spread the benefit equally amongst users?). We hope that our experience with UCIAD (and beyond, with the work building on UCIAD we are planning to do) will contribute to such exciting new approaches to activity/customer data.

• Student and researcher experience • Collection management. Student success “It is a truth universally acknowledged that”[1] early identification of students at risk and timely intervention must[2] lead to greater success. It is believed that some of the patterns of behaviour that can be identified through activity data will indicate students who are at risk and could be

1


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

supported by early intervention. It has also been demonstrated in work in the US that it can help students in the middle to improve their grades[3].

system under development and its potential users. • Investigation and implementation of appropriate algorithms. This should look at existing algorithms in use and their broader applicability. We advise that this should include statisticians with experience in areas such as pattern analysis and recommender systems.

Recommendations: 1. In year 2, JISC should fund research into what is needed to build effective student success dashboards 1. Work is needed at least in the following areas:

• Some of the systems developed under this programme should be piloted elsewhere.

• Determination of the most useful sources of data that can underpin the analytics • Identification of effective and sub-optimal study patterns that can be found from the above data.

Collection management Activity data provides information on what is actually being used / accessed. The opportunity exists to use data on and how and where resources are being used at a much finer level of granularity than is currently available. Activity data can therefore be used to help inform collection management.

• Design and development of appropriate algorithms to extract this data. We advise that this should include statisticians with experience in relevant areas such as recommender systems. • Watching what others are doing including in the areas of learning analytics, including VLE developer activity developments.

Note that this is an area where shared or open data may be particularly valuable in helping to identify important gaps in a collection.

At this stage it is not clear what the most appropriate solutions are likely to be; therefore, it is recommended that this is an area where we need to “let a thousand flowers bloom”. However, it also means that it is essential that projects collaborate in order to ensure that projects, and the wider community, learn any lessons.

1. It is recommended that in the coming year JISC should fund work to investigate how activity data can support collection management. 1. In particular work is needed in the following areas:

1. In year 2 or 3, JISC should pilot some of the systems developed under the current programme:

• Consider how activity data can supplement data that libraries are already obtaining from publishers, through projects such as JUSP.

Student and researcher experience

• Work with UK Research Reserve.

This area is primarily concerned with using recommender systems to help students and (junior) researchers locate useful material that they might not otherwise find, or would find much harder to discover.

• Assess the potential to include the Open Access publications domain in this work. • Pilot work from this programme to see if the data that they are using is helpful in this area.

Recommendations 1. It is recommended that in year 2, JISC fund additional work in the area of recommender systems for resource discovery. 1. In particular work is needed in the following areas:

Other areas The following are important areas that JISC should pursue. 1. It is recommended that JISC continue work on open data for activity data, and in particular investigates appropriate standard formats.

• Investigation of the issues and tradeoffs inherent in developing institutional versus shared services recommender systems. For instance there are likely to be at least some problems associated with recommending resources which are not available locally.

2. It is recommended that one or more projects in year 2 should investigate the value of a mixed activity data approach in connection with no-SQL data stores in order to maximise flexibility in the accumulation, aggregation and analysis of activity data and supporting data sets; the US Learning Registry project may be relevant.

• Investigating and trialling the combination of activity data with rating data. In doing this there need to be acknowledgement that users are very frequently disinclined to provide ratings, and that ways to reduce barriers to participation and increase engagement with rating processes need to be discovered in the context of the

2


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

3. It is recommended that JISC ask appropriate experts (such as Naomi Korn / Charles Oppenheim or JISC Legal) to provide advice on the legal aspects such as privacy and data sharing, similar to Licensing Open Data: A Practical Guide (written for the Discovery project).

JISCAD - TWITTER SEARCH

RT @richardn2009: A few spaces left at Innovations in Activity Data for Academic Libs on 4 July in Milton Keynes Details here http://bit.ly/iM3OCq #jiscad

Other The Activity Data Synthesis Project is not in a position to make any recommendation over the use of linked data in this area in the absence of any compelling use.

JUN 20, 2011 02:50P.M. [1] Austen J, Pride and prejudice [2] In reality – “highly likely” – but that does not fit with the quote [3] Arnold K, Signals: Applying Academic Analytics, Educause Quarterly, JISCAD - TWITTER SEARCH 2010, Vol 33 No 1 http://www.educause.edu/EDUCAUSE+Quarterly/EDUCAUSEQuarterlyMagazineVolum/SignalsApplyingAcademicAnalyti/199385 or http://bit.ly/c5Z5Zu

RT @rschon: Lots of valuable findings in the UC Libraries Academic e-Book Usage Survey Report http://j.mp/kN2chU [#jiscad]

JISCAD - TWITTER SEARCH

RT @richardn2009: A few spaces left at Innovations in Activity Data for Academic Libs on 4 July in Milton Keynes Details here http://bit.ly/iM3OCq #jiscad

JUN 20, 2011 01:44P.M.

JISCAD - TWITTER SEARCH

RT @richardn2009: A few spaces left at Innovations in Activity Data for Academic Libs, 4 July, Milton Keynes http://bit.ly/iM3OCq #jiscad

JUN 20, 2011 04:00P.M.

JUN 20, 2011 12:50P.M.

3


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

JISCAD - TWITTER SEARCH

ACTIVITY DATA

A few spaces left at Innovations in Activity Data for Academic Libs on 4 July in Milton Keynes Details here http://bit.ly/iM3OCq #jiscad

Draft Guide: ‘Bringing activity data to life’ JUN 20, 2011 05:26A.M. [This is a draft Guide that will be published as a deliverable of the synthesis team’s activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the ‘Additional Resources’ section]

JUN 20, 2011 12:32P.M.

The problem: Activity and attention data is typically large scale and may combine data from a variety of sources (e.g. learning, library, access management) and events (turnstile entry, system login, search, refine, download, borrow, return, review, rate, etc). It needs methods to make it amenable to analysis.

ACTIVITY DATA

Online Exchange #2: Event Recording [2 June 2011]

It is easy to think of visualisation simply as a tool to help our audiences (e.g. management) ‘see’ the messages (trends, correlations, etc) that we wish to highlight from our datasets. However experience with ‘big’ data indicates that visualisation and simulation tools are equally important for the expert, assisting in the formative steps of identifying patterns and trends to inform further investigation, analysis and ultimately the development of such as Performance Indicators.

JUN 20, 2011 08:53A.M. On the 2nd June we held the second of our Activity Data Virtual Meetings using Webex online conferencing tool. The hour-long session can be downloaded or streamed using the following links: • Stream the recording

The options: Statisticians and scientists have a long history of using computer tools, which can be complex to drive. At the other extreme, spreadsheets such as Excel have popularised basic graphical display for relatively small data sets. However, a number of drivers (ranging from cloud processing capability to software version control) have led to a recent explosion of high quality visualization tools capable of working with a wide variety of data formats and therefore accessible to all skill levels (including the humble spreadsheet user).

• Download the recording We heard from Richard Nurse who talked us through the Open University RISE project and shared the progress they’ve made so far. [Richard’s presentation starts at the 11min mark] We also heard from Sheila Fraser who presented an overview of EDINA’s Using OpenURL Activity Data project and touched on how the data might be used, as well as inviting participants to suggest ideas and discuss the issues around using other institutions’ data. [Sheila’s presentation starts at the 20min, 20secs mark]

Taking it further: Youtube is a source of introductory videos for tools in this space, ranging from Microsoft Excel features to the cloud based processing from Google and IBM to tools such as Gephi, which originated in the world of version control. Here are some tools recommended by people like us: Excel Animated Chart http://www.youtube.com/watch?v=KWxemQq10AM&NR=1 Excel Bubble Chart - http://www.youtube.com/watch?v=fFOgLe8z5LY

We also had speakers lined up to share information and experience about the Journal Usage Stats Portal (JUSP), Metridoc and the RAPTOR project but unfortunately a technical glitch in Webex meant that we had to postpone their contributions to a future session. [NB: you can view the in session chat box by selecting ‘View’ >> ‘Chat’ from the menu at the top of the Webex playback window]

Google Motion Chart http://code.google.com/apis/chart/interactive/docs/gallery/motionchart.html IBM Many Eyes - http://www.youtube.com/watch?v=aAYDBZt7Xk0 Use Many Eyes at http://www958.ibm.com/software/data/cognos/manyeyes/ Gapminder Desktop - http://www.youtube.com/watch?v=duGLdEzlIrs

4


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

See also http://www.gapminder.org/

statistical requirement and therefore selectively extract, aggregate and analyse a subset of the data accordingly; for example:

Gephi - http://www.youtube.com/watch?v=bXCBh6QH5W0 • Analyse library circulation trends by time period or by faculty or … Gourse - http://www.youtube.com/watch?v=E5xPMW5fg48 • Analyse VLE logs to identify users according to their access patterns (time of day, length of session)

Additional resources: To grasp the potential, watch Hans Rosling famously using Gapminder in his TED talk on third world myths http://www.youtube.com/watch?v=RUwS1uAdUcI&NR=1 UK-based Tony Hirst (@pyschemedia) has posted examples of such tools in action – see his Youtube channel http://www.youtube.com/profile?user=psychemedia. Posts include Google Motion Chart using Formula 1 data, Gourse using Edina OpenURL data and a demo of IBM Many Eyes. A wide ranging introduction to hundreds of visualisation tools and methods is provided at http://www.visualcomplexity.com/vc/

Approach 2 - Analyse the full set (or sets) of available data in search of patterns using data mining and statistical techniques. This is likely to be an iterative process involving established statistical techniques (and tools), leading to cross-tabulation of discovered patterns, for example: • Discovery 1 – A very low proportion of lecturers never post content in the VLE • Discovery 2 – A very low proportion of students never download content

ACTIVITY DATA

• Discovery 3 – These groups are both growing year on year

Draft Guide: ‘Strategies for collecting and storing activity data’

• Pattern – The vast majority of both groups are not based in the UK (and the surprise is very low subject area or course correlation between the lecturers and the students) Additional resources: Approach 1 – The Library Impact Data Project (#LIDP) had a hypothesis and went about collecting data to test it http://library.hud.ac.uk/blogs/projects/lidp/ Approach 2 - The Exposing VLE Data project (#EVAD) was faced with the availability of around 40 million VLE event records covering 5 years and decided to investigate the patterns - http://vledata.blogspot.com/

JUN 20, 2011 04:27A.M. [This is a draft Guide that will be published as a deliverable of the synthesis team’s activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the ‘References’ section]

Recommender systems (a particular form of data mining used by such as supermarkets and online stores) typically adopt Approach 2, looking for patterns using established statistical techniques http://en.wikipedia.org/wiki/Recommender_system and http://en.wikipedia.org/wiki/Data_Mining

The problem: Activity data typically comes in large volumes that require processing to be useful. The challenge is where to start and at what stage to become selective (e.g. analyse student transactions and not staff) and to aggregate (add transactions together – e.g. 1 record per day for books borrowed). If we are being driven by information requests or existing Performance Indicators, we will typically manipulate (select, aggregate) the raw data early. Alternatively, if we are searching for whatever the data might tell us then maintaining granularity is essential (e.g. if you aggregate by time period, by event or by cohort, you may be burying vital clues). However, there is the added dimension of data protection – raw activity datasets probably contain links to individuals and therefore aggregation may be a good safeguard (though only partial, as you may still need to throw away low incidence groupings that could betray individual identity). The options: It is therefore important to consider the differences between two approaches before you start burning bridges by selection / aggregation or unnecessarily filling terabytes of storage. Approach 1 - Start with a pre-determined performance indicator or other

5


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

ACTIVITY DATA

Virtual visits to library (no attribution)

Draft Guide: ‘Identifying activity data in the library service’

Service improvement OPAC Searches made, search terms used, full records retrieved (no attribution)

JUN 20, 2011 04:27A.M. Recommender system, Student success [This is a draft Guide that will be published as a deliverable of the synthesis team’s activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the ‘References’ section]

Circulation Books borrowed, renewed Collection management, Recommender system, Student success

The problem: Libraries use a range of software systems through which users interact with premises, services and resources. The LMS system is far from the only source, the OPAC and the LMS circulation module representing increasingly partial views of user attention, activity and usage in a changing world. So libraries wishing to build a picture of user interactions face the challenge of identifying the appropriate data – depending on their purpose, which may range from collection management (clearing redundant material, building ‘short loan’ capacity) to providing student success performance indicators (if correlation can be established), to developing recommender services (students who used this also used that, searched for this retrieved that, etc). Let’s split the problem down. In this guide we consider the variety of sources available within library services, a list to which you may add more. In other guides we consider strategies for deriving intelligence from ‘anything that moves’ as well as from targeted data extraction and aggregation with reference to specific goals.

URL Resolver Accesses to e-journal articles Recommender system, Collection management Counter Stats Downloads of e-journal articles Collection management Reading Lists Occurrence of books and articles – a proxy for recommendation Recommender system

The options: Libraries already working with activity data have identified a range of sources and purposes – Collection Management, Service Improvement, Student Success and Recommender Services. Potential uses of data will be limited where the user is not identified in the activity (‘No attribution’). Here are some key examples:

Help Desk Queries received Service improvement

Taking it further: Here are some important questions to ask before you start to work with user activity data:

Data Source What can be counted

• Can our systems generate that data? Value of the intelligence • Are we collecting it? Sometimes these facilities exist but are switched off

Turnstile Visits to library

• Is there enough of it to make any sense? How long have we been collecting data and how much data is collected per year?

Service improvement, Student success • Will it serve the analytical purpose we have in mind? Or could it trigger new analyses?

Website

6


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

• • Should we combine a number of these sources to paint a fuller picture? If so, are there reliable codes held in common across the relevant systems – such as User ID?

such as the JISC Activity Data programme - see resources below • Identify where log information about learning –related systems ‘events’ are already collected (e.g. Learning, library, turnstile and logon / authentication systems);

Additional resources: Consider also the Guides on Student Success (URL) and Data Strategies (URL)

• Understand the standard guidance on privacy and data protection relating to the processing and storage of such data

The Library Impact Data Project (LIDP) led by the University of Huddersfield - http://library.hud.ac.uk/blogs/projects/lidp/

• Engage the right team, likely to include key academic and support managers as well as IT services; a statistician versed in analytics may also be of assistance as this is relatively large scale data

ACTIVITY DATA

Draft Guide: ‘Enabling student success’

• Decide whether to collect data relating to a known or suspected indicator (like the example above) or to analyse the data more broadly to identify whatever patterns exist

JUN 20, 2011 04:26A.M.

• Run an bounded experiment to test a specific hypothesis

[This is a draft Guide that will be published as a deliverable of the synthesis team’s activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the ‘References’ section]

Additional resources: Three projects in the JISC Activity Data programme investigated these opportunities at Cambridge, Huddersfield and Leeds Met universities. See Activity Data Guide on ‘Data Strategies’ to maximise your potential to identify and track indicators

The problem: Universities and colleges are focused on supporting students both generally and individually to ensure retention and to assure success. The associated challenges are exacerbated by large student numbers and as teaching and learning becomes more ‘virtualised’. Institutions are therefore looking for indicators that will assist in timely identification of such as ‘at risk’ learners so they can be proactively engaged with the appropriate academic and personal support services.

More about Learning Analytics in the 2011 Educause Horizon Report http://www.educause.edu/node/645/tid/39193?time=1307689897 Academic Analytics: The Uses of Management Information and Technology in Higher Education, Goldstein P and Katz R, ECAR, 2005 http://www.educause.edu/ers0508

The options: Whilst computer enabled systems may be part of the problem, they can certainly contribute significantly to the solution through identification of patterns of learning and associated activity that highlight ‘danger signs’ and sub-optimal practice and by the automation of ‘alarms’ (e.g. traffic light indicators, alerts) triggered by one or more indicators. This approach forms part of the field of ‘learning analytics’, which is increasingly popular in North America.

JISCAD - TWITTER SEARCH

Well-chosen indicators do not necessarily imply a cause and effect relationship, but they do provide a means to single out individuals using automatically collected activity data, typically combining a bundle of indicators (e.g. Students who do not visit the library in Term 1 may be at risk; students who also do not download content from the VLE are highly likely to be at risk).

JUN 17, 2011 07:42P.M.

RT @joypalmer: really good progress being made on SALT API #jiscad #jiscsalt http://bit.ly/mxtbBa

Taking it further: Institutions wishing to develop these capabilities may be assisted by this checklist: • Consider how institutions have developed thinking and methods in

7


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

JISCAD - TWITTER SEARCH

LIBRARY IMPACT DATA PROJECT

[with proper hashtag] from last evening: playing with #uciad semantic activity data http://t.co/GLqh6e5 #jiscad

Reflections on Huddersfield’s data JUN 17, 2011 09:05A.M. Following on from De Montford’s blog post about the nature of their data submission, we’ve been thinking a bit more about what we could have included (and indeed what we might look at when we finish this project).

JUN 17, 2011 02:10P.M.

We’ve already been thinking about how we could incorporate well established surveys into data consideration (both our own internal data collection, such as our library satisfaction survey, and external surveys). While our biggest concern is getting enough data to draw conclusions, qualitative data is naturally a problematic area: numerical data ‘just’ needs obtaining and clearing for use, but getting some information from students to find out why they do or don’t use resources and the library can be quite complicated. Using other surveys outside of the project focus groups could be a way of gathering simple yet informative data to indicate trends and personal preferences. Additionally, if certain groups of students choose to use the library a little or a lot, existing surveys may give us feedback on why on a basic level.

JISCAD - TWITTER SEARCH

RT @joypalmer: really good progress being made on SALT API #jiscad #jiscsalt http://bit.ly/mxtbBa JUN 17, 2011 01:56P.M.

We also may want to ask (and admittedly I’m biased here given my research background!) what makes students choose the library for studying and just how productive they are when they get here. Footfall has already clearly demonstrated in the original project that library entries do not necessarily equate to degree results. Our library spaces have been designed for a variety of uses, for social learning, group study, individual study, specialist subject areas. However, that doesn’t mean they are used for those purposes. Footfall can mean checking email and logging on to Facebook (which of course then links back to computer log in data and how that doesn’t necessarily reflect studying), but it can also mean intensive group preparation e.g. law students working on a moot (perhaps without using computers or resources other than hard copy reference editions of law reports).

JISCAD - TWITTER SEARCH

really good progress being made on SALT API #jiscad #jiscsalt http://bit.ly/mxtbBa JUN 17, 2011 01:56P.M.

If we want to take the data even further, we could take it deeper into borrowing in terms of specific collection usage too. Other research (De Jager, K (2002) has found significant correlations between specific hard copy collections (in De Jager’s case, examples include reference materials and short loan items) and attainment, with similar varying relationships between resource use and academic achievement across different subjects. If we were to break down collection type in our borrowing analysis (particularly where there may be special collections of materials or large numbers of shorter loan periods), would we find anything that would link up to electronic resource use as a comparison? We could also consider incorporating reading lists into the data to check whether recommended texts are used heavily in high attainment groups… De Jager, K. (2002), “Successful students: does the library make a difference?” Performance Measurement and Metrics 3 (3), p.140-144

8


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

LEEDS MET STAR-TRAK PROJECT

uciad.info, as well as the linked data platform of the Open University — data.open.ac.uk.

robmoores

Traces of activities around a user

JUN 16, 2011 09:52P.M. The first piece of inference that we need to realise is to be able to identify and extract, within our data, information related to the particular traces of activities realised by a user. To identify a user, we rely here on the settings used to realise the activity. A setting, in our ontology, correspond to a computer (generally identified by its IP address) and an agent (generally a browser, identify by a generally complex string such as Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24). The first step is therefore to associated a user to the settings he/she usually uses. We are currently developing tools so that a user can register to the UCIAD platform and his/her setting be automatically detected. Here, I manually declared the settings I’m using by providing the triple store with the following piece of RDF:

One of our aims in developing STAR-Trak is to make it as easy as possible for other educational institutions to implement it. As a result the technology solution comprises: Development: •PHP – A server side scripting language, which is fast, reliable, simple to use, open source and has a huge community supporting it. •CodeIgniter framework – very lightweight compared with others. •MVC – (Model, View, Controller) architecture Access Security: •Users can be authenticated against a corporate directory. Technology Stack: •Apache Server version. 2.2.11 or higher •PHP version 5.2.9 or higher (Reporting layer) •Oracle database 10g/11g •Data Warehouse – Data Integration layer (Facts and Dimensions)

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:actor="http://uciad.info/ontology/actor/"> <rdf:Description rdf:about="http://uciad.info/actor/mathieu"> <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/4eafb6e074f46857b1c0b4b2ad0aa8e4 <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/c97fc7faeadaf5cac0a28e86f4d723c9"/ <actor:knownSetting rdf:resource="http://uciad.info/actorsetting/eec3eed71319f9d0480ff065334a5f3a" </rdf:Description> </rdf:RDF>

Click here to view or download a diagram of the STAR-Trak architecture.

This indicates that the user http://uciad.info/actor/mathieu has three settings. This settings are all on the same computer and correspond to the Safari and Chrome browsers, as well as the Apple PubSub agent (used in retrieving RSS feeds amongst other things).

UCIAD

Reasoning over user-centric activity data

Each trace of activity is realised through a setting (linked to the trace by the hasSetting ontology property). Knowing the settings of a user therefore allows us to list the traces that correspond to this particular user through a simple query. Even better, we can create a model, i.e. an RDF graph, that contains all the information related to the user’s activity on the considered websites, using a SPARQL construct query:

JUN 16, 2011 09:12P.M. There are two reasons why we believe ontology technologies will benefit the analysis of activity data in general, and from a user centric perspective in particular. First, ontology related technologies (including OWL, RDF and SPARQL) provide the necessary flexibility to enable the “lightweight” integration of data from different systems. Not only we can use our ontologies as a “pivot” model for data coming from different systems, but this model is also easily extensible to take account of the particularities of the different systems around, but also to allow for custom extension fo particular users, making personalised analysis of personal data feasible.

PREFIX tr:<http://uciad.info/ontology/trace/> PREFIX actor:<http://uciad.info/ontology/actor/> construct { ?trace ?p ?x. ?x ?p2 ?x2. ?x2 ?p3 ?x3. ?x3 ?p4 ?x4 } where{ <http://uciad.info/actor/mathieu> actor:knownSetting ?set. ?trace tr:hasSetting ?set. ?trace ?p ?x. ?x ?p2 ?x2. ?x2 ?p3 ?x3. ?x3 ?p4 ?x4 } The results of this query correspond to all the traces of activities in our data that have been realised through known setting of the user http://uciad.info/actor/mathieu, as well as the surrounding information. Although this query is a bit rough at the moment (it might include irrelevant information, or miss relevant data that are connected to the traces through too many steps), what is really interesting here is that it provides a very simple and elegant mechanism to, from large amount of raw log data, extract a subgraph that characterise completely

The second advantage of ontologies is that they allow for some form of reasoning that make it easier for us to just through data into them and obtain meaningful results. I use reasoning in a broad sense here to show how, based on raw data extracted in the logs of Web servers, we can obtain a meaningful, integrated view of the activity of a user of the corresponding websites. This is based on a current experiments realised with 2 servers hosting various websites, including blogs such as

9


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

the activities of one user on the considered websites. This data can therefore be considered on its own, as a user-centric view on activity data, rather than a server-centric or organisation-centric view. It can as well be provided back to the user, exported in a machine readable way, so that he/she becomes can possibly make use of it in other systems and for other purposes.

traces over BlogFeed realised with the Apple PubSub agent as a particular category of activities (e.g., FeedSyndication), alongside others that characterise other kind of activities: recovering data, reading news, commenting, editing, searching, …

LIBRARY IMPACT DATA PROJECT We are currently working on the mechanisms allowing users to register/login to the UCIAD platform, to identify their settings and to obtain their own “activity data repository”.

Good news everybody…

Reasoning about websites and activities

We are very pleased to report that we have now received all of the data from our partner organisations and have processed all but two already!

JUN 16, 2011 04:23P.M.

The second aspect of reasoning with user-centric activity data relates to inferring information from the data itself, to support its interpretation and analysis. What we want to achieve here is, through providing ontological definitions of different types of activities, to be able to characterise different type of traces and classify them as evidence of particular activities happening.

Early results are looking positive and our next step is to report back with a brief analysis to each institution. We are planning to give them our data and a general set of data so that they can compare and contrast. There have been some issues with the data, some of which has been described in previous blogs, however, we are confident we have enough to prove the hypothesis one way or another!

The first step in realising such inferences is to characterise the resources over which activities are realised — in our case, websites and webpages. Our ontologies define a webpage as a document that can be part of a webpage collection, and a website as a particular type of webpage collection. As part of setting up the UCIAD platform, we declare in the RDF model the different collections and website that are present on the considered server, as well as the url patterns that makes it possible to recognise webpages as parts of these websites and collections. These URL patterns are expressed as regular expression and an automatic process is applied to declare triples of the form page1 isPartOf website1 or page2 isPartOf collection1 when the URLs of page1 and page2 match the patterns of website1 and collection1 respectively.

In our final project meeting in July we hope to make a decision on what form the data will take when released under an Open Data Commons Licence. If all the partners agree, we will release the data individually; otherwise we will release the general set for other to analyse further.

LIBRARY IMPACT DATA PROJECT

What will this project do for library users? JUN 16, 2011 04:03P.M.

Now, the interesting thing is that these websites, collections and webpages can be further specified into particular types and as having particular properties. We for example declare that http://uciad.info/ub/ is a Blog, which is a particular type of website. We can all declare a webpage collection that corresponds to RSS feeds, using a particular URL pattern, and use an ontology expression to declare the class of BlogFeed as the set of webpages which are both part a Blog and part of the RSSFeed collection, i.e., in the OWL abstract syntax

The project aims to make some pretty big conclusions by the end of the data analysis about library usage and attainment, but what can we actually do with this information once we’ve got proof? What use is it to our customers? We’ve got two main groups of library users; staff and students. We aim to use our quantitative data to pinpoint groups of students who have a particular level of attainment. We’ll work with staff in order to improve poor scores and learn from those who are awarded high scores, regardless of whether they are high or low users of our resources and facilities. Focus groups held now, and most likely regularly in the future, will tell us more about people who use the library resources less but achieve good degree results. If the materials we are providing aren’t what students want to use, we can tailor our collections to reflect their needs as well as ensure they get the right kind of information their tutors want them to use.

Class(BlogFeed complete intersectionOf(Webpage restriction(isPartOf someValuesFrom(RSSFeed)) restriction(isPartOf someValuesFrom(Blog)) ) ) What is interesting here is that such a definition can be added to the repository, which, using its inference capability, will derive that certain pages are BlogFeed, without this information being directly provided in the data, or the rule to derive it being hard-coded in the system. We can therefore engage in an incremental construction of an ontology characterising websites and activities generally, in the context of a particular system, or in the context of a particular user. Our user http://uciad.info/user/mathieu might for example decide to add to his data repository the ontological definition allowing him to recognise

The student benefits are pretty obvious – the more we can advise and communicate to them and encourage use of library staff, and electronic and paper resources, the more likely they are to get a good degree and get value from their time (and money!) spent at university.

10


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

Once again we state here that we are aware of other factors in student attainment, but a degree is not achieved without having some knowledge of the subject, and we help supplement the knowledge communicated by lecturers.

course, but they all showed a willingness to make sense of their academic environments, some even finding ingenious ways around the perceived inadequacies of our systems. It would be expedient to think that it is our wonderful and expensive resources that make the difference in students’ performance and ultimately their results. But I suspect that a more crucial factor is the depth of the students’ engagement with their studies rather than the intrinsic value of our resources. My guess is that most of the students attending the focus groups will go on to do well in their studies. They will do well because they are keen, and because this motivation is translated into a willingness to try things out and explore the resources and services at their disposal.

Students get value for money and hopefully enjoy their university experience, lecturers ensure students get the right kind of support and materials they need, and we make sure our budget is used appropriately. Pretty good, huh?

LIBRARY IMPACT DATA PROJECT

Some thoughts on the focus groups from De Montfort University

The fact that many students comment on the awkwardness of our systems and searching tools (i.e. catalogues and databases) could also have a role to play in explaining the correlation between Athens logins and degree results. Motivated students are more likely to explore the resources that are available to them and also more likely to jump over hurdles and persevere to get to the good stuff. So, could the strong correlation between Athens logins and degree results be as much an indicator of students’ motivation and staying power as it is of the usefulness of our resources? And could the advent of discovery tools like Summons or Ebsco Discovery lessen this correlation? Indeed, if searching for ‘quality’ resources becomes as ‘easy’ as searching Google, will usage of online library resources still be a measure of the difference between the good and the not so good student? Or will the difference only become noticeable further along the way (e.g. how students make sense of the information they find). But if so, will we be able to measure it?

JUN 16, 2011 04:00P.M. Given the time of year and the short time scale in which to hold them (just before Easter), we were pleasantly surprised to receive an overwhelming 204 replies to our invitation email for the three LIDP focus groups. The ten pounds print credit incentive must have looked particularly attractive during assignment time, especially when our most generous offering for focus groups so far had not exceeded five pounds. Expecting less than 50% attendance, we invited twenty students to each focus group and gently let down the rest. Attendance was also better than expected with thirty five students attending in total, twenty six of whom were full time undergraduate students. Students were on the whole interested, enthusiastic and some questions and comments generated lively discussion around the table especially when talking about mysteriously missing files from the library PCs! There were some insightful comments: summing up a conversation about the limitations of some of the library searching tools and some ways around these, one student remarked ‘it seems that a lot of us use different means to go around the library rather than use the library engines as we can’t find things by using them. We are working around the library not through it’

The focus groups also helped to explain the lack of correlation found between usage of the library itself (i.e. the physical space) and degree results. Although most students use the library regularly, there is a very clear division between those students who prefer working in the library and those who prefer working at home. This preference does not appear to be linked to motivation or engagement with their course but to other factors such as personal preferences, distance from the library, the nature of the task undertaken, and the availability of internet access at home. So for those students, using the library as a space is not an indication of how hard they work. Moreover, whilst Athens cannot be used for much else besides studying, there are many more ways in which the library can be used than for studying (e.g. using the PCs for fun, chatting, meeting place).

At first, I was not sure what the focus groups could add to the very neat graphs that Dave has already produced from our quantitative data. Students who attend focus groups are not usually a representative sample, and these groups were no exception. As one student remarked ‘we have all made the effort to come to this focus group, it kind of shows we are in the same mind’ (i.e. motivated and keen to do well).

All in all, the focus groups were a great opportunity to meet some great students, gain a deeper insight into students’ experience of using the library, and generated a lot of interesting qualitative data. It also provided me with much food for thought and speculation!

However, even if for a biased sample of the students’ population, the focus groups did flesh out the story behind the figures. What these students have in common is their active engagement with the academic world, including the library and its resources. Most of them read beyond the recommended reading, use the online resources, borrow books regularly, and are keen to get a good degree. This does not mean that they do not get frustrated by faulty equipment and missing books, of

Marie Letzgus De Montfort University

11


Today’s Tabbloid PERSONAL NEWS FOR helen@sero.co.uk

22 June 2011

SALT - SURFACING THE ACADEMIC LONG TAIL

the controls in [B]; threshold is the minimum number of unique borrowers that any given combination of items must have to be considered, and format specifies how the returned data is required (either xml or json). Results from the web API are displayed in [C], with the actual output from the API reproduced in [D]. Note that all available results are returned by the API but the test code only shows the number set by the third control in [B].

SALT Demo JUN 16, 2011 12:22P.M. A further set of sample data from JRUL, comprising 100,000 loan transactions this time, has been processed and used to test a prototype web API. Signs are encouraging.

The exact format of the output is yet to be ratified but the API is in a state where it can now be incorporated into prototype interfaces at JRUL and in COPAC. In addition the remaining 3 million or so loan transactions from JRUL will be loaded and processed in readiness for user testing.

The process begins with data being extracted from the Talis library management system (LMS) at JRUL in CSV format. This data is parsed by a PHP script which separates the data into two tables in a MySQL database, the bibliographic details describing an item go into a table called items and the loan specific data, including borrower ID, goes into a table called, you’ve guessed it, loans. A further PHP script then processes the data into two additional MySQL tables, nloans and nborrowers; nloans contains the total number of times each item has been borrowed, and nborrowers contains, for each combination of two items, a count of the unique number of library users to have borrowed both items.

JISCAD - TWITTER SEARCH

With the above steps complete, additional processing is performed on demand by the web API. When called for a given item, say item_1, the API returns a list of items for suggested reading, where this list is derived as follows. From the nborrowers table a list of items is compiled from all combinations featuring item_1. For each item in this list the number of unique borrowers, from the nborrowers table, is divided by the total number of loans for that item, from the nloans table, following the logic used by Dave Pattern at the University of Huddersfield. The resulting values are ranked in descending order and the details associated with each suggested item are returned by the API.

RT @lynncorrigan: @daveyp Seen this? http://bit.ly/kd6Z3t Seems to cover a similar ground to #jiscad #lidp @Graham_Stone @librarygirlknit

For a bit of light relief here’s an image.

JUN 15, 2011 01:56P.M.

This is a screenshot from a piece of code written to demonstrate the web API. For a given item, identified by the ISBN, the details are retrieved from the items table in the MySQL database and displayed in [A]. An asynchronous call is made to the web API that accepts the ISBN as a parameter, along with threshold and format values which are set using

12

JISC AD Tabbloid: 22 June 2011  
JISC AD Tabbloid: 22 June 2011  

JISC Activity Data project blogs and #jiscad tweets

Advertisement