__MAIN_TEXT__

Page 1

Proceedings of the 12th International Conference on Electronic Publishing

Open Scholarship: Authority, Community and Sustainability in the Age of Web 2.0

Toronto June 25-27, 2008

Editors: Leslie Chan University of Toronto Scarborough (Canada) Susanna Mornati CILEA (Italy)


II

Proceedings of the 12th International Conference on Electronic Publishing, Toronto 2008 University of Toronto (Toronto, Canada) Edited by: Leslie Chan, University of Toronto Scarborough (Canada) Susanna Mornati, CILEA (Italy) Published by: International Conference on Electronic Publishing (ELPUB) ISBN:

978-0-7727-6315-0

First edition All rights reserved. (C) 2008 Leslie Chan, Susanna Mornati (C) 2008 For all authors in the proceedings Disclaimer Any views or opinions expressed in any of the papers in this collection are those of their respective authors. They do not represent the view or opinions of the University of Toronto, CILEA, the editors and members of the Progamme Committee, nor of the publisher or conference sponsors. Any products or services that are referred to in this book may be either trademarks and/or registered trademarks of their respective owners. The Publisher, editors and authors make no claim to those trademarks.


III

Members of the 2008 Programme Committee Apps, Ann The University of Manchester (UK) Baptista, Ana Alice University of Minho (Portugal) Borbinha, JosÊ INESC-ID (Portugal) Cetto, Ana Maria IAEA (Austria) Costa, Sely M.S. University of Brasilia (Brazil) Delgado, Jaime Universitat Politècnica de Catalunya (Spain) Diocaretz, Myriam University of Maastrichthe (The Netherlands) Dobreva, Milena HATII, University of Glasgow & IMI, Bulgarian Academy of Sciences (Bulgaria) Engelen, Jan Katholieke Universiteit Leuven (Belgium) Gargiulo, Paola CASPUR (Italy) Gradmann, Stefan University of Hamburg (Germany) Guentner, Georg Salzburg Research (Austria) Hedlund, Turid Swedish School of Economics and Business Administration, Helsinki (Finland) Horstmann, Wolfram University of Bielefeld (Germany) Ikonomov, Nikola Institute for Bulgarian Language (Bulgaria) Iyengar, Arun IBM Research (USA) Jezek, Karel University of West Bohemia in Pilsen (Czech Republic) Joseph, Heather SPARC (USA) Krottmaier, Harald Graz University of Technology (Austria) Linde, Peter Blekinge Institute of Technology (Sweden) Martens, Bob Vienna University of Technology (Austria) Moore, Gale University of Toronto (Canada) Markov, Krassimir IJITA-Journal (Bulgaria) Moens, Marie-Francine Katholieke Universiteit Leuven (Belgium) Mornati, Susanna CILEA (Italy) Nisheva-Pavlova, Maria Sofia University (Bulgaria) Paepen, Bert Katholieke Universiteit Leuven (Belgium) Perantonis, Stavros NCSR - Demokritos (Greece) Schranz, Markus Pressetext Austria (Austria) Smith, John University of Kent at Canterbury (UK) Tonta, Yasar Hacettepe University (Turkey)


IV

Acknowledgements Local Organizing Committee Gale Moore, Knowledge Media Design Institute University of Toronto Gabriela Mircea, University of Toronto Libraries Jen Sweezie, University of Toronto Scarborough Graphic design Joe Beausoleil - floodedstudios@rogers.com Typesetting Jen Sweezie Conference host Knowledge Media Design Institute, University of Toronto Conference Sponsors CILEA Interuniversity Consortium Synergies SPARC TeleGlobal Consulting Group JISC Department of Computer Science, University of Toronto Exhibiting Sponsors International Development Research Centre (Canada) Promotional Sponsors BioMedCentral: The Open Access Publisher University of Toronto Bookstores


V

Table of Contents Preface .................................................................................................................................. XI Leslie Chan; Susanna Mornati Organizational and Policy issues A Review of Journal Policies for Sharing Research Data .......................................................... 1 Heather A. Piwowar; Wendy W. Chapman Researcher’s Attitudes Towards Open Access and Institutional Repositories: A Methodological Study for Developing a Survey Form Directed to Researchers in Business Schools. .......................................................... 15 Turid Hedlund The IDRC Digital Library: An Open Access Institutional Repository Disseminating the Research Results of Developing World Researchers .............................................................................................................. 23 Barbara Porrett Metadata and Query Formats Keyword and Metadata Extraction from Pre-prints Emma Tonkin; Henk L. Muller

.............................................................. 30

The MPEG Query Format, a New Standard For Querying Digital Content. Usage in Scholarly Literature Search and Retrieval ............................................. 45 Ruben Tous; Jaime Delgado The State of Metadata in Open Access Journals: Possibilities and Restrictions ................... 56 Helena Francke Collaboration in Scholarly Publishing Establishing Library Publishing: Best Practices for Creating Successful Journal Editors ...................................................... 68 Jean-Gabriel Bankier; Courtney Smith Publishing Scientific Research: Is There Ground for New Ventures?..................................... 79 Panayiota Polydoratou and Martin Moyle The Role of Academic Libraries in Building Open Communities of Scholars ........................................................................................ 90 Kevin Stranack, Gwen Bird, Rea Devakos


VI

Semantic Web and New Services Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging. ................................................................................................... 100 Maria Elisabete Catarino; Ana Alice Baptista Autogeneous Authorization Framework for Open Access Information Management with Topic Maps ........................................ 111 Robert Barta; Markus W. Schranz AudioKrant, the daily spoken newspaper ............................................................................ 122 Bert Paepen A Semantic Web Powered Distributed Digital Library System ........................................... 130 Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni Business Models in e-publishing No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing ..................................................................................................... 140 Tarek Loubani, Alison Sinclair, Sally Murray, Claire Kendall, Anita Palepu, Anne Marie Todkill, John Willinsky Should University Presses Adopt An Open Access [Electronic Publishing] Business Model For All of Their Scholarly Books? ....................................................... 149 Albert N. Greco; Robert M. Wharton Scholarly Publishing within an eScholarship Framework – Sydney eScholarship as a Model of Integration and Sustainability .................................................................. 165 Ross Coleman Usage Patterns of Online Literature Global Annual Volume of Peer Reviewed Scholarly Articles and the Share Available Via Different Open Access Options ................................................................ 178 Bo-Christer Björk ;Annikki Roosr; Mari Lauri Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean ................................................................................................................. 187 Saray Córdoba-González; Rolando Coto-Solano Consortial Use of Electronic Journals in Turkish Universities Yasar Tonta; Yurdagül Ünal

.......................................... 203


VII

A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education ........................................................................................................................ 217 Jan. J. Engelen New Challenges in Scholarly Communications The SCOAP3 project: Converting the Literature of an Entire Discipline to Open Access ...................................................................................................................... 223 Salvatore Mele Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues .............................................................................................................. 234 Anita de Waard; Joost Kircz Synergies, OJS, and the Ontario Scholars Portal ................................................................ 246 Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos Open Access in Less Developed Countries African Universities in the Knowledge Economy: A Collaborative Approach to Researching and Promoting Open communications in Higher Education .................... 254 Eve Gray, Marke Burke Open Access in India: Hopes and Frustrations ...................................................................... 271 Subbiah Arunachalam An Overview of The Development of Open Access Journals and Repositories in Mexico ....................................................................................................................... 280 Isabel Galina; Joaquín Giménez Brazilian Open Access Initiatives: Key Strategies and Actions .......................................... 288 Sely M S Costa; Fernando C L Leite Information Retrieval and Discovery Services Interpretive Collaborative Review: Enabling Multi-Perspectival Dialogues to Generate Collaborative Assignments of Relevance to Information Artefacts in a Dedicated Problem Domain ................................................................................... 299 Peter Pennefather and Peter Jones Joining Up ‘Discovery to Delivery’ Services Ann Apps; Ross MacIntyre

..................................................................... 312


VIII

Web Topic Summarization ................................................................................................... 322 Josef Steinberger; Karel Jezek; Martin Sloup Open Access and Citations Open Access Citation Rates and Developing Countries ..................................................... 335 Michael Norris; Charles Oppenheim; Fytton Rowland Research Impact of Open Access Research Contributions across Disciplines Sheikh Mohammad Shafi

.................. 343

Exploration and Evaluation of Citation Networks .............................................................. 351 Karel Jezek; Dalibor Fiala; Josef Steinberger Added-value Services for Scholarly Communication Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online .............................................................................................................................. 363 Lisa R. Schiff Preserving The Scholarly Record With WebCite (www.webcitation.org): An Archiving System For Long-Term Digital Preservation Of Cited Webpages ................... 378 Gunther Eysenbach Enhancing the Sustainability of Electronic Access to ELPUB Proceedings: Means for Long-term Dissemination .............................................................................. 390 Bob Martens; Peter Linde; Robert Klinc; Per Holmberg A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics: Proposal of a Semantic Reference Linking System between On-line Primary and Secondary Sources ................................................ 401 Matteo Romanello Posters & Demonstrations Creating OA Information for Researchers ........................................................................... 415 Peter Linde, Aina Svensson Open Scholarship eCopyright@UP. Rainbow Options: Negotiating for the Proverbial Pot of Gold .................................................................................................... 417 ElsabĂŠ Olivier


IX

Scalable Electronic Publishing in a University Library Kevin Hawkins

...................................................... 421

Issues and Challenges to Development of Institutional Repositories in Academic and Research Institutions in Nigeria ............................................................................... 422 Gideon Emcee Christian When Codex Meets Network: Toward an Ideal Smartbook ................................................ 425 Greg Van Alstyne and Robert K. Logan Revues.org, Online Humanities and Social Sciences Portal Marin Dacos

............................................... 426

AbstractMaster速 ................................................................................................................. 428 Daniel Marr A Deep Validation Process for Open Document Repositories ............................................ 429 Wolfram Horstmann, Maurice Vanderfeesten, Elena Nicolaki , Natalia Manola Pre-Conference Workshops Publishing with the CDL's eXtensible Text Framework (XTF) ............................................. 432 Kirk Hastings; Martin Haye; Lisa Schiff Open Journal Systems: Working with Different Editorial and Economic Models ................. 433 Kevin Stranack; John Willinsky Repositories that Support Research Management .................................................................. 434 Leslie Carr Opening Scholarship: Strategies for Integrating Open Access and Open Education ............. 435 Mark Surman; Melissa Hagemann Boost your capacity to manage DSpace! ................................................................................ 436 Wayne Johnston; Rea Devakos; Peter Thiessen; Gabriela Mircea


X


XI

Preface It is a pleasure for us to present you with these proceedings, consisting of over 40 contributions from six different continents accepted for presentation at the 12th ELPUB conference. The conference, generously hosted by the Knowledge Media Design Institute at the University of Toronto, was chaired by Leslie Chan, University of Toronto Scarborough, Canada, and Susanna Mornati, CILEA, Italy. The 12th ELPUB conference carried on the tradition of previous international conferences on electronic publishing; bringing together researchers, developers, librarians, publishers, entrepreneurs, managers, users and all those interested in issues regarding electronic publishing and scholarly communications to present their latest projects, research, or new publishing models or tools. This year marks the first time ELPUB was held in North America. Previous meetings were held in the United Kingdom (in 1997 and 2001), Hungary (1998), Sweden (1999), Russia (2000), the Czech Republic(2002), Portugal (2003), Brazil (2004), Belgium (2005), Bulgaria (2006), andAustria (2007). The theme of this year’s meeting was “Open Scholarship�. Participants and presenters explored the future of scholarly communications resulting from the intersection of semantic web technologies, the development of new communication and knowledge platforms for the sciences as well as the humanities and social sciences. We also encouraged presenters to explore new publishing models and innovative sustainability models for providing open access to research outputs. Open Access is now a mainstream debate in publishing, and technological evolution and revolution in the digital world is transforming scholarly communication beyond traditional borders. The impact of the web on daily life has resulted in the involvement of rapidly growing audiences world-wide, who are now e-reading and e-writing, making epublishing an even more important and widespread phenomenon. Electronic calls for submissions for ELPUB 2008 were distributed widely, resulting in over 80 submissions that covered a broad range of scholarly publishing issues. In addition to technical papers dealing with metadata standards, exchange protocols, new online reading tools and service integration, we also received a fair number of papers reporting on the economics of openness, public policy implications, and institutional support and collaboration on digital publishing and knowledge dissemination. A number of conceptual papers also examined the changing nature of scholarly communications made possible by open peer-topeer production and new financial models for the production and dissemination of knowledge. In order to guarantee the high quality of papers presented at ELPUB 2008, all submissions were peerreviewed by at least three members of the international Programme Committee (PC) and additional peer reviewers. Together, these reviewers represented a broad range of technical expertise as well as diverse disciplinary interests. Their contributions of time and feedback to the authors ensured the high quality of papers that were presented at the conference and in these proceedings. We would like to express our sincere appreciation of their efforts and dedication. To assist with the assignment of reviewers, submitters were asked to characterise their entries by selecting 3-5 key words that best represent their work. In a similar way, reviewers identified their 3-5 fields of expertise, allowing the Programme team to match papers to reviewers. Accepted papers were then grouped into sessions according to common and over-lapping themes. The Table of Contents of this volume follows both the themes and the order of the sessions in which they were scheduled during the conference. Over the past three years the SciX Open Publishing System was used to manage the submission and review of abstracts. This year, we decided to experiment with the Open Conference System, an open source software application designed by the Public Knowledge Project at Simon Fraser University, to manage all aspects of an academic conference. The system worked well in most respects, though we encountered a number of small bugs and irregularities. We provided feedback to the development team and we are sure that these issues will be addressed in the next release. This is indeed the beauty of open


XII

source - community input and the sharing of benefits. We would like to thank the Public Knowledge Project for providing the software and for their key role in promoting open scholarship. As with all previous ELPUB conferences, this collection of papers and their metadata are made available through several channels of the Open Archives Initiative, including Dublin Core metadata distribution and full archives at http://elpub.scix.net. It may appear ironic to have printed proceedings for a conference dedicated to electronic publishing. However, the “need” for printed publications is an old and continuing one. It seems that it is still essential for a significant number of delegates to have “something tangible” in their hands and their respective university administrations. Thanks go to Tanzina Islam for checking the references, Jen Sweezie for copyediting and organizational support, Gabriela Mircea of the University of theToronto Library for maintaining the Open Conference System and providing valuable technical support, to the student volunteers, and to many others that made this event possible. Finally we would like to thank the various sponsors for their generous contributions. We hope you enjoy reading the proceedings from ELPUB 2008. It is also our pleasure to invite delegates and readers to ELPUB 2009, taking place in Milan, Italy. The 13th ELPUB conference will be organised by CILEA and the University of Milan. Details of the conference will be forthcoming at the ELPUB web site. As these proceedings go to press, we look forward to a very successful and productive conference,

General Chair Leslie Chan

Program Chair Susanna Mornati


1

A Review of Journal Policies for Sharing Research Data Heather A. Piwowar; Wendy W. Chapman Department of Biomedical Informatics, University of Pittsburgh 200 Meyran Avenue, Pittsburgh PA, USA e-mail: hpiwowar@gmail.com; wec6@pitt.edu

Abstract Background: Sharing data is a tenet of science, yet commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. The purpose of this study is to understand the current state of data sharing policies within journals, the features of journals that are associated with the strength of their data sharing policies, and whether the strength of data sharing policies impact the observed prevalence of data sharing. Methods: We investigated these relationships with respect to gene expression microarray data in the journals that most often publish studies about this type of data. We measured data sharing prevalence as the proportion of papers with submission links from NCBI’s Gene Expression Omnibus (GEO) database. We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access). Results: Of the 70 journal policies, 53 made some mention of sharing publication-related data within their Instruction to Author statements. Of the 40 policies with a data sharing policy applicable to gene expression microarrays, we classified 17 as weak and 23 as strong (strong policies required an accession number from database submission prior to publication). Existence of a data sharing policy was associated with the type of journal publisher: 46% of commercial journals had data sharing policy, compared to 82% of journals published by an academic society. All five of the openaccess journals had a data sharing policy. Policy strength was associated with impact factor: the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.9, and 6.2. Policy strength was positively associated with measured data sharing submission into the GEO database: the journals with no data sharing policy, a weak policy, and a strong policy had median data sharing prevalence of 8%, 20%, and 25%, respectively. Conclusion: This review and analysis begins to quantify the relationship between journal policies and data sharing outcomes. We hope it contributes to assessing the incentives and initiatives designed to facilitate widespread, responsible, effective data sharing. Keywords: data sharing; editorial policies; instructions for authors; bibliometrics; gene expression microarrays 1.

Background

Widespread adoption of the Internet now allows research results to be shared more readily than ever before. This is true not only for published research reports, but also for the raw research data points that underlie the reports. Investigators who collect and analyze data can submit their datasets to online databases, post them on websites, and include them as electronic supplemental information – thereby making the data easy to examine and reuse by other researchers. Reusing research data has many benefits for the scientific community. New research hypotheses can be tested more quickly and inexpensively when duplicate data collection is reduced. Data can be aggregated Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


2

Heather A. Piwowar; Wendy W. Chapman

to study otherwise-intractable issues, and a more diverse set of scientists can become involved when analysis is opened beyond those who collected the original data. Ethically, it has long been considered a tenet of scientific behavior to share results[1], thereby allowing close examination of research conclusions and facilitating others to build directly on previous work. The ethical position is even stronger when the research has been funded by public money[2], or the data are donated by patients and so should be used to advance science by the greatest extent permitted by the donors[3]. Unfortunately, these advantages only indirectly benefit the stakeholders who bear most of the costs for sharing their datasets: the primary data-producing investigators. Data sharing is often time consuming, confusing, scary, and potentially damaging to future research plans. Consequently, sharing data is commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. Funders are motivated by the promise of resource efficiency and rapid progress. The motivation for journals to act as an advocate and gatekeeper for data sharing is less straightforward. Journals seek to publish “well-written, properly formatted research that meets community standards” and in so doing have assumed monitoring tasks to “remind researchers of community expectations and enforce some behaviors seen as advantageous to the progress of science.”[4] This role has been encouraged by many letters[5, 6], white-papers[7, 8], and editorials in high-profile journals[9]. Journal policies are usually expressed within “instruction for authors” statements. A study by McCain in 1995[4] explored the statements of 850 journals, looking for mandates for the dissemination of data (and the sharing of biological materials). She found that 132 (16%) natural science and technology journals had a policy regarding sharing of some type of research-related information. While McCain covered a wide breadth and depth of journals (especially given that her review predated electronic access to instruction for author statements), she did not attempt to associate the policies with journal attributes, nor did she measure the actual data sharing behavior of authors and correlate the prevalence with journal policy strength. We believe looking at these issues could help us better understand the causes and effects of journal data sharing policies. The purpose of this study is to understand the current state of data sharing policies within journals, to identify which characteristics of journals are associated with the strength of their data sharing policies, and to measure whether the strength of data sharing policies impacts the observed prevalence of data sharing. 2.

Methodology

Our study involved three steps. First, we identified a set of journals for examination. For each journal, based on a manual review of the instruction to author statement, we classified the strength of its policy for data sharing as none, weak, or strong. Second, we studied the relationship between the strength of a journal’s data sharing policy and selected journal attributes. Third, for each journal, we measured how many of its recently published articles have submitted datasets to a centralized database. We used these estimates to study the relationship between data sharing prevalence and the strength of the journal’s data sharing policy. Each of these steps is described below in more detail. 2.1

Collecting the journal’s policies on sharing data

To avoid unnecessary complexity, we chose to investigate data sharing policies for a single type of data: biological gene expression microarrays. These “chips” allow investigators to measure the relative level of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Review of Journal Policies for Sharing Research Data

RNA expression across tens of thousands (exponentially more each year, as the technology improves) of different genes for each cell line in their study. For example, a clinical trial might involve extracting a small piece of breast cancer tumor from each of 100 patients who responded to a given chemotherapy treatment and from another 100 patients who did not. Cells from each patient’s tumor would be hybridized to a microarray chip, then the investigators would compare the relative levels of RNA expression across all the patients to identify a set of genes with expression levels that with chemotherapy response. This highthroughput dataset would include at least a million data points. The dataset is expensive and time-consuming to collect, but very valuable not only to the original investigators for their original purpose but also to other investigators who may wish to study different questions. Microarray data provide a useful environment for exploring data sharing policies and behaviors, for several reasons. Despite being valuable for reuse, microarray data are often but not yet universally shared. The best-practice guidelines for sharing microarray data are fairly mature, including standards for formatting and minimum-inclusion reporting developed by the active Microarray and Gene Expression Data (MGED) Society. A few centralized databases have emerged as best-practice repositories: the Gene Expression Omnibus (GEO)[10] and ArrayExpress[11]. Several high-profile letters have called for strong data sharing policies[5, 6]. Finally, the National Center for Biotechnology Information’s Entrez website (http:// www.ncbi.nlm.nih.gov/) makes it easy to identify journal articles that have submitted datasets to GEO, allowing us to study the association between journal policies and observed data sharing practice. We identified journals with more than 15 articles published on “gene expression profiling” in 2006, using Thomson’s Journal Citation Reports. We extracted the journal impact factors, subdiscipline categories, and publishing organizations. We looked up each journal in The Directory of Open Access Journals to determine which are based on an open-access publishing model. We used Google to locate the Instructions for Author policies for each of the journals. We manually downloaded and reviewed each policy for all mentions of data sharing. 2.2

Classifying the relative strength of the data sharing policies

We classified each of the policies into one of three categories: no mention of sharing microarray data, a relatively weak data sharing policy, or a strong policy. We defined a weak policy as one that is unenforceable, echoing McCain’s terminology.[4] This included policies that merely suggest or request that microarray data be shared, as well as policies that require sharing but fail to require evidence that data has been shared. Strong policies, in contrast, require microarray data to be shared and insist upon a database accession number as a condition of publication. We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access). 2.3

Measuring the frequency with which authors share their data

To make a preliminary estimate of data sharing prevalence, we began by querying PubMed for journal articles published in 2006 or 2007 that were likely to have generated gene expression microarray data. These articles form the denominator of our prevalence estimate, so ideally only studies that produced raw data – articles with potentially shareable data – would be included. Unfortunately, PubMed does not provide a straightforward way to accurately identify only studies that produced their own data; a PubMed query for articles about gene expression microarray data (“Gene Expression Profiling”[MeSH] AND “Oligonucleotide Array Sequence Analysis”[MeSH]) returns not only studies that produced their own Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

3


4

Heather A. Piwowar; Wendy W. Chapman

data, but also studies that strictly reused previous datasets (and therefore don’t have their own raw microarray data to share) and even articles about new tools for storing and analyzing gene expression microarray data. A more accurate retrieval of data-producing studies would require access to the article’s full text, and was beyond the scope of this paper. Nonetheless, if we assume that articles about data reuse and tools occur in journals independently of the journal’s data sharing policy, we can use the rough PubMed query to provide a preliminary estimate of relative prevalence. It is crucial, however, that we interpret these estimates relative to one another and not compare them to a theoretical ideal of 100%. Since the denominator of our percentages is not “number of papers that produced microarray data and could have shared it” but rather “number of papers about microarrays,” even if all studies that produced data in fact shared it our estimates would still be less than 100%. Using the NCBI’s Entrez website, for each journal in our cohort, we counted the total number of articles returned by our PubMed query and the percentage of those articles that had links to the GEO data repository. We conducted univariate and linear multivariate regressions over the journal data-sharing prevalence percentages to understand if strength of data sharing policy was associated with observed data sharing prevalence, including covariates for journal impact factor, journal subdiscipline, publisher type, and publishing model. 3.

Results

3.1

Journal’s policies on sharing data

Seventy journals met the selection criteria, spanning a wide range of impact factors (0.9 to 30.0, median: 4.5). A minority are published by academic societies (22). Only 5 use an open-access publishing model. Thomson’s Journal Citation Reports identified 27 subdisciplines covered by these journals. We retained the categories with more than five members: Biochemistry and Molecular Biology (19), Biotechnology and Applied Microbiology (11), Cell Biology (11), Genetics and Heredity (11), Oncology (19), and Plant Sciences (7). We also retained Multidisciplinary Sciences (n=4) because we were curious about the policies for high-profile journals such as Nature and Science. Of the 70 journal policies, 30 (43%) had no policy applicable to microarrays. This included 17 journals that make no mention of sharing publication-related data within their Instruction to Author statements, and 13 journal policies that request or require the sharing of non-microarray types of data (usually DNA and protein sequences), but no statement covering data in general or microarray data in particular. The remaining 40 journals had a policy applicable to microarrays. We classified 17 of the microarrayapplicable policies as relatively weak and 23 as strong, as detailed in Table 1. The policies varied widely across a number of dimensions. We explore several of these dimensions below, using excerpts from the policies. 3.1.1 Statements of policy motivation Several journals introduce their policies with a motivation for sharing data. These statements explain the anticipated benefits to the scientific community, the intended service to readers, or the principles of the journal. Examples are given in Table 2. In addition, 22 policies included general-purpose sharing statements, thereby implying their support for the principle of data sharing. An example from Bioinformatics: Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Review of Journal Policies for Sharing Research Data

All data on which the conclusions given in the publication are based must be publicly available. No Policy Acta Biochimica Et Biophysica Sinica Annals Of The New York Academy Of Sciences Biochemical And Biophysical Research Communications British Journal Of Cancer Cancer Cancer Letters Carcinogenesis Experimental Cell Research Frontiers In Bioscience Gene Genes Chromosomes & Cancer Genomics Human Molecular Genetics IEEE-ACM Transactions On Computational Biology And Bioinformatics International Journal Of Molecular Medicine International Journal Of Oncology Journal Of Clinical Oncology Journal Of Leukocyte Biology Journal Of Neurochemistry Leukemia Research Leukemia Mammalian Genome Microbes And Infection Molecular Immunology Molecular Plant-Microbe Interactions Oncogene Oncology Reports Pharmacogenomics Plant Molecular Biology Planta

Weak Policy Bioinformatics BMC Bioinformatics BMC Cancer BMC Genomics Breast Cancer Research FASEB Journal Genome Biology Genome Research International Journal Of Cancer Molecular Endocrinology Physiological Genomics Plant Journal Plant Physiology Proteomics Stem Cells Toxicological Sciences Virology

Strong (Enforceable) Policy Applied And Environmental Microbiology Blood Cancer Research Cell Clinical Cancer Research Developmental Biology FEBS Letters Gene Expression Patterns Infection And Immunity Journal Of Bacteriology Journal Of Biological Chemistry Journal Of Experimental Botany Journal Of Immunology Journal Of Pathology Journal Of Virology Molecular Cancer Therapeutics Molecular And Cellular Biology Nature Biotechnology Nature Nucleic Acids Research Plant Cell Proceedings Of The National Academy Of Sciences Of The USA (PNAS) Science

Table 1: Classification of journal data-sharing policies for gene expression microarray data From BMC Bioinformatics: Submission of a manuscript to BMC Bioinformatics implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes. 3.1.2 Datatype-specific policies The journals with general data-sharing policies almost always supplement this with additional instructions for certain datatypes. In fact, many policies only have policies for certain datatypes and not for data sharing in general. The policies for depositing nucleotide sequences are usually more strict than policies for other datatypes, including gene expression microarray data. The FASEB Journal, in contrast, explicitly treats all datatypes Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

5


6

Heather A. Piwowar; Wendy W. Chapman Journal Stem Cells, Blood (similar statement)

Excerpt from Instructions to Authors: motivation for data sharing policy Stem Cells supports the efforts of the National Academy of Sciences (NAS) to encourage the open sharing of publication-related data. Stem Cells adheres to the beliefs that authors should include in their publications the data, algorithms, or other information that is central or integral to the publication, or make it freely and readily accessible; use public repositories for data whenever possible; and make patented material available under a license for research use.

Bioinformatics

Bioinformatics fully supports the recommendations of the National Academies regarding data sharing.

Genome Research

Genome Research encourages all data producers to make their data as freely accessible as possible prior to publication. Open data resources accompanied by fair use will serve to greatly enhance the scientific quality of work by the entire community and for society at large.

Plant Cell

The purpose of this policy is to ensure that conclusions are scientifically sound

Physiological Genomics

Work published in the APS Journals must necessarily be independently verifiable [....] Within a short time span, microarrays have become an important, commonly used tool in molecular genetics and physiology research. For microarray analysis of gene expression to have any long-term impact, it is crucial that the issue of reproducibility be adequately addressed.

Proceedings Of The National Academy Of Sciences Of The USA

To allow others to replicate and build on work published in PNAS, authors must make materials, data, and associated protocols available to readers

Science

After publication, all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.

Journal Of Biological Chemistry

... will substantially enhance an author's ability to communicate important research information and will also greatly benefit readers.

Table 2: Selected excerpts to illustrate the variety of data-sharing policy motivations

the same: The FASEB Journal also does not distinguish between microarray data and other sorts of data (proteomics, sequence data, organic syntheses, crystal structures, etc.) All methods must be publicly available and described. Anything published in The FASEB Journal must have all data available not only for review but to every reader, electronic or print. 3.1.3 Sharing requested or required Most journals with a policy for sharing microarray data state it as a requirement, using phrases like must, required, and as a condition of publication. A few policies (n=4) are less strict, stating their policies as requests through the words should, recommend, and request. 3.1.4 Data location Most policies state that microarray data must be made available in a public database. A few are less specific, stating that sharing via public webpages or supplementary journal information is sufficient, or the policy leaves location unspecified. Some policies are more specific, insisting that the database be of a certain standard. Plant Cell, for example, specifies a permanent public database. Plant Physiology Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Review of Journal Policies for Sharing Research Data

expands on this theme: Links to web sites other than a permanent public repository are not an acceptable alternative because they are not permanent archives. Two databases, GEO and ArrayExpress, are the predominant centralized storage locations for microarray datasets. Many of the policies suggest that data be deposited into one of these two locations, and a few policies limit the choice to one of these centralized options. 3.1.5 Data format None of the policies explicitly specified a data format. By recommending or requiring submission to one of the permanent public databases, the journals implicitly stipulate the standard formats used within those databases. 3.1.6 Data completeness The Microarray and Gene Expression Data (MGED) Society has developed guidelines for the Minimum Information About a Microarray Experiment (MIAME) that is “needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment.”[12] Because the experimental conditions for collecting microarray data can be very complex, these MIAME guidelines are very helpful for both data sharers and data reusers. Physiological Genomics includes rationale for adopting the MIAME guidelines within their instruction for authors statement: Within a short time span, microarrays have become an important, commonly used tool in molecular genetics and physiology research. For microarray analysis of gene expression to have any long-term impact, it is crucial that the issue of reproducibility be adequately addressed. In addition, since microarray analytic standards are certain to change, it is crucial that authors identify the nature of the experimental conditions prevalent at the time of their research. If today’s research is to be relevant tomorrow, the core elements that are immune to obsolescence must be made clear. The APS Journals are adopting the MIAME standards to ensure that what is cutting edge today is not obsolete few years later. More than 30 of the data-sharing policies recommend that data be compliant with the MIAME guidelines. As an example of one of the strictest policies, Gene Expression Patterns requires adherence to the MIAME standards and even asks for a completed MIAME checklist to be submitted with the manuscript: Authors submitting manuscripts relying on microarray or similar screens must supply the data as Supplementary data [...] at the time of submission, along with the completed MIAME checklist. The data must be MIAME-compliant and supplied in a form that is widely accessible. 3.1.7 Timeliness of public availability A few policies specify that microarray data must be available to the public upon publication. None of the policies explicitly allow data to be withheld until a date after publication. 3.1.8 Consequences for not sharing data

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

7


8

Heather A. Piwowar; Wendy W. Chapman Journal Applied And Environmental Microbiology, Infection And Immunity, Journal Of Bacteriology, Journal Of Virology, Molecular And Cellular Biology Nucleic Acids Research Mammalian Genome

Excerpt from Instructions to Authors: consequences for NOT sharing data Failure to comply with the policies described in these Instructions may result in a letter of reprimand, a suspension of publishing privileges in ASM journals, and/or notification of the authors’ institutions. The Editors are prepared to deny further publication rights in the Journal to authors unwilling to abide by these principles. Failure to comply with this policy may result in exclusion from publication in Mammalian Genome.

Nature, Nature Biotechnology

After publication, readers who encounter a persistent refusal by the authors to comply with these guidelines should contact the chief editor of the Nature journal concerned, with "materials complaint" and publication reference of the article as part of the subject line. In cases where editors are unable to resolve a complaint, the journal reserves the right to refer the correspondence to the author's funding institution and/or to publish a statement of formal correction, linked to the publication, that readers have been unable to obtain necessary materials or reagents to replicate the findings. Table 3: Selected excerpts of consequences for noncompliance with data-sharing journal policies

Several policies stipulate consequences for authors who fail to comply with journal conditions, as listed in Table 3. No weak policies included consequences, even though weak policies would benefit most since their requirements are the least enforceable prior to publication. Although only tangentially related to dataset sharing, it is interesting to note the tough stance that some journals are willing to take when authors refuse to share their biological reagents after publication. From Blood: Although the Editors appreciate that many of the reagents mentioned in Blood are Journal Genome Research

Journal Proceedings Of The National Academy Of Sciences Of The USA Developmental Biology, Gene Expression Patterns

Excerpt from Instructions to Authors: forbidding exceptions to data sharing policies Genome Research will NOT consider manuscripts where data used in the paper is not freely available on either a publicly held Web site or, in the absence of such a Web site, on the Genome Research Web site. There are NO exceptions. Excerpt from Instructions to Authors: permitting exceptions to data sharing policies Authors must disclose upon submission of the manuscript any restrictions on the availability of materials or information. The editors understand that on occasion authors may not feel it appropriate to deposit the entire data set at the time of publication of this paper. We are therefore willing to consider exceptions to this requirement in response to a request from the authors, which must be made at the time of initial submission or as part of an informal pre-submission enquiry

Science

We recognize that discipline-specific conventions or special circumstances may occasionally apply, and we will consider these in negotiating compliance with requests. Any concerns about your ability to meet Science's requirements must be disclosed and discussed with an editor. Table 4: Selected excerpts to illustrate forbidden and permitted exceptions from data-sharing policies

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


9

A Review of Journal Policies for Sharing Research Data

proprietary or unique, neither condition is considered adequate grounds for deviation from this policy. … if a reasonable request is turned down and not submitted to the Editor-in-Chief, the corresponding author will be held accountable. The consequence for noncompliance is simple: the corresponding author will not publish in Blood for the following 3 years.

Figure 1: A boxplot of the impact factors for each journal, grouped by the strength of the journal’s datasharing policy. For each group, the heavy line indicates the median, the box encompasses the

From PNAS: Authors must make Unique Materials (e.g., cloned DNAs; antibodies; bacterial, animal, or plant cells; viruses; and computer programs) promptly available on request by qualified researchers for their own use. Failure to comply will preclude future publication Journal Attribute Impact Factor, natural log Open Access Published by Association Biochemistry & Molecular Biology Biotechnology & Applied Microbiology Plant Sciences Oncology Cell Biology Genetics & Heredity Multidisciplinary Sciences

Estimate 0.34 0.63 0.23 -0.28 0.04 -0.08 -0.37 0.10 -0.11 -0.29

p-value <0.001 *** 0.002 ** 0.046 * 0.031 0.784 0.636 0.004 0.485 0.456 0.207

*

**

Table 5: Results of linear multivariate regression over the existence of a journal’s datasharing policy Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


10

Heather A. Piwowar; Wendy W. Chapman

in the journalâ&#x20AC;Ś Contact pnas@nas.edu if you have difficulty obtaining materials. 3.1.9 Exceptions to data sharing policies

Figure 2: A boxplot of the relative data-sharing prevalence for each journal, grouped by the strength of the journalâ&#x20AC;&#x2122;s data-sharing policy. For each group, the heavy line indicates the median, the box encompasses the interquartile range (IQR, 25th to 75th percentiles), the whiskers extend to datapoints within 1.5xIQR from the box, and the notches approximate the 95% confidence interval of the median At least one journal, Genome Research, explicitly disallows any exceptions to their principle of public data sharing. In contrast, a few other journals state or imply that they are willing to be flexible in some circumstances. Relevant excerpts are included in Table 4. Journal Attribute Has a Data Sharing Policy Impact Factor, natural log Open Access Published by Association Biochemistry & Molecular Biology Biotechnology & Applied Microbiology Plant Sciences Oncology Cell Biology Genetics & Heredity Multidisciplinary Sciences

Estimate 0.11 0.06 -0.07 0.15 0.01 -0.01 0.08 0.02 0.04 0.27 0.28

p-value 0.037 * 0.118 0.386 0.002 ** 0.850 0.866 0.232 0.737 0.475 <0.001 0.004

*** **

Table 6: Results of linear multivariate regression over the prevalence with which the articles in a journal submit their microarray data to a centralized database Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Review of Journal Policies for Sharing Research Data

3.2

The relative strength of the data sharing policies

Based on univariate analysis, data sharing policy strength was associated with impact factor. As seen in Figure 1, the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.9, and 6.2. Data sharing policy was also associated with journal publisher: 46% of commercial publishers had a data sharing policy, compared to 82% of journals published by an academic society. All five of the open-access journals had a policy. In multivariate analysis, we found that the following variables were positively associated with the existence of a microarray data sharing policy: impact factor, open access, and academic society publishing. In contrast, the subdisciplines of Biochemistry&Molecular Biology and Oncology were negatively associated with the existence of a microarray data sharing policy. Details including all the covariates are provided in Table 5. 3.3

The frequency with which authors share their data

Journals with the strongest data sharing policies had the highest proportion of papers with shared datasets. As seen in Figure 2, the journals with no data sharing policy, a weak policy, and a strong policy had a median data sharing prevalence of 8%, 20%, and 25% respectively. As mentioned in the Methodology section, these proportions should be interpreted relative to each other rather than to a theoretical maximum of 100%. Based on multivariate analysis, we found that articles were more likely to have submitted primary data to GEO when they were published in journals with a data sharing policy, published by an academic society, or in the subdisciplines of Genetics&Heredity or Multidisciplinary Sciences. Details are given in Table 6. 4.

Discussion

We found wide variation amongst journal policies on data sharing, even for a data type with well-defined reporting standards and centralized repositories. Journals with a high impact factor, an open access publishing model, and a non-commercial publisher were most likely to have a data-sharing policy. This could be expected, as journals with a high impact factor are able to stipulate conditions to ensure research is of the highest quality without eroding their appeal, open-access journals are often particularly advocates for all aspects of open scholarship, and journals published by academic societies have previously been found to endorse data sharing more readily than commercial journals.[4] Surprisingly, our study did not identify any subdisciplines with an unusually-high number of data sharing policies. In contrast, we found that Oncology journals and Biochemistry&Molecular Biology journals were relatively unlikely to have a data sharing policy. The Oncology result is consistent with our observation that medical journals have been slower to embrace new publishing paradigms and open scholarship principles than journals within biology and bioinformatics. This is unfortunate, since cancer microarray data holds particular promise and is often especially expensive and time-consuming to collect. It is also unnecessary, since microarray data can be (and is) shared without compromising patient privacy. We found that the existence of a data sharing policy was associated with an increase in data sharing behavior. A non-commercial publisher and the subdisciplines of Genetics&Heredity and Multidisciplinary Sciences were also significantly associated with a relatively high frequency of dataset submissions into the GEO database, as a percentage of all published gene expression papers. Studies of Genetics&Heredity often reuse data, so perhaps authors in that field are well acquainted with the value of sharing data. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

11


12

Heather A. Piwowar; Wendy W. Chapman

Interestingly, the two subdisciplines that were negatively associated with the existence of a data sharing policy were not less likely than usual to share their data when other factors are held constant. We were surprised that impact factor was not strongly associated with data sharing prevalence in multivariate analysis, because we suspect that well-funded and high-profile studies are under more pressure to share their data. In the future, we’d like to include variables about funding in these analyses. A large number of journals had a policy for microarray data but not data in general. This probably reflects the success of MGED’s efforts in actively encouraging and supporting microarray data exchange. As such, the results we have found are illuminating but may not be representative for other datatypes with a less mature infrastructure. A study by Brown[13] in 2000 used several methods to investigate the adoption and usage of Genbank, one of the most mature and successful biological databases. She tracked changes in instruction to author statements across 23 journals over 20 years, and noted that the data sharing policies for sequences have become stronger over time. As she explains, the authors who published in the Journal of Biological Chemistry were urged to deposit sequence data into Genbank in 1984, told they “should” deposit data in 1985, and were required to submit data as a condition of publication by 1991. It would be interesting to study whether, as the microarray field continues to mature, the journals we consider to have weak data sharing policies will evolve stronger policies with time. Journals ought to give careful consideration to changing their policies[14]. Although there may be direct benefits to journals when authors must share their raw research data (reducing fraud, encouraging more careful research), data sharing mandates are controversial.[15] It is possible new mandates may cause authors to shop for an alternative publishing venue to avoid hassle. To measure the acceptance of a policy change, the editorial team at Physiological Genomics surveyed their authors and reviewers two years after instituting a data sharing requirement. They found that the vast majority of authors (92%) believed depositing microarray data was of significant value to the scientific community, and “67% of those who responded said they did not find the deposit of microarray data into GEO to be an obstacle to submission or review of articles”.[16] Database tools have evolved since that survey, and submitting data continues to get easier. Nonetheless, there are many personal difficulties for those who undertake to share their data, resulting in a variety of reasons why investigators may choose to withhold it. First, sharing data is often time-consuming: the data have to be formatted, documented, and uploaded. Second, releasing data can induce fear. There is a possibility that the original conclusions may be challenged by a re-analysis, whether due to possible errors in the original study, a misunderstanding or misinterpretation of the data, or simply more refined analysis methods. Future data miners might discover additional relationships in the data, some of which could disrupt the planned research agenda of the original investigators. Investigators may fear they will be deluged with requests for assistance, or need to spend time reviewing and possibly rebutting future reanalyses. They might feel that sharing data decreases their own competitive advantage, whether future publishing opportunities, information trade-in-kind offers with other labs, or potentially profit-making intellectual property. Finally, it can be complicated to release data. If not well-managed, data can become disorganized and lost. Some informed consent agreements may not obviously cover subsequent uses of data. De-identification can be complex. Study sponsors, particularly from industry, may not agree to release raw detailed information, or data sources may be copyrighted such that the data subsets can not be freely shared. Given all of these hurdles, it is natural that authors may need extra encouragement to share their data. We suggest that journal editors take a few simple steps to increase adherence to data sharing policies and thus bring about a more open scholarship. First, journals that already mandate data sharing should require the inclusion of an accession number (or web address for datatypes without databases) upon submission, since “prepublication compliance is much easier to monitor and enforce than postpublication compliance”[4] Second, journals should instruct their editors and reviewers to confirm that accession numbers are included Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Review of Journal Policies for Sharing Research Data

in the manuscripts, as some journals do for their clinical trial reporting policies[17]. Third, journals should require that authors complete a MIAME checklist to increase the likelihood that shared data is complete and well-annotated, following the example of Gene Expression Patterns. To take this step further, journals could contract with a service like the one offered by ArrayExpress[18] to verify that submitted datasets meet a threshold of annotation quality. Fourth, journals need to implement their consequences: don’t publish papers that don’t uphold the policies. Finally, during this cultural transition, we recommend that journals support measures that recognize and reward investigators who share data.[19] For example, journals could educate authors and reviewers on responsible data reuse and acknowledgement practices, either as part of instructions to authors statements or in editorials (see Nature journals [20, 21, 22]) Acknowledging data sources in a machine-readable way (through references, urls, and accession numbers) will allow the benefits of data reuse to be automatically linked back to the original data producers through citation counts[23] or other usage metrics, and thus provide a positive motivation for sharing data. Innovative attempts to provide microattribution or a data reuse registry may offer additional opportunities for journals to support these goals.[21, 22, 24] Our study has several important limitations: we explored journal policies for only one type of data, our measured data sharing behavior predated the policy downloads, and the policy classifications were performed by only one investigator. Our method of measuring data sharing behavior captures many but not all articles that shared data; we plan to use natural language processing techniques to find a wider variety of data sharing instances in the future[25]. Similarly, a full-text query to identify articles that produce primary, shareable data – perhaps using laboratory terms like purify and hybridize – could improve our preliminary estimates of data sharing prevalence. Finally, we note that the reported associations do not imply causation: we have not demonstrated that changing a journal’s data sharing policy will change the behavior of authors. Nonetheless, we believe this review and analysis is an important step in understanding the relationship between journal policies and data sharing outcomes. Policies are implemented with the hopes of affecting change. It is often said, “You cannot manage what you do not measure.” We need to understand the motivation and impact of our various incentives and initiatives if we hope to unleash the benefits of widespread data sharing.

5.

Acknowledgements

HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1R01LM009427-01. Raw data and statistical analysis code from this study are available at http://www.dbmi.pitt.edu/piwowar/ 6.

References

[1] [2] [3]

MERTON R: The sociology of science: Theoretical and empirical investigations. 1973 GASS A: Open Access As Public Policy. PLoS Biology 2(10):e353, 2004 VICKERS A: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7:15, 2006 MCCAIN K: Mandating Sharing: Journal Policies in the Natural Sciences. Science Communication 16(4):403-431, 1995 BALL CA et al.: Standards for microarray data. Science (New York, NY) 298(5593)2002 BALL CA et al.: Submission of microarray data to public repositories. PLoS Biol 2(9)2004 CECH TR et al.: Sharing publication-related data and materials: responsibilities of authorship in the life sciences. Plant physiology 132(1):19-24, 2003

[4] [5] [6] [7]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

13


14

Heather A. Piwowar; Wendy W. Chapman

[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

PANEL ON SCIENTIFIC RESPONSIBILITY AND THE CONDUCT OF RESEARCH: Responsible Science, Volume I: Ensuring the Integrity of the Research Process. 1992 Microarray standards at last. Nature 419(6905)2002 BARRETT T et al.: NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res 35(Database issue)2007 PARKINSON H et al.: ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue)2007 BRAZMA A et al.: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4):365-371, 2001 BROWN C: The changing face of scientific discourse: Analysis of genomic and proteomic database usage and acceptance. Journal of the American Society for Information Science and Technology 54(10):926-938, 2003 Democratizing proteomics data. Nat Biotech 25(3):262-262, 2007 CAMPBELL P: Controversial Proposal on Public Access to Research Data Draws 10,000 Comments. The Chronicle of Higher Education A42, 1999 VENTURA B: Mandatory submission of microarray data to public repositories: how is it working? Physiol Genomics 20(2):153-156, 2005 HOPEWELL S et al.: Endorsement of the CONSORT Statement by high impact factor medical journals: a survey of journal editors and journal ‘Instructions to Authors’. Trials 9:20, 2008 BRAZMA A, PARKINSON H: ArrayExpress service for reviewers/editors of DNA microarray papers. Nature Biotechnology 24(11):1321-1322, 2006 Got data? Nat Neurosci 10(8):931-931, 2007 SCHRIGER DL, ARORA S, ALTMAN DG: The content of medical journal Instructions for authors. Ann Emerg Med 48(6)2006 Human variome microattribution reviews. Nat Genet 40(1)2008 Compete, collaborate, compel. Nat Genet 39(8)2007 PIWOWAR HA, DAY RS, FRIDSMA DB: Sharing detailed research data is associated with increased citation rate. PLoS ONE 2(3)2007 PIWOWAR HA, CHAPMAN WW: Envisioning a Biomedical Data Reuse Registry. Blog post on March 24, 2008: http://researchremixwordpresscom/2008/03/24/envisioning-a-biomedical-data-reuseregistry/ PIWOWAR HA, CHAPMAN WW: Identifying data sharing in the biomedical literature. AMIA Annual Symposium [submitted] 2008

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


15

Researcherâ&#x20AC;&#x2122;s Attitudes Towards Open Access and Institutional Repositories: A Methodological Study for Developing a Survey Form Directed to Researchers in Business Schools. Turid Hedlund Information Systems Science, Swedish School of Economics and B. A. Pb 479, Arkadiagatan 22, 00101 Helsinki, Finland e-mail: turid.hedlund@hanken.fi

Abstract The aim of this study was to address the need of further studies on researchersâ&#x20AC;&#x2122; expectancies and attitudes towards open access publishing. In particular we wanted to focus on acceptance and user behavior regarding institutional archives. The approach is domain specific and was based on a framework of theories on intellectual and social organization of the sciences and communication practices in the digital era. In the study we apply a theoretical model of user acceptance and user behavior (UTAUT) developed by Venkatesh et al. in 2003 as an explanatory model for developing a survey form for a quantitative empirical research on user attitudes and preferences. Thus our research approach is new and crossdisciplinary in the way we combine earlier research results from the fields of organizational theory, information science and information systems science. This is in our view a fruitful approach broadening the theoretical base of the study and bringing in a deeper understanding of the research problems. As a result of the study we will present a model framework and a web survey form for how to carry out the empirical study. Keywords: end-user attitudes; methodological study; web survey 1.

Introduction

In recent years we have seen quite a few studies on open access publishing. Among others, large survey results on a cross-disciplinary level on author opinions to open access journals have been published [1], [2], [3]. Also author perceptions towards the author charges business model has been studied in the domain of a medical journal [4] . In studies on institutional repositories the attention has mainly been for several years on implementation, technical features and interoperability of systems using the OAI-PMH standard. It is a natural development, that now at a time when institutional repositories have been in function for some time, we have seen studies focusing on evaluation of repository content by genre and type of the included documents, as well as growth rate for submissions [5], [6]. However, even though the concept of open access is known among academic researchers their research and publishing practices have still not undergone a radical change. The important question regarding nonuse of institutional repositories has lately been raised by [7]. There is a need for deeper understanding of to what extent open access practices have spread among academics and what are the main incentives and barriers to acceptance and use of new systems for open access dissemination of research results for example institutional repositories. The aim of this study was to form a methodological part of a project on research on open access and in particular acceptance and user behaviour regarding institutional archives. The approach was to focus on the end-users, in this case researchers in business schools in Finland. As the approach of the project was to limit the collecting of data to a specific field, the framework of theories was based on intellectual and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


16

Turid Hedlund

social organization of the sciences and communication practices. [8], [9], [10]. We also relied on previous studies on open access publishing in the domain of biomedicine [11]. In the study we applied a theoretical model of user acceptance and user behavior (UTAUT) developed by [12] as an explanatory model for the construction of a questionnaire directed to the researchers in business schools. The framework will naturally also be used in the continuing project as a mean for analysis of results from the empirical surveys directed to business school researchers. Thus our research approach was new and cross-disciplinary in the way we combined earlier research results from the fields of organizational theory, information science and information systems science. This is in our view a fruitful approach broadening the theoretical base of the study and bringing in a deeper understanding of the research problems. In the study we addressed the following questions:

What are the prevailing attitudes toward open access among business school researchers in Finland?

What type of incentives and barriers for use and non-use of open access publishing channels can be identified.

Do factors such as the social influence of the faculty and the organization have an impact on acceptance and use of open access and institutional repositories?

Do personal factors such as perceived usefulness for the research career and perceived ease of use have an impact on acceptance and use of open access and institutional repositories?

The structure of the final paper will start by an introductory section on theories describing domain specific features in scientific communication and scientific publishing in the fields of research typically carried out in business schools. In the following section we will build up the framework of theories and models on enduser attitudes and the diffusion of new technology. In the section on study settings we described the methodology for the design of a web questionnaire and the survey to be carried out in business schools in Finland, followed by an analysis, discussion and concluding remarks. 2.

Theoretical Background

For the study on the scientific disciplines represented in business schools we will rely on Whitley’s (1984; 2000) theory on the social organization of the scientific fields as our starting point. Whitley’s theory characterizes the differences between scientific fields into two main dimensions

Degree of mutual dependence Associated with the degree of dependence between scientists, colleagues, or particular groups to make a proper research contribution

Degree of task uncertainty Associated with differences in patterns of work, organisation and control in relation to changing contextual factors

Relating Whitley’s taxonomy of scientific fields to economics and the related subject of business administration we could define the following pattern. A high degree of mutual dependence would indicate that scientific communication patterns become more controlled as competition arises. For example publishing in journals with high rank within the community is favoured among researchers. This trend is enforced also in business schools where research evaluation and meriting of researchers is focusing to an increasing degree on publishing in journals with high impact factors. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Researcher’s Attitudes Towards Open Access and Institutional Repositories

The degree of task uncertainty, according to Whitley associated with differences in pattern of work and publishing patterns, might also indicate that a scientific field with a high degree of task uncertainty is less controlled regarding its scientific output. In business schools several different patterns of work and different contextual factors are naturally present since several different subjects are represented in the departments and faculty. The long tradition within the field of economics in publishing working papers can for example be associated with the need to communicate research results at an early stage of the research process. Hedlund and Roos [11] characterize, in their study on incentives to publish in open access journals, factors depending on the social environment and factors depending on personal factors of the researcher. Social factors:

Policymaking, governmental policy in science and technology, policy of other funding bodies, interest groups and officials

• • • • • •

Increased demands for productivity and accountability Internationalisation and strong competition in the scientific field Geographical location Availability of subject-based and institutional archives and open access journals Institutional policies that promote open access publishing Communication patterns of the scientific field and the field’s willingness to early adoption of new techniques

Personal factors:

• • •

The importance of reputation and meriting to the researcher Speed of publication and visibility of research results Personal communication patterns and willingness to adopt new techniques

Figure 1 The UTAUT model. Source: Venkatesh et al. 2003, p. 447 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

17


18

Turid Hedlund

Personal values

In studies modelling user acceptance and behaviour, [12] develop a theoretical framework that is well suited to use as an explanatory framework for the intended statistical analysis of the results from a survey directed to business school researchers in Finland. In the formulation of the Unified Theory of acceptance and Use of Technology (UTAUT) [12] identified four constructs as direct determinants of user acceptance and usage behaviour (see Figure 1). The four constructs are: performance expectancy, effort expectancy, social influence and facilitating conditions. The above four constructs in [12] are defined shortly as follows:

Performance expectancy – “the degree to which an individual believes that using a system will help to attain gains in a job performance”

• •

Effort expectancy – “the degree of ease associated with the use of a system”

Facilitating conditions – “the degree to which an individual believes that an organizational and technical infrastructure exists to support use of the system”

Social influence – “the degree to which an individual perceives that important others believe he or she should use the new system”.

Until now the above constructs have been used mainly in research on acceptance and intention to use of IT-systems. In the present study we will adapt the above constructs to the study of acceptance and use and eventually also non-use of open access publishing systems as for example institutional archives. In the following we have modified the constructs in the UTAUT model to the needs for a study on enduser attitudes (researchers) towards open Access publishing.

3.

Performance expectancy – as the degree to which the researcher expects gains with OA publishing in research performance and thus increasing his/her personal merits

Effort expectancy – as the degree to which the researcher expects ease of use of an OS system. This naturally has to do with system technology and design but also with personal factors as willingness to learn and use new systems. Experience from use of other information and communication systems on the web is probably an effecting factor.

Social influence – as the degree to which a researcher is influenced of fellow researchers and organization

Facilitating conditions – as the degree to which organizational and technical infrastructure is provided to support use of the system

Design of questions for the survey form

The survey form is designed as a web survey form with the intention to be directed to faculty members and doctoral students in business schools. In the beginning of the form explanations are provided of concepts related to open access publishing such as open access journal, university publication repository, subject-based publication repository etc. The survey form contains four sector headings; demographic questions, questions on awareness and use of open access services, questions on open access publishing and questions on reasons and barriers for open access publishing. In the following the factors in the model and their representations in the questionnaire are described. First section of the questionnaire is described in Table 1. The table includes examples of how the moderating Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Researcher’s Attitudes Towards Open Access and Institutional Repositories Moderating factors Gender Age as researcher Experience of use Voluntariness of use

Demografic questions Gender Position (doctoral student – professor) Research experience in years Does your university mandate depositing a copy of your research publications in a publication archive?

Table 1 Demografic section of the survey form factors in the theoretical framework were depicted in the survey

3.1

Determinants of user acceptance and behaviour

Factors depending on the social environment

• • • •

Social influence of fellow researchers and organization Policy of funding bodies, university organization and officials Differences in patterns of work and changing contextual factors Facilitating conditions (organizational and technical infrastructure to support use of the system)

Personal factors of the researcher

• • 3.2

Performance expectancy (expected gains in research performance, personal merits) Effort expectancy (expected ease of use of a system) not reflected in the survey

Examples of social influence

How does your research community or fellow researchers react to open access publishing? Please indicate on a scale from 1-5 how well you find that the following statements reflect your opinion. 1 = I totally agree and 5 = I totally disagree

3.3

Researchers that are important to me tend to have a copy of their publications on their home pages

• •

I can find publications on my research topic openly on the web My fellow researchers ask me to publish copies of my research papers if I do not have them publicly available in full text

Examples of differences in patterns of work and changing contextual factors

What are in your opinion the main reasons to publish in an open access publication archive of your university?

By submitting my publication into the open access publication archive of my university I can reach a broader audience

People interested in my research ask me to have my research available on the Internet

What are in your opinion the main reasons to publish in an open access journal? Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

19


20

Turid Hedlund

3.4

Open access journals reach a broader audience and especially professionals that do not have access to databases in the university libraries

• •

I can choose open access journals of good standard in my research field My research community favour publishing in open access journals

Examples of policy of funding bodies, university organisation and officials

My research funders recommends or require me to have my research results available freely to the public

My university recommends or requires open access publishing in the publication archive of the university

My research funders recommends or require me to publish my research in an open access journal when possible

My university recommends or require me to publish my research in an open access journal when possible

Free comments are encouraged 3.5

Facilitating conditions (organizational and technical infrastructure to support use of the system)

What are in your opinion the main barriers for publishing in an open access publication archive managed by your university? You can choose several alternatives from the list below

• • • •

I do not know if my university has an open access publication archive I do not know how to submit a copy of a published article I believe that copyright issues are difficult to cope with I do not know which version of my article I am allowed to submit

Free comments are encouraged 3.6

Personal factors of the researcher Performance expectancy

How well do you find that open access journals meet the following criteria Please indicate on a scale from 1-5 how well you find that the following statements reflect your opinion. 1 = I totally agree and 5 = I totally disagree

• • • •

They provide accessibility to the right focus groups They increase visibility The speed of publishing is increasing The quality and impact factors meet the standard of traditional journals

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Researcher’s Attitudes Towards Open Access and Institutional Repositories

• 4.

The peer review is of good quality

Discussion

This conference paper is part of a project on open access institutional repositories carried out at the Swedish School of Economics and Business Administration. Thus the framework for the study that was presented in the theoretical part took into account discipline specific factors of earlier research. The research questions involved the modelling of the constructs of researchers’ attitudes to open access publishing and institutional archives for an empirical survey and in practice a web survey form to be directed to researchers as end users of open access services. The survey form was tested on two separate occasions; firstly, to get a picture of the methodological soundness of the constructs depicted in questions in the survey form and secondly to get feedback on what inconsistencies and hardships there might be in the actual answering of the survey form. To test the methodological soundness a presentation was given for a group of researchers and doctoral students from the Swedish School of Economics and Business Administration. It became clear that the concepts on open access publishing and institutional archives were not familiar to the audience. Therefore explanations and definitions were added to the survey form. The researchers’ main concern was where to publish (in which journal) and in what type of publications (book chapters, journals etc) not mainly in open access format. However, the factors depending on the social environment were seen as relevant for publishing practices. Getting merits for a future career as researcher was also important. The survey form was also sent to a group of 8 test persons (researchers and doctoral students). They were asked to fill in and return the form and to comment on problems and design features as well as on the relevance of the questions. The test respondents provided good comments and suggestions for improvements. The main structure and the questions in the survey were found rather easy to fill in and submit. The survey contained a suitable number of questions and did not take too long to fill in. Of the test answers that were collected we could conclude that the main reason to publish in an open access journal was that you are able to reach a broader audience and also professionals interested in your research. One of the main reasons to publish in a university publication repository was also to reach a broader audience, the other main reason among test respondents was that people interested in a persons research results asked to have them published on the web. The main barrier to publish in an open access journal was that the department of the researcher did not consider open access journals meriting for a research career. The main barrier to publish in a university publication archive was that copyright issues were considered difficult to cope with. Also the project group developing and managing the open access publication archive named “DHanken” were asked to comment on the survey questions. Some clarifications and improvements were suggested to single questions but on the whole the feedback was positive. Many of the comments were on a general level and pointing to the fact that open access publishing might not be so very well known among researchers. Therefore some definitions on concepts were added to the survey form. Based on the comments from the test of the survey form we were able to develop an updated version of the form. The survey will now be sent out to researchers in business schools in Finland and the initial results will be collected and analysed. 5.

References

[1]

Nicholas, D.and Rowlands, I. Open access publishing: The evidence from the authors. The Journal Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

21


22

Turid Hedlund

of Academic Librarianship, vol. 31(3) 2005: 179-181. Nicholas, D. , Huntington, P. and Rowlands, I. Open access journal publishing: the views of some of the worldâ&#x20AC;&#x2122;s senior authors. Journal of Documentation vol 61(4) 2005: 497-519. [3] Swan, A. and Brown, S. Authors and open access publishing. Learned publishing vol. 13(3) 2004: 219-224. [4] Schroter, S., Tite, L. and Smith, R. Perceptions of open access publishing: interviews with journal authors. British Medical Journal 2005; (330); 756- published 26 January 2005. [5] Thomas, C. and McDonald, R. H. Measuring and comparing participation patterns in digital repositories: Repositories by the numbers, Part 1. D-Lib Magazine 2007; vol 13(9/19). Doi:10.1045/ september2007-mcdonald [6] McDowell, C. S. Evaluating institutional repository deployment in American Academe since early 2005. Repositories by the number part 2. D-Lib Magazine 2007; vol 13(9/10) doi:10.1045/ september2007-mcdowell [7] Davis, P. M. and Connolly, M. J. L. Institutional repositories: Evaluating the reasons for non-use of Cornell Universityâ&#x20AC;&#x2122;s installation of DSpace. D-Lib Magazine 2007; vol 13(3/4) doi:10.1045/ march2007-davis [8] Whitley, R. The intellectual and social organization of the sciences. London: Clarendon Press 1984. [9] Fry, J. Scholarly research and information practices: a domain analytic approach. Information processing and management 2006; vol 42, pp 299-316. [10] Fry, J. and Talja, S. The intellectual and social organization of academic fields and the shaping of digital resources. Journal of information Science 2007; vol 33(2) pp. 115-133. [11] Hedlund and Roos, A. Open Access publishing as a discipline specific way of scientific communication: the case of biomedical research in Finland. In Advances in Library Administration and Organization 25. Elsevier book series 2007. [12] Venkatesh, V., Morris, M., Davis, G. B., Davis, F. D. User acceptance of information technology: Toward a unified view. MIS Quarterly 2003; vol. 27(3), pp. 425-478.

[2]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


23

The IDRC Digital Library: An Open Access Institutional Repository Disseminating the Research Results of Developing World Researchers Barbara Porrett Research Information Management Services Division International Development Research Centre (IDRC CRDI) P.O. Box 8500, Ottawa, ON K1G 3H9 Canada e-mail: bporrett@idrc.ca web: http://www.idrc.ca/

Abstract The International Development Research Centre (IDRC) has recently launched the OAI-PMH compliant IDRC Digital Library (IDL), a DSpace institutional repository. The digital library has been developed to enhance the dissemination of research outputs created as a result of Centre-funded research. The repository has a number of unique qualities. It is the public bibliographic database of a Canadian research funding organization, its subject focus is international development and the content is retrospective, dating back to the early 1970s. Intellectual property issues have been a major factor in the development of the repository. Copyright ownership of a majority of IDL content is held by developing world institutions and researchers. The digitization of content and its placement in the open access IDL has involved obtaining permissions from hundreds of copyright holders located in Africa, Asia and Latin America. IDRC has determined that obtaining permissions and populating the repository with developing world researchers’ outputs will help to improve scholarly communication mechanisms for Southern researchers. The expectation is that the IDL will make a contribution to bridging the South to South and South to North knowledge gap. The IDRC Digital Library will serve as a dissemination channel that will improve the visibility, accessibility and research impact of southern research. Keywords: developing world research; institutional repository; open access; DSpace; IDRC Digital Library; International Development Research Centre 1.

Introduction

The subject of this presentation is an institutional repository called the IDRC Digital Library [1]. The repository has several unique qualities that distinguish it from other DSpace institutional repositories now accessible on the Internet. It is the repository of a research funding organization, it serves as the organization’s public bibliographic database for the dissemination of funded research outputs and public corporate documents, its content is retrospective, dating back to the early 1970s and as a result, its development and management has presented some significant intellectual property issues. Notwithstanding these and other challenges, the IDRC Digital Library is developing into a significant resource, sharing the research results of developing world researchers with the international research community. IDRC stands for the International Development Research Centre, a Canadian Crown corporation that works in close collaboration with researchers from the developing world in their search for the means to build healthier, more equitable, and more prosperous societies. The creation of IDRC was Canada’s response to a climate of disillusionment and distrust that surrounded foreign aid programs during the late 1960s. Maurice Strong and others urged the then Canadian Prime Minister Lester B. Pearson to establish a “new instrument” to provide forward-thinking approaches to international challenges that could not be addressed by way of more conventional programs. This led to the establishment of the world’s first Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


24

Barbara Porrett

organization devoted to supporting research activities as defined by developing countries. IDRC’s objectives, as stated in the International Development Research Centre Act of 1970, are “… to initiate, encourage, support, and conduct research into the problems of the developing regions of the world and into the means of applying and adapting scientific, technical, and other knowledge to the economic and social advancement of those regions.”. IDRC is guided by a 21-member, international Board of Governors and reports to the Canadian Parliament through the Minister of Foreign Affairs. In 2007/08, IDRC received CA$135.3 million in funding from the Parliament of Canada. 2.

IDRC and the Dissemination of Funded Research Results

IDRC has, from the onset, placed a great deal of importance on the sharing of the research outputs that are created as a result of Centre-funded research. Although the copyright ownership of the outputs has always remained with funding recipients, it has been a condition of funding that IDRC maintains the ability to disseminate the research outputs supported by Centre funding. An archive of these outputs has been maintained since 1970, originally on paper but now increasingly in digital format. Bibliographic management of this archive has been done through a library catalogue and more recently through an online public access catalogue that has been accessible to the research community on the IDRC Internet web site. In an effort to enhance the dissemination of these research outputs and to provide an improved scholarly communication mechanism for Centre-funded researchers, it was decided in the fall of 2005 to explore the possibility of building an Open Access Initiative (OAI) compliant institutional repository. Under the guidance of a Steering and a Stakeholders Committee, and a policies and governance document [2], a project team of two librarians and a systems analyst undertook the initiative. In April 2007 a DSpace open access institutional repository, called the IDRC Digital Library or the IDL was launched. 3.

Content of the IDRC Digital Library

The IDL provides access to information about the IDRC research output archive dating back to the Centre’s beginnings. The database holds 34,000 Dublin core metadata records, approximately 30% of which provide links to digital full text. The subject coverage reflects the international development focus of IDRC research funding, with strong representation from the sciences and social sciences. The subject areas of research that have been supported by the Centre have changed over time. Research funding currently focuses on the following five themes: Environment and Natural Resource Management; Information and Communication Technologies for development; Innovation, Policy and Science; Social and Economic Policy; and Global Health Research. An average of slightly over 500 IDRC-funded research projects are active at any point in time and approximately 750 research outputs are added to the archive each year. 4.

Audience of the IDRC Digital Library

The primary audience of the repository is the international research community. This includes researchers, applicants for IDRC funding, donor agencies, policy makers, and development practitioners. The repository’s purpose is to share the research results of developing world researchers, to facilitate the discovery of research literature in the fields of international development and to identify researchers, research institutions and civil society organizations that have undertaken research in the fields of international development. The IDL not only enhances the public accountability and transparency of IDRC-funded research, but also demonstrates the Centre’s commitment to contribute to the global “public good” contribution of the research it supports. It also ensures that the research results will be freely accessible in order to contribute to the public debate on development issues for public benefit. The research literature in the IDL can be accessed Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The IDRC Digital Library

and used for non-commercial purposes in accordance with a definition of open access based on the Budapest Open Access Initiative. The expectation is that the IDL will make a contribution to bridging the South to South and South to North knowledge gap. These channels of scholarly communication and scholarly publishing are less heavily traveled than the North to North and North to South. [3] The IDL will serve as a dissemination channel that will improve the visibility, accessibility and research impact of southern research. 5.

Focus of the Presentation

This presentation will focus on three aspects of the IDRC Digital Library: IDL content and how it will continue to develop, copyright permission challenges presented by the repository’s retrospective content, and IDL services. Evidence of use of the repository will also be discussed. 6.

Development of IDRC Digital Library Content

The bulk of the content disseminated by the IDL is in the form of final technical reports. The reports present the research results produced by Centre-funded researchers. They are submitted by funding recipients as a requirement of research funding. The IDL also disseminates books published by the Centre, documents and other writings by staff and IDRC governors for and about IDRC, as well as other substantial works related to the Centre’s programs, projects and activities. This second category or collection of content represents about 25% of the digital library’s content. These two collections have, historically, been housed in the Centre’s Library and managed in the library catalogue. The Library’s catalogue records are the source of the majority of the digital library’s metadata. These were mapped and migrated from the Library open public access catalogue or OPAC into MIT’s DSpace. This kind of undertaking can be perilous, even under the best of circumstances. The library OPAC software, called MINISIS was home-grown, originally designed to be used by developing world libraries. The non-standard/non-MARC record structure of the MINISIS bibliographic records presented significant challenges. Further, the record content and database structure had changed over time. Migrating this content into DSpace was much like opening a Pandora’s box. Countless unanticipated challenges had to be overcome but after a great deal of problem solving, some metadata field customization and two migration attempts, an IDL database with an acceptable level of integrity has emerged. The IDL now serves as the Centre’s public bibliographic database. 7.

Submission Process of the IDRC Digital Library

The submission and metadata creation process for the IDL is centralized in the IDRC Library. The Centre’s research subject specialists, called program officers, review final technical reports received from research project recipients. Although the review is not a true peer-review, it can lead to redrafting of the reports by the funding recipients to ensure that they meet Centre funding requirements. Once finalized, the program officer determines if the report is eligible for public dissemination in the IDL. For example, reports containing politically sensitive or patentable information are not added to IDL holdings. Further if a researcher requests that their research results not be placed in the IDL because, for example, they have published or wish to publish with a publisher that does not permit dissemination on an OA platform, similarly, the report will not be added to the IDL. A Centre-funded researchers’ ability to choose not to have their final technical reports published in the IDL results from the fact that the contractual agreement for research funding has only just been modified to make provisions for OA and digital dissemination of final technical reports. IDRC-funded outputs produced by research projects approved after January 2008 will submit outputs to their program officers in digital format and the researchers will Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

25


26

Barbara Porrett

have granted the Centre permission to disseminate their funded research results in the OA IDL. A soon to be released IDRC publishing policy reiterates these new conditions of IDRC funding. The impact of this contractual and policy change is that submission of final technical reports to the IDL will be mandatory. However, the implications of this will not be seen by the IDRC Library for a number of months, because the outputs are received only after the research has been completed. If a report is destined for the digital library, the program officer determines where it will be placed within the IDL’s browsing structure. That is, in which collection of his/her DSpace community. Incidentally, the community and collection structure of the IDL has been created in collaboration with the Centre’s programming staff. This approach to the development of the IDL browsing structure is an example of how the Library has attempted to share ownership of the IDL with the Centre’s program branch. The researchers and the program officers are asked to provide uncontrolled vocabulary or keywords for the report’s IDL metadata. This is being done with the hope that keywords in the metadata that have been recommended by the subject specialists and/or the researchers will help to enhance the retrievability of the digital library’s content. Four pieces of information, an indication that the report is destined for the IDL, the appropriate collection name, plus keywords and the report are emailed by the program officer to a records management staff member. These are then placed in the Centre’s digital records management system by records staff. This information is transferred manually into the IDL by a library cataloguing technician who completes the metadata creation and submission process. Additional subject description is added to metadata records using the OECD Macrothesaurus. Automating this process to enable migration of this information from the records management system to the digital library is planned. 8.

IP Issues and the IDL

Seventy percent of the outputs described by IDL metadata are in paper format and as mentioned earlier, the copyright ownership of funding recipient created research outputs is owned by the researchers. In order to comply with Canadian copyright law, permission must be obtained from the copyright holders before the format of the outputs can be changed from paper to digital and made accessible through the open access digital library. This then leads to the subject of copyright permissions and digitization. Developing world researchers have encountered and continue to face barriers to the publication of their work. To ensure that IDRC research funding does not further impede efforts to publish, the contract between the Centre and its researchers places copyright ownership of final technical reports with the funding recipient. Obtaining permission to digitize and to place final technical reports in the IDL has been a full time occupation of a library staff member since the fall of 2006. To date, approximately 450 copyright holders have been contacted and asked to complete and sign a license granting IDRC permission to digitize and place their research results in the IDL. Many of the copyright holders are developing world institutions that hold the copyright of numerous works. The success rate in obtaining permissions is in the 65% range. It has not been difficult to obtain permission to digitize and place Centre-funded outputs in the IDL if it was possible to contact the copyright holder. For the most part, the copyright ownership of the outputs has not been transferred to publishers and, with just a few exceptions, copyright holders were willing to grant permission. Copyright holders have not make requests for further information about open access. How this should be interpreted is not clear, however, the correspondence requesting permissions has been carefully drafted in an effort to ensure that its intent is not misunderstood. Recipients have asked to be notified when their outputs were accessible in the IDL, in one case because they planned to place their digitized research results on their own web site. The impediments to obtaining permissions can be summarized as follows: the copyright holder is deceased, the research institution that received the research funding no longer exits, it was not possible to identify the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The IDRC Digital Library

copyright holder or a reply to correspondence requesting permission just has not been received. The Library has the capacity to continue to request permissions from copyright holders and to undertake in-house digitization with the objective of expanding the digital content in the IDL. However, not surprisingly, it has been difficult to locate the copyright holders of many older outputs. Regrettably, it is unlikely that the IDL will be able to disseminate the digital full text of all the final technical reports that its metadata describes. 8.

IDRC Digital Library Services

This then means that IDL metadata will continue to describe final technical reports that are not delivered digitally by the IDL. In an effort to enable users to access these outputs, the IDRC Library does its best to offer a document delivery service. Users are invited to enquire about options for accessing the research results. IDL users can, of course, visit the IDRC Library in Ottawa but this is not a practical choice for many researchers. The Centre’s contractual agreement with recipients funded after February 2004 enables the Library to digitize an output and make it available on the IDRC web site but not in the OA IDL. In cases where IDRC can not obtain permission from copyright holders to disseminate their outputs and the project contract predates February 2004, the Centre may rely, to a limited extent, on the so–called ‘fair dealing’ exception under Canada’s Copyright Act. This exception provides only a very narrow exclusion to allow a library to copy and distribute a portion of a work without it infringing copyright in that work. The library must be satisfied that the use of the work will be for research or private study. The law does not set clear limits on what portion of a work may be copied under the fair dealing exception. But, what is clear is that copying an entire work would not be permissible under the fair dealing exception. This is not an ideal situation but the document delivery staff attempt to do their best to meet the information needs of requesters. Another service being explored by the IDL is the hosting of works authored by developing world researchers who are not IDRC-funding recipients. A Centre-funded project has developed a research methodology that is being applied by non-Centre funded researchers in the developing world. The project’s lead researcher recognized the value of managing and disseminating the results of this disparate group of researchers. A partnership was established to create a DSpace community that makes the research results OA accessible through the IDL. A service agreement addressing issues such as content review, intellectual property, metadata creation and termination of the collaboration was developed to formalize the partnership. This led to the creation of the Social Analysis Systems2 (SAS2) Community [4] in the IDL. This community not only disseminates developing world researchers’ work, but also facilitates the aggregation of a body of knowledge. 9.

Integrating the IDRC Digital Library into Other Centre Systems

The IDL has been designed to integrate with a suite of other repositories of information created by IDRC. For example, as described earlier, final technical reports are filed in the records management system. The documents and their skeletal metadata are reused in the digital library. This eliminates the need for submission to both the records system and the IDL. Further, IDL content can be accessed through the Centre’s project database, called IDRIS+ [5] and its persistent URLs are widely used in the IDRC web content management system. Integration enables reuse of the IDL’s content through other IDRC systems and will help to ensure long term funding and survival of the repository.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

27


28

Barbara Porrett

10.

Use of the IDRC Digital Library as a Research Resource

Preliminary data indicates that the IDL is on its way to accomplishing the objective of making a contribution toward bridging the South to South and South to North knowledge gap. The context of this data is as follows. The absence of links to digital full text in 70% of the IDL’s metadata records has made it possible to gather some information about who is using the IDL as a research resource. Users are contacting the IDRC library to enquire about receiving the full text of outputs that are described but not delivered digitally. The majority of the requests are received by email. Although it is not always possible to confirm that the requester originates from the developing world (many developing world researchers are studying and working in developed world institutions and organizations), these requests for full text do reveal the following. The total number received between July 2007 and mid April 2008 was 96. The mailing address and/or signature indicate that 53 were writing from the South, approximately 55% of the total. The majority of these came from Africa and Latin America. A smaller number were received from India, Vietnam and Cambodia The origin of the remaining 45% was, in order of frequency, Canada, the U.S., the UK, and France. The IDL’s server log files are not available for analysis at this time. However the DSpace application provides a statistical summary that reveals some interesting information about the system’s use. Data collection began in November 2007. The IDL has been searched an average of 16,000 times per month, an average of 81,000 items have been viewed and an average of 35,500 bit streams or digital files have been accessed each month. The words searched by IDL users are also noteworthy. The French, Spanish and English languages are equally well represented among the terms being used. Although the presence of French and English is not surprising, the numerous terms in Spanish may indicate that the IDL has caught the attention of Latin American researchers. Search terms such as reformas, gouvernance, poverty, tecnológicas, rurale, policy as well as developing world geographic locations are high on the list of frequently searched words. All of these terms reflect the research areas funded by IDRC. The terms also suggest that there is a strong potential that searchers’ information needs are being met by the IDL. 11.

Conclusion

By way of conclusion I would like to note that IDRC is the first Canadian research funding organization to build an OAI-PMH compliant institutional repository to disseminate its funded research results. It was the vision of Marjorie Whalen, the IDRC Library director that led to the creation of an IDRC institutional repository to enhance the dissemination of southern researchers’ research results. The experience of the project shows that challenges, some expected, others not, were inevitable but not insurmountable. The collaborative nature of this undertaking has been enriching for all of us at the Centre. But content development of the IDL is far from complete. Obtaining consent from copyright holders to distribute their works through the IDL will remain a high priority for some time. This is consistent with the belief at IDRC that open access will lead to the maximization of the societal benefits of investment in research. To close, I would like to share the following which were sent to us by a researcher. Thank for your email request for permission to include my works in the IDRC Digital Institutional Repository. I am a firm believer on the universal right of the people of the World to have free access to knowledge. Especially when the knowledge created is a result of communal effort as is the case for all IDRC projects.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The IDRC Digital Library

12.

References

[1]

The International Development Research Centre Digital Library/ La Bibliothèque numérique du Centre de recherches pour le développement international. URL : http://idl-bnc.idrc.ca WHALEN, M. IDRC Digital Library Policies and Governance. 2007. URL: http://idl-bnc.idrc.ca/ dspace/handle/123456789/35334 CHAN, L.; KIRSOP, B.; ARUNACHALAM, S. Open Access Archiving: the fast track to building research capacity in developing countries,’ SciDevNet, February 11, 2005. URL: http:// www.scidev.net/en/features/open-access-archiving-the-fast-track-to-building-r.html Social Analysis Systems2 (SAS2)/ Les Systèmes d’analyse sociale2 (SAS2). URL: http://idlbnc.idrc.ca/dspace/handle/123456789/34882 IDRIS+. URL: http://idris.idrc.ca/app/Search

[2] [3] [4] [5]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

29


30

Keyword and Metadata Extraction from Pre-prints Emma Tonkin1; Henk L. Muller2 1

UKOLN, University of Bath, UK e-mail: e.tonkin@ukoln.ac.uk 2 University of Bristol, UK e-mail: henkm@cs.bris.ac.uk

Abstract In this paper we study how to provide metadata for a pre-print archive. Metadata includes, but is not limited to, title, authors, citations, and keywords, and is used to both present data to the user in a meaningful way, and to index and cross-reference the pre-prints. We are particularly interested in studying different methods to obtain metadata for a pre-print. We have developed a system that automatically extracts metadata, and that allows the user to verify and correct metadata before it is accepted by the system. Keywords: metadata extraction; Dublin Core; user evaluation; Bayesian classification 1.

Introduction

There are two methods for obtaining metadata: the metadata can be mechanically extracted from the preprint, or we can ask a person (for example the author or digital librarian) to manually enter the metadata. The former approach, automated metadata generation, has attracted a great deal of attention in recent years, particularly for the role that it is expected to play in reducing the metadata generation bottleneck [1] - that is, the difficulty of producing metadata in a timely manner. Much of this interest arises from prior work in machine-aided indexing, or automated indexing - that is, either software-supported or entirely software-driven indexing approaches. The difference between machine-aided or automated indexing and automated metadata generation or extraction approaches is, as seen by the authors, simply that the metadata is here seen as an end in itself; we aim to emulate well-formed metadata generation, and do not concern ourselves greatly here with the subsequent question - evaluation of the usefulness of this metadata for a given purpose. Greenberg et al [2] describe two primary approaches to metadata generation, stating that researchers have experimented primarily with document structure and knowledge representation systems. Document structure involves the use of the visual grammar of pages, for example, making use of the observation that title, author(s) and affiliation(s) generally appear in content header information. Such metadata can be extracted via various means, for example using support vector machines upon linguistic features [3], a variable hidden Markov model [4], or a heuristic approach [5]. [6] describe an approach that primarily utilizes formatting information such as font size as features, and makes use of the following models: Perceptron with Uneven Margins, Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Voted Perceptron Model (VP), and Conditional Random Fields (CRF): they note that an advantage of an approach that primarily makes use of visual features is the ease of generalisation to documents in languages other than English. This approach, however, focuses solely on the problem of extracting the document title. The relevance of knowledge representation systems for Greenberg et al is the increasing availability of resources that can be useful to the process of metadata generation, or indeed the harvesting of existing Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

metadata registries; this is primarily of use in post-processing or enhancement, although such knowledge basis additionally provide a useful resource under many circumstances. For example, an authoritative but incomplete author name database can be used firstly for automatic name authority control, and secondly as an excellent basis for training of supervised machine learning systems in detection of fields containing author names. The issue of post-processing is, however, out of the scope of this paper, and will therefore be referred to only briefly. Recent work on the Semantic Web and on classification and knowledge management has focused on the extent to which these methods lead to equivalent or stable results. Whilst the two approaches may have compatible outcomes in terms of the type of metadata output, they depend upon very different underlying mechanisms. Factual metadata such as title and author is usually unambiguous; but other metadata, such as keywords for classification, is of an interpretative nature. User entered classifications can be seen as based around a set of prototype concepts [7,8]. Mechanically generated classifications are generally built around an identified set of features. The features that are used by the mechanical system are meant to form a basis for making similar judgements to those given by a human, and hence are intended to emulate similar behaviour to the set of concepts recognised by the user; but they are in practice quite different, for they are based around a range of heuristics or learnt statistical measurements rather than a deeper understanding of the information within the data object. Because of this difference, care must be taken to ensure that the judgements are compatible, typically by choosing supervised methods, that may be trained and verified against reference data (ground truth). 2.

Available metadata

An electronic copy of a document is potentially a rich source of metadata. Some of the metadata is presented in an obvious manner to the reader, for example the title of a document, the number of pages and the authors. Other metadata is less obviously visible. Attributes of the eprint such as format - intrinsic document properties - can be automatically detected with ease [9]. The class of a document - that is, whether it has been peer-reviewed, whether it appeared as a conference paper, article, journal article, technical report or PhD/Masters’ thesis - is often unclear. The theme, subject matter and contributions contained within the document should be visible within the text, for this is after all the rationale behind making the document available at all, but a great deal of domain knowledge may be required to extract such information and recognise it for what it is. We focused on five general structures that can be examined in order to extract metadata:

The document may have structure imposed on it in its electronic format. For example, from an HTML document one can extract a DOM tree, and find HTML tags such as <TITLE>.

The document may have a prescribed visual structure. For example, postscript and PDF specify how text is to be layed out on a page, and this can be used to identify sections of the text.

The document may be structured following some tradition. For example, it may start with a title, then the authors, and end with a number of references.

Documents that are interlinked via citation linking or co-authorship analysis may be analysed via bibliometric methods, making available various types of information.

The document will have linguistic structure that may be accessible. For example, if the document is written in English, the authors may “conclude that xxx .”, which gives some meaning to the words between the conclude and the full stop.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

31


32

Emma Tonkin; Henk L. Muller

There exist in practice a huge number of features by which to describe a complex object such as an eprint. Readers effortlessly identify and use relevant subsets and combinations of these on a daily basis, but not all of those features are actually intrinsic to the document or the specific instance of the document (the file). 2.1

Formatting structure

Certain document types contain structural elements with relatively clear or explicit semantics. One of the potential advantages of a language like HTML that stresses document structure over a language such as Postscript that stresses document layout, is that given a document structure it is potentially feasible to mechanically infer the meaning of parts of the document. Indeed, if HTML is used according to modern W3C recommendations, HTML is to contain only structural information, with all design information contributed in CSS. This process of divorcing design from content began in the HTML 4.0 specification [10]. Under these circumstances, a large amount of information can potentially be gained by simply inspecting the DOM tree. For example, all headers H1, H2, H3, ... can be extracted and they can be used to build a table of contents of the paper, and find titles of sections and subsections. Similarly, the HEAD section can be dissected in order to extract the title of a page, although this may not contain the title of the document. However, given that there are multiple ways in HTML to achieve the same visual effect, the use of the tags given above is not enforced. Many WYSIWIG tools use alternative means to produce a similar visual impression – for example, generating a <P class=’header2'> tag rather than a H2 tag. Since the semantics of these alternatives are less clear, this makes extraction of data from HTML pages in practice difficult. A technical report by Bergmark [5] describes the use of XHTML as an intermediate format for the processing of online documents into a structure, but concedes that, firstly, most HTML documents are ‘not well-formed and are therefore difficult to parse’; translation of HTML into XHTML resolves a proportion of these difficulties, but many documents cannot be parsed unambiguously into XHTML. A similar approach is proposed by Krause [11]. In this paper we ignore any context markup, and we have focussed on documents that are not presented in a structure language. On examination of Bergmark’s metadata extraction algorithm, it seems likely that a robust metadata extraction from XHTML makes relatively little use of formatting information.

Figure 1: Visual structure of a scientific paper Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

2.2

Visual structure

In contrast to HTML, other methods to present documents often prescribe visual structure rather than document structure. For example, both Postscript and PDF specify symbol or word locations on a page, and the document consists of a bag of symbols or words at specific locations. Document structure may be inferred from symbol locations. For example, a group of letters placed close together are likely to be a word, and a group of words placed on the same vertical position on the page may be part of a sentence in a western language. The disadvantage of those page description languages is that there are multiple ways to present text, for example, text can be encoded in fonts with bespoke encodings; the encoding itself has no relation to the characters depicted, and it is the shape of the character which conveys the meaning. In circumstances like this it is very difficult to extract characters or words, but the visual structure itself can still be used to identify sections of a document. For example, Figure 1 shows a (deliberately) pixelated image of the first page of a paper, and even without knowing anything about the particular characters, four sections can be highlighted that almost certainly contain text (red), authors (green), affiliation (yellow) and abstract (blue). Indeed, it turns out that visual structure itself can provide help in extracting sections of an image of, for example, legacy documents that have been scanned in. However, it is virtually impossible to distinguish between author names above the title and author names below the title, if the length of the title and the length of the author block are roughly the same. We have performed some experiments that show that we can extract bitmaps for the title and authors from documents that are otherwise unreadable — 3-6% of documents on average in a sample academic environment [12]. An approximately 80% degree of success is achievable using a simple image segmentation approach. These images, or indeed the entire page, may alternatively be handed to OCR software such as gOCR for translation into text and the resulting text string processed appropriately. An account of the use of appearance and geometric position of text and image blocks for document analysis and classification of PDF material may be found in Lovegrove and Brailsford [13], and a rather later description of a similar ‘spatial knowledge’ approach applied to Postscript formatted files is given by Giuffrida et al [13]. In this paper we focus on documents from which we can extract the text as a simple stream of characters. 2.3

Document structure

From both structured description languages (such as HTML) and page description languages (such as PDF) we can usually extract the text of the document. The text itself can be analysed to identify metadata. In particular, author names usually stand out, and so do affiliations, and even the title and journal details. The information that can be extracted from the document structure includes: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Title Authors Affiliation Email URL Abstract Section headings (table of contents) Citations References

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

33


34

Emma Tonkin; Henk L. Muller

10. 11.

Figure and table captions eg. [15] Acknowledgments [16]

Extracting these purely from the document structure is difficult, but together with knowledge about words likely found in, for example, author names or titles, the extraction is feasible. A detailed discussion on the methods that we use can be found later on in this paper. 2.4

Bibliographic citation analysis

There exists a widespread enthusiasm for bibliometrics as an area, which depends heavily on citation analysis as an underlying technology. Some form of citation extraction is a prerequisite for this. As a consequence, a number of methods have been identified for this approach, making use of various degrees of automation. Harnad and Carr [17] describe the use of tools from the Open Journal Project and Cogprints that can, given well-formed and correctly specified bibliographic citations, extract and convert citations from HTML and PDF. Citation linking is of interest to many as a result of the potential of this data in analysis of impact and, arguably, value of scientific papers, but other uses of the information exist, in particular in the area of interface design and support for information-seeking practices. The nature and level of interlinking between documents is a rich source for information about the relations between them. For example, a high rate of co-citation may suggest that the subject area or theme is very similar. In this instance, we extracted citations via our software; these could potentially be used for various purposes. For example, Hoche and Flach [18] investigated the use of co-authorship information to predict the topic of scientific papers. The harvesting of acknowledgements has also been suggested as a measure for an individualâ&#x20AC;&#x2122;s academic impact [16], but may also carry thematic information as well as information on a social-networking level that could potentially be useful for measuring points such as conflict of interest. Along with content classification, this constitutes part of a toolkit for â&#x20AC;&#x2DC;similarity searchâ&#x20AC;&#x2122; [9]. 2.5

Linguistic structure

Finally, the document can be analysed linguistically, inferring meaning of parts of sentences, or relationships between metadata. For example, citations in the main text may be contained within the same sentence, indicating that the two citations are likely to be related in some way. The relation may be a positive relationship or a negative relationship, depending on the text around it: In contrast to work by Jones (1998), work by Thomas (1999)... Analysing linguistic structure depends on knowledge of the document language, and possibly on domain knowledge. Using linguistic analysis one can attempt to extract: 1. 2.

keywords relations between citations

Other than Bayesian statistics across term appearance, we do not use explicit linguistic information in the work presented below, but instead focus on the document structure, guided by simple probabilistic information. 3.

Uncertainty and metadata

Potential discrepancies between mechanically generated metadata and user-generated metadata may not Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

be a big problem, because there is also considerable variation in metadata generated by users. There are three principle sources of variation in metadata as generated by humans: typographic errors, different interpretation of the document, and different interpretations of the metadata descriptions. Below we give a description of those three, and a discussion on the consequences of metadata uncertainty. 3.1

Differences in document interpretation

Differences in document interpretation come to light in, for example the consistency of classifying preprints using keywords. Neither humans nor computers can index with 100% accuracy. If the same article is indexed by each author and a librarian in turn, then they will probably suggest different indexing terms, stemming from different interpretations of the work, background of the person, knowledge about classifications, and in-depth knowledge of the subject matter. Indexing consistency is a well-known problem of interest to researchers in the domain of information science [19]. Indeed, it is doubtful that there is a “gold standard” classification, for even the author of the article may not agree with appropriate classification keywords. Differing interpretations of the work undoubtedly exist; for example, censorship is generally seen as a primary theme of Bradbury’s classic work, Fahrenheit 451, an interpretation that the author does not accept. That is, the relevance of a document changes over time, and may not coincide with the author’s intention; as this occurs, the keywords associated with a document change over time too. This suggests that either keywords have to be kept up to date, or the interpretation of keywords must depend on the context in which those keywords were assigned. 3.2

Typographic errors in metadata

A common failure mode for a human entering metadata is typographic errors. The frequency of typographic errors depends on system interface, feedback, user profile and the type of metadata. In high-grade metadata that is entered by professionals who are being paid to, say, index scientific works contains very few errors. But low-grade metadata, entered by for example on-line users may contain a significant number of errors. Upper bounds for this value on the tagging system Panoramio was less than 10%, with other tag systems showing far higher numbers. These errors are not limited to incorrect spellings, but include errors where the metadata value is selected using a drop-down menu the user may select the keyword “above” or “below” the chosen keyword, or spell checkers that have “corrected” a typographic error and have, for example, replaced recking with racking (rather than wrecking). The latter can be a big problem with people who write documents in a non-native language. In citing other authors, errors in orthography are common, stemming from typographical error, misreading, cultural misunderstanding (such as the inversion of first and last names), as well as from other sources such as issues with citation management software or, indeed, error propagated from replicating prior miscitation of the document. An overview and typology of features found in online orthography can be found in Tavosanis [20]. Automatically generated metadata does not contain any typos, other than those copied from the original document and those introduced during the extraction process. However, computer generated metadata is subject to different failure mode. In the simplest case, an incorrect keyword is suggested because it appears appropriate on the basis of the features, but turns out to be one that is inappropriate to a human who understands that identical words may refer to different concepts.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

35


36

Emma Tonkin; Henk L. Muller

3.3

Different interpretations of metadata schemas

A third common variation in metadata is due to the interpretation of metadata schemas. This expresses itself commonly in the way in which author names are interpreted. Different parts of names have different meanings, and in some cultures the first part of the name may be a family name, whereas in other cultures the first name may be a given name, and there are languages where there are “middle parts” that are part of the surname. It is virtually impossible to design a metadata scheme that both allows all names to be stored in a single canonical format, and that at the same time is unambiguous and easy to use for authors from all different cultural backgrounds. One strategy around this is that authors names are just opaque strings of characters that warrant no interpretation. These are difficult to match because authors are frequently inconsistent in providing their names, preferring perhaps in certain cases to provide middle initials and in others to give only an initial of one of their given names. Indeed, it is a strategy that is often used consciously by authors to separate their publications in one field from those in another. This strategy may even be deliberately applied to “fool” automatic indexing [20]. Even where authors are consistent, errors in data extraction or journal style guide clashes may cause errors in author name extraction. For example, some article styles require “first” names in citations to be abbreviated to a single letter. 3.4

Propagation of errors

In the general case, we consider metadata generation as an inherently uncertain operation. This implies that metadata should not necessarily be seen as a discrete set of values, but it could be better to represent it as a probability distribution [21,22]. Representing the metadata as a distribution gives us the opportunity to communicate the uncertainty in the suggested metadata to the user. For example, we can select a number of possible keywords based on features of a publication, and communicate which of those keywords are more probable than others. Once errors in metadata exist, they propagate, reinforce similar errors on future pre-prints, introduce seemingly unrelated extra errors, and obfuscate the data presented to the user. Firstly, a system will normally use previous classifications in order to classify future papers. In our system, paperBase, author-names, title, abstract, and classification of previous pre-prints are being used to predict the classification of new pre-prints. Once a pre-print has been misclassified, future papers may be misclassified in a similar manner. Secondly, a system typically uses the metadata found in pre-prints in order to establish connections between pre-prints. Connections can be made because two pre-prints are written by an author with the same name, because they cite each other, or because they cover a similar subject matter according to the keywords. Those connections can be used to, for example, disambiguate author identities. A missing link or an extraneous link would make the process of reasoning about clusters of related papers increasingly difficult. Thirdly, the answers of search queries are diluted when errors are introduced. Cascading errors cause a disproportional dilution of search results. This is also true of user-contributed systems in which users may infer the use of classification terms through examining available exemplars. When machine-generated classifications are provided, they are generally represented as unitary facts; either a document may be described via a keyword, or it may not. Consider the following example of a machine-generated classification: Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

Figure 2: Candidate keywords with associated probabilities In this case, a document is considered almost certain to be about “Computer Architecture” or “Parallel Processing”, and to have a diminishing likelihood of being classifiable as about “Machine Learning” or any of the other terms. In general, a threshold is placed, or the top classification accepted by default, when the result is presented, but it is this distribution that describes the paper with respect to others. The shape of this distribution is very relevant in establishing the nature and relevance of the classification. There may be no clear winner if there are many keywords with similar probability, and then our confidence in the clarity of the results may be shaken absent human evaluation of that judgement. In the case of classifications, many options may be acceptable, but this is less the case in other situations where uncertainty exists. Consider the following citation parses taken from a sample paper (bold text denotes the title and italic text denotes the author):

Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

• •

...

Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

The likelihood for the correct parse is much higher than the likelihood for all other parses. Unlike the prior example of a classification, only one of these parses can be valid. Whilst it is the most likely, we do not have total confidence in this, but we are able to generate a probability of its accuracy (our level of confidence, a value between 0 and 1). Hence, it is possible to provide some guidance as to the validity of this datum as a “fact” about the document. The danger of reasoning over data in which we, or the system, have low confidence, is the risk of propagating errors. If we retain a Bayesian viewpoint, we may calculate any further conclusions on the basis of existing probabilities via Bayesian inference. If, however, we treat a probability as a fact and make Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

37


38

Emma Tonkin; Henk L. Muller

inferences over inaccurate data without regard to degree of confidence, the result may be the production of hypotheses over which we have very little confidence indeed. As a consequence, an extension of DC metadata to include estimates of confidence, as described in [23] is useful, as in the case of classification would be an estimate of the number of classifications considered “plausible”; the breadth or range of likely classifications, which could also be described in terms of variation or level of consistency in judgement - a similar value to that which might be generated in any other situation in which generated or contributed classifications may be treated as “votes”, such as collaborative tagging systems. If the nature and extent of the error are known, further functions that employ these values may apply this information to estimate the accuracy of the result or that of derivative functions. We note that for certain types of metadata, this problem is well-investigated. For example, author name disambiguation has received a great deal of interest in recent years, eg. Han et al [24,25]. 4.

Prototype

We developed a system for the automated extraction of metadata from pre-print papers known as paperBase. The extractor makes use of the structure that is inherent to scientific papers and Bayesian classifiers in order to identify the metadata. We have captured the structure of scientific documents in a probabilistic grammar that produces most known forms of papers. More details on this grammar are given in Tonkin and Muller [12]. The grammar is used to parse the text of a paper, and this produces a collection of metadata with associated probabilities. The parser takes the path through the grammar that results in maximal probabilities for authors, title, affiliation, email addresses. The individual probabilities can then be used later on to decide how to use the metadata. We extended DC with appropriate attributes for the encoding of those confidence measures, so that, for example, a user interface might visually encode the confidence and highlight fields that are likely to contain errors. 4.1

Visual interface

The interface displays the metadata in a tabbed form, one tab for each type of pre-print. The extracted metadata such as author names, title, journal-name, and suggested keywords are displayed in the tab. The uncertainty that is assigned to each of the suggested keywords is shown by ordering the keywords based on the certainty, and by using a graded colour-coding to indicate probable keywords, providing clear and consistent interface semantics. The keywords are shown in a list with scroll-bar, with the five most likely keywords visible. 4.2

Extension to Dublin Core

The metadata extracted, or in some cases generated, from the document object may be retrieved as an XML document via the paperBase API. The DC metadata itself is encoded into XML using the DC XML guidelines [26] as a basis. Additional terms, including confidence values (probabilities of accuracy) where appropriate for this interface, were included in this document. A fragmentary example of an Open Archives Initiative/Dublin Core XML record as generated by paperBase is below:

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

<oai_dc:dc> <dc:type>e-print</dc:type> <dc:title>An Evaluation Study of a Link-Based Data Diffusion Machine</dc:title> <dc:creator canonical=’Muller HL’>Henk L. Muller</dc:creator> <dc:creator canonical=’Stallard PWA’>Paul Stallard</dc:creator> <dc:creator canonical=’Warren DHD’>David HD Warren</dc:creator> <dc:description> .... Abstract deleted...</dc:description> <dc:subject probability=’834'>Computer Architecture</dc:subject> <dc:subject probability=’827'>Parallel Processing</dc:subject> <dc:subject probability=’183'>Machine Learning</dc:subject> <dc:subject probability=’176'>Computer Vision</dc:subject> <dc:subject probability=’156'>Mobile Software</dc:subject> ... More keywords ... </oai_dc:dc> All keywords are given with a number indicating a calculated probability that the keyphrase is applicable to this document. In this instance, the top two keywords, Computer Architecture and Parallel Processing are good choices, with a high probability (the maximum value is 1000). The next three are less likely, and are, indeed, inappropriate. The probabilities given are not normalised into confidence values; at this time, there exists no consensus on how confidence values should best be encoded. Therefore, the structure of this record may well change in future. 4.3

Deployment Workflow

As a first trial, we have integrated the system in the institutional repository that stores papers written by members of the Department of Computer Science at the University of Bristol. We adapted the workflow so that authors first have to upload an electronic version of the paper, prior to providing any metadata. When the paper is uploaded the user is presented with a form in which the user can enter the meta-data for that publication. 4.4

Technical details

The extracted data is provided to the end user via a web service. The service is engineered to use web standards common in the Web 2.0 environment, including REST, Dublin Core and XML. The client interface for the user’s web browser makes use of ECMA JavaScript and XML (AJAX) to retrieve the analysed data and place it into a web form. The webserver has a dedicated thread that interprets metadata. This thread decodes postscript and PDF files, and extracts text from those using the public domain PDFbox Java library (www.pdfbox.org). When the text is extracted it is interpreted in a probabilistic grammar, and the results are stored in a database. Various web services make use of this database, including an independent browse interface along the lines of CiteSeer, and a machine-to-machine REST interface that is used to support AJAX applications requiring document metadata. Others, such as an OAI harvesting interface, can be built against the same database backend - however, as mentioned above (in “Propagation of errors”) it is useful for client services to be aware of the data origin and constraints on its use. An AJAX application embedded into the repository’s web interface polls the webserver for metadata, and fills the form in when metadata becomes available. Typically, metadata is available within a few seconds Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

39


40

Emma Tonkin; Henk L. Muller

of submitting the form. The form will then be filled in asynchronously when the web server has extracted the data. One might regard a synchronous implementation as ideal, where the form comes back when the file has been uploaded. However, since we only have limited computational resources on the web server, and it may take a few seconds for a paper to be completely analysed, we must queue all papers on the server and deal with them one at a time, in order to control congestion. The way in which the queue is handled can be optimised to limit the impact of likely causes of congestion, such as a batch file upload (a usage pattern supported by the service’s own internal interface). As a result, users may have to wait for a few seconds before their form is filled in with the relevant information - however, we think this is beneficial because the user can use this time to familiarise themselves with the form. Providing and filling the form as two asynchronous steps is preferable over a user looking at a spinning hour-glass, and then being taken to a filled-in form. In addition, our system fails gracefully, in that if the decoder service is not working for whatever reason, the form will simply stay with all fields blank and report that no metadata could be extracted. The accessibility of the resulting software represented a primary concern. As such, care has been taken to ensure that the system functions across multiple browser platforms, including IE, Firefox, Safari, Opera and other Gecko- or KHTML-based browsers such as Konqueror. Cross-browser compatibility is, however, a moving target; hence it is expected that this will impose a small ongoing maintenance cost. The nonavailability of JavaScript simply means that the user must fill in each field manually, as was the case before this service became available. One further accessibility concern for us was the way in which screen readers and similar assistive technologies reacted to the dynamic content placement. The dynamic content proved not to be an issue in practical use; however, the presence of a (non-AJAX) “SELECT MULTI” element fell foul of a known showstopper bug in the screen reader, which meant that we could not complete the evaluation. It seems that fully supporting screen readers would involve at least the level of customisation and maintenance required for cross-browser compatibility, and furthermore this requires additional investment in developing or procuring a software base for testing purposes. 5.

Evaluation

We have performed two trials of our system. In one trial we have rebuilt our entire repository, logging suggested keywords together with keywords that were assigned by the author. We show that 80% of the keywords selected by the authors are in the top-five list of keywords. This is a conservative figure, since we expect that some authors would not have picked the other 20% of the keywords if they weren’t suggested - see also our discussion earlier on reliability of human indexing. Throughout the development process, sets of informal think-aloud trials were conducted, that resulted in user feedback; applicable results were included in latter phases of the iterative design process. We then performed a more formal evaluation study on 12 subjects, presenting them with a set of six papers to be entered into a repository. The participant were presented with a form to enter the data, and sometimes this form would be pre-filled in with automatically extracted metadata. Half the participants had their first three papers entered without assistance, and had automatically extracted metadata for the last three papers. The other half of the participants were presented with automatic data for the first three papers, and had no assistance for the latter three papers. The papers were selected in order to cause maximum trouble for the metadata extractor, in particular, an author name would be extracted twice on one paper, there would be one missing author on a second paper, other papers would have missing ligatures, or mathematical formulas in the abstract. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

In order to quantify what a true user would see we have manually judged the quality of title and author extraction on 186 papers. We found that 8% of the titles was completely wrong, and 8% was not completely correct, with the remaining 86% of the titles being right. For the experiment above, this would mean that a participant might have seen one bad title, with a probability of less than 50%. Three bad titles in a row has a probability of 0.4%. For the authors, 13% of the authors was wrong, of the remaining 87%, 32% included the right authors but had extraneous text that was misconstrued as authors. Our sample was not sufficiently randomised and had many papers by a Slovenian author with a diacritical mark in both surname and first name, which skewed our statistics. In addition, another author’s affiliaton was at the “George Kohler Allee” which was misrecognised as an author name. A detailed analysis of the quantitative results of those study are published in another paper[12]; in short, it was found that the assistive effect generally caused participants to take less time overall in depositing papers. Here, we report on the qualitative feedback that the participants provided. At the end of the trial participants were given a form with four questions, asking which system they preferred, whether they thought that system was faster, whether they thought that system was more accurate, and an “any other comments” box. A most interesting observation was that the participants were divided on the question of whether manually entered data had fewer errors. Many participants had spotted errors in the automatically corrected data, and had corrected them, and had concluded that the manual data must have been more accurate - however, analysing the errors it turns out that manual data contains more errors. The reasons for this is two-fold. First, there are people who take manual entry literally: they type the title in again (rather than using a copyand-paste feature). Typing is an error-prone process. Second, people who use the copy-and-paste feature seem to assume that this is by definition error-free - hardly any of the participants spotted that when they had used copy-and-paste ligatures had gone missing during the process, or that hyphenation had been introduced because a word had been broken across two lines in the abstract. Instead, participants accepted copy-and-paste as a ground-truth, and corrected the errors only when the copy-and-paste had been performed by the meta data extractor. We postulate that people have a limited amount of time to perform tasks such as entering publication data, and that they either spend it on manual entry, or on correcting automated entry - the latter leads to more accurate results. There is also a possibility that this is related to the “proofreader blindness” effect - it is known to be more difficult to proofread one’s own work than work by others in one’s own domain [27]. It is possible that the same effect plays a role in this instance. Many of the comments that were passed on using the last open question related to features that people would like to see in the system; in particular, we requested a month and a year, and many participants rightly complained that they had to give a month, even if they didn’t know it. A number of comments gave qualitative feedback on the use of paperBase. A telling comment was Just adding a few fields makes the task of adding your publication much less boring and time consuming. I’d prefer to see it try and occasionally fail as opposed to it be removed because it occasionally failed.. This has been observed in other studies - users are aware that the task they are doing should be done automatically, and they appreciate any help that a system will give them [28]. Another user commented that I particularly liked the ordered keywords. The suggested keywords are either ordered alphabetically, or in some order of likelihood. The latter works really well if the right keywords are somewhere in the top-5; if they are not in the top-5 they are very hard to find back, because the ordering is related to the perception of the extraction algorithm, and no longer related to the user entering the publication. Even though people liked it in general, we should have an option to sort the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

41


42

Emma Tonkin; Henk L. Muller

keywords alphabetically (or have an assisted keyword search) for situations in which the algorithm fails. One very interesting comment read: Abstracts are a nuisance; I would remove those from the database.. Indeed, this participant had blanked out all abstracts - they had not entered any abstract manually and had erased the automatically extracted abstracts. We postulate that they are only a nuisance because of the work involved in entering abstracts - from a search and user interface perspective abstracts are highly valuable and should be available. We think that automated extraction will aid in making meta data more complete - as long as people will not delete valuable information wholesale. 6.

Conclusion

Semi automatic meta data entry offers many advantages. From the limited study that we performed, we observed an increased accuracy, faster entry time, and most important, buy-in from the participants unambiguously preferring the semi automated entry system. The evaluation that we performed is limited in that we studied only a single domain (computer science papers), with participants who were very computer literate (postgraduates in computer science), and with only a small number of participants. In future evaluations we would like to include different domains. The current version of the interface only uses a small amount of the data that could be used. In particular, we do not use links between papers (as found in the form of citations) yet to, for example, disambiguate author identities. The number of file formats supported by the system could be increased, and methods found for the user to correct other metadata such as citations which are also extracted by the system. Equally, the provision and use of error margins may have some promise in providing cross-site, hybrid search operating across a number of resource and metadata types. One feature of interest within the study results is the reminder that the quality of metadata, whether semiautomated or not, depends on the level of interest of the participant. Individuals who simply do not see the point of providing a given metadata element will at best put little thought into the process, and at worst will actively remove extraneous elements despite the best efforts of an automated metadata extraction service. The ultimate arbiter in any system that is not fully automated is the individual contributor, despite any scaffolding that the system may provide, and any mismatch between the contributor’s needs and the aims of the system designer should be identified and allowed for in design and development. 7.

Notes and References

[1]

Liddy, E. D., Sutton, S. Paik, W., Allen, E., Harwell, S., Monsour, M., Turner, A. and Liddy, J. Breaking the metadata generation bottleneck: preliminary findings. Proceedings of the 1st ACM/ IEEE-CS joint conference on Digital libraries, Roanoke, Virginia, United States, 2001. pp. 464 Greenberg, J., Spurgin, K. and Crystal, A. Functionalities for automatic metadata generation applications: a survey of metadata experts’ opinions. Int. J. Metadata, Semantics and Ontologies, Vol. 1, No. 1. 2006 Han, H., Giles, C. L., Manavoglu, E. and Zha, H. Automatic Document Metadata Extraction using Support Vector Machines, Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, ACM Press, New York, 2003. pp.37–48 Takasu, A. ‘Bibliographic attribute extraction from erroneous references based on a statistical model’, Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, ACM Press, New York, 2003. pp.49–60. Bergmark, D. Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821, Cornell Digital Library Research Group, November 2000.

[2] [3] [4] [5]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Keyword and Metadata Extraction from Pre-prints

[6] [7] [8] [9] [10] [11]

[12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

Hu, Yunhua, Li, Hang, Cao, Yunbo, Teng, Li, Meyerzon, Dmitriy and Zheng, Qinghua. Automatic extraction of titles from general documents using machine learning. Information Processing & Management. Volume 42, Issue 5, September 2006, Pages 1276-1293 Rosch, E.. Natural categories. Cognitive Psychology 4, 1973. pp. 328-350. Labov, W. The boundaries of words and their meanings, New ways of analysing variation in English. Washington: Georgetown University Press C-J. N. Bailey and R. W. Shuy, 1973. pp 340—373 Olivié, H., Cardinaels, K. & Duval, E.. Issues in Automatic Learning Object Indexation. In P. Barker & S. Rebelsky (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2002, Chesapeake, VA: AACE, 2002. pp. 239-240. Austin, Daniel, Peruvemba, Subramanian, McCarron, Shane, Ishikawa, Masayasu and Birbeck, Mark. XHTML™ Modularization 1.1, W3C Working Draft, 2006. Retrieved April 30th, 2008, from http://www.w3.org/TR/xhtml-modularization/xhtml-modularization.html Krause, J. and Marx, J. . Vocabulary Switching and Automatic Metadata Extraction or How to Get Useful Information from a Digital Library, Proceedings of the First DELOS Network of Excellence Workshop on “Information Seeking, Searching and Querying in Digital Libraries”. Zurich, Switzerland, 2000. Tonkin, E. and Muller, H. L. Semi Automated Metadata Extraction for Preprints Archives. Proceedings of the Eighth ACM/IEEE Joint Conference on Digital Libraries, ACM Press, New York. 2008 Lovegrove, W. S. and Brailsford, D. F. Document analysis of PDF files: methods, results and implications. Electronic publishing, Vol. 8(2&3), 207-220 (June and September 1995). Giuffrida, G., Shek, E.C. and Yang, J. Knowledge-based metadata extraction from PostScript files. DL ’00: Proceedings of the fifth ACM conference on digital libraries. pp. 77-84, ACM, NY, USA, 2000. DOI: http://doi.acm.org/10.1145/336597.336639 Liu, Y., Mitra, P., Giles, C.L. and Bai, K. Automatic extraction of table metadata from digital documents. Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries, 2006. pp 339-340. Giles, C. L. and Councill, I. D.. Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing. PNAS, Vol. 101, no. 51, 2004. pp. 17599-17604 Harnad, S. and Carr, L.. Integrating, navigating and analysing open Eprint archives through open citation linking (the OpCit project). Current Science. 79(5), 2000. 629-638 Hoche, S. and Flach, P. Predicting Topics of Scientific Papers from Co-Authorship Graphs: a Case Study. Proceedings of the 2006 UK Workshop on Computational Intelligence (UKCI2006), pp. 215–222. September 2006 Olson, H & Wolfram, D. Indexing Consistency and its Implications for Information Architecture: A Pilot Study. IA Summit 2006. Tavosanis, M. A causal classification of orthography errors in web texts. Proceedings of AND 2007. van Rijsbergen, C. J. The Geometry of Information Retrieval. Cambridge University Press. 2004 Widddows D. Geometry and Meaning. (CSLI-LN) Center for the Study of Language and Information. 2004 Cardinaels, Kris, Duval, Erik and Olivié, Henk J., A Formal Model of Learning Object Metadata. EC-TEL 2006. pp. 74-87 Han, Hui, Lee, Giles, Zha, Hongyuan, Li, Cheng, and Tsioutsiouliklis, Kostas. Two supervised learning approaches for name disambiguation in author citations. Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries, ACM Press, New York, 2004. pp. 296-30 Han, Hui, Zha, Hongyuan, Giles, C. Lee. Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of JCDL’2005, ACM Press, New York, 2005. pp.334~343 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

43


44

Emma Tonkin; Henk L. Muller

[26] Powell, A and Johnston, P. Guidelines for implementing Dublin Core in XML. DCMI Recommendation. April 2003. http://dublincore.org/documents/dc-xml-guidelines/ [27] Daneman, Meredyth and Stainton, Murray. The generation effect in reading and proofreading. Reading and Writing, Vol. 5, no. 3, 1993. pp. 297-313. DOI - 10.1007/BF01027393 [28] Berry, Michael W. and Murray Browne. Understanding search engines; mathematical modeling and text retrieval. SIAM, 2005

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


45

The MPEG Query Format, a New Standard For Querying Digital Content. Usage in Scholarly Literature Search and Retrieval Ruben Tous1 and Jaime Delgado2 Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya (UPC) Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Mòdul D6, Campus Nord C/ Jordi Girona, 1-3, E-08034 Barcelona, Spain e-mail: 1rtous@ac.upc.edu; 2jaime.delgado@ac.upc.edu

Abstract The initiative of standardization of MPEG Query Format (MPQF) has refueled the research around the definition of a unified query language for digital content. The goal is to provide a standardized interface to multimedia document repositories, including but not limited to multimedia databases, documental databases, digital libraries, spatio-temporal databases and geographical information systems. The initiative is being led by MPEG (i.e. ISO/IEC JTC1/SC29/WG11). This paper presents MPQF as a new approach for retrieving multimedia document instances from very large document databases, and its particular application to scholarly literature search and retrieval. The paper also explores how MPQF can be used in combination with the Open Archives Initiative (OAI) to deploy advanced distributed search and retrieval services. Finally, the issue of rights preservation is discussed. Keywords: scholarly literature; search, framework; query format, MPQF; Open Archives Initiative; MPEG 1.

Introduction

During the last years, the technologies enabling search and retrieval of multimedia digital contents have gained importance due to the large amount of digitally stored multimedia documents. Therefore, members of the MPEG standardization committee (i.e. ISO/IEC JTC1/SC29/WG11) have developed a new standard, the MPEG Query Format (MPQF) [1, 2, 3], which provides a standardized interface to multimedia document repositories, including but not limited to multimedia databases, documental databases, digital libraries, spatio-temporal databases and geographical information systems. The MPEG Query Format offers a new and powerful alternative to the traditional scholarly communication model. MPQF provides scholarly repositories with the ability to extend access to their metadata and contents via a standard query interface, in the same way as Z39.50 [4], but making use of the newest XML querying tools (based in XPath 2.0 [5] and XQuery 1.0 [6]) in combination with a set of advanced multimedia information retrieval capabilities defined within MPEG. This would allow, for example, querying for journal papers by specifying constraints over their related XML metadata (which is not restricted to a particular format) in combination with similarity search, relevance feedback, query-by-keywords, queryby-example media (using an example image for retrieving papers with similar ones), etc. MPQF has been designed to unify the way digital materials are searched and retrieved. This has important implications in the near future, when scholarly users’ information needs will become more complex and will involve searches combining (in the input and the output) documents from different nature (e-prints, still images, audio transcripts, video files, etc.). Currently, several forums, like [7], are trying to identify the necessary steps that could be taken to improve

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


46

Ruben Tous; Jaime Delgado

interoperability across heterogeneous scholarly repositories. The specific goal is to reach a common understanding of a set of core repository interfaces that would allow services to interact with heterogeneous repositories in a consistent manner. Such repository interfaces include interfaces that support locating, identifying, harvesting and retrieving digital objects. There’s an open discussion about if the interoperability framework may benefit from the introduction of a search interface service. In general, it is felt that, while such interface is essential, it should not be part of the core, and that it could be implemented as an autonomous service over one or more digital repositories fed through interaction with core repository interfaces for harvesting like the Open Archives Initiative (OAI) [8]. We defend that MPQF could be this search interface service, deployed in the last mile of the value chain, offering powerful and innovative ways to express user information needs. 2.

Related work

In general, the preferred method for distributed acquisition to digital content repositories is metadata harvesting. Metadata harvesting consists on collecting the metadata descriptions of digital items (usually in XML format) from a set of digital content repositories and storing them in a central server. Metadata is lighter than content, and it’s feasible to store the necessary amount of it in an aggregation server so that real-time access to information about distributed digital content can take place without the burden of initiating a parallel real-time querying of the underlying target content databases. Nowadays, the preferred harvesting method is the one offered by the Open Archives Initiative (OAI), which defines a mechanism for harvesting XML-formatted metadata from repositories (usually within the scholarly context). The OAI technical framework is intentionally simple with the intent of providing a low barrier for implementers and users. The trade-off is that its query expressiveness and output format description is very limited. In OAI Protocol for Metadata Harvesting (OAI-PMH), metadata consumers or “harvesters” request metadata from updated records from the metadata producers or “repositories” (data providers are required to provide XML metadata at least in Dublin Core format). These requests can be based on a timestamp range, and can be only restricted to named sets defined by the provider. These sets provide a very limited form of selective harvesting, and do not act as a search interface. Consequently some repositories may provide other querying interfaces with richer functionality, usually in addition to OAI. The two principal examples are Z39.50 and SRU-CQL [9, 10]. Regarding OAI, the MPEG Query Format (MPQF) could also be used for harvesting (though in that case a metadata format offering records update timestamps would be needed), overlapping with the OAI functionalities. However, MPQF is a complex language which has been designed for fine-grained retrieval and more advanced filtering capabilities. Because OAI offers a specialised, low-barrier and mature protocol for harvesting, we think that both mechanisms should be used in conjunction. With respect to Z39.50 and related protocols/languages, MPQF surpasses their expressive power by offering a flexible combination of XML-based query capabilities with a broad set of multimedia information retrieval capabilities. A major difference with respect to the Z39.50 approach is that MPQF does not define abstract data structures to which the queries refer, instead, MPQF queries use generic XPath and XQuery expressions written in terms of the expected metadata format of the target databases. We envisage the usage of MPQF and its expressive power directly between user agents and service providers, while the OAI will be probably used through the rest of the value chain. Regarding other multimedia query formats, there exist several languages explicitly for multimedia data such as SQL/MM [11], MOQL [12] or POQLMM [13], which are out of scope of this paper based on its limitation in handling XML data. Today, these kind of works use to be based on MPEG-7 descriptors and the MPEG-7 data model. Some simply defend the use of XQuery or some extensions of it. Others define a more high-level and user-oriented approach. MPQF outperforms XQuery-based approaches like [14, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The MPEG Query Format, a New Standard For Querying Digital Content

15, 16] because, while offering the same level of expressiveness, it offers multiple content-based search functionalities (QBE, query-by-freetext) and other IR-like features (e.g. paging or relevance feedback). Besides, XQuery does not provide means for querying multiple databases in one request and does not support multimodal or spatial/temporal queries. 3.

MPEG Query Format

3.1. Concepts and benefits Formally, MPQF is Part 12 of ISO/IEC 15938-12, “Information Technology - Multimedia Content Description Interface” better known as MPEG-7 [17]. The standardization process started in July 2006 with the release of a ”Call for Proposals on MPEG-7 Query Format” [18]. However, the query format was technically decoupled from MPEG-7 during the 81st MPEG meeting in July 2007, and its name changed to “MPEG Query Format” or simply “MPQF”. The standardization process has proceeded and it is expected that MPQF will become an ISO/IEC final standard after the 85th MPEG meeting in July 2008. Basically, MPQF is an XML-based query language that defines the format of queries and replies to be interchanged between clients and servers in a distributed multimedia information search-and-retrieval context. The two main benefits of standardizing such kind of language are 1) interoperability between parties (e.g. content providers, aggregators and user agents) and 2) platform independence; developers can write their applications involving multimedia queries independently of the database used, which fosters software reusability and maintainability. The major advantage of having MPEG rather than industry forums leading this initiative is that MPEG specifies international, open standards targeting all possible application domains and which, therefore, are not conditioned by partial interests or restrictions.

Input Query Format Output Query Format

Requester

Responder

Query Management Input Query Management Output

MPQF

MPQF

Service Provider

MPQF

Database 1

Other…

Database 2

Client MPQF

Database N

Figure 1. MPEG Query Format diagram MPQF defines a request-reply XML-based interface between a requester and a responder. Figure 1 shows a diagram outlining the basic MPQF scenario. In the simplest case, the requester may be a user’s agent and the responder might be a retrieval system. However, MPQF has been specially designed for more complex scenarios, in which users interact, for instance, with a content aggregator. The content aggregator acts at the same time as responder (from the point-of-view of the user) and as a requester to Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

47


48

Ruben Tous; Jaime Delgado

a number of underlying content providers to which the user query is forwarded. 3.2. Multimedia information retrieval vs. (XML) data retrieval One of the novel features of MPQF is that it allows the expression of queries combining both the expressive style of information and XML Data Retrieval systems. Thus, MPQF allows combining e.g. keywords and query-by-example with e.g. XQuery allowing the fulfillment of a broad range of usersâ&#x20AC;&#x2122; multimedia information needs. Both approaches to data retrieval aim to facilitate usersâ&#x20AC;&#x2122; access to information, but from different points-of-view. On one hand, given a query expressed in a user-oriented manner (e.g. an image of a bird), an Information Retrieval system aims to retrieve information that might be relevant even though the query is not formalized. In contrast, a Data Retrieval system (e.g. an XQuery-based database) deals with a well defined data model and aims to determine which objects of the collection satisfy clearly defined conditions (e.g. the title of a movie, the size of a video file or the fundamental frequency of an audio signal). Regarding Information Retrieval, MPQF offers a broad range of possibilities that include but are not limited to queryby-example-media, query-byexample-description, query-by-keywords, query-by-feature-range, query-byspatial-relationships, query-by-temporalrelationships and query-by-relevance-feedback. For Data Retrieval, MPQF offers its own XML query algebra for expressing conditions over the multimedia related XML metadata (e.g. MPEG-7, Dublin Core or any other XMLbased metadata format) but also offers the possibility to embed XQuery expressions (see Figure 2). XML query algebra (metadata-neutral) DR-like criteria

Embedded XQuery expressions (metadata-neutral)

MPEG Query Format

QueryByFreeText IR-like criteria

QueryByDescription QueryByMedia SpatialQuery TemporalQuery

Figure 2. MPQF IR and DR capabilities 3.3. Language parts MPQF instances are XML documents that can be validated against the MPQF XML schema. Any MPQF instance includes always the MpegQuery element as the root element. Below the root element, an MPQF document includes the Query element or the Management element. MPQF instances with the Query element are the usual requests and responses of a digital content search process. The Query element can include the Input element or the Output element, depending if the document is a request or a response. The part of the language describing the contents of the Input element is named the Input Query Format (IQF) in the MPQF standard. The part of the language describing the Output element is named the Output Query Format (OQF) in the standard. IQF and OQF are just used to facilitate understanding, but do not have representation in the schema. Alternatively, below the root element, an MPQF document can Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The MPEG Query Format, a New Standard For Querying Digital Content

include the Management element. Management messages (which in turn can be requests and responses) provide means for requesting service-level functionalities like discovering databases or other kind of service providers, interrogating the capabilities of a service, or configuring service parameters. Input Query

FetchResult Output

MpegQuery

Input Management

Output

Figure 3. MPQF Schema root elements 3.4

Input Query Format (IQF)

The MPQF’s Input Query Format (IQF) mainly allows specifying the search condition tree, which represents the user’s search criteria, and also the structure and desired contents of the resultset. The condition tree is the main component of MPQF, and can be built combining different kids of expressions and query types. When analyzing an MPQF condition tree, one must consider that it will be evaluated against an unordered set of Multimedia Content (MC). The concept of Multimedia Content [17] is analogous to the concept of Digital Object from the Digital Libraries area, and refers to the combination of multimedia data and its associated metadata. MPQF allows search and retrieval of complete or partial MC data and metadata. Conditions within the condition tree operate on one evaluation-item (EI) at a given time (or two EIs if a Join operation is used). By default, an Evaluation Item (EI) is a multimedia content in the database, but other types of EIs are also possible (spatial or time regions, metadata fragments, etc.). Figure 4 outlines the main elements of the IQF part of the MPQF schema. The condition tree is placed within the QueryCondition element, and is constructed combining boolean operators (AND, OR, etc.), simple conditions over the XML metadata fields and query types (QueryByFreeText, QueryByMedia, etc.). Example in Code 1 shows the MPQF query asking for PDF research papers related to the keywords “Open Access” with a Dublin Core date element greater or equal to 2008-01-15. Note that the query expects the target repository exposing Dublin Core descriptors. Exposing Dublin Core metadata is not required for an MPQF-compliant server, therefore the requester must previously ask the repository which metadata formats support. 3.5

Output Query Format (OQF)

The MPQF’s Output Query Format (OQF) allows specifying the structure of the resultset. By default, the resultset includes some fields like the resource locator (MediaResource element in MPQF) but MPQF allows also selecting specific XML elements from one or more target namespaces. MPQF allows sorting and grouping result records, but it is deliberately rigid in the way records are presented. Unlike XQuery, which allows defining any possible structure for a result, MPQF records always share the same structure at the top levels. As shown in Figure 5, for each record a ResultItem element is returned. Within each Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

49


50

Ruben Tous; Jaime Delgado

ResultItem, generic information about the record is placed within the Comment, TextResult and MediaResource elements, while the Description element is reserved for encapsulating the XML fields which have been selected in the query.

QFDeclaration OutputDescription Input

Path TargetMediaType

QueryCondition

Join

ServiceSelection

Condition

Figure 4. Input Query Format (IQF) Comment GlobalComment Output

ResultItem SystemMessage

TextResult Thumbnail MediaResource Description AggregationResult

Figure 5. Output Query Format (OQF) Example in Code 2, gives an idea of how the result of the query in Code 1 could look like. The resultset consists on two records which match the query conditions, and include the Dublin Core elements which have been selected (title, creator, publisher and date). <MpegQuery> <Query> <Input> <OutputDescription outputNameSpace="//purl.org/dc/elements/1.1/"> <ReqField>title</ReqField> <ReqField>creator</ReqField> <ReqField>publisher</ReqField> <ReqField>date</ReqField> </OutputDescription> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="AND" preferenceValue="10"> <Condition xsi:type="QueryByFreeText"> <FreeText>Open Access</FreeText> </Condition> <Condition xsi:type="GreaterThanEqual"> <DateTimeField>date</DateTimeField> <DateValue>2008-01-15</DateValue> </Condition> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>

Code 1: Input query example

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The MPEG Query Format, a New Standard For Querying Digital Content

<MpegQuery mpqfID="AB13DGDDE1"> <Query> <Output> <ResultItem xsi:type="ResultItemType" recordNumber="1"> <TextResult>Some advertising here</TextResult> <MediaResource>http://www.repository.com/item04.pdf</MediaResource> <Description xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ dc.xsd"> <dc:title>Open Access Overview</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-21</dc:date> </Description> </ResultItem> <ResultItem xsi:type="ResultItemType" recordNumber="2"> <TextResult>Some advertising here</TextResult> <MediaResource>http://www.repository.com/item08.pdf</MediaResource> <Description xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ dc.xsd"> <dc:title>Open Access in Germany</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-01</dc:date> </Description> </ResultItem> </Output> </Query> </MpegQuery>

Code 2: Output query example

4.

Open Archives and MPQF together. Scholarly objects interchange framework

We envisage that MPQF could be one building block of a future scholarly objects interchange framework, interconnecting heterogeneous scholarly repositories. The framework would be based on the combination of the Open Archives Initiative (OAI) protocol for metadata harvesting (OAI-PMH) with MPQF. Figure 6 outlines graphically the basic elements of the framework in an example scenario. Required search functionalities amongst the different parties in the framework vary depending on their roles. On one hand, aggregators (e.g. librarians) need collecting metadata descriptions from repositories (e.g. publishers) or between them, and this is usually performed through a harvesting mechanism. On the other hand, content “retailers”, which include aggregators and also some repositories (generally medium or large scale ones), should be able to deploy value-added services offering fine-grained access to digital objects, and advanced search and retrieval capabilities. We believe that the MPEG Query Format could be the search interface between “retailers” and users, in the last mile of the value chain, offering expressive ways to represent user information needs. The scenario from Figure 6 does not cover the real-time distributed usage of MPQF. Our experience in previous projects like [19] and [20] makes us think that real-time distributed search imposes severe limitations in terms of interoperability and performance, and it is not always necessary. However, this scenario is just an example, and nothing limits the distributed usage of MPQF (the standard provides extensive capabilities for that). 5.

Advanced examples

5.1

QueryByMedia example: Searching research papers with similar images

Example in Code 3 shows the MPQF query asking for PDF research papers including images similar to a given one. An example usage of this query could be the detection of image copyright infringement. For instance, it could have been used in the first 90s, when the Playboy magazine discovered that an image Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

51


52

Ruben Tous; Jaime Delgado

copyrighted by the company in 1972, the Lena Sjooblom’s photo (Figure 7), was being widely used in image processing research papers. The query includes a Condition element from the QueryByMedia complex type. Query-by-example similarly searches allow to express the user information need with one or more example digital objects (e.g. an image file). Though the usage of low-level features description instead of the example object bit stream is also considered query-by-example, in MPQF these two situations are differentiated, naming query-by-media to the first case (the digital media itself) and query-by-description the second one. MPQF

OAI-PMH

presentation service (e.g. portal)

aggregator

repository

end user MPQF

OAI-PMH

presentation service

repository

end user OAI-PMH

OAI-PMH

Aggregator

aggregator OAI-PMH

MPQF

presentation service

MPQF

end user

repository

Figure 6. OAI+MPQF Example scenario

Figure 7. Lena Sjooblom’s photo from 1972 and a research paper where it appears [21] <MpegQuery> <Query> <Input> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="QueryByMedia"> <MediaResource xsi:type="MediaResourceType"> <MediaResource> <InlineMedia type="image/jpeg"> <MediaData64>R0lGODlhDwAPAKECAAAAzMzM/////wAAACwAAAAADwA PAAACIISPeQHsrZ5ModrLlN48CXF8m2iQ3YmmKqVlRtW4MLwWACH+H09 wdGltaXplZCBieSBVbGVhZCBTbWFydFNhdmVyIQAAOw==</MediaData64> </InlineMedia> </MediaResource> </MediaResource> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>

Code 3: QueryByMedia example

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The MPEG Query Format, a New Standard For Querying Digital Content

QueryByMedia and QueryByDescription are the fundamental operations of MPFQ and represent the query-by-example paradigm. The individual difference lies in the used sample data. The QueryByMedia query type uses a media sample such as image as a key for search, whereas QueryByDescription allows querying on the basis of an XML-based description. Code 4 shows how an example Dublin Core description can be included in a query. The server should return records corresponding to digital objects with metadata similar to the given ones. It’s up to the server deciding which similarity algorithm to apply. <MpegQuery> <Query> <Input> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="QueryByDescription" matchType="exact"> <DescriptionResource resourceID="desc07"> <AnyDescription xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:title>Open Access Overview</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-21</dc:date> </AnyDescription> </DescriptionResource> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>

Code 4: QueryByDescription example

5.3

QueryByRelevanceFeedback example. Refining the search of a research paper

In Information Retrieval, “relevance feedback” is related to taking the relevance of the results that are initially returned from a given query to improve the results of a new query. MPQF offers the possibility of “explicit” relevance feedback by allowing user to mark specific records as relevant or irrelevant. This is accomplished through the usage of the QueryByRelevanceFeedback query type. Let’s take again example from Code 1 and Code 2. The user was looking for research papers related to the words “Open Access” and submitted a query (Code 1) to the server. The server responded with several records (within a response with id “AB13DGDDE1”), some of which are shown in Code 2. Let’s imagine that the user found specially interesting the records number 1, 2 and 5. By using the QueryByRelevanceFeedback query type, as shown in Code 5, the user cand submit to the server his/her preferences, allowing the server to refine the response. <MpegQuery> <Query> <Input> <QueryCondition> <Condition xsi:type="QueryByRelevanceFeedback" answerID="AB13DGDDE1"> <ResultItem>1</ResultItem> <ResultItem>2</ResultItem> <ResultItem>5</ResultItem> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>

Code 5: QueryByRelevanceFeedback example

The presented examples just demonstrate a small part of MPQF capabilities, but just pretend to show the particularities of this language in comparison to other existing multimedia querying facilities, and specially in comparison to existing scholarly search interfaces.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

53


54

Ruben Tous; Jaime Delgado

6.

Conclusions

This paper proposes the usage of a novel standard, the MPEG Query Format, to extend the functionality and to foster the interoperability of scholarly repositories search interfaces. The paper defends that future scholarly digital objects interchange frameworks could be based on the combination of MPQF and the Open Archives protocol. While Open Archives offers a low-barrier mechanism for “wholesale” metadata interchange, MPQF provides scholarly repositories with the ability to extend access to their metadata and contents via a standard query interface, making use of the newest XML querying tools (based in XPath 2.0 and XQuery 1.0) in combination with a set of advanced multimedia information retrieval capabilities defined within MPEG. The paper describes also how this idea can be applied to the design of a scholarly objects interchange framework. The framework interconnects heterogeneous scholarly repositories and it is based on the combination of two standard technologies such as the OAI-PMH protocol and the MPEG Query Format. The design has been guided by the conclusions of a previous experience, the XAC project [20], from which several lessons were learnt, as the necessary separation between metadata harvesting and realtime search and retrieval, or the necessity to choose a more appropriate query format than XQuery. We are working currently in the first implementation of the framework. It is worth mention that it is planned that from this work, the first known implementation of an MPEG Query Format processor will emerge. Furthermore, parts of the ongoing implementation are being contributed to the MPEG standardisation process in the form of Reference Software modules. Finally, it is also relevant to indicate that, in fact, we are working with a third standard, the MPEG-21 Rights Expression Language [22] and its extensions, in order to also cover rights management issues. Although it has not been the focus of this paper, we have also considered in our framework the possibility of having licenses associated to the content being distributed. Those licenses specify rights and conditions that apply to a resource for a specific user, and may be used, through an authorization process, to enforce these rights and conditions during the consumption of protected content. In [22] we have already developed some tools to create licenses, to verify them, to decide if a specific consumption is to be authorised and to distribute information about all events happening on the content. Apart from this, we have participated in the development of a system [23] that allows controlling the rights related to the whole life cycle of intellectual property, from its creation to the final usage. We are currently considering adapting our system to specifically handle scholarly content, which would allow authors to register their work, before sending for reviewing or publication, to decide about the rights they want to give to their creations, and to control over the events related to them. 7.

Acknowledgments

This work has been partly supported by the Spanish government (DRM-MM project TSI 2005-05277) and the European Network of Excellence (VISNET-II IST-1-038398), funded under the European Commission IST 6th Framework Program. 8.

References

[1]

ISO/IEC FDIS 15938-12:2008 “Information Technology — Multimedia Content Description Interface — Part 12: Query Format”. Gruhne, Matthias; Tous, Ruben; Doeller Mario; Delgado, Jaime and Kosch, Harald (2007). MP7QF: An MPEG-7 Query Format. 3rd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AXMEDIS 2007), Barcelona, November 2007. IEEE

[2]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The MPEG Query Format, a New Standard For Querying Digital Content

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

[20]

[21] [22] [23]

Computer Society Press. ISBN 0-7695-3030-3. p. 15-18 Kevin Adistambha et al. (2007). The MPEG-7 Query Format: A New Standard in Progress for Multimedia Query by Content. 7th International Symposium on Communications and Information Technologies (ISCIT 2007), Sydney, Australia, October 16-19, 2007. IEEE Computer Society Press. ISO 23950. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. XQuery 1.0: An XML Query Language. W3C Recommendation 23 January 2007. See http:// www.w3.org/TR/xquery/. XML Path Language (XPath) 2.0. W3C Recommendation 23 January 2007. See http://www.w3.org/ TR/xpath20/. Tony Hey, Herbert Van de Sompel, Don Waters, Cliff Lynch, Carl Lagoze Augmenting interoperability across scholarly repositories JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries June 2006 Open Archives Initiative. http://www.openarchives.org/. Library of Congress, 2004. SRU: SRU (Search Retrieve via URL). See http://www.loc.gov/standards/sru/sru-spec.html. Common query language. See http://www.loc.gov/z3950/agency/zing/cql/cqlsyntax.html. J. Melton and A. Eisenberg. SQL Multimedia Application packages (SQL/MM). ACM SIGMOD Record, 30(4):97–102, December 2001. J. Z. Li, M. T. Ozsu, D. Szafron, and V. Oria. MOQL: A Multimedia Object Query Language. In Proceedings of the third International Workshop on Multimedia Information Systems, pages 19–28, Como Italy, 1997. A. Henrich and G. Robbert. POQLMM: A Query Language for Structured Multimedia Documents. In Proceedings 1st International Workshop on Multimedia Data and Document Engineering (MDDE’01), pages 17–26, July 2001. J. Kang and al. An XQuery engine for digital library systems. In 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, Houston, Texas, May 2003. D. Tjondronegoro and Y. Chen. Content-based indexing and retrieval using mpeg-7 and xquery in video data management systems. World Wide Web: Internet and Web Information Systems, pages 207–227, 2002. L. Xue, C. Li, Y. Wu, and Z. Xiong. VeXQuery: an XQuery extension for MPEG-7 vector-based feature query . In Proceedings of the International Conference on Signal-Image Technology and InternetBased Systems (IEEE/ACM SITIS’2006), pages 176–185, Hammamet, Tunesia, 2006. ISO/IEC 15938 Version 2 “Information Technology - Multimedia Content Description Interface” (MPEG-7). ISO/IEC JTC1/SC29/WG11 N8220. July 2006 ”Call for Proposals on MPEG-7 Query Format”. Ruben Tous, Jaime Delgado. Advanced Meta-Search of News in the Web , ELPUB2002 Technology Interactions. Proceedings of the 6th International ICCC/IFIP Conference on Electronic Publishing held in Karlovy Vary, Czech Republic, 6–8 November 2002. VWF Berlin, 2002. ISBN 389700-357-0. J. Delgado, S. Llorente, E. Peig, and A. Carreras. A multimedia content interchange framework for TV producers. 3rd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AXMEDIS 2007), Barcelona, November 2007. IEEE Computer Society Press. ISBN 0-7695-3030-3. p. 206-213. 2007. Jose A. Rodrigo, Tatiana Alieva, Maria L. Calvo, Applications of gyrator transform for image processing, Optics Communications. Volume 278, Issue 2, 15 October 2007, Pages 279-284. ISO/IEC, Information Technology – Multimedia framework (MPEG-21) – Part 5:Rights Expression Language, ISO/IEC 21000- 5:2004, March 2004. IPOS-DS (Intellectual Property Operations System – Digital Shadow), exploited by NetPortedItems, S.L. http://www.digitalmediavalues.com. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

55


56

The State of Metadata in Open Access Journals: Possibilities and Restrictions Helena Francke Department of Cultural Sciences, Lund University SE-223 62 Lund, Sweden and Swedish School of Library and Information Science, Göteborg University and University College of Borås, Sweden e-mail: helena.francke@hb.se

Abstract This paper reports on an inquiry into the use of metadata, publishing formats, and markup in editormanaged open access journals. It builds on findings from a study of the document architectures of open access journals, conducted through a survey of 265 journal web sites and a qualitative, descriptive analysis of 4 journal web sites. The journals’ choices of publishing formats and the consistency of their markup are described as a background. The main investigation is of their inclusion of metadata. Framing the description is a discussion of whether the journals’ metadata may be automatically retrieved by libraries and other information services in order to provide better tools for helping potential readers locate relevant journal articles. Keywords: scholarly journals; metadata; markup; open access; information access 1.

Introduction

This paper will report on an inquiry into the use of metadata, publishing formats, and markup, in editormanaged open access journals[1]. The open access movement endorses and is actively working towards the possibility for everyone with an Internet connection and sufficient information literacy to be able to access scholarly contributions on the Web. However, given the amount of documents and services on the Web, making content available is no guarantee that it will also be found by the intended target groups. Although there are several ways for authors and publishers of open access scholarly journals to address the problem of their products “being found”, including Search Engine Optimization, many of them require a potential reader to either already be familiar with the journal or to enter a suitable search query into a search engine. The latter is presumably the most common locating tool that readers use [2]. Making the journal articles searchable through OAI-compliant repositories or library online catalogues can aid in bringing articles to the attention of potential readers, often with the additional perk that comes with positioning the articles within the context of the journal to a larger extent than is the case when individual files are found through a search engine. For small publishers of scholarly journals [3], particularly in cases where an open access journal is run on a low budget by an individual or an organization such as a university department or library [4],[5], it may be difficult to find the time and resources to promote the journal. Libraries and other information services may provide help with collecting and making available article metadata from this group of journals in order to increase their visibility. Such projects already exist, e.g. the Lund University Library’s DOAJ [6] or the University of Michigan’s OAIster [7], but these services still require input from the journals in the form of harvestable metadata. If article metadata could be retrieved directly from the journal web sites without a need for the publishers to provide it in a specific Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

format, there would be better opportunity for libraries to work with publishers of small and local journals so as to help them target a world-wide audience [cf. e.g. 8]. In this paper, I will present findings concerning the use of metadata, publishing formats, and markup in editor-managed open access journals [5, p. 5] that can be of use for librarians, scholars, and computer scientists who are considering taking on such tasks. Focus in the paper is on which metadata are included and marked up in the journals; the choice of format and markup consistency are included because they constitute important prerequisites for how metadata may be reused. 2.

Methodology

The data were collected through a combination of qualitative and quantitative methods. This allows for conclusions to be drawn both across journals and across the different issues and articles within individual journals. The document architectures of the journals were studied with regard to their choice of publishing format, their use of markup in cases where markup languages were used, and the marked up and visible metadata or bibliographic data included. The study looked at three levels of the journals: the start page, the table of contents pages, and the article pages. The quantitative study comprised 265 journals. The most recent issue and its first article were studied, and for some variables the first issue published online was also included. The qualitative study included four journals, which were investigated in greater detail, including all or most of the issues and a few articles for each issue. The margins of error for each variable in the statistical study were estimated with 95% confidence by using Jowettâ&#x20AC;&#x2122;s method [9], [10]. 2.1

Journals included in the study

The focus of the study was on journals that are published by small open access publishers. These journals are often run by individuals or groups of individuals, or sponsored by universities or university libraries, and they may be termed editor-managed journals [5, p. 5] because much of the publishing work is made by editors who are subject specialists rather than professional publishers. The journals included in the sampling frame were identified through the DOAJ [6] and Open J-Gate [11] databases and was restricted to those journals that were peer reviewed, published the web site in one of the languages Danish, English, French, German, Norwegian, or Swedish, that were open access, and could be considered editor-managed. From the sampling frame of approximately 700 journals (in spring 2006), a random sample of 265 journals was drawn. The majority of the journals in the sample, 70.2%, were published by university departments. Another 9.8% were published by university presses or e-journal initiatives, and 7.2% each by another type of non-profit organisation or under the journalâ&#x20AC;&#x2122;s name. English was the most common language, with 85.3% of the journals having this as their main language. The journals represented every first level subject category included in DOAJ. The four journals in the qualitative section were selected mainly because they use web technology in an innovative or interesting fashion. This was of relevance to other parts of the study than those reported in this paper. The journals were all from the humanities or education, namely: assemblage: the Sheffield graduate journal of archaeology, The Journal of Interactive Media in Education (JIME), The Journal of Music and Meaning (JMM), and The International Review of Research in Open and Distance Learning (IRRODL). 3.

Results

From the study outlined above, data have been selected for presentation that concern three different areas: the publishing formats of the journals, their use of (X)HTML markup, and their inclusion of metadata. Focus is on marked up metadata included in the journal files at the various journal levels. To what extent Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

57


58

Helena Francke

do editor-managed open access journals include marked up metadata, and are the text strings that are marked up in this way potentially useful for various forms of automatic collection of metadata into a system? However, marked up metadata requires a file format based on a markup language of some sort. This motivates an initial look at the publishing formats used in the journals at the various journal levels. The usefulness of the metadata, as well as other marked up text is also to some extent restricted by how the markup has been performed. Therefore, the predictability and validity of the journalâ&#x20AC;&#x2122;s markup is also discussed before turning to a more thorough report of the inclusion of marked up metadata. 3.1

Publishing formats

The start page of a Web-based journal is often intended to be a mutable space where news and updates are added regularly. The page also often functions as a portal with a collection of hyperlinks to the other parts of the journal web site. It is therefore not surprising that the start pages of all the journals in the sample publish through some version of (X)HTML. Most journals also have separate table of contents pages for each issue. These pages have a higher degree of permanency than the start pages, because they are generally not updated once the issue has been published. In most cases, their primary function is to direct the visitor to one of the issueâ&#x20AC;&#x2122;s articles. When these pages exist separately, they are (X)HTML based, but in 5 of the journals the issue is published as a single unit in PDF or DOC, and the table of contents is placed at the beginning of that file. At the article level, the variety of file formats is much wider, but (X)HTML and PDF are by far the most common ones. As many as 67.1 to 78.1% of the journals in the population publish the articles in their latest issue in PDF, whereas between 36.6 and 48.8% of the journals use (X)HTML. The articles in somewhere around one fifth of the journals are actually made available in more than one file format, and it is often the case that both PDF and (X)HTML are used. Furthermore, the proportion of journals with PDF as the publishing format for the articles is higher in the latest issues than in the first ones, with a corresponding decline in the popularity of (X)HTML. There are many reasons that could account for why PDF has become more popular. These include a desire on the part of the journals to use a file format that indicates permanency, something that is often associated with credibility; the ease of using the same file for derivatives in several media (notably print and Web); and a wish to facilitate for readers who print the articles before reading. Publishing format HTML non-specified HTML 2.0 HTML 3.2 HTML 4.01 Transitional HTML 4.01 Frameset XHTML 1.0 (X)HTML Total PDF PostScript MS Word RTF DVI Hyperdvi DjVu TeX ASCII/txt WordPerfect Mp3 PNG EPS

1st issue 82 2 8 26 2 18 139 158 10 5 3 5 1 2 3 4 1 1 1 --

Latest issue 53 -3 29 3 25 113 193 9 4 1 5 -2 3 --1 -1

Table 1: Frequency of publishing formats in the journals, including journals that publish their articles inmore than one format. First peer reviewed article in the first and most recent issue published on the journal web site. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

Other file formats found occasionally at the article level are various LaTeX output formats such as DVI, Hyperdvi and TeX, as well as PostScript and DjVu. Apart from the latter format, these exist solely in journals within the areas of mathematics and computer science. A few occurrences of MS Word, RDF, and TXT were noted, and one journal – IRRODL – contained MP3 versions of some of its articles (see also Table 1). One of the consequences of the dominance of PDF and the decline of the use of (X)HTML at article level is that fewer journals provide the possibility of including marked up metadata at article level. Rather, the issue level becomes more important as a potential location for metadata, even for metadata describing an article rather than an issue. Some journals also offer a “paratext page”, generally positioned between the issue’s table of contents page and the article page, where they include non-marked up metadata (or paratexts) describing the article. This page can include information on the author(s) and the journal, and various descriptions of the article such as title, abstract, keywords, and sometimes even references. As these paratext pages are generally in (X)HTML, this can be a spot to also identify marked up metadata. However, a consequence of the limited use of (X)HTML at article level is that the places to look for marked up metadata in the journals varies depending on which file formats are used at which levels. 3.2

Markup

The predictability and validity of the markup of (X)HTML pages may affect the possibilities to make use of the markup in various ways. If elements are correctly and consistently marked up it is easier to identify and extract them for specific purposes. This includes identifying an article’s title through a <title> tag, finding words occurring in headings, block quotes, or image texts, and the use of XPath to locate a specific position in a document. Among the journals that were studied, very few made use of valid (X)HTML markup. Among start pages, 6.8% passed validation and the corresponding figure at the article level was 8.0%. Due to the low proportion of articles that were published in (X)HTML, this means that between 1.6 and 6.4% of all journals can be expected to publish articles with valid (X)HTML markup. It should be acknowledged that validation of the pages was made automatically, using the fairly strict W3C validator, and that no evaluation was made in the survey of the types of errors that it reported. A closer inspection of the types of errors that came up in the validation of one of the journals in the qualitative study illustrates how attempts to accommodate various (older) web browsers can cause the markup to break W3C recommendations. Thus, a conscious choice may in some cases have been made that has resulted in a minor violation of the recommendations. It was clear in the sample that a majority of the valid (X)HTML pages were found among start pages and articles where XHTML 1.0 was the HTML version used; this was the case in two thirds of the valid pages. With one exception, the remaining third of the valid pages were HTML 4.01 Transitional. A concern with validity (or the use of editor software that generates more correct markup) was thus found primarily among those web sites that use newer versions of (X)HTML. At the same time, only half of the start pages and article pages in the sample that used XHTML 1.0 had valid markup. So far, the (X)HTML validators are not intelligent in the sense that they take into account whether or not the marked up content of the elements fit the logic for which they are marked up. It is, for instance, quite possible to mark up a section of the body text as a heading, such as <h3>, and this is sometimes done in order to achieve a specific visual effect. However, if one wishes to use markup for identifying and retrieving content, it is of importance both that the markup is used for a text string of the content type indicated by that markup element and that all the content of that type is marked up with the correct element and not with other elements. For instance, if one wishes to use the element <blockquote> in order to locate and extract any block quotes in the articles of a journal, this will only be successful if block quotes have in fact been marked up as such and not as, e.g., <dir><dir><font size=-1>, and if <blockquote> has not been used Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

59


60

Helena Francke

to achieve a desired visual appearance for, say, the abstracts. The markup of three types of content was studied in the survey, namely headings, block quotes, and the inclusion of alternative text as an attribute in image elements. These three types were chosen because headings are a common element on a web page and may contain terms that are significant to describe an article’s content, block quotes have close ties to the scholarly article as a genre and indicate a reference to somebody other than the article author(s), and the “alt” attribute could give an indication of what an image represents through means that are possible to use in text – rather than image – retrieval. Of these, block quotes was the element that was used correctly most often, namely in 51.3% of the cases. On the other hand, because many of the journals publish in other formats than (X)HTML and given that block quotes are less common than, for instance, headings, only between 11.3 and 20.4% of journals contain correctly marked up block quotes. Some journals that do not mark quotes using the <blockquote> element nevertheless indicates the function of the string of text by including block quote as a class, name, or ID attribute. All articles can be expected to contain headings, if nothing else then at least an article title, which would presumably be marked up as a heading of the highest degree. Just under half of the journals in the sample with articles in (X)HTML use <h> for headings, and slightly more than 40% of these journals have headings marked up according to hierarchy beginning at the topmost level and downwards. A further 15.7% adhere to hierarchy but do not begin with <h1>. In total, between 7.5 and 15.3% of all the journals can be expected to use <h> to identify headings hierarchically. The “alt” attribute to the image element – optional in earlier versions of HTML but compulsory in later versions – was included in slightly under one third of the articles that contained the <img> element, which was 75.2% of the journals in the sample publishing articles in (X)HTML. A few articles contained the “alt” attribute, but it was left without content. This means that the total proportion of journals with “alt” attributes that could be used for various purposes is between 4.4 and 11.0%. In the survey, the markup was studied in the first peer reviewed article in the most recent issue of each journal. The qualitative studies indicate that there can be large variations to how markup validity and predictability are handled between different issues of the same journal and even between articles in the same issue. At the moment, this makes the use of markup an unreliable means to identify specific logical elements in the articles. 3.3

Metadata

The journals’ start pages, issue pages, paratext pages, and article pages contain data that describe the articles and the journal in various ways. This information can be divided into that which is marked up according to its content type and that which is not marked up but whose content type can be identified by a person or, in some cases, automatically through an algorithm that can identify such specific features as a copyright sign or a phone number. Focus here will be primarily on marked up, machine-readable metadata, which is the type most easily usable in, for instance, various projects for automated data collection (for more results concerning the non-marked up type, see [1]). The types of machine-readable metadata that will be discussed are those marked up by the <title> and <meta> elements, including <meta> elements that make use of elements from the Dublin Core Metadata Element Set. The occurrence of RSS feeds will also be briefly discussed. Three things are of particular interest in this context: 1. 2. 3.

to what extent are various types of marked up metadata included at various levels in the journals? what content is entered into the metadata elements? and which levels of the journal do these metadata describe (journal, issue, article)?

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

Type of content in the <title> element Journal title No/vol. of issue “Current issue” or similar 2 of the above Article title Name of author Article title and author More or other of the above Other

% of journals – issue level (n=265) 38.9 5.3 1.1 40.4 n.a. n.a. n.a. n.a. 10.9

% of journals – article level (n=265) 7.6 -n.a. 1.9 7.9 2.3 3.4 18.1 1.1

% of journals with <title> – article level (n=112) 44.6 24.1 -n.a. 53.6 33.9 n.a. n.a. 1.1

Table 2: Types of content included in the <title> element at issue and article level. In the right-most column, the composite values have been broken down into single values. [1, p. 247] In the presentation of findings that follows, it may be good to keep in mind that the file formats that the journals use vary at the different levels. All the journals use (X)HTML on their start pages and almost all (98.1%) for the table of contents pages (the issue level). The use of (X)HTML is less common at article level, where it is found in the most recent issues of 42.6% of the journals. This means that when the article level is discussed below, only this smaller sample of (X)HTML files has formed the basis for the results. The most commonly occurring metadata type is the <title> element, which can be found at the start page and issue levels in at least 95% of the journals and at the article level of a minimum of 93% of the journals publishing in (X)HTML at this level. The journal title is the most commonly included information in the <title> element at the issue level, occurring in between 73.9 and 84.0% of the journals. Information on the issue and/or volume number, or a text that indicates that it is the “current issue” occurs in between 40.7 and 53.0% of the journals. Both the journal title and the issue/volume number also occur fairly frequently in the <title> element at the article level – the title in just below half of the journals and the issue/volume number in about a quarter of them. Approximately as common – slightly more common in the sample, in fact – are the article title and the name of the author(s). Between 43.9 and 63.1% of the journals include the article title in the <title> element of the article files. However, at this level, it is not entirely uncommon for the <title> element to contain a number of different types of information. The figures of the most common content types are listed in Table 2. Some variety can also be found among the words listed in <title> – many are quite generic, such as “Article/s”, “contributions”, “Mainpage”, or “Default Normal Template”, whereas others provide additional information that may be used to identify the journal, support its credentials, or advertise the journal, such as the name of the publisher or the ISSN. Very few of the <title> elements contain nonsensical text. A particular problem can be caused by journals that use frames. In many cases, frames mean that if the content of a <title> element on a page is to be used for some purpose, a decision has to be made with regard to which file (and <title> element) should be preferred over the others. Perhaps the most likely candidates are the frameset file and the file which contains the article text. However, these can have different text in their <title> elements. One of the journals in the qualitative study illustrates this, and also that there was some inconsistency in what content was included in the <title> elements of similarly positioned files in different issues (this was also the case in another of the journals in the qualitative study). The differences are by no means very large, but it is not uncommon for the content to be formulated according to varying patterns (abbreviations, notation, order, etc.) or to contain slightly different types of content. Overall, the variety of exactly what the <title> element contains is quite wide and covers many more types of content than, for instance, the main heading of the pages.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

61


62

Helena Francke

A comparison of the content in the <title> element and that marked up as DC.title (only few journals make use of the Dublin Core title element, 12 journals at the issue level and 14 at the article level) shows that the content is similar in most cases (10 journals at the issue level, 8 at the article level). In the few other cases, the Dublin Core elements sometimes contain more precise content in the form of the article title where the <title> equivalent has more types of content, and sometimes the Dublin Core element contains generic content such as “Article”. However, since the Dublin Core title element is much less common than the <title> element, and in many cases contains the same information, it does not seem to be particularly useful to target specifically. A type of markup that is of specific interest in this case is the <meta> element available in (X)HTML, which can be used for marking up various types of metadata – in the words of the HTML 4.01 specification, “generic metainformation” [12, sect. 7.4.4]. The attributes name and http-equiv are used to describe the type of metadata (or property) that is included, and the attribute content to include the metadata text itself (the value). As the HTML specification does not restrict the properties that are possible to use, some variety in properties is likely to be encountered, but some properties have emerged as more common than others. Among the 90% of the journals in the sample that included a <meta> element, most used the technically oriented http-equiv with various properties. The details of this attribute were not included in the study. Among the properties associated with the name attribute, the most commonly used were keywords, description, and generator (see Table 3). Keywords and description, in particular, have emerged as quite frequently found on the web sites. Apart from some of the journals that include http-equiv, files often contain more than one <meta> element. Combinations of the properties keyword, description, and httpequiv and of http-equiv and generator are the most common (the two latter properties are likely to be included by the software employed and seldom requires the person marking up the text to fill out the values). Type of metadata http-equiv keywords description generator author robots copyright title date

Journal level (n=265) 206 (77.7%) 98 (37.9%) 93 (35.1%) 66 (24.9%) 36 (13.6%) 16 (6.0%) 11 (4.2%) 3 (1.1%) 2 (0.8%)

Issue level (n=260) 199 (76.5%) 74 (28.5%) 74 (28.5%) 71 (27.3%) 30 (11.5%) 11 (4.2%) 10 (3.8%) 2 (0.8%) 2 (0.8%)

Article level (n=113) 91 (80.5%) 30 (26.5%) 31 (27.4%) 40 (35.4%) 20 (17.7%) 4 (3.5%) 6 (5.3%) 5 (4.4%) 2 (1.8%)

Table 3: Types of metadata in the <meta> element, in frequency and proportion of the (X)HTML files. [1,p. 251] The discrepancy in the number of (X)HTML files at article level compared to Table 2 is due to inconsistencies in the study. Small variations can be seen in the sample when it comes to the frequency of the various properties at different journal levels, but generally they show a similar pattern. The differences need to be treated with caution, as they are not statistically significant for the population at large. The generator property is slightly more common at article level in the sample, as is the case with author. That the author property would not be more common at article level is perhaps a bit surprising, as it is generally easier to identify the particular author(s) of an article than decide who should be listed in that position for the journal at large. The fact that the keyword and description properties are more common on the start pages than on the table of contents or article level could have to do with the fact that it is easy to enter the values to these properties once on the start page when creating the site, but requires certain routines if they are to be entered for each new table of contents page and article. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

The qualitative studies, where more attention was placed on the values included in the <meta> elements, provide examples of how journals try to counter the fact that in the general use of the <meta> element properties one does not adhere to a specific vocabulary, by offering various versions of suitable keywords. Anticipated variations in how users will search for certain words with regard to number, spelling, and synonyms were met by including alternative keywords, e.g. university, universities â&#x20AC;&#x201C; archaeology, archeology â&#x20AC;&#x201C; and journal, periodical. Some journals also explore the fact that search engines can as easily search through post- as pre-coordination. They include quite unexpected phrases among the keywords, phrases that one would perhaps not expect potential readers to search for but where separate terms can still be retrieved. In the fairly rare cases in the sample where the <meta> element is used to mark up a more regulated set of properties, namely those from the Dublin Core Metadata Element Set, the number of properties that are included is quite extensive, ranging from four to 14, with a median of 7 or 8 (depending on journal level). Between 3.8 and 11.4% of the journals contain Dublin Core metadata. In the sample, 18 journals were found to include this metadata type at the journal level, 17 at the issue level, and 20 at the article level. Only the Dublin Core properties that contained a value to the content attribute were included in the study. The practice of including subject (keywords) and description remain fairly strong at all levels, but even more commonly used are properties that may be easier to include (and in some cases to inherit from a template), such as DC.Type, DC.Format, and DC.Language. The Dublin Core elements are also used to indicate the originator to quite a large degree through such properties as DC.Creator, DC.Publisher, DC.Rights, and DC.Identifier. The only other property that occurs in more than 10 journals on at least one of the levels is DC.Title (cf. above). So far, it is mainly the types of properties included in the <meta> that have been reported. However, as with the content of the <title> element, the <meta> elements are of little use if they do not contain values that may be used. For this reason, the quantitative study also included the various journal levels that the metadata describe. In order to discuss this, a distinction must be made between the level (journal/start page level, issue level, and article level) on which the file containing the <meta> element is placed and the level that the value of this <meta> element describes. I will refer to these as the levels where the metadata is placed and the level that the metadata describes. The metadata (including Dublin Core elements) placed on the journalsâ&#x20AC;&#x2122; start pages generally describe the journal at large. This is, however, also very often the case with metadata found at the issue and (to a smaller extent) article levels. When metadata at each of these levels does not (or not only) describe the journal level, it describes the level on which the metadata is placed. Thus, it is very rare for metadata placed on table of contents pages to describe individual articles, and for metadata placed in the article files to apply to the issue level. In fact, as can be seen in Figure 1, it is much more common at the issue level for the metadata to describe the journal than the issue. This further supports the hypothesis that metadata that can be entered once and continue to be valid, such as metadata describing the journal level, are more commonly included than metadata that needs to be updated for each new issue or article. The fact that some cases where found where the metadata had been copied from a previous issue or article without being changed indicates that when a new file is created based on a previous issue or article file, to change the marked up metadata could easily be forgotten. One of the journals in the qualitative study included quite a few <meta> and Dublin Core elements at its various levels. With a few exceptions at the article level, the values to each property were the same across the three levels, however. The metadata on this journal are thus site-specific rather than page-specific, which influences the granularity with which one can search for content from the journal. Marked up metadata that are placed in a separate file are offered by 25 of the journals in the form of RSS feeds. This means that RSS files are available in between 5.6 and 13.6% (possibly as high as 17.8% at the article level) of the journals, at all three journal levels. 7 of these journals make use of a journal management Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

63


64

Helena Francke

system (either PLONE or the Open Journal System), which has presumably made the inclusion of the feed easier. RSS feeds can provide marked up metadata that can be useful for various forms of reuse. Unlike the case with the <meta> element, the content of this metadata format is also more publicly visible, which could mean that the content is more carefully selected and entered. Metadata descriptions

Number of journals

Placed on start page 120 80 40

Placed in TOC

Placed in article

115

63

10

18

17

12 1 7

7 10 5

two or more

other sensible

0 journal

issue

article

Level described

Figure 1: The levels of the journal described by the metadata (<meta> and Dublin Core) found in the files at the various journal levels, by number of journals. [1, p. 256] 4.

Discussion and conclusions

Time is a valuable – and often scarce – resource for editorial staff of open access scholarly journals. A likely reason for the inconsequent use of marked up metadata that has come out as one of the results of this study is the lack of routines to follow when preparing an article for publication, both when a single person is responsible for the markup and design and when several people are involved. This results in great variations in what metadata are included in the various metadata elements as well as in how the metadata are notated and organized. The latter was shown to be the case in particular in the <title> element. As was illustrated in the qualitative studies, such variations occur not only between journals – where they are only to be expected – but also within journals and even within issues. Other problems that turned up in the study concern the reliability of metadata, such as when the values of the metadata elements are not updated when a new article file is created from an existing article or from a template. A certain lack of consistency was also found in one of the journals in the qualitative study that used frames. This raises the question of how to treat, and prioritize between, frames files when it comes to metadata. Thus, there are several potential problems with using existing metadata for various attempts at automatic collection of bibliographic data from the journals, even in the cases where there has been made an effort of including metadata elements. The great variety found in markup and metadata both between and within journals affects the possibilities for, for instance, libraries and other information services to retrieve data directly from the journal web sites in order to provide added value to the journals and their user communities. At the same time, many of the journals do include metadata in the form of <title> and <meta> elements, even though only keywords and description can be said to be properties that occur reasonably often in the journals. Below, some thoughts are offered on considerations to keep in mind for individual journal publishers and the editor-managed journal community as a whole – preferably in co-operation with the library community – when trying to develop simple improvements in the form of documented routines or even more long-term guidelines for improving metadata inclusion in the journals. The ambition here has been that the development and performance of such routines should require little technological know-how. However, if more consistency and predictability is found in the marked up metadata of the editor-managed open access journals, it would be more worthwhile to develop services that offer access to the journals through various forms of collections and through bibliographic control. Such initial improvement of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

metadata should be seen as a step towards the use of more advanced metadata systems, such as OAIPMH. On the way towards the use of such systems, documented routines or guidelines can be developed that take into consideration the following aspects that emerged from the present study: What level to describe in the metadata elements at various journal levels. At the moment, metadata placed in the table of contents and article files quite often describe the journal as a whole rather than the content of that particular file. This is particularly common at the issue level. It is often of great importance to include information about the journal not only on the start page but also in the files at the issue and article levels in order to highlight the connection between, for instance, an article and the journal in which it has been published, but such metadata is preferably supplemented with metadata describing the content of the file in which the elements are included. In particular, many article metadata are often included on the web site, even if they are not marked up. This includes the name of the author(s), article title, abstract, keywords, and date of publishing. In fact, not surprisingly, the first article in the latest issue of every journal in the survey displayed the author names and article title in the article file. Abstracts were included in 78.9% of the journals, either in the article file, on a paratext page, or on the table of contents page. The corresponding figure for keywords was 40.4% and for author affiliation 86.8%. Another property that is easily obtainable for the journal staff is the date of publishing. This suggests that these metadata are in many cases available, they are simply not included among the marked up metadata in the files. At what journal level to place metadata describing the article. The article file seems to be the obvious place for metadata describing the article. However, in cases when the article is published in a file format that does not easily incorporate metadata for retrieval, an option can be to introduce a paratext page, a page situated between the table of contents page and the article page. When this is done, the paratext page generally serves the purpose of providing bibliographic data about the article that can help the potential reader to determine if it is relevant to download the article – possibly a practice that open access journals have inherited from closed access journals, but where cost rather than download time needs to be considered. Yet, the paratext page can also contain marked up metadata which can serve to direct a user to the article page itself. Another consideration to take into account is how much metadata describing the articles in an issue to include on the table of contents page. This was very rarely done in the journals in the survey. Associated with the issue of where to place metadata describing the article is the question of: How to treat web sites with frames. In journals that use frames for displaying the web site, there are generally several options of where to place metadata that describe the article. The content of the <title> element displaying in the web browser’s title bar will be that of the frameset file. As this file is most likely the same for the entire web site, it is in most cases not a likely candidate for where to place article level metadata. A careful choice needs to be made as to where to place them, taking the design of the site into account. What metadata properties to include. It is easy to be ambitious when planning for metadata elements but sometimes difficult to maintain those ambitions in the daily work. In these cases, it is probably better to keep the number of metadata properties down and aim to update them for each new issue or article. However, some metadata are likely to be constant from issue to issue and from article to article, mainly the ones that concern the journal level and more technical aspects, such as file format and encoding. If the markup is copied from one article to the next, such metadata can remain unchanged. Among the journals in the survey, keywords, description, and author were among the more common <meta> element properties to be included. Title and date were much less frequent. Keywords and description are fairly established properties, but if one wishes to include more properties, there could be reason to use the Dublin Core elements in order to achieve consistency in property names. The Dublin Core Metadata Element Set, still used very seldom among the open access journals, supplies a standardized set of properties that may be beneficial, including the possibility of qualifying such ambiguous properties as “date”.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

65


66

Helena Francke

How to achieve consistency in the metadata element values. Related to the question of how to find consistency in the choice of name for various properties is that of achieving consistency in the element values. There are two dimensions of interest here: how to be consistent in the type of metadata that are included in an element vs. how to be consistent in the notation of the element value, and consistency within a journal vs. consistency across journals. That the issue of what type of content to include in an element is difficult is illustrated by the great variety found in the content of the <title> element. It is also pointed out by Roberson and Dawson [8, p. 68], that of the four journals they worked with, there were three different interpretations as to what should be the value of the DC.Relation property. Simple documentation of routines can help make both the type of content and its notation and organization more consistent across all new pages of a journal web site. If there is time to go over existing pages to align them with the guidelines outlined in the documentation, the web site as a whole will be more useful. One of the greatest challenges is to achieve such consistency across a number of journals while keeping the work both technologically simple and time efficient. At the same time, cross-journal consistency is only interesting if the machine-readable metadata are used, that is, if there is some benefit to be had from consistency. This is where journal editors and librarians/information specialists can work together to add value to and support services that increase the findability of open access journals published by small publishers. To create basic guidelines for the inclusion of marked up metadata is one way to begin such collaboration, but as with all things that require some form of performance, there also needs to be a reward, a reason for putting in the work. 5.

Notes and References

[1]

This paper builds on data that were collected as part of my dissertation work, which was reported in FRANCKE, H. (Re)creations of Scholarly Journals: Document and Information Architecture in Open Access Journals. Borås. Sweden : Valfrid, 2008. Also available from: <http://hdl.handle.net/ 2320/1815/> [cited 10 May 2008]. HAGLUND, L. et al. Unga forskares behov av informationssökning och IT-stöd [Young Scientists’ Need of Information Seeking and IT Support] [online]. Stockholm, Sweden: Karolinska Institutet/ BIBSAM, 2006. Available from: <http://www.kb.se/BIBSAM/bidrag/projbidr/avslutade/2006/ unga_forskares_behov_slutrapport.pdf> [cited 19 April 2007]. The term scholarly is used in this paper to cover contributions from both the scholarly, scientific, and technological communities. HEDLUND, T.; GUSTAFSSON, T.; BJÖRK, B.-C. The Open Access Scientific Journal: An Empirical Study. Learned Publishing. 2004, vol. 17, no. 3, pp. 199-209. KAUFMAN-WILLS GROUP. The Facts about Open Access : A Study of the Financial and Non-financial Effects of Alternative Business Models for Scholarly Journals [online]. The Association of Learned and Professional Society Publishers, 2005. Available from: <http:// www.alpsp.org/ForceDownload.asp?id=70> [cited 24 April 2007]. The Directory of Open Access Journals is provided by Lund University Libraries at <http:// www.doaj.org/>. OAIster is provided by the University of Michigan at <http://www.oaister.org/>. ROBERTSON, R. J.; DAWSON, A. An Easy Option? OAI Static Repositories as a Method of Exposing Publishers’ Metadata to the Wider Information Environment. In MARTENS, B; DOBREVA, M. ELPUB2006 : Digital Spectrum : Integrating Technology and Culture – Proceedings of the 10th International Conference on Electronic Publishing held in Bansko, Bulgaria 14-16 June 2006 [online]. pp. 59-70. Available from: <http://elpub.scix.net/data/works/ att/261_elpub2006.content.pdf> [cited 12 January 2008].

[2]

[3] [4] [5]

[6] [7] [8]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The State of Metadata in Open Access Journals: Possibilities and Restrictions

[9]

JOWETT, G. H. The Relationship Between the Binomial and F Distributions. The Statistician. 1963, vol. 13, no. 1, pp. 55-57. [10] ELENIUS, M. (2004). Några metoder att bestämma konfidensintervall för en binomialproportion : en litteratur- och simuleringsstudie [Some Methods for Determining Confidence Intervalls for a Binomial Proportion : A Literature and Simulation Study]. Göteborg, Sweden: Department of Economics and Statistics, Göteborg University. C-essay in Statistics. [11] Open J-Gate is provided by Informatics India Ltd at <http://www.openj-gate.com/>. [12] RAGGETT, D.; LE HORS, A.; JACOBS, I., Eds. HTML 4.01 Specification : W3C Recommendation 24 December 1999 [online]. W3C (World Wide Web Consortium), 1999. Available from: <http://www.w3.org/TR/html4/> [cited 9 May 2008].

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

67


68

Establishing Library Publishing: Best Practices for Creating Successful Journal Editors Jean-Gabriel Bankier1; Courtney Smith2 The Berkeley Electronic Press 2809 Telegraph Avenue Suite 202, Berkeley, CA e-mail:1jgbankier@bepress.com; 2csmith@bepress.com

Abstract Library publishing is a hot topic. We compiled the results of interviews with librarians and editors who are currently publishing journals with the Digital Commons platform. While the research and illustrations in this paper originate from Digital Commons subscriber interviews, we think the lessons and trends we’ve identified can serve as a roadmap for all librarians looking to provide successful publishing services to their faculty. Successful journal publishing appears to rely greatly upon the librarian hitting the pavement and promoting. The librarian must be ready to invest time and commit to a multi-year view. With support and encouragement, faculty will begin journals. The librarian can then use these early successes as showcases for others. While the first editors get involved in publishing because they believe in open-access or are looking to make a mark, for future editors the most powerful motivator is seeing the success of their peers. Publishing becomes viral, and the successful librarian encourages this. Keywords: University as a publisher of e-journals; journal publishing in an institutional repository; road map for library publishing; open-access; Digital Commons; university as publisher; library as publisher 1.

Introduction

A survey of the current literature on electronic academic publishing shows scholars are rapidly going digital. Commercial publishing is losing its stranglehold on the dissemination of scholarly communications, and the commercial publisher is no longer considered part of the vanguard. Rather, it is becoming apparent that as journal editors “go digital”, they are looking to the university for consulting and publishing support. The recent report “Research Library Publishing Services,” published by the Association of Research Libraries’s Office of Scholarly Communications, showed that 65% of responding libraries offer or plan to offer some form of publishing support, using editorial management and publishing systems including OJS, DPubs, homegrown platforms, and our own institutional repository platform, Digital Commons.[1] We at the Berkeley Electronic Press (bepress) are witnessing a groundswell of interest in publishing with the library – an average of five new journals are being created each month with Digital Commons. Our librarians are excited, and we are too. Library publishing is the hot new topic. We’ve seen several reports over the last year that address the library’s emerging role as publisher[2]. But, to date, we haven’t seen much work on best practices for successful library publishing initiatives, so we started asking, How does the library do it? We compiled the results of interviews with librarians and editors who are currently publishing with the Digital Commons platform, and drew conclusions about the best practices of librarians who drive successful library publishing programs. While the research and illustrations in this paper originate from Digital Commons subscriber interviews, we think the trends we’re seeing can be applied widely. In the following paper, we share lessons about how to best engage existing editors in library publishing and entice or support prospective editors to “jump in”. As a professional publisher ourselves, bepress has worked with hundreds of editors. From the outset, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Establishing Library Publishing: Best Practices for Creating Successful Journal Editors

most know that becoming an editor will take tremendous energy and work. Creating and editing a journal is a huge investment of time for editors and a commitment to their discipline and to their early contributors. They bring a passion for the field and a desire to create a community for others who share their passion. But they often don’t know where to turn to find help in getting started. Despite the findings of the recent reports, we have found that many scholars do not implicitly think “library” when they want to publish digitally. Even after learning about journal publishing services, faculty sometimes question whether publishing is a core competency of their library. Where does this leave library publishing? It tells us that faculty need the following: 1) to know the library is available and can offer the services they need; 2) reassurance that the library is a partner and has proven success; and, 3) certainty that the library can be a successful publisher. We introduce the paper with three general observations—we call them “truths” about library publishing. Next, we discuss key services the library needs to offer to editors in order to encourage journal set-up and help them achieve long-term sustainability. Finally, we discuss the importance of creating a showcase that reflects the publishing expertise of the library, as well as the quality of library publications and, by extension, the editors. We close with thoughts about growing the service of library publishing and the viral nature of faculty engagement. 2.

Two Hard Truths and One Easy Truth About Establishing Library Publishing

The first truth: Librarians must maintain a long-term view. Journals don’t just happen with a snap of the fingers. As Ann Koopman of Thomas Jefferson University explained, her boss supported her in taking “the long-term view” because campus-wide investment in library publishing usually takes three to five years to establish. Starting new journals requires a cheerleader, promoting library publishing for as long as it takes to get faculty talking about it. Librarians who are ready to embark on a library publishing initiative must assume Koopman’s long-term view, and be prepared to spend significant time developing a suite of sustainable journals. The second truth: The first journal is the hardest. The first journal rarely, if ever, comes to the librarian. Instead, the librarian must seek out publishing opportunities by hitting the pavement and doing some good old-fashioned face-to-face networking to find the faculty that is ready to publish digitally. Which brings us to the third truth: It gets easier – much, much easier, in fact – to bring on new journals once the librarian is able to showcase initial successes. The first takers publish because they see themselves as forward thinkers and open-access advocates. But most scholars are simply persuaded by the success of their peers. Once the library has helped establish three or so publications, librarians describe a transformation. Events unthinkable early in the period of journal recruitment become second nature to faculty and students. Librarians begin to watch the publishing craze catch on. Marilyn Billings, Scholarly Communications and Special Initiatives Librarian at UMass-Amherst, says that after three years, she is not the primary publicizing force for ScholarWorks.[3] She finds that faculty and students, including the Dean of the Graduate School, the Vice Provost for Research, and the Vice Provost for Outreach, are now doing the publicizing for her. Of course, these truths still beg the question: How does the library actually establish itself as publisher? Well, here is what we’ve found. 3.

Getting Editors Started

When it comes to establishing a digital publishing system, Ann Koopman considers the librarian’s role as trifold. The librarian is or can be: all-around promoter; provider of clerical support; provider of technical Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

69


70

Jean-Gabriel Bankier; Courtney Smith

support. To put it another way, a library publishing program requires a software platform with technical support, a support system for faculty authors and publishers, and a cheerleader to get them excited and involved. For library publishing to achieve success quickly, we now know that it must have an evangelist – a “librarian as promoter” at the helm, who is truly dedicated to growing it from the grassroots level, by getting out and talking to people about it. When Marilyn Billings unveiled UMass-Amherst’s ScholarWorks IR and publishing platform, she did so with gusto and a special flair for knowing how to throw a party. Billings chose to introduce the new IR at a Scholarly Communications Colloquium sponsored by the University Libraries, Office of Research, Graduate School and Center for Teaching. She introduced it with a show tune (she’s a singer as well as librarian), a slam-bang virtual curtain drawing, and a bit of digital bubbly – “three cheers for ScholarWorks!” The chancellor, who burst out laughing during the unveiling, became a staunch supporter from that moment on. Billings’s unveiling is a lesson in the importance of brewing campus excitement. Billings notes that once she started talking, everybody started talking, and soon (that is to say, soon in library time, i.e. three years), scholars started asking to publish within the library. Western Kentucky University’s Scott Lyons also recognizes that building excitement is the first way to build investment. In addition to personally signing up new reviewers at regional sports medicine conferences for his journal International Journal of Exercise Science[4], he is currently planning a kick-off celebration for the journal’s editorial board at the American College of Sports Medicine’s Sports Medicine Conference in Indianapolis. Once librarians have created initial awareness and excitement, how do they build campus-wide investment? The librarians we spoke with consistently recommended that new publishing programs seek out “lowhanging fruit”, in the words of Sue Wilson, Library Technology Administrator at Illinois-Wesleyan University. Faculty who publish digital, open-access journals regard themselves as forward-thinkers, publishing electronically in order to incorporate multimedia content, increase the rate of knowledge production, and enhance access to scholarship. So to find this “low-hanging fruit” librarians often seek out one or more of the following: proponents of open-access, young scholars looking to make their mark, faculty who use journal publishing as a pedagogical tool, faculty who care greatly about self-promotion, and/or editors whose journals are languishing, usually due to funding concerns. Once librarians have the fruit in sight, they still must be able to reach the faculty on faculty’s terms – to “close the deal” if you will – by eliminating the barriers to going digital. New editors, as well as established editors seeking to transition paper journals, ask for a sustainable infrastructure and an established workflow. In the case of open source software, the infrastructure is set up by the library or the Office of Information Technology. In the case of hosted software, the technical infrastructure is maintained either at an hourly consulting rate or, as is the case with Digital Commons, the host provides both set up and ongoing, unlimited technical support. Whatever the library’s choice of platform, it benefits from having established a training program and peer-reviewed workflow, so that when editors are ready to begin, start up is quick. The idea for a new journal can come from anywhere at any time. The library must be able to say, “I can help you with that.” The library, in short, will want to strike while the iron is hot. Connie Foster, Professor and Serials Coordinator in the Department of Library Technical Services at Western Kentucky University, saw this first hand when Scott Lyons and his colleague James Navalta began the International Journal of Exercise Science. Though the idea of starting his own journal had been germinating for a long time, Navalta did not seize the opportunity until the day Lyons, frustrated by the protracted submission and review process of paper journals, turned right into Navalta’s office instead of left into his own. As Lyons tells it, he marched into Navalta’s office, threw up his hands, and asked, “James, don’t you ever just want to start your own journal?” Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Establishing Library Publishing: Best Practices for Creating Successful Journal Editors

“As a matter of fact,” Navalta replied without pause, “I do.” They are now growing the journal by traveling to conferences and soliciting submissions from their network of colleagues. The journal is student-focused – that is, an undergrad or grad student must either author or co-author the paper for it to be submitted. In addition to developing new titles, librarians cull from the well of established print journals that are looking to transition to hybrid paper-electronic publications or go fully digital. These editors are enticed by the opportunity to reach a much larger audience, and by the time saved in the editorial management process. Faculty members at Boston College are well-versed in both paper and electronic publishing. The experience of Alec Peck, associate professor at the Lynch School of Education, speaks to this. He maintains a print journal, Teaching Exceptional Children, and an electronic journal, Teaching Exceptional Children Plus (TECPlus),[5] which he originally chose to establish in order to supplement the print with content like podcasts, video, and hyperlinks. He notes to Mark Caprio, BC’s eScholarship Program Manager, that the time it takes him to work through a full editorial cycle for the digital journal is at least half that of the print cycle. Doug White, professor of anthropology at the University of California – Irvine, shares a similar perspective. He is founding editor of the e-journal World Cultures[6] as well as founder and editor of Structure and Dynamics.[7] He began his first electronic anthropology journal in 1985, publishing on 5 ¼” floppies, and edited paper journals previous to that. White, a strong proponent of open access publishing, says, “My publication output has roughly doubled because the journal is easy for me to manage.” Editing a journal is, by all accounts, a huge time investment; libraries that can offer time-saving workflow solutions make the scholar’s decision to invest easier. Editors expect not just a publishing plan, but also the support of the library staff, either to train them on a software system, or to act as coordinator between them and hosted IT support. The value of face-to-face support is relevant, and here is where the librarian fills his or her second role – that of facilitator, or in Koopman’s words, the “clerical role”. In the role of facilitator, librarians support scholars by applying to aggregation and indexing services when the time is right, as well as ensuring that publications receive an ISSN number, that metadata is entered and formatted correctly, and that issues are archived. They also may be called upon to practice mediated deposits when a faculty member doesn’t want to learn a publishing software. The librarian, first a promoter, next becomes facilitator, helping faculty manage and publish original content. The librarians we spoke to have the promoter and clerical roles covered – and if their excitement to share their success is any measure, they clearly enjoy them. So how do they find time for the technical role as well? Admittedly, they don’t. This is a two person job. Libraries choose Digital Commons partially because they want their librarians to do the work of building the publishing program and supporting scholars; librarians can only accomplish these two tasks if they are freed from ongoing technical support. Marilyn Billings points out that after spending a six-month sabbatical researching IRs and publishing platforms, one of the reasons UMass-Amherst chose Digital Commons was because they “felt it was more important to do the marketing and the education than spend time on technical concerns.” The changing role of Mark Caprio at Boston College speaks to this as well. As he put it, “Well, I have the time to go out and see who else is doing stuff.” 4.

Sustaining Publishing

The first journal is the hardest. As is, perhaps, the second, given that library publishing is relatively new, and many would-be participants are still wary. A newly-launched journal that flounders can diminish rather than strengthen the chance that the library will ultimately succeed in its mission to become a Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

71


72

Jean-Gabriel Bankier; Courtney Smith

publisher. Demonstrated sustainability is needed not just for the success of the journal but for its potential to influence prospective editors. In order to commit to editing a journal, scholars need to be reassured that the library has proven methods to ensure the journal’s success. Libraries can do this by providing download reports, optimizing discoverability, and branding the full-text. Library publishing is effective insofar as it is able to maintain and increase readership. We found across our interviews that generating automatic download reports validates editors’ and authors’ efforts. Each month authors receive their readership in total downloads for each article within the Digital Commons system. In the former days of paper, editors, authors, and libraries had no way to assess the impact of publications – that is the total readership for any given article or journal. Now, authors and editors can assess, in real-time, the impact of their research, and can use download and citation statistics in funding applications and review processes. Giving contributors feedback on the dissemination and downloads of their work creates excitement and a sense of investment in the journal and the publishing process. Authors are encouraged to submit other pieces of research and encourage their peers to do the same. Moreover, when an institution can statistically verify its impact, it is more likely to continue to support publishing endeavors. The library’s publishing system may or may not automatically generate download reports. If it does not generate them automatically, then the librarian or the editor should consider this an essential task to perform manually. Another way to provide valuable feedback is to show editors and authors their rank in Google search results and citations across the web. Doug White used both download counts and citation counts of his first issue of Structure and Dynamics to demonstrate initial success. He writes, “This [the numbers] reflects positively on quality of the articles, made possible in turn by the high quality and incisiveness of reviews, the number and diversity of reviewers who have responded, and selection for quality in article acceptance and reviewers.”[8] Editors use download reports to assess the impact of research, as well as identify the content most valuable to a journal’s constituency. Ann Koopman, editor of Jeffline[9] and manager of the Jefferson Digital Commons[10], tells a similar story about Thomas Jefferson University’s Health Policy Newsletter[11], which utilizes download reports to identify the topics readers find most compelling. After uploading back content from 1994 to the present, the editors now track article downloads on a quarterly basis. In analyzing the numbers, they can pinpoint the areas of research where readers show the most interest, and shortlist these topics for more in-depth coverage in future issues. Richard Griscom, Head of the Music Library at UPenn and former IR manager, discusses the disproportionate success of the institution’s undergraduate journal, CUREJ: College Undergraduate Research Electronic Journal. During September 2007, CUREJ documents made up a little over 2% of all the content in the repository, but they made up over 10% of the downloads[12]. Analyzing download statistics allows an institution to assess the impact of various scholarly endeavors and focus resources where they are most valuable. As Griscom tells it, these statistics encouraged other groups to approach him about creating various publications within UPenn’s ScholarlyCommons[13]. Clearly, wide readership encourages authors and serves as a reflection of a successful library publishing program. Since it is in the library’s best interest to facilitate the widest possible dissemination of its institutional publications, it must optimize the publications for search by Google and Google Scholar. Librarians as publishers must ensure that their journals are optimized for Google, using identifiers to provide Google with easy access to content. Editors and authors who publish within Digital Commons have their articles full-text indexed through Google and Google Scholar, as well as made highly-discoverable to other search engines. An independent professor posting work on his or her website likely does not know the ins and outs of search engine discovery, whereas a technical team has the knowledge and time to develop a format that maximizes discoverability. By structuring the underlying code in the appropriate way, a web Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Establishing Library Publishing: Best Practices for Creating Successful Journal Editors

development or design team can ensure that the search engine crawlers can discover it by citation data, abstract, or words in the full text. As a publisher of academic journals online, our data on readership referrals shows that 80% of readers arrive at the journal articles straight from Google, without traveling through the journal’s homepage, in which case while the download still registers in the report, the reader may not affiliate the content with the journal or the publishing institution. We’d like to share an approach to this issue. Digital Commons can automatically stamp all PDF articles with cover pages, which bear the journal name and/or the publishing institution’s name, as well as the key article metadata. These pages are produced by a title page-generating software that is incorporated into the Digital Commons platform. There is a lot of junk on the Web and the journal or university’s stamp on the cover page tells the reader that this is content from a reputable source. The cover page acts as a signal of quality.

Figure 1: The Macalester Islam Journal, a stamped cover page on the PDF article, produced by a title page-generating software. 5.

Extending the Publishing Model

Some of the things that editors and libraries are doing, we expected. For instance, we expected journals with a paper history to publish their back content online. But we were surprised by many of the ways libraries and editors are pushing the limits of our current conception of “digital publishing”. As the hub of an institution’s scholarly communications, the library is in a unique publishing position. Scholars take advantage of this position to create a “context” or a “scholarly environment” for one or more journals. At UMass-Amherst and McMaster University we are seeing scholars use library publishing to synthesize various content and resources within and outside of the university. Take, for instance, Rex Wallace, linguist and one of few Etruscan language scholars in America. During Marilyn Billings’ and Rex Wallace’s first conversations about the UMass-Amherst ScholarWorks repository, Billings discovered that Wallace had a database of arcane Etruscan inscriptions without a home. Wallace wanted to house these inscriptions where they could be freely accessed by the scholarly community, but also wanted a location that would act as a “springboard” to bring users to the Etruscan Texts Project and the Poggio Civitate Excavation Archive, both of which are housed within the Classics department. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

73


74

Jean-Gabriel Bankier; Courtney Smith

The pair used this opportunity to create a Center for Etruscan Studies within the repository, an idea that Wallace had been chewing on for some time. Soon after, Wallace and Tuck, an archaelogist also at UMass-Amherst, decided to extend this center by creating the journal Rasenna.[14] Months later, Tuck was at a meeting of the Etruscan Foundation at the annual convention of the American Institute of Archaeology. The Etruscan Foundation had been publishing a well-known paper journal, Etruscan Studies, for over ten years, but noted that it was difficult for many scholars in the field to get easy access to the content. As Wallace tells it, Tuck showed off the Center and Rasenna, and the members, who got very excited about the prospect of making the Etruscan Studies content more widely accessible, started talking about publishing the back content on line. As Billings tells the same story, “After this presentation in Chicago some of these things become really self-evident, he showed them off, and something clicked.” Wallace, Tuck, and Lisa Marie Smith are now in the process of developing a digital version of Etruscan Studies,[15] a sibling journal to Rasenna. They are creating it, he says, out of a desire to make the back content “accessible to the field of Etruscan scholars.” His next goal is to position UMass-Amherst’s Center for Etruscan Studies as the place to go in America for Etruscan Studies – “a sort of clearinghouse for the field,” he says. Editors and librarians are discovering that library publishing offers the potential for an integration of content types – the ability to create what has been alternately called a “context” or a “scholarly environment” for a journal. Wallace calls it an umbrella. He speaks of Rasenna’s creation in these terms: “The e-journal dovetailed with things Marilyn [Billings] and I had been talking about for years. We saw it as a way to bring all the diverse programs we’re working on together under one umbrella.” In the same way that Wallace and Tuck are fashioning the Center for Etruscan Studies and all its associated parts into a “clearinghouse”, the Russell Archives at McMaster University is in the process of creating its own digital presence, under the direction of Kenneth Blackwell. Kenneth Blackwell is Bertrand Russell’s archivist and has been the editor of Russell: the Journal of Bertrand Russell Studies since it began in 1971.[16] Dr.Blackwell was persuaded by the library at McMaster’s University to digitize all of the back issues and bring his journal online. While the most current years (2004-2007) are available by subscription only, he made past issues (1971-2003) openly available to all. It wasn’t long before dozens of Russell-related texts were added, turning the site from a journal into a virtual Bertrand Russell center. The Russell Center is beginning to extend the journal content itself with rare leaflets, notes on his readers, copies of his personal letters and interviews. What does it mean, though, to create a “context” or a “scholarly environment”? In an effort to elucidate this concept, we have identified key practices that are features of library publishing and components of developing a scholarly environment for a journal.Publishing Back Issues: Creating historical continuity is important in establishing an ejournal that has transitioned from paper. Many Digital Commons journals are using the system not only to publish going forward, but to publish back content as well. Editors, like those of Nsodia/Contributions in Black Studies, World Cultures, PLACES,1 and Etruscan Studies, invest time in digitizing and publishing the back content, making what was originally only paper and available only to a few, now accessible to all scholars in the field.

Grouping Diverse Content Types: Faculty members are taking advantage of e-publishing to archive and link to many different content types. For example, Alec Tuck, editor of Boston College’s Teaching Exceptional Children Plus, regularly includes links to supplemental podcasts and video. Bond University digitally publishes the journal Spreadsheets in Education2 precisely because the topic of its study requires additional materials conducive only to the electronic format.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Establishing Library Publishing: Best Practices for Creating Successful Journal Editors

6.

Creating Families of Journals: UMass-Amherst is currently creating two sets of sibling journals: Contributions in Black Studies and Nsodia, as well as Rasenna and Etruscan Studies. As the library develops the system, it can facilitate browsing and searching – across families or across all library published journals – and provide a single, integrated look and feel.

Publishing Cross-Departmentally and Campus-Wide: Librarians are also able to maintain continuity of publication, whether it is cross-departmental, cross-disciplinary, or campuswide. Sue Wilson, for example, is pushing Illinois-Wesleyan’s campus-wide magazine to go digital. She sees the repository as allowing them to move “out of the disciplinary and into the university wide content.” Terri Fishel, Library Director at Macalester College, did not lose the Macalester Islam Journal when the editor, a professor in the Religious Studies department, left the school. Rather, she has found it a new home and it will begin publishing again under the editorship of a professor in the newly-established Middle Eastern Studies program. The flexibility of library publishing ensures continuity – it accommodates both the changing nature of disciplines and departments.

Making the Journal a True Showcase

To recap: the library has excited and engaged its faculty with the publishing program by offering the publishing services and support that faculty need. Editors are invested in their new journal ventures, and the library is helping them to expand the publishing model and achieve success. So where does the librarian go from here? We find that successful librarians get back out and continue to network, this time with successes in hand. It’s that simple: show off success. The librarian gets early adopters, he or she shepherds the first journals to success, and then, as Ann Koopman explained, “Once you’ve got a few and you show them around, they just come like dominos.” Why is this? As we observed before, most scholars are persuaded by the success of their peers. With success in hand, the librarian is now able to demonstrate that the library is a committed, knowledgeable provider of on-campus journal publishing solutions. Prospective editors will recognize that the library can provide the services they need to begin new online journals or transition existing print journals to digital. The journal is a reflection – a showcase, even – of its editorial board. Our editors want their journals to be as visually-compelling as traditional paper journals – and they want them to look good both on screen and in print. We’ve learned from experience that, to the editor, the librarian, and the readers, design matters. We have learned that successful journal sites have a “look” as compelling as commercial journals, and a “feel” that is clean, easy to read and navigate, and demonstrates a coherent logic. As a small publisher we have worked hard on the presentation and design of our family of journals. Our journals have been designed by award-winning professional web designers, and we would like to share some best practices. We present content with key aspects of visual harmony and readability in mind. Our Digital Commons journal pages were built upon concepts of the Golden Ratio and natural mapping, and use grid-based designs to both focus attention on the content and make that content as easily accessible as possible. We find that the little things matter: we always showcase new content from the journal’s homepage, with title, editor and author names given primary focus. Believing that access is primary, we even position the fulltext PDF icon to be the first thing the eyes meet when reading left to right. We ensure on screen readability by designing with attention to optimal line length, spacing, white space, and harmony of color. Because users are drawn to order, alignment and consistency, we have designed the journals to integrate smoothly by providing continuity of design and navigation. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

75


76

Jean-Gabriel Bankier; Courtney Smith

Figure 2: Illinois-Wesleyan University’s Journal, Res Publica. This journal was designed using the Golden Ratio. We also think it is necessary to consider what a digital object will look like in its printed form: when DC journal home pages and article pages are printed, they are rendered intelligible, without hyperlinks and other digital goodies irrelevant to the print format. We have also worked hard to maintain the important vestiges of print journals – down to serif fonts and continuous pagination. And as we mentioned before, since many readers find content through Google without traveling through an institutional portal, we make sure to stamp every article with a cover page branding it as the institution and author’s own. A picture is worth a thousand words – beautiful-looking, simple to navigate journal designs inspire other faculty members to take the leap. The excitement of good looks and good feel lends itself to the establishment of the library publisher, and it is the final key in getting publishing to “go viral”.

Figure 3: Western Kentucky University’s International Journal of Exercise Science. Article page – online view. Compare with print view, Figure 4. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Establishing Library Publishing: Best Practices for Creating Successful Journal Editors

Figure 4: Same article page as Figure 3, in print view. Our librarians’ excitement is contagious – in a good way. As we mentioned before, once they start talking, everybody starts talking, and soon, scholars start asking to publish with the library. Boston College’s Mark Caprio recognizes that library publishing catches on only when scholars see their respected peers engaging in it and finding success. And recently, UMass-Amherst grad student Ventura Perez approached Marilyn Billings about creating a social justice conference, Landscapes of Violence, hoping to have the library publish the conference proceedings. Soon, he had decided to also start a journal of the same name, Landscapes of Violence, the first issue of which will publish the best conference presentations as scholarly articles. 7.

Conclusions

So what do Digital Commons librarians do once they’ve relinquished the role of tech support, and eased up on the cheerleading? They are generally taking on the roles of high-level administration and continue with key content identification. Connie Foster calls her role that of “overarching coordinator”. She says, “Now, thankfully, when a journal or series is created, we [at the library] don’t have to get directly involved in the management of it. Once we know a dedicated faculty member is in charge, the library’s role is to make sure communication goes well. We set up the training for our editors, we coordinate and we troubleshoot.” She goes on to identify the library as “the central contact point, but not the day to day manager.” As Foster wrote in a follow-up email, “Seize every opportunity!” Because there is always more original content to discover, by and large, DC librarians now get to go out and see who is “doing stuff”. Foster, for example, is out finding more original content on campus and she, like many others, now shares stories of serendipitous discoveries. For instance, Foster recently attended an emeritus luncheon where the provost handed out photocopies of early WKU essays compiled by the president in 1926. Once she saw them, she decided to publish them online as the library’s first project under Presidential Papers. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

77


78

Jean-Gabriel Bankier; Courtney Smith

Successful journal publishing appears to rely on hitting the pavement and promoting. The librarian must be ready to invest time and commit to a long-term view. With support and encouragement, faculty will begin journals. The librarian uses these as a showcase for others, and lets design and success speak for themselves. While the first editors get involved in publishing because they believe in open-access or are looking to make a mark, for future editors the most powerful motivator is seeing the success of their peers. Publishing becomes viral, and the successful librarian encourages this. As Marilyn Billings says, “I no longer have to talk about it – they all do!” 8.

Notes and References

[1] [2]

Available at: http://www.arl.org/bm~doc/research-library-publishing-services.pdf These reports include: the ARL report; the Ithaka Report by Laura Brown, Rebecca Griffiths, and Matthew Rascoff, “University Publishing in a Digital Age.” Available at: http://www.ithaka.org/ strategic-services/Ithaka%20University%20Publishing%20Report.pdf; and Catherine Candee and Lynne Withey’s “Publishing Needs and Opportunities at the University of California.” Available at: http://www.slp.ucop.edu/consultation/slasiac/102207/SLASIAC_Pub_Task_Force_Report_final.doc http://scholarworks.umass.edu/ http://digitalcommons.wku.edu/ijes/ http://escholarship.bc.edu/education/tecplus/ A print journal transitioning to digital. The electronic version is currently in demo mode. http://repositories.cdlib.org/imbs/socdyn/sdeas/ White, Douglas R. and Ben Manlove. “Structure and Dynamics Vol.1 No.2: Editorial Commentary.” Structure and Dynamics, Vol. 1 Iss. 2, 1996. Available at: http://repositories.cdlib.org/cgi/ viewcontent.cgi?article=1050&context=imbs/socdyn/sdeas http://jeffline.jefferson.edu/ http://jdc.jefferson.edu/ http://jdc.jefferson.edu/hpn/ DeTurck, Dennis and Richard Griscom. “Publishing Undergraduate Research Electronically” Scholarship at Penn Libraries. Oct. 2007. Available at: http://works.bepress.com/richard_griscom/ 6 http://repository.upenn.edu/ http://scholarworks.umass.edu/rasenna/ Published in paper. The journal site for the electronic version is currently in demo. http://digitalcommons.mcmaster.ca/russelljournal/ http://repositories.cdlib.org/ced/places/ http://epublications.bond.edu.au/ejsie/

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


79

Publishing Scientific Research: Is There Ground for New Ventures? Panayiota Polydoratou and Martin Moyle University College London, Library Services DMS Watson Building, Malet Place, WC1E 6BT Telephone: 020 7679 7795 Fax: 020 7679 7373 Email: lib-rioja@ucl.ac.uk

Abstract This paper highlights some of the issues that have been reported in surveys carried out by the RIOJA (Repository Interface for Overlaid Journal Archives) project (http://www.ucl.ac.uk/ls/rioja). Six hundred and eighty three scientists (17% of 4012 contacted), and representatives from publishing houses and members of editorial boards from peer-reviewed journals in astrophysics and cosmology provided their views regarding the overlay journal model. In general the scientists were disposed favourably towards the overlay journal model. However, they raised several implementation issues that they would consider important, primarily relating to the quality of the editorial board and of the published papers, the speed and quality of the peer review process, and the long-term archiving of the accepted research material. The traditional copy-editing function remains important to researchers in these disciplines, as is the visibility of research in indexing services. The printed volume is of little interest. Keywords: subject repositories; publishing models; overlay journal model; astrophysics & cosmology 1.

Introduction to the project

The RIOJA (Repository Interface for Overlaid Journal Archives) project (http://www.ucl.ac.uk/ls/rioja) is an international partnership of academic staff, librarians and technologists from UCL (University College London), the University of Cambridge, the University of Glasgow, Imperial College London and Cornell University. It aims to address the issues around the development and implementation of a new publishing model, the overlay journal - defined, for the purposes of the project, as a quality-assured journal whose content is deposited to and resides in one or more open access repositories. The project is funded by the JISC (Joint Information Systems Committee, UK) and runs from April 2007 to June 2008. The impetus for the RIOJA project came directly from academic users of the arXiv (http://arxiv.org) subject repository. For this reason, arXiv and its community is the testbed for RIOJA. arXiv was founded in 1991 to facilitate the exchange of pre-prints between physicists. It now holds over 460,000 scientific papers, and in recent years its coverage has extended to mathematics, nonlinear sciences, quantitative biology and computer science in addition to physics. arXiv is firmly embedded in the research workflows of these communities. This paper highlights some of the issues that have been reported in the community surveys, which, as part of the RIOJA project, surveyed the views of scientists, publishers and members of editorial boards of peer-reviewed journals in the fields of astrophysics and cosmology regarding the overlay journal model. To gather background to their views on publishing, the respondents were asked to provide information about their research, publication and reading patterns. The use of arXiv by this community and the reaction of its members to the overlay publishing model were also addressed in the survey. Respondents were asked to provide feedback about the suggested model; to indicate the factors that would influence them in deciding whether to publish in a journal overlaid onto a public repository; and to give their views on the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


80

Panayiota Polydoratou; Martin Moyle

relative importance of different features and functions of a journal in terms of funding priorities. The publishers and members of editorial boards of peer-reviewed journals provided an insight into existing publishing practices. 2.

Statement of the problem

The overlay concept, and the term “overlay journal” itself, appear to be attributed to Ginsparg [1]. Smith [2] further defined the model by discussing and comparing functions of the existing publishing model with what he referred to as the “deconstructed journal”. Although aspects of overlay have been introduced to journals in some subject domains, such as mathematics and computing [3-6], overlay journals have not yet been widely deployed. Halliday and Oppenheim [7], in a report regarding the economics of Digital Libraries, recommended further research, in the field of electronic publishing in particular. Specifically, they suggested that the costs of electronic journal services should be further investigated, and commented that the degree of functionality that users require from electronic journals may have an impact on their costs. In a JISC funded report, consultants from Rightscom Ltd [8] suggested that commercial arrangements for the provision of access to the published literature are made based on the nature of the resource and the anticipated usage of the resource. Cockerill [9] indicated that what is regarded as a sustainable publishing model in the traditional sense (pay for access) is actually supported by the willingness of libraries to pay […”even reluctantly”, p.94] large amounts of money to ensure access to the published literature. He suggested that as open access does not introduce any new costs there should not be any problem, in theory, to sustain open access to the literature. Waltham [10] raised further questions about the role of learned societies as publishers as well as the overall acceptance of the ‘author pays’ model by the scientific community. Self-archiving and open access journals have been recommended by the Budapest Open Access Initiative (http://www.soros.org/openaccess/read.shtml) as the means to achieve access to publicly-funded research. The overlay model has the potential to combine both these “Green” (self-archiving) and “Gold” (open access journal) roads to open access. Hagemmann [11] noted that “…overlay journals complement the original BOAI dual strategy for achieving Open Access…” and suggested that the overlay model could be the next step to open access. In support of open access to information the BOAI published guides and handbooks on best practice to launching a new open access journal, converting an existing journal to open access, and business models to take into consideration [12-14]. Factors such as the expansion of digital repositories, the introduction of open source journal management software, an increasing awareness within the scholarly community at large of the issues around open access, and an increasing readiness within the publishing community to experiment with new models, suggest that the circumstances may now be right for an overlay model to succeed. The RIOJA survey was designed to test the reaction of one research community, selected for its close integration with a central subject repository, to this prospective new model. 3.

Methodology

The RIOJA project is currently being carried out in six overlapping work packages addressing both managerial and research aspects of the project. This paper will discuss the results from community surveys which were undertaken to explore the views of scientists in the fields of astrophysics and cosmology concerning the feasibility of an overlay journal model. In addition to a questionnaire survey, a number of publishers and members of editorial boards were approached to discuss and elaborate on some of the initial questionnaire findings. These complementary studies were intended to enable a more rounded understanding of the publishing process, and to help the project to explore whether an overlay journal Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Publishing Scientific Research: Is There Ground for New Ventures?

model in astrophysics and cosmology could be viable in the long term . The Times Higher Education Supplement World Rankings [15-16] was used to identify scientists in the top 100 academic and 15 non-academic institutions in the fields of astrophysics and cosmology worldwide, so as to capture feedback from the research community at an international level. Additionally, the invitation to participate in the survey was posted to a domain-specific discussion list, “CosmoCoffee” (http:// www.cosmocofee.info). The survey was launched on June 8th 2007, and closed on July 15th. The questionnaire comprised 5 sections that aimed to: a) gather demographic and other background information about the respondents, b) find out about the research norms and practices of the scientists, from their perspectives as both creators and readers of research, c) identify issues around the researchers’ use of arXiv; and d) the final section sought their views regarding the viability of the overlay journal model. The target group was restricted to scientists who have completed their doctoral studies, and who therefore could be assumed to have produced research publications or to be in the process of publishing their research outcomes. 4012 scientists were contacted, and 683 (17%) participated. The supplementary interviews involved representatives from PhysMath Central, Public Library of Science (PloS), and Oxford University Press (OUP), and members of the editorial boards of the journals Monthly Notices of the Royal Astronomical Society (MNRAS) and Journal of Cosmology and Astroparticle Physics (JCAP). The interviews lasted between 1.5 and 2 hours, were comprised of semi-structured questions, and on several occasions benefited from the participation of the project’s academic staff. 4.

Results

The community surveys received responses from 683 scientists (17% of 4012 contacted), and representatives from publishing houses and members of editorial boards from peer-reviewed journals in astrophysics and cosmology, as described in the previous section. The respondents to the questionnaire survey represented a range of research interests, roles and research experience, and an almost equal proportion of returns (51/49) came from scientists who were English native speakers and those who were not. Results indicated that more than half of the respondents (53%) were favourably disposed to the idea of overlay journal as a potential future model for scientific publishing. Over three quarters (80%) of the respondents were, in principle, willing to act as referees in an arXiv-overlay journal. Those scientists who expressed an interest in an overlay publishing journal (35%) but did not consider it important elaborated on some concerns and provided suggestions that are described in the following subsections. 4.1

Some issues around publishing research outcomes

The vast majority of the respondents to the survey (663 people) noted that papers for submission to peerreviewed journals were their main research output. An average of 13 papers per scientist over a two-year period indicates a healthy research field with substantial ongoing research activity. These findings confirm the importance that peer-reviewed journals, and peers in general, play in the validation and dissemination of research in this discipline. The journals in which the respondents had mostly published their research were among those with the highest impact factor as reported in the Thomson ISI Journal Citation Reports, 2005 [17]. Irrespective of ongoing discussions in the literature about the validity of citation analysis, these findings suggest that impact factor does have a bearing on scientists’ decisions on where to publish. However, the majority of the researchers (494 people) reported that the most important factor in their decision where to publish was the quality of the journal as perceived by the scientific community. Other factors from the scientists Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

81


82

Panayiota Polydoratou; Martin Moyle

pointed to the relationship between the quality, readership and impact of a journal with the reputation of the editorial board, and clear policies around the process of peer review. Although factors such as whether the journal is published by a professional society (473) or published in print (463) were considered unimportant, emphasis was placed on the importance of long-term archiving and sustainable access to the published literature. The subject coverage of the journal, the efficiency and ease of use of the submission system, its handling of images and various file formats (eg LaTex), and the time that it takes for a paper to reach publication were also noted as influential factors (Table 1). Rating

Statement

% agree

95% confidence limit

Perceived quality of the journal by the scientific community

97.3

High journal impact factor

88.9

Being kept up-to-date during the refereeing process

81.6

Other factors (please specify)

75.3

Inclusion in indexing/abstracting services (e.g. ISI Science Citation Index)

67.9

Reputation of the editor/editorial board

66.2

Journals that do not charge authors for publication

64.5

Open Access Journals (journals whose content is openly and freely available)

52.8

Low or no subscription costs

33.9

Journals which publish a print version

29.8

Journals published by my professional society

26.9

Journals which have a high rate of rejection of papers

21.1

Key:

Very unimportant

Fairly unimportant

Neither

Fairly important

± ± ± ± ± ± ± ± ± ± ± ±

1.2 2.4 3 9.4 3.6 3.6 3.6 3.8 3.6 3.5 3.4 3.1

Very important

Table 1: Factors affecting the scientists’ decision where to publish 4.2

Use of arXiv and indexing services

The scientists confirmed the important role that arXiv plays in communicating and disseminating research in the fields of astrophysics and cosmology. About 77% of the respondents access the arXiv on a daily or weekly basis. About 80% visit the arXiv “new/recent” section to keep up to date with new research (Figure 1). In addition, when the scientists were asked “on finding an interesting title/abstract, where do you look for the full article”, e-print repositories (such as arXiv) were denoted as the first port of call by 610 people (89%). In the context of an overlay journal, repository policy clearly needs to be aligned sympathetically with the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Publishing Scientific Research: Is There Ground for New Ventures?

83

journals’s objectives. For example, observations were made about the quality of papers submitted to arXiv, and the fact that papers which have been subjected to peer review and those which have not co-exist on the repository without being clearly distinguished. Limitations on the size and format of files that may be uploaded to arXiv were also highlighted. . Some example of the comments the scientists made:

“arXiv has its own flaws, mostly related to the freewheeling unrefereed nature of the papers posted there… “

“To be fair, arxiv is quick and fast in spreading information, but the quality of papers in terms of language and typesetting varies greatly - and this is the (expensive) benefit of having journals copy-editing the papers, which I do appreciate. Furthermore, other changes that they would welcome would be in the policies about file formats and image sizes”.

• •

“Large versions of color figures should be available” “I think the idea of “enhancing” the arXiv with a proper peer-review lens is a good idea, provided that what I see are the key advantages of current journal articles are retained: 1. The refereeing process; 2. Proper copy editing; 3. High-quality figures (the current arXiv limits on file sizes for figures leads to figures which are often illegible)”. Up to date with advances 8

N/R

38

Other

49

Indexing/abstracting services

194

Alerts from arXiv 90

Alerts from ADS

148

"table of contents" alerts ADS w ebsite

396

Journal w eb pages

164

Discussion lists/forums

114

arXiv new /recent

549

Print copies of journals

101 0

50

100

150

200

250

300

350

400

450

500

550

600

Figure 1: Keeping up to date with research advances To search for back literature, 68% of the scientists prefer the ADS service. “Other” responses showed that information gleaned from colleagues, journal alerting services, attendance at conferences and workshops, and visiting the SPIRES Web site are all important. 4.3

Costs

The interviews with publishers and editors did not reveal any substantial information about costings that have not already been reported in the literature [10] or are available on some publishers’ websites, e.g. PhysMath Central (http://www.biomedcentral.com/info/about/apcfaq). Interviewees suggested that the price per article processing varies by journal, discipline and usage. Drawing up exact costings for the setup, production and running of an overlay journal was out of scope of the project.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


84

Panayiota Polydoratou; Martin Moyle

The interviews with publishers indicated that the interest of academic and research staff in new publishing models is the prime driver for their adapting to technology challenges. For example, one of the publishers interviewed stated that one of their most successful journals, both in terms of revenue to the publisher and in terms of perceived quality and acceptance by the scientific community, was converted to open access (the ‘author pays’ model) purely because of community demand. Meanwhile, a question included in the questionnaire survey concerning how expenditure should be apportioned towards particular functions of a journal was subject to criticism: respondents queried whether a scientist has adequate knowledge of the publishing process and its associated costs to make any useful observations. It was also observed that the publishing process entails more than the distribution phase, which some respondents felt that the survey appeared only to address . However, the costs associated with the work of scientific editors, with the integrity and long-term archiving of journal content, and with the transparency of peer review were highlighted as worthwhile (Table 2, scale 1 (little) – 5 (most of the amount) ). An indicative comment is listed below: “… Very-little of a high-cost journal may be more than a considerable amou[n]t of a low-cost one. Perhaps it would be better posed in terms of one’s priorities in paying for the journal. I think that in this day paying those such as the editors and referees, and ensuring the integrity of the archive, ought to be a higher priority than producing a paper version of the journal. Especially for an overlay journal such as you propose”.

Suggested expenditure/priority Paying scientific editors Paying copy editors Maintenance of journal software Journal website Online archive of journal's own back issues Production of paper version Extra features such as storage of associated data Publisher profits Paying referees Other

None

1

2

3

4

5

23 8 4

23 28 20

60 73 73

240 256 238

141 134 147

15 6 9

Not sure 21 15 30

5 9

28 27

79 52

225 202

149 189

20 18

15 19

138 30

101 63

125 105

107 182

29 100

4 6

14 26

142 249

122 70

138 70

91 85

9 22

0 8

19 18

3

1

1

1

3

2

3

Table 2 Suggested expenditure/priorities Copy editing, the level of author involvement in it, and who should be responsible for any costs associated with it, were also issues that were commented upon. Some respondents favoured the idea of charging extra for papers that require extensive copy editing. Almost half of the respondents favoured the suggestion that the cost of copy editing should be borne by the author, and that it should also be variable based on the amount of copy editing required. Furthermore, almost half of the respondents (47%) appear to be in agreement that those changes should be carried out by the author (Table 3). The appearance and layout of the published papers were considered important. • “The idea of charging authors for papers that require excessive copyediting is a great one!”

“Copy editing is a difficult issue: it should be the [responsibility] of authors to improve their writing, on the other hand the journal should take [responsibility] for what it published. Perhaps an author could have say three chances and after that

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Publishing Scientific Research: Is There Ground for New Ventures?

85

should pay for copy editing?”

“…my position is that a basic copy editing should be provided by the journal, but that extremely messy papers should be penalized, perhaps by introducing extra costs”

“I do believe money [is] being wasted on the copy-editing of already copy-edited articles, on paper copies of journals, on library subscriptions, etc. The publications process needs to be streamlined and a new type of open-access peer-reviewed journal might just be the right thing”. % agree

Rating

Statement

95% confidence limit

The cost of copy editing should be borne by the author and vary from paper to paper, depending on the amount of copy editing required

48.2

Copy editing should be carried out by the author

47.3

A referee should be prepared to assess whether or not copy editing is required

18.1

The cost of copy editing should be borne by the journal When a journal makes copy edits, the corrected LaTeX should be returned to the author (after his/her approval) Strongly disagree

Key:

Slightly disagree

11.1 4.7 Neither

Slightly agree

± ± ± ± ±

3.8 3.8 2.9 2.4 1.6

Strongly agree

Table 3: Copy editing When asked where the funding to meet those costs should come from, the respondents preferred to select research funders (485 people, 71% of base=683), library subscriptions (432 people, 63%) and sponsorship, for example by a Learned Society (350 people, 51%). A model requiring an author to pay from research funds either on acceptance (218 people) or on submission (47 people) of a paper was endorsed (Figure 2). Other sources mentioned in comments included: personal donations, professional association contributions, commercial and/or not-for-profit organisations, advertisements, subscriptions and even models of having authors pay partially on submission and partially acceptance. Sources for covering journals' costs 700 600 500

485

432

350

400 300

218

200 100

47

14

18

Other (please specify)

N/R

0 Library Author pays on Author pays on Research subscriptions submission acceptance funders (e.g. using (e.g. using (Councils, research research government, funds) funds) etc.)

Sponsorship (e.g. by Learned Society)

Figure 2: Sources for covering journals’ costs Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


86

Panayiota Polydoratou; Martin Moyle

4.4

Peer review

The process of peer review, as noted above, was raised by the scientists as a very important factor when selecting the journals in which they publish their research and, and in informing their opinion about a journal. Aspects of peer review that the respondents considered important were the transparency of the process, the proven track record of the referees, that of the scientific editor and his/her role in the peer review process, high reviewing standards, and relevance of the chosen reviewers. These factors were cited as acceptance criteria for an overlay journal. In general, the comments were grouped around the speed, quality and reliability of the process. Some comments on the speed of peer review concerned the role of the editorial team and a journal’s support services. It was indicated that an easily accessible editorial team that keeps scientists informed at each stage of the review process, while responding promptly and reliably to questions, is desirable. Also welcome, perhaps as an alternative, would be access to an online system that allows authors to keep track of the peer review process, supplemented by a clear statement of how review is conducted and the assessment criteria in place. In comments about the quality of peer review, the scientists raised issues around the transparency of the process, the selection of the referees and the importance of a proven record of past refereeing: what a respondent called “respected peer review”. Furthermore, comments also referred to the competence, care, efficiency and responsibility of editors and editorial boards. Comments from the respondents also addressed other peer review models such as open and community peer review [18-19]. One school of thought called for a more open, publicly available peer review system, incorporating the use of new technologies such as wikis, voting systems, and discussion forums, and so on. A second preferred to maintain the anonymity of peer review, but was keen to see more exploration and possible adaptation of the more rigorous models of peer review which are applied in other disciplines. The administration of peer review was also pointed out as time-consuming and, along with copy editing, costly, by the publishers who were interviewed. 4.5

Concerns – overlay journal model

The scientists who participated in the survey expressed some concerns about new and untested models of publishing, the overlay model included. However, they were positioned favourably towards trying new models and means for publishing scientific research - provided that they could ensure that the published research outcomes would continue to assist them in establishing an academic record, attracting funding and ensuring tenure. Specifically, the following issues received particular mention:

• • • • 4.6

Impact, readership, and the financial sustainability of the journal. Peer review process, with particular emphasis on ensuring quality Long-term archiving and the sustainability of the underlying repositories Clarity and proof of viability of the proposed model.

The overlay journal model - success factors

The most important factors which would encourage publication in a repository-overlaid journal were the quality of other submitted papers (526 responses), the transparency of the peer review process (410) and the reputation of the editorial board (386). Respondents also provided a range of other factors that they considered important, among them the reputation of the journal; its competitiveness measured against other journals under the RAE (the UK’s Research Assessment Exercise); the quality both of the journal’s referees and of its accepted papers; a commitment to using free software; a commitment to the long-term Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Publishing Scientific Research: Is There Ground for New Ventures?

archiving and preservation of published papers; relevant readership; and its impact factor, (which, it was noted, should only take into account citations to papers after final acceptance and not while residing on arXiv prior to “publication”). 5.

Discussion

The questionnaire survey received responses from 683 scientists in the fields of astrophysics and cosmology (a 17% return). The respondents represented a range of research interests, roles and research experience, and an almost equal proportion of returns (51/49) came from scientists who were English native speakers and those who were not. The respondents indicated that they each produce, on average, 13 papers over each 2-year period. They confirmed the important role of scientific journals in communicating research: 97% indicated that papers for submission to peer-reviewed journals are the main written output of their research. When it comes to choosing a journal in which to publish, the scientists highlighted a journal’s impact factor, readership levels and acceptance by the scientific community as having the most weight in the decision. This is exemplified by the list of journals in which the respondents had mostly published their research, which included the 10 with the highest impact factor in these fields (ISI Journal Citation Reports, 2005). Other factors which affect the scientists’ decision on where to publish include the subject coverage of the journal, the efficiency and ease of use of the submission system, the time that it takes for a paper to reach publication, open access, indexing in services such as the ADS and the publishing requirements of particular projects. The most important functions of a journal were identified as the online archive of the journal’s back issues, the journal’s website and maintenance of the journal software. Journal production costs should, it was felt, be covered by research funders or by library subscriptions. In the context of an overlay journal, repository policy clearly needs to be support the journals’s objectives - some of arXiv’s current policies and practices (for example, policies about file sizes, submission, acceptance and citation of unrefereed papers, multiple versions of papers, etc.) were highlighted by this community as issues which would need to be addressed if arXiv overlay were trialled. Open access was also an issue brought up by several scientists, and they emphasised the importance of having free access to the scientific literature. In particular, free access to less privileged scientists was highlighted as desirable. The inclusion of journal content in indexing and alerting services was deemed important. The ADS services are regarded favourably as an access point to the literature by the majority of the respondents. The respondents showed particular concern with the speed, quality and reliability of the peer review process, which was repeatedly mentioned in comments made by the respondents. It is not always clear to authors how peer review is being conducted by a given journal. Their comments suggest that, perhaps, there is room for improvement in the system, although there was no consensus on the best way to make those improvement. As documented elsewhere, arXiv use is prevalent in this community:

• •

77% of respondents access arXiv on a daily or weekly basis 80% visit arXiv’s “new/recent” to keep up to date with advances in their fields

The respondents were broadly receptive to the idea of overlay publishing: 53% welcomed it, and 80% would be happy to be involved as referees for an arXiv-overlay journal. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

87


88

Panayiota Polydoratou; Martin Moyle

The questionnaire survey, therefore, found some encouragement for the overlay journal model in the fields of Astrophysics and Cosmology. However, general issues were raised about new and untested models of publishing, the overlay model included. It is clear that, for any new publishing model to succeed, it will have to address many ‘traditional’ publishing issues, among them impact, peer review quality and efficiency, building a readership and reputation, arrangements for copy-editing, visibility in indexing services, and long-term archiving. These are generic concerns, for which repository overlay is not necessarily the complete answer. 6.

Summary and conclusions

This paper has discussed some of the issues around scientific publishing in astrophysics and cosmology and presented some of the finding of two community surveys in those fields. The roles, responsibilities and experience of the respondents primarily involve research. The preferred output from their research is peer-reviewed journal articles, which confirms the importance in this discipline of certification by quality-assured journals. The scientists indicated that the quality of any journal publishing model is very important to them, and they choose to publish in journals that demonstrate to them the endorsement of the scientific community, whether through readership levels, impact factor, or perceived quality of the editorial board and journal content. In general the scientists were disposed favourably towards the overlay journal model. However, they raised several implementation issues that they would consider important, primarily relating to the quality of the editorial board and of the published papers, and to the long-term archiving of the accepted research material. The traditional copy-editing function remains important to researchers in these disciplines, as is visibility in indexing services. The traditional printed volume is of little interest. The initial results from this survey suggest that scientists in the fields of astrophysics and cosmology are, in the main, positioned positively towards a new publishing model that, in a respondent’s own words, “…is more open, flexible, quicker (and cheaper?), and as “safe” or safer (i.e. ensuring science quality) as would be needed”. A full examination of these results, together with the other findings from the RIOJA project, is expected to enrich our understanding of the many issues around the acceptance and sustainability of the overlay journal as a potential publishing model. 7.

Acknowledgements

The authors would like to thank the scientists who participated in their survey for their time and input. We would also like to thank the representatives from PhysMath Central, Public Library of Science (PloS), and Oxford University Press (OUP), and members of the editorial boards of the journals Monthly Notices of the Royal Astronomical Society (MNRAS) and Journal of Cosmology and Astroparticle Physics (JCAP) for their time and interest in the RIOJA project. 7.

Notes and References

[1]

GINSPARG, P. (1996). Winners and Losers in the Global Research Village. Invited contribution, UNESCO Conference HQ, Paris, 19-23 Feb 1996. [online]. [cited 08 May 2008] Available from: <http://xxx.lanl.gov/blurb/pg96unesco.html> SMITH, J W T. The deconstructed journal: a new model for academic publishing. Learned Publishing. 1999, Vol. 12, no. 2, pp. 79-91. [cited 08 May 2008]. Also available from Internet: <http://library.kent.ac.uk/library/papers/jwts/DJpaper.pdf>

[2]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Publishing Scientific Research: Is There Ground for New Ventures?

[3] [4] [5] [6] [7] [8] [9]

[10] [11] [12] [13] [14] [15] [16] [17] [18]

[19]

Logical Methods in Computer Science [online]. Available from Internet: <http://www.lmcs-online.org/ index.php>. ISSN 1860-5974. Journal of Machine Learning Research [online]. Available from Internet: <http://jmlr.csail.mit.edu/ >. Annals of Mathematics [online]. Available from Internet: <http://annals.princeton.edu/index.html>. Geometry and Topology [online]. Available from Internet: <http://www.msp.warwick.ac.uk/gt/2007/ 11/>. HALLIDAY, L and C OPPENHEIM. (1999). Economic models of the Digital Library. [online]. [cited 08 May 2008]. Available from Internet: <http://www.ukoln.ac.uk/services/elib/papers/ukoln/ emod-diglib/final-report.pdf > RIGHTCOM Ltd. Business model for journal content: final report, JISC. [online].2005. Available from : http://www.nesli2.ac.uk/JBM_o_20050401Final_report_redacted_for_publication.pdf COCKERILL, M. Business models in open access publishing in, JACOBS, Neil (ed.) Open Access: Key Strategic, Technical and Economic Aspect, Oxford: Chandos Publishing, pp. 89-95, 2006. [online]. [cited 08 May 2008].Available from Internet: http://demo.openrepository.com/demo/handle/ 2384/2367 WALTHAM, M. Learned Society Open Access Business Models, JISC. [online]. 2005. [cited 08 May 2008]. Available at: <http://www.jisc.ac.uk/uploaded_documents/ Learned%20Society%20Open%20Access%20Business%20Models.doc> Haggemann, M. SPARC Innovator: December 2006. [online]. [cited 08 May 2008]. Available from Internet: <http://www.arl.org/sparc/innovator/hagemann.html> CROW, R. & GOLDSTEIN, H. 2003, Model Business Plan: A Supplemental Guide for Open Access Journal Developers & Publishers, Open Society Initiative. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/oaj_supplement_0703.pdf> CROW, R. & GOLDSTEIN, H. 2003. Guide to Business Planning for Launching a New Open Access Journal, Open Society Institute. 2nd edition. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/business_planning.pdf> CROW, R. & GOLDSTEIN, H. 2003, Guide to Business Planning for Converting a Subscriptionbased Journal to Open Access, Open Society Institute. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/business_converting.pdf> The Times Higher Education Supplement. World university rankings: the worldâ&#x20AC;&#x2122;s top 100 science universities. 2006 [online]. [cited 08 May 2008]. Available from Internet: <http:// www.timeshighereducation.co.uk/hybrid.asp?typeCode=162> The Times Higher Education Supplement. World university rankings: the worldâ&#x20AC;&#x2122;s top non university institutions in science. 2006 [online]. [cited 08 May 2008]. Available from Internet: <http:// www.timeshighereducation.co.uk/hybrid.asp?typeCode=164> At the time of the survey the ISI Journal Citation Reports, 2006 reports were not available. Therefore, the list of journals that were used in the survery were based on the 2005 reports. RODRIGUEZ, M. A., BOLLEN, J., and VAN DE SOMPEL, H. 2006. The convergence of digital libraries and the peer-review process. Journal of Information Science [online]. [cited 08 May 2008]. 2006, Vol.32, no.2, pp.149-159. DOI= http://dx.doi.org/10.1177/0165551506062327. An arXiv preprint of this paper is available at: arXiv:cs/0504084v3 CASATI, F., GIUNCHIGLIA, F., and MARCHESE, M. 2007. Publish and perish: why the current publication and review model is killing research and wasting your money [online]. [cited 08 May 2008]. Ubiquity. 2007, Vol.8, issue 3, 1-1. DOI= http://doi.acm.org/10.1145/1226694.1226695

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

89


90

The Role of Academic Libraries in Building Open Communities of Scholars Kevin Stranack1, Gwen Bird2, Rea Devakos3 reSearcher/Public Knowledge Project Librarian email: kstranac@sfu.ca; 2 WAC Bennett Library, Simon Fraser University 8888 University Dr., Burnaby, BC Canada, email: gbird@sfu.ca 3 Information Technology Services, University of Toronto Libraries 130 St. George St, Toronto, ON, Canada email: rea.devakos@utoronto.ca 1

Abstract This paper describes three important pillars of publishing programs emerging at university libraries: providing a robust publishing platform, engaging the academic community in discussions about scholarly communication, and building a suite of production level services. The experiences of the Public Knowledge Project, the Simon Fraser University Library, and the University of Toronto Library’s journal hosting service are examined as case studies. Detailed information is provided about the development of the Public Knowledge Project, its goals and history, and the tools it offers. Campus activities at Simon Fraser University have been coordinated to support the use of PKP tools, and to raise awareness on campus about the changing landscape of scholarly publishing. The University of Toronto’s journal hosting service is profiled as another example. The role of university libraries in bringing together scholars, publishing tools and new models of scholarly publishing is considered. Keywords: Public Knowledge Project; academic libraries; scholarly publishing. 1.

Introduction

Libraries around the world are seeking to answer the fundamental question posed by Hahn in “The Changing Environment of University Publishing”: “To what extent should the institutions that support the creation of scholarship and research take responsibility for its dissemination as well?”[1] Many libraries are in fact not only providing services, but actively experimenting in scholarly publishing. This paper describes three important pillars of library publishing programs: providing a robust publishing platform, engaging the academic community in discussions around scholarly communication, and building a suite of production level services. The experiences of the Public Knowledge Project, the Simon Fraser University Library, and the University of Toronto Library’s journal hosting service will serve as case studies. 2.

The Public Knowledge Project

Founded in 1998 by Dr. John Willinsky of Stanford University and the University of British Columbia, the Public Knowledge Project (PKP)[2] is an international research initiative promoting publishing alternatives for scholarly journals, conferences, and monographs. Through its development of innovative, open source publication management tools, the Project contributes to the growing, global community of scholars dedicated to furthering free and open access to information and research. By building in workflow efficiencies, the Project software allows publishers to significantly reduce their operating costs[3] and make their content Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The Role of Academic Libraries in Building Open Communities of Scholars

either free or available with low subscription fees. A recent indication of the software’s impact can be found in Hahn’s 2008 report, Research Library Publishing Services: New Options for University Publishing[4], which discovered that the Project’s Open Journal Systems software is now the most frequently used program of its kind, whether commercial or open source, supporting academic library publishing initiatives. Since becoming a PKP partner in 2005, the Simon Fraser University Library has taken on responsibility for managing the development of the software, providing technical support to the global community, and publicizing the Project through the PKP web site, workshops, presentations, and publications. In 2006, the Project was the sole Canadian winner of the Mellon Award for Technological Collaboration[5] and was also recognized as a Leading Edge partner with the Scholarly Publication and Academic Resources Coalition (SPARC)[6]. Currently, all five of the lead institutions in the Synergies project[7], described in the conference paper by Eberle-Sinatra, Copeland and Devakos, are using one or more elements of the PKP’s software to advance online humanities and social sciences publishing in Canada. In addition, the software products continue to develop and mature, and the global community of scholars taking advantage of the Project’s work continues to grow. 3.

Open Source Software

The Public Knowledge Project’s suite of software includes a variety of separate, but inter-related applications, including the Open Journal Systems (OJS), the Open Conference Systems (OCS), the Open Monograph Press (OMP), and Lemon8-XML. All are freely available as open source software. They share similar technical requirements and underpinnings (PHP, MySQL, Apache or Microsoft IIS 6, and a Linux, BSD, Solaris, Mac OS X, or Windows operating system), operate in any standard server environment, and need only a minimal level of technical expertise to get up and running. In addition, the software is well supported with a free, online support forum and growing body of documentation. The Open Journal Systems (OJS) software[8] provides a complete scholarly journal publication management system, offering a journal web site (see Figure 1), an online submission system, multiple rounds of peerreview, an editorial workflow that includes copyediting, layout editing, and proofreading, indexing, online publication, and full-text searching.

Figure 1: The International Journal of Design web site using OJS Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

91


92

Kevin Stranack; Gwen Bird; Rea Devakos

OJS goes beyond managing and displaying content, however, and provides an interesting set of Reading Tools, helping the reader to contextualize the content, and allowing for innovative interactions between the reader, the text, and the author (see Figure 2).

Figure 2: Postcolonial Textâ&#x20AC;&#x2122;s Reading Tools The Reading Tools allow readers to communicate privately with the author or to place comments directly on the web site, providing an interesting model of post-publication, open review. OJS is currently in version 2.2, with version 2.3 expected for release in late 2008, with upcoming features to include online reader annotation tools and enhanced statistics and reporting. Today, over 1,500 journals worldwide are using the Projectâ&#x20AC;&#x2122;s OJS software to manage their scholarly publication process, with 50% coming from the Sciences, 38% from the Humanities and Social Sciences, and 12% being interdisciplinary. As well, a growing number of translations have been contributed by community members, with Chinese, Croatian, English, French, German, Greek, Hindi, Italian, Japanese, Portuguese, Russian, Spanish, Turkish, and Vietnamese versions of OJS completed, and several others in production. The Open Conference Systems (OCS) software[9] provides a fully-featured conference management and publication system, including not only a conference web site, online submissions, peer review, editorial workflow, online publication, and full-text searching, but also a conference schedule, accommodation and travel information pages, and an online registration and payment system. The Reading Tools, similar to those provided with OJS are also available. OCS is currently in version 2.1, with version 2.2 expected later in 2008. At least 300 scholarly conferences have used OCS to manage their events, including the 2008 International Conference on Electronic Publishing[10]. OCS has now been translated into English, French, German, Italian, Portuguese, and Spanish. The Open Monograph Press (OMP)[11] is a new open source project that is still in a very early stage of development. Essentially, OMP will provide a similar management system for the production of scholarly monographs, with a built-in correspondence system for participants, marketing and cataloguing tools, and XML conversion (see Lemon8-XML below). It will allow editors to invite contributors to participate in the creation in a new work and provide authors with an online studio to assist with the research and writing process, including bibliographic management tools, a document annotation system, blogs, wikis, and more. The PKP has received significant interest internationally for this project, and the OMP will benefit from the wide-ranging community expertise that will be provided throughout the development process. Lemon8-XML[12] is another innovation which is still in development. It is a document conversion system, which will allow users of OJS, OCS, OMP, or any other publication system to automatically transform text files submitted by authors (such as Microsoft Word or Open Office Writer) into XML files to assist with Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The Role of Academic Libraries in Building Open Communities of Scholars

online publication and compliance with indexing service requirements (e.g., PubMed Central). This will build in a significant new level of efficiency, saving layout editors the time-consuming task of producing PDF, XHTML, or XML documents manually. Although developed specifically for use with the other Project software tools, it will be a standalone open source product, allowing for uses independent of OJS, OCS, or OMP. Lemon8-XML will be released in mid-2008 and a beta version is available from the Project web site. 4.

Community

In addition to the hundreds of users of the Public Knowledge Project software products, the community also extends to the many people who volunteer their time and efforts in a variety of important ways. One critical contribution has been the translations mentioned earlier. Without this contribution, the PKP software tools would not have the international reach that they have today. It would have simply been impossible for the Project to create translations without the community volunteers. Other forms of community participation include the recurring need to thoroughly test every new software release. This is a very time-consuming and somewhat repetitive task for the volunteers, but ensures that crucial bugs have not been overlooked, which could cause very serious problems if they were introduced into production systems. Without community testers the Project would not be able to continue with its regular enhancement process and increase the functionality of the software, nor ensure its continued security and robust nature. Community members also contribute important new software features, including the subscription module, which allows OJS journal publishers to continue to charge subscriptions or other fees as they consider the move to open access. Another important example of the health of the PKP community is the fact that the online support forum now has over 1100 members, many of whom not only post their questions, but are increasingly sharing their experiences and assisting other users by answering questions. The PKP community is made up of a wide variety of participants, including scholars (e.g., The International Journal of Communication[13]), university information technology divisions (e.g., The University of Saskatchewan College of Arts and Science[14]), government departments (e.g., Sistema Eletrônico de Editoração de Revistas[15]), publishers (e.g., Co-Action Publishing[16]), and, of course, libraries. As the Project grows, this form of community-based support will become increasingly important. 5.

The Simon Fraser University Library

In 2007, the Simon Fraser University Library began a formal program of scholarly communication activities on campus. Scholarly communication was included as a theme and a clear priority in the Library’s 3-year plan for 2007-2010. This theme included a cluster of issues arising out of the current system of academic publishing and a desire reform that system. We identified the usual array of issues concerning libraries: the high and steeply rising cost of commercially published scholarly journals, widely recognized as unsustainable; a desire to support alternative publishing models, including Open Access; an opportunity to use library buying power to support alternative models that are sustainable and provide benefit for the Simon Fraser community; a desire to minimize limitations on the use of faculty-authored publications; and a desire to provide infrastructure and support for authors who wish to self-archive research outputs. As a result, we asked what the Library could do to contribute to the efforts to “create change”[17]. We asked what would best build on existing activities and strengths of the SFU Library. We were willing to take on new roles as needed in the changing landscape of scholarly communication. To provide a bit of context, Simon Fraser University is a mid-sized, publicly funded Canadian university. It offers programs in a full range of subject areas up to the doctoral level, but includes no professional schools, such as law or medicine, and serves just under 20,000 FTE students. The Library recognized that Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

93


94

Kevin Stranack; Gwen Bird; Rea Devakos

while we were well positioned to take a leadership role on campus, we would be successful only to the extent that we could engage the interest of faculty members. As with other faculty endeavours, our team of liaison librarians would be key to this success, building on their knowledge of departments and wellestablished individual relationships. For this reason, after sketching out a modest set of events, we began by working with SFU librarians. We partnered with colleagues from neighbouring institutions, the University of British Columbia and University of Victoria, to offer joint training for librarians which provided background on many of the issues listed above, and ran participants through a variety of interactive activities. The goal was to orient librarians to the subject in order to make them comfortable integrating discussions of scholarly communication into their liaison work. In short order, the participating librarians felt grounded and ready to incorporate scholarly publishing into their instruction and other interactions with faculty in the way that we had hoped. A few of the events put on for the campus community are described below. In thinking about how we would build on existing strengths of the Library, it was clear that we had an “ace in the hole” for a scholarly communications program in the form of our participation in the PKP Project. Here was a set of tools we could put directly into the hands of those wanting to reclaim academic publishing, one journal or one conference at a time. In July 2007 the Library worked with the PKP project to host the first International PKP Conference[18] bringing together users of the tools and others interested in its goals from around the world. With over 200 participants and generous sponsorship from the Open Society Institute to cover costs for delegates from developing countries, the conference provided an astonishing picture of the development and operation of alternative publishing projects around the world. The conference featured papers from five continents, exploring both the practical and theoretical aspects of the Project.[19] After the conference we repeated “OJS in a Day” workshops offered that were quickly filled, to continue putting the skills needed to use OJS into the hands of interested researchers and editors. As our librarians work on campus to discuss scholarly publishing they are regularly turning up requests for more information about OJS, or requests for software support. The federally funded Synergies project is providing one-time funding to assist many Canadian journals in the Social Sciences and Humanities to move content online for the first time using OJS, and also provides further support for SFU scholars moving their publications in this direction. Another place where the Library saw itself functioning as a hub was with respect to journal editors. Staff in the Collections Management office noticed they were often fielding inquiries from faculty members in their roles as editors, and that these inquiries began to form a pattern. When editorial boards were considering offers to license their journal content to third party aggregators, to change publishers, to digitize their backfiles, or to move from a for-fee to an Open Access business model, they were coming to the Library for guidance. We brought together a group of editors for a forum where they were able to find each other across disciplines, and share common experiences. As many of the editors were active users of OJS, they were able to impart firsthand experience of using the software, and of running Open Access journals, or transitioning society publications to Open Access. In addition, the Library has continued to host campus events highlighting Open Access publishing more generally. These have included speakers from BioMedCentral, Public Library of Science, Open Medicine, and others. Typically they attract a mix of graduate students, faculty members and librarians, and a mix of advocates, skeptics and curious newcomers. Here the Library acts as a facilitator, putting the issues on the table, encouraging lively discussion and debate, highlighting positive stories from successful Open Access journals, and SFU authors willing to share their experiences and motivations for supporting Open Access. As appropriate, the Library can also provide information about the often invisible costs of the traditional system of academic publishing, providing members of the SFU community with a local perspective on our part in this $16 billion a year industry.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The Role of Academic Libraries in Building Open Communities of Scholars

Finally, the Library has launched an Institutional Repository that offers trusted infrastructure and support for interested community members who wish to self-archive.[20] As this initiative grows, the Library plays an increasingly active role at earlier stages of the research process, ideally in a few cases as a co-applicant on funding applications where archiving is built into the project from its inception. As we build this program on our campus, we recognize that the current system of scholarly communication is embedded in larger institutional and industry-wide contexts. These include the tenure and promotion system generally, and its specific expression at SFU; trends in the academic, trade and commercial publishing sectors; and the requirements and regulations of granting agencies. We have launched a blog to help the campus community stay abreast of news, and to offer an online space for continued discussion.[21] Future plans include applied research into faculty attitudes and behaviors around scholarly communication to further inform our work in this area. In holding events like these on the SFU campus, one of the common refrains is that faculty members and researchers are grateful for the opportunity to hear about what’s going on in other disciplines. Even those who are keen on the topic are generally not able to keep up with developments in areas outside their own. For example, biologists are pleased to come to events hosted by the library to learn about discussions going on in the American Anthropological Association[22]; anthropologists are interested in learning about SCOAP3[23], and social scientists are interested to hear what is happening in the life sciences where OA journals have been making significant inroads. Taken alone, none of these activities have marked a departure for the SFU Library, but as a program, together they are certainly contributing to a changed role for libraries in building open communities of scholars. We have learned that faculty on our campus bring a varied level of understanding of the issues, and that our programs must be multivalent enough to address these multiple levels of need. We’ve also seen that integrating scholarly communication into our liaison program is a comfortable fit that has reinvigorated several of our long-serving librarians, and provided us with a renewed definition of liaison work in an academic library. Similar programs have been offered by many university libraries, and reports and lessons learned elsewhere have also been useful for us (e.g., The University of California Berkeley’s Scholarly Communication News and Events[24], Scholarly Communications at the University of Washington Libraries[25], and the University of Guelph’s Scholarly Communication Initiatives[26], to name a few). Awareness on the SFU campus continues to build about the changing scholarly communication landscape. And the Simon Fraser University Library continues to explore new roles for itself in bringing together and building open communities of scholars. Like most other university libraries, we are operating in an environment where we have largely eliminated print journals in favour of online, and just a few years ago were contending with feedback from members of our community lamenting the fact that their once weekly trips to the library’s periodical reading room was a chance to get out of their academic silos and mix with colleagues from elsewhere on campus. We are pleased to see the Library continuing to occupy this role of campus hub, albeit in a new way. 6.

The University of Toronto’s Journal Hosting Service

Like many academic libraries, the University of Toronto is offering a range of journal publishing services. Indeed, U of T services parallel many of the trends found by Hahn[27]: 1. 2.

The Open Journal System is used. Services provided include: · hosting

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

95


96

Kevin Stranack; Gwen Bird; Rea Devakos

3. 4.

5. 6.

7. 8. 9.

· initial set up · consultation on business models, advertising, launches etc. · training · ongoing troubleshooting and customer support Service was initially advertised through word of mouth The university has leveraged past investments in digital library services. The Scholarly Communication Initiatives unit also offers repository services using the DSpace platform and conference hosting services, using the Open Conference System. Electronic only publication services are offered, though a few journals publish in print also. Services are funded through multiple sources; though initially funded by the libraries operating budget, this has now luckily been supplemented by a federal government grant, Synergies. Libraries in Australia, Germany and Denmark received similar government funding. As with the National Library of Australia’s Open Publish service[28], quality control and copyright clearance rests with the journal. Analogous to the California Digital Library (CDL)29, interdisciplinary journals are prominent, as are student journals. Like Newfound Press[30], the Library is interested in enhancing access to peer reviewed scholarship and specialized works with a potentially limited audience.

The service is staffed by a librarian, technical staff and student assistants. An online application form, modeled after the CDL’s, asks for Canadian university affiliation, journal’s aim and purpose, editorial board, peer review process, copyright and authors’ rights[31]. In addition student led journals are asked for a letter of support from a faculty sponsor. Eight journals are currently hosted and we are in discussions with another ten. We expect the number of journals hosted to continue to grow. Here are a few illustrative examples: Women in Judaism was founded 11 years ago, and is devoted to scholarly debate on gender-related issues in Judaism. The ultimate aim of the journal is to promote the reconceptualization of the study of Judaism, by acknowledging and incorporating the roles played by women, and by encouraging the development of alternative research paradigms. Articles undergo blind review. The international editorial board numbers 60. The journal publishes two issues a year. In addition to scholarly articles, works of fiction, biographical essays, book and film reviews are also published. The journal is indexed by ATLAS, RAMBI- the Index of Articles on Jewish Studies by the Jewish National and University Library, Jewish Periodical Index, MLA International Bibliography and others. The journal website states: We do not have subscription fees, nor do we intend to have them in the future. The Canadian Online Journal of Queer Studies in Education was created in 2004 to provide a forum for scholars, professionals, and activists to discuss queer topics in education and the social sciences in the Canadian context. The term ‘education’ is understood broadly to include all levels of education in every discipline. This journal is devoted to supporting and disseminating research and theory that promotes social justice for all queer people, including lesbian, gay, bisexual, queer, intersex, two-spirited and transidentified people. The forum encourages critical examination of queer discourse across disciplines and dialogue on multiple and intersecting forms of oppression based on gender, race, class, ability, religion, etc. This refereed journal is affiliated with the Ontario Institute for Studies in Education at the University of Toronto. Clinical & Investigative Medicine is the official journal of the Canadian Society for Clinical Investigation. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The Role of Academic Libraries in Building Open Communities of Scholars

The journal’s focus is on original research and information of importance to clinician-scientists. Founded in 1978, the journal moved totally online in 2007 due partially to the cost of print production. Most subscribers are also society members. Immediate open access is offered to all Canadian universities; total open access is provided after six months. In the past, CIM had relationships with a variety of aggregators. The University of Toronto Journal of Undergraduate Life Sciences (JULS) showcases the research achievements of undergraduate life science students at the U of T and encourages intellectual exploration across the various life sciences disciplines. Established in 2006 by a small group of students, JULS quickly gained support from various departments and faculty members. The journal publishes research articles and mini-reviews. All articles undergo a two-stage double blind peer-review process conducted by students and faculty. Issues are published annually, in both print and electronic format. Currently, all but one hosted journals are open access, but this is not a strict requirement. Like the Copenhagen Business School Library’s Ejournals@cbs[32], we seek to provide a “low risk environment for small journals.” For journals concerned about losing subscription income we work to identify ways to “open” access while protecting revenues, such as delayed open access, providing free access to some articles, ip ranges or issues. We expect use of this mixed model to increase. Like Ejournals@cbs, our journals fall into two categories: those born digitally versus print. However we have found that whether a journal is born in print or digitally, has not affected comfort with the platform. Our born-digital journals include those born on our service, and those born on their own home grown system or another OJS service provider. Established journals with established workflows are prone to only utilize OJS’ dissemination features. As Felczak, Lorimer and Smith describe, journals often find the task of changing their production methods a “non trivial challenge.”[33] Launching an electronic journal, whether new or established, is a time consuming project. The OJS platform and the new medium prompt the editorial team to consider or reconsider policies such as copyright. The mixture of practical “click here” and policy questions to be addressed is often daunting. Editors ask what others have done, how long it takes to do x etc. The most common question is the cost of electronic journal production. It is not a question we can answer easily. In a review on journal publishing costs, King laments: a wide range of figures for publishing costs and average costs per subscription and per article. Many cost estimates are presented in the literature in support of a specific agenda: to explain high prices, to demonstrate the savings to be expected from electronic publishing, or to show why author-side payment should be adopted. Unfortunately, the way in which many publishing costs are presented in the literature is somewhat misleading, because the costs are not qualified by the various cost parameters or other factors that contribute to their value, large or small.[34] Indeed our initial meetings with existing journals are sometimes difficult. In relating the transition of the Canadian Journal of Sociology to electronic open access publication, Haggerty describes their first meeting with the U of Alberta Libraries: Laura and I, however, did not give our colleagues an easy time during our meeting, asking them a procession of difficult questions about the implication of such as move. Looking back, it is evident that Pam and Denise could not have answered most of those questions to my satisfaction as the answers were contingent upon their having detailed knowledge about the specifics of the journal’s finances and assorted institutional arrangements. I also suspect that what I really wanted from them was an impossible guarantee that the journal could accrue all the benefits of going open access without also bearing the risks of such as move.[35] Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

97


98

Kevin Stranack; Gwen Bird; Rea Devakos

Libraries are well positioned, not only to acknowledge the unknown, but also to assist journals as they explore uncharted waters. In so doing we have forged strong working relationships and gained unique insight into the scholarly communication process. 7.

Conclusions

From the case studies presented in this paper, it is clear that libraries are becoming increasingly involved in scholarly publishing, either through providing powerful software platforms to increase operational efficiency and technological innovations as at the Public Knowledge Project, or the development of new forms of scholar-librarian collaboration at the Simon Fraser University Library, or offering a complete set of production services at the University of Toronto Library. And these libraries are by no means alone in these endeavours. Internationally, libraries are becoming increasingly involved in scholarly publishing activities, and this represents an important shift in the services libraries offer and the perception of their organizations, both externally and internally. As Case and John[36] point out, however, the “next major step is to integrate the digital publishing operations into the library organization.... The role of library as publisher must be embedded in the culture of our organization.” 8.

Notes

[1]

Hahn, K. (2007). The Changing Environment of University Publishing. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-intro.pdf The Public Knowledge Project. (2008). Retrieved June 1, 2008 from http://pkp.sfu.ca Willinsky, J. (2005). Scholarly Associations and the Economic Viability of Open Access Publishing. Retrieved June 1, 2008 from http://jodi.tamu.edu/Articles/v04/i02/Willinsky/ Hahn, K. (2008). Research Library Publishing Services: New Options for University Publishing. Retrieved June 1, 2008 from http://www.arl.org/bm~doc/research-library-publishing-services.pdf Recipients of First Annual Mellon Awards for Technology Collaboration Announced. Retrieved June 1, 2008 from http://rit.mellon.org/awards/matcpressrelease.pdf The SPARC Leading Edge publisher partner program. Retrieved June 1, 2008 from http:// www.arl.org/sparc/partner/leadingedge.shtml Synergies Project. Retrieved June 1, 2008 from http://www.synergiescanada.org/ Open Journal Systems. Retrieved June 1, 2008 from http://pkp.sfu.ca/ojs Open Conference Systems. Retrieved June 1, 2008 from http://pkp.sfu.ca/ocs International Conference on Electronic Publishing 2008. Retrieved June 1, 2008 from http:// www.elpub.net Open Monograph Press. Retrieved June 1, 2008 from http://pkp.sfu.ca/omp Lemon8-XML. Retrieved June 1, 2008 from http://pkp.sfu.ca/lemon8 The International Journal of Communication. Retrieved June 1, 2008 from http://ijoc.org The University of Saskatchewan College of Arts and Science Conference Server. Retrieved June 1, 2008 from http://ocs.usask.ca/ Sistema Eletrônico de Editoração de Revistas. Retrieved June 1, 2008 from http://seer.ibict.br/ Co-Action Publishing. Retrieved June 1, 2008 from http://www.co-action.net/ Create Change Canada. Retrieved June 1, 2008 from http://www.createchangecanada.ca/about/ index.shtml First International PKP Scholarly Publishing Conference. Retrieved June 1, 2008 from http:// pkp.sfu.ca/ocs/pkp2007/index.php/pkp/1 First Monday, October 2007, 12 (10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/ bin/ojs/index.php/fm/issue/view/250

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The Role of Academic Libraries in Building Open Communities of Scholars

[20] Simon Fraser University Institutional Repository. Retrieved June 1, 2008 from http://ir.lib.sfu.ca/ index.jsp [21] Simon Fraser University Library Scholarly Communication News. Retrieved June 1, 2008 from http://blogs.lib.sfu.ca/index.php/scholarlycommunication [22] Cross, J. (2008). Open Access and AAA. Anthropology News, Feb 2008, 49 (2), 6. Retrieved June 1, 2008 from http://www.aaanet.org/pdf/upload/49-2-Jason-Cross-In-Focus.pdf [23] SCOAP3 - Sponsoring Consortium for Open Access Publishing in Particle Physics. Retrieved June 1, 2008 from http://scoap3.org/ [24] The University of California Berkeley’s Scholarly Communication News and Events. Retrieved June 1, 2008 from http://blogs.lib.berkeley.edu/scholcomm.php [25] Scholarly Communications at the University of Washington Libraries. Retrieved June 1, 2008 from http://www.lib.washington.edu/ScholComm/ [26] The University of Guelph’s Scholarly Communication Initiatives. Retrieved June 1, 2008 from http:/ /www.lib.uoguelph.ca/scholarly_communication/initiatives/ [27] Hahn, K. (2007). The Changing Environment of University Publishing. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-intro.pdf [28] Graham,S. (2007). Open access to open publish: National Library of Australia. First Monday 12 (10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/ view/1960/1837 [29] Candee, C. H., & Withey, L. (2007). The University of California as publisher. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-cal.pdf [30] Phillips, L. L. (2007). Newfound Press: The digital imprint of the University of Tennessee Libraries. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/1968/1843 [31] The University of Toronto Libraries’ Request for Journal Hosting. Retrieved June 1, 2008 from http://jps.library.utoronto.ca/index.php/index/Boilerplate/submit [32] Elbaek, M. K., & Nondal, L. (2007). The library as a mediator for e-publishing: A case on how a library can become a significant factor in facilitating digital scholarly communication and open access publishing for less web-savvy journals. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1958/1835 [33] Felczak, M., Lorimer, R., & Smith, R. (2007). From production to publishing at CJC online: Experiences, insights, and considerations for adoption. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1959/1836 [34] King, D. W. (2007). The cost of journal publishing: A literature review and commentary. Learned Publishing, 20(2), 85-106. [35] Haggerty, K. D. (2008). Taking the plunge: Open access at the Canadian Journal of Sociology. Information Research, 13(1) Retrieved from http://informationr.net/ir/13-1/paper338.html [36] Case, M. M., & John, N. R. (2007). Publishing Journals@UIC. ARL Bimonthly Report, no. 252/ 253. Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-uic.pdf

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

99


100

Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging. Maria Elisabete Catarino1; Ana Alice Baptista2 Information Systems Department, University of Minho. Campus Azurém, Guimarães, Portugal. CAPES-MEC-Brazil. e-mail:ecatarino@dsi.uminho.pt 2 Information Systems Department, University of Minho. Campus Azurém, Guimarães, Portugal. e-mail: analice@dsi.uminho.pt 1

Abstract The Web 2.0 maximizes the Internet concept of encouraging its users to cooperate effectively for the offer of virtual services and content organization. Among the various potentialities of the Web 2.0, folksonomy appears as a result of the free assignment of tags to the Web’s resources by their users/ readers. Despite tags describe the Web’s resources, generally they are not integrated in the metadata. In order for them to be intelligible by machines and therefore used in the Semantic Web context, they have to be automatically allocated to specific metadata elements. There are many metadata formats. The focus of this investigation will be the Dublin Core Metadata Terms (DCTerms) that is a widely used set of properties for the description of electronic resources. A subset of DCTerms, the Dublin Core Metadata Element Set (DCMES), has been adopted by the majority of Institutional Repositories’ platforms as a way to promote interoperability. We propose a research that intends to identify elements of the metadata originated from folksonomies and propose an application profile for DC Social Tagging. That will allow tags to be conveniently processed by interoperability protocols, particularly the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH). This paper will present the results of the pilot study developed in the beginning of the research as well as the metadata elements preliminarily defined. Keywords: Social Tagging; Folksonomy; Metadata; Dublin Core. 1.

Introduction

Metadata may be defined as a group of elements for the description of resources [1]. There are many standards of metadata in the repository context; we can point out the Dublin Core Metadata Element Set (DCMES) or simply Dublin Core (DC) that is a metadata element set for the description of electronic resources. This standard is well diffused, used globally and on a broad scale due to some factors: a) it was created specifically for the description of electronic resources; b) it has an initiative which is responsible for its development, maintenance and spreading - the Dublin Core Metadata Initiative (DCMI); c) it is the metadata set used by default by the Open Archives Initiative – Protocol for Metadata Harvesting (OAIPMH). The more active participation of the users in the construction and organization of Internet contents is the result of the evolution of the Web technologies. The so-called Web 2.0 is “the network as platform, spanning all connected devices; Web 2.0 applications are those that make the most of the intrinsic advantages of that platform: delivering software as a continually-updated service that gets better the more people use it, consuming and remixing data from multiple sources, including individual users, while providing their own data and services in a form that allows remixing by others, creating network effects through an ‘architecture of participation’, and going beyond the page metaphor of Web 1.0 to deliver rich user experiences.”[2].

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.

One of the new possibilities of the Web 2.0 is the folksonomy that is “the result of personal free tagging of information and objects (anything with an URL) for one’s own retrieval. The tagging is done in a social environment (shared and open to others). The act of tagging is done by the person consuming the information”[3]. The tags which make up a folksonomy would be key-words, categories or metadata [4]. Tags have several roles as a study from Golder and Huberman [5][6] points out: Identifying What (or Who) it is About, Identifying What it Is, Identifying Who Owns It, Refining Categories, Identifying Qualities or Characteristics, Self Reference and Task Organizing. Another study, Kinds of Tags (KoT) [7], has the objective of verifying how the tags derived from folksonomies can be normalized aiming at their interoperability with metadata standards, specifically the DC. Their researchers observed that there are some tags that cannot be inserted in any of the already existing elements. Preliminary results indicate that the following new elements may have to be used: Action_Towards_Resource, To_Be_Used_In, Rate and Depth [8][9]. Generally digital repositories’ metadata is input by authors or professionals that mediate deposit. In the Web 2.0 context, folksonomies arise, as a result of Web resource tagging by its own users. Tags are a complementary form of description which expresses the user’s view of a given resource and, therefore, potentially important for its discovery and retrieval. The preliminary results of KoT indicate that the current DCTerms elements are not enough to hold user’s descriptions by means of tags. In the context shown, following up the analysis resulting from the KoT project, we propose an application profile for DC Social Tagging so as to enable that tags may be used in the context of the Semantic Web. This application profile will be a result of a research that aims at identifying metadata elements derived from folksonomies and compare them with DCTerms’ properties. 2.

Investigation: Procedures

The procedures of this research project are divided in four stages. The first stage consists of an analysis of all tags contained in the KoT project dataset. At this stage all tags assigned to the resources are analysed, grouped in what we call key-tags and then DC properties are assigned to them when possible. A Key-tag is a normalised tag that represents a group of similar tags. For instance, the key-tag Controlled Vocabulary stands for tags controlledvocabulary, controlled vocabularies or vocabulars controlatis. Once that the meaning of tags is not always clear, it is necessary to dispel doubts by complementarily turning to lexical resources (dictionaries, encyclopaedias, Word Net, Wikipedia, etc), and analysing other tags of the same users. Contacting the users may be a last alternative to try to find out the meaning of a given tag. In this stage, a pilot study was developed in order to refine the proposed methodology and to verify whether the proposed variants for grouping and analysing tags are adequate. The second stage aims at proposing complementary properties to the ones already existing in the DCMI Metadata Terms [10]. Key-tags that were not assigned to any DC property in stage one will now be subject to further analysis in order to infer new properties specific to Social Tagging applications. This analysis takes into account all DC standards and recommendations, including the DCAM model, the ISO Standard 15836-2003 and the NISO Standard Z39.85-2007. The next stage comprises the adaptation of an already existing DC ontology. This will make use of Protégé, an ontology editor developed at Stanford University. The ontology will be encoded in OWL, a language endorsed by the W3C. Finally, the fourth stage intends to submit the proposal to the DC-Social Tagging community for comments Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

101


102

Maria Elisabete Catarino; Ana Alice Baptista

and feedback via online questionnaires. After this phase, a first final version of a proposal to a DC Social Tagging profile will be submitted to the community. This paper will present the results of the pilot study alongside with the preliminary results of the first research stage: tag analysis. The preliminary results of KoT indicate that an application profile for Social Tagging applications would benefit from the inclusion of new properties, other than those in DCTerms. Those terms will potentially accommodate tags that currently do not have a metadata holder. The results of this research will therefore allow to determine if the KoT preliminary findings are verified and to what extent. 3.

Pilot Study

The pilot study was carried out in order to improve the methodology proposed for the investigation project, since, as Yin [11] states, â&#x20AC;&#x153;The pilot study helps investigators to refine their data collection plans with respect to both the content of the data and the procedures to be followedâ&#x20AC;?. The dataset used in this project is the same of the KoT project: it is composed of 50 records of resources which were tagged in two systems of social bookmarking: Connotea and Delicious. Each record is composed by fields distributed in two groups of data: a) information related to the resource as a whole: URL, number of users, research date; and b) information related to the tags assigned to the resource: social bookmarking system, user, bookmarked date and the tags. A relational database was set up with the DCMI Metadata Terms and the KoT data set that was imported from its original files. The following tables were created: Tags, Users, Documents, Key-tags and Metadata. 3.1

Tag Analysis

In the pilot study it was analysed data of the first five resources of the data set. This implied the analysis of a total of 311 tags with 1141 occurrences and assigned by 355 users. It was important to register not only the number of tags but also their total occurrence, since a tag could have different meanings to each one of the resources to which it was assigned. Therefore, in some cases, it was possible to analyse of the occurrence of a tag concerning an individual resource. 3.1.1 Grouping Tags in their different forms: Key-tags Key Tag is the term that represents the various forms of a same Tag. In order to accomplish Tag grouping it was necessary to generate reports for each resource with the following information: Title (of the resource), User Nick and Tag, displaying information in the alphabetical order of the Tags to facilitate the visualization of the existing different Tag forms and definition of Key-tags. In this stage it is necessary to use lexical resources (dictionaries, WordNet, Infopedia, etc) and other online services, such as online translators, in order to fully understand the meaning of tags. In some cases further research and analysis of other tags of a given user, or even a direct contact with this user by email may be necessary in order to understand the exact meaning of a tag. An important concern regarding tag analysis is the fact that as tags are assigned by the resourcesâ&#x20AC;&#x2122; users, that inevitably leads to a lack of homogeneity in their form. Therefore, it was necessary to establish some rules in order to properly analyse tags, establish key-tags and relate DC properties with them. The first rule to be observed concerns the alphabet. In this Project, only tags written in Latin alphabet were considered. Further studies should involve the analysis of tags written in different alphabets. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.

Another rule is directly related to language. The dataset comprises tags written in different languages. As English is the dominant one, it was the chosen language to represent Key-tags. Depending on the Key-tags, certain criteria concerning the classification of words need to be established: simple or compound, singular or plural, based on a thesaurus structure in its syntactical relations. In these cases, the rules presented by Currás [12] were followed. It was still necessary to create rules to deal with compound tags, as they contain more than one word. There are two kinds of compound tags: (1) the ones that are related to only one concept and therefore originate only one key-tag (e.g. Digial Libraries); and (2) the ones that are related to two or more concepts and therefore originate two or more key-tags (e.g. Library and Librarians). In the first kind, compound tags are composed by a focus (or head) and a modifier [13]. The focus, i.e. the noun component which identifies the general class of concepts to which the term as a whole refers, and the modifier, i.e. one or more components which serve to specify the extension of the focus; in the example above: Digital (modifier) Libraries (focus). It is a compound term that comprises a main component or focus and a modifier that specifies it. In the second kind, compound tags are related to two or more distinct Key-tags, as for example: Library and Librarians, which would be part of the group of two distinct Key-tags: Library and Librarian. Another example is Cataloguing-Classification, which would be assigned to the Key-tag Cataloguing and to the Key-tag Classification. In this second segment there isn’t a relation of focus/difference between the components as their meanings are totally independent. Following these pre-established rules, the 311 tags were grouped in their different forms, adding up to 212 Key-tags. The first step of tag analysis comprises grouping tag variants: a) language; b) simple/compound; c) abbreviations and acronyms; d) singular/plural; e) capital letter/small letter. Then a Key-tag is assigned to each of these groups according to the rules presented above. Following, there are some examples of tags and their assigned key-tag: • Tags: _article, article, articles, artikel, article:sw. Key-tag: Article.

Tags: biblioteca digital, biblioteques digitals, digital libraries, digital library, digital_libraries, digital_library, digitallibraries, digital-libraries, digitallibrary, dl. Key-tag: Digital Libraries.

The above key-tags show a variation in : • spelling: _article / article; digital library / digital_library / digitallibrary and dl;

form (Singular/Plural): article / articles; digital library / digital libraries;

language: article (EN) / artikel (DE); Biblioteca digital (PT) / biblioteques digitals (CA) and Digital Library (EN).

The examples above also show the two kinds of compound tags. Compound Tags focus/modifier like biblioteca digital and digital library are assigned to only Key-tag. Tags composed of two focus components like article:sw are assigned to two distinct Key-tags: Article and Semantic Web.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

103


104

Maria Elisabete Catarino; Ana Alice Baptista

3.1.2 Tag Analysis in relation to DC After Key Tag composition, an analysis was carried out in order to verify to which DC Properties these tags corresponded. What happens is that this analysis becomes more complex as the definitions of the DCMI Terms are intentionally very inclusive, so that the description of electronic documents with a small, however satisfactory number, of metadata is possible. This inclusiveness may cause some doubt when relating Key-tags to DC Properties. Another factor of complexity is that this is a qualitative study which is developed manually so that the analysis is the most detailed possible. Due to these factors, it was necessary to define basic rules for the correspondence of Key-tags to the DC Properties. In the occurrence of Simple tags there is a peculiarity to be noticed that relates to the way tags are inserted in the social bookmarking sites: the way tags are inserted can interfere with the system’s indexation. When the user inserts tags in Delicious, the only separator is the space character and everything that is typed separated by spaces will be considered distinct tags. For example, if the compound term Digital Library is inserted containing only the space as separator, the system will consider two tags: Digital and Library. In order to be inserted as a compound tag it is necessary to use special characters such as underscore, dashes and colons. Some examples of such kind of compound tags are: Digital_Library, Digital-library, Digital:Library, Digital.Library. In Connotea tags are also separated by a space or a comma. However, Connotea suggests to users to type compound tags between inverted commas. For example, if the user inserts Information Science without placing the words between inverted commas, the words will be considered two distinct tags; however, if they are typed between inverted commas (“Information Science”) the system will generate only one compound tag. This simple, yet important issue, has a high implication on the system’s indexation of the tags. To exemplify what is said above there is an example of a Delicious user who, when assigning tags to the resource “The Semantic Web”, written by Tim Berners-Lee, inserted the following tags: the, semantic, web, article, by, tim, berners-lee, without using the resources of word combination (_ ; etc). The system generated seven simple tags. However, it is clear that these tags can be post-coordinated [14][15] to have a meaning such as Title, Creator and Subject. Thus, as a first rule, in the cases when simple tags could clearly be post-coordinated, they were analysed as a compound term for the assignment of the DC Property. However, this analysis could only be carried out in relation to only one resource’s user and never to a group, since it can mischaracterize the assignment of properties. The second rule concerns tags that correspond to more than one DC Property. It is considered two different situations: simple and compound tags. The easiest case is the one of simple tags. If simple tags to which more than one property can be assigned occur, then all the properties are assigned to the tag. For example in the resource entitled DSpace, the properties “Title” and “Subject” are assigned to the Key-tag dspace. As explained earlier, compound tags, however, can correspond to two or more key-tags. Thus the relationship with DC properties is made through the key-tags. These are treated as simple tags in the way they are related to DC properties. For example the tag Web2.0:article, corresponds to two Key-tags, Web 2.0 and Article, each one of them corresponding to a different property: Subject and Type (respectively). There may also be cases of compound tags that represent two different values for the same property, as in Classification-Cataloguing, that was splitted into two Key-tags: Classification and Cataloguing, both SUBJECT.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.

Another rule is related to tags whose value corresponds to the property Title. Tags will be related to the element “Title” when they are composed by terms found in the main title of the resource. For example, Dspace, Library2.0. Another example is the case of the resource entitled “The Semantic Web”, where the tags The, Semantic, Web, that were assigned by the same user, and thus, may be considered postcoordinated. 3.2

Definition of DC Properties

From the 311 tags analysed, 212 Key-tags were created. From this amount, 159 Key-tags (75%) of which corresponded to the following DC properties: Creator, Date, Description, Format, Is Part Of, Publisher, Subject, Title and Type. From these, 90,5% correspond to Subject and Description. At this point it is worth to highlight that the tags that referred both to the main subject and to the other subjects related to the resource were allocated to Subject. The other properties present the following percentages of allocation: Type - 5%; Creator, Is Part Of and Title 3,1% each, Date and Publisher 1,3% each and Format 0,6%. The other 53 Key-tags (25%) could not be related to any DC property. New complementary properties were defined and their definition is still in process. The following properties that were identified in the pilot study will be described: Action, Category, Depth, Rate, User Name, Utility and Notes. 3.3. Proposed Properties At this stage, potential new properties for the Key-tags to which it was impossible to assign any DC property were defined. The definition of these properties, at this stage of the research, is still preliminary, since it is based solely in the pilot study. The research on the full dataset will determine which properties will be included in the application profile, including any new that do not exist in DCTerms. The preliminary new properties identified in the pilot study will be described below, and are the following: Action, Category, Depth, Rate, User Name, Utility and Notes. The following percentages for these properties proposed were observed: Action, Rate and Utility (15,1% each), Category (11,3%), Depth (9,4%), Notes (7,5%) and User Name (1,9%). There is still a 24,5% of Key-tags to which it was not possible to assign or propose any property as their meaning in relation to the resources and users was not possible to identify. Below, each of these properties will be described, following the set of attributes used to specify the DCMI Metadata Terms [16]: Label, Definition, Comment and Example. Some additional information for better understanding these properties will also be included. 3.3.1. Action There is a group of Key-tags that represent the action of the user in relation to the tagged resource. It is a type of Tag that can be easily identified since the action is expressed in the very term itself when tagging the resource. Eight Key-tags were identified: Print, Read-Review, Read Later, Read This, Reading-List, To Do and To Read.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

105


106

Maria Elisabete Catarino; Ana Alice Baptista

Below, a descriptive table of the element to be proposed Label Definition Comment Example

Action Action of the user in relation to the resource. Has the role of registering the action undertaken by the user to the resource As example the tags which represent the action To Read, attributed to 6 users, all from Delicious: _toread, a_lire, toread.

Table 1: Description of the property Action 3.3.2 Category This property includes Tags whose function is to group the resources into categories, that is, to classify the resources. The classification is not determined by subjects or theme of the resource, since, in these cases, the key-tags could correspond to the Subject property. This property is not easy to identify, since it is necessary to analyse the given tag in the context of the totality of tags that user has inserted, independently of the resource under analysis. In some cases it may become necessary to analyse the whole group of resources the user has tagged with the tag that is object of analysis. Six Key-tags which could correspond to the Key Tag Category were identified: Alternative Desktop, DC tagged, DMST, FW – Trends, Literature and Reference. See descriptive table 2. Label Definition Comment Example

Category Terms that specify the category of a group of resources. Applied to the tags which were attributed to group the resources in categories, but which aren’t theme or subject categories, since for those Subject should be used. For instance, during the analysis of the Key-tag DC Tagged it was noticed that the corresponding resources had also other tags tags with the prefix dc: (e.g.: dc:contributor, dc:creator, dc:Publisher, dc:language or dc:identifier, among others). It was concluded that the tag DC Tagged could be being applied to group all the resources that were tagged by tags that were prefixed by dc:. Therefore it was considered a Category since it is not a classification of subjects or a description of the content of the resource.

Table 2: Description of the property Category 3.3.3 Depth This type of tag confers the degree of intellectual depth to the tagged resource. As Word Net, Depth “degree of psychological or intellectual profundity” [17]. Label Definition Comment Example

Category Terms that specify the category of a group of resources. Applied to the tags which were attributed to group the resources in categories, but which aren’t theme or subject categories, since for those Subject should be used. For instance, during the analysis of the Key-tag DC Tagged it was noticed that the corresponding resources had also other tags tags with the prefix dc: (e.g.: dc:contributor, dc:creator, dc:Publisher, dc:language or dc:identifier, among others). It was concluded that the tag DC Tagged could be being applied to group all the resources that were tagged by tags that were prefixed by dc:. Therefore it was considered a Category since it is not a classification of subjects or a description of the content of the resource.

Table 3: Description of the property Depth Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.

The following Key-tags for this property were identified: Diagrams, Introduction – Document, Overview, SemanticWeb – Overview, Semantic Web – Introduction,that occurred only once. 3.3.4 Notes This element may be proposed to represent the tags that are used as a note or reminder. As Wornet, “a brief written record” that has the objective of registering some observations concerning the resource, but that does not refer to its content and does not intend to be used as its classification or categorization [18]. Label Definition Comment Example

Notes A note or annotation concerning a resource. Used to make some type of comment or observation with the objective of reminding something, registering an observation, comment or explanation related to a tagged resource. For instance, there is a resource that received the tags Hey and OR2007. The first tag, Hey, refers to Tony Hey, a well-known researcher who made a debate on important issues that were related to the tagged resource. In this case the information was given by the user who attributed the tags himself. The second tag makes reference to the Open Repositories 2007, event where Tony Hey mentioned above made a Keynote speech. However, interestingly enough, the tagged resource does not have any direct relation neither with that event nor with Tony Hey, this information was confirmed by the user of the resource himself (creator).

Table 4: Description of the property Notes A note should be understood as: an annotation to remind something; observation, comment or explanation inserted in a document to clarify a word or a certain part of the text [19]. From the five analysed resources, the following Key-tags considered as Notes were identified: Hey, Ingenta, OR2007, PCB Journal Club. 3.3.5 Rate Rate, meaning pattern, category, class or quality is important to include tags that are evaluating the tagged resource. Thus, the user categorizes the resource according to its quality when using this type of tag. Label Definition Comment Example

Rate Categorizes the quality of the tagged resource Used to register the evaluation of the user in relation to the quality of the tagged resource. Examples of this type of tag: good, great, important. A resource tagged with the tags Good and Great represent the qualification of the user according to the quality.

Table 5: Description of the property Rate The following Key-tags were related to the property: academic, critical, important, old, great, good and vision. These are generally easily identified as Rate in each one of the terms. In other cases, the tags may be doubtful and it becomes necessary to analyse them in relation to the tags assigned by the user to the resource under analysis as well as to the whole collection of resources tagged by that user. For instance, the tag Vision could have several meanings, but, after an analysis to the collection of resources, it may be concluded that it is classifying the quality of the resource Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

107


Maria Elisabete Catarino; Ana Alice Baptista

108 Label Definition Comment Example

User Name Name of the user of the resource. Refers to tags which registered the Nick Name of the user of the resource. In the pilot study only one tag for this type of element was identified. The tag Alttablib was attributed by a user of Delicious to the resource 4 (Resource Description and Access (RDA)).

Table 6: Description of the resource User Name 3.3.6. User Name The Tag User Name labels the resource with the name of a user. The analysed resource had the name of the user of the tagged resource. Only one tag of this type was identified in the pilot study. Despite the preliminary results presented here, it is assumed that here may be other occurrences. 3.3.7 Utility After an analysis of the tags and resources, it is proposed an element that would gather the tags that registered the utility of the resource for the user. It represents a specific categorization of the tags, so that the user may recognize which resources are useful to him in relation to certain tasks and utilities. In the pilot study the following tags were identified: Class Paper, Research, Dissertation, Maass, Professional, Research, Search and Thesis. It was not difficult to identify the majority as being Utility. However, three of them, Class Paper, Maass and Professional, required an analysis of other tags and resources from the same users. Class Paper is a tag that is bundled in “1schoolwork” and was assigned to three resources. By analysing the group of resources and related tags, it supposedly refers to resources that would be or have been used for a certain activity. Maass is a tag that was bundled in “Study”. The term represents the name of a teacher, information found in the user’s notes in two resources tagged with Maass: “Forschung von Prof. Maass an der Fakultat Digitale Medien an der HFU”; and “Unterlagen für Thema ‘Folksonomies’ für die Veranstaltung “Semantic Web” bei Prof. Maass”. Professional is a tag assigned by the user to separate those resources that are useful for work-related issues. This information was given by the user of the tag himself. Label Definition Comment Example

Utility Represents the purpose of use of the resource for the user. Categorizes the resources according to utility, as for example: dissertation, thesis A group of resources useful for the development of a research could be tagged with the tag Research.

Table 7: Description of the property Utility 4.

Final Considerations

In the following cases it was not possible to make any correspondence with any property since it was impossible to understand the meaning of the tags in relation to their resources: resource 1 - Capstone; Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.

resource 2: Suncat2; resource 4: Babel, Exp, L and resource 5: Do it or Diet, Inner Space, Kynunan, and W. Nonetheless, these are the results of the pilot study and, therefore, they will, be presented to the DC community for evaluation and validation along with the result of the final research. As result of this pilot study it is important to highlight that there is a meaningful part of tags, 25%, which could not be assigned the already existing DCTerms properties. This result strengthens what had already been concluded in the KoT project, where 37,3% of the analysed tags were not found to correspond to any of the DCTerms properties. Therefore, the adoption of new properties is justified so that the metadata deriving from folksonomies can be used by metadata interoperability protocols 5.

Acknowledgments

The authors wish to thank Filomena Louro from the Program Support to the Edition of Scientific Papers at University of Minho for her help in editing the final English draft. 6.

Notes and References

[1]

DCMI. Using Dublin Core: Dublin Core Qualifiers. DCMI, 2005. Available at: http:// dublincore.org/documents/usageguide/qualifiers.shtml , last accessed on August 30, 2007. [2] O’REILLY, T. Web 2.0: Compact definition? O’Reilly Radar Blog, 1 October 2005. Available at: http://radar.oreilly.com/archives/2005/10/web_20_compact_definition.html, last accessed on November 6, 2006. [3] WAL, Thomas Vander. Folksonomy definition and wikipedia. Available at: http:// www.vanderwal.net/random/entrysel.php?blog=1750 , last accessed on November 22, 2006. [4] GUY, Marieke; TONKIN, Emma. Folksonomies: tidying up tags?. D-Lib Magazine, v.12, n.1, jan. 2006. Available at: http://wwww.dlib.org/dlib/january06/guy/ 01guy.html , last accessed on December 12, 2006. [5] GOLDER, Scott A.; HUBERMAN, Bernardo A. The Structure of Collaborative Tagging systems. Available at: http://arxiv.org/abs/cs.DL/0508082 , last accessed on November 14, 2006a. [6] ________. Usage patterns of collaborative tagging systems. Journal of Information Science, v.32, n.2, p.198-208, 2006b. [7] Preliminary data presented in DC-2007 and NKOS-2007. [8] BAPTISTA, Ana Alice et al. Kinds of Tags: progress report for the DC-Social tagging community. In: DC-2007, International Cconference on Dublin Core and Metadata Applications: applications profiles: theory and practice. 27-31 August, Singapure. Available at: http:// hdl.handle.net/1822/6881, last accessed on September 4, 2007. [9] TONKIN, E. et al. Kinds of tags: a collaborative research study on tag usage and structure (Presentation). In: European Networked Knowledge Organization Systems (NKOS), 6.; EDCL Conference, 11., Budapest, Hungary. Available at: http://www.us.bris.ac.uk/Publications/ Papers/ 2000724.pdf , last accessed on December 10, 2007. [10] DCMI Usage Board. DCMI Metadata Terms. 14 January 2008. Available at: http://dublincore.org/ documents/2008/01/14/dcmi-terms, last accessed on January 21, 2008. [11] YIN, Robert K. Case Study Research: design and methods. Thousands Oaks, USA, 1989. [12] CURRÁS, Emília. Ontologías, taxonomia y tesauros: manual de construcción y uso. 3.ed. act. y ampl. Madri : Treas, 2005. [13] INTERNATIONAL STANDARDS ORGANIZATION. ISO 2788: Documentation: Guidelines for the establishment and development of monolingual thesauri. [S.L.] : ISO, 1986. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

109


110

Maria Elisabete Catarino; Ana Alice Baptista

[14] Post-coordination is the principle by which the relationship between concepts is established at the moment of outlining a search strategy [15]. [15] MENEZES, E. M.; CUNHA, M. V.; HEEMANN, V. M. Glossário de análise documentaria. São Paulo : ABECIN, 2004. (Teoria e Crítica, 01). [16] DCMI Usage Board. DCMI Metadata Terms. 14 January 2008. Available at: http://dublincore.org/ documents/2008/01/14/dcmi-terms, last accessed on January 21, 2008. [17] WORDNET. A lexical database for the english language. Princeton University, Cognitive Science Laboratory. Disponível em: <http://wordnet.princeton.edu/>. Acedido em 07 de Fevereiro de 2008. [18] WORDNET. ref 17. [19] INFOPEDIA. Enciclopédias e dicionários. Porto : Porto editora. Disponível em: <http:// www.infopedia.pt>. Acedido em 7 de fevereiro de 2008.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


111

Autogeneous Authorization Framework for Open Access Information Management with Topic Maps Robert Barta1; Markus W. Schranz2 Austrian Research Centers Seibersdorf Seibersdorf, Austria e-mail: robert.barta@arcs.ac.at 2 Department of Distributed Systems, Institute for Information Systems, Vienna University of Technology Argentinierstr. 8/184-1, Vienna, Austria e-mail: schranz@infosys.tuwien.ac.at 1

Abstract Conventional content management systems (CMSes) consider user management, specifically authorization to modify content objects to be orthogonal to any evolution of content within the system. This puts the burden on a system administrator or his delegates to organize an authorization scheme appropriate for the community the CMS is serving. Arguably, high quality content - especially in open access publications with little or no a priori content classification – can only be guaranteed and later sustained, if the fields of competence of authors and editors parallel the thematic aspect of the content. In this work we propose to abandon the above-mentioned line of demarcation between object authorization and object theming, and describe a framework which allows to evolve content and its ontological aspect in lockstep with content ownership. Keywords: Ontology; Semantic Technologies; Authorization Framework 1.

Introduction

Content ownership, joint or individual, is the main driving factor in an information society. Currently systems tend to be built with strong gravitational forces to attract content creation, so that the harvested information can be sold back into society. In the long run such business models can lead to monopolies and to highly uneven content distribution. Traditionally, user management has been regarded orthogonal to the life cycle of a document object within a content management system (CMS). That has allowed implementations to delegate not only authentication but also authorization to a middleware layer. Many modern platforms (such as .NET or J2EE) allow to deploy a wide variety of authorization technologies. Most CMSes provide an identity-based authorization scheme to control access to the information nodes within the CMS. Individual users either get assigned particular privileges relative to these nodes, or particular privileges are usually clustered into ‘roles’, mainly to reduce the management effort. When particular users are associated with a particular role, they inherit all the role’s privileges. Role management is then usually handled by an administrator. Such an individual person can easily present a bottleneck (and a security risk), so role assignment is often delegated, or even further subdelegated. Current authorization schemes are quite flexible and they can cover systems in the Wiki class (rather flat user base, hardly any workflow) up to corporate systems with a rather deep organizationally imposed group hierarchy and considerable workflow capabilities. Despite the delegation features in many practical deployments user management is still funneled through very few administrators. And in many real world deployments these actually do not take part in the collective authoring effort. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


112

Robert Barta; Markus W. Schranz

On a different front, a more recent trend in CMSes is the addition of semantic information (e.g. [1,2]). In the simplest case this is achieved by providing a background taxonomy against which the information nodes are organized. More sophisticated CMSes allow users to attach not only meta-information along well-defined attributes but also to use one of the semantic web technologies (RDF[4] and Topic Maps[5]). These enable to freely relate information items within the CMS against each other or with concepts and instance data defined elsewhere. Such outside information can be either referenced or integrated via virtualization[6]. Also here systems differ considerably in the degree how individual users can extend the existing ontology or the types of relationships. In our approach we propose to coalesce the authorization mechanism with the ontological information, thus offering an Autogeneous Authorization Framework (AAF) for (open access) information management based on Topic Maps. The paper is structured as follows: in section 2 we describe the challenges for integrating content management and authorization, introduce our proposed methodology and refer to related work and necessary notation formats. In the following sections we formalize our proposed machinery for the AAF and cover implementation aspects such as visibility rules for nodes and the necessary ontological commitments. Finally we summarize our current experiences and outline the work to invest for a scaleable implementation. 2.

Challenges in Combining Content Management and User Authorization

As target audience for the integration of content management and authorization we have in mind loose federations of organizations which want to cooperate in certain areas on a number of topics. Each of the involved organizations may have their experts in certain areas but each will seek expert knowledge from their partners. 2.1

Proposed Method and Assumptions

Realistically such federated projects will not have stable ontologies from the very beginning, much in contrast to a priori created ones (e.g. [3]). These will have to evolve over time and each snapshot implicitly indicates deficits and hence the need on which topics the project will have to focus next. This will prompt experts to adopt certain topics and detail them to the extent necessary or feasible. From past experiences it can also be expected that further field experts will be solicited, adhoc or via affiliation, especially in the area of open access research publications, where major focus lies on content quality and reliability and trust in topic experts. Ideally these invitations for co-authorship will not affect the whole content body but only that fraction for which a new expert is authoritative. Under the regime of ‘taxonomy-based authorization’ a particular user does not derive his privileges relative to a given node from the settings of a central administrator or a membership in a group, but from the commitment that the user is an expert in the field the node belongs to (the ‘theme’). Accordingly, the system tracks for each node how it is classified to a theme in the current taxonomy and it also keeps book which individuals are authoritative in certain themes. From there we generalize the regime in several directions: • First we abstract away from the particular privileges an individual may have regarding a particular topic. In the simplest case this may be a read or write (edit) access to a node. In more complex cases privileges may include the start, promotion or finalization of workflows steps, or different level of read access to only certain fractions or aspects of the topic. The only thing we are currently assuming is that all different privileges are totally ordered, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Autogeneous Authorization Framework for Open Access Information Management with Topic Maps

so that (a) it is always unambiguously determinable which privilege is higher than the other, and (b) there is always a highest privilege.

When topics are generated, they will be classified into a theme. From there topics themselves may or may not follow a workflow. As usual with workflows, the progress of a workflowing node depends on certain privileges, be it for promoting the topic into a final stage or be it for sending it back to an editing phase. Our framework does not prescribe any particular workflow states or any intrinsic privileges to move a topic to another state. The only assumption here is that any workflow privileges are still tied to themes and that topics can be adopted by anyone who has the sufficient privileges for that theme. Topic adoption can happen either actively by a user (pull), or passively (push) by forcing adoption upon the user by another one with higher privileges.

Also privileges themselves will follow a life cycle, in that the initial privileges of users are extended (monotonically increasing over time). Also here we allow a push-pull setup: either a user, unsolicited, gets privileges via other users to certain themes, or a user requests higher privileges, which he later is possibly granted.

Hereby we allow two subschemes: • In the ‘delegation scheme’ privileges can only be granted from someone with higher privileges on that theme.

In the ‘peer scheme’ the granted privilege can be at the same level as that of the granting user.

The ontology-based authorization described here can be bootstrapped from any existing taxonomy, even a pathological one with a single theme (the ‘thing’). Authors in the system can create taxonomy nodes along with information nodes and establish the highest available privilege on that taxonomy node for them. Safeguards are in place to avoid that users can reassign nodes in the taxonomy to subvert the authorization system. 2.2

Basic Requirements and Related Standards

Since basic elements in our AAF are denominated above as nodes in graphs that can be accessed within certain actions, we propose a standardized notation format to represent the resulting semantic network. In literature such graph structures have been implemented in various forms under different names including associative nets, semantic nets, partitioned nets, or knowledge maps in many AI systems. One of the most completely worked out notations are the conceptual graphs developed by Sowa et.al.[8]. Semantic networks rely on a basic model that is similar to that of the topics and associations found in indexes. Thus the two approaches promise great benefits in both information management and knowledge management. Exactly these benefits are targeted at by the topic map standard. By introducing relations between topics and occurences additionally to the topic-association model, topic maps provide a means bridge the gap, as it were, between knowledge representation and information management. In this paper we want to extend this basic intension of topic maps to include user authorization based on thematic topics within the contents. 2.2.1 Topic Maps and the Topic Map Standard Topic maps are an ISO standard[9] for describing knowledge structures and associating them with information resources. Since topic maps are often synonymed as the GPS of the information universe, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

113


114

Robert Barta; Markus W. Schranz

they provide powerful new ways of navigating large and interconnected information sets. According to the basic elements of informations structuring and accessing embodied in indexes, the topic map standard is based on Topics, Associations, and Occurrences. The following section outlines the TAO of topic maps, as it is explained in detail by S. Pepper in [10]. We will use topic maps as notation standard for our AAF as described in section 4.2. Topics A topic, in its most generic sense, can be any thing whatsoever â&#x20AC;&#x201D; a concept, an article, person, etc.. The term topic refers to the object in the topic map that represents the subject being referred to. Typically, there is a one-to-one relationship between subjects and subjects, with every topic representing a single subject and every subject being represented by just one topic. In a topic map, any given topic is an instance of zero or more topic types, thus categorizing specific topics according to their kind. Similar to the usage of multiple indizes in a book (index of abbreviations, names, or illustrations) topic types semantically describe the nature of topics. What one chooses to regard as topics in any particular application may vary according to the needs of the application, the nature of the information, and the uses to which the topic map will be put: e.g. in software documentation they might be variables, functions, and objects. In order to identify objects symbolically, topics may have explicit names. Since names exist in various shapes, such as formal names, symbolic names, nicknames, etc. the topic map standard provides the facility to assign multiple base names to a single topic, and to provide variants of each base name for use in specific contexts. Occurrences A topic may be linked to one or more information resources that are somehow relevant to the topic. Such resources are named occurrences of the topic and are generally external to the topic map document itself, and they are referenced using various mechanisms the system supports, e.g. URIs in XTM[7]. A significant advantagesto using topic maps is that the real world documents (occurrences) themselves do not have to be touched and thus topic maps support a clear separation of the network into two layers of the topics and their occurrences. Following the concepts in the topic map standard, also occurrences may be of any number of different types (e.g. articles, monograph, commentary) . Such distinctions are supported in the standard by the concepts of occurrence role and occurrence role type. These basic constructs refer to the basic organizing principle for information. The concepts of topic, topic type, name, occurrence, and occurrence role provide means to organize information resources according to topics/subjects, and to create simple indexes. Associations What we really need in addition to basic indizes for constructing semantic networks is to be able to describe relationships between topics. To achive this, the topic map standard provides a construct called the topic association, which asserts a relationship between two or more topics. Similar to the grouping of topics and occurrences according to their specific type â&#x20AC;&#x201C; such as author/ research area and article/commentary/monograph - also associations between topics can be categorized according to their type. And following the notation concepts of the standard, association types are themselves defined in terms of topics. The ability to apply typing to topic associations significantly increases the expressive power of the topic map, making it possible to group together the set of topics that have the same relationship to any given topic. This is necessary to provide intuitive and user-friendly interfaces for navigating large information networks.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Autogeneous Authorization Framework for Open Access Information Management with Topic Maps

Each topic that participates in an association plays a role in that association. Consequently, association roles can be typed and the type of an association role is also represented as a topic. Due to the fact, that associations in topic maps are multidirectional, the clear distinction between specific association roles is very important (e.g. CM_user has_edit_right_on article). The clear separation of (real world) information resources and the topic map itself, the same topic map can be overlaid on different information sets. Similarly, different topic maps can be overlaid on the same pool of information to provide different semantic views to different users. Furthermore, this separation provides the possiblity to interchange topic maps among publishers or to merge several topic maps, thus handling semantic networks. Omitting other details of the topic map standard since out of scope of this paper we progress to introduce a notation scheme for the AAF, followed by a model on how to represent the nodes in Topic Maps. 3.

Proposed Concepts for an Autogeneous Authorization Framework

To describe and analyze the dynamics of an AAF driven system, we need to abstract away (a) from any specifics of the underlying CMS and (b) from any representation technique used to manifest AAF-related ontological and operation information. For this purpose we introduce a simple adhoc formalism to describe static and dynamic integrity constraints. As minimal ontological commitment we choose to have nodes as the unit to carry content. From the AAF’s point of view such a node itself is atomic: It can carry any content, be it text, structured or unstructured. The node may have attachments, or it may have meta information attached to it; in any case this is outside the scope of AAF. 3.1

Static Model

3.1.1 Themes One special node kind are /themes/. Intuitively they represent topics such as finance or, say, UMTS. Themes can be organized into subclass relationships, usually referred to as taxonomy. We write t’ < t if the theme t’ is a specialization, direct or transitively, of theme t. Of course, themes can be related to each other in more specific ways, but this is outside the AAF realm. The only exception are non-theme nodes which are affiliated with a theme t, something we denote as n→t Any number of such affiliations may exist at any time. How such affiliation is modeled in the background ontology is deployment and implementation dependent. Here the notation should simply transport that the node is somehow related to a certain theme. One constraint imposed is that such affiliation inherits downwards the subclass hierarchy, so

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

115


116

Robert Barta; Markus W. Schranz

3.1.2 Users Another special node type are users. Nodes of this type are supposed to carry content about a certain user and we implicitly identify a user with its node (which is somewhat sloppy from an ontological view point). Users are the only active component, so it is them who perform actions. 3.1.3 Actions AAF also assumes that there is a finite set of actions on nodes. Built-in actions are read and edit. read will always keep the content of the node, edit will always modify the node, but both will maintain the identity of any node. Additionally applications and deployments are free to add workflow actions, i.e. actions which move nodes through a series of workflows. Formally, read and edit are embedded in this scheme as they are the only actions in their respective workflows. As common in workflow applications every workflow step will move the document into a new state, such as “edited” after an edit action. States are regarded here as derivatory concepts; still we conveniently use the notation n@S when a node n is in state S. On any set of action we also impose an order, so that between two actions there can be a comparison, which of them is stronger. This is to model that usually editing implies also reading, or that moving a document in a workflow also implies editing it. As this comparison may only exist on some pairs of actions, we only need a half-order. In any case we require that there is a strongest action. We refer to it as top. The bottom action also exists in every system and it represents the empty action. All actions are bigger that bottom. 3.1.4 Privileges When users have privileges then these are characterized by the maximal action that user can exert on a node affiliated to a certain theme. We denote such a privilege of user u relative to a theme t for action a as p ~a~> t An example would be that Bill has editing rights for finance: bill ~edit~> finance Also here we expect the privilege to inherit downwards the theme taxonomy: u ~a~> t ⇒ u ~a~> t’ with t’ < t This makes sense as if Bill has editing privileges for finance he implicitly should also have one for accounting. But since actions are also under a half-order more privileges can be inferred as well: u ~a~> t ⇒ u ~a’~> t with a’ < a Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Autogeneous Authorization Framework for Open Access Information Management with Topic Maps

In the example we derive from Bill’s authorization to edit finance topics also his authorization to read them. Additionally we define that arbitrary nodes n affiliated to a theme t can be actioned too: u ~a~> n iff ∃ t : n → t and u ~a~> t If a node is not (yet) affiliated with a theme, then there is no access information for it to be inferred. In the same way as document states can be derived from the actions, user roles can be defined on the basis of their privileges. Since all privileges, though, are always relative to a theme, simply to define that someone is an editor is correct, but ultimately looses essential information: Editor (u) ⇐ ∃ t: u ~edit~> t Still we keep this as notational convenience. 3.2

System Dynamics

According to the set of privileges at a given time, nodes can evolve and move through their respective workflows. Privileges can also change throughout the lifetime of an AAF governed system. This is either achieved by extending someone’s privileges directly, or indirectly in that nodes are affiliated with themes someone has a privilege on. The integrity constraints for this evolution we model with the help of pre- and post-conditions on node transitions. The preconditions guard certain actions and the post-conditions characterize the state in terms of AAF after a node has be acted on. Each of these transactions are atomic. To denote, for example, that every user is entitled to edit his node u into a new version u’ we write < User (u), u ~edit~> u || User (u’) > While the node u undergoes a change, it will maintain its identity. 3.2.1 Bootstrapping To put an AAF system into an initial state, it has to be bootstrapped into some configuration. The most minimal state is characterized by superuser ~top~> thing Superuser is one particular user. What makes him special is that he has the highest privilege (top) on the most abstract thing. 3.2.2 Document Life Cycle When a new document node is created it will not have any affiliation. Such an outlawed node can only be brought into the realm of a theme t if a user u has top privileges on t: < Node (n), u ~top~> t || n → t’ >

with t’≤ t

In this initiation phase the user has significant responsibility to choose the smallest reasonable subtheme t’ of t. Further changes of affiliations can happen later as well, but only in an accumulative manner, so that no existing privileges are hampered. If a node is moved along a workflow axis it always maintains its identity, even when it is modified. A user

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

117


118

Robert Barta; Markus W. Schranz

u can exert action a on a document node n in workflow state S when a theme t exists such that: < n → t, n @ S , u ~a~> t || n’ > For the special, predefined actions read and edit we define < n → t, u ~read~> t || n > < n → t, u ~edit~> t || n’ > 3.2.3 User Life Cyle When a user node is created obviously no explicit privileges are defined for that user. Implicitly we allow users to modify their own nodes, mainly to introduce themselves to the swarm of other users. This can be achieved on the policy level with priming every user node u with u ~edit~> u During the course of the life time a user can acquire new privileges whereby we distinguish two scenarios:

In the unsolicited scenario a user will be promoted by another user without prior request: < User (u), User (v), v ~a~> t || u ~a’~> t’ > In any case t’ will be equal or smaller than t if the user v determines that u only needs privileges for a more special theme. ”The granting v may also reduce the action level itself so that a’ ≤ a . If a = a’ we call this process peer invitation, otherwise delegation. Following our running example, the editor of the finance sector may hand down reviewing rights to another person: < User (bill), User (fred), fred ~edit~> finance || bill ~edit~> accounting >

Solicited privilege escalation is not likely to be agile. The process requires that users constantly monitor for the needs of other users. This is not something humans are well equipped to do. In the solicited scenario a user first requests certain privileges. These will be responded to at a later point by other users, so that requests are then resolved. To better moderate the process of solicited privilege escalation we force users either to escalate along the theme taxonomy or alternatively along the action half-order. For the latter case we characterize the creation of a privilege request via < User (u), u ~a~> t || u ~b?~> t > With u ~b?~> t we symbolize the fact that u wants privilege to do b on t. Note that u needs to have at least privilege a to launch such a request. To escalate along the taxonomy also the user needs minimal entry rights:

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Autogeneous Authorization Framework for Open Access Information Management with Topic Maps

< User (u), u ~a~> t || u~a?~ t’ > with t < t’. Every request can be responded to. In the case a request is granted, another user with sufficient privileges will have to come into play: ·

< u ~b?~> t’, v ~c~> t’’ || u ~a~> t >

Hereby v can choose to reduce a to b, so that a < b ≤ c. The user v can also choose to reduce the scope of the privilege, so that t < t’ ≤ t’’ . If a is chosen to be bottom, i.e. the smallest action, then effectively the request is rejected. 4. 4.1

Implementation Visibilities of Nodes Aspect

Once an AAF system is implemented, the user interface has to control which aspects of a node are visible to whom. Many of these visibilities will be policy controlled, so the following may be vary between deployments. Regardless of the node type we distinguish between the content itself, the ontological embedding of the node and the defined (or derived) privileges on it. For reasons of reproducibility and auditing not only the current information is displayed but also the past history of changes, so that it is imminent who got which privilege at which time. While general document nodes follow the generic rules of section 3, user and theme nodes have to be treated differently. 4.1.1 User Nodes For user nodes the content visibility follows the generic rules above. Ontology related information cannot be changed after a user node has been created, at least as far as AAF is concerned. Any ontological content is normally visible to everyone else. Typically all user s also see all existing privileges of another particular user. Outgoing privilege requests, so those which are pending, will be listed only for that particular user and for those users who have a stake in the themes involved and where their privilege level is at least on par with the one requested. Otherwise the request will normally not be shown. Incoming privilege requests, i.e. those where a particular user has the privilege level to grant or reject the request are listed for that very user. As a convenience this list will include past grants and rejections. 4.1.2 Theme Nodes Again for the content itself the generic AAF rules apply. As themes are meant to be abstract, only their relationship with other themes can be modified. Every theme node will list the privileges on them, direct or derived.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

119


120

4.2

Robert Barta; Markus W. Schranz

Model Representation with Topic Maps

All AAF-related information can be mapped into a topic map, although auditing information recording events is less suitable to be brought into this representation. That will be kept normally separate. The initial topic map will have to contain the ontological commitments, namely that every node is either a general document node, a user node or a theme node: user subclasses node theme subclasses node The predicate User(u) is then true when u is a user node in the map. Similarly this holds for themes. Another commitment is the list of actions involved. The predefined ones will appear in any case: read isa action edit isa action But more can and should be readily added. Any ordering between actions is represented by an association of type comparison comparison (stronger: edit, weaker: read) From the totality of all comparisons the smallest (bottom) and the biggest (top) action follow. All nodes are directly represented as Topic Map topics. For theme topics any taxonometric information is directly modeled with the onboard means available for Topic Maps, specifically transitive subclassing and instance-of relationships. In any case we flag theme nodes to be instances of theme, e.g. finance isa theme In the same vein, user nodes are marked as instances of user: bill isa user All content nodes are simply instances of node. If such a node is affiliated with a theme, this is modeled with an otherwise arbitrary association, for example for budget_2008: dc:subject (node: budget_2008, theme: finance) Hereby we made use of the predefined subject property inside Dublin Core vocabulary. Whenever a user is granted a privilege relative to the theme, this will also be represented natively in a map: privilege (theme: finance, user: bill, action: edit ) Requests for privileges look similar, except that associations representing them are scoped as pending: privilege @ pending (theme: finance, .... ) 5.

Conclusions

The formalized version of the AAF is the end result of a series of prototype implementations using a scripting language together with one of the mainstream wiki software (Perl + TWiki). In hindsight, that particular platform has not proven to be flexible enough for two necessary adaptions to an existing system: â&#x20AC;˘ the implantation of an ontological backbone, in which to host taxonometric and other semantic network information, and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Autogeneous Authorization Framework for Open Access Information Management with Topic Maps

the injection of an authorization layer to implement AAF’s functionality.

The main motivation for formalization lied in the perspective to better analyze security implications and to describe possible attack vectors. It also opens a foundation to formulate statistical means to subvert an AAF-operated system. At this stage we have little operational experience with an AAF deployment, not only because of the unsuitability of the chosen platform, but also mostly because of a lack of a mature Topic Maps implementation which allows to quickly scale to hundreds of users and thousands of topics. This shortcoming had been addressed recently, so that a reimplementation with a CMS but also a conceptual integration with content frameworks such as JSR-283[11] can be attempted. To substantiate our claim that an AAF-driven system will cause and sustain an adequate and balanced privilege distribution, our efforts will have to concentrate on developing metrics to measure this balance. It is yet unclear whether such metrics will depend on the social setting, be it a corporate environment, a group of cooperating NGOs or independent individuals. 6.

Notes and References

[1]

Barbera, Michele; Di Donato, Francesca. Weaving the Web of Science. HyperJournal and the Impact of the Semantic Web on Scientific Publishing , Proceedings of the 10th International Conference on Electronic Publishing, Bansko, Bulgaria, 14-16 June 2006. [2] Annotation and Navigation in Semantic Wikis, Eyal Oren, Renaud Delbru, Knud Moeller, Max Voelkel, and Siegfried Handschuh, Proceedings of the First Workshop on Semantic Wikis, 2006, Ed. Max Voelkel [3] Costa Oliveira, Edgard; Lima-Marques, Mamede. An Architecture of Authoring Environments for the Semantic Web, Proceedings of the 10th International Conference on Electronic Publishing, Bansko, Bulgaria, 14-16 June 2006. [4] Resource Description Framework (RDF) model and syntax specification, Technical report, W3C; O. Lassila and K. Swick [5] TMDM, ISO 13250-2: Topic Maps - Data Model, Lars Marius Garshol and Graham Moore, 200311-02 [6] Knowledge-Oriented Middleware Using Topic Maps, Robert Barta, TMRA 2007, Leipzig, (to appear in TMRA 2007 Proceedings, Springer LNCS/LNAI) [7] Pepper, S. and Moore G.: XML Topic Maps (XTM) 1.0. TopicMaps.Org http://www.topicmaps.org/ xtm/1.0/, 2001 [8] Sowa J, et. al. Knowledge Representation: Logical, Philosophical and Computational Foundations, Brooks-Cole, Pacific Grove 2000 [9] International Organization for Standardisation, ISO/IEC 13250, Information technology – SGML Appliations – Topic Maps, ISO, Geneva 2000 [10] Pepper S., The TAO of Topic Maps – Finding the Way in the Age of Infoglut, Ontopia http:// www.ontopia.net/topicmaps/materials/tao.html, April 2002 [11] JSR 283: Content Repository for Java Technology API Version 2.0, http://jcp.org/en/jsr/detail?id=283

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

121


122

AudioKrant, the daily spoken newspaper Bert Paepen Katholieke Universiteit Leuven – Centre for Usability Research (CUO) Parkstraat 45 Bus 3605 - 3000 Leuven e-mail: bert.paepen@soc.kuleuven.be

Abstract Being subscribed to a newspaper, readers expect some basic things: receiving their paper in their mailbox early in the morning, being able to read it privately when and where they want, reading first what they find most interesting, etc. For people with a reading disability all this is not that obvious as only few accessible alternatives are around; accessible news on a daily basis does virtually not exist. Knowing that the number of visual disabled persons follows the rise in the ageing population, an increasing number of citizens however is getting debarred from a daily news reading experience. At present Belgium is one of the rare countries publishing a daily newspaper accessible to readers with a visual impairment, both in a Braille print and an electronic version. Notwithstanding major accessibility improvements over a printed newspaper, these newspapers still have some important barriers for many visually impaired readers. Reading requires specific skills and/or equipment, such as the ability to interpret Braille or the availability of a personal computer, a screen reader, a speech synthesizer or an internet connection. The goal of the AudioKrant project was to develop a new, universally accessible news publication with a minimal learning curve, aiming at a wide range of potential readers: the “talking newspaper”. Thanks to significant progress in text-to-speech technology it is today possible to produce a newspaper read by a computer voice that is understandable, has an acceptable speech quality and is even pleasant to listen to. This paper explains how the talking newspaper is produced, what formats and technology are used, what the current status and challenges are and what future improvements can be anticipated. Keywords: newspaper; accessibility; Daisy; DTB (digital talking books) 1.

Introduction

According to the European Blind Union 1 in 30 people are blind or partially sighted. Blindness and partial sight are closely associated with old age, so as people live longer the number of visually impaired persons is increasing. Nearly 90% of all blind and partially sighted people in Europe are over the age of 60, and two thirds are over the age of 65[2] [3]. In several countries initiatives exist for publishing news to readers with a visual impairment. Mostly this takes the form of an audio book containing a daily or weekly selection of news articles, read by a human voice, or a Braille book, also with a selection of articles. Of course this is a major improvement for disabled readers but it is still far from the reading experience offered by a traditional print paper: accessible news should also be complete, recent and allow a private reading experience. At present Belgium is one of the rare countries publishing a daily newspaper accessible to readers with a visual impairment. Both a Braille print and an electronic version are published on a day to day basis. Subscribers can read these papers either by “feeling” the dots on Braille printed papers or by listening to a text-to-speech synthesizer on their computer [1]. Notwithstanding major accessibility improvements to a printed newspaper, the Braille and electronic newspapers still have some important barriers for many visually impaired readers. Reading requires specific Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


AudioKrant, the daily spoken newspaper

skills and/or equipment, such as the ability to interpret Braille or the availability of a personal computer, a screen reader, a speech synthesizer or an internet connection. Knowing that a growing number of elderly readers have difficulties reading a printed paper and at the same time are unable to learn Braille or to operate a computer, an increasing number of people are excluded from getting information from a newspaper. Given the rise in age related visual disabilities there is a clear need for a new, universally accessible newspaper publication with a minimum learning curve. The AudioKrant project has developed such a “talking newspaper”, which is not only targeted at visually impaired persons, but also at elderly persons and people with a reading disability such as dyslexia, a motor disability or language problems. The aim was to come to a very simple product, requiring as little skills as possible and thus being accessible to a wide range of potential readers. This could include for example elderly persons whose sight does not allow them to read the printed paper, but who do not understand Braille or know how to operate a computer. For this reason the talking newspaper is distributed on a CD-ROM by surface mail. As most simple solution it can be listened to by means of a “Daisy” player, but also a computer with specialized software or even a regular MP3 player are possible. This paper explains how the talking newspaper is produced, what formats are used, what the current status and challenges are and what future improvements can be anticipated. 2.

Daisy

In the DiGiKrant project we used the DAISY format, an XML standard for digital talking books (DTB), for producing accessible electronic newspapers. Several types of books can be stored using this format: audio books, text books and audio-text combinations [4]. For the DiGiKrant we used the text only variant. This makes the file sizes very small so that it can be easily transferred by e-mail. The downside is that the electronic text still needs to be converted to an accessible format by means of a Braille screen reader or a speech synthesizer on the reader’s personal computer. This requires the reader to own a computer, an internet connection, accessibility software and/or hardware, and the skills to operate all that. To avoid these possible barriers at the side of the reader, the talking newspaper includes both text and its audio representation (hence “talking” newspaper). Technically this means that the spoken version of the text is created at the producer’s instead of the reader’s side. As a consequence the reader does not need a computer: a small daisy player or mp3 player will do (however it can still be read on a computer as well). A digital talking Daisy book contains a set of mp3 audio files, HTML text files, SMIL synchronization files and an HTML navigation file. Thanks to the latter the reader can browse through the book’s contents in a structural way, jumping to chapters and paragraphs or skipping to the next sentence. SMIL (Synchronized Multimedia Integration Language) [6] enables synchronization between the text and audio version of a book up to the smallest available navigation level. When reading a DTB on a computer, users can see and hear the book’s structure, navigate to its sections, and read through its paragraphs. Thanks to the SMIL synchronization, words or sentences can be highlighted at the moment when they are spoken (see Figure 1, displaying a newspaper fragment in the application EaseReader). Highlighting can be helpful for example for dyslectic readers. A DTB can also be read using a more compact Daisy player or even a regular mp3 player. The former will interpret the navigation file, so that structural reading is possible. A simple four arrow button interface makes operating a Daisy player very easy. Figure 2 displays a Daisy reader with a simple and straightforward design, with the usual play/pause, rewind and forward buttons known from a CD player. Four navigation buttons (up, down, left, right) in the middle of the device allow structural reading: the up and down buttons define the navigation level, the left and right buttons navigate through the items in the chosen level. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

123


124

Bert Paepen

An example newspaper could contain three navigation levels: sections, subsections and articles. The reader will start with the first article in the first section. When the user pushes the down button, the device reads out the current navigation level (1). At a second push on the down button the device changes the level to 2, corresponding to the subsections) and reads out this new level. After pushing the left button the reader navigates to the next subsection and starts reading its first article. This way, navigation is possible up to the level of individual sentences or even words, as long as the DTB is structured up to this level of detail.

Figure 1: reading a DTB on a computer

Figure 2: Daisy reader A list of Daisy players and software can be found in [5]. A classical mp3 player does not allow structural reading but can still play the successive paragraphs in a sequential order. 3.

Accessible newspaper

For obtaining an accessible newspaper the Daisy format is only a starting point: several key requirements determine if the newspaper will be really accessible to and usable for readers with a wide range of visual Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


AudioKrant, the daily spoken newspaper

disabilities. First, the newspaper structure is of key importance. It should be clear, well-organized and simple, and it should resemble the structure of a printed newspaper, using the same type of columns and a recognizable order of sections. Page numbering should allow referencing between the printed and the spoken version of the newspaper. In the audio newspaper typically four navigation levels are included, from sections (like Front page, Politics, Economics, Local news and Sports), subsections (like Soccer, Baseball and Basketball) and articles to the lowest level of individual sentences. Such structure allows “structural reading” of the DTB, meaning that the reader starts from the navigational “tree” structure of the book to browse to a specific part of its content. Second, navigation through the paper should be straightforward and fast. Readers should have an immediate view on the paper’s contents, seeing the sections, subsections and the number of articles in each section. Figure 3 displays an extract newspaper structure, showing the section titles and the number of articles in each section between brackets. For the audio newspaper we chose for showing both the number of articles and the number of subsections (if any) in each section. For example: “Sports (23 articles and 4 subsections)”.

Figure 3: example newspaper structure Readers should also be able to jump from one section to the other or from one article to the other and to quickly skip the remainder of a sentence or paragraph. The four button interface, described above, makes this possible if a sufficient level of detail is provided in the newspaper structure. This type of user interaction makes “sequential reading” more efficient. In the audio newspaper sequential reading is further improved by providing two types of tunes, marking the end of a news article or the end of a section. Without these tunes it could be unclear when a new article or section has started, as the reading software or hardware just continues reading. Finally, the quality of the newspaper’s contents, both text and audio, should be impeccable. This seems obvious, but with a (semi-)automatic production process it is not an easy goal to achieve. These requirements formed the basis of the analysis, design and implementation work for the “production wizard”, an application allowing the daily production of the accessible audio newspaper. 4.

Production process and challenges

The accessible newspaper in its three forms (Braille, text and audio) is produced at nighttime between the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

125


126

Bert Paepen

journalistsâ&#x20AC;&#x2122; deadline and the postal truckâ&#x20AC;&#x2122;s departure time, leaving a very short production time (less than 2 hours). For this reason the production process of the spoken newspaper was optimized for efficiency, leaving little room for error and manual intervention.

De AudioKrant TV schedules audio

News articles

Production

De DiGiKrant text

braille

De BrailleKrant

Stock exchange figures

Figure 4: accessible newspaper production in three forms As a first production step content is gathered from various sources, including newspaper articles, TV schedules and stock exchange figures. Most of the input files are in XML format, like the news article displayed in Figure 5, while some still use a text based format with simplified tags, like the stock exchange example in Figure 6.

Figure 5: news article source fragment

Figure 6: stock exchange figures fragment Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


AudioKrant, the daily spoken newspaper

All these files are filtered to improve the quality of the resulting newspaper and are converted to a central XML format. Thanks to such a centralized format the conversion software is flexible in processing any type of input into any type of accessible output. Filtering is important for obtaining a high quality electronic publication from a source that is intended purely for paper publishing. An article title for example might be missing from the source file because it was only available in graphical form for printing. In this case a title needs to be generated from the article content. In the next step the input file is inserted at the right place in the newspaper structure; the structure is gradually built up when new files arrive. Building such a structure is one of the most difficult tasks in the automatic production process: while the input is article-oriented (each file contains only one news article) the output is newspaper-oriented (containing the full structure). During the third production step, depicted using the Daisy logo in Figure 7, news articles are converted into their “spoken” version using a text-to-speech converter (or speech synthesizer) such as RealSpeak [7]. Immediately a SMIL synchronization file is created, linking the written text to its spoken audio representation. Because of time constraints it is impossible to have a full daily newspaper read by a human voice, knowing that a complete newspaper can take up to 20 hours of speech. Speech generation software has improved immensely since the typical computer voices from the early days, creating speech that is not only understandable, but even pleasant to listen to. Speech can be improved further by a rule set, defining how certain characters should be read (for example: & should be “and” instead of “ampersand”), and a pronunciation dictionary, defining how specific words should be read.

News articles

Front page Politics Sports Barcelona wins Champion’s League Olympic games disturbed by protest

TV schedules

Figure 7: Audio newspaper production process Finally the output is created as a complete DTB in Daisy format, containing the navigation file, the news articles in text format and in audio format (mp3), and the SMIL files. This end result is burned to a CDROM, duplicated, packed and sent by surface mail to the subscribers overnight. As with a regular newspaper, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

127


128

Bert Paepen

subscribers should receive their audio newspaper in their mailbox in the morning. Several technical challenges arise from the fact that the source material, received from the newspaper publisher, is optimized for print rather than for a digital and accessible product. Some information is only available in graphic form, leaving no room for an accessible version, tables are poorly exported and even some article headlines are missing. The first solution for these problems was to try obtaining better source data from the publisher, for example sports results in structured tables. In some cases, where the publisher cannot provide better quality data, the production software tries to improve quality by means of several text filters. A second major project challenge was time available for production. While the deadline for journalists is around 21:00 h, the first articles become available in XML format from around 22:00 h. At 00:30 h the first shipment is leaving, giving only about 2,5 hours for the production of the accessible newspapers. Every aspect of the production software was designed and developed for optimal production speed, for example running several speech processors at the same time in parallel threads, as the text-to-speech module is the most CPU-intensive task of the entire process. Today the total production time for all newspapers (in total about 500 MB in file size) averages around 1h20min, not including the time needed for duplicating, packing and transporting to the shipping department. 5.

Future work

The audio newspaper was developed between May 2007 and May 2008 and was launched on June 2nd, 2008 with a press conference and seminar on June 6th in Brussels. As of that date the production wizard is operational for the accessible newspapers’ production crew, allowing them to publish their products on a day-to-day basis. After a few months of beta testing it is clear that, although we are ready to produce a daily audio newspaper of acceptable quality, not all technical challenges are conquered yet. Especially the time constraints, tied to the physical delivery of the CD-ROMs and the late availability of source material, are still a daily challenge. As a result the first delivery group (leaving at 00:10h) today receives a newspaper with less content and structure than the second group (leaving at 2:00h). We are working on two levels to solve the problems related to the short production time. As a first solution the publisher is working on a solution for an earlier delivery of the source material. Of course the journalists’ deadline cannot be changed, but the accessible newspaper production process does not have to wait for the paper production process before starting. We are now trying to obtain news articles already when they are positioned in the newspaper’s layout, even if they are not set up yet on the printing plate. This gives some extra production time for the audio newspaper. A second solution for the (too) short production time could lie in non-physical distribution channels such as the internet. Being distributed over the Internet to its subscribers, the audio newspaper production could be postponed until later at night, when all source material is available. Although this might sound as an unnatural distribution channel, given that the target audience does not have a computer or internet connection, several user friendly solutions exist for bringing the content to the reader automatically. The ORION Webbox for example is a device that downloads new content from the internet overnight, allowing the reader to start enjoying their fresh newspaper as soon as they get up in the morning, all without manual intervention. Knowing that an audio newspaper in mp3 format averages about 350 MB of data per day, a decent internet connection is necessary. As soon as this type of distribution is used, the increased production time allows new features improving the newspaper’s quality, such as personalization. Subscribers could choose in which type of content they are interested (sports but not economy e.g.) and receive a customized newspaper every day. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


AudioKrant, the daily spoken newspaper

6.

Conclusions

Hearing computer generated voices during the 1980’s it was unimaginable that people could listen to such a voice reading an entire newspaper. Significant progress in text-to-speech technology today allows products such as a spoken newspaper that is understandable, has an acceptable speech quality and is even pleasant to listen to. One of the major achievements of the spoken newspaper for visually impaired persons is that it gives them back the opportunity to enjoy a daily, individual and private news reading experience. With a small player one can read anywhere, anytime and at one’s own pace without needing any assistance. In a world where ubiquitous information access is getting commonplace, this can help impaired persons to get included and overcome the digital divide. 7.

Notes and References

[1]

Paepen, B., Engelen, Jan. Braillekrant and DiGiKrant: a Daily Newspaper for Visually Disabled Readers. In Proceedings of the 9th ICCC International Conference on Electronic Publishing, June 2005. Leuven, Belgium : Peeters Publishing Leuven, pp. 197-202. Eurostat. Health statistics – Key data on health 2002 – Data 1970 – 2001, http:// epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-08-02-002/EN/KS-08-02-002-EN.PDF, p. 144, 2004. European Blind Union. A Vision for Inclusion - A Guide to the European Blind Union. http:// www.euroblind.org/fichiersGB/visincen.html, 2004. Daisy Consortium. Technology Overview - What is a DTB? http://www.daisy.org/about_us/ dtbooks.asp, 2008. Daisy Consortium. Playback Tools. http://www.daisy.org/tools/tools.shtml?Cat=playback, 2008. W3C. Synchronized Multimedia. http://www.w3.org/AudioVideo/, 2008. Nuance Communications, Inc. RealSpeak. http://www.nuance.com/realspeak/, 2008.

[2] [3] [4] [5] [6] [7]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

129


130

A Semantic Web Powered Distributed Digital Library System Michele Barbera1; Michele Nucci2; Daniel Hahn1, Christian Morbidoni2; Net7 – Internet Open Solutions Via Marche 8/a, 56123 Pisa, Itay e-mail: barbera@netseven.it; hahn@netseven.it 2 Dipartimento di Elettronica, Intelligenza artificiale e Telecomunicazioni, Università politecnica delle Marche, Via Brecce bianche, 60100 Ancona, Italy e-mail: mik.nucci@gmail.com; christian@deit.univpm.it 1

Abstract Research in Humanities and Social Sciences is traditionally based on printed publications such as manuscripts, personal correspondence, first editions and other types of documents which are often difficult to obtain. An important step to facilitate humanities and social sciences scholarship is to make digital reproductions of these materials freely available on-line. The collection of resources available on-line is continuously expanding. It is now required to develop tools to access these resources in an intelligent way and search them as if they were part of a unique information space. In this paper we present Talia, a innovative distributed semantic digital library, annotation and publishing system, which is specifically designed for the needs of scholarly research in humanities. Talia is strictly based on standard Semantic Web technologies and uses ontologies for the organization of knowledge, which can help the definition of a highly adaptable and state-of-the-art research and publishing environment. Talia provides an innovative and flexible system which enables data interoperability and new paradigms for information enrichment, data-retrieval and navigation. Moreover, digital libraries powered by Talia can be joined into a federation, to create a distributed peer-to-peer network of interconnected libraries. In the first three paragraphs we will introduce the motivations and the background that led to the development of Talia. In paragraphs 4 and 5 we will describe the Talia’s architecture and the Talia Federation. In paragraphs 6 and 7 we will focus on Talia’s specialized features for the Humanities Domain and its relations with the Discovery Project.In paragraph 9 we will describe Talia’s widget framework and how it can be used to customize Talia for other domains. In the final paragraph we will compare Talia with related technologies and platforms and suggest some possible future research and development ideas. Keywords: digital library; semantic web; humanities. 1.

Introduction

In the last few years the amount of digital scholarly resources in the Humanities grew substantially thanks to the efforts of many collections holders who digitized their materials and published them on-line. However, to date, many digital library projects can be characterized as both strongly hierarchical (top-down) and disconnected. Materials are selected for reformatting and inclusion by librarians, archivists and curators from their own collections to the presumed benefit of their patrons but with little actual consent from them. Collections are often assembled with little regard for existing complementary materials, leaving it to the end-user to make and sustain the connections across collections, that remain collections remain fundamentally siloed, with no way to establish permanent semantic connections of their contents. In digital research libraries there is no longer any need to abide by the restrictions of physical organizational schemes or even physical location. New research libraries can and should be built across collections and across libraries. On the other side of this spectrum lay the large-scale aggregation initiatives, such as the The European Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Semantic Web Powered Distributed Digital Library System

Library [1] or OAIster [2] which serve as general purpose digital libraries, but fail to provide the depth needed for research-level scholarship. The emergence of web 2.0 has resulted in a number of tools and technologies for annotation and personalization of resources but these tools have yet to gain a strong foothold in an humanistic academic setting. We believe that Semantic Web Technologies have the potential to glue together the opposing needs of maintaining the context in which the collections originate, by leaving them under control of their holders, and at the same time making resources part of a global structured knowledge space that is independent of a single centralized authority or aggregation service. 2.

Semantic Web and Ontologies

The Semantic Web is an extension of the current Web in which information can be expressed in a machineunderstandable format and can be processed automatically by software agents. The Semantic Web enables data interoperability, allowing data to be shared and reused across heterogeneous applications and communities [3]. The Semantic Web is mainly based on the Resource Description Framework (RDF) [4] by which is possible to define relations among different data, creating semantic networks. RDFâ&#x20AC;&#x2122;s main strength is simplicity: it is a network of nodes connected by directed and labelled arcs (figure 1).

Figure 1: An example of an RDF Semantic Network The nodes are resources or literals (values) while arcs are used to express properties of resources. In SW a resource is anything that can be somehow identified using a specific identifier. In particular, the identifiers used in SW are known as Uniform Resource Identifiers (URI). In Semantic Web ontologies, are used to organize information and formally describe concepts of a domain of interest. An ontology is essentially a vocabulary which includes a set of terms and relations among them. Ontologies can be developed using specific ontology languages such as: the RDF Schema Language1 (RDFS) and the Web Ontology Language2 (OWL).

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

131


132

3.

Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni

Digital libraries for all

In recent years, the decreasing prices of digitazion costs and storage facilities as well as the emergence of many easy-to-deploy Open Source content management systems and digital object repositories, led to the multiplication of small digital libraries run by smaller institutions. Despite their limited size, the collection of these libraries sometimes include cultural masterpiecies. Unfortunately, due to limited resources, these libraries cannot afford to invest on professional digital library management platforms that are either too expensive in terms of license costs or too expensive to maintain because of their complexity. Talia is an Open Source semantic digital library management system that is easy to deploy and maintain. Building a Talia based digital library doesnâ&#x20AC;&#x2122;t require any advanced software development and management skill, that smaller cultural institutions may not possess. Additionally, Talia is a distributed library system, meaning that it permits to build virtual collections that go beyond the boundaries of a single archive without requiring the underlying content providers to loose control over their holdings. For all the reasons stated above, Talia aims at being a complete tough powerful solution for the needs of smaller institutionâ&#x20AC;&#x2122;s digital libraries. 4.

The Talia Platform

Talia is a distributed semantic digital library system which has been specifically designed for the needs of scholarly research in social science and Humanities. Talia combines the features of a digital archive management system with an on-line peer review system, thus it is capable of combining together a digital library with an electronic publishing system. Talia is able to handle a wide range of different kinds of resources such as texts, images and videos. All the resources published in Talia are identified by a stable URI: documents can never be removed once they are published and are maintained in a fixed state in perpetuity. This, combined with other long-term preservation techniques, allows the scholars to reference their works and gives the research community immediate access to new contents. One of the most innovative aspect of Talia is that it is completely based on Semantic Web Technologies which enable deep data interoperability with other Semantic-Web aware tools and applications. In particular, the Talia Knowledge Base is kept in RDF and it is formally described using RDFS/OWL ontologies. Talia natively supports heterogeneous data sets whose metadata schemes can be very different from each other, therefore the system is not based on a predefined ontology. Talia embeds only a very broad structural ontology, which contains only general concepts and basic relations to link resources. Research communities are encouraged to develop their own domain ontology to describe knowledge and content in their domains of interest. The domain ontologies can be developed using standard ontology languages such as RDFS and easily imported into into the libraryâ&#x20AC;&#x2122;s data store.

Figure 2: Talia Architecture Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Semantic Web Powered Distributed Digital Library System

Talia also includes facilities for semantic data-enrichment, data-annotations and data-retrieval as well as a lot of other specific tools. A complete overview of the Talia architecture as well as the underlying technical details can be seen in [5] and [6]. 5.

Distributed semantic digital libraries

Digital semantic libraries based on Talia can be joined in a Talia Federation, to create a peer-to-peer network of interconnected libraries (nodes). Talia provides a mechanism to share parts of its knowledge base. This mechanism is based on a REST interface, an approach proposed in [7]. By using this feature a node can notify another node that a semantic link between them exists. This information can then be used by the notified node to update its own knowledge base to create a bidirectional connection between the contents. This features allows individual scholarly communities, each one managing a single node of the federation, to retain control on their own content while at the same having a strong interaction with the other nodes content. Talia works both as a Digital Library and as an Open Access publisher of original contributions. In a digital library a notion of absolute quality can be acceptable even outside the boundaries of the community who manages the library. On the other hand, the concept of quality for newly published contributions varies significantly with culture and context, therefore each communiy must retain control on what their users see through that community web site. The approach used in Talia is to let each node decide which other federation node it trusts. Notification of incoming links will then be processed only if they come from trusted nodes. The result for the enduser is that it sees backlinks only to content held in trusted sources. At any time, a node administrator can modify the trust policy and recover the notifications that have been previously filtered out by the trust engine. Talia also features a single sign-on mechanism based on OpenID [8]. A Talia node acts as an OpenID client. Depending on its own policies a federation can run itâ&#x20AC;&#x2122;s own OpenID identity server or choose to rely on any existing external service. Each federation node keeps a copy of the user credentials and user roles and permissions are managed locally. As any other of its components, the authentication and authorization component of Talia is pluggable and completely modular. Therefore, it will be possible in the future to develop specific authentication and authorization components based on other infrastructures, such as a more institution-centric approach based on the Shibboleth [9] model, or on any other legacy model. 6.

Digital Humanities

Projects in the domain of Digital Humanities deal with a incredible amount of different types of data (ranging from manuscript reproductions to statistical linguistic metadata, pictures of historically relevant places, maps and many different kind of books in diverse digital formats just to name a few). The level of standardization of data and metadata formats, but especially of the research process is very limited compared with natural sciences. Another important characteristic of this domain, is that the level of computer literacy for Humanities scholars tends to be rather low compared to scholars in other sectors. These two facts (heterogeneity of data and processes and low computer literacy of the users) suggest the need of an electronic environment that must be extremely flexible, adaptable and extensible but at the same time integrated and easy to use. Talia is a coherent and easy-to-use web-based working environment that integrates a set of features that are usually scattered through many different desktop and web-based tools rather than condensed in a unique environment. These tools include for example XML tagging and transformation, image and audio analysis, linguistic text analysis, manuscript annotation, electronic edition editing and so on. Taliaâ&#x20AC;&#x2122;s widget system permits to extends the core engine and easily integrate these tools into a unique Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

133


134

Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni

infrastructure. In the context of the Discovery project we will develop a limited amount of these tools as well as the documentation on how to develop additional plug-ins. We hope that the Open Source community and other Digital Library projects will contribute additional plug-ins in the near future. 7.

The discovery project

The Discovery project [10], funded by the European Commission under the eContentplus programme, aims at the creation of a federation of digital libraries dedicated to different authors and themes of ancient, modern and contemporary philosophy. Talia has born in the context of the Discovery project to serve as its technological infrastructure. The federated libraries that are part of the Discovery federation are unique as they as the function at the same time as traditional digital libraries and as Open Access publishers of original contributions. With this model , discovery aims at stimulating the production of new knowledge by aggregating scholarly communities around thematic repositories of both primary sources (like manuscripts and first editions) and original contributions submitted by the scholars. Additionally, the nodes of the Discovery federation, can also store and publish semantic annotations, that are another type of user contributions. In Discovery, there are two main categories of resources (called “Sources” in Discovery): primary and secondary sources. Primary sources are all the resources that belong to the digital library. These resources have been collected, digitized and published by the institution that runs the digital library. Secondary sources are all the resources that belong to the Electronic Publisher component of a Talia node. These resources have been submitted by the users and they passed through a peer-review process before being published. In Discovery there are four content providers, each of them manages is own instance of Talia. Each provider has its own Domain Ontology that specifies wich types of Primary and Secondary Sources they deal with. Each content provider also has its own peer-review policies and procedures and user user interfaces. In addition to running domain specific Digital Libraries and an Open Access Electronic Publishers, the content providers also engages in what is refferred to as “Semantic Enrichment”. By using a tool called Philospace[10], domain specialists can semantically annotate the Sources published in Talia to add new semantic relations among them. As any other user generated content, the semantic annotations also go through the peer review process. If the annotations pass the peer-review scrutiny they are published into Talia and become available to end users. There is no limit on the meaning of the annotation that may range from simple metadata added to a Source to complex relations to philosophical concepts expressed in a domain thesaurus. The only requirement is that the annotations must be based on a “Annotation Domain Ontology” that is both loaded into the annotation tool and into Talia. More details about the Discovery projects are available in its website[10]. 8.

Item-centric vs relation-centric perspectives

In scholarly environments, where Talia is mostly expected to be used, the contex in which each individual object is placed is of extreme importance. It is often by exploring the context, that is the relations that each objects has with others, that new discoveries are made. As an example, consider the following figure. The interface shown above is part of Hyper, the software Talia derives from. A similar interface is currently being re-implemented in Talia. It has been designed to visualize a particular type of relations that a set of manuscripts have among themselves. In particular, this interface allows the user to visualize a path of the genesis of a philosophical idea, from it’s conception on the manuscript to its publication on a printed book, through its evolution and refinements in successive manuscripts and pre-printing copies. The following figure is an alternative visualization of one of the resources shown in the previous figure. This view, called Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Semantic Web Powered Distributed Digital Library System

“rhizome view” shows all the “paths” that pass through a certain resource.

Figure 3: Path widget

Figure 4: Rhizome widget It is clear that the meaning of these two alternative visualizations is different from each other, but the interesting element is that in both these interfaces the focus is on relations among resources rather than on the resources themselves. Having alternative interfaces that allow the user to focus on an individual resource as well as its relatioships with other resources is one of the charachteristics that makes Talia a scholarly tool rather than a simple digital object repository. 9.

User Interface Widgets and Source Tool plugins

Talia is meant to be used to publish a vast amount of heterogeneous digital objects and resources. Scholarly communities are often interdisciplinary and their research output embraces more than one scientific domain. We believe, that general purpose user interfaces are unable to match the complexity of these contexts and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

135


136

Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni

Figure 5: Default user interface. The Source is shown on the right hand side. The bar on the left lists semantic relations with other sources. Related sources are clickable. fail to properly address the diverse needs of the users. Therefore, Talia provides a flexible and modular user interface framework based on widgets. Widgets are distributed independently and can be used as building blocks for customizing the applicationâ&#x20AC;&#x2122;s user interface. Taliaâ&#x20AC;&#x2122;s Widgets engine offers an high level framework that can be used by application developers to build community specific user interfaces without the need of programming low level details. Widgets can easily be plugged into the default user interface.

The following figure shows an example of a Talia Semantic Navigation Interface, based on a widget that directly interacts with the Talia Knowledge Base, using metadata and ontologies to display information.

Figure 6: Semantic relations widget Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Semantic Web Powered Distributed Digital Library System

We plan to host a library of Open Source widgets on Talia’s website that developers can use to distribute their widgets. In addition to the widgets component, Talia also includes another kind of plug-ins called Source Tools. These tools are behaviours that can be attached to a specific type of digital object (called “Sources” in Talia). For example a Source of type manuscript edition that includes a data object of type TEI-XML may have a Source tool that allows the user to perform some kind of linguistic analysis on the text. A source of type manuscript whose data objects are images representing the manuscript may have a Source Tool to OCR the the text from the image. The rizhome shown above is another example of Source Tool applied to a subset of Sources of type Manuscript. 10.

Conclusions and Related works

Talia is an innovative distributed semantic digital library system, which aims at improving scholarly research in humanities by avoiding fragmentations of materials. Using standard Semantic Web technologies, ontologies to organize information and a completely customizable user interface framework, Talia represents a stateof-the-art research and publishing environment for humanities. Talia is distributed with an Open Source license and it is very easy to install and configure so that it can be used to build single digital libraries and electronic publishing venues at a very low cost. Talia nodes, can then be joined together in a federation to create virtual collections that cross the borders of a single library or organization. At the same time, Talia’s full compliance with Semantic Web standards ensures a deep interoperability with any other Semantic Web tools and applications. Talia shares some properties with other semantic digital library systems like JeromeDL [11], BRICKS [12] and Fedora [13]. These projects are however mostly focused on the back-end technology and can hardly be deployed in low-tech environments such as small archives, libraries and museums. None of these tools offers tight coupling between the semantic knowledge base and the flexible user interface framework provided by Talia. BRICKS is an architecture that is composed of a set of generic foundational components (called “core and basic services”) plus a number of additional specialized services (called “Pillars”) that can be invoked by applications as remote services. A BRICKS node (called Bnode) is an application that uses these services and interacts with other bNodes within a BRICKS network. BRICKS is therefore a huge infrastructures that requires a significant amount of central coordination to maintain the basic services. From the point of view of the individual institution that wants to join the network, BRICKS provides a set of very useful services on top of which each content provider should develop his own application and user interface. We believe that within the Humanities and in general in the sector of cultural institutions, it is very uncommon that organizations have access to the know-how, budget and organizational capacity to deploy such a complex product. Moreover, even though the technology itself has a decentralized architecture, BRICKS relies on ad-hoc components (“core and basic services”) that depend on the availability in the network of remote services. In this way the need of centralized coordination is shifted from the technological level to the organizational and managerial level. We believe that the lack of organizational and managerial coordination is one of the weak spots of the Humanities scholarly community. Therefore, the approach of Talia is to minimize the efforts needed to set-up and deploy a Talia node. Additionally, a Talia federation does not rely on any legacy coordination and knowledge exchange protocol (such as the BRICKS P2P component). Talia is entirely built on top of very simple Semantic Web standards such as RDF, HTTP and URI’s. In short, Talia is designed to run out-of-the-box in order to make it possible for smaller cultural institutions (whom may even not have or Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

137


138

Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni

have very small computing staff) to contribute to a Semantic Web of Culture. The similarity between Talia and Fedora is that both allow to express relations between objects in RDF. However, â&#x20AC;&#x153;...Fedora is a digital asset management (DAM) architecture, upon which many types of digital library, institutional repositories, digital archives, and digital libraries systems might be built. Fedora is the underlying architecture for a digital repository, and is not a complete management, indexing, discovery, and delivery application...â&#x20AC;?. As with BRICKS, Fedora is suitable to develop and deploy very large digital library applications. Talia instead aims to be a complete out-of-the box application the comes with a pre-defined, generic and complete user interface. Talia also has a modular architecture that makes it easy to extend its features and customize its interface by developing plug-ins and UI widgets. JeromeDL is the application most similar to Talia that currently exists. Like Talia, JeromeDL is fully based on simple Semantic Web Standards, works out-of-the-box and is extensible through plug-ins. Apart from the different language in which the two applications are written (JeromeDL is written in Java and Talia is written in Ruby) the main difference lies in their primary target user group. JeromeDLâ&#x20AC;&#x2122;s primary target audience is the generic user of a digital library while Talia will also include default User Interfaces and tools that are targeted to Humanities Scholars. At the time of writing, Talia is still in Alpha stage and a first stable public realease is planned for October 2008. The first release will include a set of visualization widgets specifically meant for handling Discovery content as well as a full-featured on-line peer review system. The first release will also include an adapter for Philospace, a semantic annotation tool based on the Dbin platform[14][15], briefly introduced in paragraph 7. In the meantime, Talia is being customized for other applications in the cultural heritage and digital library domains. In particular, additional research is being performed on the integration of Semantic Web based bibliometric tools. The focus of these research activity is to exploit the Semantic Web to explore bibliometric models and impact measures that can be used as alternative to traditional impact indicators such as the Impact Factor. Some of this models have been proposed in [16] and [17]. Other areas of future improvement include the development of an infrastructure for collaborative ontology editing and mapping as well as an ontology library for the Humanities and Cultural Heritage domains. Finally, we are also studying the integration of Talia with archival cataloguing software and standards, such as EAD editors and archival data management systems, with the objective of making Talia a suitable product to interlink heterogeneous data and resources coming from the library, archival and museum domains to create digital collections of cultural resources. 11.

Acknowledgements

This work has been supported by Discovery, an ECP 2005 CULT 038206 project under the EC eContentplus programme. 12.

Notes

1

http://www.w3.org/TR/rdf-schema/ http://www.w3.org/TR/owl-features/

2

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Semantic Web Powered Distributed Digital Library System

13.

References

[1] [2] [3] [4] [5]

The European Library Portal, [http://www.theeuropeanlibrary.org/portal/index.html] OAIster, [http://www.oaister.org/] W3C Semantic Web Activity, [http://www.w3.org/2001/sw/] RDF Primer, W3C Recommendation, [http://www.w3.org/TR/rdf-primer/] Nucci, M., David, S., Hahn, D., Barbera, M., Talia: A Framework for Philosophy Scholars, in proceedings of Semantic Web Applications and Perspective, Bari, Italy, 2007. Talia Wiki, [http://trac.talia.discovery-project.eu/] Fielding, R.T., Achitectural Styles and the Design of Network-based Software Architectures, PhD thesis, UC Irvine, 2000. OpenID Web Site, [http://openid.net/] Shibboleth Web Site, [http://shibboleth.internet2.edu/] ] Discovery Web Site, [http://www.discovery-project.eu/] Kruk, S., Woroniecki, T., Gzella, A., Dabrowski, M., McDaniel, B., Anatomy of a social semantic library, in: European Semantic Web Conference, Volume Semantic Digital Library Tutorial, 2007. Risse, T., Knezevic, P., Meghini, C., Hecht, R., Basile, F., The bricks infrastructure - an overview, in The International Conference EVA, Moscow, 2005. Fedora Development Team, Fedora open source repository software: White paper, white paper, Fedora Project, 2005. G. Tummarello, C. Morbidoni, M. Nucci, â&#x20AC;&#x153;Enabling Semantic Web communities with DBin: an overviewâ&#x20AC;?, Proceedings of the Fifth International Semantic Web Conference ISWC 2006, Athens, GA, USA, 2006 Dbin Web Site, [http://www.dbin.org/] Bollen, J., Van de Sompel, H., Smith, J., Luce, R., Towards alternative metrics of journal impact: a comparison of dowload and citation data. Information Processing & Management, Volume 41, Issue 6, pp. 1419- 1440, Dec. 2005 Barbera, Michele and Di Donato, Francesca (2006) Weaving the Web of Science : HyperJournal and the impact of the Semantic Web on scientific publishing. In Martens, Bob and Dobrova, Milena, Eds. Proceedings ELPUB : International Conference on Electronic Publishing, pp. 341-348, Bansko Bulgaria, 2006.

[6] [7] [8] [9] [10 [11] [12] [13] [14] [15] [16] [17]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

139


140

No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing Tarek Loubani1, Alison Sinclair1, Sally Murray1, Claire Kendall1, Anita Palepu1, Anne Marie Todkill1, John Willinsky2 1 Editorial Team, Open Medicine 2409 Wyndale Crescent, Ottawa, Ontario, K1H 8J2, Canada email: tarek@tarek.org, alison.sinclair@mac.com, smurray@openmedicine.ca, ckendall@openmedicine.ca, apalepu@openmedicine.ca 2 Board of Directors, Open Medicine 2409 Wyndale Crescent, Ottawa, Ontario, K1H 8J2, Canada e-mail: john.willinsky@ubc.ca

Abstract Open Medicine (http://www.openmedicine.ca) is an electronic open access, peer-reviewed general medical journal that started publication in April 2007. The editors of Open Medicine have been exploring the use of Free and Open Source Software (FOSS) in constructing an efficient and sustainable publishing model that can be adopted by other journals. The goal of using FOSS is to minimize scarce financial resources and maximize return to the community by way of software code and high quality articles. Using information collected through archived documents and interviews with key editorial and technical staff responsible for journal development, this paper reports on the incorporation of FOSS into the production workflow of Open Medicine. We discuss the different types of software used; how they interface; why they were chosen; and the successes and challenges associated with using FOSS rather than proprietary software. These include the flagship FOSS office and graphics packages (OpenOffice, The GIMP, Inkscape), the content management system Drupal to run our Open Medicine Blog, wiki software MediaWiki to communicate and archive our weekly editorial and operational meeting agenda, minutes and other documents that the team can collectively edit, Scribus for automated layout and VOIP software Skype and OpenWengo to communicate. All software can be run on any of the main operating systems, including the free and open source GNU/Linux Operating system. Journal management is provided by Open Journal Systems, developed by the Public Knowledge Project (http://pkp.sfu.ca/?q=ojs). OJS assists with every stage of the refereed publishing process, from submissions, assignment of peer reviewers, through to online publication and indexing. The Public Knowledge Project has also recently developed Lemon8-XML (http://pkp.sfu.ca/ lemon8), which automates the conversion of text document formats to XML, enabling structured markup of content for automated searching and indexing. As XML is required for inclusion in PubMed Central, this integrated, semi-automated processing of manuscripts is a key ingredient for biomedical publishing, and Lemon8-XML has significant resource implications for the many journals where XML conversion is currently done manually or with proprietary software. Conversion to XML and the use of Scribus has allowed semi-automated production of HTML and PDF documents for online publication, representing another significant resource saving. Extensive use of free and open source software by Open Medicine serves as a unique case study for the feasibility of FOSS use for all journals in scholarly publishing. It also demonstrates how innovative use of this software adds to a more sustainable publishing model that is replicable worldwide. Keywords: Free and Open Source Software (FOSS); biomedical publishing; Open Medicine.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing

1.

Introduction

The private interests of medical society and commercially owned medical journals do not encourage collaboration between journals for processes related to journal publishing. This is particularly apparent as journal publishing moves into the digital age: profit is sitting at the helm of an era where shared software code and reader-centric licenses could otherwise accelerate the development and advantages of electronic publishing for all readers and authors. The focus on profit also prevents many potential readers from purchasing subscriptions. In a US periodical price survey published in early 2008, health science periodicals subscriptions averaged US$1330, representing a ten percent increase from 2007. The same study showed that average subscription prices in the health sciences increased by 43% between 2004 and 2008 [1]. A report commissioned by the Wellcome Trust showed similar data [2]; in 2000 the average subscription price for a medical journal was £396.22, and the average cost of a medical journal increased 184% in the ten-year period between 1990 and 2000 [2]. These costs limit journal readership to academic and institution-affiliated professionals in developed countries, and exclude physicians and academics in developing countries not covered by initiatives such as the Health InterNetwork Access to Research Initiative (HINARI) [3]. Electronic publishing renders obsolete costly processes used to justify high subscription prices. In a recent publication costing study comparing print and electronic publications, Clarke [4] found that the publication costs of a print version of a non-profit association journal were more than double those of an electronic version (US$20 000 compared with US$8 000). Although editorial costs associated with the production of high-quality publications remain – and, for larger journals, can be a considerable part of their operating costs – it is clear that the impact of these costs on the financial viability of a journal can be considerably offset with reduced production costs. This has the potential to reduce the dependence of medical journals on pharmaceutical company and medical device manufacturer advertising, the effects of which have been well documented [5,6]. While the Clarke study does not itemize the contribution that publishing-related software makes to publication costs, it can only be assumed that the use of free and open source software (FOSS) would decrease these costs further. Willinsky and Mendis [7] recently published a paper describing their experience of publishing an entirely unfunded humanities journal using free publishing software and “a volunteer economy of committed souls”. Hitchcock [8] describes the only other journal that we are aware of that has exclusively used FOSS for this purpose. At Open Medicine, we employ “committed souls”, professional journal editors and FOSS to publish our biomedical journal. 2.

Open access (OA) publication

Open access publication has emerged as another way of increasing integrity, transparency and accessibility in biomedical publishing [9]. In 2002, the Budapest Open Access Initiative (BOAI) was launched to encourage science to be freely available on the internet, the BOAI supports the archiving of scientific articles and the free availability, without copyright and proprietary limitations, of articles to be to read, downloaded, reproduced, distributed, printed, searched or linked to full-text articles, with proper attribution to the source (see http://www.soros.org/openaccess). Reframing traditional copyright limitations allows anyone the ability to use science for learning, teaching and discussion without having to pay for its use in the form of a subscription or re-print purchase. Without this kind of protection, even an article’s authors cannot freely use published articles for these purposes. The trend towards opening access among journal publishers has been swift: the Directory of Open Access Journals now lists more than 3281 journals (http://www.doaj.org). The benefits of OA are also becoming clearer: studies are finding that articles published in open access journals are cited more widely [10], and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

141


142

Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky

studies that have made their data openly accessible have also increased citation advantage [11]. Academic institutions, funding bodies, regulators and even governments have recognized how open access might serve academic integrity and improve patient care [12]. 3.

Free and open source software (FOSS)

Like the copyright laws that continue to significantly limit readers’ ability to download, reproduce, distribute, print, share and expand upon knowledge printed in many journals, copyright limitations apply to sharing novel software programs and code. Software development under a free license such as the GNU General Public License ensures that source code is freely available and can be used, examined, changed, improved or redistributed without limits except that any changes must be released back into the community with the same license (http://en.wikipedia.org/wiki/Open_source_software). Developers of FOSS range from software hobbyists to multinational corporations. Programmers may or may not be paid for their work, and their motivations include the wish to satisfy user need, and to use and develop their skills [13]. Free licenses encourage code sharing and code integrity, and enable the rapid identification and fixing of critical bugs, and the adaptation and re-purposing of code. Among the best-known open source software projects are the GNU/Linux operating system, the Mozilla Firefox web browser, Open Office productivity software, and the MediaWiki publishing platform that underlies Wikipedia. The ability of many smaller journals to support open access publication has been enabled by the availability of open source journal management and publishing systems, including Open Journal Systems (http:// pkp.sfu.ca/ojs/), DPubS (http://dpubs.org/), GAPworks (http://gapworks.berlios.de/), Hyperjournal (http:/ /www.hjournal.org/), ePublishing Toolkit (https://dev.livingreviews.org/projects/epubtk/), OpenACS (http:/ /openacs.org/), and SciX Open Publishing Services (SOPS; http://www.scix.net/) (see http://pkp.sfu.ca/ ojs_faq). At Open Medicine we have taken our commitment to “openness” and developing a more sustainable publishing model a step further by using free and open source software (FOSS) for our journal management, blog and electronic publishing platform. We are also increasingly incorporating FOSS into our workflow to enable the production of XML (a document format required for NLM/MEDLINE indexing) and for our layout and copyediting process, with the end goal of publishing and managing the journal exclusively using a FOSS workflow. The use of FOSS in medical publishing has many advantages. Cost is one commonly cited factor, though by no means the most important. By using FOSS, Open Medicine is replacing software with single license costs (non-educational versions) ranging from hundreds to thousands of dollars, representing savings in startup costs of many thousands of dollars; this use of FOSS also avoids costly upgrades of both software and hardware. FOSS tends to be available for a broader range of platforms – at a minimum, there are likely to be GNU/Linux, Apple Mac OSX and Microsoft Windows versions – and since older versions of the software are not commercially competitive with newer versions, support for established FOSS projects does not end according to a commercial cycle. This means that older, slower computers remain viable platforms. It also means that backward compatibility of programs is more often maintained. FOSS also produces documents in open formats such as the Open Document Format, which means that the user is able to transfer documents to another program should development on the original one cease, or a more suitable alternative be found – unlike data kept in a proprietary format. This problem, dubbed “vendor lock-in” will become more pronounced, with the introduction of Microsoft’s new proprietary office format, as well as with “patented” proprietary formats from other companies. 4.

FOSS at Open Medicine

Use of FOSS at Open Medicine was primarily driven by the added control, security, and usability of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing

software. However, it was also in part prompted by cost considerations. As a start-up independent journal, committed to editorial independence, we operate principally with volunteer staff with minimal institutional support: the purchase of expensive proprietary journal management software was not only undesirable, but unfeasible. Our first step was to work with John Willinsky and the Public Knowledge Project to explore Open Journal Systems (OJS; http://pkp.sfu.ca/ojs). OJS is a free and open source online journal management and publishing system, developed by the Public Knowledge Project in a partnership among the Faculty of Education at the University of British Columbia, Simon Fraser Library and the Canadian Centre for Studies in Publishing [14]. We are not alone in recognizing the benefits of using OJS; there are now more than 1000 journals using OJS as a publishing platform, 20 percent of which are new titles and all of which offer some form of open access. Somewhat more than half are being published in low-income countries. OJS offers a complete manuscript management and publishing system. Correspondence between authors, editors, peer reviewers, coypeditors and layout editors can be managed within the system, with modifiable templates for correspondence. A database of peer reviewers, with contact information, interests and review history, is maintained within the system. Authors are able to track the progress of their manuscripts through the system, and peer reviewers are able to access their peer review requests, download the documents and enter or upload their completed peer reviews. OJS operates within a browser, with good attention to cross-platform, cross-browser compatibility (see Figure 1).

Figure 1: Open Medicine home page published using OJS A critical advantage of OJS is its use of open source code and a free software license. This has allowed the technical staff at Open Medicine, with OJS support, to write new or revised code, targeted to our particular journal needs. And of course the cycle continues: any code written by our programmer with wider applicability for journal publishing has in turn been shared with the team at OJS and other journals. This relationship has been particularly productive in our testing and use of Lemon8-XML for OJS. Lemon8XML (http://pkp.sfu.ca/lemon8) is a web-based program that automates the conversion of text document formats to National Library of Medicine (NLM) XML, permitting myriad uses, including the easy transmission of article information to the NLM. XML ensures that text is marked up so as to enable meaningful computer searching. For example, it allows the orderly tagging of date of publication and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

143


144

Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky

author names so that computers can search and find data that would usually appear as text buried within the body of a document. XML conversion is required for PubMed/Medline indexing â&#x20AC;&#x201C; a critical goal for any medical journal â&#x20AC;&#x201C; and is currently performed in most journal operations either manually or with prohibitively expensive proprietary software. The development of Lemon8-XML will be a powerful contribution to data searching, and will have significant resource implications for journals, many of whom have been unable to produce XML because of the high costs and expertise required. The easy creation of XML has enabled another recent innovation: the automated transformation of XML files into web-ready pages (HTML), as well as preliminary page layouts that can then be fine-tuned for print-ready publication (PDF). Some of our initial efforts at creating the editing-layout portion of the workflow involved using OpenOffice for both copy-editing and layout, and then generating both the XML markup version and the publication PDF from the final laid out document. OpenOffice, although suitable for editing and copy-editing tasks, proved to lack the flexibility and fine control required to produce layouts to professional standard. For example, fine-grain control over hyphenation, font kerning and so on were nearly impossible with OpenOffice. This led us to explore the use of Scribus, a well-established free and open-source desktop publishing software, for the layout stage. Rather than be obliged to maintain and reconcile separate XML and Scribus/laid out versions, our technical staff developed a plugin that enables conversion of the copy-edited XML version of an article directly to a final web-ready page and a preliminary layout in Scribus, ready for final refinement (see Figure 2).

Figure 2: Example of automated article layout using Lemon8-XML and Scribus Day-to-day operations within the journal are also performed using FOSS. As our editorial team is distributed across Canada and Australia, team members communicate using email, instant messaging (IM), and voice over internet (VoIP). We have also experimented with the SIP-based Wengophone (now Qutephone) to support teleconferences of more than three people, but have been unsuccessful to date; we are currently still reliant on a proprietary solution for teleconferences involving the entire team. Coordination of journal activities is also made possible through an internal wiki, using MediaWiki (the platform originally written for Wikipedia). Editorial meeting agendas and minutes, projects and documents in development, lists of contacts and resources, and all other documentation associated with running the journal are all accessible to and editable by all members of the editorial team. Table 1 offers a summary of the programs we have explored or are exploring as part of a Free and Open Source (FOSS) publishing workflow or in support of our operations. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing Free and Open Source Program

Open Medicine Use(s)

Advantages

Disadvantages

Editing and copy-editing of manuscripts; preliminary layout (Current industry standard: Microsoft Office)

Best established FOSS office suite Increasing acceptance in business and enterprise Well-supported by documentation

Interface and customizations differ from proprietary alternative Does not have fine control required for layout

Image editing (Current industry standard: Adobe Photoshop)

Best all-around photo- and image-editing software Well supported with documentation and forums

Contested user interface CMYK support only with plugin (relevant for print publishing)

Figure preparation (Current industry standards: Adobe Illustrator / Corel Draw)

Intuitive, thought out user interface Excellent SVG support

Difficulty integrating with illustrators using Adobe Illustrator or Corel Draw

Manuscript management Reader tools On-line publishing Communication with editors, copyeditors & layout persons

Many users Potential to request additional system features Responsive developers

Some limitations with theme customization

XML generation

Removes considerable human resource cost as currently done manually at most journals

Still in early testing phase Requires some manual reference searching No current link with OJS author details requiring duplicate data entry (planned for final version)

Layout of articles for print (PDF) publication (Current industry standard: Quark Xpress)

Fine grain control over text layout, font kerning Excellent PDF export control Excellent support community

Confusing development cycle Poorly thought out document format

Blog (Current industry standards: Wordpress, Movable Type, Blogger.com)

Powerful contentmanagement system with user-access controls; extensible with plug-ins Active user community

Learning curve; requires expertise to set up and manage

Meeting minutes; shared projects; shared resources

Web-based Minimal learning required for use Very flexible

Some expertise required for installation and maintenance

Team communication

Multiple sites can conference simultaneously Uses SIP standard

Unstable development Small userbase Decreased sound quality compared to other SIP products

Shared calendars

Multiple users can enter data

Article editing and preparation Open Office

http://www.openoffice.org/

GIMP

http://www.gimp.org/ Inkscape

http://www.inkscape.org/ Article management, layout, and publishing Open Journal Systems (OJS)

http://pkp.sfu.ca/ojs/

Lemon8-XML

http://lemon8.ca/

Scribus

http://www.scribus.net/

Drupal

http://drupal.org/

Operations MediaWiki

http://www.mediawiki.org/ WengoPhone

http://www.openwengo.org/ Chandler

http://www.osafoundation.org/ Thunderbird

http://www.mozilla.com/thunderbird/

Team communication; editorauthor-peer-reviewer communication

Table 1: Free and open source software used at Open Medicine Figure 3 shows a schematic flowchart of our operations and sites of FOSS use.

Figure 3: Workflow at Open Medicine using FOSS Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

145


146

5.

Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky

Free and Open Source Software in Medical Publishing: the challenges

There is no denying that there are challenges unique to adopting FOSS to create a workflow that has hitherto involved proprietary software. Some of these challenges arise from the software themselves, some from integration (or lack of) between various FOSS programs, and others simply from the time taken to learn to use new programs and troubleshoot without traditional help forums. For an individual user who is experienced in proprietary software and a proprietary workflow, the initial penalty of moving to FOSS is a loss of efficiency and a (re)learning curve. Users must learn one or several new interfaces, which may require them to adapt their personal workflow if it is not supported by the program, or to learn how to customize the program to suit their needs. This is especially true for littleused specialist components of software, which tend to be buried deep within the software and to be poorly documented. Users must find and identify sources and resources that will provide them with answers to questions that may be quite specific to the task; this can be time-consuming, particularly when the reason for there being no documentation is that that functionality has not been included in the software. The user interfaces of FOSS differ from their proprietary counterparts, in part as a result of the opportunity to solve perceived problems with existing proprietary interfaces and improve their design, and in part because developers in today’s litigious environment must avoid incorporating design elements that may be claimed under patent [see http://en.wikibooks.org/wiki/FOSS_Open_Standards/Patents_in_Standards for a discussion of patents and FOSS]. While improving on design, however, developers of the more “mainstream” and widely adopted FOSS (e.g., Firefox, GNOME, OpenOffice, GIMP) find themselves attempting to balance the needs of new users for an intuitive, familiar interface with the requirements of experienced users for a flexible interface that can be highly customized. Microsoft and Adobe own much of the software in common use in authoring and publishing, and have so shaped user expectations and workflow design such that what user interfaces they do not own, they influence. This results in consistency in the user interface when approaching different programs by the same manufacturer. One common complaint about FOSS interfaces is that they can be individually unique, even idiosyncratic, posing a barrier to new users. This problem has recently been recognized by the community, and is being addressed aggressively with massive usability projects (e.g., Open Usability; http://openusability.org/) and human interface guidelines (e.g., GNOME HIG http://developer.gnome.org/projects/gup/hig/; and KDE HIG http:/ /usability.kde.org/hig/). FOSS applications lend themselves to development on multiple operating systems, since any developer with an interest in a platform and some knowledge is free to modify the code. This leads to support for esoteric operating systems such as IBM’s long-defunct OS/2. The upside of availability on multiple platforms is balanced by the lower quality of versions in which developers are uninterested. Because free software is available to the public at all stages of its development cycle, this also means that sometimes installation of applications on underdeveloped platforms is confusing or poorly implemented. Scribus, one of our mainstay applications for layout editing, is an excellent example of this challenge. At the time of writing, Scribus version 1.3.3.11 is considered “stable”. However, versions 1.3.4 and 1.3.5 are in wide use as well, despite being “unstable”. Scribus’ installer for Mac OSX is also primitive, and does not install required libraries, or even the application itself in an intuitive way. The user needs to select the correct version, may need to download and install the supporting libraries or packages, and may need then to interpret and troubleshoot any resulting error messages. It is worthwhile noting that this problem is essentially eliminated within free software operating systems (e.g., GNU/Linux), all of which use package management systems to easily install software and dependent libraries. Publishing requires a workflow that faithfully preserves detail of presentation – font, layout, figures. For proprietary publishing, this workflow has been developed largely by the consolidation of products involved in the process into end-to-end product lines that smooth the integration but offer little choice to the consumer. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing

The various components of FOSS are not integrated into a workflow and require additional customization and programming. Furthermore, given that almost all of our submissions are received in Microsoft Word document format, one of the areas the Open Medicine staff found most challenging was in importing figures and tables prepared in Word, and citations and reference lists prepared in another widely used proprietary software, EndNote. We have yet to resolve our dependency on proprietary fonts for standardization of appearance and layout across stages and platforms. When difficulties are encountered in free software applications, solutions are not always easily located. The pace of progress means that documentation and technical support are primarily provided online by the user community, rather than in the form of published manuals. The majority of commercial publishers of books describing individual computer applications concentrate their efforts on mainstream proprietary software, which tends to have a much longer product lifecycle and slower development pace. Established FOSS projects commonly offer documentation in the form of a wiki (collectively edited multi-page manual), and support in the form of forums and online communities. Individual users may develop extensive tips and support sites, either out of interest, or in support of their consulting business (or both). To find the documentation that suits one’s level of learning, or the exact answer to a technical question, requires skills in searching, and some experience in assessing the receptiveness of a forum to “newbie questions”. The move to lesser-known free software also negates the often overlooked advantage of “the geek next door”, the friend with a slightly higher level of skill who can help achieve certain tasks. The increasing popularity of free software will eventually render this challenge moot, however it remains important at this time. 6.

FOSS in Medical Publishing: the Possibilities

By the very nature of FOSS, many of the frustrations cited should ease with increasing adoption of FOSS in scholarly publishing. Members of FOSS-OA publishing are forming their own community, exchanging experiences and developing documentation specific to the task of using FOSS for publishing. Experience will teach us which programs are best suited to which step in the editing-publishing workflow, which programs integrate best with others, and how they might be customized for ease of workflow. The open architecture of FOSS permits the development of macros and plugins to automate repeated steps and to facilitate import and export. The most interesting possibilities presented by FOSS will have to do with the fruits of collaboration by several FOSS-OA publishers. A case in point: Open Medicine is collaborating with the Public Knowledge Project to develop a user commenting system for OJS, but we expect this system to truly mature and evolve when other publishers implement and expand upon it. For our own part, we hope Open Medicine can become a working template and case study for other journals interested in publishing using a complete FOSS interface. Journals choosing to use FOSS because of their philosophy, cost considerations or availability of computing ‘power’ to run software applications can benefit from our learning experiences, and, given the nature of FOSS, the source code developed for our publishing purposes. We look forward to the ongoing dialogue and experience of pursuing a truly “Open”, academically independent, biomedical publishing option. For us, transparency and integrity are essential traits, and we want Open Medicine to embody these traits in the software we use as well as the articles we publish.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

147


148

Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky

7.

References

[1]

Van Orsdel LC, Born K. Periodicals Price Survey 2008: Embracing Openness. Library Journal [Internet]. 2008. 15th Apr. [cited 2008 May 6]. Available from: http://www.libraryjournal.com/article/ CA6547086.html SQW Limited. Economic analysis of scientific publishing: A report commissioned by the Wellcome Trust; Histon. Wellcome Trust, 2003 The PLoS Medicine Editors. How Can Biomedical Journals Help to Tackle Global Poverty? PLoS Med [Internet]. 2006. 29 Aug [cited 2008 May 6]; 3(8): e380. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0030380 Clarke, R. The cost profiles of alternative approaches to journal publishing. First Monday [Internet]. 2007. 21 Nov [cited 2008 May 6]; 12(12). Available from: http://www.uic.edu/htbin/cgiwrap/bin/ ojs/index.php/fm/article/view/2048/1906 Smith R. Medical Journals Are an Extension of the Marketing Arm of Pharmaceutical Companies. PLoS Med [Internet]. 2005 [cited 2008 May 6]; 2(5): e138. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0020138 Fugh-Berman A, Alladin K, Chow J, Advertising in Medical Journals: Should Current Practices Change? PLoS Med [Internet]. 2006 [cited 2008. May 6]; 3(6): p. e130. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0030130 Willinsky J, Mendis R. Open access on a zero budget: a case study of Postcolonial Text. Information Research [Internet]. 2007 [cited 2008. May 6];12(3): paper 308. Available from: http:// InformationR.net/ir/12-3/paper308.html Hitchcock S. The effect of open access and downloads (â&#x20AC;&#x2DC;hitsâ&#x20AC;&#x2122;) on citation impact: a bibliography of studies. 2007. [cited 2007 May 24]. Unpublished paper, University of Southampton. Available from: http://opcit.eprints.org/oacitation-biblio.html Willinsky J, Murray S, Kendall C, Palepu K. Doing Medical Journals Differently: Open Medicine, Open Access and Academic Freedom. Canadian Journal of Communication [Internet]. 2007 [cited 2008. May 6]; 32(3): 595-612 Eysenbach G. Citation advantage of open access articles. PLoS Biol [Internet]. 2006 [cited 2008. May 6]; 4(5):e157. Available from: http://biology.plosjournals.org/perlserv/?request=getdocument&doi=10.1371%2Fjournal.pbio.0040157 Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PLoS ONE 2007 [cited 2008. May 6]; 2(3):e308. Available from: http://www.plosone.org/ article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000308 Peter Suber. An open access mandate for the National Institutes of Health. Open Medicine [Internet] 2002;2(2) April 16. [cited 2008. May 10] Available from: http://www.openmedicine.ca/article/view/ 213/135 Lakhani, KR, Wolf RG. Why Hackers Do What They Do: Understanding Motivation and Effort in Free/Open Source Software Projects. MIT Sloan Working Paper No. 4425-03. Sep 2003 [cited 2008. May 5]. Available from: http://ssrn.com/abstract=443040 Willinsky J. Open Journal Systems: An example of open source software for journal management and publishing. Library Hi Tech [Internet]. 2005 [cited 2006. May 5]; 23(4);504-519. Available from:http://research2.csci.educ.ubc.ca/eprints/archive/00000047/01/Library_Hi_Tech_DRAFT.pdf

[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


149

Should University Presses Adopt An Open Access [Electronic Publishing] Business Model For All of Their Scholarly Books? Albert N. Greco;1 Robert M. Wharton2 Marketing Area, Fordham University Graduate School of Business Administration, 113 West 60th Street. New York, NY United States 10023. email: agreco@fordham.edu 2 Professor of Management Science, Fordham University Graduate School of Business Administration, 113 West 60th Street. New York, NY United States 10023. e-mail: R.FWharton@att.net 1

Abstract This paper analyzes U.S. university press datasets (2001-2007) to determine net publishers’ revenues and net publishers’ units, the major markets and channels of distribution (libraries and institutions; college adoptions; and general retailer sales) that these presses relied on, and the intense competition these presses confronted from commercial scholarly, trade, and college textbook publishers entering these three markets. ARIMA forecasts were employed to determine projections for the years 2008-2012 to ascertain changes or declines in market shares. The paper concludes with a brief series of substantive recommendations including the idea that university presses must consider abandoning a “print only” business model and adopt an “Open Access” electronic publishing model in order to reposition the presses to regain the unique value proposition these presses held in the late 1970s. Keywords: Innovative business models for scholarly publishing; university presses; electronic publishing; Open Access; scholarly communication; marketing strategies. 1.

Introduction

Since the late 19th century, university presses in the United States have played a pivotal role, and some individuals might argue the pivotal role, in the transmission of scholarly knowledge [1]. University press books have become the “gold standard” in many academic fields (e.g., history; literature; and in certain areas of philosophy and sociology) in the departmental or college evaluation of a faculty member’s scholarly output (and reputation) for tenure, promotion, and merit pay [2]. In 2008 these presses ranged in size from exceptionally large presses (with annual revenues in excess of $50 million; e.g., Oxford University Press has U.S. annual revenues of approximately $140million; Cambridge University Press, approximately $60 million), to large presses (+$6 million; e.g., University of Chicago Press), medium sized presses (approximately $1.5-$3 million; e.g., The University of Notre Dame), and relatively small presses (approximately $900,000-$1.5 million; e.g., Carnegie Mellon University). They publish peer reviewed: scholarly monographs; trade and professional books; textbooks; and, in some instances, scholarly journals that they own or publish under contract for academic societies (e.g., Pennsylvania State University Press). In this paper all university press books will be considered one category for analysis; however, additional research on the relationship between monographs and textbooks is needed. Between 1945 and the late 1970s, the basic university press business model was incredibly successful since this diverse collection of presses had a unique value proposition. University press publishing during those years was a “cozy” world, where everyone knew someone who knew someone; and most editors Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


150

Albert N. Greco; Robert M. Wharton

and press directors attended the same type of college (perhaps the Ivy League, small prestigious liberal arts colleges, or the large state universities). So these editors and publishers either went to school with or knew many of the major academic experts, who sent certain prestigious university presses their manuscripts and advised their graduate students to do likewise. During those years, the typical press received a “reasonable” level of financial and administrative support from its university; and presses were not expected to generate an annual “surplus” (i.e., a profit) [3]. The end result was these presses published superb books and, concomitantly, dominated the scholarly publishing field with preeminent sales in three major markets or channels of distribution: libraries and institutions; college and graduate school adoptions (in this paper “college” and “university” will be used interchangeably); and general readers (i.e., sales to general retailers). There was little competition from commercial professional scholarly publishing houses (the term “commercial professional scholarly publishing houses,” or “professional and scholarly publishing houses,” or “scholarly publishing houses” refer to the same cluster of publishing companies, and they will be used interchangeably in this paper). The vast majority of trade publishing firms tended to concentrate on “big, hit driven” fiction titles, although a cluster of firms (e.g., W.W. Norton or Random House) published serious scholarly works. By the mid to late 1970s, the total amount of net publishers’ revenues for all of these university press operations was quite “modest” (1972: $41.4 million; 1977: $56.1 million; all revenues used in this paper are U.S. dollars); and the suggested retail price for the typical university press book was often $10.00-$15.00 [4]. This was an important marketing strategy since inexpensive suggested retail prices (i.e., the MSRP) allowed the presses to penetrate the library market (the average university press expected to sell approximately 1,500 copies of each new scholarly book to academic and public libraries) as well as the college and graduate school adoption market, which often relied on scholarly titles from university presses in advanced undergraduate and graduate school courses. These presses tended to hire people who loved books; while wages were anemic, even by publishing industry standards, these presses offered editors an intellectually charged work environment in an academic environment that appealed to a significant number of people. The end result, a carefully written and edited and illustrated scholarly book, was indeed impressive. We reviewed press subsidies for 58 university presses for the years 2001-2006 (data for 2007 will not be available until late 2008) [5]. The most important subsidy was a direct financial grant to a press, with 70.69% of the presses receiving these funds. However, a wide variety of other free support services were provided to presses, including: payroll and human resources (86.21%); legal services (84.48%); audit services (70.7%); office space (62.07%); accounting services (60.34%); utilities (50%); working capital (44.83%; e.g., to pay printers, etc.); employee benefits (39.66%); salaries (37.93%); insurance (36.21%); carrying cost of accounts receivable (34.48%); warehouse space (32.76%); carrying cost of inventory (29.31%); parking (17.24%); work-study students (paid for by the university; and interns from the business school , the English, or the mass communications department (no data on these options). These percentages remained rather constant during the years 2001-2006. 2.

Significant Changes: The Emergence of “Black Swans”

Yet this “insulated” world changed abruptly in the late 1970s (a phenomenon called a “Black Swan” because of the unexpected nature of the change) which continued during the following decades [6]. What happened? Reliable statistical data is available for the years 1980 through 2007. However, because of the space limitations on this research paper, we addressed the years 2001-2007; and the ARIMA forecasting methodology was utilized to generate these projections) [7]. First, there was an increase in a number of new titles published in the U.S. by university presses as well Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

as by all publishers. Table 1 outlines these trends between 2001-2007. _____________________________________________________________________________________________ Year New Title Output Annual % Change Total New Title Output Annual % Change University Presses University Presses All U.S. Books All U.S. Books _____________________________________________________________________________________________ 2001 10,130 — N/A N/A 2002 9,915 -2.12 247,777 N/A 2003 11,104 11.99 266,322 7.48 2004 9,854 -11.26 295,523 10.96 2005 9,812 -0.43 282,500 -4.41 2006 9,969 1.60 291,922 3.34 2007 10,781 8.15 400,000* 37.02* 2008 N/A — N/A — _____________________________________________________________________________________________ Source: Yankee Book Peddler; R.R. Bowker (revised totals since 2002). Totals include both hardbound and paperbound books. *Rachael Donadio, “You’re An Author? Me Too!” The New York Times Book Review, April 27, 2008, page 27. The 2007 projection for all U.S. books was based on R.R. Bowker data in this article; Bowker issues all ISBNs in the U.S.

Table 1: University Press New Title Output: 2001-2007

_______________________________________________________________________________________________ Year Net Publishers’ Annual C.P.I. Net Publishers’ Annual Revenues % Change % Change Units % Change _______________________________________________________________________________________________ 2001 474.8 N/A 2.85 24.6 N/A 2002 486.5 2.46 1.58 24.7 2.92 2003 494.8 1.71 2.28 24.6 -0.40 2004 501.0 1.25 2.66 31.4 27.64 2005 513.5 2.50 3.39 31.4 0 2006 531.0 3.41 3.23 29.0 -7.64 2007 546.9 2.99 2.85 28.7 -1.03 2008 563.3 3.00 2.80* 28.4 -1.05 2008 580.2 2.99 1.90* 28.2 -0.70 2010 597.0 2.90 2.10* 27.9 -1.06 2011 613.7 2.80 2.10* 27.7 -0.71 2012 630.3 2.70 2.10* 27.5 -0.70

_______________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include data for hardcover and paperbound books. Consumer Price Index (C.P.I) is for “all items.” All data refers to the sale of new books; used book sales are excluded. *C.P.I. projections: The U.S. Congressional Budget Office (as of January 2008).

Table 2: University Press Books: Net Publishers’ Revenues and Net Publishers’ Units 2001-2007 University press net publishers’ revenues (i.e., gross sales minus returns equally net revenues; the same system is followed for units) increased because of changes in the suggested retail prices of these books (which generally exceeded annual increases in the Consumer Price Index, the C.P.I.) while units sagged after 2005. Table 2 outlines these trends. Since 1945, the three primary markets and channels of distribution for university presses were: (1) libraries and institutions; (2) college adoptions (which include graduate school adoptions); and (3) general retailers. The datasets for net publishers’ revenues indicated growth in all three channels. Total increases between Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

151


152

Albert N. Greco; Robert M. Wharton

2001-2007 were: 7.73% for the general retailer sector; 14.9% for college adoptions; and 16.0% for libraries and institutions. Table 3 outlines these trends. ______________________________________________________________________________________________ Year Exports General College Libraries & High Direct to Other Retailers Adoptions Institutions School Consumer Adoptions _______________________________________________________________________________________________ 2001 60.6 109.9 114.8 125.6 8.6 52.9 2.4 2002 61.9 112.3 117.3 129.6 8.9 53.9 2.6 2003 63.1 114.3 119.5 131.5 9.0 56.3 2.5 2004 63.9 108.9 121.3 132.3 9.1 67.5 2.5 2005 65.5 111.5 124.3 135.9 9.3 69.2 2.6 2006 67.8 115.2 128.4 140.5 9.7 71.5 2.7 2007 69.5 118.4 131.9 145.7 10.0 73.3 3.0 2008 71.9 122.1 136.0 149.7 10.3 75.6 2.8 2009 74.1 125.8 140.1 153.9 10.5 78.2 2.9 2010 76.3 129.5 144.3 158.2 10.9 80.4 2.9 2011 78.3 133.2 148.4 162.4 11.2 82.7 3.0 2012 80.4 146.0 152.5 166.6 11.5 70.2 3.1 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.

Table 3: University Press Books: Net Publishers’ Revenues By Channels of Distribution 2001-2007 With Projections for 2008-2012 (U.S. $ Millions) Year

Exports

General Retailers

College Adoptions

Libraries & High Direct to Other Institutions School Consumer Adoptions _______________________________________________________________________________________________ 2001 3.1 8.3 7.4 3.9 0.3 1.4 0.3 2002 3.2 8.1 7.2 4.0 0.4 1.5 0.3 2003 3.1 8.1 7.2 4.0 0.4 1.5 0.3 2004 4.0 9.9 9.6 4.8 0.3 2.2 0.3 2005 4.0 9.9 9.4 4.9 0.4 2.2 0.3 2006 3.7 9.2 8.7 4.6 0.3 2.0 0.3 2007 3.7 8.8 8.4 4.6 0.5 2.1 0.3 2008 3.6 8.8 8.3 4.6 0.5 2.1 0.3 2009 3.6 8.8 8.3 4.5 0.4 2.1 0.3 2010 3.5 8.9 8.3 4.4 0.3 1.9 0.3 2011 3.5 8.8 8.3 4.3 0.3 1.9 0.3 2012 3.4 9.3 8.4 4.3 0.3 1.6 0.3 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.

Table 4: University Press Books: Net Publishers’ Units By Channels of Distribution 20012007 With Projections for 2008-2012 (Millions of Units) Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

Between 2001-2007, net publishers’ unit data reveals a flattening of sales in the library and institution sector (essentially no growth after 2007) and college adoption areas (another sector with flat sales after 2006). Based on a review of unit sales in the 3 major markets and channels, it appears likely the market for scholarly non-profit university press books has plateaued., a potential weakness for presses in those channels. Table 4 outlines these trends. An analysis of the data for 2001-2007 revealed the substantial gains posted by professional and scholarly publishers in the university press’ three main markets and channels in terms of net publishers’ revenues. Revenues were up 18.01% in the general retailer sector, 17.21% in college adoptions, and 17.55% in the library and institutional market. Unit sales were also strong during those years: +14.31% in general retailers; +11.09% in college adoptions; and +14.08% in libraries and institutions. The prognosis for 2008-2012 was for continued strong growth in both revenues and units in all three markets. Table 5 outlines these trends. Net Publishers’ Revenues

Net Publishers’ Units

Year

General College Libraries & General College Libraries & Retailers Adoptions Institutions Retailers Adoptions Institutions 2001 1399.3 1204.3 1659.4 63.6 51.4 41.9 2002 1444.4 1245.8 1714.7 64.3 52.2 45.5 2003 1482.2 1274.1 1759.0 69.8 55.0 46.0 2004 1535.7 1315.6 1816.0 70.5 55.7 46.3 2005 1547.5 1326.0 1832.7 71.2 56.4 47.1 2006 1599.7 1370.5 1893.8 71.5 57.0 47.6 2007 1651.3 1411.6 1950.7 72.7 57.1 47.8 2008 1692.2 1448.9 2002.6 72.7 57.4 48.2 2009 1734.8 1486.5 2054.6 72.9 57.7 48.3 2010 1777.6 1522.9 2104.4 72.9 58.0 48.4 2011 1821.3 1560.3 2155.3 72.9 58.3 48.5 2012 1872.0 1600.0 2207.2 73.1 58.5 48.8 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%.Totals includes both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.

Table 5: Professional and Scholarly Publishers: Net Publishers’ Revenues and Net Publishers’ Units (2001-2012) for Sales to General Retailers, College Adoptions, and Library & Institutions (U.S. $ Million; Millions of Units) The pattern for college textbooks in these three markets was equally impressive. Sales to general retailers increased 17.65% between 2001-2007, with the tally for college adoptions hovering near the 17.64% mark, topping the 15.77% increase in the library and institutional market. Unit data was equally striking: +20.0% in general retailers; +10.11% in college adoptions; and +13.33% in the library and institutional market. Table 6 outlines these trends. A comparison of the revenue sales patterns for university presses, professional and scholarly publishers, and college textbook in the three channels was revealing, illuminating the impressive market shares held by professional and scholarly and college text book publishers. • general retailers: university presses, +7.73%; professional & scholarly, +18.1%; college textbooks, +17.65%; • college adoptions: university presses, +14.9%; professional & scholarly, +17.21%; college textbooks, 17.65%;

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

153


154

Albert N. Greco; Robert M. Wharton •

libraries and institutions: university presses, +16.0%; professional & scholarly, +17.55%; college textbooks, 15.77%.

_______________________________________________________________________________________________ Net Publishers’ Revenues Net Publishers’ Units

Year

General Retailers

College Adoptions

Libraries & Institutions

General Retailers

College Adoptions

Libraries& Institutions

2001 175.6 2875.4 274.6 3.0 47.1 3.0 2002 178.3 2930.8 279.6 3.1 47.7 3.1 2003 181.5 2989.7 285.5 3.3 49.8 3.1 2004 189.8 3133.7 291.8 3.5 56.1 3.4 2005 193.6 3197.2 297.6 3.5 55.7 3.4 2006 199.7 3293.8 307.3 3.5 56.0 3.4 2007 206.6 3382.5 317.9 3.6 56.1 3.4 2008 211.6 3478.7 325.9 3.6 56.4 3.5 2009 217.1 3575.8 334.4 3.6 56.7 3.5 2010 222.9 3677.7 343.2 3.6 57.1 3.5 2011 228.8 3777.2 352.1 3.6 57.4 3.5 2012 234.9 3880.0 362.4 3.6 57.7 3.5 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%.Totals includes both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.

Table 6: College Textbook Publishers: Net Publishers’ Revenues and Net Publishers’ Units (2001-2012) for Sales to General Retailers, College Adoptions, and Library & Institutions (U.S. $ Million; Millions of Units) _______________________________________________________________________________________________ Year New Title Output Annual % Change _______________________________________________________________________________________________ 2001 41,016 N/A 2002 43,554 6.19 2003 47,662 9.43 2004 44,981 -5.63 2005 42,975 -4.46 2006 47,124 9.65 2007 48,951 3.88 2008 N/A N/A Source: Yankee Book Peddler; R.R. Bowker (revised totals since 2000).

Table 7: New Title Output Scholarly Books Published by Professional and Scholarly and Trade Publishers: 2001-2007 New title output of scholarly books by professional and scholarly publishers and trade publishers increased 19.35% between 2001-2007. Table 5 outlines this trend. Second, the emergence of the “serials crisis” (i.e., the growth in the number of and annual subscription prices for scholarly journals, often owned by large commercial scholarly publishers (e.g., Elsevier; Wolters Kluwer; Springer; Blackwell; John Wiley; Taylor & Francis; etc.), triggered declines in the academic library purchases of university press books (about 1,500 in the mid-1970s; about 200-300 in 2008) [8]. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

Third, the decline in the number of independent bookstores (about 4,400 in the 1970s and 1,800 in 2008) and the rise of book superstores (Barnes & Noble; Borders; Books-A-Million) [9]. Fourth, a dramatic change in book retailing channels of distribution, the rise in importance of the mass merchants (e.g., WalMart; K-Mart; Target), price clubs (e.g., Costco, BJs; Sam’s Clubs), and other retailing establishments (e.g., supermarkets; drug stores; convenience stores; terminals; etc.) [10]. Fifth, precipitous declines in media usage (i.e., annual hours per person, above the age of 18, reading books) [11]. Sixth, the development of an interest in publishing scholarly titles by many of the large trade houses [12]. Seventh, the growth of the college textbook educational publishing sector [13]. Lastly, by the 1990s and the early years of the 21st century, several “disruptive technologies” emerged (e.g., the Internet and electronic publishing options; print-on demand, POD; the Open Access movement; etc. ) that challenged traditional concepts regarding the distribution of intellectual content [14]. Starting around 1980, the majority of all university presses witnessed a sophisticated pincer movement by commercial trade, professional, and textbook companies eager to take business and market share away from university presses. In essence, the basic competitive advantage of university presses (i.e., the ability to dominate the publishing of scholarly books in their three key markets and channels of distribution) changed, at first slowly and then more rapidly; and many university press directors and editors (and many academics rightfully concerned about this situation) pursued innovative, and, unfortunately in some instances, unsuccessful strategies and responses to the frontal attack of commercial publishing companies. One cluster of press directors (and major industry leaders) issued jeremiads about the state of scholarly publishing, and they were joined by academics that ruminated, “How can I get tenured if you cannot publish my book?” Many directors (and industry leaders) tried to convince provosts and president to increase their funding to counterbalance the decline in sales. The next strategy was to ask foundations for funding to analyze the decline in sales. Lastly, some presses went to foundations for seed money to publish books in “critical areas” [15] Another group of directors, more attuned to the ideas of finance and marketing, reevaluated their basic business models; and they crafted defensive strategies, including: reducing the print run of new books; curtailing “duel editions” (often called “split runs;” i.e., the simultaneous printing of a hardbound and paperbound version of a new title); outsourcing line editing and certain production tasks; off-shore typesetting and printing; reducing support staff (often secretaries); and changing domestic distributors, often going to one of the major university press distribution operations or relying on a printer to handle distribution and fulfillment. Some of these strategies worked; and some did not. In the years after 1980, two dramatic and completely unanticipated developments occurred, known as a “Black Swan” to economists and marketers, which took the majority of press directors, editors, and industry leaders off guard. First, far two many university presses failed to realize that the basic laws of supply and demand cannot be rescinded. They continued to increase title output (even when trimming print runs) as demand for their books sagged. Second, “Wall Street” firms decided to invest in the book industry. The term “Wall Street” refers to financial service companies in New York, Boston, Chicago, San Francisco as well as London, Paris, etc.; for example, Bain & Co., , Thomas S. Lee, JP Morgan, Goldman Sachs, etc.). Many commercial scholarly presses, trade houses, or college textbook companies were viewed by a growing number of Wall Street investment bankers, private equity managers, and hedge fund executives as either “value stocks” or “growth stocks;” and they invested in many of these companies, taking a few of them “private” [16]. This influx of invested money allowed these commercial publishing companies to gain access to needed capital (known as “capital deepening” to economists and marketers) for investment and expansion. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

155


156

Albert N. Greco; Robert M. Wharton

Why did Wall Street firms target book industry companies when they could have invested in more “glamorous” industries and firms? These Wall Street companies realized that book publishing economics were harsh and unforgiving; but they were understandable and quantifiable. This meant they could develop sophisticated statistical models to predict future earnings. For example, professional and scholarly book publishing companies (as of January 2008) had a low “beta,” which is a measure of volatility. In general the Standard and Poor’s Index has a “beta” of 1.00. So a stock with a “beta” higher than 1.00 has a higher volatility but generally generates higher returns than the stock market; a “beta” below 1.00 has a lower volatility but generally generates lower returns. For example, during the months of March and April 2008, Pearson PLC had a “beta” of 0.95; McGraw-Hill’s “beta” was 1.24; John Wiley’s “beta” stood at 1.57; and Reed Elsevier’s “beta” was a rather low 0.65. As a point of comparison, during this same time period, Hewlett-Packard’s “beta” was 1.09 and Amazon.com was 3.18 [17]. Scholarly and professional companies also had high “alphas” (i.e., successful editors and publishers able to find and cultivate authors who make money for the house). Clearly, many of these commercial scholarly, trade, and college textbook firms were targeted by Wall Street for investment and expansion. Scholarly and professional publishers, many trade publishers (including Bertelsmann AG’s Random House; Pearson PLC’s Penguin; News Corp.’s HarperCollins; CBS’ Simon & Schuster; and Lagadere’s Grand Central, formerly Little, Brown and Warner Books), and all of the major textbook publishers (e.g., Pearson PLC’s Prentice-Hall; Cengage Learning, formerly Thomson Learning; McGraw-Hill; John Wiley-Blackwell; Von Holtzbrinck; Informa’s Taylor & Francis; etc.) crafted innovative strategies to penetrate and increase their market positions in the scholarly publishing world, including: attracting major scholars with advances (e.g., Professor Mankiw was paid $1.6 million in 1996 by Harcourt, now part of Cengage, to write a principles of economics textbook), generous “step” royalty options, aggressive marketing strategies enlarging and expanding channels of distribution in this nation and abroad [18]. In the years after 1980, these commercial publishing companies were able to sell their scholarly tomes or textbooks, pay taxes (university presses are exempt from taxes since they are non-profit entities under the U.S. Internal Revenue Code), provide appealing wages for employees (they hired people who loved books that made money); and make profits for their stockholders. Many of the major commercial scholarly presses also published scholarly journals. Realizing the significant impact electronic journals had on their balance sheets (no printing, paper, binding, mailing, fulfillment, warehouses, warehouse personnel, etc.), many of the largest houses began to offer electronic versions of their books (either an entire book or one or more chapters in a title), a trend that was followed by the major college textbook publishers; as of 2008, trade houses have been unable to monetize significantly their content in digital platforms. While hard data on electronic sales revenues are difficult to obtain (quarterly and annual financial reports are silent on this issue), it is likely that between 15%-20% (well over $1 billion) of scholarly and professional net publishers’ book revenues were generated through electronic sales or site license agreements. The number for textbooks is perhaps 5% (approximately $250 million); and trade publishers generate about $60 million annually through digital sales [19]. 3.

The Business Environment for University Presses: 2001-2006

Can university presses develop realistic marketing plans and regain their competitive advantage? Can they challenge the hegemony of large, global commercial publishers? In light of the proliferation of technological services, are university presses relevant and needed in the 21st century? Are these university presses needed?

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ... Business Model Assumptions _______________________________________________________________________________________________ Print Run: 1,000 copies Gross Sales: 970 copies [1,000 copies; -3% of print run for author’s copies, office copies] Export -20 copies General Retailers -183 copies College Adoptions -239 copies Libraries & Institutions -184 copies High School Adoptions 0 copies Direct to Consumers -20 copies Other 0 copies Net Sales: 646 Suggested Retail Price: $65.00 Average Discount: 47% [Publisher nets $34.45 per copy] PPB: $5,341.13 [Printing, Paper & Binding; approximately 19% of net sales industry average] Plant: $1,124.45 [Editorial and typesetting; approximately 4% of net revenues; industry average] Marketing: $1,000.00 [$1 times number of printed copies] Royalty Advance: 0 Royalty Rate: — [0% for first 500 copies sold; 10% of net revenues for +501 sold copies] Subrights: $200.00 [filmed entertainment; reprints; book clubs; foreign rights; serial rights; 50% for author and 50% for publisher] _______________________________________________________________________________________________ Revenues and Expenses and Net Profit/Loss 1. Gross Sales: $33,416.50 [970 copies x $34.45@] 2. Returns: -$5,856.50 [170 copies x $34.45@] 3. Net Sales: $27,560.00 4. Plant: -$1,124.45 5. PPB: -$5,341.13 6. Earned Royalty: -$502.97 [146 copies at $3.445@] 7. Inventory Write-Off -$1,730.53 [970 – 646 = 324 copies x $5.34@] 8. Total Cost of Goods Sold: $8,699.08 [COGS; #4 +#5 +#6 +#7] 9. Initial Gross Margin: $18,861.00 [#3 - #8] 10. Other Publishing Income: +$100.00 [50% to publisher] 11. Final Gross Margin: $18,961.00 12. Marketing: -$1,000.00 13. Overhead: -$8,268.00 [30% of net sales revenues] 14. Net Profit/Loss: $9,693.00 _____________________________________________________________________________________________ Source: Greco’s estimates; industry averages.

Table 8: Sample Profit & Loss (P & L) Statement for a Hardbound University Press Book An analysis of a “typical” university press book’s profit and loss (P & L) provides a preliminary framework in order to address some of the questions listed above. We start with a series of basic business assumptions regarding: (1) the print run; (2) gross sales; and (3) potential sales to exporters, general retailers, college adoptions, libraries and institutions, high school adoptions, direct sales made by the press direct to consumers, and any “other” sales. These assumptions are based on past experiences for similar books and a healthy dose of optimism (perhaps more of the latter than the former). Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

157


158

Albert N. Greco; Robert M. Wharton

The next step is to determine: (1) net sales (gross sales minus sales); (2) the suggested retail price; and the average discount (books are sold to retail establishments and distributors at a discount; industry averages were utilized in all of these calculations). Other expenses are estimated: (1) printing, paper and binding (PPB; 19% of net revenues is the industry average); (2) editorial-typesetting, etc. (plant; 4% of net revenues is the industry average); (3) marketing; (4) the royalty advance against earned royalties; (5) the royalty rate; and (6) any foreign or sub rights. Once these estimates are determined, the actual financial P & L can be run: (1) gross sales minus returns equals net sales revenues (in general, most books are fully returnable to the publisher for a full credit as long as the published terms and conditions of sale are followed by the retailer or distributor); (2) plant, PPB, earned royalty, and inventory write-offs are subtracted from net sales to determine the total cost of goods sold and the initial gross margin; (3) other income is added to the initial gross margin to calculate the final gross margin; (4) marketing costs and the ubiquitous overhear are deducted from the final gross margin; and (5) the end result is either a net profit or a net loss. Table 8 indicates that this “typical” book, which took months to edit and print and tied up thousands of dollars, generated a net profit of $9,693.00. Any slippage in sales could have generated a loss. Our extensive research (and hundreds of discussions with university press directors, commercial scholarly publishers, and trade book executives) indicated that seven out of every ten new books lose money, two books break even financially, and one is a financial success. The vast majority of university presses post financial losses annually, even with subsidies from their universities. Table 8 outlines in detail the economics of publishing a book. The second analysis centered on a study of 63 university presses between 2001 and 2006 (no data for 2007). These presses ranged in size and had annual revenues between $900,000-$1.5 million (22 presses), $1.5 million-$3 million (16 presses), $3 million-$6 million (18 presses), and more than $6 million (8 presses). In terms of net operating income (i.e., total book sales income plus any other publishing income minus operating expenses; editorial, production and design; order fulfillment; etc.), losses were posted for all of these presses between 2001 and 2006. The addition of direct parent institution financial support, other subsidies-grants-endowments, and “other press activities) changed the economic picture, somewhat. These 63 presses recorded positive total net income results in 2004, 2005, and 2006; losses were generated in 2001, 2002, and 2003. We estimate that a positive net income will be posted by these presses in 2007 and a negative net income in 2008 (and possibly in 2009). So book operations for 6 years had losses; and financial support from the parent institution ameliorated the situation in 3 of these 6 years. In reality, the basic business model of selling printed scholarly books by university presses did not work between 2001-2006, and a review of substantive datasets revealed it has not worked since 1945. If parent institutions trimmed even slightly their financial commitments to the presses, the majority of presses would be in the red financially and deeply in the red. What should these presses do. 4.

Recommendations

Based on an analysis of the relevant, available data, we believe that university presses should consider adopting an exclusive Open Access (i.e., an electronic publishing) policy. While each press would continue to utilize the well established and critically important peer review process for manuscripts and develop its own guidelines, we believe it is imperative financially and economically for these presses to consider the following. First, institute a realistic manuscript submission fee structure, paid by the author(s) (or the author’s academic department and/or college), perhaps $250.00, to cover the initial internal editorial costs Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

associated with reviewing a submission. Second, if the manuscript had merit and fits into a press’ list, a second fee paid by the author(s) (or the author’s academic department and/or college-university), perhaps $250.00, would cover the expenses of sending the manuscript out for peer review. Many scholarly journals have similar fees, paid for by the author(s) or the author’s department or college; and most universities currently provide some budgets for academics to attend scholarly conferences. This fee structure would become another cost in running a department or college. Another issue centers on the fact that approximately 95 U.S. university presses support the scholarly book publishing activities of academics at more than 3,000 U.S. colleges and universities as well as foreign colleges and universities. So a fee structure provides financial support for the university presses that bear the brunt of reviewing, editing, and publishing an important number of books for faculty members at colleges that do not have a press. Would a fee structure place an unreasonable burden on an author earning a meager salary at a small college and/or a department at a college that did not have the financial resources of a well endowed university? Yes; and the existing playing field is not even. Academics at universities with low teaching schedules and access to substantial financial resources for research have an important competitive advantage over scholars at under funded departments. These are very serious issues, but they are clearly beyond the scope of this paper. Third, if the peer reviewers recommended publication, a final fee, paid by the author(s) (or the author’s academic department and/or college), perhaps $10,000.00 to cover costs associated with line editing, typesetting, posting the book on the press’ Open Access site, etc. Any or all of these fees can or should be waived for academics from developing nations. Table 9 outlines an Open Access P & L. A small press using Table 9 and releasing 20 Open Access books would generate $128,511. in profit; a large press releasing 100 titles would generate $642,555.00 in profit. Three other calculations must be considered. First, we were told that the average press received about 10 manuscripts for every one published. Assuming the fee based structure dampened the submission of manuscripts, and the small press received 100 submissions at $250.00 each, an additional $25,000.00 in extra income could be booked; the large press might receive 500 submission fees of $250, generating an additional $125,000.00 in revenues. Second, not every press had a contract with an author covering electronic rights. So some backlist titles (i.e., a book more than 9 months old) would remain as a print only book, although POD could handle these titles. Third, the existing inventory would have to be stored in a warehouse, triggering costs. It could take at best 4-5 years (2012-2013) to reduce this inventory through sales (or write-offs). The movement toward an Open Access only system provides positive financial results for university presses, allows them to compete with other publishers that are moving rapidly toward the electronic distribution of content, and puts these presses on a sound financial footing, allowing them to continue to exist in both good and bad economic business cycles.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

159


160

Albert N. Greco; Robert M. Wharton

_______________________________________________________________________________________________ Business Model Assumptions _______________________________________________________________________________________________ Print Run: 0 Net Sales: 25 POD [POD is print on demand] Suggested Retail Price: $30.00 POD [$10.00 unit manufacturing cost] Average Discount 0 PPB: 0 Plant: $1,124.45 Marketing: $100.00 Royalty Advance: 0 Royalty Rate: â&#x20AC;&#x201D; [10% of net revenues for all POD copies] Subrights: $200.00 _______________________________________________________________________________________________ Revenues and Expenses and Net Profit/Loss 1. Gross Sales: 2. Returns: 3. Net Sales: 4. Plant: 5 Earned Royalty: 6 Inventory Write-Off 7. Peer Review Fee 8 Total Cost of Goods Sold: 9. Other Publishing Income: Submission fee Peer Review Fee Publication fee 10. Marketing: 11. Overhead:

$750.00 [25 copies x $30.00@] 0 $500.00 [25 copies x $30@ - unit manufacturing cost $10@] -$1,124.45 -$50 [ copies at $2.00] 0 -$250.00 $1,424.45 [#4 + #5 + #7] +$100.00 [sub rights; 50% to publisher] +$250.00 +$250.00 +$10,000.00 -$100.00 -$3,000.00 [30% of $10,000.00 publication fee]

12. Net Profit/Loss:

$10,000 +$250 +$500 + $100 = $10,850.00

$10,850 - $1,424,45 - $3,000 = $6,425.55 profit for this book _____________________________________________________________________________________________ Source: Grecoâ&#x20AC;&#x2122;s estimates; industry averages.

Table 9: Sample Profit & Loss (P & L) Statement for an Open Access University Press Book 5.

Conclusions

Clearly, the world changed in the last 20 years. Computers, the Internet, i-Pods, and cell phones seemed to sprout up everywhere (or in most developed nations); and satellites linked most regions of the world. Yet far too many university presses maintained a centuries old commitment to an unprofitable business model for their books. Based on an analysis of the empirical data, a review of the published literature and existing business models, our visits and discussions with leaders at more than 50 U.S. university presses [e.g., Harvard, Princeton, MIT, Chicago, Stanford, Carnegie-Mellon, Duke), our discussions with faulty members, and our focus group interviews with more than 500 undergraduate and graduate students), we recommend the following procedures to insure the continued viability of university presses. First, all direct university press financial subsidies (excluding non-financial subsidies, e.g., free rent, free access to legal services, etc.) provided by their home university should be discontinued by 2012-2013. Of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

course in a market economy, any university that insisted on providing a financial subsidy to its university press can continue this policy. Second, in light of the increased utilization and acceptance of “Open Access” [and electronic publishing] publication models in the scholarly journal sector, a realistic electronic publishing “Open Access” business model should be adopted by university presses for “all” of their books by 20122013. Third, existing stringent peer review should be maintained by each university press as it adopts an Open Access business model. Fourth, each university press should determine an appropriate Open Access fee to be paid to the press after a manuscript has undergone peer review and after it has been accepted for publication by a press; this fee can be paid by the author(s), by the author’s academic department and/or college, through research grants-funds, etc.; waivers of the Open Access fee should be granted to an author(s) from a developing country. Fifth, each university press should consider selling a hard copy, preferably one produced through a “print on demand” (POD) system, to any individual, library, etc. that prefers or needs a hard copy. This Open Access-POD procedure has been utilized successfully by a number of non-profit publishers (e.g., National Academies Press; The World Bank). Sixth, the “university press community,” working with librarians, NGOs, etc., craft a global marketing strategy (by 2012-2013) to license digital content in developing nations, especially titles addressing pivotal issues related to economic development, poverty, disease, global warming, and globalization. Seventh, it appears likely, at least in the next 3-5 years, that the scholarly book will remain the principal scholarly platform in the tenure, promotion and merit process in the humanities and in many areas of the social sciences. It will take more than 4-5 years to convince deans and provosts that peer reviewed Open Access electronic books have the same value as a printed book. What might expedite thinking in academia is the “acceptance” of “electronic books” and “electronic book readers” in the trade book market. Eighth, the transformation to an Open Access publishing platform will take 4-5 years. Contracts for many backlist books (especially contracts from the 1990s) might not contain clauses regarding the electronic distribution of a specific author’s book; and unless those contracts are renegotiated, those titles will remain print only. Recently signed contracts (for manuscripts to be delivered in 2008, 2009, and possibly 2010) are unlikely to contain an only Open Access clause; unless they are renegotiated, these books will remain print only. So it is likely that a university press will have to announce its Open Access policy; and new contracts for manuscript submission in 2010 or 2011 will have to contain the appropriate language. Will some academics refuse to submit a manuscript to an open Access university press? Yes; but the “publish or perish” mindset of university deans and provosts might be a significant counterbalancing force. However, our analysis sparked some intriguing questions. First, are university presses necessary in an age of electronic distribution of content and a plethora of publishing opportunities offered by scholarly, trade, and textbook publishing companies, all with a broad reach and financial resources that exceed the vast majority of most university presses? University presses have a mission to publish and disseminate scholarship, and they offer a useful counterbalance against commercial publishers, although university press title output should be reduced to better match demand against the supply. Would scholarship continue to flourish in a strictly commercial publishing environment? While the precise answer to this question is unknown and unknowable unless all university presses disappeared, the outpouring of research titles from commercial publishers might indicate scholars could continue to get their research out to academics and students. Second, is the institutional affiliation of university presses necessary in an Open Access-commercial publishing environment? From a financial point of view, the answer is no. Universities could reallocate press funding to support other activities, including faculty salaries, scholarships, Open Access publication Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

161


162

Albert N. Greco; Robert M. Wharton

fees, etc. If we evaluate the social mission-public relations component of a university affiliation, we might reach a different conclusion. We asked this question to a number of press directors, and one response was telling. The director told us, “R______ University gets great P.R. every time we publish a book that is reviewed in The New York Times, especially if we do a book that had broad consumer appeal and is highlighted in an article by the Associated Press.” There is no easy answer to this question. Third, the electronic distribution of content by commercial scholarly and textbook publishers has been, at least so far, dependent on downloads onto desk or laptop (notebook) computers and not e-book readers. Price and convenience have been the two main reasons. The average e-book reader costs between $300 and $400, and then you have to pay for the book download. Most academics and students have computers, making downloads relatively easy. The newest e-book reader (the highly publicized Kindle) has a black and white screen; most textbooks and many scholarly books rely on color for charts, graphs, etc. So the price of the e-book reader would have to be reduced significantly to penetrate the academic market (in essence the “King Gillette” model would have to be utilized) and offer color options. Fourth, what is the relationship between print only sales and electronic downloads? While we reviewed data on print sales for 2001-2007 (in reality we also reviewed print sales datasets back to the 1960s), no publisher has released data on electronic downloads. We reviewed quarterly and annual financial reports, Wall Street analyst’s reports, conference calls with stock analysts, and visits to a number of publishing companies. Publishing executives told us they book electronic download revenues and not units; and they did not release any data on the ratio between print and electronic download sales. We investigated this issue unsuccessfully in the summer of 2007; however during the fall of 2007 we began to observe certain patters that provides a “working analysis” of download revenues. We know that McGraw-Hill textbook operations sold 10,000 downloads in 2006, although we could not ascertain whether these were full book downloads and/or book chapter downloads. We estimate that the revenue number for commercial scholarly publishers is perhaps $1 billion; textbooks are approximately $250 million; and trade publishers generated about $60 million in 2007 through digital sales. However, a significant amount of research is needed to develop firmer numbers. Fifth, many universities have launched online course initiatives; and Harvard University’ faculty of arts and sciences created an opt-out only policy regarding Harvard’s posting of scholarly journal articles. Both of these developments are too recent to evaluate in the context of the Open Access book movement, although both will require analysis in the next year or so. The ultimate goal of all U.S. university presses is to reach readers able or unable to access or purchase printed university press intellectual content or books. Clearly, university presses in the U.S., and indeed throughout the world, face exceptionally complex problems related to their intellectual products, convoluted distribution systems, and the increased competition from commercial trade, scholarly, and commercial textbook publishers who are moving rapidly into the electronic publishing of their content. There are no “simple” answers to any of these thorny problems; and a review of the published literature reveals the complexity associated with the current emphasis on printed scholarly books [20]. However, we believe a realistic Open Access (electronic publishing) business model will better position university presses to fulfill their mission to disseminate scholarly knowledge and, concomitantly, mitigate the debilitation economic problems that are undermining the very foundation of these presses and threatening their future. 6.

Notes and References

[1]

COSER, L.A.; KADUSHIN, C. POWELL, W.W. (1982. Books: the Culture and Commerce of Publishing. Pages 45-57. Basic Books, New York.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...

[2] [3]

[4] [5]

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

GRECO, A.N.; RODRIGUEZ, C.E.; WHARTON, R.M. (2007). The Culture and Commerce of Publishing in The 21st Century. Pages 36-84. Stanford University Press, Stanford, CA. HARVEY, W.B.; BAILEY, H.S.; BECKER, W.C.; PUTNAM, J.B. (1972). The impending crisis in university press publishing. Journal of Scholarly Publishing 3(3): 195-200; BEAN, D.P. (1981). The quality of American scholarly publishing in 1929. Journal of Scholarly Publishing 12(3): 259268. The journal Scholarly Publishing changed its name to the Journal of Scholarly Publishing; and this new name is the one used for current and past issues. GRECO, A.N. (2005). The Book Publishing Industry, 2nd ed. Page 345. Lawrence Erlbaum Associates, Mahwah, NJ. VAN IERSSEL, H. (2007). Annual university press statistics 2003 through 2006. Pages 23-35. Association of American University Presses, New York; VAN IERSSEL, H. (2007); Van Ierssel, H. (2006). Annual university press statistics 2001 through 2004. Pages 24-30. Association of American University Presses, New York; VAN IERSSEL, H. (2001). Annual university press statistics 1997 through 2000. Pages 16-24. Association of American University Presses, New York; VAN IERSSEL, H. (2000). Annual university press statistics 1996 through 1999. Pages 16-22. Association of American University Presses, New York; TALEB, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Pages 3-21.Random House, New York; TALEB, N.N. (2004). Fooled By Randomness. Pages 5-20, 43-64. BOOK INDUSTRY STUDY GROUP, INC. (2007). Book Industry Trends 2007. Pages 136-152. Book Industry Study Group, Inc., New York; all of the statistical datasets, projections, and several essays were prepared by Greco, A.N. and Wharton, R.M. KYRILLIDOU, M.; YOUNG, M. (2008). ARL Statistics 2005-2006: A Compilation of Statistics From One Hundred and Twenty-Three members of the Association of Research Libraries. Page12. Association of Research Libraries, Washington, D.C. BOGART, D. ed. (2007). The Bowker Annual: Library and Book Trade Almanac 2007, 52nd ed. Pages 514-515.Information Today, Inc. New Providence, NJ. GRECO, A.N. (2005). The Book Publishing Industry, 2nd ed. Page 26-50. Lawrence Erlbaum Associates, Mahwah, NJ. VERONIS SUHLER STEVENSON. (2007). Communications Industry Forecast 2007-2011, 21st ed. Pages 52-57.Veronis Suhler Stevenson, New York. BOOK INDUSTRY STUDY GROUP, INC. (2007). Book Industry Trends 2007. Pages 181-198. Book Industry Study Group, Inc., New York; all of the statistical datasets, projections, and several essays were prepared by Greco, A.N. and Wharton, R.M. GRECO, A.N. (1987-1988). University presses and the trade book Market. Book Research Quarterly 3(4):34-53. CHRISTENSEN, C.M. (2000). The Innovatorâ&#x20AC;&#x2122;s Dilemna. Pages xi-xxxii. HarperCollins, New York. KERR, C. (1987). The Kerr report: One more time. Publishers Weekly, 5 June: 20; KERR, C. (1987).One more time: American university presses revisited. Journal of Scholarly Publishing 1(4): 8-10. BASS, T.A. (1999). The predictors: How a Band of Maverick Physicists Used Chaos Theory to Trade Their Way to a Fortune on Wall Street. Pages 8-33. Henry Holt, New York; See also: RODICK, D. (2007). One Economics Many Recipes: Globalization, Institutions and Economic Growth. Pages 85-98. Princeton University press, Princeton, NJ; BURTON, K. (2007). Hedge Hunters: Hedge Fund Masters on the Rewards, the Risk, and the Reckoning. Pages 163-177). Bloomberg Press, New York; GADIESH, O.; MacARTHUR, H. (2008). Lessons From private Equity Any Company Can Use. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

163


164

[17] [18] [19] [20]

Albert N. Greco; Robert M. Wharton

Pages 28-57. Harvard Business School Press, Boston, MA.; DERMAN, E. (2004). My Life As A Quant: Reflections on Physics and Finance. Pages 17-28. John Wiley & Sons, Hoboken, NJ; MALKIEL, B. (1996). Pages 164-193. W.W. Norton, New York; DAVENPORT, T.H.; HARRISS, J.G. (2007). Competing on Analytics: the New Science of Winning. Pages 41-82. Harvard Business School Press, Boston, MA. Statistical data about a specific firm’s “beta” (and other key economic indicators) can be found at www.finance.yahoo.com. GRECO, A.N.; RODRIGUEZ, C.E.; WHARTON, R.M. (2007). The Culture and Commerce of Publishing in The 21st Century. Pages 36-84. Stanford University Press, Stanford, CA. Greco’s estimates of the electronic revenue streams for these formats. Rice University Press launched an Open Access –POD scholarly press in 2007; however, its operation is rather new, and few conclusions can be drawn from Rice’s history; See also: BROWN, L.; GRIFFITHS, R.; RASCOFF, M. (2007). The Ithaka Report: University Publishing in a Digital Age Pages 1-62. Available at: http://www.ithaka.org/strategic-services/university-publishing; GREENBLATT, S. (2002). Dear Colleague letter to members of the Modern Language Association. Pages 1-2. May 28, 2002; HAHN, K.L. (2008). Research library publishing services: New options for university publishing. Available at: http://www.arl.org; HOWARD, J. (2008). New open-access humanities press makes its debut. The Chronicle of Higher Learning. May 7, 2008. Available at: http://chronicle.com; RAMPELL, C. (2008). Free textbooks: An online company tries a controversial publishing model. tabThe Chronicle of Higher Learning. May 7, 2008: A14; COHEN, N. (2008). Start writing the eulogies for print encyclopedias. The New York Times, March 3, 2008: WK3; HOOVER, B. (2008). University press tries digital publishing; refers to the University of Pittsburgh Press. Available at: http://www.post-gazette.com; MILLIOT, J. (2008). Report finds growing acceptance of digital books. Publishers Weekly, February 18, 2008: 6; International Digital Publishing Forum (2008). Industry statistics. Available at: http://www.idfp.org; GRAFTON, A. (2007). Future reading: Digitization and its discontents. The New Yorker. November 5, 2007. Available at http://www.newyorker.com; TAYLOR, P. (2007). Kindle turns a new page. The Financial Times, November 23, 2007: 12; CRAIN, C. (2007). Twilight of the books. What will life be like if people stop reading books? The New Yorker, December 24, 2007. Available at: http://www.newyorker.com; eBrary. (2007). 2007 global faculty e-book survey. Pages 1-46. Available at: http://www.ebrary.com.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


165

Scholarly Publishing within an eScholarship Framework – Sydney eScholarship as a Model of Integration and Sustainability Ross Coleman Sydney eScholarship Fisher Library, F03, University of Sydney New South Wales, 2006, Australia email: r.coleman@library.usyd.edu.au

Abstract This paper will discuss and describe an operational example of a business model where scholarly publication (Sydney University Press) functions within an eScholarship framework that also integrates digital collections, open access repositories and eResearch data services. The paper will argue that such services are complementary, and that such a level of integration benefits the development of a sustainable publishing operation. The paper describes the business model as a dynamic hybrid. The kinds of values considered include tangible and intangible benefits as well as commercial income. The paper illustrates the flexible operational model with four brief cases studies enabled by integrating repository, digital library, and data services with an innovative publishing service. Keywords: eScholarship; scholarly communication; Sydney University Press; eResearch; data publication 1.

Introduction

Hardly a week goes by without some new challenge to scholarly communication that demands attention, and occasionally, perhaps, a pause to get some bearings. Information technologies and the opportunities of the semantic web and Web 2.0 on one side, the complexity of rights and open or managed access on another, and on, yet another side, the need for sustainable and viable business or operational models. Beneath a yawning divide between the corporate publishing world and that of the institutions (and within the institutions the relationships between the traditional presses and the emerging e-presses), and overlaying all the power and omnipresence of the global search engines. Approaching over the horizon is the demands and complexity of e-research – as cyberinfrastructure, but also authoritative data itself as a form of publication. The decisions being made now about how to best engage in this environment are not the final solutions. What we need is the best kind of foundations - flexible, responsive, light and open – on which to build the new scholarly publishing and communications structures of the future. A tired cliché, but true - continuing change is the only certainty. Nor are there any single solutions – we need to work within innovative frameworks that accommodates this diversity and these challenges and opportunities; and frameworks that facilitate new partnerships. The framework we have chosen to work within is that of eScholarship As there are no single solutions, this model must be a dynamic hybrid, seeking to respond and deliver to the diverse and changing set of demands and markets. A model providing solutions for the creators and the consumers of scholarly publications. This paper will discuss an operational program and a business model or methodology where scholarly Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


166

Ross Coleman

publication functions within an eScholarship framework that also integrates digital collections, open access repositories and eResearch data services. The paper will argue that such services are complementary, and that such a level of integration benefits the development of a sustainable publishing operation. This argument will be illustrated with results in four brief case studies. The primary platform for scholarly publishing at the University of Sydney – Sydney University Press – operates as an integral part of the University Library’s Sydney eScholarship program [http:// escholarship.usyd.edu.au]. 2.

Discussion - on eScholarship, publishing, sustainability and integration

Sydney eScholarship operates as an integrated set of services, characterised by: • • • • • • •

2.1

commitment to standards for archiving and re-use delivery capabilities - publishing services for books, journals, conferences, new forms stable open digital repository services project analysis and planning advice digital library collections and services business planning, legal compliance and secure e-commerce capabilities partnerships, collaborations and opportunism

eScholarship

We at Sydney were inspired by the vision of eScholarship originally enunciated by the California Digital Library (CDL): “eScholarship … facilitates innovation and supports experimentation in the production and dissemination of scholarship. Through the use of innovative technology, the program seeks to develop a financially sustainable model and improve all areas of scholarly communication….” [1] CDL continues to explore sustainable models, and acts as a leader in innovation developing services and tools. For example XTF (eXtensible Text Framework) being implemented across a number of digital library and publishing services, including at Sydney The term “eScholarship” is used variously according to context or, indeed, convenience. Most common use is in regard to digital repository services (Boston, Queensland et al), and sometimes as a catch-all descriptor for services associated with digital activities in higher education. [2] If there is any commonality in usage it is in reference to a digital archive services. At Sydney we have taken a broader understanding of eScholarship - as an overarching framework. This vision enunciated by CDL enabled us to conceptualise, and implement a coherent approach to deliver the strategic and operational ambitions for many of the Library’s digital collection and publishing activities. It allowed us to articulate the relationships underlying these activities, and the new roles and expectations in integrating digital collections, open repository services and emerging eResearch support services at the University with a publishing operation. Importantly it has allowed us to address these activities and relationships pragmatically, offering a set of services that we feel are operationally sustainable, beneficial and productive. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

The service components of Sydney eScholarship and the business model underlying these operations will be discussed below, after briefly considering the concept of sustainability. 2.2

Sustainability

Sustainability is one of those comforting aspirational - but slippery - goals, depending on context. Some insight into the complexity of sustainability in the digital environment was gained through participation in the federally funded Australian Partnership for Sustainable Repositories (APSR) [3]. Digital sustainability is described by Kevin Bradley in his Sustainability Issues Discussion Paper for APSR [4] as being technical, social and economic. Bradley describes the following as aspects of sustainability. • • • • • • •

The sustainability of the raw data - the retention of the byte-stream. The sustainability of access to meaning - content remaining meaningful for creator and user. The economics of sustainability – continued existence of the institutions that support the technology. The organisational structure of digital sustainability - relationships between the rights holder, the archive and the user. The economics of participation – matters of incentives and inhibitors. Sustainability and the value of the data – the value through the life-cycle Tools, software and sustainability

Central to any practical discussion of sustainability, and implicit in Bradley’s discussion, is the need for organisational continuity. Such a key requirement also underlies similar topologies such as the attributes and responsibilities of the Trusted Digital Repository. [5] The traditional purveyor of curatorial continuity for publication is the library. While this does not necessarily need to be so in the future, it does explain the repository role of many libraries – in terms of assertion and expectation [6]. In Australia, these roles have been formalised as libraries are funded to provide repository services for various government research assessment initiatives, such as the new Excellence for Research in Australia (ERA) program replacing the Research Quality Framework (RQF) [7], or in the UK for the Research Assessment Exercise (RAE) [ http://www.rae.ac.uk/] The library at many universities, like Sydney, is often the only organisational and curatorial entity that has existed (in one form or another) throughout an institution’s history. The viability of services such as Sydney eScholarship, committed to the long term management and preservation of digital content, relies to a large extent on organisational continuity as part of the University Library,. Indeed, this association has raised the expectations by researchers of libraries having a central role in supporting such initiatives. In a practical sense to have any ambition toward providing sustainable information services over the longer term organisational continuity and commitment is important. But this needs to be accompanied by the appropriate skills and expertise, infrastructure services with forward development plans, an innovative, proactive and responsive approach, and a viable and demonstrable operational or business plan to ensure future funding. Within a publishing environment the business plan is critical (even if the plan is 100% institutional subsidy)

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

167


168

2.3

Ross Coleman

eText to eScholarship

The Sydney eScholarship program was formally launched in 2006, but these services (and the appropriate skills sets) had been evolving over a decade, since the establishment of SETIS (Sydney Electronic Text and Image Service), in the mid 1990s. SETIS was initially established as an eText centre in 1996, similar to many other in the US, and due in part to the missionary zeal of a visiting David Seaman, then at Virginia. The evolution of SETIS from a service networking commercial full text databases to a service creating etext collections was rapid. The skills translated easily from one service to the other. These services provided a platform for the creation of text and image based digital library collections. The expertise built up during the 1990s gave SETIS a national reputation in creating and managing such text collections, with a focus on Australian studies [http://setis.library.usyd.edu.au/oztexts/] This reputation has grown through active partnerships in major research grants in Australian literary and historical studies. The first major project for SETIS to create and provide full text of primary (literary) and secondary (critical) texts for AUSTLIT, [www.austlit.edu.au ] the major Australian literary bibliographical and biographical database, funded through Australian Research Council (ARC) grants. This commitment to AUSTLIT continues with ongoing digital conversion of selected literary works. A production process for digital conversion was developed for this and other projects. To ensure the highest possible accuracy digital conversion involved the double-keying of texts. We eventually settled on a preferred vendor in Chennai, India, and this company remains our major production vendor for digital conversion. Texts were converted and XML files were returned in our established DTD with basic structural mark-up. Further mark-up to the TEI (Text Encoding Initiative) guidelines, and processing took place in SETIS and the XML files were rendered into HTML and web PDFs depending on requirements. The textual corpora created by SETIS – XML based collections with a range of presentation options provided Sydney with leadership and acknowledged expertise in creating primary source text collections in Australian studies. The role of SETIS as the primary full text manager in Australian literature also provided the opportunity to consider establishing a publishing operation to meet the demand for print versions. Our Indian vendor - also a production house for several major European publishers – provided additional services such as type-setting for potential print production. The reputation of SETIS continues to bring new and exciting collaborations, consolidating our role, and providing the innovative impetus and funding for much of the new major project work done in Sydney eScholarship 2.4

Sydney University Press and Sydney eScholarship

Sydney University Press (SUP) had existed as a traditional print publisher and press. It was initially established by the University in 1962, but after 25 years of operation was effectively dismantled due to the heavy infrastructure costs. Over this period SUP produced a major list of over 600 titles and several major journals. In 1987 the SUP imprint was sold by the University to Oxford University Press. The imprint was then used mostly for textbooks, but eventually relinquished by OUP in the early 1990s, and the business name and imprint were abandoned. The University re-registered SUP in 2003 under Library management “to address the challenges of scholarly publication in the networked environment” The reputation of SETIS as a digital library platform facilitated the re-establishment. Case Study # 1 – digital library to publisher, Classic Australian Works, 4.1 below describes how the operation and reputation of SETIS was fundamental to re-establishing SUP. Sydney University Press [http://www.sup.usyd.edu.au/] was revived in the same milieu as a number of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

new e-presses (many associated with libraries). In Australia these included Monash e-Press, ANU ePress and UTS e-Press. Though SUP was re-established in this context a decision was made to stay with the name Sydney University Press, and not adopt an e-Press banner. There were several reasons for this: it was an established brand name, we were determined to be a print as well as electronic publisher, and we needed to present a business case that would include its operation as a commercial publisher generating income. Sydney eScholarship was established as a set of innovative services for the University of Sydney to integrate the management and curation of digital content and research with new forms of access and scholarly publication. Within this framework a viable publishing operation was important to add value to the set of services. The business components of this ‘adding value’ are discussed in the methodology of the business model, section 3 below, but real and tangible value flows through all the services of Sydney eScholarship. As a commercial publisher SUP publishes new editorially accepted and market tested titles, as well as a growing re-print list. All titles are electronically archived and sold print on demand or short run. While publishing provides transactional value, the digital library collections and the repository manage content and provide the sound archival foundation to facilitate publishing services. In reality each service provides functional value to the other. Publications derived from associated data sets, described in Case study # 2 – data to publication – from surf to city, 4.2 below, are increasingly part of this value chain. Within the digital environment the key to the value chain (or value circle) is the capacity to re-use, represent or re-engineer content into different environments. Depending on circumstances this may be into an open access environment or a managed or commercial environment. This capacity to address different demands is an essential part of operational viability. The Dictionary of Sydney project, another ARC research project [www.dictionaryofsydney.org/] in which we are a partner, explicitly has such a model of content re-use and re-engineering in both open and commercial spheres, to ensure the sustainability of the project when research funding ceases. This includes forms of publication via SUP In this operational context Sydney eScholarship can be broadly understood as an integration of the Sydney Digital Library (creating, managing and curating content) and Sydney University Publishing (providing associated business and publishing services). This is outlined in the table below Sydney eScholarship Sydney University Publishing

Sydney Digital Library •

eScholarship repository

Sydney University Press

SETIS digital collections

other imprints

Sydney Digital Theses

digital / print on demand services

Data project analyst and advisory services

eStore, eCommerce and business services

hosting subject data services

experimental publication

Table 1 : Sydney eScholarship services Sydney University Publishing, while centred on SUP, does provide other services, many of a business nature. SUP is established as a commercial and scholarly imprint, and this identity needs to be maintained both as a quality publisher, and one that complies with formal ‘research publication’ requirements. We do provide other imprints, such as Darlington Press, for more popular or semi-academic titles. We provide print-on-demand services for other publishers such as Monash ePress and UTS ePress, as well as for administrative services, such as University Faculty Handbooks etc. We will also provide a secure eStore service for the sale of other published content, in print, and soon in electronic form. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

169


170

Ross Coleman

A niche area of increasing interest is conference publication, as a form requiring rapid and open publication. The PKP Open Conference System (OCS) provides the publishing platform. We also provide an Open Journal System (OJS) platform. These are integrated into the repository services, as illustrated in Case study # 3 - repository and publishing – open and managed access, in 4.3, below SUP is also interested in providing platforms for experimental types of publication, such as multi-media streaming. Toward this end SUP recently produced/published its first music CD, Wurrurrumi Kun-Borr . This CD was the joint winner of the 2007 Northern Territory Indigenous Music Awards, and the first in a series from the National Indigenous Recording research project. The new Sydney University Press was established to integrate expertise in handling digital content with a production facility to provide a viable print-on-demand service, and a secure eStore service for commercial sale. It provided the production capacity to meet the formal requirements of research publications. 2.5

Nature of a research publication

In Australia there is currently a formal set of requirements that define (for funding purposes) a ‘research publication’. These requirements do proscribe different modes of scholarly publication, but at the same time provide a useful and defendable set of criteria that provides (for good or bad) some benchmarks for research publication. Not surprising, meeting these requirements are fundamental elements in any scholarly publishing model. Publications that meet this definition generate research points that are converted into federal research funds – so publication output is important for both individuals and institutions. The definition of a research publication is outlined in the Higher Education Research Data Collection (HERDC) 2008 specifications [8] “For the purposes of these specifications, research publications are books, book chapters, journal articles and/or conference publications which meet the definition of research, and are characterised by: •

• • • •

substantial scholarly activity, as evidenced by discussion of the relevant literature, an awareness of the history and antecedents of work described, and provided in a format which allows a reader to trace sources of the work, including through citations and footnotes originality (i.e. not a compilation of existing works) veracity/validity through a peer validation process or by satisfying the commercial publisher processes increasing the stock of knowledge being in a form that enables dissemination of knowledge” [print or electronic]

Publishing within an eScholarship framework enables us to both comply with these specifications and to investigate other types or modes of publication that still meet the fundamental characteristics of ‘research publication’. One area of growing pressure is the need for the creation of authoritative data sets to be recognised as a valid research activity, and perhaps recognised as a ‘publication’ for funding purposes. Appendix 1 illustrates how a data set may comply with the formal characteristics of a research publication 3.

Methodology - the business model

The historical change in scholarly publishing is facilitated by technologies which have enabled new business and strategic approaches. This is an operational shift in publishing from retail-type single product (eg printProceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

run, journal volume etc) to a dynamic services framework. This is more than multi-channel distribution; it is an ongoing process that allows for re-use and re-mix facilitated by archival formats that enable content to be used in different contexts and markets. This is illustrated in Case study # 4 – literary re-use and customisation – APRIL., 4.4 below. 3.1

A dynamic hybrid

The development of an operationally ‘financially sustainable’ model (that is, one that generates income) was fundamental to the medium term planning of Sydney University Press. The model needed to meet appropriate scholarly and market needs, provide commercial income generating services, as well as the capability to generate research points. It also needed to work in the new information environment that facilitates the benefits of open access, and the challenges and opportunities of packaging and re-use of eResearch as publication. This business model is based on a hybrid operational and philosophical approach to scholarly publishing, and a broad recognition of the various elements of value in a business model. The hybrid approach is demonstrated in the capacity to deliver both digital and print (including on demand) content as appropriate, and in a capacity to mix both open-access and paid delivery of publications as appropriate. This dynamic hybrid model enables response to different demands, requirements and markets. Publication outputs will take the forms appropriate for the work, the readership and the market. There is a continuing market demand for printed works which is serviceable in a digital print environment.. Electronic delivery is currently downloadable free PDFs because of functional constraints with the eStore. A current eStore upgrade will provide the capacity to extend services to sell e-versions in whole or in part. As a delivery mechanism print on demand (PoD) from stored files ensures that theoretically a work is never out of print. In a business sense this provides the long tail for publishing, where production is most cost-effective, and where a long list with low inventory and turnover contributes towards a viable business proposition. This is part of the business strategy behind Amazon BookSurge, and is also – on a more modest scale – a business strategy behind SUP. A business strategy enabled through text archiving processes of the digital library. It is a point where the digital library crosses into business. This dynamic hybrid model does provide a flexible approach. Importantly it allows us to alter and adapt the mix of delivery modes as technology, demands and markets shift. In a context of continual change this flexibility is critical. Another advantage of a business approach is that it - ironically perhaps - provides a particular credibility with authors, partners, the media and the trade. When SUP was first conceptualised we envisaged that most sales would be via the web site, direct to customers. However, currently about half of sales are into the trade, to both retail bookshops and library suppliers. This did require a review of how pricing was structured to include a margin for trade discount, and some careful thought about price points etc. SUP is not exclusively a Sydney University publisher; to be so would only be self-serving, and ultimately self-defeating for a scholarly publisher. In 2007 only about half of new titles were associated with University of Sydney staff. Our first goal in terms of business viability is operational self-sufficiency. That is, from direct and indirect income we cover all production costs including editorial, copy-edit, indexing where needed, design, layout, proof copies and final copies for legal deposit, review, authors etc, and some internal staff costs. This goal has largely been achieved. Core staffing (business manager) is currently provided by the Library. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

171


172

3.2

Ross Coleman

Types of value

We take a broad view of business planning and strategies, and recognise that that the values in this model are more complex than only forms of income, we must consider other real value and benefits that also accrue to individuals and institutions. The value elements of this model can be expressed in a simple matrix form Direct income (sales etc)

Tangible benefits (metrics etc)

Indirect income

Intangible benefits

(subsidies, points etc)

(authority, brand etc)

The nature these value elements •

3.3

Direct income - from SUP publishing sales and diversified income from print-on-demand services. This income accrued via eStore sales is split between SUP, and the Library for infrastructure Indirect income - to SUP in the form of subsidies to assist with publication, common in scholarly publishing. Although preferring a level of subsidy, SUP has taken the whole commercial risk with several titles, and recouped through sales or royalty sacrifice. Another, more, substantial indirect income - though not necessarily to SUP - is accrual to the individuals and universities through research publication points funding by the government (2.5 above). This underwrites some subsidies. Tangible benefits - to individuals include higher metrics and profiles for citations and downloads due to open access, internal institutional efficiencies by utilizing services such as PoD; and the potential rationalisation of diverse publishing operations Intangible benefits - relate to prestige and recognition through an active scholarly press, and increased individual and institutional research publication productivity

Practicalities – legals, marketing and risk

Fundamental to the business process is the contractual basis under which publication is facilitated. All SUP contractual templates comply with University legal requirements, and have been developed with external intellectual property legal advice. Within all these contracts authors retain their copyright, SUP only licences for publication. This enables authors to deposit their content in other repositories. This complies with an open access orientation, as described in Case study 4.3 Marketing remains a major issue, as SUP does not operate in the traditional trade, high advertising, high inventory and distribution environment. Marketing is to the niche with a tailored marketing plan for each title, with little general advertising. SUP uses targeted media releases and media networks. Publication details are added to all the book-trade lists, and have a small number of preferred independent book retailers. Like many publishers, SUP is negotiating to join GoogleBooks and have signed with AmazonBooksurge for delivery into the North American markets. Marketing is still an area requiring more effort and lateral thought. Issues of risk have been considered from several perspectives. Legal risks and exposure across all the activities of Sydney eScholarship have been canvassed at length with the University Office of General Counsel. The outcomes of these discussions often takes the form of approved templates for contracts, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

deeds, agreements, memos of understanding etc with project partners, for repository contributors, for data hosted on our servers, and for authors. These discussions have sometimes involved the need to resolve differing views about exposure through open access and assumed loss of intellectual property rights. SUP does need to comply with university legal requirements (including copyright), and, despite some frustrations, often arrive at a level of common agreement that enables services to operate largely as we envisage. It is very important that we liaise closely with legal counsel, and continue to have a good and open relationship with them. The other level of risk is around the publishing operation itself. At the time of establishment SUP was subject to a risk assessment, in terms of production services, internal relationships, external partnerships, and initial support. We have been cognisant of these risks, and at all times have contingencies and alternatives planned for technical, production and business disruption. However, it has been accepted that developing new and innovative services does require the University and the Library to accept a degree of risk. This has been minimised as much as possible in legal terms. The benefits of these services and the value they add in terms of improved access and communication of university research in the new information environments has been embraced over any risks that may emerge 4.

Results – the case studies

While, for SUP, print services and sales continue to be a key part of business operations, the importance of integration within the eScholarship framework is fundamental to success. The working relationships through these kinds of integration, and the benefits in terms of productive scholarly outcomes, are described in the case studies, below. 4.1

Case study #1 – digital library to re-print publisher – Classic Australian Works

As already noted SETIS had built a reputation for creating and managing literary full text in the scholarly environment. As part of a digital library collection these texts were maintained in archival forms (XML) for rendering into a range delivery modes. In 2003 the Library was approached by the Copyright Agency Ltd (CAL), the national agency for overseeing copyright enforcement to discuss a project to bring back to the market in a cost-effective way works of literature that were out of print, but still in copyright. CAL had initially approached the National Library of Australia (NLA) to partner the project, but they had referred CAL to Sydney because of our reputation in text creation and archiving (we had partnered with the NLA in other digital projects). The project proposed by CAL was for them to contribute to the establishment of a print-on-demand publishing operation, and to clear the reproduction rights of the books to be re-printed. In return we would convert and archive the works, establish a publishing operation, a secure web-based eStore for commercial sale, and production service to facilitate print-on-demand. So the revival of Sydney University Press was set. It was re-established as a light infrastructure integrated on top of existing university services: the SETIS digital text expertise; the digital print capacity of the University Printing Service; and the secure e-commerce transaction service of central IT. The Classic Australian Works series was established as a partnership between SUP and CAL, and twenty-five major ‘classics’ from twenty authors were selected as the initial tranche. After the series launch we were faced with the challenges of managing a commercial operation, marketing and a whole host of related challenges. The Classic Australian Works series continues with new reproduction of works and a new editorial presence. The infrastructure developed through this initiative provided the initial production and business foundations Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

173


174

Ross Coleman

for the new Sydney University Press. It is important to appreciate that SUP was re-established through an actual business opportunity and demand, not as the result of an administrative decision. This has set the tone and direction of SUP as a viable business operation. 4.2 Case study # 2 – data to publication – from surf to city While a data set itself is not recognised as a form of research publication [though it is possible to peg the characteristics of a research publication against a data set, appendix 1], it is possible for some forms of data set to be converted into research and commercial publication. Publication of some data does work well within the context of re-use and re-mix. SUP has published works derived from research data sets. Associated with this is action to also ensure the technical sustainability of the data, as described by Bradley (2.2, above) The major example is the “Beaches of the Australian Coast” series, currently published in seven volumes, representing states or regions in Australia. This series was derived from a substantial scientific data base detailing every one of the over 10,600 beaches around Australia. The data base covered over 30 elements for each beach including geomorphology, tidal and surf data, safety and recreational data, as well as images of all beaches. While used primarily as a marine data base, the possibilities for publication were obvious, in the form they are now published, but also potential for re-use or re-mixe around particular themes (fishing beaches etc). The state of the data base itself (on a myriad of excel files with little backup) is the subject of re-building into an XML based data base which will be archived by the Library for current and future research. The data base does provide benchmark data on beaches important for climate change studies. Another example of publication is from a data set (also being archived) is on urban planning legislation and practice in Australia. Parts of this data base were rendered to publication as Australian Urban Land Use Planning: Introducing Statutory Planning Practice in New South Wales. This work is used both as a text and a reference work (and data base) by planners. 4.3

Case study # 3 - repository and publishing – open and managed access

The Sydney eScholarship repository – a DSpace installation – provides a secure open access repository service. It provides the storage foundation of the Sydney Digital Library, and the SUP archive. While SUP does operate as a commercial publisher we are committed and oriented to open access wherever possible. The publishing contract templates we use, which comply with university general counsel requirements, permit authors to retain their rights and allow deposit of content in other repositories. All conference papers and chapters in edited works (unless specifically blocked by the author) are openly accessible via the repository, and are regularly harvested by services such as GoogleScholar. The full work or conference is still available as a completed print work, and remains so as a print-on-demand file, with a link between the repository and the SUP eStore. There is demand for both the print volume and open access at the paper level. Print-on-demand (PoD) satisfies the low print demand in a cost-effective way. In the publication process we do need to ensure that the editorial processes meet the formal requirements of peer review so that individual authors receive due research publication recognition. This recognition is provided irrespective as to whether it is distributed in print or electronic form, as long as the criteria of research publication are met. A repository service does fit neatly into a publishing operation, indeed it is a fundamental part of the

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

operational model to ensure that content remains continually available over the long term. 4.4

Case study # 4 – literary re-use and customisation - APRIL

A project funded by the Australian Research Council (ARC) with industry linkage with CAL (Copyright Agency) is the Australian Poetry Resources Internet Library (APRIL). [http://april.edu.au/]. This has been funded as a research project to study the reception and readership of Australian poetry. The project involves the digitisation of the complete works of over 300 Australian poets in the first phase (with associated video and audio of interviews, readings etc). About half of these works are still in copyright, and permissions will be cleared by CAL. This project is one of several supported by CAL from its Cultural Fund to encourage the study and appreciation of Australian poetry and plays. The text of the poems will be double-keyed and also represented as images. The text will be archived in SETIS as TEI tagged XML. The content will be rendered via Cocoon from within an XTF framework. It will be RDF (Resource Description Framework) capable for semantic web environments. Several publication options are also being investigated as part of the project. These are looking to the repackaging and delivery of anthologies in different contexts, including education and general readership. The processes of producing and selling client customised anthologies of poems by print on demand through SUP are being investigated as part of the project. This is a design and production challenge for publication. At the rights and business level the use of DOIs (Digital Object Identifiers) will articulate and record the transactions for each poem and poet. This work is in the first year of a three year project, but again illustrates the benefit in integrated digital library and publication services to enable new and experimental modes of publication. 5.

Conclusion

The four case studies illustrate the kinds of benefits and synergies that are enabled by integrating repository, digital library, and data services with an innovative publishing service. The capacity of that service to deliver diverse content required the dynamic hybrid operational and business model described in this paper. As hypothesised at the start of this paper, there are no single publishing solutions. These case studies illustrate the need to be able to deliver in different circumstances - to be able to provide appropriate publishing solutions in different contexts. Managing a commercial publishing operation does raise questions whether this is a proper role for a library. At the University of Sydney Library it is regarded as an appropriate role, an extension of the traditional roles of preservation and access. A publishing service is regarded as integral to the Library providing leadership in addressing the challenges of communicating research and scholarship in the new contexts of networked information services. The services integrated through Sydney eScholarship provide the fundamental components to facilitate and support innovative digital projects within the semantic web, and provide stable archival platforms for research. The association with a publishing enterprise that operates in both commercial and open environments provides a service that is attractive to researchers requiring recognized ‘research publications’ and the benefits of secure archiving and open access. Within this integrated eScholarship framework each service adds value to the other – each benefits the other. This is the kind of framework that can contribute to sustaining the new models of scholarly publishing

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

175


176

Ross Coleman

6.

Notes and References

[1] [2]

accessed from CDL site in October 2005 For example, in Australia the eScholarship Research Centre at the University of Melbourne is a research and data archive group within an ITC unit, and no publishing agenda www.esrc.unimelb.edu.au/ Australian Partnership for Sustainable Repositories (APSR) - http://www.apsr.edu.au/ The APSR Project aims to establish a centre of excellence for the management of scholarly in digital format. The project has four interlinked programs: Digital Continuity and Sustainability Centre of excellence to share software tools, expertise and planning strategies. International Linkages Program Participate in international standards and maintain a technology watching brief. National Services Program Support national teaching and research with technical advisory services; knowledge transfer; consultation and collaboration services. Practices & Testbed Build expertise in sustainable digital resource management through partner relationships. The Australian National University: Develop and populate a broad-spectrum · repository. · The University of Sydney: Sustainability of resources in a complex distributed environment. · The University of Queensland: Develop an integrated gateway to a range of repositories of research output. Bradley, Kevin Sustainability Issues Discussion Paper APSR (2005). dspace.anu.edu.au/handle/1885/46445, accessed May 2008 http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf It is an interesting sidelight that the role of librarians in taking leadership in these areas has generated some derisive comments from higher education administrators about ‘agenda stealing’, This has only been conversational but quite explicit - an interesting tangent that may be worth further consideration. h t t p : / / m i n i s t e r. i n n o v a t i o n . g o v. a u / S e n a t o r t h e H o n K i m C a r r / P a g e s NEWERAFORRESEARCHQUALITY.aspx [accessed 4 May 2008] h t t p : / / w w w. d e s t . g o v. a u / s e c t o r s / r e s e a r c h _ s e c t o r / o n l i n e _ f o r m s _ s e r v i c e s / higher_education_research_data_collection.htm#2008_Specifications , retrieved 6 May 2008

[3]

[4] [5] [6]

[7] [8]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Scholarly Publishing within an eScholarship Framework

177

Appendix 1 –data as research publication The HERDC definition of a research publication can be adapted outside of those traditional – mostly textual - publication forms or categories (be they in print of electronic form). The table below identifies at first cut - a range of requirements that could meet publication criteria and be applied to the development of datasets as a recognised research activity and as a new form of publication output. HERDC research publication criteria substantial scholarly activity, as evidenced by discussion of the relevant literature, an awareness of the history and antecedents of work described, and provided in a format which allows a reader to trace sources of the work

Data set publication requirements • credibility of the researchers, • authority of platform/organisation (aka publisher) • significance of the subject matter • conceptualisation of data collection • meeting data and metadata (descriptive, technical, provenance, etc) standard requirements, • relationship/linkage to other datasets • persistent citability

originality (i.e. not a compilation of existing works)

unique data collection

replicated data necessary for testing or verification

veracity/validity through a peer validation process or by satisfying the commercial publisher processes

use of recognised data and metadata standards

peer review process for data inclusion

credible/authoritative review panel

usability/functionality community

unique primary data

persistence of citation

being an identifiable set of data for citation purposes

IP licence model

OAIS compliance for harvesting (OAIPMH)

increasing the stock of knowledge being in a form that enables dissemination of knowledge

for

research

These requirements may raise many practical questions and many researchers could add other discipline specific standards and requirements. However the table does indicate it is possible to develop an acceptable set of requirements that would provide defendable criteria for recognition as a research publication. Source: Coleman, Ross.. Field, file, data, conference: towards new modes of scholarly publication. In Sustainable Data from Digital Fieldwork. Proceedings of the conference held at the University of Sydney, 4-6 December 2006. http://hdl.handle.net/2123/1300

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


178

Global Annual Volume of Peer Reviewed Scholarly Articles and the Share Available Via Different Open Access Options Bo-Christer Björk 1;Annikki Roosr1,2; Mari Lauri1 1

Information Systems Science, Department of Management and Organization, Swedish School of Economics and Business Administration Arkadiankatu 22, 00100 Helsinki, Finland e-mail: bo-christer.bjork@hanken.fi; Annikki.roos@ktl.fi 2 National Public Health Institute Mannerheimintie 166, 00300 Helsinki, Finland

Abstract A key parameter in any discussions about the academic peer reviewed journal system is the number of articles annually published. Several diverging estimates of this parameter have been proposed in the past, and have also influenced calculations of the average production price per article, the total costs of the journal system and the prevalence of Open Access publishing. With journals and articles increasingly being present on the web and indexed in a number of databases it has now become possible to quite accurately estimate the number of articles. We used the databases of ISI and Ulrich’s as our primary sources and estimate that the total number of articles published in 2006 by 23 750 journals was approximately 1 350 000. Using this number as denominator it was also possible to estimate the number of articles which are openly available on the web in primary OA journals (gold OA). This share turned out to be 4.6 % for the year 2006. In addition at least a further 3.5 % was available after an embargo period of usually one year, bringing the total share of gold OA to 8.1% Using a random sample of articles, we also tried to estimate the proportion of the articles published which are available as copies deposited in e-print repositories or homepages (green OA). Based on the article title a web search engine was used to search for a freely downloadable full-text version. For 11.3 % a usable copy was found. Combining these two figures we estimate that 19.4 % of the total yearly output can be accessed freely. Keywords: scholarly publishing; scientific articles; article output; open access 1.

Introduction

“Open Access” means access to the full text of a scientific publication on the internet, with no other limitations than possibly a requirement to register, for statistical or other purposes. This implicitly means that Open access (OA) material is easily indexed by general purpose search engines. There are several widely quoted definitions on the net, for instance the Budapest Open Access Initiative [1] . For the scholarly journal literature in particular, OA can be achieved using two complimentary strategies: Gold OA means journals that are open access from the start, whereas green OA means that authors post copies of their manuscripts to OA sites on the web [2]. Since there are numerous different types of stakeholders involved in the scientific publishing value chain [3], such as publishers, libraries and authors, with sometimes conflicting interests, a lot of what is being written about OA is strongly biased either towards promoting open access or describing the dangers of open access to the scholarly publishing system. There has also been a discussion among OA advocates which of the two strategies (gold or green) is better. There is thus an urgent need for reliable figures concerning the yearly volumes of journal publishing, and the share of the yearly volume which is available Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Global Annual Volume of Peer Reviewed Scholarly Articles

as open access via different channels. In most of the earlier discussions about the economy of journal publishing the focus has been on the number of journals, and costs such as the subscription cost have been mainly related to the individual title [i.e. 4]. This was natural due to the fact of the easy availability of subscription information for individual titles and for the handling of paper copies in libraries all over the world. We argue that since the advent of the digital delivery for the contents and the electronic licensing of vast holdings of journal content (“the big deal”) the focus should be more on the individual articles as the basic molecule of the journal system and any average costs should be related to the article. We also think that the ratio of open access articles to the overall number of articles published is a much more important indicator of the growing importance of OA than the number of OA titles compared to the number of titles in general. 2.

Total number of articles published

A central hypothesis in this calculation was that the journals indexed by Thomson Scientific’s (ISI) three citation databases (SCI, SSCI an AHCI) on average tend to publish far more articles per volume than the often more recently established journals not covered by the ISI, and that this should explicitly be taken into account in the estimation method. We proceeded as follows. To estimate the total number of scholarly peer reviewed titles we used Ulrich’s periodicals directory and conducted a search with the following parameters; Academic/Scholarly, Refereed and Active. In the winter 2007 this yielded a total of 23 750 journals. For the case of the journals indexed by the ISI it was possible to extract the total number of articles published in the last completed year (2006) by conducting a search in the Web of Science (WoS). A general search was done covering all three indexes (Science Citation Index Expanded, Social Sciences Citation Index and the Arts and Humanities Citation Index). The parameters were set as follows; Publication Year = 2006, Language = All languages, Document type = Article. Since the system has a limitation in the number of items shown of 100 000 it was not possible to directly get the total number of indexed articles. The problem was solved by systematically going through the alphabet by setting the Source Title as A*, B*, C* etc. This worked well for all other letters, for which the total number was less than 100 000, except for A an J. For the letter A more detailed search on AA*, AB* etc was enough, for J we had to go down to the level of Journal of A*, Journal of B* etc. The total number of articles we arrived at in this way was 966 384. ISI as a rule only indexes peer reviewed journals, but with at least one notable exception, the “lecture notes in…” series published by Springer, which publishes conference proceedings in computer science and mathematics in book form. By doing a search using the above as Source Title we got the number of articles published in this series which was 20 484. Subtracting this from the total results in the final number of ISI articles of 945 900. If we know the exact number of titles that the ISI tracked in the WoS in 2006, we can easily derive the average number of articles published annually per title. Since we didn’t have access to exact figures from ISI we had to go a roundabout way to estimate this figure. One indication is given by the number of journals included in the Journal Citation Reports. When searched from Ulrich’s and defining Journal Citation Reports (JCR) as a further search criterion, the result is 6 877 titles. For one reason or another, the search directly from JCR for 2006 gives more journals: 6 166 titles indexed in SCI and 1 768 in SSCI. AHCI journals are not included in the Journal Citation Reports. We can, however, estimate the number of titles by assuming that AHCI journals on average publish as many articles per year as SSCI journals (53.1) which would result in additionally 532 titles. Summing these up, we would get 8 466 titles. Using these Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

179


180

Bo-Christer Björk ;Annikki Roosr; Mari Lauri

numbers as a base, we are able to estimate the average number of articles published in journals indexed in WoS by ISI as 111.7 per title. This can for instance be compared to the figure of 123 articles per year for 6 771 US publishers reported by Tenopir and King [5]. The number of titles indexed in the WoS is probably slightly higher than our estimate for a couple of reasons. The main reason is a time lag between the inclusion in the indexes and the first journal citation report produced for a specific journal. According to ISI [6] the number of titles indexed in the citation databases at the end of the year 2007 was 9 190 journals. In the beginning of 2008 according to ISI’s webpages the number of journals had risen to 9 300. This would indicate, assuming that the number of journals indexed rises steadily every year, that the number would have been somewhere between our estimate and this information. However, we have chosen to use our earlier mentioned estimate (8 466) because the number of titles does not influence the number of ISI-articles which we have obtained separately. It does affect our estimate of the number of non-ISI journals since these are obtained by subtraction (see text below). Since we have estimated these to have a much lower number of articles published per year the effects of a possible mistake in our number of ISI-titles of 1 % would be only around 0.2 % in the total number of articles. Taking as a starting point the number of total titles 23 750 and the number of titles indexed by the ISI 8 466 we arrive by subtraction at a number of titles not indexed by the ISI of 15 284. In order to arrive at a total number of articles we now need to estimate how many articles these journals publish on average per year. This was done using a statistical sample of journals. The basis was Ulrich’s database from which a statistical sample of 250 journals was taken. We set the search so that we only chose journals that have an on-line presence. This might statistically result in a slight bias, but was the only practical way we could study the publication volumes of the journals in the sample. We then extracted the number of articles published in 2006 until we had data for 104 journals (Journals in the original sample which were indexed by the ISI or for which the number of articles could not be found were discarded). In this group the average number of articles published was 26.2, which as we had suspected was considerably lower than for ISI indexed journals. Five of the journals had published no articles and the journal with the highest output had published 225 articles. Multiplying 26.2 by 15 284 results in an estimate of articles published in 2006 of 400 440. Adding the figures for ISI brings the estimate of the total number of peer reviewed articles to 1 346 000 (rounded off) with 70 % covered by the ISI. In their answer to a UK House of Commons committee Elsevier in 2004 estimated that some 2000 publishers in STM (Science, Technology and Medicine) publish 1.2 million peer reviewed articles annually [7]. Taking into account publishing in the social sciences and the humanities our estimate seems to be well in line with these figures. 3.

Share of OA publishing

In policy discussions concerning Open Access publishing a very important question is “what share of all scientific articles is available openly”. For a given year, in our case 2006, this concerns both articles directly published as open access (the so-called gold route in OA jargon), and articles published in subscription based journals, but where the author has deposited a copy in a subject-based or institutional repository (green route). It is easier to estimate the number of gold route articles. For the case of copies in repositories, the evidence is much more scattered, and there is the additional difficulty of checking the nature of the copies (copy of manuscript submitted, personal copy of approved manuscript or replica of published article).

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Global Annual Volume of Peer Reviewed Scholarly Articles

3.1. Gold To estimate the number of articles directly available as OA in 2006 the Directory of Open Access journals (DOAJ) would at first sight seem to be the natural entry point. At the time of checking the directory listed 2 961 journals. Using the directory it is easy to go directly to the web pages of a journal and manually count the number of articles published. One problem is, however, that DOAJ states as inclusion criteria that journals are quality controlled by peer review or editorial quality control. When we searched Ulrich’s for our earlier analysis, we only included journals which had self-reported as refereed (23 750 titles). If we relaxed that criterion and only required a journal to be active and scholarly/academic a search in Ulrich’s yields 60 911 titles. The corresponding figures if the additional criterion of open access was defined were 1 735 refereed and 2 690 scholarly/academic in all. The latter figure is, as could be expected, quite close to the DOAJ total. For these reasons we decided to use Ulrich’s as an entry point, concentrating on the 1 735 journals listed as refereed and open access. In doing the actual counting we tried as far as we could, based on the tables of contents on the web, to only include research articles, excluding editorials etc. This is in line with our earlier use of ISI where we concentrated on the article category only. There are a handful of major OA publishers, Public Library of Science (PLoS), BioMed Central, Hindawi and Internet Scientific Publications (ISP) which use article charges or other means to fund their operations. We counted their articles separately, since they have some high-volume journals. All 7 PLoS journals are listed in Ulrich’s as peer reviewed. Of the 176 BioMed Central journals listed in DOAJ 172 are also listed in Ulrich’s as scholarly and 139 as refereed. For OA journals by other publishers, often published on university web sites using an open source mode of operation with neither publication charges nor subscriptions, we again used a sampling technique. The starting point for this was the figure from Ulrich’s of 1 735 OA titles in total from which we subtracted the number of titles operated by the four publishers listed above resulting in 1 487 titles. A selection of 100 journals was made from this set and the number of research articles was counted from the tables of contents on their web sites. This resulted in an estimate of the mean number of articles published per year of 34.6. Table 1 shows our calculation of the number of OA titles and the number of articles published in 2006. We estimated the total number to 61 313 and this represented 4.6 % of all articles published in 2006. Our figures can be compared to a number of earlier studies. Regazzi [8] used a similar sampling method to study the journals listed in DOAJ in 2003 and 2004 and found a drop in the estimated total number of articles from 25 380 to 24 516, indicating an overall share of 2 % STM articles. He notes that OA journals on average publish far fewer articles (30 on average) than established journals, and quotes an average of 103 for ISI tracked STM journals and 160 for the 1 800 titles of Elsevier. We have also ourselves earlier studied this number through a web survey to the editors of open access journals and then obtained a rather lower figure of 16 articles per year [9].

PLOS Biomed Central Hindawi ISP Other OA journals SUM

Peer reviewed titles (Ulrich’s) 7 139 44 58 1 487 1 735

Articles 2006 881 6 589 1 643 737 51 465 61 313

Table 1. Number of OA titles and articles in 2006 In a white paper on open access publishing from Thomson Scientific [10], the owner of ISI, numbers are given for open access articles included in the Science Citation Index. The text indicates that first the OA publishers were determined from the ROMEO database on publisher OA policies after which the articles Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

181


182

Bo-Christer Björk ;Annikki Roosr; Mari Lauri

were counted. The number of OA articles in SCI in 2003 was 22 095 out of a total of 747 060. Thus roughly 3.0 % of all articles in ISI’s Science Citation Index would have been open access in that year. 3.2. Delayed and hybrid OA In addition to pure gold OA publishing there are two additional routes which could be worth studying. These are the open publishing of individual articles in otherwise closed journals using a separate fee (sometimes labelled open choice) and delayed open access publishing of whole journals. The important thing is that in both these options the version accessed is the original publication, at the publishers website, the only difference is that the access restrictions have been lifted for either a single article, or for articles that have been published before a specific date. All of the biggest publishers, Springer, Taylor & Francis, Blackwell, Wiley and Elsevier provide the option of freeing individual articles against a fee for a wide spectrum of journals [see 11]. It is typical that this opportunity is offered to a sample of the journals in a publisher’s collection. Oxford University Press is an example of a publisher which has been among the first hybrid providers and Karger is an example of a publisher which offers “Author’s Choice” to all of its journals. There are no systematic studies on how commonly the open choice option has been chosen by authors but so far it appears to be rather low. We chose not to do any calculations of our own, since this would be very labor-intensive due to the scattering of relatively few articles among a vast number of titles. Delayed open access is more common among society publishers than commercial publishers. A good example of an individual journal practicing delayed OA is Learned Publishing, the articles in which become OA roughly one year after publishing. A lower bound for an estimate of the prevalence of delayed OA can be obtained via the web portal of HighWire Press, which hosts the e-versions of currently 1 080 journals from over 130 mostly non-commercial publishers. Only a small number of the journals (43) are fully open access from the start but of the total of 4.6 million articles 1.8 million are freely available. The fully open access ones are such that the print version is subscription-based but the online version is free. A search in the database for articles posted during 2006 results in 219 224 hits. This figure may not exactly coincide with the number or articles formally published during that same year and some caution is in order regarding the fact that some of the serials in HighWire Press should not be classified as fully refereed scholarly journals. Of the 1080 HighWire journals 277 (as of January 2008) offer direct or delayed open access. Table 2 lists the numbers in different delay categories as well as an estimate of the total number of articles. The latter has been made assuming that the average number of articles for these is the same as for all the journals in the HighWire portal. Thus comparing this to the total number of articles published in 2006 the share of delayed OA can be estimated to at least 3.5 %, bringing the sum of direct and delayed gold OA to 8.1 %.

Delay

No. of journals

% of all HW journals

Direct OA 1-6 months 7-12 months Over 12 months Delayed in total

43 27 190 17 234

4,0 2,5 17,6 1,6 21,7

Estimated number of articles 8 700 5 481 38 567 3 451 47 499

Table 2. OA articles published electronically by HighWire Press. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Global Annual Volume of Peer Reviewed Scholarly Articles

From the viewpoint of readers hybrid (“open choice”) and delayed open access are less useful than full and instant open access on the title level in “current awareness” reading, where academics track what is being published in a few essential journals either by getting a paper copy or an e-mail table-of-content message. This type of information activity is called “monitoring” in Ellis’s model of information-seeking behaviour [12]. Hybrid and delayed open access help more in cases where a reader tries to access a given article based on a citation (called “chaining” in Ellis’s model). 3.3. Parallel publishing of copies (green) It is much more difficult to estimate the prevalence of green OA than gold OA. Copies of articles published in referee journals are scattered in hundreds of different repositories as well as in even more numerous homepages of authors. There is also the issue of the actual existence of a digital copy on some server versus how easy it is to find it using the most widely used web search engines. For the purposes of this article, we take the pragmatic view that unless you get a hit in Google (or Google Scholar) using the full title of an article, a copy “does not exist”. This is both because a copy which cannot be found this way is very difficult to find for a potential reader and because the best systematic way of measuring the proportion of “green articles” is via systematic search on article titles using a popular search engine, such as Google. An additional complication is that the full text copy found may differ quite substantially from the final published version. It can in the best of cases be an exact copy of the published file (usually PDF) but it can also be a manuscript version from any stage of the submission process. The most useful version is often labelled “accepted for publication” and sometimes includes also changes resulting from the final copy-editing done by the publisher’s technical staff, sometimes not. The layout and page numbering is also usually different from the final published version. Most publishers who allow posting of a copy of an article in an e-print repository allow posting of this so-called “personal version”. In addition, some researchers also upload earlier manuscript versions, often called preprints, but this is not as common except for certain disciplines such as physics. In order to estimate the green route to open access we selected a random sample of all peer reviewed articles published in 2006. The entry point was again Ulrich’s, out of which we took a sample of both journals listed in ISI Web of Science as well as those not listed there. The sample was proportional so that the number of articles from ISI corresponded roughly to the share of ISI in the total number of articles (it included 200 articles in ISI journals and 100 articles in non-ISI journals). A spreadsheet listing the title of the article, the three first authors and the name of the journal was created from the sample. A search was then conducted in Google systematically using the name of the article and in the second hand the writers’ names, using a computer which had Internet access but no access to our university intranet which would automatically allow access to the journals we subscribe to. (We first tried also Google Scholar but we dropped it after a while since we noticed that the search results turned out to almost identical). In order to keep the workload manageable and follow the viewpoint of an average searcher, who does not want to spend too much time and energy, we only searched the 10 first hits, which also is what you usually see on the first screen. If we got a hit which was not on the journal’s own website and which included a full text file containing a document available without subscription, that seemed to fulfil the criteria, a copy was downloaded and saved. The last check was performed by comparing the obtained copy to the published official version which we obtained separately via our own university website or the website of the publisher. This was in order to see that the copy was close enough to the original article. Out of the 35 copies we studied we had subscribed access to 32 and were able to do the comparison, for the remaining three we assumed the copies to be usable. Two of the copies studied turned out to differ significantly in content from the original, and were therefore discarded. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

183


184

Bo-Christer Björk ;Annikki Roosr; Mari Lauri

The results concerning copies in repositories were very similar for ISI-indexed journals (11%) and the other journals (12 %) bringing the weighted average to 11.3 %. The spread between different formats and different types of repositories is shown in the table below, but the absolute numbers are so small per category that it is difficult to generalize to the whole target population. Table 3 shows the percentage of green OA-versions and their popularity: Type of site Exact copy Subject based repository Institutional repository Author’s home pages All

0.7

4.7 1.7 7.0

Personal version 2.3 3.0 1.3 4.0

Type of copy Other version 0.3 0.0 0.0 0.3

All 3.3 5.0 3.0 11.3

Table 3.The frequency of OA copies of different kinds. We found no case of overlaps of the same article being both published as gold OA on a publisher’s website and with a copy in a repository. Thus the figures for green OA can be added to our earlier estimates for gold OA (8.1 %) to get the total OA availability of 19.4%. We were of course also able to check the direct gold availability of the articles in the sample. For the articles in ISI journals the percentage was 15 but for non-ISI articles an astonishing 35 %, compared with our earlier figures of 8.1 %. The reasons for this can be twofold. Firstly we were in practice restricted in producing the sample to journals which at least have tables of content freely available on the net. Our experience in producing the sample, in terms of how many candidate journals we had to disqualify because of a lacking web presence, indicated that for ISI-listed journals the availability of web tables of contents is nowadays rather high, whereas for non-ISI journals the percentage is much lower. Unfortunately we did not keep exact records when we produced our sample, which could have helped in correcting the estimate taking this factor into account. Secondly there might be a random element in this calculation, which of course could be reduced by increasing the sample size. All in all we believe our earlier estimate of gold availability to be more reliable. 4.

Conclusions and discussion

We have estimated in this study that the amount of scientific articles published in 2006 was 1 346 000. Our hypotheses about the difference in the number of articles published per title in the titles indexed by ISI and non-ISI-titles appeared to be correct. The non-ISI journals published on average 26.7 articles per title and the ISI-journals 111.7 articles. 4.6 % from the yearly article output appears in the Golden OA journals and at least 3.5 % is open after a delay period. 11.3 % of articles are openly available in repositories and for example on personal web pages. Altogether the amount of openly available articles from the yearly output is 19.4 %. The different elements in our calculation differ in terms of accuracy. The total number of articles included in the indices of the ISI should be very accurate, provided that we have searched the database in a correct way. Also the total number of journals tracked by the ISI in a given year is reasonably accurate. The total number of peer reviewed scholarly journals is much more difficult to estimate accurately. Ulrich’s database is the best tool available for this purpose, but its coverage is not 100 %, and there are some inactive journals included in the category. On the other hand if we organise the total journal market according to the number of yearly articles per title we get a distribution with a few very high volume titles and many journals with few articles. It is very likely that journals which are not listed in Ulrich’s publish rather few articles per annum, and thus their contribution to the total volume of article is rather marginal. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Global Annual Volume of Peer Reviewed Scholarly Articles

It is also more likely that Ulrich’s coverage of journals published in the Anglo-Saxon countries is more comprehensive than journals published in non-English speaking countries and in particular in languages other than English. It is also impossible to draw a clear border line between journals practicing full peer review and journals where the editors check the content of the submission. In this respect we just have to trust the self-reporting of journals to Ulrich’s data base. Also we have excluded conference proceedings produced using a referee procedure, since it would be very difficult to find data about these. The one notable exception is the Springer Lecture Notes series, but we chose to exclude it from our calculations. An interesting study of the growth of Open Access and the effect of open vs. closed access on the number of citations has been carried out by Hajjem, Harnad and Gingras [13]. They used a web robot to search for full texts corresponding to the citation metadata of 1.3 million articles indexed by the ISI from a 12 year time period (1992-2003), in particular focusing on differences between disciplines in the degree of open availability and in the citation advantage provided by OA. Articles published in OA journals were excluded and their results thus concern articles published in subscription-based journals where the author (or a third party) has deposited a copy on any web site which allows full text retrieval for web robots. According to the study the degree of green OA varied from 5-16 % depending on the discipline, but from our viewpoint the most important figure was that for the total of 1.3 million articles OA full text copies could be found for 12 %. This included both direct replicas, the author’s accepted manuscripts after the review (“personal version”) and submitted manuscripts (“preprint”), since it can be assumed that the robot could not distinguish between these if the title and author have remained unchanged. All in all we believe our estimates to be more accurate than the estimates that have been presented earlier in different contexts. We have defined our method in detail and the estimate can easily be replicated and/ or adjusted by other researchers in later years. 5.

Acknowledgements

This study was partly financed by the Academy of Finland, through the research grant for the OACS project (application no. 205993). We would also like to thank Piero Ciarcelluto for his assistance in the data gathering phase. 6.

References

[1] [2]

Budapest Open Access Initiative. 2002. http://www.soros.org/openaccess/read.shtml Harnad S., Brody T., Vallières F., Carr L., Hitchcock S., Gingras Y., Oppenheim C., Stamerjohanns H. and Hilf ER. 2004. The Access/Impact Problem and the Green and Gold Roads to Open Access. Serials Review, Vol. 30(4), pp. 310-314. Björk B-C. 2007. A model of scientific communication as a global distributed information system. Information Research, Vol. 12(2) paper 307. Available at http://InformationR.net/ir/12-2/ paper307.html European Commission. 2006. Study on the economic and technical evolution of the scientific publication markets in Europe. Brussels: European Commission. Directorate General for Research. Available at: http://ec.europa.eu/research/science-society/pdf/scientific-publication-study_en.pdf Tenopir, C. & King, D. (2000). Towards electronic Journals – realities for scientists, librarians and publishers, Washington D. C.: Special Libraries Association. Horky, David 2008. E-mail from David Horky, Thomson Scientific, in the 17th of Jan 2008. Elsevier. 2004. Responses to the questions posed by the Science and Technology Committee, Document submitted to the UK House of Commons Select Committee on Science and Technology by Elsevier on 12 February 2004. Available at: http://www.elsevier.com/

[3] [4] [5] [6] [7]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

185


186

Bo-Christer Björk ;Annikki Roosr; Mari Lauri

authored_news/corporate/images/UK_STC_FINAL_SUBMISSION.pdf Regazzi J. 2004. The Shifting Sands of Open Access Publishing, a Publisher’s View. Serials Review Vol. 30, pp. 275-280. [9] Hedlund T., Gustafson T. and Björk B-C. 2004. The Open Access Scientific Journal: An Empirical Study. Learned Publishing, Vol. 17(3), pp. 199-209. [10] Mc Veigh ME. 2004. Open Access Journals in the ISI Citation Databases: Analysis of Impact Factors and Citation Patterns – A Citation study from Thomson Scientific. http:// scientific.thomson.com/media/presentrep/essayspdf/openaccesscitations2.pdf [11] Morris S. 2007. Mapping the journal publishing landscape: how much do we know? Learned Publishing, Vol.20(4), pp. 299-310. [12] Ellis, D. (2005) Ellis’s model of information-seeking behaviour. In Fisher K.E. et al. (eds.) Theories of information behaviour. Medford : Information Today. pp. 138-142. [13] Hajjem C., Harnad S. and Gingras Y. 2005. Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin Vol. 28(4) pp. 39-47. Available at: http://eprints.ecs.soton.ac.uk/11688/ [8]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


187

Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean Saray Córdoba-González1; Rolando Coto-Solano2 Vicerrectoría de Investigación, Universidad de Costa Rica Ciudad Universitaria Rodrigo Facio, San José, Costa Rica e-mail: saraycg@gmail.com; 2 Vicerrectoría de Investigación ,Universidad de Costa Rica Ciudad Universitaria Rodrigo Facio, San José, Costa Rica e-mail: rolandocoto@gmail.com 1

Abstract Our objective is to analyze the use that Latin American peer-reviewed journals make of the tools and opportunities provided by electronic publishing, particularly of those that would make them evolve to be more than “mere photocopies” of their printed counterparts. While doing these, we also set out to discover if there were any Latin American journals that use these technologies in an effective way, comparable to the most innovative journals in existence. We extracted a sample of 125 journals from the LATINDEX – Regional System of Scientific Journals of Latin America, the Caribbean, Spain and Portugal – electronic resources index, and compared along five dimensions: (1) Non-linearity, (2) use of multimedia, (3) linking to external resources (“multiple use”), (4) interactivity, and (5) use of metadata, search engines, and other added resources. We have found that very few articles in these journals (14%) used non-linear links to navigate between different sections of the article. Almost no journals (3%) featured multimedia contents. About one in every four articles (26%) published in the journals analyzed had their references or bibliographic items enriched by links that connected to the original documents quoted by the author. The most common form of interaction was user!journal, in the form of question forms (17% of journals) and new issue warnings (17% of journals). Some, however (5%) had user!user interaction, offering forums and response to published articles by the readership. About 35% of the journals have metadata within their pages, and 50% offer search engines to their users. One of the most pressing problems for these journals it the wrong use of rather simple technologies such as linking: 49% of the external resource links were mismarked in some way, with a full 24% being mismarked by spelling or layout mistakes. Latin American journals still present a number of serious limitations when using electronic resources and techniques, with text being overwhelmingly linear and underlinked, e-mail to the editors being the main means of contact, and multimedia as a scarce commodity. We selected a small sample of journals from other regions of the world, and found that they offer significantly more non-linearity (p = 0.005 < 0.1), interactive features (p = 0.005 < 0.1), use of multimedia (p = 0.04 < 0.1) and linking to external documents (p = 0.007 < 0.1). While these are the current characteristics of Latin American journals, a number of very notable exceptions speak volumes of the potential of these technologies to improve the quality of Latin American scholarly publishing. Keywords: Electronic journals; scholarly journals; Latin America; serials quality criteria; LATINDEX 1.

Introduction

Electronic journals in Latin America have been under-analyzed in terms of their architecture, particularly in how well they exploit the tools made available by Internet technologies, which provide new ways to produce interaction with the readers, non-linearity in the text, and multimedia content to illustrate and complement the articles. If online journals take advantage of these novel tools, they can become more Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


188

Saray Córdoba-González; Rolando Coto-Solano

than mere clones of their printed versions. This would give them an advantage that could potentially place them format-wise at par with the more innovative journals in the world, and can help in the debunking of some pervasive prejudices held by the entire scientific community towards electronic publishing [1]. We analyzed a sample of peer-reviewed journals from the Latin America and the Caribbean, and measured the adoption of those features in journals from across the region. We seek to diagnose the present situation in Latin America, but also to provide a basis for comparison with related journals in other longitudes of the world. Mayernik [2] published a study along this line, where he analyzed 11 psychology and 10 physics journals, but didn’t make any emphasis on their geographical origin. A number of virtual libraries, indexes and repositories have sprung forth in Latin America to support the work of the local journals, as well as to help them in better using their resources (particularly monetary resources) with proposals that can improve editorial quality, including introduction of open-source software in their work cycles and adoption of Open Access as a philosophy for the journals. All of these efforts are aimed to make this “hidden science” more accessible to the academic world in both the local and global spaces. The Regional System of Scientific Journals of Latin America, the Caribbean, Spain and Portugal – LATINDEX (www.latindex.org) – was created in 1997, and currently has a directory of more than 16000 journals. It also provides a criterion-reviewed “catalogue” with 2952 journals. These journals have been selected based on 36 evaluation criteria that describe basic editorial quality. Among these criteria, three aspects are aimed specifically at online journals: use of metadata, incorporation of a search engine for the content of the site, and inclusion of “added content”, such as lists of “links of interest”, discussion forums, etc. From the journals in the directory, 2490 have an electronic version. Electronic journals have been at the center of a long-running discussion in the editorial word. Since 1997, Valuskas [3] defined electronic journals as “a digital periodical dedicated to publishing, on the Internet, articles, essays, and analyses that have been read and commented upon initially by a select group of editors and reviewers, to meet a certain arbitrary standard of excellence (as determined by the editors) for a given discipline addressed by the journal itself”. In this sense, electronic journals were perceived in terms of their availability in the web. However, this definition leaves out any mention of the potential exploitation of Internet tools. Valuskas [3] complements the explanation by saying that “The very electronic nature of the journal provides ample opportunities for experimentation with formats, layouts, fonts, and other design features, although many electronic journals fail to jump at some obvious opportunities to make given issue more readable and appealing to the eye”. One year later, Hitchcock et al. [4] drew attention onto the importance of links within the text, as well as other Internet-related advantages that could improve access to the knowledge contained in scientific journals. Efforts have been made to clearly define the parameters upon which electronic journals could be evaluated. An important description deals with the relation the electronic journal might have with a potential printed counterpart. Kling & McKim [5] defined three possibilities for this relation: pure e-journals that were born electronic, p-e journals where the articles were first published on paper but where electronic distribution is also possible, and e-p journals where the electronic format is the predominant version, but limited quantities of paper versions are also produced. While the authors are very clear in pointing out that not all Internet-based journals are rigorous in their peer-review processes, they establish that the model of publication does not determine the quality of the final product. Notwithstanding the way they got into the web, we assume as a matter of course that electronic journals must be peer-reviewed, must conform to international editorial standards (such as the LATINDEX criteria), and the majority of its text must be made of scientific articles. In keeping with this, for this study we will only use peer-reviewed journals, and not include any bulletins or science popularization magazines. A few studies have scratched the surface of the electronic publishing practices in Latin America. Dias [6] studied a number of Brazilian journals for their use of hypertext and search engines as a satisfactory Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

implementation of the inherent possibilities of the electronic medium. Marcondes et.al. [7] wrote a descriptive study about Brazilian electronic journals, focusing on technical aspects such as electronic text formats used, availability of a site search engine, if the journal belonged or not to a portal, proving that metadata were little known by Brazilian editors, and that features such as interactivity, hypertext and multimedia were almost never used. They conclude that Brazilian journals resemble journals from other parts of the world, in that the issues are designed following the printed-only model, delivering the Internet version as a virtual photocopy of the document, in want of “more professionalism from the editors”. Another significant case in Latin America was presented in ELPUB 2006 by Muñoz, Bustos and Muñoz [8] where they studied the Chilean Electronic Journal of Biotechnology, and described the journal’s innovative features in terms of usability of the website, speed and efficiency, use of metadata, adoption of the DOI system, and use of CrossRef as a citation linking system. Mayernik [2] wrote the most comprehensive study in this field. He used four specific dimensions to evaluate the journals: (1) non-linearity of the document (2) external links to the documents quoted in the article (be it the main body or the references), which he refers to as “multiple use”, (3) multimedia use in the articles or in the website, and (4) interactivity with the readers, in the form of forums or other two-way communications. (Mayernik also studied a fifth dimension, speedy publication, which we will not consider here). These characteristics are deemed as innovative in their use of the technical possibilities of the web, and as a valuable addition to the overall experience of the reader/user. However, as some authors have explained (Harnad [9], Tenopir & King [10]), these qualities are not fully exploited by the editors, and the journal’s full potential is not achieved. These four dimensions are anything but casual. Hitchcock et. al. [4] had denominated the emergency of hyperlinking as “the second frontier”. Harnad studied the benefits of hyperlinks as early as 1992 [11], and Lukesh [12] has explained how multimedia options “play a major role in the similar explosion we are undergoing today as they become tools in developing knowledge rather than simple illustrations”. A number of repositories and virtual collections (such as the Brazilian SciELO and the Mexican REDALyC for example1), usually associated to universities, have played a significant role in pushing forward the digitalization of the scholarly journals in the region, particularly within the Open Access model. These solutions have emerged as a way to focus whatever resources become available and apply them to a number of journals at the same time, and have become valuable tools to provide visibility and web presence to the scientific production of Latin America, providing data on how this information is used and quoted within the local scientific community, and how a field of knowledge evolves in the region. In this study, however, we will focus on the specific website of each of the journals, and see what solutions are being used by individual journals and their editorial boards. We will try to determine how “innovative” are electronic journals in Latin America, where innovation is understood as the exploitation and application of Internet resources, tools and programs that can improve the user!journal and user!user communication processes, and add competitive value to the journal. We understand that the web offers these possibilities, but that they are not taken advantage of by Latin American editors, and indeed by editors around the world (Harnad [9], Tenopir & King [10]). Our first objective is to analyze the e-journals in Latin America, to (a) determine the degree in which they are copies of their printed counterparts, and (b) to discover the specific Internet tools and resources that editors are using to improve their journals. Our second objective is to determine how these journals fare when compared to e-journals from other parts of the world that can be regarded as innovative and technically advanced. 2.

Methodology

We selected peer-reviewed journals, with open acess to their articles and at least 40% of scientific content, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

189


190

Saray Córdoba-González; Rolando Coto-Solano

and that are published independently, not as a part of a larger collection site (such as SciELO or REDALyC) that might have made caused their content to artificially conform to different publishing standards. Based on the Mayernik [2] characteristics, we chose to study: (1) Non-linearity, the ability to jump from one part of the article to the other as the user wishes, (2) multimedia, the use of audio or video to enhance the user’s experience, (3) multiple use: the existence of links to the full text of the documents quoted or referred to on the article, and (4) interactivity: the existence of tools that can provide interaction between the editors, the authors, and the readers of the journals. Mayernik described a fifth characteristic, speedy publication, but we will not consider it amongst the objective of our study. Additionally, we also analyzed the use of criteria 34, 35 and 36 of the LATINDEX e-journal criteria: use of metadata, use of search engines, and use of “added value services”, such as links to external resources, documents relevant to the readership, forms of interaction, etc 2. To evaluate the journals, we used the Electronic Resources Index of the LATINDEX website. We chose the 12 countries that have more than 10 journals in the directory: Argentina, Brazil, Chile, Colombia, Costa Rica, Cuba, Ecuador, Mexico, Peru, Uruguay and Venezuela. Within these groups, we randomly chose 10% of the journals. While the original sample included 167 journals, it also included bulletins, non-peerreviewed journals and science education magazines. After accounting for this we arrived at the final sample size of 125 journals (see Appendix 1), representative of the journals within Latindex’ electronic directory with an estimated sampling error of 7.5%. The journals had in average 8.95 ± 5.8 articles per issue, so we randomly chose one third of the articles (3 for every issue) to perform the analysis, arriving at a final sample of 375 articles. The multimedia, interaction and LATINDEX-34-35-36 criteria were evaluated at the journal level (using the 125 journals as the sample), and the non-linearity and multiple use criteria were evaluated at the article level (using the 375 articles as the sample). For the LATINDEX criteria, a journal scores one point if it meets all criteria, 0.66 if it meets 2 criteria, and 0 if neither metadata, search engines or added services are present. For the multimedia criteria, there are two individual criteria: (1) presence of video features and (2) presence of audio features. The presence of one of these is interpreted as one point. The presence of only one feature is 0.5 points. The non-existence of these features awards the journal zero points. For interaction, the four criteria are: (1) presence of a contact form that a user can use to write to the editors, (2) presence of some means of communication between the reader and the author, or some expert in the field, (3) presence of some means of communication amongst the readers, and (4) use of alert features, such as e-mail subscriptions or RSS news feeds. A journal scores one point if it meets all criteria, 0.75 if it meets 3 criteria, 0.5 if it meets only two, 0.25 if it meets just one, and 0 for no compliance of any criterion. In addition to the whole-corpus counts, we also analyzed the corpus of journals and articles along two variables: country of the journal, and subject (Social Sciences, Medical Sciences, Agricultural Sciences, Exact and Natural Sciences, Multidisciplinary, Arts and Humanities, and Engineering). Finally, we chose a small intentional sample of e-journals from other regions of the world that made extensive use of our studied characteristics, and compared them against a selection of journals chosen as “top in their class” by three LATINDEX officials3. We evaluated the non Latin American journals using the same parameters as the Latin American ones, and proceeded to compare them. Our research design is exploratory, representative of a major collection of Latin American peer-reviewed journals. This paper being of an exploratory nature, we chose ˜ = 0.1 for statistical comparisons. We used Microsoft® Access© 2003 for database keeping, Adobe® Acrobat Reader© 8 for PDF analysis, and Internet Explorer© 6, Internet Explorer© 7, Mozilla® Firefox© 2.0 and Opera® 9.27 for Internet browsing within the Microsoft® Windows XP© Service Pack 2 operating system. We used the software JMP© 7 for statistical analysis. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

3.

Results

3.1

General information about the corpus

We examined 125 journals from 12 different countries, using the most current issue as the point of reference to start the comparative analysis. In the following table, we describe the sample’s age and use of computer formats: Year of publication of the current issue 2008 31 articles (25%) 2007 56 articles (45%) 2006 20 articles (16%) 2005 6 articles (5%) Prior to 2005 12 articles (9%)

Format of the articles in the latest edition PDF 312 articles (83%) HTML 102 articles (27%) Both PDF and HTML 42 articles (11%)

Table 1. Age and format of the corpus 3.2 Non-linearity We defined four possibilities for our intradocument links: (1) Navigational links to jump between sections of the document, (2) Links to footnotes or notes at the end of the document, (4) Navigational and footnote links combined and (4) Links to reference or bibliography items. Mayernik uses our possibility 3 as his measure for non-linearity.

Articles that contained: Average (links)

Navigational links

Footnote links 33 articles (8.8%)

Navigational and foodtnote links combined 51 articles (13.6%)

Links to references or bibliography 14 articles (3.7%)

20 articles (5.3%) 11.2 ± 11.8 links

16.4 ± 16.7 links

15.0 ± 15.2 links

32.6 ± 28.9 links

Table 2. Non-linearity across the corpus. Given the large variability in the corpus, it is no surprise that the standard deviation is larger than the average in most categories. Only in the references section the variation is narrow enough to say that each article has at least 3 links to the bibliographic references. Table 3 describes the situation in more detail, breaking it down to the country level. We compared how the use of HTML and PDF formats affected the use of internal links, and found that there is indeed a significant difference. When HTML format is used (alone or in combination with PDF), there is a significantly higher amount of internal links within the document, both navigational and to the references. 3.3

Use of multimedia

We found that only 2.5% of the journals use audio resources; and only the 3% of them post videos on their website. In average, the use of multimedia in the journals analyzed equaled 0.03 points. There are no significant differences in the use of multimedia either by country or by subject. 3.4

Multiple use

We subdivided the multiple use category in four different areas: (1) External links embedded in the body of the article, (2) external links in the reference or bibliography section, that lead directly to the text of the reference item, (3) external links in the reference or bibliography section, that might lead directly or Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

191


192

Saray Córdoba-González; Rolando Coto-Solano

indirectly to the text of the reference item, where indirectly is understood as “three clicks away or less”, and (4) external links in the reference or bibliography section, that lead directly or indirectly to the text, or at the very least lead to an abstract of the reference item. Mayernik uses our area (2) as his measure for multiple use.

Total ARG BRA CHL COL CRI CUB ECU

Number of journals and average links Navigational Footnote Footnote and links links navigational combined 20 (5.3%) 33 (8.8%) 51 (13.6%) 11.2 ± 11.8 16.4 ± 16.7 15.0 ± 15.2 0 5 (10%) 5 (10%) 16.8 ± 15.7 16.8 ± 15.7 0 11 (16%) 11 (16%) 16.5 ± 21.3 16.5 ± 21.3 6 (17%) 5 (14%) 11 (31%) 18.5 ± 13.2 13.8 ± 12.3 16.4 ± 12.4 3 (9%) 0 3 (9%) 3.7 ± 2.1 3.7 ± 2.1 0 3 (17%) 3 (17%) 24.7 ± 21.1 24.7 ± 21.1 5 (28%) 0 5 (28%) 6 ± 4.8 6 ± 4.8 0 0 0

Signification groups * Navigational Footnote links links

C

BCD

C

BC

A

Footnote and navigational combined CD BCD

BCD BC

AB D

D

BC

B

AB

BC CD

BCD

BC

BCD

BCD

2 (2%) 5.5 ± 6.4 0

6 (7%) 13.8 ±13.4 0

BC

D

D

PER

6 (7%) 12 ± 14.4 0

BC

BCD

CD

PRI

0

BCD

BCD

0

VEN

0

2 (17%) 8.5 ± 2.1 5 (42%) 21.2 ± 17.3 0

BC

URY

2 (17%) 8.5 ± 2.1 5 (42%) 21.2 ± 17.3 0

MEX

BC

A

BC

A CD

p = 0.01 < 0.1

D

p = 0.04 < 0.1

p = 0.01 < 0.1

Table 3. Non-linearity by country. (The first line in the left cells is “number of articles” and “percentage of articles in the corpus”. The second line is “average number of links” and “standard deviation in the number of links”).

Total Both HTML and PDF Only HTML Only PDF

Average links Footnote and navigational combined 11.2 ± 11.8 6.0 ± 1.1 5.3 ± 0.9 0.7 ± 0.4

Links to references or bibliography 16.4 ± 16.7 5.1 ± 1.2 2.5 ± 1.0 0.3 ± 0.5

Signification groups * Footnote and Links to references navigational or bibliography combined A A

A A

B p < 0.0001 < 0.1

B p = 0.002 < 0.1

Table 4: Formats used by the journals and their level for significance

Articles that contained: Average (links)

Originating from the body of the article External links

Originating from the references or bibliography

51 articles (14%) 7.4 ± 14.5 links

External links connecting directly to the item 82 articles (22%)

External links connecting indirectly to the item 98 articles (26%)

External links connecting to the item or at least to its abstract 98 articles (26%)

3.4 ± 3.9 links

3.7 ± 4.6 links

4.5 ± 6.1 links

Table 5. Articles with external links Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

When examining multiple use, we also recorded which articles had any Internet links at all, and whether those links were in fact marked as clickable hyperlinks or not (they might have been just plain text, which the user couldn’t click on). A total of 143 articles (38%) used one or more Internet references, and from those, only 114 (30%) had all of their links properly marked. Table 6 shows the situation broken down by country.

Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN

Articles had Internet addresses in the references or bibliography 143 (38%) 12 (23%) 36 (52%) 17 (47%) 15 (47%) 11 (61%) 7 (39%) 0 26 (29%) 3 (33%) 7 (58%) 4 (33%) 5 (24%)

Signification groups * BC A A A A AB C BC ABC A ABC BC p = 0.003 < 0.1

All of the Internet addresses in these articles were in fact marked (clickable) 114 (30%) 11 (22%) 30 (43%) 12 (33%) 15 (47%) 9 (50%) 5 (28%) 0 23 (26%) 2 (22%) 4 (33%) 1 (8%) 2 (9%)

Signification groups * BC A AB A A ABC C BC ABC ABC C C p = 0.006 < 0.1

Table 6: Use of Internet references by authors, and correct marking of Internet references as hyperlinks

Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN

External links (from the references) connecting directly or indirectly to the item 98 articles (26%) 3.7 ± 4.6 links 10 articles (20%) 4.0 ± 3.2 links 27 articles (39%) 3.2 ± 3.7 links 13 articles (36%) 3.4 ± 4.6 links 10 articles (31%) 3.1 ±3.4 links 8 articles (44%) 4.2 ± 3.7 links 2 articles (11%) 1.0 ± 0.0 links 0 16 articles (18%) 4.0 ± 6.2 links 3 articles (33%) 11.0 ± 13.9 links 5 articles (42%) 2.2 ± 0.8 links 1 articles (8%) 6.0 ± 0.0 links 3 articles (14%) 3.3 ± 1.5 links

Signification groups *

C C AB C BC C BC C A BC BC C p = 0.098 < 0.1

Table 7: External links connecting directly or indirectly to the text of a reference item, broken down by country Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

193


194

Saray Córdoba-González; Rolando Coto-Solano

From all of the countries in table 6, Costa Rica, Puerto Rico, Brazil, Chile and Colombia show the richest use of Internet references by authors (p = 0.003 < 0.1), but only Costa Rica, Colombia and Brazil match that a throughout use of link markedness (p = 0.006 < 0.1), which only the case of Costa Rica reaches 50% of all articles having their links thoroughly marked as hyperlinks. Table 7 describes the situation for “links leading directly or indirectly to the text”, where we found a significance by country: Chile and Peru appear to make the most use of links in the reference sections of their articles (p = 0.098 < 0.1). When broken down by subject we found that the total links from the reference (direct, indirect and abstracts) were significantly different. Table 8 shows that Natural and Exact Sciences and Agricultural Sciences journals used more links in average that the rest of the subjects (p = 0.04 < 0.1).

Total Arts and Humanities Agricultural Sciences Engineering Natural and Exact Sciences Medical Sciences Social Sciences Multidisciplinary

External links connecting to the item or at least to its abstract 98 articles (26%) 3.7 ± 4.6 links 5 articles (14%) 2.2 ± 1.8 links 8 articles (33%) 4.0 ± 3.5 links 10 articles (42%) 2.4 ± 1.6 links 8 articles (27%) 10.5 ± 12.2 links 14 articles (19%) 2.1 ± 2.0 links 47 articles (28%) 5.3 ± 6.3 links 6 articles (33%) 1.5 ± 0.8 links

Signification groups *

C ABC BC A C B BC p = 0.04 < 0.1

Table 8: External links connecting to the item or at least to its abstract, broken down by subject 3.5. Interactivity To analyze the possibilities offered by interactivity, we broke down the category in four different criteria: (1) presence of a contact form that a user can use to write to the editors, (2) presence of some means of communication between the reader and the author, or some expert in the field, (3) presence of some means of communication amongst the readers, and (4) use of alert features, such as e-mail subscriptions or RSS news feeds. The results for the 125 journals are summarized in Table 9. We can see that Cuba, Brazil and Chile emerge as the more solid competitors in this area (p = 0.022 < 0.1). Cuban journals make good use of rich websites to foster communication between the readers and the authors, mostly in the medical journals. Brazilian journals lead in offering alerts and “new issue warnings” to users (probably due to the common adoption of the OJS platform), and Chilean journals very frequently offer forms to the users (as opposed to simply e-mail addresses) to ask questions or relay opinions to the journal’s editors. In total, we found that: (1) In the category of user-journal interaction, 17% of the journals offer contact forms, 5% offers ways to contact experts, and 17% offer alerts and news in some form. (2) In the category of user-user interaction, only 5% of the journals offer forums, discussion boards, or any other way for the readers to share information or reply to articles. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

3.6

Latindex evaluation criteria

After calculating the points for the LATINDEX criteria, the following results emerged. The average LATINDEX score is 0.48 points, with 35% of the journals using metadata, 50% using search engines, and 58% using â&#x20AC;&#x153;added servicesâ&#x20AC;?. Figure 1 describes what added services are more common in our sample, and Table 10 breaks down the LATINDEX score by country. Costa Rica and Brazil are the countries with the highest Latindex score (0.83 and 0.62 respectively; p = 0.07 < 0.1). Country Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN

Total journals 125 17 23 12 11 6 6 2 30 3 4 4 7

Crit.1 (Journals) 21 (17%) 3 (17.6%) 4 (17.4%) 5 (41.7%) 2 (18.2%) 1 (16,7%) 4 (66.7%) 0 1 (3,3%) 1 (33.3%) 0 0 0

Crit.2 (Journals) 6 (5%) 0 0 0 0 1 (16.7%) 3 (50%) 0 2 (6,7%) 0 0 0 0

Crit.3 (Journals) 5 (4%) 1 (5.9%) 0 0 1 (9.1%) 0 0 0 2 (6.7%) 0 0 0 1 (14.3%)

Crit.4 (Journals) 21 (17%) 5 (29.4%) 11 (47.8%) 2 (16.7%) 0 0 0 0 1 (3.3%) 1 (33.3%) 0 1 (25%) 0

Average points 0,11 0,13 0,16 0,15 0,07 0,08 0,29 0 0,05 0,17 0 0,06 0,04

Signification groups BCD B BC CDE BCDE A BCDE E ABCDE DE BCDE CDE p = 0.022 < 0,1

Table 9. Interactivity, broken down by country. Crit.1: Presence of contact form for user- journal interaction. Crit.2: Presence of some means of communication between the user and the author or some expert in the field. Crit. 3: Presence of some scheme of user-user communication. Crit.4: Use of alerts for the users.

Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN

Number of journals Latindex Metadata score (0-1) usage (Journals) 0.48 44 (35%) 0.55 9 (53%) 0.62 10 (43%) 0.39 3 (25%) 0.33 1 (9%) 0.83 5 (83%) 0.56 3 (50%) 0.17 1 (50%) 0.38 7 (23%) 0.56 2 (67%) 0.25 0 0.50 2 (50%) 0.43 1 (14%)

Search engine usage (Journals) 62 (50%) 7 (41%) 16 (70%) 6 (50%) 4 (36%) 5 (83%) 3 (50%) 0 11 (37%) 2 (67%) 1 (25%) 1 (25%) 6 (86%)

Signification groups * Latindex Metadata score (0-1) usage

Search engine usage

BC AB C C A ABC C C ABC C ABC BC p = 0.07 < 0.1

B A AB B A AB B B AB B B A p = 0.09 < 0.1

AB BC BCD D A ABC ABCD CD ABC D ABCD CD p = 0.03 < 0.1

Tabla 10: LATINDEX evaluation criteria, broken down by country 3.7

Comparison between Latin American and non-Latin American journals

For the comparison, we intentionally chose nine journals from Latin America and five from other parts of the world, based on their reputation for innovative use of Internet resources2. Within this small sample, we found that the non-Latin American options offer significantly more non-linearity (p = 0.005 < 0.1), interactive features (p = 0.005 < 0.1), use of multimedia (p = 0.04 < 0.1) and linking to external documents (p = 0.007 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

195


Saray Córdoba-González; Rolando Coto-Solano

196

< 0.1). However, we did not find the two groups were significantly different in their LATINDEX score. Figure 1. Types of added resources used by Latin American Journals Others

5%

Contact with aut hors or expert s

5%

User? User Interaction (Forums)

9%

Send the article to a third person

10%

Themat ic indexes of the articles

13%

Relevant resources or document s

15%

Issue warnings (e-mail, RSS)

24%

Links to external resources

53%

0%

10%

20%

30%

40%

50%

60%

% of journal s that use the re source

Figure 1. Types of added resources used by Latin American Journals

Journals Non-Latin American Latin American Journals Non-Latin American Latin American

Interactivity 0.55 ± 0.41 pts. 0.14 ± 0.18 pts. p = 0.005 < 0.1 Navigational links

Multimedia Use 0.40 ± 0.12 pts. 0.05 ± 0.09 pts. p = 0.04 < 0.1 External links in the references

30.1 ± 5.2 links 10.9 ± 18.2 links p = 0.005 < 0.1

4.2 ± 5.2 links 1.3 ± 0.7 links p = 0.02 < 0.1

LATINDEX score 0.87 ± 0.13 pts. 0.67 ± 0.33 pts. p = 0.24 > 0.1 External links in the references (including links to abstracts) 49.6 ± 12.9 links 4.0 ± 8.1 links p = 0.007 < 0.1

Table 11: Compared situation of a small simple of Latin American and Non-Latin American journals 4.

Discussion

The data obtained indicates that the use of Internet-related tools and technologies is not very widespread in the Latin American region. While there is a certain presence of multiple use within the articles (about 26% of all articles evaluated had HTML links in their references or bibliography), non-linearity and interaction are very seldom present in the journals, and the sight of multimedia functions in the journals is almost nonexistent. Only 13.6% of all articles are non-linear, with an average of 15.0 ± 15.2 navigational links for articles that are non-linear. Uruguay and Chile appear to stand out in this category (p = 0.01 < 0.1; see table 2), with 42% and 31% respectively of their articles having non-linear links. From the entire corpus, a mere 3.7% of all articles have links to references or bibliography items. This appears to be one of the least soughafter features across Latin American publishing. In both types of non-linear links (‘navigational’ and ‘directed to references’), the appearance of HTML publishing is determinant in raising consciousness and ultimately frequency in links (p d•0.002 < 0.1; see table 4). PDF-only publishing might be keeping the editors from fathoming the possibilities of non-linearity, which might be the cause for these results. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

Multimedia was the least frequent of all characteristics. Only 3 journals have audio features, and only 4 journals have some form of video. Three of the more noteworthy cases were: (i) Actualidades Investigativas en Educación (Costa Rica), where Quicktime audio and video is used in one of its articles. (ii) La pintura mural prehispánica en México (Mexico), that offers online Flash videos produced by its parent research unit. (iii) Revista de Enfermedades Infecciosas en Pediatría (Mexico), where audio interviews present its authors and other researchers discussing current issues. Multiple use was the characteristic that fared better in our corpus. About a quarter of the articles (26%) had links that departed from the references, and landed on either the document, or on a page with the abstract of the original article. Mayernik only considered direct links, meaning links that landed on the text of the document cited. In our corpus, 22% of the articles had this kind of direct links in their reference sections, and 14% of the articles had such links within the body of the text. In total, 38% of the articles did cite Internet references, with Costa Rica, Puerto Rico, Brazil, Chile and Colombia as the locations where these references are most common among authors (p = 0.003 < 0.1; see table 6), and Chile and Peru as the countries where those references will be more likely to include a link leading directly to (or within 3 clicks of) the cited text (p = 0.098 < 0.1; see table 7). The fields of Natural Sciences and Agricultural Sciences presented the highest frequency of links in the references (p = 0.04 < 0.1; see table 8), which might be influenced by the situation described by Cronin [13]: “publishing practices differ; for example, disciplines such as molecular biology follow a pattern characterized by a large number of relatively short papers with joint authorship, frequently appearing in highly cited journals”. While looking into the multiple use functionality, we discovered a widespread and potentially seriously problem of mismarking in the links leading to external documents. In our sample, only 51% of all potential links where well marked (leading to any Internet page at all when clicked on). From the remaining 49%, only 10% were broken links, 15% were completely unmarked links (only appearing as text), and a full 24% where misspelled or incomplete. The most common problem occurred when marking the reference items. Since URL addresses must fit within the layout page, the longer links get “broken in two”, so that the two parts are sitting on different lines. The paragraph looks very orderly, but the automatic marking cannot recognize the second part of the address, and marks only the first section. When this happens, the address gets “cut off in the middle”, and the browser can’t possibly find the right page. While the problem of link-morbidity in scholarly writing still needs to be addressed, we believe this apparently simple problem should also be taken into account. In the interactivity section, the most common forms of interaction offered by the journals were user!journal in the form of question forms (17% of journals) and new issue warnings (17% of journals). Some journals however (5%) had ways to provide user!user interaction, through user forums and systems of response to published articles, so that the readers could participate in the discussion. E-mail only contact with the journal continues to be the norm, with Cuba as the most salient exception for the user interaction available in its medical journals (p = 0.02 < 0.1; see table 9). As for the Latindex characteristics, about 35% of the journals have metadata within their pages, and 50% offer search engines to their users. Costa Rica and Brazil get the most points across the three Latindex categories (p = 0.07 < 0.1; see table 10): 83% of Costa Rican journals offer metadata in their websites, and 83% uses search engines for the sites’ contents. The average LATINDEX grade for Costa Rican journals was 0.83, while the average for Brazilian journals was 0.62. When comparing the results of the non-Latin American journals with the Latin American ones, the differences are quite obvious. Every single balance tips in favor of the non-Latin American journals (and not only in quantity; the quality of the navigational linking for example is much noticeable). The only exception was in the case of LATINDEX characteristics, where there is no significant difference.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

197


198

5.

Saray Córdoba-González; Rolando Coto-Solano

Conclusions

At this point, we can conclude that the situation of the average the scientific electronic journals from Latin America does not really differ from that one studied by Mayernik and the results obtained by Marcondes et al. [7]; that is, that the journals from this part of the world, “as many international e-journals, are still designed based on paper journals models. They incorporate few of those technological facilities”. Very few journals use formats and techniques that can fully take advantage of the possibilities offered by Internet tools. Many journals take a great deal of care to offer a presentable “cover page”, with a very flexible and non-linear entry page. However, those efforts wilt as the user approaches the article pages, until the article’s text becomes a copy of the original printed format. Traditionally, editors have thought that their articles have only one audience: human beings. An attractive presentation will surely play a role in a diligent editor’s duty. However, a good visual layout will be impenetrable to what has become a second audience for the articles: computers systems such as robots and spiders that crawl the article in search of usable links, hoping to weave valuable connections between science web pages throughout the world. These two audiences (humans and web-exploration software) complement each other, and both beg attention from the editor. Creating awareness about this problem might be the only way to go. While the present situation in the journals of Latin America is not the best of all possible worlds, the adoption of basic Internet tools such as metadata and search engines is in itself no small feat. The LATINDEX network of associates constantly monitors the use of these features in each country, and has campaigned among local editors to create awareness about them, which might have helped reach the results that we see today. In spite of the low scores in the Mayernik categories, there was no significant difference in the use of metadata and search engines between the “top Latin American journals” and the “top non-Latin American journals” we studied in table 11. Superficial as this comparison might be, it does speak of the achievements that electronic publishing has reached in this group of countries. In countries like Chile, Costa Rica, Colombia, Brazil and Mexico, we found examples of good journals that are well prepared to compete in the global arena. Five countries stand out from the corpus: Brazil, Chile, Costa Rica and Cuba (for their relatively good scores in all of the characteristics), and Mexico (for its incorporation of multimedia into their publishing practices). Every country has its peculiarities. As a part of the informal BRIC bloc, Brazil has been hard pressed to improve the quality of their scientific output. Both Chile and Costa Rica fare well in the Global Competitiveness Report [14] (first and fifth place respectively among countries and territories in Latin America) and the Human Development Index [15] (second and fourth place respectively). Cuba is reputed in the region for “good old resourcefulness” in the face of economic difficulties, but also for a very strict and vertical research culture. Mexico’s UNAM is the only university in the region in the top 200 universities of the QS Ranking, followed within the top 250 by Chile’s Pontificia Universidad Católica [16]. In spite of these varying conditions, the one thread that places these countries together is that they have the largest investment in Research and Development in the region when compared to their GNP: Brasil, 0.83%; Chile, 0.68%; Cuba, 0.56%; Mexico, 0.46%; and Costa Rica, 0.41% [17]. This has allowed them to use much needed resources in raising awareness about their scientific communication processes, a fact that appears to be reflected in the data we have obtained. Determining the exact extent to which funding, funding models, and the availability of materials at the individual journal level truly influences the results of this investigation is a question that calls for future study. Marcondes et.al.’s [7] suggestion of “lack of professionalism” is certainly a strong statement that has little or nothing to do with funds; Costa, Silva y Costa [18] point at the direction of lack of computer alphabetization as the culprit of the situation. Some editors might be considering that “just being on the Internet” is added value enough and that there is no need to improve or work on that online presence. Yet Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

another possibility is that they consider the addition of links might be “baffling” to the user, where the “user” is still the narrow vision of humans as the only consumers of their information. The situation hints a complex interaction between availability of funds and willingness to ‘think outside the box’, and more research is needed to understand the attitudes of editors towards electronic publishing. 6.

Acknowledgements

We would like to thank the Vicedean of Research of the University of Costa Rica (www.vinv.ucr.ac.cr) and the Incentive Funds Commision of the Ministry of Science and Technology of Costa Rica (www.micit.go.cr), as well as Marcela Alfaro for acting as consultant for the statistical analyses. 7.

Notes

1

Further examples include the Mexican E-Journal at UNAM and the Costa Rican Latindex-UCR at University of Costa Rica. There is indeed overlap between the “added services” category in LATINDEX, and the interactivity characteristic of Mayernik. Figure 1 describes the specific added services found in the sample, and can be contrasted to Table 9 for differences between the two. In this case, we asked to LATINDEX partners who have the biggest groups of online journals in the Directory. From them, we got three different answers: Mexico, Colombia and Costa Rica, but our experience begged us to also include journals from Chile and Brazil. We chose the following nonLatin American journals: Journal of Electronic Publishing, British Medical Journal, Behavioral and Brain Sciences, PLoS Medicine and CTheory. As for the Latin American journals, we selected: Online Brazilian Journal of Nursing, Revista Eletrônica de Estudos Hegelianos, Colombia Médica, Livestock Research For Rural Development, Revista E-mercatoria, e-Gnosis, Aleph Zero, Cinta de Moebio and the Electronic Journal of Biotechnology.

2

3

8.

References

[1]

MOGHADDAM, G.G. Archiving challenges of Scholarly Electronic Journals: How do Publishers Manage them? Serials Review 2007, vol. 33, no.2 [cited on May, 5 2008], p. 2. Available from: http://eprints.rclis.org archive/00011175/01/Archiving _Challenges_of_Scholarly_Electronic_Journal_ How_do_publishers_manage_them.pdf MAYERNIK, N. The Prevalence of Additional Electronic Features in Pure E-Journals. Journal of Electronic Publishing, 2007 Fall, vol. 10, no. 3, [cited on May, 5 2008]. Available from: http:// quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;rgn=main;view=text;idno=3336451.0010.307 , accessed on 15-01-2008. VALUSKAS, E. Waiting for Thomas Kuhn: First Monday and the Evolution of Electronic Journals. Journal of Electronic Publishing, 1997, vol. 3, no. 1, September, [cited on May, 5 2008]. Available from: http:// quod.lib.umich.edu/cgi/t/text/ textidx?c=jep;cc=jep;q1=3336451.0003.1%2A;rgn=main;view=text;idno= 3336451.0003.104, accessed on: 01-15-2008. HITCHCOCK, S., CARR, L., HARRIS, S. and, HALL, W., PROBETS, S., EVANS, D., BRAILSFORD, D. Linking electronic Journals. D-Lib Magazine, Dec. 1998. Available from: http:/ /www.dlib.org/dlib/december98/12hitchcock.html#note1, accessed on 10-12-2007. ISSN 10829873. KLING, R. & McKIM, G. Scholarly Communication and the Continuum of Electronic Publishing. Journal of the American Society for Information Science, 1999, v. 50, n.10, p. 890-906. Available from: http://arxiv.org/abs/cs/9903015v1, accessed on 03-01-2008.

[2]

[3]

[4]

[5]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

199


200

[6]

[7]

[8]

[9] [10]

[11] [12] [13] [14] [15] [16] [17] [18]

Saray Córdoba-González; Rolando Coto-Solano

DIAS, G.A. Periódicos eletrônicos: considerações relativas à aceitação deste recurso pelos usuários. Ciência da Informação, 2002, v. 31, n. 3, Sept.-Dec. [cited on May, 2 2008]. . Available from: http://www.scielo.brscielo.php?script=sci_abstract&pid=S010019652002000300002 &lng=en&nrm=iso&tlng=pt. Doi: 10.1590/SO100-19652000300002. MARCONDES, C.H.; SAYÃO, L:F:; MAIA, C.M.R.; DANTAS, M.A.R.; FARIA, W.S. Stateof-the-art of brazilian e-journals in science and technology. In International Conference on Electronic Publishing, 8. Brasilia, D.F., Brazil, 23-26 June, 2004 / Edited by: Jan Engelen, Sely M. S. Costa, Ana Cristina S. Moreira. Universidade de Brasília, 2004. [cited on April, 4 2008]. Available from: http://elpub.scix.net/cgi-bin/works/Show?079elpub2004. MUÑOZ,G., BUSTOS-GONZÁLEZ, A. Y MUÑOZ-CONEJO, A. Sharing the Know-how of a Latin American Open Access ony e-journal: The Case of the Electronic Journal of Biotechnology. In ELPUB2007. Openness in Digital Publishing: Awareness, Discovery and Access - Proceedings of the 11th International Conference on Electronic Publishing held in Vienna, Austria 13-15 June 2007 / Edited by: Leslie Chan and Bob Martens., 2007, [cited on February, 20 2008], pp. 331-340. Available from: http://elpub.scix.net/cgi-bin/works/Show?133_elpub2007, ISBN 978-3-85437-2929. HARNAD, S. Post-Gutenberg Galaxy: The Fourth Revolution in the Means of Production of Knowledge. 1991. [cited on December, 10 2008] Available from: http://users.ecs.soton.ac.uk/harnad/ Papers/Harnad/harnad91.postgutenberg.html. TENOPIR, C. & KING, D. Designing Electronic Journals With 30 Years of Lessons from Print. Journal of Electronic Publishing, 2002, vol. 7, no. 3, [cited on January, 15 2008] April. Available from: http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;q1=Electronic %20journals;q 2=Scholarly%20journals;op2=and;op3=and;rgn=main;view=text;idno=3336451.0007.303, HARNAD, S. Interactive publication: Extending the American Physical Society’s discipline-specific model for electronic publishing, Serials Review, 1992, special issue [cited on March, 15 2008] . Available from: http://cogprints.org/1688/0/harnad92.interactivpub.html. LUKESH, S. Revolutions and Images and the Development of Knowledge: Implications for Research Libraries and Publishers of Scholarly Communications. Journal of Electronic Publishing, April, 2002, vol. 7, n. 3. [cited on May, 5 2008] Available from: http://hdl.handle.net/2027/spo.3336451.0007.303. CRONIN, B. The Hand of Science. Lanham, Maryland: The Rowman and Littlefield Pub., 2005. ISBN 0-8108-5282-9. WORLD ECONOMIC FORUM. Global Competitiveness Report 2007-2008. 2007 [cited on May 6, 2008]. Available from: http://www.weforum.org/pdf/Global_Competitiveness_Reports/Reports/ gcr_2007/gcr2007_rankings.pdf. UNITED NATIONS DEVELOPMENT PROGRAMME. 2007/2008 Human Development Index rankings. 2008 [cited on May 6, 2008]. Available from: http://hdr.undp.org/en/statistics/. TIMES HIGHER EDUCATION. QS World University Rankings 2007 - Top 400 Universities. 2007 [cited on May 6, 2008]. Available from: http://www.topuniversities.com/worlduniversityrankings/ results/2007/overall_rankings/top_400_universities/. RICYT. Red de Indicadores de America Latina: Indicadores Comparativos. 2008 [cited on May 6, 2008]. Available from: http://www.ricyt.org/interior/interior.asp?Nivel1=1&Nivel2=2 &Idioma=. COSTA, S., SILVA W. Y COSTA, M. Electronic publishing in Brazilian academic institutions: changes in formal communication, too?, In: International Conference on Electronic Publishing, 5 2001 in the Digital Publishing Odyssey: Proceedings of an ICCC/IFIP conference held at the University of Kent, Kenterbury, UK, 5-7 July 2001/ Edited by Arved Hübler, Peter Linde and John W.T. Smith / ISBN 1-58603-191-0. Available from: http://elpub.scix.net/cgi-bin/works/Show?200112, accessed on: 03-13-2008.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

8.

Appendix 1: Journals included in the study

Argentina: Archivos argentinos de alergia e inmunología clínica, AdVersus, Biocell, Contabilidad y auditoría, Dermatología Argentina, Equipo Federal del Trabajo, Foro Iberoamericano sobre Estrategias de Comunicación (FISEC), Hologramática, Journal of Applied Economics, Journal of Computer Science and Technology, Psikeba, Rev. Argentina de Lingüística, Revista de Ciencias Sociales, Rev. De Investigaciones Agropecuarias, Telondefondo, Universitas, Urbe et Ius. Brazil:

Afro Asia, Boletim do Instituto de Pesca, Brazilian Administration Review, Brazilian Journal of Biomotricity, Caderno espaço feminino, Caderno Virtual de Turismo, Contingentia, Data Grama Zero, Economia e Energia, Educação Temática Digital, Engenharia Ambiental, Hegemonia, Klepsidra, Online Brazilian Journal of Nursing, Relações públicas em revista, Revista brasileira de educação médica, Revista Brasileira de Zoologia, Revista de Estudos da Religião, Revista de Gestão da Tecnologia e Sistemas de Informação, Revista Eletrônica de Estudos Hegelianos, Revista Expectativa, Revista Matéria, Semina.

Chile

Agenda Pública, Ciencia y Trabajo, Cinta de Moebio, Cuadernos de Economía, El Vigía (Santiago), Electronic Journal of Biotechnology, Journal of Technology Management and Innovation, Monografías electrónicas de patologia veterinaria, Política Criminal, Rev. Chilena de Ciencia de la Computación, Rev. Chilena de Semiótica, Revista Universitaria.

Colombia

Acta Biológica Colombiana, Colombia Médica, Cuadernos de Administración, Earth Sciences Research Journal, Livestock Research For Rural Development, Nómadas, Rev. Ciencias Humanas, Rev. Latinoamericana de Ciencias Sociales, Niñez y Juventud, Revista EIA Ingeniería Antioquía, Revista E-mercatoria, Revista Escuela Colombiana de Medicina - ECM.

Costa Rica Actualidades Investigativas en Educación, Diálogos, MHSalud, Población y Salud en Mesoamérica, Reflexiones, Revista de Derecho Electoral. Cuba

ACIMED, Fitosanidad, Multimed, Revista cubana de investigaciones biomédicas, Revista Cubana de Obstetricia y Ginecología, Revista cubana de pediatría.

Ecuador

Gaceta Dermatológica Ecuatoriana, Universidad-Verdad

Mexico

Acta Médica Grupo Ángeles, Alegatos, Aleph Zero Anales del I.Biología, Serie Zoología, Anuario Mexicano de Derecho Internacional, Archivos Hispanoamericanos de Sexología, Biblioteca Universitaria, Buenaval, Computación y Sistemas, Cuadernos de Psicoanálisis, Dugesiana, Educar, e-Gnosis, El Psicólogo Anahuac, Hitos de Ciencias Económico Administrativas, InFÁRMAte, Investigación Bibliotecológica, Journal of Applied Research and Technology, La pintura mural prehispánica en México, Los amantes de Sofía, Mensaje bioquímico, Nueva Antropología, Redes Música, Rev. Ciencia Veterinaria, Revista Biomédica, Revista de Enfermedades Infecciosas en Pediatría, Revista de la Educación Superior, Revista del Instituto Nacional de Cancerología, Revista Fractal, Revista Mexicana de Física.

Peru

Biblios, Diagnóstico, Escritura y Pensamiento

Puerto Rico Ceteris Paribus, El Amauta, Rev. Int. Desastres Naturales, Infraestructura Civil, Videoenlace Interactivo.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

201


202

Saray Córdoba-González; Rolando Coto-Solano

Uruguay

Actas de Fisiología, Boletín Cinterfor, Boletín del Inst. de Inv. Pesqueras, Galileo.

Venezuela

Acción Pedagógica, Boletín Antropológico, Cayapa, Música en clave, Postgrado, Rev. Ingeniería UC, Revista de la Sociedad Médico-Quirúrgica del Hospital de Emergencia Pérez de León.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


203

Consortial Use of Electronic Journals in Turkish Universities Yasar Tonta; Yurdagül Ünal Department of Information Management, Hacettepe University 06800 Beytepe, Ankara, Turkey e-mail: {tonta, yurdagul}@hacettepe.edu.tr

Abstract The use of electronic journals has outnumbered that of printed journals within the last decade. The consortial use of electronic journals through publishers’ or aggregators’ web sites is on the rise worldwide. This is also the case for Turkey. The Turkish academic community downloaded close to 50 million fulltext articles from various electronic journal databases since the year 2000. This paper analyzes the seven-years’ worth of journal use data comprising more than 25 million full-text articles downloaded from Elsevier’s ScienceDirect (SD) electronic journals package between 2001 and 2007. Some 100 core journals, constituting only 5% of all SD journal titles, satisfied over 8.4 million download requests. The lists of core journals were quite stable, consistently satisfying one third of all demand. A large number of journal titles were rarely used while some were never used at all. The correlation between the impact factors (IFs) of core journal titles and the number of downloads therefrom was rather low. Findings can be used to develop better consortial collection management policies and empower the consortium management to negotiate better deals with publishers. Keywords: electronic journals; consortial use of electronic journals; core journal titles; Turkish universities; Bradford Law of Scattering; ScienceDirect. 1.

Introduction

Scientific journals are one of the major information sources of library collections. Currently, some 25,000 refereed journals are being published world-wide. Libraries spend about two thirds of their budgets for the subscription to and licensing of scientific journals and make them available online. Consortial agreements signed between libraries and publishers/aggregators enable users to get access to electronic journals through the Internet. Users can easily download the full-texts of articles that appear in thousands of electronic journals. Yet, the great majority of articles downloaded by the users tend to get published in a relatively small number of “core journals” in each field. Those core journals can easily be identified by means of an analysis of COUNTER-based use data. Studies based on such analyses of empirical journal use data are scarce in Turkey. This paper attempts to identify the most frequently used core journals by Turkish academic users. Our analysis is based on data of more than 25 million full-text articles downloaded by Turkish universities from Elsevier’s ScienceDirect (SD) electronic journal package over a seven-year period (2001-2007), making it perhaps one of the most comprehensive electronic journal use studies carried out on a national scale. The volume of data enables us to identify the core journals as well as to determine their stability over the years. Findings can be used to develop better collection management policies and improve the conditions of the national consortial license for Turkish universities.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


204

2.

Yasar Tonta; Yurdagül Ünal

Universities and Consortium Development in Turkey

As of 2008, the total number of universities in Turkey is 115. Most (85) are state-sponsored. The tertiary education system is governed by the Higher Education Act of 1981. The Higher Education Council (YÖK), consisting of members from universities and outside interests, is the policy-making body for all universities including private/foundational ones. The selection and admission of students takes place through a national entrance exam administered by a center (ÖSYM) under the authority of YÖK. The total number of students enrolled in the higher education system (including students in the distance education and vocational programs) was 2,453,664 in the 2006/2007 academic year. The number of graduate students was rather low: 108,653 master’s, 33,711 Ph.D. students. More than half (54.5%) of all undergraduate students study social sciences. The rest study technical sciences (17.3%), math and sciences (10%), medicine (9%), language and literature (4%), agriculture and forestry (3.2%), and arts (2%) [1,2]. The number of faculty in the 2006/2007 academic year was 89,319 (12,773 professors, 6,150 associate professors, 15,844 assistant professors, and 54,562 research assistants, specialists, and others) [3]. The National Academic Network and Information Center (ULAKBIM) of the Turkish Scientific and Technological Research Center (TÜBITAK) was founded in 1996 to set up a national academic and research network and use it as a testbed to share precious information resources among university libraries. In addition, ULAKBÝM aimed to provide access to electronic information sources and services by signing national site licenses with publishers on behalf of all Turkish universities. In fact, the first experience of the Turkish universities with electronic journals dates back to the second half of 1990s following the establishment of ULAKBIM. On November 14, 1997, ULAKBIM organized a day-long meeting for university library directors and their superiors (i.e., vice-rectors overseeing libraries) and presented its views on setting up a consortium for university libraries to cooperate and share electronic resources as stated in its by-law. In 1998, ULAKBIM offered the first trial databases to universities [4]. However, ULAKBIM’s priorities had changed due to financial and administrative difficulties experienced at that time and the Center was not able to immediately carry out some of its duties (one of which was to set up a consortium) as specified in its foundational by-law. Thus, ULAKBIM could not live up to the expectations of the potential members of the consortium, namely university libraries, in its formative years. Meanwhile, a few university libraries signed joint licensing projects with publishers in 1999 and 2000. Following this, the Consortium of Anatolian University Libraries (ANKOS) was created in 2001 as a voluntary association run by a nine-member Steering Committee. ANKOS developed the Turkish National Site License (TRNSL) document and member libraries began to sign agreements with publishers to get access to electronic journals and bibliographic databases [5,6]. These initial agreements were “mostly informal subscription arrangements” for printed journals including access to electronic copies thereof. In 2004, ANKOS began to sign multi-year consortial licenses to get access to the electronic copies of journals (excluding their printed equivalents) [7,8]. Thanks to the indefatigable efforts of ANKOS, several universities, especially the newly-established ones, provided access to electronic journals for the first time through such licenses. Some universities did not even have any sizable journal collections at that time. As more university libraries joined ANKOS over the years, the number of databases offered and their use has increased tremendously. ANKOS currently has some 90 members including a few non-university entities. ULAKBIM has also been a member of ANKOS from the very beginning. As of 2008, ANKOS licenses a total of 30 packages of electronic journals and books. Some of those packages are as follows: Blackwell’s, Ebrary, Emerald, Gale, Nature Publishing Group, Proquest, Sage, ScienceDirect (SD) ebooks, Wiley Interscience, and journal packages offered by professional associations such as ACM, ACS, ALSPS, and SIAM. The number of licensees for each package ranges between 4 (Elsevier’s MD Consult) and 74 universities (Oxford University Press), average being 24 universities [9].

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

Apparently it took ULAKBIM longer than anticipated to convince TÜBITAK to allocate resources to provide access to electronic journals and books on a national scale [10]. After a precious loss of about seven years, ULAKBIM came into scene once again in 2005. Having secured funds (apparently) from the European Union (EU), TÜBITAK’s Science Council authorized ULAKBIM, in late 2005, to sign national site licenses with publishers covering potentially all universities. This authorization enabled ULAKBIM to make electronic databases available to all Turkish universities and research centers through its National Academic License of Electronic Resources (EKUAL) starting from 2006 [11]. The first package offered to universities on a national scale through ULAKBIM’s EKUAL has been ISI’s (now Thomson Scientific) Web of Science (WoS) [12]. The coverage of EKUAL has been expanded in February 2006 so as to include training and research centers of public hospitals under the administration of the Ministry of Health. EKUAL currently has 105 member universities and research centers as well as 48 hospitals. As of early 2008, ULAKBIM offers 11 electronic databases to universities and research centers. These databases are as follows: BMJ Clinical Evidence, BMJ Online Journals, CAB, EBSCOHost, Engineering Village 2, IEEE, Journal Citation Reports (JCR), Ovid-LWW, ScienceDirect, Taylor & Francis, and the Web of Science. Some databases are offered to all members (e.g., Thomson Scientific’s WoS and JCR databases, and Elsevier’s SD) while others depend on the number of members requesting access (for instance, almost 90 members requested access to the Engineering Village 2 and IEEE databases while 31 members preferred access to the CAB database). In addition to the above databases, all 48 hospitals get access to the following databases through ULAKBIM’s EKUAL: Blackwell Synergy, Embase, ScienceDirect Health Sciences, Springer, The Cochrane Library, Wiley Interscience, and Xpharm [13]. In addition to ANKOS and ULAKBIM, the Associaton of University and Research Libraries (ÜNAK) also took part in consortial licensing of electronic resources starting from 2001. The ÜNAK-OCLC Consortium provides access to OCLC’s databases such as First Search, WorldCat and the Net Library [14]. The number of licensees ranges between 5 and 24. Non-OCLC databases are apparently outside the realm of the ÜNAK-OCLC Consortium. Some 12 million full-text journal articles or book chapters were downloaded in 2007 by the Turkish academics from various databases [15]. Downloads from Elsevier’s SD usually constitute more than half the total. 3.

Literature Review

Libraries sign agreements with publishers for “big deals”. Publishers provide a set of journals as part of the big deal package. In the early days, this approach were embraced readily by the university libraries because it was attractive for users to perform a cross-search and get access to the full-texts of articles regardless of whether their library subscribed to that title earlier. Yet, some of the journal titles provided in the big deal agreements are not necessarily the most frequently used ones. Paying license fees for marginal journal titles embedded in the big deals tends to increase the total license fees to a certain extent and limits the choices of libraries, not to mention the possible overlap of journal titles offered by different publishers and aggregators. To support the license fees of the big deals libraries ought to cut some of their subscriptions to journals that are used perhaps more frequently. Big deals are therefore increasingly being criticized in recent years because of monopoly, price hikes, and the inclusion of journals that may not be at the top of the priority lists of libraries [16,17,18]. Some universities in the United States therefore rejected the big deals and negotiated new agreements with publishers. For instance, Cornell University agreed to identify journal titles from a package and only include them as part of the license agreement with Elsevier [19]. Gatten and Sanville [20] analyzed the download data to identify the use patterns of journal titles within a big consortium (OhioLINK). They wondered if the rarely used journal titles within a consortial big deal package can be dropped from the subsequent years’ negotiations without undermining the use of one or more consortium members, thereby Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

205


206

Yasar Tonta; Yurdagül Ünal

making the big deal more cost effective. They showed that an orderly retreat (i.e., title-by-title elimination of rarely used titles) “based on the ranking of articles-downloaded aggregated across member institutions appears to be a reasonable method to employ if needed. . . . An effective orderly retreat means consortia have the ability to manage a Big Deal based on a ‘cost for content’ approach.” It may sometimes be more economical for a library to pay-per-view rather than sign a big deal agreement, especially if the use is not that great ([21,22]; see also [17] . It should be noted that the big deal publishers seem to soften their stand on “all or nothing” approach and some of them allow libraries to pick the titles they want out of a big deal package. One way for the libraries to find out if the electronic journals licensed are used or not is to conduct use studies. Findings of such studies empower library administrators and enable them to develop better collection management policies [23,24]. Especially studies of cost-benefit analysis are noteworthy [25,26,27]. Use analyses based on SD database of electronic journals are not that many [28,29,30,31]. In general, core journals satisfied large percentage of requests [28,29,32,33]. For instance, half the use of the Middle East Technical University (METU), a leading Turkish university, is satisfied by 136 core journals. One third of all journals satisfied 86% of all need [25, p. 73]. Evans and Peters [22] analyzed the aggregated use of more than 100 business and management journals included in the Emerald Management Xtra (EMX) collection and tested if the dispersal of some 6,4 million articles downloaded in 2004 by the “big deal” users fitted the “80/20 rule” or Pareto principle. They found that the most frequently used 15 journals satisfied 36.7% of all download requests and the download data did not conform to the 80/20 rule: 47.4% of journals satisfied the 80% of download requests. Aggregated use of the members of the Consortium of University Libraries of Catalonia (CBUC) of, among others, EMX collection (formerly MCB) between 2001 and 2003 displayed a similar trend: 46.2% journals satisfied 80% of more than 200 thousand download requests [23]. There are studies that test the relationship between some bibliometric indicators such as the journal impact factor (IF), half life, total number of citations and the number of use (downloads) [24]. Some studies report the existence of such a statistically significant relationship between the use based on bibliometric indicators and that of download data [25] while others do not [26]. Darmoni, Roussel, Benichou, Thirion, and Pinhas [27] defined a new measure called the “Reading Factor” (RF), “the ratio between the number of electronic consultations of an individual journal and the mean number of electronic consultations of all the journals studied” (p. 323) and compared the RF and IF values for 46 journals. They reported no correlation “between IF and electronic journal use as measured by RF” (p. 325). Although such findings can be used in collection management to some extent, the use of electronic journals cannot be explained by a single factor such as journal IFs or RFs. Bollen, Van de Sompel, Smith and Luce [28] developed a taxonomy of impact measures based on journal usage data that includes frequentist author-based (i.e., IF) and reader-based (i.e., RF) measures as well as structural author-based (i.e., webometrics) and reader-based measures. Recently, Bollen and Van de Sompel [29] examined the effects of the community-based characteristics such as the total student enrollment and the size of the discipline in terms of the number of journals on journal usage. They defined a journal Usage Impact Factor (UIF) mimicking ISI’s IF. They then used the two-years’ worth of download data obtained from the 23-campus California State University to rank journals on the basis of UIFs. They reported a negative correlation between UIF and ISI’s IF values in general. No correlation was found for most disciplines between UIF and IF values. However, UIF and IF correlations “seemed to be related to the ratio between the sizes of the undergraduate and graduate community in a discipline.” (p. 146) Studies based on the MESUR database containing large volumes of usage and citation data will shed new Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

207

lights on the relationship between use-based measures and the community- or subject-based characteristics of journal use. Developed by Bollen and his colleagues, the database contains usage data spanning 100,000 journals and citation data spanning 10,000 journals for 10 years. In addition the database has “publisher-provided COUNTER usage reports that span nearly 2000 institutions worldwide. . . . MESUR is now producing large-scale, longitudinal maps of the scholarly community and a survey of more than 60 different metrics of scholarly impact.” [30] 4.

Methodology

Data used in this paper come from the ScienceDirect (SD) Freedom Collection of electronic journals database of Elsevier. SD contains the full-texts of some 8 million articles published in more than 2,000 journals. The SD Freedom Collection provides access to the contents of both subscribed and non-subscribed Elsevier journals with “dynamic linking to journals from approximately 2,000 STM publishers through CrossRef” [31]. The seven-years’ (2001-2007) worth of COUNTER-based download statistics of Turkish universities from Elsevier’s SD database were obtained from the publisher. The number of full-text articles downloaded from each journal by each university was recorded. The analysis was based on more than 25 million articles downloaded from over 2,000 Elsevier journals. Most frequently used “core” journal titles were identified. Tests were carried out to see if the distribution of downloaded articles to journal titles conformed to the Bradford’s Law of Scattering, the 80/20 rule and the Price Law. Using ISI citation data (Journal Citation Reports 2006), the correlation between the journal impact factors (IFs) of core journal titles and their use based on the number of downloads was calculated to see if journals with high IFs were also used heavily by the Turkish academic community. What follows is the preliminary findings of our study. 5.

Findings and Discussion

Turkish academic users downloaded a total of 25,145,293 full-text articles between 2001 and 2007 from 2,097 different journals included in Elsevier’s SD database [48]. Two-thirds of those articles were downloaded over the last three years (2005-2007) (Fig. 1). March and December are the most heavily used months of the year while the number of downloads appears to decrease considerably during the summer. Table 1 shows the frequencies and percentages of journal titles satisfying one third, two third, and all requests downloaded between 2001 and 2007 as well as on an annual basis. Based on data presented in Table 1, Figure 2 shows the annual distributions of journal titles by regions (i.e., percentage of journal titles satisfying one third, two third and all download requests). The first one third of all download requests (some 8.4 million articles) were satisfied by 105 “core” journals, constituting a mere 5% of all journal titles. The second one third were satisfied by 273 journals (12.9% of all journal titles). In other words, 378 journal titles (some 18% of all journal titles within SD) satisfied two thirds of all download requests. The last one third of requests were satisfied by 1,719 rarely used journals (82.1% of all SD journal titles). When the download statistics were analyzed on an annual basis for seven years, the pattern of use of core journals did not change much: on the average about 90 core journal titles invariably satisfied one third of all download requests each year (77 journal titles in 2001, 83 in 2002, 95 in 2003, 103 each in 2004 and 2005, 92 in 2006, and 93 in 2007) (Table 1). Percentage of core journal titles ranged between 4.6% (2007) and 6.2% (2001) of all SD journals. The use patterns of moderately and rarely used journal titles did not fluctuate much, either: percentage of moderately used journal titles ranged between 12.8% (2007) and 16.7% (2001) while the rarely used ones constituted the overwhelming majority (77% in 2001 and 82.6% in 2007) of all SD journals. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

F


208

Yasar Tonta; YurdagĂźl Ă&#x153;nal Number of journals 2001-2007 N % 105 5.0

2001 N % 77 6.2

2002 N % 83 5.2

2003 N % 95 5.7

2004 N % 103 5.8

2005 N % 103 5.5

2006 N % 92 4.8

2007 N % 93 4.6

273

12.9

206

16.7

225

14.1

255

15.4

271

15.2

274

14.6

262

13.7

257

12.8

1,719

82.1

950

77.0

1,292

80.8

1,304

78.8

1,409

79.0

1,498

79.9

1,553

81.4

1,663

82.6

2,097

100.0

1,233

99.9

1,600

100.1

1,654

99.9

1,783

100.0

1,875

100.0

1,907

99.9

2,013

100.0

Note: Some totals differ from 100% due to rounding.

Table 1. Distribution of journals by regions

Number of downloaded articles

7.000.000

and

6.000.000

5,264,423

5,652,780

5,843,049 (est.)

2006

2007

4,575,094

5.000.000 4.000.000

3,346,381

3.000.000 2.000.000 1.000.000

1,362,934 810,203

0

2001

2002

2003

2004

2005

Year Note: The number of use in the last quarter of 2007 was estimated according to the average rate of increase (70.67%) the last four years (2003-2006).

Figure 1. Number of full-text articles downloaded from ScienceDirect (2001-2007) 1. Region

2. Region

3. Region

100 Percentage of journal titles

90 80 70 60 50 40 30 20 10 0 2001-2007

2001

2002

2003

2004

2005

2006

2007

Ye ar

Figure 2. Yearly distributions of journal titles by region Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

Core journal titles satisfying one third of all download requests exhibited further interesting use patterns. Not only were their numbers quite stable (around 100) but also the same journal titles consistently appeared, to some extent, in the core journal lists over seven years. To put it somewhat differently, a core journal fulfilling high use in a given year tends to do so in the following years as well. Ranks of individual journal titles based on the number of downloads did not fluctuate much on a yearly basis. This is despite the fact that new journal titles are constantly being added to the SD journal list, thereby increasing both the total number of SD journal titles available for download and the probability of further fluctuation. The total number of SD journal titles available in 2001 is likely to be greater than that in 2007. Yet, the stability of the ranks of individual journals is especially noteworthy. Nonetheless, it should be noted that the ranks of some journals might get affected due to the increase in the total number of available SD journal titles over the years. Spearman rank order correlation coefficients (r) for core journal titles for two consecutive years ranged between 0.402 (2001/2002) and 0.874 (2006/2007) (Table 2). As the number of downloaded articles increased over the years, so did the correlation between the annual ranks of core journal titles.

Years 2001-2002 2002-2003 2003-2004 2004-2005 2005-2006 2006-2007

Spearman rank order correlation coefficient (Ď ) 0.402 0.706 0.778 0.780 0.791 0.874

Note: The correlation coefficient for 2006-2007 does not reflect the use of journal titles within the last quarter of 2007.

Table 2. Correlation coefficients for the core journal titles that were common in two consecutive years

A total of 29 journals appeared in core journal lists of all seven years, roughly satisfying 3.3 million fulltext articles (13.1% of the total number of downloads). More than 200,000 articles were downloaded from the most frequently used journal (Food Chemistry), satisfying 0.8% of all requests. The average number of articles downloaded from those 29 top journals over seven years was 113,793 (16,256 per year). This is about 10 times more than that of the average for all journal titles [49]. Spearman rank order correlation coefficients (Ď ) for 29 core journal titles that were common to all seven years were even higher (minimum 0.472 in 2001, maximum 0.964 in 2005. In other words, Turkish academic users tend to use certain journal titles time and again to satisfy their information needs. The most frequently used top 29 journals along with their rank orders based on the number of articles downloaded over all seven years and that on an annual basis are given in Table 3. It should be noted that journals listed are common to each and every core journal list of all seven years (satisfying one third of all requests) as well as that of the total use (2001/07). It was observed earlier that as the number of downloads increased, the ranks of core journals became more stable. This can be seen in the ranks of the top five journals for the years 2004 through 2007. None of these journals ranked lower than the 8th place (Journal of Food Engineering). As we go down the list, the ranks of top journals start fluctuating. For instance, the journal Brain Resarch was at the top of the core journal list in 2002 whereas it moved down to the 70th place in 2006. Core journal lists need to be studied more closely in order to pinpoint possible use patterns. Findings of a use study based on the download statistics of one Turkish university (Hacettepe) produced similar results with regards to the SD core journal titles [33]. The most frequently used 30 journal titles Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

209


210

Yasar Tonta; Yurdagül Ünal

satisfied 20.4% of all use at Hacettepe University. The most frequently used 12 journal titles (within the first 30) at Hacettepe were also among the 29 journals used most heavily by all Turkish academics. Seven of those 12 titles were of medicine while the remaining 5 in food chemistry, food engineering, chromatography, polymer and biomaterials. The rank of journals differed as well. For instance, the journal Lancet was the most frequently used title in Hacettepe’s core journal list (Hacettepe University has a medical school) while it ranked third in the consortial core journal list. Rank order Journal name

2001/07

2001

2002

2003

2004

2005

2006

2007

Food Chemistry

1

9

9

12

1

1

2

1

European Journal of Operational Research

2

5

5

4

2

2

3

2

Lancet, The

3

29

11

1

3

3

5

6

Journal of Materials Processing Technology

4

6

2

2

4

6

7

3

Journal of Food Engineering

5

26

24

21

6

8

4

5

Tetrahedron Letters

6

19

13

15

12

4

10

15

Journal of Chromatography A

7

13

10

11

10

11

9

14

Analytica Chimica Acta

10

7

12

6

18

17

12

13

Water Research

11

3

4

10

15

16

22

30

Cement and Concrete Research

12

27

6

5

19

5

30

52

Materials Science and Engineering A

13

15

23

19

16

20

17

8

Tetrahedron

15

32

27

23

25

7

14

17

Polymer

16

18

22

16

23

19

15

11

Biomaterials

17

49

19

20

14

15

20

26

Surface and Coatings Technology

18

24

16

36

26

18

18

10

Bioresource Technology

20

36

28

39

28

21

16

7

Chemosphere

24

50

38

32

29

22

26

22

Energy Conversion and Management

25

37

26

24

22

23

31

32

Aquaculture

26

10

29

38

9

24

52

63

International Journal of Production Economics

28

11

25

34

21

26

37

42

Thin Solid Films

29

34

59

70

47

28

25

12

Brain Research

30

66

1

22

50

48

70

73

Talanta

32

14

33

53

34

33

35

25

International Journal of Food Microbiology

33

17

15

25

27

39

43

54

International Journal of Heat and Mass Transfer

35

4

41

31

40

42

63

51

European Journal of Pharmacology

39

56

8

17

64

79

72

69

Renewable Energy

43

28

63

59

45

49

71

58

Journal of Membrane Science

70

57

49

80

95

67

67

86

Enzyme and Microbial Technology

80

58

56

78

96

92

80

83

Table 3. Top 29 journals common in the core journal lists of total use (2001/07) and individual years

The use of SD journals by the Turkish academic community seem to be in parallel with the worldwide use of the same journals. By November 2006, more than one billion articles were downloaded world-wide from SD [50]. Table 4 lists the “hottest” 10 SD journals based on download statistics along with the percentages satisfying the world-wide demand. The weekly and fortnightly science journals such as the Lancet top the list. Table 4 also provides the equivalent percentages and ranks of those top journals on the basis of local download data. Four out of 10 “hottest” journals (The Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

211

Lancet, Tetrahedron Letters, Journal of Chromatography A, and Journal of the American College of Cardiology) are also among the top 10 journals used most often by the Turkish academics. Percentages of use of four journals are also comparable. Some well-known journals such as Cell and the Journal of Molecular Biology, on the other hand, appear to have not been used heavily in Turkey [51]. Journals

World-wide %

Turkey (2001-2007) % Rank

The Lancet 1.56 0.71 3 Tetrahedron Letters 1.55 0.55 6 Cell 0.99 0.03 919 Biochemical and Biophysical Research Communications 0.97 0.27 47 Tetrahedron 0.93 0.47 15 FEBS Letters 0.87 0.21 96 Journal of Chromatography A 0.67 0.54 7 Journal of Molecular Biology 0.60 0.09 309 Journal of the American College of Cardiology 0.58 0.54 8 Brain Research 0.55 0.36 30 Source: Data in the first two columns come from http://www.info.sciencedirect.com/news/archive/2006/news_billionth.asp.

Table 4. The most frequently used top 10 ScienceDirect journals

Despite the fact that some 100 core journal titles satisfied one third, some 200 titles half, and some 500 titles 80% of all download requests, the distribution of downloaded articles did not conform to the Bradfordâ&#x20AC;&#x2122;s Law of Scattering [52]. In separate studies, we found out that the distribution of the fiveyear (2002-2006) download data of Hacettepe University users representing over one million articles, and the distribution of both electronic document delivery and in-house journal use data of the National Academic Network and Information Center did not fit the Bradford Law, either [32,33]. It was observed in the literature [53,54] that homogenous bibliographies fit the Bradford Law better, whereas the article download data used in the present study come from over 2,000 journals representing all subject fields. It is also possible that the distributions that possess long tails (e.g., very few articles being downloaded from a large number of journal titles, as was the case in our study) may not fit the Bradford Law very well. This is an issue that deserves to be explored further in its own right [55]. Notwithstanding this disconformity, the stability of the number of relatively few journal titles satisfying the great majority of download requests can nonetheless be seen in Figure 3, which depicts the Bradford curves for the aggregated use of all SD journal titles by Turkish academic users. Figure 3 also shows that the number of SD journals used at least once increased over the years (2,013 in 2007 as opposed to 1,233 in 2001). Yet, it is interesting to note that 17 SD journal titles have not been used even once by more than two million (potential) users in Turkish universities during the seven-year period. Some 102 journal titles were used, on the average, just once per annum. The download data did not quite fit the 80/20 rule, either [56]. In our case, 29% of all journals (or 602 titles) satisfied 80% of more than 25 million download requests. For individual years, the percentage of journals satisfying 80% of all requests ranged between 35% (2001) and 28% (2007), average being 31.6%. Nor did the distribution of download data fit the Price Law (i.e. the number of journals that is equal to the square root of all journal titles satisfying half the download demand) ([52], p. 362). In our study, half the downloads came from 208 journal titles instead of 46, as the Price Law suggests. Again, the disconformity can perhaps be explained by the wide variety of uses of the collection for Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


212

Yasar Tonta; Yurdagül Ünal

different purposes by different researchers. For instance, universities with medical schools may download medical articles more often whereas science and engineering schools may do the same for articles in their respective fields. Considering the fact that there are more than 100 universities with different concentration of subjects in Turkish universities, it is likely that the demand for articles dispersed more evenly than that predicted by the 80/20 rule. The four-year (2000-2003) download data of the Consortium of University Libraries of Catalonia (CBUC) did not fit the 80/20 rule, either: an average of 35% of the journal titles of four different publishers satisfied 80% of the demand [35]. It was suggested that the dispersal of use of journals fits the 80/20 rule better as the number of articles available for download in a collection increases. This does not seem to be the case, however. The SD electronic journals package used in this study has over 2,000 journal titles with more than 8 million articles available for download whereas, for instance, the Emerald Management Xtra (EMX) collection comprises about 190 electronic journal titles with 75,000 articles available for download. While 29% of the SD journal titles satisfied 80% of the download requests in our study, almost half the EMX journal titles satisfied 80% of the world-wide demand in 2004 representing more than 6 million article downloads [57]. 2001-2007

2001

2002

2003

2004

2005

2006

2007

100

Cumulative percentage of use

90 80 70 60 50 40 30 20 10 0 0

150

300

450

600

750

900

1050 1200 1350 1500 1650 1800 1950 2100 2250

Cumulative number of journal titles

Figure 3. Bradford curves for the use of journal titles in SD (2001-2007 N = 2097, 2001 N =1233, 2002 N =1600, 2003 N = 1654, 2004 N = 1783, 2005 N = 1875, 2006 N = 1907, 2007 N = 2013). We checked if there is any correlation between the journal impact factors (IFs) and the download statistics. IF values of 105 core journals along with the total number of citations to articles that were published therein were obtained from ISI’s Journal Citation Reports 2006. The number of downloads ranged between 206,537 (Food Chemistry) and 50,020 (European Polymer Journal) for core journals (average being 80,228 with SD=33,329). Journals’ IF values ranged between 25.8 (The Lancet) and 0.615 (Journal of Materials Processing Technology) (average being 2.340 with SD=2.624). There appears to be a low correlation between IFs of core journals and the number of downloads therefrom (Pearson’s r = 0.368). The correlation coefficient was even lower (0.291) for 29 journals that were common in all core journal lists between 2001 and 2007. This finding is in parallel with that obtained by other studies that we recently carried out [32,42]. A low correlation also exists between the ranks of core journal titles based on the number of downloads and that of the total number of citations (Spearman’s r = 0.253, N = 104). It appears Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

that journals with high IFs tend to be used slightly more often by the Turkish academic community. It can be argued that the journal IFs (and the total number of citations) are calculated on the basis of world-wide use whereas the number of downloads used in this study is based on local use. The concentration of research in Turkey may well differ from that in other countries (e.g., USA) and skew the downloads away from IF and total number of citations. Yet, there are several studies that show that the use based on citations (IFs) and that on downloads are either slightly correlated or not at all (See [32], p. 215; [36]; [40], p. 319; [41,42]). As we have indicated earlier, Bollen and Van de Sompel [45] conducted a more careful study comparing the use based on citations (i.e., Journal IFs) and downloads (i.e., Usage Impact Factors) obtained from California State University (CSU). They reported a moderate negative correlation between the two, noting that “CSU usage data indicates significant, community-based deviations between local usage impact and global citation impact” and that “usage-based impact assessments are influenced by the demographic and scholarly characteristics of particular communities” (p. 146). It is also possible that use based on citations and that on downloads measure two different dimensions of usage [36]. The motives of users downloading articles may be quite different than those who cite articles and they may not overlap. 6.

Conclusion

The preliminary findings of our analysis based on download statistics of all Turkish universities from Elsevier’s SD database show that some 100 core journals satisfied one third of the total number of 25 million full-text download requests. Lists of core journal titles seem to be quite persistent, for they do not change much on an annual basis. A large number of journal titles were rarely used while some were never used at all. Coupled with the pricing data, findings based on seven years’ worth of national usage statistics can be used by individual university libraries as well as by the consortium management to develop collection management plans and devise negotiation strategies that can be exercised with publishers. Based on national usage statistics, “an orderly retreat” for rarely used journal titles that are usually offered as part of the “big deals” can be negotiated with publishers on behalf of all consortium members [20]. 7.

Acknowledgments

This study was supported in part by a reserach grant of the Turkish Scientific and Technological Research Center (SOBAG-106K068). We thank Mr. Hatim El Faiz of Elsevier for providing download data used in this study, and Mr. Umut Al of Hacettepe University for providing feedback on an earlier draft of this paper. 8.

Notes and References

[1]

Statistics come from the web site of the Student Selection and Admission Center: ÖÐRENCÝ SEÇME VE YERLEÞTÝRME MERKEZÝ. (n.d.). 2006-2007 öðretim yýlý yükseköðretim istatistikleri (Higher Education Statistics of 2006-2007 Academic Year). Ankara: ÖSYM. Retrieved 26 March 2008, from http://www.osym.gov.tr/dosyagoster.aspx? DIL=1&BELGEANAH=19176& DOSYAISIM=1_Ogrenci_Say.pdf. The statistics on the distribution of students by subject disciplines come from p.46, Table 4.2 of TÜRK YÜKSEKÖÐRETÝMÝNÝN BUGÜNKÜ DURUMU (the state of the art of Turkish higher education). (November 2005). Ankara: Higher Education Council. Retrieved, 26 March 2008, from http://www.yok.gov.tr/egitim/raporlar/kasim_2005/kasim_2005.doc. Statistics come from the web site of the Student Selection and Admission Center: ÖÐRENCÝ SEÇME VE YERLEÞTÝRME MERKEZÝ. (n.d.). 2006-2007 öðretim yýlý yükseköðretim

[2]

[3]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

213


214

[4] [5]

[6] [7] [8] [9] [10]

[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

Yasar Tonta; Yurdagül Ünal

istatistikleri (Higher Education Statistics of 2006-2007 Academic Year). Ankara: ÖSYM. Retrieved, 26 March 2008, from http://www.osym.gov.tr/dosyagoster.aspx?DIL=1&BELGEANAH=19176& DOSYAISIM=2_Ogretim_El_Say.pdf. TONTA, Y. (2001). Collection development of electronic information resources in Turkish university libraries. Library Collections, Acquisitions and Technical Services, 25(3): 291-298. LINDLEY, J.A., & ERDOÐAN, P.L. (2002). TRNSL: A model site license for ANKOS. (paper) Presented at the Symposium on Research in the Light of Electronic Developments, October 24-25, 2002, Bolu, Turkey. Retrieved, March 26, 2008, from http://www.library.yale.edu/~llicense/TRNSLpaper.doc LINDLEY, J.A. (2003). Turkish National Site License (TRNSL). Serials, 16(2): 187-190. ERDOÐAN, P.L., & KARASÖZEN, B. (2006). ANKOS and its dealings with vendors. The Journal of Academic Librarianship, 44(3-4): 69-83, p. 69. KARASÖZEN, B., & LINDLEY, J.A. (2004). The impact of ANKOS: Consortium development in Turkey. The Journal of Academic Librarianship, 30: 402-409. See the ANKOS web site for more information (http://www.ankos.gen.tr). TONTA, Y. (2007). Elektronik dergiler ve veri tabanlarýnda ulusal lisans sorunu (The national license issue in electronic journals and databases). (conference paper). Presented at the Akademik Biliþim ’07, 31 January – 2 February 2007, Kütahya, Turkey. (Online). Retrieved, May 12, 2008, from http://yunus.hacettepe.edu.tr/~tonta/yayinlar/tonta-ab07-bildirisi.pdf. For more information on EKUAL, see http://www.ulakbim.gov.tr/cabim/ekual/hakkinda.uhtml. In fact, ULAKBÝM paid for the license fee of 2006 (last year of a three-year license agreement signed by Elsevier and ANKOS) on behalf of ANKOS members. For more information on databases offered through ULAKBÝM’s EKUAL, see http:// www.ulakbim.gov.tr/cabim/ekual/veritabani.uhtml. ÜNAK-OCLC KONSORSÝYUMU (The ÜNAK-OCLC Consortium). (2008). Retrieved, 26 March 2008, from http://www.unak.org.tr/unakoclc/ See http://e-gazete.anadolu.edu.tr/ayrinti.php?no=6501. FRAZIER, K. (2001). The librarians’ dilemma: Contemplating the costs of the “big deal”. D-Lib Magazine, 7(3). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/dlib/march01/frazier/ 03frazier.html BALL, D. (2004). What’s the “big deal”, and why is it a bad deal for universities? Interlending & Document Supply, 32(2), 117-125. JOHNSON, R.K. (2004). Open access: Unclocking the value of scientific research. Journal of Library Administration, 42(2), 107-124. DURANCEAU, E.F. (2004). Cornell and the future of the big deal: An interview with Ross Atkinson. Serials Review, 30(2), 127-130, p. 127. See also Johnson (2004, p. 109) in Ref. 13. GATTEN, J.N., & SANVILLE, T. (2004). An orderly retreat from the big deal: is it possible for consortia? D-Lib Magazine, 10(10). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/ dlib/october04/gatten/10gatten.html. HAAR, J. (2000). Project PEAK: Vanderbilt’s experience with articles on demand. Serials Librarian, 38(1/2), 91-99. HUNTER, K. (2000). PEAK and Elsevier Science. PEAK Conference, Ann Arbor, 23 March 2000. (Online). Retrieved, May 12, 2008, from http://www.si.umich.edu/PEAK-2000/hunter.pdf DAVIS, P.M. (2002). Patterns in electronic journal usage: Challenging the composition of geographic consortia. College & Research Libraries, 63, 484-497. GALBRAITH, B. (2002). Journal retention decisions incorporating use-statistics as a measure of value. Collection Management, 27(1), 79-90. BATI, H. (2006). Elektronik bilgi kaynaklarýnda maliyet-yarar analizi: Orta Doðu Teknik Üniversitesi Kütüphanesi üzerinde bir deðerlendirme. (Cost-benefit analysis in electronic Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Consortial Use of Electronic Journals in Turkish Universities

[26] [27] [28]

[29] [30]

[31] [32]

[33]

[34] [35]

[36] [37] [38] [39] [40]

information resources: An evaluation of the Middle East Technical University Library). Unpublished M.A. dissertation. Hacettepe University, Ankara. CHRZASTOWSKI, T.E. (2003). Making the transition from print to electronic serial collections: A new model for academic chemistry libraries? Journal of the American Society for Information Science and Technology, 54, 1141-1148. WILEY, L., & CHRZASTOWSKI, T.E. (2002). The Illinois Interlibrary Loan Assesment Project II: revisiting statewide article sharing and assessing the impact of electronic full-text journals. Library Collections, Acquisitions, & Technical Services, 26(1), 19-33. HAMAKER, C. (2003). Quantity, quality and the role of consortia. What’s the Big Deal? Journal purchasing – bulk buying or cherry picking? Strategic issues for librarians, publishers, agents and intermediaries. Association of Subscription Agents and Intermediaries (ASA) Conference (2425 February 2003). (Online). Retrieved, 14 January 2007, from http://www.subscriptionagents.org/ conference/200302/chuck.hamaker.pps. KE, H-R., KWAKKELAAR, R., TAI, Y-M., & CHEN, L-C. (2002). Exploring behavior of Ejournal users in science and technology: Transaction log analysis of Elsevier’s ScienceDirect OnSite in Taiwan. Library & Information Science Research, 24, 265-291. RUSCH-FEJA, D., & SIEBKY, U. (1999). Evaluation of usage and acceptance of electronic journals: Results of an electronic survey of Max Planck society researchers including usage statistics from Elsevier, Springer and Academic Press (Full report). D-Lib Magazine, 5(10). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/dlib/october99/rusch-feja/10rusch-fejafullreport.html. VAUGHAN, K.T.L. (2003). Changing use patterns of print journals in the digital age: Impacts of electronic equivalents on print chemistry journal use. Journal of the American Society for Information Science and Technology, 54, 1149-1152. See also TONTA, Y., & ÜNAL, Y. (2007). Dergi kullaným verilerinin bibliyometrik analizi ve koleksiyon yönetiminde kullanýmý (Bibliometric analysis of journal use data and its use in collection management). In Serap Kurbanoðlu, Yaþar Tonta & Umut Al (eds.). Deðiþen Dünyada Bilgi Yönetimi Sempozyumu 24-26 Ekim 2007, Ankara Bildiriler (pp. 193-200). Ankara: Hacettepe Üniversitesi Bilgi ve Belge Yönetimi Bölümü. AL, U., & TONTA, Y. (2007). Tam metin makale kullaným verilerinin bibliyometrik analizi (Bibliometric analysis of the full-text aricle use). In Serap Kurbanoðlu, Yaþar Tonta & Umut Al (eds.). Deðiþen Dünyada Bilgi Yönetimi Sempozyumu 24-26 Ekim 2007, Ankara Bildiriler (pp. 209-217). Ankara: Hacettepe Üniversitesi Bilgi ve Belge Yönetimi Bölümü. EVANS, P., & PETERS, J. (2005). Analysis of the dispersal of use for journals in Emerald Management Xtra (EMX). Interlending & Document Supply, 33(3): 155-157. URBANO, C., ANGLADA, L.M., BORREGO, A., CANTOS, C., COSCULLUELA, C., & COMELLAS, N. (2004). The use of consortially purchased electronic journals by the CBUC (20002003). D-Lib Magazine, 10(6). (Online). Retrieved. May 10, 2008, from http://www.dlib.org/dlib/ june04/anglada/06anglada.html. COOPER, M.D., & MCGREGOR, G.F. (1994). Using article photocopydata in bibliographic models for journal collection management. Library Quarterly, 64, 386-413. MCDONALD, J.D. (2007). Understanding journal usage: A statistical analysis of citation and use. Journal of the AmericanSociety for Information Science and Technology, 58, 39-50. TSAY, M-Y. (1998a). Library journal use and citation half-life in medical science. Journal of the American Society for Information Science, 49, 1283-1292. TSAY, M-Y. (1998b). The relationship between journal use in a medical library and citation use. Bulletin of the Medical Library Association, 86, 31-39. WULFF, J.L., & NIXON, N.D. (2004). Quality markers and use ofelectronic journals in an academic health sciences library. Journal of the Medical Library Association, 92, 315-322. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

215


216

Yasar Tonta; Yurdagül Ünal

[41] SCALES, P.A. (1976). Citation analyses as indicators of the use of serials: A comparison of ranked title lists produced by citation counting and from use data. Journal of Documentation, 32, 17-25. [42] TONTA, Y., & ÜNAL, Y. (2005). Scatter of journals and literature obsolescence reflected in document delivery requests. Journal of the American Society for Information Science & Technology, 56(1): 84-94. [43] DARMONI, S.J., ROUSSEL, F., BENICHOU, J., THIRION, B., & PINHAS, N. (2002). Reading factor: A new bibliometric criterion for managing digital libraries. Journal of the Medical Library Association, 90(3), 323–327. [44] BOLLEN, J., VAN DE SOMPEL, H., SMITH, J., & LUCE, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing & Management, 41(6), 1419–1440. [45] BOLLEN, J., & VAN DE SOMPEL, H. (2008). Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59, 136-149. [46] http://www.mesur.org/MESUR.html (bold in original) [47] http://www.info.sciencedirect.com/content/journals/titles/ [48] The figure reflects the data obtained from the publisher. It is slightly different from the total given in Fig. 1, as the download statistics for the last quarter of 2007 was estimated and added to the total. [49] The average number of articles downloaded per journal title over 7 years was 11,991 (1,713 per year) (s.d. = 20,101, median: 4,784). [50] http://www.info.sciencedirect.com/news/archive/2006/news_billionth.asp [51] Note that the Journal of the American College of Cardiology ranks 8th on the basis of total use (2001/07). It does not appear in Table 3 because the journal was not common in the core journal lists of all years. [52] EGGHE, L., & ROUSSEAU, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. Amsterdam: Elsevier Science Publishers. (Online) Retrieved. January 31, 2008 from http://hdl.handle.net/1942/587. [53] COLEMAN, S.R. (1994). Disciplinary variables that affect the shape of Bradford’s bibliograph. Scientometrics, 29(1): 59-81. [54] COLEMAN, S.R. (1993). Bradford distributions of social-science bibliographies varying in definitional homogeneity. Scientometrics, 27(1): 75-91. [55] DROTT, M.C., & GRIFFITH, B.C. (1978). An examination of the Bradford’s Law and the scattering of scientific literature. Journal of the American Society for Information Science, 29, 238-246. [56] TRUESWELL, R.L. (1969). Some behavioral patterns of library users: the 80/20 rule. Wilson Library Bulletin, 43: 458-461. [57] See Ref. 34. Current statistics on the number of electronic journals and articles in the Emerald Management Xtra (EMX) collection come from http://info.emeraldinsight.com/products/xtra/

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


217

A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education Jan. J. Engelen Kath. Univ. Leuven – DocArch group – ESAT-Dept. of Electrical Engineering Kasteelpark Arenberg 10, B-3001 Heverlee – Leuven (Belgium) jan.engelen@esat.kuleuven.be

Abstract This contribution focuses on the relatively new phenomenon of the purely commercial availability of audiobooks, sometimes also called “spoken books”, “talking books” or “narrated books”. Having the text of a book read aloud and recorded has been for a very long time the favourite solution to make books and other texts accessible for persons with a serious reading impairment such as blindness or low vision. Specialised production centres do exist in most countries of the world for producing these talking books. But now a growing number of commercial groups have found out that there is a booming market for these products as people slowly get used to leisure listening to books instead of reading them. Some companies claim already having over 40.000 titles in spoken format in their catalogue. Major differences and possible synergies between the two worlds are discussed. Keywords: audiobooks; talking books; spoken information; commercialization 1.

Introduction

Electronic equivalents of printed books (e-books) have been around for a long time now; also multimedia documents have become more and more popular. Especially the spoken variants of books are continuously gaining popularity. Up to a few years ago, producing talking books was seen uniquely as a service to support reading-impaired persons but nowadays commercial interest is growing at a high pace. We start by comparing the traditional specialised production processes with their equivalents in the commercial circuit. This will anyhow involve also some technical aspects. Digital Rights Management and copyright challenges are handled too. Finally we discuss a few implications of this phenomenon on the organisation of libraries and related cataloguing issues. 2.

Specialised audiobook production centres

Most audiobook production centres in Western countries that are focussing mainly on consumers with a reading impairment, have now abandoned cassette distribution in favour of CD-based solutions. Cassettes had been around since the beginning of the sixties but recording, erasing and checking returned cassettes remained very time consuming activities for these production centres. Furthermore most books had to be put on a series of cassettes (due to their limited storage capacity) and clear indications on the cassettes and their boxes, preferably in Braille, were needed to keep some order in such a collection. But even at that time several measures for protecting copyright (nowadays called Digital Rights Management, DRM) were taken: special cassette formats or non-standard tape speeds were used to have some copying protection. In the middle of the nineties internet technology and especially web documents with hyperlinks to other Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


218

Jan. J. Engelen

documents became widespread. Within the European Digibook project several hybrid books were developed, containing both the text and the linked audio files of the same book. The linking was done on sentence level [1]. Similar initiatives were developed at the Swedish production centre TPB. In 1996 a large group of specialised production centres on a global scale has created the Daisy consortium [2] in order to study and to standardise the future audio recording of talking books and, very importantly, how a navigation structure could be added to the books in question. This lead to the Daisy 2.02 and 3.0 standards which have been turned into US standards by NISO but are accepted worldwide by all specialised production centres in order to permit the exchange of this new generation of audiobooks. Most centres distribute their productions nowadays on a data CD [3], or to a minor extent via the internet [4]. Data CD’s permit a trade off between quality and recording speed that is not possible for audio CD’s (e.g. with music). As the human voice can be recorded with a much lower sampling frequency than high quality music, data CD’s can contain easily 50 to 70 hours of speech. Technically the Daisy format describes the content of the book (in XHTML or XML type files) while the audio is recorded as a collection of mp3 files (.wav is rarely used). The Daisy CD can also contain the text of the document and a whole series of timing links (in SMIL format) between the two. That way one can have a computer or reading device searching the text content but the user still can listen to the corresponding audio output. Furthermore a Daisy book permits easy and rapid navigation through complex documents as up to six levels of table of contents are possible. Daisy books are read with computer programmes (AMIS, Easereader, TPB reader…) or special players. These players are actually CD-ROM readers with Daisy reading software. Some look even like a CD walkman. Since last year mini Daisy players have reached the market. These smaller devices with PDA or mobile phone dimensions use SD memory card readers instead of CD’s. The next generation of these devices will connect automatically (through WiFi) to the internet and will then automatically download books (or newspapers, cf. below). Currently only a few UTP-cable connected devices (Webbox, Adela) do exist but their WiFi versions are under development. 3.

Commercial audiobook production

3.1

Booming commercial audiobook popularity

Over the last years we have witnessed an enormous increase in audiobook popularity outside the “traditional” user group of persons with a visual impairment. Several commercial groups, some linked to traditional publishers, some completely new ones have popped up. Audible.com [5] is the leading online provider of digital spoken word audio content in English, specialising in digital audio editions of books, newspapers and magazines, television and radio programmes and original programming. Through its web sites in the US and UK and alliances in Germany and France, Audible.com offers over 40,000 programmes, including audiobooks from well-known authors such as Stephen King, Thomas Friedman, and Jane Austen, and spoken word audio content from newspapers including The New York Times and The New Yorker. However these newspapers are only made available in excerpted form.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education

"It really is that easy. You don't need to install any special software. You don't have to join a club and pay a monthly subscription. You don't even have to break the bank as there are lots of titles for just a few dollars. Just get whatever you want, whenever you want, and sit back and enjoy."

Figure 1: Two different commercial approaches: Audible.com (top) and LeisureAudiobooks (bottom) Meanwhile in Belgium and in the Netherlands (two small countries, 15 million Dutch speaking inhabitants) about a dozen specialised publishers have popped up in a short time. Curiously enough customers will seldom buy audiobooks in bookshops: they seem to be used to downloading music and therefore expect also audiobooks to be downloadable. On the other hand many public libraries have reacted to an enormous interest in audiobooks by adding them to their collections. There is also a growing interest in spoken versions of education material and course material [6]. 3.2

Technical formats and standards for audiobooks; copy protection

A very important issue is the type of audiobook standard that is used. As stated above, within the sector of audiobook production for reading-impaired persons the Daisy standard is very common (and in fact globally accepted). Commercial publishers on the other hand do NOT use the Daisy standard but rely on several alternatives for distributing their audiobooks: • Some companies provide documents on standard [7] audio CD’s (e.g. Dan Brown’s “Da Vinci Code” spans 13 audio CD’s). The main reason for this choice is the universal usability on any audio CD-player developed since 1980.

Others use data CD’s with audiofiles in mp3-format and for reasons explained above. Up to 40 hours narration on one CD is not uncommon.

A few, including the largest one (audible.com, cf. above) provide their audiobooks in a DRM-protected format. Some of their more expensive books however can be burned onto (a pile of) audio CD’s by a legal buyer. A special version of NERO CD writer is needed to do this.

Audible.com has developed the proprietary “.aa” format and provides free software for playing (legally acquired) .aa files on 290 platforms. This format also caters for different quality levels.

Apple i-Tunes used mainly the proprietary MP4 format (a container format, including the media and DRM info) which made it impossible for some time to use the files on non-iPod players.

Since the beginning of 2007 more and more music on the internet became available without DRM, although generally at a somewhat higher price. Many see DRM now as a thing of the past (cf. below). Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

219


220

Jan. J. Engelen

But the most striking difference of all these solutions with the Daisy format is the lack of any sensible navigation system through the audio files. The available solution, the Daisy standard is not used in the commercial audiobook world! A very important aspect of audiobook (and music) distribution on CD’s (or via the internet) is copyright protection, often seen as copying protection. Digital rights management was once seen by the music industry as the method to prohibit illegal copying. In practice however DRM lead to quite a lot of customer frustration as it hindered copying in general or made it sometimes impossible to play the legally acquired files on a whole series of devices. In practice, all widely-used DRM systems have been defeated or circumvented when deployed to enough customers. Protection of audio and visual material is especially difficult due to the existence of the “analogue hole” [8], and there are even suggestions that effective DRM is logically impossible for this reason. A much more complex situation for illegal copiers arises when books become interactive and the sequential nature of the narration is abandoned. 3.3

Business models

A special audiobook issue is the business model used by the publishers: some companies, including again the largest one, prefer a subscription model with monthly instalments – worth approximately one audiobook. Audible.com’s business model is closely mimicking the well established marketing system of “book clubs”, i.e. one gets the possibility to download a number of books by paying a monthly membership fee. Buying individual books is possible too, but at much higher prices. Its main competitor LeisureAudiobooks, on the contrary, stresses that no subscription is required (cf. figure 1) Others charge different prices for the different audio qualities available. E.g. at audiobooksforfree.com, the lowest quality is for free but users are charged for better quality files. In fact the company stores high quality audiobooks but degrades them for those who want to pay less.

Figure 2: Example of different pricing for different qualities (example from audiobooksforfree.com) 3.4

Audio: human voice vs synthetic voice

A major distinction between audiobooks must be made according to the type of audio: is the narration done by a human person or by a computer (synthetic voice)? Everyone agrees that, even still nowadays, human voice is much more agreeable to listen at than synthetic voice although very good quality text-to-speech software (TTS) is available. However, for some applications only electronic conversion is an option. E.g. during the production of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education

spoken Flemish daily newspaper project – Audiokrant - with full text coverage of all articles, there are only some 30 minutes available after copy closure time to produce 12 to 15 hours of speech and to physically record the subscribers’ CD’s [9]. 3.5

Growing synergy between commercial and not-for-profit audiobook publishers

Up to now, the worlds of the commercial and the not-for-profit publishers have been very segregated. Commercial publishers often state that their products also benefit reading-impaired persons but they show no interest in using the Daisy standard. On the other hand, specialised production centres are clearly exploring the commercial possibilities of the large archive of spoken books most of them have created over the past years. Sometimes, specialised and commercial productions go hand in hand. The Royal National Institute of the Blind (UK) recording of Terry Darlington’s ‘Narrow Dog To Carcassonne’ has won the APA ‘Audies’ award for 2007 in the category of best unabridged non-fiction. The book was produced by RNIB both as a DAISY Digital talking book for RNIB clients and also as a commercial audio book on CD (ISIS publishing). In the Netherlands, the largest specialised audiobook production centre, “Dedicon” has created at the end of 2006 a commercial branch, named Lecticus [10]. Mainly linearly organised books are provided as a series of mp3 or wma files. Books can be downloaded but also can be delivered on a cheap audio mp3 reader (USB stick size). 4.

Cataloguing Issues

The problem of how to find an audiobook in a library is clearly somewhat complicated by the fact that the number of productions centres is increasing rapidly. Furthermore a comprehensive cataloguing process for audiobooks requires a whole new series of descriptive items including but certainly not limited to: • flags for abridged [11] and unabridged versions; a field for total reading time;

fields for technical recording specifications (e.g. audio quality/sampling frequency; file types, use of Daisy standards 2.02 or 3.0 etc.);

• •

a field to distinguish between recorded/synthesized speech;

flags for pronunciation details (UK English vs American, Austrian or Swiss German vs Standard German, Dutch vs Flemish intonation etc..). No standard for covering these subtle language differences is available;

fields describing the audio-to-text linking mechanisms used in the audiobook (if the text is made available too): synchronisation between text and audio on a word, a paragraph or a page level.

fields for the narrator’s details (Experienced narrators or books read by their author constitute selling arguments for commercially produced books!);

Some of these requirements resemble the cataloguing needs for books in large print. These topics are actually under the remit of a special section within the International Federation of Library Associations (IFLA) that caters for the needs of reading-impaired users [12]. 5.

Conclusions Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

221


222

Jan. J. Engelen

Due to the explosive booming of commercial audiobooks, a huge number of titles theoretically becomes available for reading-impaired users too. This process requires however new business models for the traditional specialised centres and probably also a completely new societal vision on who is willing to pay for what type of audiobook service in the future. 6.

Notes and References

[1]

Andras Arato, Laszlo Buday, Teresa Vaspori, “Hybrid Books for Blind - a New Form of Talking Books”, lecture at ICCHP’96 (Linz, July 1996), in Proceedings of the 5th International Conference “Interdisciplinary Aspects on Computers Helping People with Special Needs”, Schriftenreihe der Oesterreichischen Computer Ges., Band 87 (part 1), ISBN 3-85403-087-8, (Linz, July 1996) [2] Daisy consortium: http://www.daisy.org [3] The medium to use is not part of the Daisy standard. The CD-ROM format is described in the “Yellow Book”: (http://en.wikipedia.org/wiki/Yellow_Book_%28CD-ROM_standards%29) [4] Downloading is not yet a very common procedure due to most internet providers’ data download volume restrictions, although this is changing rapidly to permit more multimedia downloads. [5] Audible.com:http://www.audible.com After having been a minority shareholder for some years, Amazon.com fully acquired Audible.com on January 31, 2008. [6] Post, Hans-Maarten, “Luisterboeken winnen terrein”, p.32 in “De Standaard”, 5 May 2008 (Corelio newspaper publishers, Belgium) [7] The Audio CD format (“Red Book”) was developed in 1980 by Philips & Sony for high quality music recordings and specifies a maximum playing time of 78 minutes. (http://en.wikipedia.org/wiki/Red_Book_%28audio_CD_standard%29) [8] The “analogue hole” means simply that any audio or video signal has to be transformed into an analogue signal to be interpretable by human beings; but analogue signals can be re-digitised afterwards. Internet music stores have more or less given up DRM protection. E.g. it was found that a new iTunes music track (with DRM) made available from Apple needed less than 3 minutes to become available elsewhere on the web in an unprotected audio format. [9] Paepen, Bert, “AudioKrant, the daily spoken newspaper”, Proceedings of the 12th Electronic Publishing Conference (ELPUB, Toronto, July 2008). Available from: http://elpub.scix.net (Open Access) [10] Lecticus audiobook shop: http://www.lecticus.nl [11] And what does “abridged” precisely mean? [12] Brazier, Helen, “The Role and Activities of the IFLA Libraries for the Blind Section”, Library Trends - Volume 55, Number 4, Spring 2007, pp. 864-878

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


223

The SCOAP3 project: converting the literature of an entire discipline to Open Access Salvatore Mele1 1

CERN – European Organization for Nuclear Research CH1211, Geneva 23, Switzerland e-mail: Salvatore.Mele@cern.ch

Abstract The High-Energy Physics (HEP) community spearheaded Open Access with over half a century of dissemination of pre-prints, culminating in the arXiv system. It is now proposing an Open Access publishing model which goes beyond present, sometimes controversial, proposals, with a novel practical approach: the Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3). In this model, libraries and research institutions federate to explicitly cover the costs of the peer-review and other editorial services, rather than implicitly supporting them via journal subscriptions. Rather than through subscriptions, journals will their costs from SCOAP3 and make the electronic versions of their journals free to read. Unlike many “author-pays” Open Access models, authors are not directly charged to publish their articles in the Open Access paradigm. Contributions to the SCOAP3 consortium are determined on a country-by-country basis, according to the volume of HEP publications originating from each country. They would come from nation-wide re-directions of current subscriptions to HEP journals. SCOAP3 will negotiate with publishers in the field the price of their peer review services through a tendering process. Journals converted to Open Access will be then decoupled from package licenses. The global yearly budget envelope for this transition is estimated at about 10 Million Euros. This unique experiment of “flipping” from Toll Access to Open Access all journals covering the literature in a given subject is rapidly gaining momentum, and about a third of the required budget envelope has already been pledged by leading libraries, library consortia and High-Energy Physics funding agencies worldwide. This conference paper describes the HEP publication landscape and the bibliometric studies at the basis of the SCOAP3 model. Details of the model are provided and the status of the initiative is presented, debriefing the lessons learned in this attempt to achieve a large-scale conversion of an entire field to Open Access. Keywords: SCOAP3; Open Access Publishing; High-Energy Physics. 1.

Introduction

Recently, the Open Access debate has become mainstream, spreading to all areas and actors of scholarly communication and affecting its entire spectrum, from policy making to financial aspects [1]. Open Access models are actively being proposed by scholars, libraries and publishers alike, and Open Access definitions, of varying shades and colours, are actively debated. This change falls under the umbrella of the groundbreaking technological changes that are inspiring the transformation of science into e-Science in the XXIst century. This contribution will not enter into these wide ranging issues: its objective is to present a specific Open Access model tailored to the needs of a specific community, High-Energy Physics (HEP), as embodied by the SCOAP3 initiative (Sponsoring Consortium for Open Access Publishing in Particle Physics). Although this is a discipline-specific approach to the wider issue of Open Access, it is a particularly interesting one: HEP has a long tradition of innovations in scholarly communication and Open Access, which have then Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


224

Salvatore Mele

spread to other fields, and the lessons learned by the momentum gathered by the SCOAP3 initiative can inform the evolution of Open Access publishing in other fields. A few words are in order to give the scale of the endeavors of HEP, and its strong collaborative texture, which is inspiring its position in the Open Access debate. The scientific goals of HEP are to attain a fundamental description of the laws of physics, to explain the origin of mass and to understand the dark matter in the universe. Any of these insights would dramatically change our view of the world. To reach these scientific goals, experimental particle physicists team in thousand-strong collaborations to build the largest instruments ever, to reproduce on Earth the energy densities of the universe at its birth. At the same time, theoretical particle physicists collaborate to formulate hypotheses and theories, based on complex calculations, to accommodate and predict experimental findings. These goals are at the edge of current technology and drive developments in many areas, from engineering to electronics, from information technology to accelerator technology. The crowning jewel in HEP research is CERN’s Large Hadron Collider (LHC), which will start accelerating particles in 2008, after more than a decade of construction. This 27km-long accelerator will collide protons 40 million times a second. These collisions will be observed by large detectors, up to the size of a five storey building, crammed with electronic sensors: think a 100MegaPixel digital camera taking 40 million pictures a second. This contribution is structured as follows: Section 2 traces a short history of Scholarly Communication and Open Access in HEP; Section 3 presents the HEP publication landscape and the way this has inspired the construction of the SCOAP3 model; Section 4 outlines the details of the SCOAP3 model; Section 5 discusses the transition from a model to reality, presenting the status of the initiative and debriefing the lessons learned in recent months, with an outlook for the future evolution of the SCOAP3 initiative. 2.

Scholarly Communication and Open Access in HEP

HEP has long pioneered a bridge between scholarly communication and Open Access through its widespread preprint culture [2,3]. For decades, theoretical physicists and scientific collaborations, eager to disseminate their results in a way faster than the distribution of conventional scholarly publications, took to print and mail hundreds of copies of their manuscripts at the same time as submitting them to peerreviewed journals. This ante-litteram form of “author-pays” or rather “institute-pays” Open Access assured the broadest possible dissemination of scientific results, albeit privileging scientists working in affluent institutions. These could afford the mass mailing and were most likely to receive a copy of preprints from other scientists eager to advertise their results. At the same time, for research-intensive institutions, preprint dissemination came at a cost: as an example, in the ‘90s the DESY (Deutsches Elektronen-Synchrotron) HEP research centre in Hamburg, Germany, used to spend about 1 Million DM a year, (500’000€ of today, not corrected for inflation) for the production and mailing of hard-copies of these preprints, while CERN used to spend about twice as much [2]. Against this background, three revolutions mark crucial advances in scholarly communication in HEP. 1. 1974, IT meets HEP libraries. The SPIRES database, the first grey-literature electronic catalogue, saw the light at the SLAC (Stanford Linear Accelerator Center) HEP laboratory in Stanford, California, in 1974. It listed preprints, reports, journal articles, theses, conference talks and books and it now contains metadata for about 760’000 HEP articles, including links to full-text. It offers additional tools like citation analysis and is interlinked with other databases containing information on conferences, experiments, authors and institutions [4]. A recent poll of HEP scholars has shown that SPIRES, in symbiosis with arXiv, is an indispensable tool in their daily research workflow [5]. 2. 1991, the first repository. arXiv, the archetypal repository, was conceived in 1991 by Paul Ginsparg then at LANL (Los Alamos National Laboratory) in New Mexico [6]. It Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access

3.

evolved the four-decade old preprint culture into an electronic system, offering all scholars a level playing-field from which to access and disseminate information. Today arXiv has grown outside the field of HEP, becoming the reference repository for many disciplines: from mathematics to some areas of biology. It contains about 450’000 full-text preprints, receiving about 5’000 submissions each month, about 15% of which concern HEP. 1991, the web is woven. The invention of the web by Tim Berners-Lee at CERN in 1991 is a household story [7], and April 30th, 2008 saw the 15th anniversary of the day CERN released the corresponding software in the public domain [8]. What is less known is that the first web server outside Europe was installed at SLAC in December 1991 to provide access to the SPIRES database, as an example of the “killer-app” for the web [9]. HEP scholars imagined the web as from its inception as a tool for scholarly communication. The interlinking of arXiv and SPIRES in summer 1992 eventually offered the first web-based Open Access application.

Thanks its decade-old preprint culture, HEP is today an almost entirely “green” Open Access discipline, that means a discipline where authors self-archive their research results on repositories which guarantee their unlimited circulation. Posting an article on arXiv, even before submitting it to a journal, is common practice. Even revised versions incorporating the changes due to the peer-review process are routinely uploaded. Publishers of HEP journals are all allowing such practices and, in some cases, even hosting arXiv mirrors! It is interesting to remark that this success of “green” Open Access in HEP originates without mandates and without debates: very few HEP scientists would not take advantage of the formidable opportunities offered by the discipline repository of the field, and the linked discovery and citation-analysis tools offered by SPIRES. The speed of adoption of arXiv at large in the field is presented in Figure 1, which plots the evolution with time of the submissions to arXiv in the four categories in which HEP results are conventionally divided. The number of preprints that are subsequently published in peer-review journals is also indicated. The difference between the numbers of submissions and the published articles is mostly due to conference proceedings and other grey-literature material that is routinely submitted to arXiv, but which does not usually generate peer-reviewed publications.

Figure 1. HEP preprints submitted to arXiv in four different categories (hep-ex, hep-lat, hep-ph and hep-th) as well as total numbers (hep-*). Preprints subsequently published in peer-reviewed journals are indicated with a “P”. After a phase of adoption of the arXiv system, corresponding to the rise of all curves, present outputs are constant. Data from the SPIRES database.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

225


226

Salvatore Mele

As a consequence of the widespread role of arXiv in scholarly communication, it can be argued that HEP journals have to a large extent lost their century-old role as vehicles of scholarly communication. However, at the same time, they continue to play a crucial part in the HEP community. Evaluation of research institutes and (young) researchers is largely based on publications in prestigious peer-reviewed journals. The main role of journals in HEP is mostly perceived as the one of “keeper-of-the-records”, by guaranteeing a high-quality peer-review process. In short, it can be argued that the HEP community needs high-quality journals as its “interface with officialdom”. The synergy between HEP and Open Access extends beyond preprints, into peer-reviewed literature. In 1997, HEP launched one of the first peer-reviewed Open Access journals: the Journal of High Energy Physics (JHEP), published by the International School of Advanced Studies (SISSA) in Trieste, Italy. It then became a low-cost subscription journal, and it is now offering a successful institutional membership scheme where for a small additional fee, all articles originating from a contributing institution are Open Access. It was followed in 1998 by Physical Review Special Topics Accelerators and Beams, published by the American Physical Society (APS), which operates under a sponsorship scheme, with 14 research institutions footing the bill for the operation of this niche journal. Another example is the New Journal of Physics, published by the Institute of Physics Publishing (IOPP), which carries HEP content in a broader spectrum covering many branches of physics. This journal also started in 1998 and is financed by author fees, under the so-called “author-pays” model. In 2007, PhysMathCentral, a spin-off of BioMedCentral, started a new “author-pays” HEP journal, Physics A. Most HEP publishers, Springer first and APS and Elsevier later, offer now the possibility to authors to pay an additional fee on top of subscription to make their single articles Open Access, under the so called “hybrid model”. The “author-pays” and “hybrid” schemes, however, are not very popular: the total number of HEP articles that appear as Open Access under these two schemes is below 1% of the yearly HEP literature. In comparison, the volume of Open Access articles financed by the institutional membership fee in JHEP is about 20% of this journal, corresponding to about 4% of the total volume of HEP articles. After preprints, arXiv and the web, a transition to Open Access journals appears to be the next logical step in the natural evolution of HEP scholarly communication, and the following sections of this contribution will be describing the publishing landscape in HEP and how such a transition can be achieved, beyond the present experiments. 3.

Bibliometric Facts

The aim of the SCOAP3 initiative is to convert the entire HEP literature to Open Access. In-depth studies have been performed to assess the HEP publication landscape and have informed the design of this model. The most relevant findings of these studies are summarised in the following, and in particular the volume of HEP publishing, the journals favoured by HEP authors and the geographical distribution of the HEP authorship [10,11,12]. Five numbers set the scale of HEP scientific publishing:

• •

20’000; a lower limit to the number of active HEP scholars;

• •

80%; the fraction of HEP articles produced by theoretical physicists;

6’000; an upper limit to the HEP articles submitted to arXiv yearly, and subsequently published in peer-reviewed journals; Figure 1 shows that this yearly HEP output is constant. 20%; the fraction of these articles authored by large collaborations of experimental

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access

physicists;

50:50; the ratio of active experimental and theoretical HEP scholars.

Figure 2 presents the journals favoured by HEP authors in 2006. The large majority of HEP articles are published in just six peer-reviewed journals from four publishers. Five of those six journals carry a majority of HEP content. These are Physical Review D (published by the APS), Physics Letters B and Nuclear Physics B (Elsevier), JHEP (SISSA/IOPP) and the European Physical Journal C (Springer). The sixth journal, Physical Review Letters (APS), is a “broadband” journal that carries only about 10% of HEP content. These journals have been since long time favourite by HEP scholars, albeit with varying fortunes. Figure 3 presents the percentage of HEP articles published in each of these six journals in the last 17 years. Only the articles published in these journals are considered in this graph, which allows to assess the relative popularity of these titles with time. Periods of stability are followed by fast rise of some titles and corresponding decline of others.

Figure 2. Journals favoured by HEP scientists in 2006. Journals that attracted less than 75 HEP articles are grouped in the slice named “Others”. Data from the SPIRES database.

Figure 3. Journals favoured by HEP scientists in the last 18 years. For each year, only articles published in these six journals are considered, and the relative fractions are displayed. Articles published in Zeitschrift für Physik C and the European Physical Journal C are aggregated, as the latter is a successor of the former. Data from the SPIRES database. It is interesting to remark that in a discipline as HEP, with traditionally strong cross-border collaborative links, journals published in the United States or in Europe attract contribution from all geographical regions, as presented in Figure 4. Any Open Access initiative, therefore, can only succeed if it is truly global in scope. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

227


228

Salvatore Mele

Figure 4. Geographical origin of publications in HEP journals based in the United States and in Europe. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D (US), Physics Letters B (EU), Nuclear Physics B (EU), Journal of High Energy Physics (EU) and the European Physical Journal C (EU), and the HEP articles published in two “broadband” journals: Physical Review Letters (US) and Nuclear Instruments and Methods in Physics Research A (EU) [12]. The European contribution is well represented by CERN and its Member States, which are: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway, Poland, Portugal, the Slovak Republic, Spain, Sweden, Switzerland and the United Kingdom. Share of HEP Country Scientific Publishing United States 24.3% Germany 9.1% Japan 7.1% Italy 6.9% United Kingdom 6.6% China 5.6% France 3.8% Russia 3.4% Spain 3.1% Canada 2.8% Brazil 2.7% India 2.7% CERN 2.1% Korea 1.8% Switzerland 1.3% Poland 1.3% Israel 1.0% Iran 0.9%

Share of HEP Country Scientific Publishing Netherlands 0.9% Portugal 0.9% Taiwan 0.8% Mexico 0.8% Sweden 0.8% Belgium 0.7% Greece 0.7% Denmark 0.6% Australia 0.6% Argentina 0.6% Turkey 0.6% Chile 0.6% Austria 0.5% Finland 0.5% Hungary 0.4% Norway 0.3% Czech Republic 0.3% Remaining countries 3.1%

Table 1: Contributions by country to the HEP scientific literature. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. The last cell aggregates contributions from countries with a share below 0.3%. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, Journal of High Energy Physics and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of about 11’300 articles is considered [11,12]. Table 1 and Figure 5 present the contribution by country to the HEP scientific literature. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, JHEP and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access

Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of almost 11’300 articles is considered [11,12].

Figure 5. Contributions by country to the HEP scientific literature published in the largest journals in the field. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. Countries with individual contributions less than 0.8% are aggregated in the “Other countries” category. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, Journal of High Energy Physics and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of almost 11’300 articles is considered [11,12]. 4.

The SCOAP3 model

The call for Open Access journals in HEP is not only originating from librarians frustrated by spiralling subscription costs and shrinking budget, but is a solid pillar of the scientific community. At the beginning of 2007, the four experimental collaborations working at the CERN LHC accelerator, ATLAS, CMS, ALICE and LHCb, counting a total of over 5’000 scientists from 54 countries, declared: “We, […] strongly encourage the usage of electronic publishing methods for [our] publications and support the principles of Open Access Publishing, which includes granting free access of our publications to all. Furthermore, we encourage all [our] members to publish papers in easily accessible journals, following the principles of the Open Access paradigm” [11]. SCOAP3, the Sponsoring Consortium for Open Access Publishing in Particle Physics, aims to convert to Open Access the HEP peer-reviewed literature in a way transparent to authors [11,13], meeting the expectations of the HEP community for peer-review of the highest standard: administered from the journals which have served the field for decades, while leaving room for new players. The SCOAP3 business model originates from a two-years debate involving the scientific community, libraries and publishers [11,14]. The essence of this model is the formation of a consortium to sponsor HEP publications and make them Open Access by redirecting funds that are currently used for subscriptions to HEP journals. Today, libraries (or the funding bodies behind them) purchase journal subscriptions to implicitly support the peerreview and other editorial services and to allow their users to read articles, even though in HEP the scientists mostly access their information by reading preprints on arXiv. The SCOAP3 vision for tomorrow is that funding bodies and libraries worldwide would federate in a consortium that will pay centrally for the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

229


230

Salvatore Mele

peer-review and other editorial services, through a re-direction of funds currently used for journal subscriptions, and, as a consequence, articles will be free to read for everyone. This evolution of the current “author-pays” Open Access models will make the transition to Open Access transparent for authors, by removing any financial barriers. The SCOAP3 model offers another advantage for libraries and funding bodies over the present “authorpays” model. Disciplines with successful “author-pays” journals often see publication costs met either by libraries or by funding bodies. At the same time the costs of the subscriptions to “traditional” journals do not decrease following the reduced volume of articles that these publish, due to the drain towards “authorpays” Open Access journals. Conversely, in the SCOAP3 models all literature of the field could be converted to Open Access, keeping the total expenditure under control. In practice, the Open Access transition will be facilitated by the fact that the large majority of HEP articles are published in just six peer-reviewed journals from four publishers, as presented in Figure 1. Five of those six journals carry a majority of HEP content and the aim of the SCOAP3 model is to assist publishers to convert these “core” HEP journals entirely to Open Access and it is expected that the vast majority of the SCOAP3 budget will be spent to achieve this target. Another journal, Physical Review Letters, is a “broadband” journal that carries only 10% of HEP content: it is the aim of SCOAP3 to sponsor the conversion to Open Access of this journal fraction. The same approach can be extended to other “broadband” journals. Of course, the SCOAP3 model is open to any other, present or future, “core” or “broadband”, high-quality journals carrying HEP content, beyond those spotlighted here. This will ensure a dynamic market with healthy competition and a broader choice. The price of an electronic journal is mainly driven by the costs of running the peer-review system and editorial processing. Most publishers quote a price in the range of 1’000–2’000€ per published article. On this basis, given that the total number of HEP publications in high-quality journals is between 5’000 and 10’000, according to how one defines HEP and its overlap with cognate disciplines, the annual SCOAP3 budget for the transition of HEP publishing to Open Access would amount to a maximum of 10 Million Euros per year [11]. The costs of SCOAP3 will be distributed among all countries according to a fairshare model based on the distribution of HEP articles per country, as shown in Table 1 and Figure 5. In practice, this is an evolution of the “author-pays” concept: countries will be asked to contribute to SCOAP3, whose ultimate targets are Open Access and peer-review, according to their use of the latter, measured from their scientific productivity. To cover publications from scientists from countries that cannot be reasonably expected to make a contribution to the consortium at this time, an allowance of not more than 10% of the SCOAP3 budget is foreseen. SCOAP3 will sponsor articles through a tendering procedure with publishers of high-quality journals. It is expected that the consortium will invite publishers to bid for their peer-review and other editorial services, on a per-article basis. The consortium will then evaluate these offers as a function of indicators such as the journal quality and price and attribute contracts, within its capped budget envelope. SCOAP3 has therefore the potential to contain the overall cost of journal publishing by linking price, volume and quality and injecting competition into the market. In the SCOAP3 model, libraries will not be paying twice for the journals to be converted to Open Access, in case these are part of journal licence packages. Indeed, in the case of a “core” HEP journal (where an entire journal is converted to OA) that is part of a large journal licence package, the publisher will be required to un-bundle this package and to correspondingly reduce the subscription cost for the remaining part of the package. For “broadband” journals (where only the conversion of selected HEP articles is paid by SCOAP³), the subscription costs will be required to be lowered according to the fraction supported by SCOAP³. For journals of this kind that are part of a licence package, the reduction should be reflected in a corresponding reduction of the package subscription cost. In the case of existing long-term subscription Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access

contracts between publishers, libraries, and funding agencies, publishers will be required to reimburse the subscription costs pertaining to OA journals or to the journal fractions converted to OA. It appears at first glance to be a formidable enterprise to organize a worldwide consortium of research institutes, libraries and funding bodies that cooperates with publishers in converting the most important HEP journals to Open Access. At the same time, HEP is used to international collaborations on a much bigger scale. As an example, the ATLAS experiment, one of the four detectors at the LHC, has been built over more than a decade by about 50 funding agencies on a total budget of 400 Million Euros (excluding person-power), placing about 1000 industrial contracts. In comparison, the SCOAP3 initiative has about the same number of partners, but a yearly budget of only 10 Million Euros, and will handle less than a dozen contracts with publishers. SCOAP3 will be operated along the blueprint of large HEP collaborations, profiting from the collaborative experience of HEP. 5.

Conclusions and Outlook

SCOAP3 is now collecting Expressions of Interest from partners worldwide to join the consortium. Once it will have reached a critical mass, and thus demonstrated its legitimacy and credibility, it will formally establish the consortium and its governance, it will issue a call for tender to publishers, aimed at assessing the exact cost of the operation, and then move quickly forward with negotiating and placing contracts with publishers. SCOAP3 is rapidly gaining momentum. In Europe, most countries have pledged their contribution to the projects. In the United States, leading libraries and library consortia have pledged a redirection of their current expenditures for HEP journal subscription to SCOAP3, and a call for action has originated from many associations, among which ARL, the Association of Research Libraries [15]. In total, SCOAP3 has already received pledges for about a third of its budget envelope, with another considerable fraction having the potential to be pledged in the short-term future, as presented in Figure 6 [13]. This consensus basis is not restricted to Europe and North America: Australia is part of the consortium and advanced negotiations are in progress in Asia and in Latin America.

Figure 6. Status of the SCOAP3 fund-raising at the time of writing. A third of the funds have already been pledged, 15% are expected to be pledged in the coming weeks, while discussions and negotiations are in progress for another 44% [13]. In conclusion, SCOAP3 is a unique experiment of “flipping” from Toll Access to Open Access all journals covering the literature of a given disciplien. Its success so far and its eventual fate, will be important to inform other initiatives in Open Access publishing for several reasons:

The contained publication landscape of HEP, with less than 10’000 articles appearing in Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

231


232

Salvatore Mele

half a dozen journals from few publishers simplifies a possible transition of the entire literature of the field to Open Access.

6.

HEP is a scientific discipline which has since long embraced, and actually pioneered, “green” Open Access, with a long tradition of unrestricted circulation of preprints via mass mailing first and arXiv later. SCOAP3 can be interpreted as an experiment in a controlled environment of possible future evolutions in Open Access publishing, or “gold” Open Access, in the light of the present acceleration of “green” Open Access, or self-archiving of research results in many other fields of science, both on an institutional and disciplinary basis.

Some of obstacles met by “gold” Open Access publishing so far are related to justified authors’ concerns about financial barriers for the paiment Open Access fees and their reluctance to submit articles to new, Open Access, journals. The SCOAP3 initiative benefits from a strong consensus from the researchers side as it addressed both points: it does not imply any direct financial contribution from authors and aims to convert to Open Access the high quality of peer-reviewed journals which have served the community for decades.

By construction, the SCOAP3 model implies a large worldwide consensus first, and financial commitment later. As Open Access is a global issue, the success of this initiative, in a well-organised discipline with strong cross-border links like HEP, can signify the potential of international cooperation in addressing the global problems of scholarly communication.

Notes and References

[1]

One of the most extensive sources of information on the Open Access movement is http:// www.earlham.edu/~peters/fos/overview.htm [Last visited May 25th, 2008]. [2] R. Heuer, A. Holtkamp, S. Mele, 2008, Innovation in Scholarly Communication: Vision and Projects from High-Energy Physics, arXiv:0805.2739. [3] L. Goldschmidt-Clermont, 1965, Communication Patterns in High-Energy Physics, http:// eprints.rclis.org/archive/00000445/02/communication_patterns.pdf. [4] L. Addis, 2002, Brief and Biased History of Preprint and Database Activities at the SLAC Library, http://www.slac.stanford.edu/spires/papers/history.html [Last visited May 25th, 2008]; P. A. Kreitz and T. C. Brooks, Sci. Tech. Libraries 24 (2003) 153, arXiv:physics/0309027. [5] A. Gentil-Beccot et al., 2008, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course, arXiv:0804.2701 [6] P. Ginsparg, Computers in Physics 8 (1994) 390. [7] T. Berners-Lee, Weaving the Web, HarperCollins, San Francisco, 1999. [8] J. Gillies, 2008, The World Wide Web turns 15 (again) , http://news.bbc.co.uk/2/hi/technology/ 7375703.stm [Last visited May 25th, 2008]. [9] P. Kunz et al, 2006, The Early World Wide Web at SLAC, http://www.slac.stanford.edu/history/ earlyweb/history.shtml [Last visited May 25th, 2008]. [10] S. Mele et al., Journal of High Energy Physics 12 (2006) S01, arXiv:cs.DL/0611130. [11] S. Bianco et al., 2007, Report of the SCOAP3 Working Party, http://www.scoap3.org/files/ Scoap3WPReport.pdf. [12] J. Krause et al., 2007, Quantitative Study of the Geographical Distribution of the Authorship of High-Energy Physics Journals, http://scoap3.org/files/cer-002691702.pdf [Last visited May 25th, 2008]. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access

[13] http://scoap3.org [Last visited May 25th, 2008]. [14] R. Voss et al., 2006, Report of the Task Force on Open Access Publishing in Particle Physics, http://www.scoap3.org/files/cer-002632247.pdf. [15] I. Anderson, 2008, The Audacity of SCOAP3, ARL Bimonthly Report, no. 257; J. Blixrud, 2008, Taking Action on SCOAP3, ibid.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

233


234

Modeling Scientific Research Articles â&#x20AC;&#x201C; Shifting Perspectives and Persistent Issues Anita de Waard1,2; Joost Kircz3, 4 Elsevier Labs, Radarweg 29, 1043 NX, Amsterdam, The Netherlands e-mail: a.dewaard@elsevier.com 2 Department of Information and Computing Sciences, Universiteit Utrecht, The Netherlands 3 Institute for Media and Information Management (MIM), Hogeschool van Amsterdam, The Netherlands e-mail: j.g.kircz@hva.nl 4 Kircz Research Amsterdam, http://www.kra.nl 1

Abstract We review over 10 years of research at Elsevier and various Dutch academic institutions on establishing a new format for the scientific research article. Our work rests on two main theoretical principles: the concept of modular documents, consisting of content elements that can exist and be published independently and are linked by meaningful relations, and the use of semantic data standards allowing access to heterogeneous data. We discuss the application of these concepts in five different projects: a modular format for physics articles, an XML encyclopedia in pharmacology, a semantic data integration project, a modular format for computer science proceedings papers, and our current work on research articles in cell biology. Keywords: Scientific publishing models; new scholarly constructs and discourse methods; metadata creation and usage; pragmatic and semantic web technologies. 1.

Introduction

The objective of our work is, one the one hand, to analyze and investigate what role the research article plays in the connected world that scientists live in today, and on the other hand to propose and experiment with new forms of publication, which contain the knowledge traditionally transferred by â&#x20AC;&#x2DC;papersâ&#x20AC;&#x2122;, but are better suited to an online environment. Our research is driven both by an analytical approach stemming from the humanities, including argumentation theory, discourse modeling, and sociology of science, and a knowledge engineering approach from the computer science end, using semantic web technologies, argumentation visualization, and authoring and annotation tools. We present five examples of our work, in roughly chronological order [1]. We have been driven by two main theoretical concepts: firstly, the concept of modularity: the idea that a scientific text can consist of a set of self-contained and reusable content elements that are strung together to form one or more variants of an evolving series of documents. To explore this concept, Kircz and Harmsze have developed a modular format of the research article in physics [2] this work was extended to create a modular format for Major Reference Work in Pharmacology, which can be used as a database or a linear text [3]. The other main theoretical driver for our research is the use of semantic technologies to access scientific content. In the DOPE project [4], we developed an RDF (Resource Description Framework, [5])-based architecture to access a diverse content set through a thesaurus. This project included the RDF formatting Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

of Elsevier’s EMTREE thesaurus [6] and development of an explorative user interface [7] to access ta heterogeneous dataset. Lastly, we discuss two projects where we combine the concept of modularity with that of semantic tools and standards. We first identified a simple modular structure for articles in computer science that can be created using LaTeX and converted to semantic formats, entitled the ABCDE Format [8]. Our current work delves more deeply into the text of research articles. We are currently investigating a discourse modeling approach to develop a theoretical framework for a pragmatic model of research articles, linked through a network of argumentational relations. We probe of the pragmatic roles which various discourse elements provide, and modeling the way in which textual coherence and argumentative roles of textural elements are expressed, through an analysis of the linguistic forms used in various parts of a biology text. 2.

Modular Documents in Physics

Kircz and Roosendaal [9] summarized the communications needs in the scientific community as follows:

Awareness of knowledge about the body of knowledge of one’s own or related research domains;

• •

Awareness of new research outcomes, needed for one’s current research program;

Scientific standards on research approaches and reporting, that develop in the process of a certain research program and shapes the social structure of a field;

Platform for communication as a tool that enables formal and informal exchange of idea’s opinions, results and (dis)agreements between peers;

Ownership protection on the intellectual results and possible commercial applications.

Specific information on relevant theories, detailed information on design, methodologies etc.;

All of these roles demand different ways of identifying pertinent information units that together compose the paper as we know it today. In early period of transition of the scientific paper to electronic media, proposals for new formats remained traditional, without taking into account the extent to which electronic media change the whole spectrum of dissemination and reading. In a critique of this, we explored the functions of the article and discussed changes in form due to the fact that in an electronic medium, text and non-textual material obtain a different relationship than in the paper world [10, 11]. In our proposal the essay format, typical for a paper product that is meant to be read as an individual information object, is replaced by a mode of communication that is an intrinsic fit for reading electronically. Specifically, this will allow the reader to only read those parts that really serve an information need at a particular place and time. In other words with proper tools we will see a change from read and locate situation where first a document is identified and then it is read to identify the needed information, to a locate and read situation where we start with a relevant passage of text and from that as starting point decide to read other parts or to skip on. Such organized browsing, by immediately skipping to determined parts of the text, demands changes in the way research reports are structured and represented on paper and electronic media. This aspect suggests an intrinsically modular structure for electronic publications, first explored in [2]. The PhD research project on modularity of scientific information by Harmsze [12] focused on the dissection of the research paper into different types of information that are conveyed by the structure of the research paper.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

235


236

Anita de Waard; Joost Kircz

This approach leads to a modular model of scientific information in physics, which contains two elements: â&#x20AC;˘ Modules: information elements such as positioning (introduction), methods, results, interpretation, outcome and their subdivisions, and

â&#x20AC;˘

Relations between these elements, both to non-textual elements in the paper as to external relations to (parts of) other works

In Figure 1, we show the modular system developed by Harmsze [12] to model a set of papers in physics, where each module contained a unique type of content, focusing on e.g. the experimental setup or the central problem of a piece of research. Core to the use of modular elements is the concept of reusability: when a paper is updated, it might not need a new Positioning module, but merely provide e.g. new Methods and Results sections.

Figure 1: The module meta-information and the modules follow the conceptual function of the information, and the sequential paths leading through the article. The dashed line indicates the complete sequential path, and the dotted line the essay-type sequential path [12].

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

The other pillar of the modular model is the concept of ‘meaningful links’ (as the jargon was, at the time), Although the ubiquity of the <A HREF=”link.htm”> typical html link </A> has meant a great triumph of the simple hyperlink – one-to-one, mono-directional, not containing information about the link type – a body of research in hypertext has been going on for decades that identifies many different types of links and roles that can play in connecting pieces of data. In thinking about relations in this way, it becomes clear that a relation is not simply a pointer to another piece of data. The fact that the relation exists, and the relationship it expresses between the linking and linked data, provides information in itself, which can be made explicit (e.g. visible and/or searchable) for the reader. We therefore explicitly considered the relation and the information presented within it, or the relation type, as separate entities.

Figure 2: Different types of organisational relations distinguished in the modular model by Harmsze [12] For physics articles, Harmsze identifies the following detailed taxonomy of relations between modules: • Organisational relations that are based on the structure of the information. These dovetail with the structural XML information we discussed above – see figure 2 for a detailed subdivision, and

Discourse relations, which define the reasoning of the argument. In the model of Harmsze an elaborate skeleton has been worked out. Based on the systematic pragmadialectical categorization of Garssen [13], these can further be subdivided as:

Causal relations: relations where there is a causal connection between premise and conclusion (or between explanans and explanandum). This kind of relation exists between a statement or a formula and an elaborate mathematical derivation. Obviously, the usage of the causal relation as an argument and as an explanation, lie close together.

Comparison relations: relations where the relation is one of resemblance, contradiction or similarity. The analogue is a typical subtype. Comparisons used as argument are well-known phenomena, such as with the comparison of measured data from, e.g., the module Treated Results with theoretical predictions that fit within certain acceptable boundaries. We can also think of similarity relations, where results of others on similar systems are compared to emphasize agreement or disagreement. In the case of an elucidation, we can think of the relation between the description of a phenomenon and a known mechanical toy model. A link between a text and an image that illustrates the reasoning or results belong to this category. Another example is the suggestion that a drug that is effective in curing a particular ailment might also help against similar symptoms.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

237


238

Anita de Waard; Joost Kircz

â&#x20AC;˘

3.

Symptomatic relations, which are of a more complicated nature. Here we deal with relations where a concomitance exists between the two poles. This category is more heterogeneous than the other two. This kind of relation can be based on a definition or a value judgment such as the role of a specific feature that serves as a sufficiently discriminatory value to warrant a conclusion. We can think of a relation between the textually described results and a picture in which a specific feature, like a discontinuity in a graph, is used to declare a particular physical effect present or not.

A Modular Major Reference Work

The main drawback of the model developed for physics articles was that it is very demanding to the author to adhere to the proposed structure and the model presupposes strong editorial assistance in the form of advanced XML-based text processing software. This could be enforced within the context of a reference work, where a) the content elements are commissioned, and therefore a writing template can be proscribed and b) the main rhetorical purpose of the goal is to inform the reader of existing knowledge, rather than convince him or her of the validity of a specific claim (for more on this, see below). Therefore, we adapted Harmszeâ&#x20AC;&#x2122;s model to use it for XPharm, a state of the art, online, comprehensive pharmacology reference work. XPharm contains information on agents (drugs), targets, disorders and principles of pharmacology [3]. The 4,000 XPharm entries are authored by a group of 600 contributors who write in a very modular format. The idea for four databases was driven by the fact that Agents, including drugs, which are the core of pharmacology, act at molecular Targets to treat Disorders. The Principles database is included as a repository of information fundamental to the discipline but generally independent of the chemical entity,

Figure 3: Outline of an XPharm Target Record, showing the modular structure; all topic headings are the same for each Target. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

site of action, or clinical use. Each XPharm record can be rendered in a customizable way, and the interface allows for the rendering of modular content elements within different user-defined contexts [3]. XPharm uses the concept of modularity by proscribing a rigid format for each type of entry: for example, all target entries follow the format shown in Figure 3: The XML of each record is highly granular, for example physical constants are individually marked up so they can, in principle, be extracted and compared to create tables of data, thus enabling the XML to function either as a text (in the html instantiation) or as a database. Relations exist between records to these modular headings, so that the text can be interlinked in a very granular way. Also, this system enables detailed updates of only specific parts of the texts, e.g. if a new antibody is found for a specific target, only that module can be updated. As a conclusion, the system of modular authoring can work quite well for texts in which structures can be mandated and which are more like a ‘dressed-up database’ than like a persuasive text. As a conclusion, we believe that the difference between informative content sources (such as textbooks and databases) and persuasive texts (such as primary research articles) needs to be taken into greater account when modeling scientific information. In XPharm, a set of content relations was also proposed, which specifically hold between different elements; these are based on the specific biological rules that govern the interactions between content elements. For example, a disorder can be related to a drug (or Agent, in XPharm terms), by either the Treats relation (Aspirin treats Headache) or the Side Effect relation (Stomach Ache is a Side Effect of Aspirin). A system of 13 such relations was proposed, but because of technical issues (most notably, the lack of ability of current browsers to render relationship types) has not yet been implemented. 4.

Semantic Access to Heterogeneous Data: The DOPE Project

The technologies used in the work described above predate (our knowledge of ) the semantic web (XPharm was designed in 1998), and the lack of interoperable standards were partly what prevented us from scaling up or connecting to other projects. A next project focused on the use of such standards in the context of pharmacology research. DOPE, the Drug Ontology Project for Elsevier, focused on allowing access via a multifaceted thesaurus, EMTREE, to a large set of data: five million abstracts from the Medline database and about 500,000 full-text articles from Elsevier’s ScienceDirect [7]. At the time (2003), no open architecture existed to support using thesauri for querying data sources. To provide this functionality, we needed a technical infrastructure to mediate between the information sources, thesaurus representation, and document metadata stored on the Collexis fingerprint server [14]. We implemented this mediation in our DOPE prototype using the RDF repository Sesame [15]. The records were first indexed to Elsevier’s Proprietary thesaurus EMTREE [6]. The version we used, EMTREE 2003, contained about 45,000 preferred terms and 190,000 synonyms organized in a multilevel hierarchy, and currently contained the following information types:

• •

Facets: broad topic areas that divide the thesaurus into independent hierarchies.

Preferred terms are enriched by a set of synonyms—alternative terms that can be used to refer to the corresponding preferred term. A person can use synonyms to index or query information, but they will be normalized to the preferred term internally.

Links, a subclass of the preferred terms, serve as subheadings for other index keywords.

Each facet consists of a hierarchy of preferred terms used as index keywords to describe a resource’s information content. Facet names are not themselves preferred terms, and they cannot be used as index keywords. A term can occur in more than one facet; that is, EMTREE is poly-hierarchical.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

239


240

Anita de Waard; Joost Kircz

They denote a context or aspect for the main term to which they are linked. Two kinds of link terms, drug-links and disease-links, can be used as subheadings for a term denoting a drug or a disease. The indexing process was done by the Collexis Indexing Engine using a technique called fingerprinting [14], which assigns a list of weighted thesaurus keywords assigned to a document. Next to the document fingerprints, the Collexis server housed bibliographic metadata about the document such as authors and document location. The DOPE architecture (see Figure 4) then dynamically mapped the Collexis metadata to an RDF model. An RDF database, using the SOAP protocol, communicated with both the fingerprint server and the RDF version of EMTREE. A client application interface, based on Aduna’s Spectacle Cluster Map [7], let users interact with the document sets indexed by the thesaurus keywords using SeRQL queries, an RDF rule language sent by HTTP [16]. The system design permits the addition of new data sources, which are mapped to their own RDF data source models and communicate with Sesame. It also allows the addition of add new ontologies or thesauri, which can be converted into RDF schema and communicate with the Sesame RDF server [15].

Figure 4: Basic components of the DOPE architecture (technologies are given in brackets) We performed a small user study with 10 potential end users, including six academic and four industrial users [17]. These users found the tool useful for the exploration of a large information space, for tasks such as filtering information when preparing lectures on a certain topic and doing literature surveys (for example, using a “shopping basket” to collect search results). A more advanced potential application mentioned was to monitor changes in the research community’s focus. This, however, would require extending the current system with mechanisms for filtering documents based on publication date, as well as advanced visualization strategies for changes that happen over time, which were not part of the project scope. Overall, the DOPE system was a useful, working implementation of Semantic Web technologies that allowed for the inclusion of new distributed data sources and ontologies using the RDF data standard. In juxtaposing this project with the experiments in modularity discussed above, we note that a complex representation of the EMTREE thesaurus in RDF was constructed, using historically meaningful relationships between thesaurus elements. The use of semantic standards enables easy scaling of the system with new thesauri, or new relationships. However, of course, within DOPE the documents accessed were not modular, and they could only be related using overlapping or related thesaurus entries. Combining these two concepts, modular documents with meaningful relations and semantic technologies, led to our next series of investigations. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

5.

Semantic Modular Publishing: The ABCDE Format

Our current research focuses on developing a new format for publications that combines the concepts of modularity with semantic technologies. Our first foray into this area was to develop a simple modular format for structuring conference contributions in computer science, and the authoring, editing and retrieval processes needed to use them. Specifically, this format was meant as a way to allow the use of conference papers by Semantic Browsers such as PiggyBank [18] and semantic collaborative authoring tools such as Semantic Wikis [19]. The ABCDE Format (ABCDEF) for proceedings and workshop contributions is an open-standard, widely (re)useable format, that can be easily mined, integrated and consumed by semantic browsers and wiki’s [8]. The format can be created in several interoperable data types, including LaTeX and XML, or a simple text file. It is characterized by the following elements:

A - Annotation. Each record contains a set of metadata that follows the Dublin Core standard. Minimal required fields are Title, Creator, Identifier and Date.

B, C, D - Background, Contribution, Discussion. The main body of text consists of three sections: Background, describing the positioning of the research, ongoing issues and the central research question; Contribution, describing the work the authors have done: any concrete things created, programmed, or investigated; Discussion, contains a discussion of the work done, comparison with other work, and implications and next steps. These section headings need to exist somewhere in the metadata of the article - but they can be hidden markup; also, each of the sections can have different, and differently named, subheadings.

E- Entities. Throughout the text, entities such as references, personal names, project websites, etc. are identified by: - The text linking to an entity - The type of link (reference, footnote, website, etc.) - The linking URI, if present - The text for the link In other words, the entity link can be described as an RDF statement [5].

There is no abstract in an ABCDE document - instead, within the B, C and D paragraphs the author denotes ‘core’ sentences. Upon retrieval or rendering of the article, these can be extracted to form a structured abstract of the article - where one can jump directly to the core of the Background, Contribution or Discussion. This allows the author to create and modify statements summarizing the article only once, which prevents a misrepresentation in the abstract of the paper, which, in fact, occurs quite often [20].

ABCDEF allows an extensible set of relations to work on documents with a (simple) modular structure, and enables the use of open semantic standards. This format has been described and a LaTeX stylesheet has been published [8]; as a test, a small set of documents for the Semantic Wiki conference was converted to this format. The ABCDE format is a quite simple intermediary step towards creating a reusable, modular, semantic format for research articles. The relations between the modules are quite simple: the sequentiality is obvious (first B, then C, then D for the sections); ‘elaboration’ relations exist between core sentences in the abstract and their locations in the text; and the entities are related to their links by a link type which the user is free to name. Although this format allows access to the content by various semantic Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

241


242

Anita de Waard; Joost Kircz

tools, it still does not do a very good job of marking up the knowledge or argumentation in the text. An attempt at this is made in the currently ongoing project, discussed in the next section. 6.

Semantic Modular Publishing: Rhetoric in Biology.

At present, we are developing a more integrated approach, where we look more closely at the way in which rhetoric and persuasion are expressed. The main goal of our research is not a linguistic analysis of a research paper in a field, but the creation of a model that will enable faster browsing through a single paper as well as a collection of related papers. The most important observation from the work done on the modular physics articles was if when you break up the essay-type of article in well-defined units that can be retrieved and re-used, the units of information never become fully independent. A research paper is an attempt to convey meaning, and convince peers of the validity of a specific claim, using research data: therefore, to optimally represent a scientific paper, we should model how it aims to convince. To use a chemical metaphor: breaking up a molecule into its constituent atoms immediately confronts you with the various aspects of chemical binding. In the same way, parts of a scientific text are glued together with arguments, which cannot be disconnected without a loss of meaning to the overall structure. As a knowledge transmission tool, the research article offers an amalgamate of pragmatic, rhetorical and simply informative functions. Our modularity experiments led us to understand that although certain parts of the paper can be made into database-like elements, other parts are quite complex to modularize, and their format plays a critical role in transferring knowledge and convincing peers of the correctness of a statement. Our current efforts focus on obtaining a better understanding of the sociology and linguistic expressions of scientific truth creation in science. We are using a corpus of full-text articles in the field of cell biology, partly because it is a vast field, where presentations are already quite standardized, and partly because the role of research results vs. theoretical descriptions is very clear-cut. In modeling these articles, we are staying close to the traditional ‘IMRaD’ (Introduction, Methods, Results and Discussion) format, since first of all the field has consistently adopted this format [21]; an additional motivation for this format can be found by looking at models from classical rhetoric and story grammar models [22]. Therefore, to optimize granularity but still enable the rhetorical narrative flow, our current model in biology has three elements [23]: I: Content Modules: • Front matter/metadata

• • • • • •

Positioning Central Problem Hypothesis Summary of Results

Experiments, containing the following discourse unit types:

• •

Introduction, containing the following subsections:

Fact, Goal, Problem, Method, Result, Implication, Experimental Hypothesis

Discussion, containing the following subsections:

• • • •

Evaluation Comparison Implications Next Steps

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

Figure 4: A Subset of Statements and relations from two biology texts, modeled in Cohere [25]; each ‘target’ is linked into the appropriate location in the underlying documents II: Entities (contained within the (sub)sections described above), consisting of: • Domain-specific entities, such as genes, proteins, anatomical locations, etc.

• •

Figures and tables Bibliographic references

III: Relations, consisting of: • Relations within the document from entity representations to entities (figure, table and bibliographic references)

Relations out of the document, from entities to external representation (e.g., from a protein name to it’s unique identifier in a protein databank)

Relations between moves within documents, e.g. elaboration, from a summary statement the Introduction section to a Result element within the Experiment section)

Relations between moves between documents (e.g., agreement between a Result in one paper and that in another paper)

The modular division for the Introduction and the Discussion is based on Harmsze’s model and our own empirical investigations (it was easy to fit a collection of 13 biology articles within this framework, and we Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

243


244

Anita de Waard; Joost Kircz

hope it will cover the needs of the corpus in general). The Experiments are subdivided in a different way, where smaller elements consisting of one or more phrases are identified using verb tense and cue phrases, as motivated in [24] (a preliminary computational assessment will be given in [23]). Currently, we have marked up a corpus of 13 documents in this format, and we are working on implementing these, linked by the relationships described, in the online argumentation environment Cohere [25], see figure 4 for a screenshot of some of our statements in this environment. One of the main challenges is to represent the argumentation and the research data in a way which will allow a user to quickly oversee which claims are based on which experimental data, both within and between research articles. Our final goal is to develop a network that clearly differentiates claims from their validation, based on data, and enables insight into the quantitative motivation of a specific statement from its constituent experimental underpinnings. A further direction is to attempt automatic identification of the elements, specifically the moves within the experiment sections, which could enable a (semi) automatic representation of a paper as a set of claims and underlying data. 7.

Conclusion

Each of these projects has provided us with insights that, in part, have led to the next experiment. In particular, we have explored various incarnations of modular content representations, linked by meaningful relations. In certain cases, this can be fruitful: for example, a modular structure for an encyclopedic work can allow certain user functions that a narrative, linear structure does not allow; the ABCDE format enables an accessible representation of a collection of research papers inside a semantic architecture. The next major issue is to see whether a partly modular, partly linear format, where content elements are at least identified by type (Method, Hypothesis etc.) can indeed replace the existing linear narrative. If it does turns out to enable more useable reading environments, we need to ensure that the creation of the format can be achieved, given current publishing practices. We hope that our current experiments can help provide a format that offers computational handholds to access the argumentative elements within a research paper. Lastly, we want to state our are interest in exploring collaborations on this subject with the myriad initiatives that are currently ongoing, since we firmly believe this complex problem can only be solved by collaborative effort. This issue does not have a purely technological solution; to truly improve the way in which science is communicated will require serious scrutiny by the scientific community of the social, political and psycholinguistic way in which it claims, confirms, and creates knowledge. 8.

References

[1]

This paper is aimed to describe previous and current projects, and does not contain a theoretical embedding or references to related work; these have been addressed in [9, 10, 22, 24] and will be addressed in a forthcoming [23]. Kircz, J.G. and F.A.P. Harmsze, “Modular scenarios in the electronic age,” Conferentie Informatiewetenschap 2000. Doelen, Rotterdam 5 April 2000. In: P. van der Vet en P. de Bra (eds.) CS-Report 00-20. Proceedings Conferentie Informatiewetenschap 2000. De Doelen Utrecht, 5 April 2000. pp. 31-43. Enna, S..J., D. B. Bylund, Preface, XPharm, doi:10.1016/B978-008055232-3.09004-9 Stuckenschmidt, H., F. van Harmelen, A. de Waard, et.al, , “Exploring Large Document Repositories with RDF Technology: The DOPE Project,” IEEE Intelligent Systems, vol. 19, no. 3, pp. 34-40, May/Jun, 2004 Brickley D. (ed.), RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-schema For more information, see http://www.info.embase.com/emtree/about/

[2]

[3] [4] [5] [6]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues

[7] [8] [9]

[10] [11] [12] [13]

[14] [15] [16] [17]

[18] [19] [20] [21] [22] [23] [24]

[25]

Fluit C., M. Sabou, and F. van Harmelen, “Ontology-Based Information Visualization,” Visualizing the Semantic Web, V. Geroimenko and C. Chen, eds., Springer-Verlag, 2003, pp. 36-48. Waard, A. de and Tel, G., “The ABCDE Format: Enabling Semantic Conference Proceedings,” In: Proceedings of the First Workshop on Semantic Wikis, European Semantic Web Conference (ESWC 2006), Budva, Montenegro, 2006. Kircz, J. G. and Hans E. Roosendaal, “Understanding and shaping scientific information transfer,”. In: Dennis Shaw and Howard Moore (eds). Electronic publishing in science. Proceedings of the ICSU Press / UNESCO expert conference February 1996. Unesco Paris 1996. pp. 106116. Kircz, J.G., “New practices for electronic publishing 1: Will the scientific paper keep its form,” Learned Publishing. Volume 14. Number 4, October 2001. pp. 265-272. Kircz,. J.G., “New practices for electronic publishing 2: New forms of the scientific paper,” Learned Publishing. Volume 15. Number 1, January 2002. pp. 27-32 Harmsze, F, “A modular structure for scientific articles in an electronic environment,” PhD thesis, University of Amsterdam, February 9, 2000. Garssen, B., “The nature of symptomatic argumentation,” In: Frans H. van Eemeren, Rob Grootendorst, J Anthony Blair, Charles A. Wilards (eds.). Proceedings of the 4th International Conference of the International Society for the Study of Argumentation, Amsterdam, June 1619 1998. Amsterdam: SICSAT, 1999. Van Mulligen, E.M. et al., “Research for Research: Tools for Knowledge Discovery and Visualization,” Proc. 2002 AMIA Ann Symp., Am. Medical Informatics Assn., 2002, pp. 835–839. Broekstra, J., A. Kampman, and F. van Harmelen, “Sesame: An Architecture for Storing and Querying RDF and RDF Schema,” Proc. 1st Int’l Semantic Web Conf., LNCS 2342, SpringerVerlag, 2002, pp.54–68. Broekstra J. and A. Kampman, “SeRQL: Querying and Transformation with a Second- Generation Language,” technical white paper, Aduna/Vrije Universiteit Amsterdam, Jan. 2004. Stuckenschmidt, H., A. de Waard, R. Bhogal et.al, “A Topic-Based Browser for Large Online Resources,” In: Proceedings of the Proceedings of the 14th International Conference on Knowledge Engineering and Knowledge Management ({EKAW}’04). Editors E. Motta and N. Shadbolt. Series Lecture Notes in Artificial Intelligence. Huynh, D., Stefano Mazzocchi, and David Karger. “Piggy Bank: Experience the Semantic Web Inside Your Web Browser”, Proceedings International Semantic Web Conference (ISWC) 2005. For definitions and examples, see http://en.wikipedia.org/wiki/Semantic_wiki Pitkin, R.M., Branagan, M.A., Burmeister, L.F., “Accuracy of data in abstracts of published research articles,” JAMA 281 (1999) 1110–1111 See e.g., the International Committee of Medical Journal Editors, “Uniform Requirements for Manuscripts Submitted to Biomedical Journals: Writing and Editing for Biomedical Publications,” Updated October 2007, available online at http://www.icmje.org/ Waard, A. de, Breure, L., Kircz, J.G. & Oostendorp, H. van (2006), “Modeling Rhetoric in Scientific Publications,” in: Current Research in Information Sciences and Technologies (pp. 352-356). Vicente P. Guerrero-Bote (Editor) (Ed.), Badajoz, Spain: Open Institute of Knowledge. Waard, A. de, “A Semantic Modular Structure for Biology Articles,” forthcoming Waard, A. de, “A Pragmatic Structure for the Research Article,” in: Proceedings ICPW’07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: NL. (Eds.) Buckingham Shum, S., Lind, M. and Weigand, H. Published in: ACM Digital Library & Open University ePrint 9275. Buckingham Shum, S., “Cohere: Towards Web 2.0 Argumentation,” In: Proceedings, COMMA’08: 2nd International Conference on Computational Models of Argument, 28-30 May 2008, Toulouse. IOS Press: Amsterdam. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

245


246

Synergies, OJS, and the Ontario Scholars Portal Michael Eberle-Sinatra1; Lynn Copeland2; Rea Devakos3 Centre d’édition numérique, Université de Montréal CP 6129, succ. Centre Ville, Montreal, QC, H3C3J7, Canada e-mail: michael.eberle.sinatra@umontreal.ca 2 W.A.C. Bennett Library, Simon Fraser University Burnaby, BC, V5A 1S6, Canada e-mail: copeland@sfu.ca 3 Information Technology Services, University of Toronto Libraries 130 St George St, 7th Floor, Room 7th floor, Robarts, Toronto, ON, M5S 1A5, Canada e-mail: rea.devakos@utoronto.ca 1

Abstract This paper introduces the CFI-funded project Synergies: The Canadian Information Network for Research in the Social Sciences and Humanities, and two of its regional components. This four-year project is a national distributed platform with a wide range of tools to support the creation, distribution, access and archiving of digital objects such as journal articles. It will enable the distribution and use of social sciences and humanities research, as well as to create a resource and platform for pure and applied research. In short, Synergies will be a research tool and a dissemination tool that will greatly enhance the potential and impact of Social Sciences and Humanities scholarship. The Synergies infrastructure is built on two publishing platforms: Érudit and the Public Knowledge Project (PKP). This paper will present the PKP project within the broader context of scholarly communications. Synergies is also built on regional nodes, with both overlapping and unique services. The Ontario region will be presented as a case study, with particular emphasis on project integration with Scholars Portal, a digital library. Keywords: content management; online publication; digital access 1.

Synergies Overview

This four-year project will create a national distributed platform with a wide range of tools to support the creation, distribution, access and archiving of digital objects such as journal articles. It will enable the distribution and use of social sciences and humanities research, as well as to create a resource and platform for pure and applied research. In short, Synergies will be a research and a dissemination tool that will greatly enhance the potential and impact of Social Sciences and Humanities scholarship. Canadian social sciences and humanities research published in Canadian journals and elsewhere, especially in English, is often confined to print. The dynamics of print mean that this research is machine-opaque and hence invisible on the Internet, where many students and scholars begin and more and more often end their background research. In bringing Canadian social sciences and humanities research to the internet, Synergies will not only bring that research into the mainstream of worldwide research discourse but also continue the legitimization of online publication in social sciences and humanities by the academic community and the population at large. The acceptance of this medium extends the manner in which knowledge can be represented. In one dimension, researchers will be able to take advantage of an enriched media palette— colour, image, sound, moving images, multimedia. In the second, researchers will be able to take advantage of interactivity. And in a third, those who query existing research will be able to broaden their vision by means of navigational interfaces, multilingual interrogation and automatic translation, metadata and intelligent Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Synergies, OJS, and the Ontario Scholars Portal

search engines, and textual analysis. In still another dimension, scholars will be able to expand further into areas of knowledge such as bibliometrics and technometrics, new media analysis, scholarly communicational analysis and publishing studies. Canadian researchers in the social sciences and humanities will benefit from accessing two research communication services within one structure. The first is an accessible online Canadian research record. The second is access to an online publication production level services that will place their work on record and will ensure widespread and flexible access. Synergies provides both these functions. Built on the dual foundation of Érudit, a Quebec-based research publication service provider in existence since 1998 and the Open Journal Systems, which is a British Columbia-based online journal publishing software suite used by over 1,500 journals worldwide, and the additional technical expertise developed by its three other partners, Synergies will aggregate publications from its twenty one-university consortium to create a decentralized national platform. Synergies is designed to eventually encompass a range of formats— including published articles, pre-publication papers, data sets, presentations, electronic monographs— in short to provide a rich scholarly record, the backbone of which is existing and yet to be created peerreview journals. Synergies will bring Canadian social sciences and humanities research into the mainstream of worldwide research discourse by using cost-effective public/not-for-profit partnerships to maximize knowledge dissemination. Synergies will also provide a needed infrastructure for the Social Sciences and Humanities Research Council (SSHRC) to follow through its in-principle commitment to open access and facilitate its implementation by extending the current venues and means for online publishing in Canada. The members of the Synergies consortium are the University of New Brunswick, Université de Montréal (lead institution), University of Toronto, University of Calgary, and Simon Fraser University. Each brings appropriate but different expertise to the project. At its first level, Synergies consists of this five-university consortium that will provide a fully accessible, searchable, decentralized and inclusive national social sciences and humanities database of structured primary and secondary social sciences and humanities texts. This distributed environment is technically complex to implement, and represents a major political and social collaboration which attests to the project’s transformative dimension for Canadian social sciences and humanities research and researchers. Synergies will be a primary aggregator of research that, in providing publishing services, will allow journal editors (and other producers) to manage peer review, structure subscriptions and maintain revenue control. At a second level, Synergies will reach out to 16 regional partner universities who will benefit from, and contribute to extend, Synergies functionality. At a third level, in a producer-to-consumer relationship with university libraries and organizations such as the Canadian Research Knowledge Network, Synergies will make possible national accessibility. Using this relationship as a model, Synergies will be positioned to facilitate similar relationships for journals with licensing consortia around the world. There are many Canadian content and network infrastructure initiatives, such as electronic journals, institutional repositories, and electronic resources. Synergies partners and others are developing these infrastructures. What is needed nationally is an infrastructure that integrates these distributed components in order to enhance productivity and accessibility to Canadian social sciences and humanities at the national and international levels. The Synergies platform will integrate the outputs from the five distributed regional nodes in a centralized fashion on a large scale. The relevant technology is already partially in place. There is a need however to integrate and improve the technical infrastructure, and to address the financial processes whereby information can be made accessible to all Canadians in all sectors. Synergies will create public benefit from public funds invested in knowledge generation. Synergies is not only a pan-Canadian technical infrastructure but also a mobilizing and enabling resource for the entire scholarly community of Canadian social sciences and humanities researchers. In embracing the whole of the social sciences and humanities, Synergies will foster cross-disciplinary, problem- and issue-oriented research while also allowing further research explorations that can be time-framed, disciplineProceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

247


248

Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos

based, media or methodologically specific, theoretically constrained or geo-referenced. Synergies will thus serve to modernize Canadian social sciences and humanities research communication. It embraces emerging practices by utilizing existing texts, enriching, expanding, and greatly easing access to scholarly data and to audiences. It further provides deeper organizational capacity for a fragmented research record, ensuring and enhancing access to existing data sets. By providing a robust infrastructure, it allows content producers to explore new business models such as open access. However, it also facilitates access via aggregation of journals and an ability to enable agreements between Canadian social sciences and humanities journals and other producers’ and buyers’ consortia. It lays a foundation for expanding the research record to encompass all scholarly inquiry in order to achieve maximum accessibility and circulation. Synergies represents a project in parallel with other national projects and disciplinary databases emerging in other countries, for example, Project Muse, Euclid, JStor, and HighWire in the United States, and in France, Persée and Adonis. Similar to these projects, Synergies will capture and disseminate knowledge through a cost-recovery profit-neutral model. As mentioned above, Synergies is the result of a collaboration between five core universities which have been working together for several years. With each partner bringing its own expertise to the initiative, a genuine collaboration resulted in an infrastructure that was conceived from the start as truly scalable and extendable. Each regional node will integrate the input of current and future regional partners in the development of Synergies, thus continuing to extend its pan-Canadian dimension. Each node in close collaboration with the head node will develop the functionality and sustainability of the infrastructure over the course of the first three years starting in 2008. The latter will also co-ordinate the establishment of long-term goals and priorities that will ensure the functionality being developed is appropriate and achieves the overall goal of enhancing the end-user experience. 2.

OJS in the Synergies context

As partners in the Synergies project, considering how Simon Fraser University Library would play a role as the British Columbia node, initially we focused on the most obvious and important contribution we could make — digital conversion of Canada’s humanities and social science research journals, current and past issues, to electronic form (a Canadian JSTOR – CSTOR – if you will). However, we quickly realized that we could play another important role and our thinking evolved along the lines reflected in the Ithaka Report [1] and we began to realize that our partnership with publishers in this key project could be stronger. It is worth considering this important report in the context of Synergies and to note that the conclusions and recommendations relating to university presses in the United States also provide an important model for Canadian scholarly journals. The recommendation that universities ‘develop a shared electronic publishing infrastructure across universities to save costs, create scale, leverage expertise, innovate, extend the brand of U.S. higher education, create an interlinked environment of information, and provide a robust alternative to commercial competitors’ [2] could equally well apply to Canada and its scholarly publishing community. One important facet of the Report recommendation is that libraries are included as parts of the recommended model and Synergies and OJS have also brought together traditional and electronic publishers and academic libraries. The Report notes the strengths that libraries bring to the partnership: technology; expertise in organizing information; storage and preservation capability; and deep connections to the academy, with networks of subject specialists familiar with faculty research, instructional needs and publishing trends. It goes on to note that librarians understand how to build collections and disciplinary differences. They understand multimedia content and own enormous collections of value to scholars, have extensive digitization experience and are committed to providing free access. They understand information searching and retrieval. They Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Synergies, OJS, and the Ontario Scholars Portal

are relatively well funded (although any university librarian will be quick to note that most of that funding is targeted, and that buying power is decreasing). Libraries excel at service. Through SPARC, they advocate nationally and institutionally to maximize the dissemination and bring down the costs of scholarly information, for example through open access, and open source publishing options. They are good at collaborating across institutions (for example, most Canadian university libraries have reciprocal borrowing and interlibrary loan agreements, and have been highly successful at leveraging online journal costs through consortia organizations such as the Canadian Research Knowledge Network, Ontario Consortium of University Libraries, the BC Electronic Library Network (ELN). They have experience in building shared technology; for example, SFU Library, with funding from the BC ELN and Council of Prairie and Pacific University Libraries has developed the reSearcher software which crosslinks index entries and journal content, as well as providing interlibrary loan requesting for print materials. They provide access to their collections through union catalogues, and are extending that model to the digital world, for example through Canadiana.org (formerly AlouetteCanada+CIHM) and of course through Synergies. Complementarily, the Report notes the strengths that publishers bring to the partnership: commercial discipline – they understand the financial aspects of distribution of scholarly research, and the need to protect the sustainability of the enterprise. Publishers understand the publishing process, know how to evaluate demand, are experts at editorial selection, vetting and improving content quality. They work with faculty as the creators of scholarly content. They are marketing experts. They cultivate their longstanding national and international networks among wholesalers, retailers, libraries, and individuals. They are able to balance exposure for a work, financial rewards for creators and producers, and tolerable costs to consumers (libraries). They understand copyright protection and rights management. Thus the reports sets out how libraries, with technological resources and expertise, can play a crucial role in fostering scholarly publishing, by partnering, appropriately, with the academics and publishers, who continue the responsibility for maintain the core editorial and peer review functions. This model is by no means new, in some sense. For example the UBC Press was successfully launched with the leadership and support of the University Librarian, Basil Stuart-Stubbs. Coincidentally, while the Synergies project was being defined and brought into existence, the Public Knowledge Project evolved from its initial project-based inception into what has become an extraordinarily successful and sustainable partnership. It can be argued that part of the reason for its success lies in the conscious adoption of a partnership very similar to that subsequently laid out in the Ithaka report. Dr. John Willinsky, originator of the project, continues his vigorous leadership role, successfully attracting funding and new adopters. Not least of the reasons for the importance of the PKP Project and its success is the goal of bringing the tools for electronic publishing to developing countries and their research output to us. There are three software tools in the PKP suite: Open Journals System (OJS) which provides a scholarly journal process management framework; Open Conference System (OCS) which provides the tools for conference management; and the metadata harvester, which can be configured to harvest a selection of resources, and is used for example for access to the Canadian Association of Research Libraries’ institutional repositories. SFU Library has undertaken the role of system development and maintenance, as well as providing a hosting service for interested journals and conferences. Under the leadership of Dr. Rowland Lorimer, the SFU Canadian Centre for Studies in Publishing and the CCSP Press provide the publishing support itemized in the Report. Under this partnership, OJS has expanded its take-up to over 1,500 journals worldwide, been translated into dozens of languages, and developed partnerships with, among others, SPARC, International Network for the Availability of Scientific Publications, Oxford, Instituto Brasileiro de Informação em Ciência e Tecnologia, Brasilia, Red de Revistas Científicas de América Latina y El Caribe, España y Portugal (REDALYC), Mexico, FeSalud - Fundación para la eSalud, Málaga, España, Journal of Medical Internet Research, the Multiliteracy Project and the National Centre for Scientific Information, Indian Institute of Science, Bengalooru. Dr. Richard Kopak, Chia-ning Chan of UBC and the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

249


250

Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos

team of Dr. Ray Siemens at University of Victoria, who are BC partners in Synergies, are contributing to the reading tools, which will form significant added value for researchers The growth has not been without its challenges, though they have all been met, most notably, the SFU Library hosting of ‘Open Medicine’ whose immediate success led us to realize we needed to ensure 24x7 uptime. With visits reaching over 40,000 per month, pkp.sfu.ca is the twelfth most visited SFU site. A continuing requirement is the recruitment of intelligent, inquiring and thoughtful individuals to work on various aspects of the project, but the possibility of working at a distance has to some extent ameliorated the scarcity of available local talent. With interest among all of the Synergies partners in aspects of the Open Journals System (OJS), Open Conference Systems (OCS) and metadata harvester, it became apparent that, in addition to fostering the transition or development of Canadian SSH journals online, a key component of the SFU Library role in Synergies will be to co-ordinate and foster the further development of the PKP software to provide the features of the software to meet Synergies partner- and, more importantly, Canadian scholarly SSH research publication needs. Synergies nodes University of New Brunswick and University of Toronto are contributing to the development of the software. This co-ordination will of course continue to involve our many other international partners. Thus, development is focussed on particular features such as interoperability with the Synergies national portal site, statistical reporting, reading tools, aggregator modules, scholarly monograph management, and interoperability with institutional repository software such as Dspace. The Synergies partnership has also led to a fruitful and ongoing exchange between the PKP and Érudit developers, in particular through the Technical committee. What is most exciting and encouraging is that our Synergies and international partners, enabling us to truly embody the vision of Open Source collaborative software development, are undertaking much of the development work. As is often noted, the reasons for failure to achieve that vision have much to do with requisite time commitments and resources. Synergies funding allows us to overcome those barriers, to the benefit of Canadian and international scholars and publishers. The Ithaka Report concludes that “It is one thing to say that the organization needs to have a coherent vision of scholarly communications, quite another for provosts, press directors and librarians to agree on what that is and to put it into effect – especially when elements of this vision must be embraced across institutions… The basic infrastructure is there, and the question now is what the next layer (or layers) will look like. The recent report on cyberinfrastructure in the humanities and social sciences explored this question and focused attention on the state of scholarly communications in these fields. In addition, the terrain may now be more fertile for elements of the electronic research environments described in our report to take root, as the necessary ingredients (e.g. growing interest in eBooks) are falling into place. Finally, there is more recognition that the challenges are too big to “go it alone,” and that individual presses or even universities lack the scale to assert a desirable level of control over the dissemination of their scholarly output.” [3] This conclusion applies no less to the Synergies project, and to the PKP Partnership. 3.

Ontario Scholars Portal

In addition to the development of two publishing platforms and the national portal, the work of Synergies will be carried out by a series of linked regional centres. Each region will provide a common set of core services to Canadian scholars. In addition, regional nodes are focusing on related key elements. The Ontario region is exploring search. A key issue for electronic publications is academic findability, acceptance and persistence – clearly the latter two are related to the first. Canadian scholars and publishers want to be found on the open Net, but also on established scholarly databases. The Synergies Ontario node is comprised of York University, and the Universities of Guelph, Toronto (lead) and Windsor. Services provided include journal hosting using OJS, conference hosting using the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Synergies, OJS, and the Ontario Scholars Portal

Open Conference System and repository services using DSpace. All selected platforms facilitate search engine crawling but what about recognized scholarly finding tools such as abstracting and indexing sources? This is a common question from journal editors – how can I get indexed in the leading A & I disciplinary database(s)? What is the application process? And how long will I have to wait? The Ontario Synergies regional node has partnered with the Ontario Council of University Libraries (OCUL) [4] to not only provide a presence within an well-known scholarly finding tool, but one that seeks to integrate itself into the academic workflow and emerging library archiving practices. OCUL is a consortium of twenty university libraries, including the Ontario Synergies partners. OCUL’S vision is to be a recognized leader in provincial, national and international post-secondary communities for the collaborative development and delivery of outstanding and innovative library services. Organizational goals include building effective practices for advocacy, collaboration and organizational development; providing a robust, sustainable and innovative access and delivery services and building comprehensive and integrated digital collections. Projects often begin with grant funding but ongoing costs are then assumed by the membership. Founded in 1967, OCUL serves approximately 382,000 FTE students, staff and faculty within the province of Ontario. Joint services include resource sharing, collective purchasing and the joint creation of the digital library, Scholars Portal (SP). Scholars Portal provides the infrastructure to all Ontario universities to support electronic access to major research materials. The portal is a gateway to a wide range of information and services for all faculty and students in the Ontario universities. The goals of SP are to support research, enhance teaching, simplify learning and advance scholarship. Specifically, Scholars Portal was established in 2002 with four primary objectives 1. 2. 3. 4.

To provide for the long term, secure archiving of resources to ensure continued availability. To ensure rapid and reliable, response time for information services and resources. T o provide an environment that fosters additional innovation in response to the needs of users. To create a network of intellectual resources by linking ideas, materials, documents and resources.

OCUL’s strategy focuses on locally hosting and integrating a range of collections and services into Scholars Portal:

SP contains approximately 200 million citations from 200 locally loaded abstracting and indexing databases: approximately 47% are scientific citations, 29% multidisciplinary, 18% social science and 5% from the arts and humanities.

Thirteen million full text journal articles from over 8,250 journals are locally loaded. In 2007, 4.2 million articles were downloaded. Publishers include Elsevier, Oxford, Taylor and Frances, Berkeley and the American Chemical Society. Member libraries have integrated Scholars Portal into RefWorks and courses management systems such as BlackBoard.

Refworks hosting is provided not only for Ontario but also for a total of sixty-seven institutions from every Canadian province. Thirty thousand regular users log in about 160,000 times a month during peak academic periods. These users collectively manage over 4 million citations.

The Ontario Data Documentation, Extraction Service and Infrastructure (ODESI) project, in the early stages of implementation, will provide researchers data discovery and

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

251


252

Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos

extraction services for social science survey data. It is expected that this service will grow to include geospatial data. Current plans are aimed at providing a rich tool set for users:

Provincial funding will allow for some Scholars Portal data and search functionality to be made freely accessibly. This will include open access journals, 120,000 books scanned from the University of Toronto collection as part of the Open Content Alliance and some journal metadata.

Distributing and archiving over 150,000 e-books on ebrary’s stand alone technology platform: ISIS

Two planned initiatives carry special import for Synergies: the migration of data into Mark Logic and trusted digital repository certification. SP has begun migrating locally loaded data from ScienceServer to a Mark Logic content platform. [5] Mark Logicstores XML documents, an encoding format increasingly used by publishers, in native format. By building indexes on individual works, XML elements and attributes, such as tables or illustrations, it builds indexes not only on words but context and hence can provide a richer search. Relevance-based searching, facet-based browsing, thesaurus expansion, language –based stemming and collations, automaticclassification, web services, AJAZ facilitate incorporating current Web technologies into a new interface. As part of content migration, SP staff will be transforming all records from the proprietary ScienceServer DTD to the NIH Journal Archiving and Publishing Schema. Not only is NIH non proprietary, it also supports full text and metadata only sources. ScienceDirect is a metadata only DTD. The NIH schema will also allow for links to external resources such as genebank database and data integration with external applications such as Goggle documents. The target release date is September 2008. In order for Synergies data to be fully searchable within Scholars Portal, we have begun mapping the OJS native and Érudit DTD to the NIH schema and loaded a few sample journal issues. Once the pilot is complete we will pilot integrating other content types and later invite participation from other Synergies hosted content and OA providers. It is often difficult for small journals to transition into the electronic realm, let alone alter their production methods to fully exploit the realm’s potential. Over the course of the project, we will be seeking cost effective methods to assist journals with this transition. Processes are being re-engineered not only to fulfil the move to Mark Logic but also to begin satisfying requirements of a “Trusted digital repository” [6] External review of practices and policies is planned for 2008-9. Consultation with the University of Calgary, which is charged with developing a preservation framework for Synergies partners, is scheduled. In looking to the future, OCUL envisions a future where Scholars Portal can connect to the citation to the users workflow and support collaborative research. Synergies shares similar aims, though focuses on moving and aggregating Canadian scholarly works online. 4.

Conclusions

The Synergies project is important for granting councils, for universities, for individual journals, for academics, and for Canadians. Synergies will facilitate both public access within Canada and international access and prestige. Academics’ citations will increase substantially as they enjoy much greater national and international exposure. Journals will be able to increase their exposure and find new ways of aggregating content with comparable journals while maintaining their financial viability. Universities, through their institutional repositories, will increase their international reputations. Scholars representing Canadian Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Synergies, OJS, and the Ontario Scholars Portal

universities will benefit from an enhanced profile on the national and international stages. Based on the current expertise of each of the regional partners, Synergies will also very quickly be in a position to become a leader in the field of digital humanities and electronic publications around the world. Thus, Canada’s position on the international level will be reinforced through Synergies and the research it will enable. Furthermore, Synergies will establish itself as an advisory technical committee for policymakers in Canada, and will play a role in the development of future collaborative projects around the world. The project will also help transform scholarly communication and promote a greater degree of interdisciplinary work. All Canadians are stakeholders in this enterprise and should be vitally interested in it, since they will be able to benefit from access to the research that is paid for by their tax dollars and that is contributing to transforming their society in an effort to democratize knowledge. More than just benefiting present-day research, the organization of data within the Synergies infrastructure will be standardized for use by future research initiatives. An initial investment in Synergies thus profits not only already-identified research projects, but it will also benefit many research projects to come. Academic communities in Canada and elsewhere will have access to content that was previously unavailable or obtainable only with great difficulty. As well, this content will enjoy the extensive functionality—powerful searching tools, textual and other forms of computer-assisted analysis, and cross-referencing between disciplines—that will be available in the online environment developed by Synergies. Moreover, Synergies will allow researchers to ask new questions, to draw on previously inaccessible information sources, and to disseminate their results to a much broader range of knowledge users in the public, private, and civil sectors of society. All of these possibilities will greatly benefit Canada as a whole. Once fully operational, Synergies will provide researchers, decision-makers and Canadian citizen with direct, organized and unprecedented access to the vast store of knowledge created within our universities, in both official languages, regardless of geographic location, subject or discipline. 5.

Acknowledgements

The authors would like to thank the other members of the Synergies steering committee for their input on an earlier version of this essay: Guylaine Beaudry, Gérard Boismenu, Thomas Hickerson, Greg Kealey, Ian Lancashire, Rowland Lorimer, Erik Moore, and Mary Westell. 6.

Notes and References

[1]

Brown, Laura; Griffiths, Rebecca and Rascoff, Matthew. University Publishing in a Digital Age, Ithaka Report, July 26, 2007, 2007. <http://www.ithaka.org/strategic-services/ Ithaka%20University%20Publishing%20Report.pdf> ibid. p. 32. ibid. p. 33. <http://www.ocul.on.ca> <http://www.marklogic.com> RLG/OCLC Working Group on Digital Archive Attributes; Research Libraries Group.and OCLC. Trusted Digital Repositories Attributes and Responsibilities: An RLG-OCLC Report, 2002 AGRICOLA. <http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf>

[2] [3] [4] [5] [6]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

253


254

African Universities in the Knowledge Economy: A Collaborative Approach to Researching and Promoting Open Communications in Higher Education Eve Gray1, Marke Burke2 Centre for Educational Technology, University of Cape Town Private Bag, Rondebosch 7701, South Africa email: eve.gray@gmail.com 2 Researcher, Link Centre School of Public Development and Management University of the Witwatersrand email: burkem@developmentwork.co.za

1

Abstract This paper will describe the informal collaborative approach taken by a group of donor funders and researchers in southern and eastern Africa aimed at consolidating the results and increasing the impact of a number of projects dealing with research communications and access to knowledge in higher education in southern and eastern Africa. The projects deploy a variety of perspectives and explore a range of contexts, using the collaborative potential of online resources and social networking tools for the sharing of information and results. The paper will provide a case study of donor intervention as well as analysing the methodologies, approaches and findings of the four projects concerned. The paper will explore the ways in which the projects and their funders have had to address the issues of the global dynamics of knowledge, of the changes in research practices being brought about by information and communication technologies; and of the promises that this could hold for improved access to knowledge in Africa. Finally, the conclusions of the paper address the complex dynamics of institutional change and national policy intervention and the ways in which a collaborative approach can address these. Keywords digital scholarship; knowledge ecology; open education; open access; scholarly publication 1.

Introduction For our continent to take its rightful place in the history of humanity ... we need to undertake, with a degree of urgency, a process of reclamation and assertion. We must contest the colonial denial of our history and we must initiate our own conversations and dialogues about our past. We need our own historians and our own scholars to interpret the history of our continent. President Thabo Mbeki â&#x20AC;&#x201C; launching the South Africa-Mali Timbuktu Library Project

When it comes to access to knowledge in higher education institutions in African countries, the emphasis has tended to be, in the first instance, on the difficulties that African researchers face in gaining access to expensive commercially published journals and books, and the extent to which this disables African participation in the knowledge society. John Willinsky is but one of a number of authors who have described the dismal circumstances in which African researchers work, with empty library shelves and minimal

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

access to international resources. He also describes some of the initiatives that have been put in place to remedy this situation, such as the negotiation of special journal packages by the International Network for the Advancement of Science Publications (INASP) and the World Health Organisation. On the other side of the coin are the difficulties experienced by African scholars in publishing from their home countries, also described in some detail by Willinsky [1] [2]. These are not only problems of resources, of funding for paper and printing, of the difficulties of print distribution or computer availability and bandwidth [3], but also of the power dynamics of international scholarly publishing, a more neglected topic. Developing countries, especially in Africa, face a broad spectrum of research infrastructure and capacity constraints that limit their capability to produce scientific output, and absorb scientific and technical knowledge. Unequal access to information and knowledge by developing nations , exacerbated by unequal development and exchange in international trade, serves to reinforce the political and cultural hegemony of developed countries. The impact of knowledge-based development will continue to have insignificant impact as long as this asymmetry in research output and access to up-to-date information remains [4]. There is no doubt that when it comes to participation in the global knowledge economy, Africa is particularly badly represented. According to a 2002 survey by the African Publishers’ Network, Africa produces about 3% of all books published, yet consumes 12% [5]. The statistics are even worse when considering Africa’s contribution to the internet. In 2002, Africa produced only 0,04% of all online content and, if one excludes South Africa’s contribution to this, the figure fell to 0.02% [6]. When it comes to journal publishing, the power dynamics of this commercialised global sector is clearly demonstrated. In 2005 there were 22 African journals out of 3,730 journals in the Thomson Scientific indexes. Twenty of these were from South Africa. The major Northern journals account for 80% of the journals in the Thomson Scientific indexes; just 2.5% overall come from developing countries [7]. Given the overwhelming social, economic and political problems that so many African countries face, the major need is for the production of locally relevant research to be effectively disseminated in order to have maximum impact where it is most needed. This is skewed in the global scramble for publication in the most prestigious journals as African scholars and their universities seek to establish their rankings in a competitive global research environment [8] [9]. The situation in most African countries has been compounded by decades of IMF and World Bank structural adjustment programmes, based on Milton Friedman’s theory that economic growth is generated through investment in primary education, while higher education creates unrest and instability [10]. This has led, in most African counties, to the decimation of higher education infrastructure and the virtual destruction of research capacity. South Africa is much better off in terms of research capacity; however, the higher education sector faces complex challenges as it addresses its transformation needs post-apartheid. Ondari-Okemwa [11] categorises constraints specific to knowledge production and dissemination into economic (inadequate funding and budgetary cuts, lack of incentives, brain drain) technological (internet connectivity and telecommunications infrastructure) and environmental factors (freedom of expression). Kanyengo and Kanyengo [12] identify the non-existence of information policies for handling information, poor ICT infrastructure to manage the preservation of knowledge resources, inadequate financial resources and the lack of technical knowledge and legal barriers as the key impediments to preserving information resources as inputs into knowledge production. Where there is agreement is that one of the major priorities for addressing Africa’s development challenges,should be knowledge production by African researchers working primarily at African institutions, focusing on locally relevant knowledge production. According to Sawyerr [13], this insistence ‘on African research and researchers at African institutions is to ensure rootedness and the sustainability of knowledge generation, as well as the increased likelihood of relevance and applicability. This condition presupposes local institutions and an environment adequate to support research of the highest calibre; and insists upon Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

255


256

Eve Gray; Marke Burke

the rootedness of such research as well as its positive spill-over effects on the local society’. 2. Policy contradictions The policy framework that lies behind these projects has been described in Eve Gray’s paper on research publication policy in Africa produced for an Open Society Institute (OSI) International Policy Fellowship. This revealed fault lines and contradictions in South Africa’s well-elaborated research policy, which were reflected in policy developments in the region. Broadly speaking, polices that impact on research dissemination veer between an emphasis on the public role of the university, which demands social and economic impact in the national community, and an international role that is framed in the discourse of a competitive system of citation counts and international scholarly rankings. The former places the emphasis on the knowledge society, the use of ICTs and open and collaborative approaches to research; the latter on individual effort, proprietary intellectual property (IP) regulation and monetary returns garnered through the leverage of the university IP in the knowledge economy[8]. As Jean-Calude Guèdon points out, these two terms are not co-terminous: This is comething that is eloquently explored in a recent paper by JeanClaude Guèdon, who points out that ‘the universality of scientific knowledge differs fundamentally from its globalisation’ and that ‘it is clear that the present situation of access to scientific publications arises less from aspirations fir a ‘knowledge society’ but rather from the rapacity of a ‘knowledge economy’ [14]. One effect of the latter strand of policy is – strangely, given the emphasis on the need for national development impact – a remarkably narrow conceptualisation of what constitutes research publication. Peer reviewed journal articles, books, chapters in books and refereed conference proceedings are valued and supported in a region that, given the serious developmental challenges it faces, could learn from the efforts of the Department of Education. Science and Technology (DEST) in Australia to grapple with a broader conception of what could constitute effective research publication, given the opportunities offered by ICT use in a changing research environment [15]. 3.

Donor collaboration

This paper will review the ways in which a group of projects in southern Africa are seeking to address these issues through informal collaboration by donor funders seeking to maximise the impact of their interventions. Discussion of this collaboration started at the Workshop on Electromic Publishing and Open Access in Bangalore in 2006. This workshop recognised the potential for collaboration between second economy countries as a power base for change and was attended by delegates from India, Brazil, South Africa and China. This recognition of the importance of collaboration spilled over to tea-break discussions about the fragmentation of donor interventions in southern Africa and the need for a consolidated and coordinated approach. In response, a group of funders and researchers – from the OSI, the International Development Research Council (IDRC) and the Shuttleworth Foundation – subsequently met at the iSummit in Dubrovnik in June 2007 to take this idea further. The decision was that the funders would map their various projects in consultation with one another in order to try to achieve a consolidated impact in the transformation of policy and practice for the use of ICTs and open access publishing to increase access to knowledge in Africa. The projects that have emerged from this informal initiative thus consciously cross-reference one another in the pursuit of these goals, contractually requiring that research findings be made freely available through open licences, and also sharing project resources and findings through the use of social networking tools. This has already proved effective, as the projects have shared literature surveys and reading lists; have exchanged findings; have collaborated in interviews and workshops; and

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

have used collaborative workspaces and online discussion forums to exchange ideas and track common areas of interest. 4.

The projects

This paper will describe four open access and scholarly publishing projects currently included in this collaborative effort, charting the ways in which they impact on one another and how their findings could coalesce to create an impact greater than the sum of their parts. These projects recognise that the achievement of shifts in policy and practice in an environment as conservative as the university sector and as sensitive as the under-resourced African higher education system needed a multiple-pronged approach, working at all levels of the system – institutional, national and regional – to change entrenched policy and practice. A complex approach would have a better chance, this collaboration suggests, to deliver a substantial shift, leveraging the potential of ICT use and open access publishing models, to transform the delivery of African knowledge dissemination. The projects all focus on the production of African knowledge from Africa, for African purposes, rather than the question of access alone. These projects also all share a contextual understanding of the need to take into account the changing research and teaching environment that has resulted from the impact of ICTs across the academic enterprise. Research is increasingly characterised by greater emphasis on interdiscplinarity, multidisciplinary, transdisciplinary practices; an increasing focus on problems rather than techniques; and more emphasis on collaborative work and communication [16.] [17]( This in turn creates new information and dissemination needs, since there is an increased demand for access to a wider range of more diverse sources; for access mechanisms that cut across disciplines; and for access to, and management of, non-traditional, non-text objects. The four projects to be evaluated in this paper are:

The OpeningScholarship Project funded by the Shuttleworth Foundation and carried out in the Centre for Educational Technology (CET) [18] at the University of Cape Town is using a case study approach to explore the potential of ICTs and Web 2.0 to transform scholarly communication between scholars, lecturers and students and also between the university and community. The focus is at an institutional level; the lever for change is seen as the ICT systems that this institution has invested in and their use within the university.

The Publishing and Alternative Licensing Africa (PALM Africa) project, funded by the IDRC. The project is working across the conventional publishing industry and open access content providers, seeking to better understand how flexible licences, including online automated licensing systems such as CC+ and Automatic Content Access Protocol (ACAP), can facilitate citizens’ access to knowledge in the digital environment and how the adoption of new and innovative business models of publishing can help African countries improve the publishing of learning materials. The first investigations are being carried out in South Africa and Uganda.

The Opening Access to Southern African Research Project, being carried out by the Link Centre for the Southern African Regional University Association (SARUA), funded by the IDRC, is studying the issues of access to knowledge constraints in southern African universities and the role of open approaches to research and science cooperation. The research project aims to inform the development of the basis for policy advocacy at the institutional, country and regional level with respect to academic publishing and knowledge sharing in the ‘digital commons’ context.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

257


258

Eve Gray; Marke Burke

The Shuttleworth Foundation and the OSI are supporting the production of the Publishing Matrix, an overview of the workings of the publishing industry – formal and informal – to allow researchers, activists and funders to better understand the context in which they are operating. The problem that this project addresses is that if projects are to achieve wider access to learning materials in Africa, they need to be backed by an understanding of how publication and knowledge dissemination works in the countries concerned, where there are blockages and weaknesses in the provision of learning materials and other knowledge resources, and where traditional systems are working well.

The projects described share methodologies of qualitative analysis, exploratory, descriptive and action research. They combine higher education policy studies with analysis of technology use and its impact. They share the perception that, as a result of the changes being brought about in research and teaching through ICT use, technical, organisational and communication infrastructure needs to be analysed in an integrated knowledge cycle. Most strikingly, in contrast to many open access initiatives, the projects combine to explore the potential for finding solutions that could also involve the publishing industry, formal and informal, in changed business practices that could deliver sustainable models for greater access to learning materials. In analysing these projects we will consider open access in the context of university missions: academic teaching (knowledge building cycle), research (research, development and innovation cycle) and social engagement (promoting the utilisation of knowledge produced in universities for the benefit of communities/ society). The potential value of commons-based, open access approaches for universities would be the creation of an environment which fosters a more rapid growth of the volume of research output than is currently occurring, and the more effective utilisation of research activity to expand the knowledge base in any particular field by building on what has gone before. The conceptual framework shared by these projects acknowledges the context of African countries and their universities in the emerging information and knowledge economy, a world view that regards information and knowledge as central to the development and emergence of a new form of social organisation. This view endorses the role of universities as centres of knowledge production, with a primary mission to produce, communicate and disseminate knowledge. Using the case studies of the projects described above, this paper will describe the barriers in national and institutional policies that currently block the use of ICTs for enhanced access to knowledge and will report on the shifts that are taking place as a result of these interventions. Each project is examined in some detail, exploring the project methodology and its findings before drawing conclusions about the collaboration between the projects. 5.

OpeningScholarship: the picture of an institution

The Opening Scholarship project is being carried out in the Centre for Educational Technology (CET) at the University of Cape Town,with the aim of investigating the impact of the use of ICT in scholarly communications in one of South Africa’s leading research universities.Acknowledging the impact of social networking and Web 2.0 on the hierarchies of knowledge production and the role that can be played by a range of formal and informal technologies, the question asked by the OpeningScholarship project is how the ICT systems that are in place could help deliver much greater intellectual capacity and how a university like UCT could make the most effective use of its research knowledge; how it could avoid becoming a dependency, relying on its own intellectual output rather than on imported content. It also acknowledges the disruptive potential of ICT use: the ways in which changing communications could break down disciplinary silos in an increasingly inter-disciplinary research environment, breaching the walls of the traditional Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

curriculum. The choice of university for this study was influenced by the fact that UCT, South Africa’s leading research university, has made a serious investment in its ICT infrastructure, designed to allow the university to develop and leverage the knowledge that it produces in innovative ways. UCT is also unusual in having invested in the development of an institutional infrastructure in the Centre for Educational Technology that combines technical, research and pedagogical skills in an academic department. The explicit aims of the department are to enrich and enhance the curriculum; provide for the needs of a diverse student body; and support staff in transforming, improving and extending their practice. CET is a partner in the international Sakai collaborative project for the development of an open source learning management system (LMS) in fact it was the the first non-USA member of the community. The development of the UCT version of Sakai, Vula, has provided an interesting perspective on the relationship between open source and open access in delivering the increased capacity being sought through this project as well as providing. a potential platform for opening resources. The project has not taken a narrow view of what constitutes scholarly communications. It has taken seriously the university’s statement of its own mission and national higher education policy in tracking scholarly communications in three directions:

• • •

Academic scholarship: academic to academic; Teaching:and learning: academic to student;, student to academic, student to student; Community engagement: university to community (and community to university).

Although some work has been done at UCT and other South African universities to reveal how ICTs could support academic scholarship, teaching and learning, not enough has been done in terms of understanding how ICTs could be usefully employed in supporting community engagement and more particularly, how ICTs could undergird a coordinated approach to academic scholarship, teaching, learning and community engagement. On the national level, the question would be how to use ICTs to grow access to South African (and African) knowledge to deliver the aspirations of national policy, as set out in the White Paper on Science and Technology (1996) and of the key objectives identified in the university’s own strategy. This is an important reflection of the South African government’s view of the role of the university in a knowledge society, particularly in an African country, in which research investment, the government suggests, needs to be recovered by way of impact on national development goals, for social upliftment, employment, health and economic growth. The use of ICTs is seen as an important component of this process,essential tools if South African universities are to take their place in the global knowledge economy. As the South African White Paper on Science and Technology (1996) spelled it out: The world is in the throes of a revolution that will change forever the way we live, work, play, organise our societies and ultimately define ourselves... Although the nature of this information revolution is still being determined... [t]he ability to maximise the use of information is now considered to be the single most important factor in deciding the competitiveness of countries as well as their ability to empower their citizens through enhanced access to information. [19] What this project has aimed to do, therefore, is to pull together the various initiatives that are taking place and identify how maximum use could be made of ICTs at UCT to advance research, teaching and learning and community engagement through a coordinated set of coherent policies, action plans and technological and infrastructural systems.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

259


260

5.1

Eve Gray; Marke Burke

Methodology

The principle methodology of this project has been the use of case studies to map a variety of uses of ICTs for scholarly communication at the University of Cape Town. The project has drawn upon desk research, semi-structured interviews, focus groups, and questionnaires in the conducting of these case studies. The project is contextualised in a review of international best practice and national policy and practice in order to frame and provide a context for the case studies being conducted. The approach of this project has therefore been to take the university as a case study within the national context, explore ways in which the institution is reflecting national policy and matching this against international best practice. Finally, within the university, case studies have explored how individual academics and departments have been using ICTs in in transformative ways for research communication, teaching and learning, and social impact and what lessons for university policy and strategy can be learned from this . 5.2

Findings

The project is in its final stages, due for completion in July. Its findings, although not finally analysed and integrated, are therefore fairly complete. Although UCT’s mission incorporates teaching and learning, research and social responsiveness as if they are equally rated, in reality the system is heavily weighted towards research, and research of a particular kind. The impact of national policy in this regard was evident in all the case studies dealing with research publication. Not unexpectedly, the project revealed the extent to which institutional behaviour is distorted in South Africa by the financial rewards paid to the universities by the Department of Education for publication in accredited journals, books and refereed conference proceedings. The rush for a substantial revenue stream, reinforced by the appeal the policy makes to an entrenched conservatism, particularly in the upper ranks of the university hierarchy, leads the university to place a very strong emphasis on targets for the production of journal articles in particular ‘accredited’ journals. This is further strengthened by a system of competitive rankings for individual scholars run by the National Science Foundation based on the metrics of citation counts. Both of these mechanisms place a neo-colonial emphasis on the primacy of international rather than local performance and on the metrics of citation counts as opposed to any attempt at evaluation of the contribution that scholarship is making to the the nation or region. The predictable results are that the production of local scholarly publication is under-supported, with an equally predictable backward drag on the professionalism and quality of a number of journals in an environment where journal publishing in the traditional model is unlikely to be self-sustaining [20]. Moreover, the activities of academics involved in publishing and editing are not tracked centrally in the university system, although they may be reflected in a fragmented way in departmental records. South Africa shares a common presumption in the English-speaking world that the delivery of scholarly publication to be regarded as something that it is not the university’s business to fund. While the university seems willing to invest very large sums of money in patent registration, presumably against the (largely unrealistic) expectation of revenue, the much smaller sums needed for publication do not feature in their budgets. This means that there is no source of financial support for the development of digital open access publications, nor for the payment of author fees for publication in international open access journals. Although there are open access journals being published at UCT, such as Feminist Africa in the African Gender Institute [21] and expressions of interest by existing and potential new journals, there has been little actual take-up of open access scholarly publishing at UCT. In part as a result of the OpeningScholarship project and in part because of a national project for the promotion of open access scholarly publishing funded by the DST and being delivered by the Academy of Science of South Africa (ASSAf) [14] there Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

is increasing awareness of the potential of electronic publishing and open access to increase research impact. However, given the lack of institutional support systems – financial and informational – or any centralised institutional policy framework for open access publishing, this interest remains fragmented. On the horizon, however, is the prospect of the creation of a national journal platform as part of the ASSAf programme for the development of local open access journals. As a result of the OpeningScholarship project, the UCT research department has became aware of the fact that in its systems, it was only tracking authorship of publications, and not the publications that are being produced on campus, nor the activities of UCT academics who are journal editors. There is therefore support for authorship and neglect of publishing efforts, even though this neglect is potentially detrimental to levels of authorship, as under-supported journals struggle to produce issues on time. An interesting spinoff from the project was the recognition that the profile of information in the university’s central systems can wield considerable power. This became clear in a workshop on UCT’s research information system at which Australian and South Africa universities compared their use of the publication record module of their shared system. A report from the University of Sydney provided a vivid example of how the creation of a record system, linked to a digital repository that records all publications – formal and informal – has served as a tool both to expand dissemination of university research and to profile and promote the university. Another issue that has emerged at institutional level is the fact that the university has a centralised facility for the use of ICT for teaching and learning through CET, but there is no university-wide integrated system to support not only teaching and learning but also research data and publications across the institution. Given that the DST is planning to implement policy on access to data from public funding in 2008-9, this will become an increasingly important issue. Also, if UCT is to retain its status as a leading research institution, given developments in higher education elsewhere, it will have to begin to address, cyberinfrastructure needs for the 21st century,in collaboration with the South African higher education sector as a whole. Where technology use is having increasing impact is in teaching and learning and this is because of UCT’s investment in CET as a dedicated department for research and development in ICT use for education. It became clear in 2007 when the Vula LMS – UCT’s version of Sakai – was launched that the use of an open source system that was user-friendly and capable of adaptation to user needs has substantially increased the use of ICT for teaching and learning on campus. The courses delivered through the online LMS grew from 191 in 2006, to 908 in 2007, the year that Vula was launched. To date in 2008, less than half way through the year, 933 courses are being delivered though Vula. The figures show very clearly that there is a strong response to the use of ICTs for teaching and leaning, with a particularly steep rise in 2007, when the Vula system replaced the custom system that had hosted courses prior to the creation of CET as a university-wide service. Anecdotal evidence suggests that this is at least in part the result of the ease with which lecturers can create their own course profiles and upload their course materials, given that this is an open source system, as well as student pressure for what they find a congenial and supportive environment. The case studies of teaching and learning practice within Vula have revealed that the development of innovative teaching tools and learning environments is largely the result of individual motivation. Although there are ostensibly weightings for teaching practice in the promotion system and the university offers prizes for good teaching, the perception of lecturers is that the primary route to promotion is through the publication of journal articles, preferably in international journals. An important driver of innovation, the case studies show, has been the support by the Mellon Foundation for Teaching with Technology grants [22. These relatively small grants have been the source of a number of innovative programmes in Vula, using multi-media, animated simulations for technology teaching and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

261


262

Eve Gray; Marke Burke

conceptual understanding; mobile technology for course administration and interaction; and simulation exercises building on social networking, What this makes clear is that relatively small levels of funding support can bring disproportionate results. While the university has shown vision in funding CET from mainstream funding, unlike most other universities, still does not fully fund the department and there are still many posts in the department supported by grant funding The case studies have revealed that the development of online courses is very much bound into individual departments and even into individual courses within departments. in an institutional culture which is still built around disciplinary â&#x20AC;&#x2DC;silosâ&#x20AC;&#x2122; The university does not seem to have much centrally coordinated space for collaboration in teaching and learning. The OpeningScholarship project has, however, revealed the potential for significant interdisciplinary collaboration once the the connections have been made. For example, the project found that there was more than one department involved in the use of V-Python open source software for the creation of animated simulations to enhance theoretical understanding and develop algorithmic thinking in science and technology. This is of vital importance in a country that still faces an educational deficit in scientific education. The opening up and expansion of these resources could therefore be of national value. The problem, however, in delivering a vision such as this would be the question of resourcing in an already-stretched university system. The courses that use these innovations also demonstrate changes in the power relations between students and lecturers, with students playing a more active role in knowledge production. Another set of courses using online simulations of a different kind demonstrates a similar change in lecturer -student dynamics through the creation of online communities and role-playing. The Department of Public Lawâ&#x20AC;&#x2122;s international law course, Inkundla yeHlabathu/World Forum, for example, has created an innovative tutorial simulation, in which students learn to apply the rules and methods of international law through a series of African case studies from the 1960s to the present day by simulating the work of legal advisers to ten African States. The course is delivered through a combination of formal, doctrinal lectures, small-group tutorials and the Inkundla yeHlabathi simulation. A compilation of cases and materials, the e-casebook, is made available to students both online and on CD-ROM for offline use. This course has been recognised in the Sakai community, through the Teaching with Sakai Innovation Award , sponsored by IBM, as one of the two most innovative courses in the Sakai community worldwide in its use of technology for transformative teaching. When it comes to opening access to these resources beyond the university,, the vision of Salim Nakjhavani, the course convenor, is to gradually engage other African universities in parts of the simulation, deployed through Sakai and hosted by the University. The University of the Witwatersrand (Johannesburg, South Africa) plans to join one component of the simulation in late 2008. This course provides an exemplar of the unfolding nature of open education and highlights some of the challenges that need to be addressed in order to make this simulation open to all. These challenges include copyright clearance and long-term sustainability models. While many of the cases that the students use, from the International Criminal Court in particular, are in the public domain, many commentaries on these and other cases that they need to refer to are not. In courses such as these, students form active communities and, according to their lecturers, can develop passionate alliances related to their roles in the simulations. There is also the potential for increased contributions to the creation of online learning materials by students. A further and unexpected sign of student willingness to create their own space in the system is the fact that a growing number of student societies are using the Vula space to manage their communities and their projects. Students are active in community responsiveness projects at UCT and the latest social responsiveness report at UCT reports two of these [23]. While there is still no comprehensive tracking of public benefit programmes at UCT, the university has become aware of the need to demonstrate its Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

contribution to national development and so has started an initiative to track and report on the various projects on campus, student and staff-driven. At a recent social responsiveness workshop it became clear that not only are there a large number of programmes at UCT that make a considerable contribution, but also that among these are projects that produce a variety of publications: research reports, policy guidelines, training manuals, community information resources, and popularisations. These are produced without financial or logistical support and the projects concerned complained that their work was not recognised as academic output and did not receive recognition or incentives, in spite of its importance to the community and the universityâ&#x20AC;&#x2122;s reputation. The importance to the university is recognised at senior level. As Deputy Vice-Chancellor Martin Hall expressed it, in a climate in which govenrment is questioning whether it gets value from its invetment in higer education research: [U]niversities [need to] assert the importance of their independence, and the value of the knowledge commons as a seedbed of innovation ranging from product development to the design of effective public policies. They also recognize that, for the knowledge commons to acquire public credibility and support, they need to show how their work is responsive to the pressing objectives of development. In pursuit of this, they develop a range of smart interfaces with both the state and private sectors, promoting effective knowledge transfer, and showing, through example, how there can be a valid social and economic return on public investment in their resources. [24] A number of projects mentioned that they used Vula to support their projects and it is clear that both a central record of these projects, support for publications through the provision of publishing platforms and the creation of an institutional repository of research outputs would be welcomed. The question the university needs to confront is how much the effective dissemination and publication would add to the impact of its social responsiveness programmes, how much this would contribute to profiling the institution and to its ability to attract government and donor funding. 6.

PALM Africa - from polarisation to collaboration

Publishing and Alternative Licensing in Africa (PALM Africa) funded by the IDRC and led by Dr Frances Pinter, Visiting Fellow in the Centre for the Study of Global Governance at the London School of Economics, addresses some of the sustainability issues raised by the OpeningScholarship project and Opening Access to Knowledge in Southern African Universities. In an African context, in which access to internet connectivity is often limited and in which the question of distribution of learning materials is a serious challenge, what is missing, this project argues, is research on how open access approaches employing flexible licensing could work in conjunction with local publishing in developing countries to improve access to learning materials. Through the action research element of the project it is expected that a variety of new business models appropriate for Africa will be devised and tested. The focus of the project is intended to be primarily on the higher education sector, both because the levels of ICT infrastructure and connectivity in this sector are adequate to the task and also to align the project with the focus of other IDRC interventions in the region. The overarching research question that this project addresses is: whether the adoption of more flexible licensing regimes could contribute to improved publishing of learning materials in Africa today. An important component of this project is the recognition of the contribution that can be made by professional publishing skills: the services of commissioning, editing, design, marketing, validating, branding and distributing learning materials. The project explores how more flexible licensing regimes might allow publishers to access a broader range of materials to which they might add local relevance, publish successfully and distribute in a manner that leads to more sustainable publishing and improved access for readers. In other words, what is being explored in this context is the potential for increased access to be generated through partnerships Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

263


264

Eve Gray; Marke Burke

between open access and commercial publishing models or through the use of innovative licensing and business models that address the particular difficulties of African markets. The possible solutions to the various structural and process issues that a are beginning to emerge from this study might range from alternative business models in market sectors in which the ‘free online’ open access models might be sustainable with public funding, to more complex models combining the commercial and the ‘free’ in various new ways. The scholarly literature has identified a number of viable ‘some rights reserved’ models with reference to a few examples primarily in the fields of music and software. This is the first comparative study of its kind that engages with stakeholders to build up appropriate business models from inside the industry and then proceeds to test the viability of those models. The countries that will be participating are South Africa and Uganda. In the higher education sector, the problems that this project will address include the current difficulties faced in the development of and access to scholarly writing and textbooks produced on the African continent, given resourcing problems and small market size, as well as the barriers to inter-African trade. Then there are the barriers imposed by the high cost of imported textbooks when they are shipped or copublished using conventional publishing business models. Finally there is the need for localisation of international materials [3] [6] In all of these cases, there is potential for electronic publishing to transcend the distribution difficulties and added costs that arise in the physical movement of books across such great distances. The final product could then either be e-books where technology availability allows, or the use of local printing to produce affordable print copies in the local market. Chris Anderson’s ‘Long Tail’ market model would suggest that, given that these are marginal markets for international publishers, there should be opportunities for exploring new financial models – including the potential for open access and commercial models used in conjunction with one another – in order to find innovative ways of meeting market needs without the high prices that have accompanied the conventional book trade models. 6.1

Methodology

This project brings together active research in the form of publishing demonstration projects combined with an academic assessment that reviews whether or not liberalising of licenses may bring about improvements in the publishing process defined as increased access to materials while maintaining sustainability of publishing services. Hence the emphasis is on collaborative efforts to find practical solutions. The outreach activities aim to create a space for discussion of the outputs and outcomes of the projects so as to encourage a deeper understanding of the role of licensing and broader engagement with decisions on the types of licenses that fit the specific needs. The methodologies being employed in the project include literature review; qualitative analysis through questionnaires delivered at a stakeholder seminar; and publishing workshop for capacity-building in each country. Following these interventions, publishing exercises will be supported in each country and a comparative analysis made of the results. 6.2

Findings

This project is still at an early stage, with publishing workshops due to be held in Uganda and South Africa in May and August 2008. However, some interesting insights are emerging already, some as a result of collaboration with other projects. It has become clear that the formal publishing industry internationally is trying to come to terms with the digital age and is experimenting with a number of new business models. This new disruptive digital technology is necessitating new approaches to copyright. Yet, where we stand today is still at the incubation stage of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

these new models, with caution competing with boldness as the industry tries to find ways of recovering its investments. In the meantime there is still the urgent need to see how these new models may facilitate access and distribution in developing countries. Discussion and debate about new licensing and business models are becoming more insistent in the global North, but are less evident in Africa. This is ironic, as it is in the difficult market conditions in Africa that the use of flexible business models, linked to digital content delivery could have real traction. In South Africa, connections have been forged between the OpeningScholarship project and PALM Africa . Some of the larger academic textbook publishers are interested to explore the changing environment of teaching and learning at UCT and as a result, a workshop with one publisher and an interview session with another have already been organised which included representatives from the PALM and OpeningScholarship projects. It was clear from these discussions that the publishers were beginning to realise the need to grapple with a changing environment brought about by the use of online learning platforms. This is challenging them to explore changing business models. and there is now an interest in exploring how to interface with online and multimedia content being developed in the universities. There might also be potential for exploring licensing options for the inclusion of textbook and commentaries in online delivery in LMS such as Vula., for fully integrated course material incorporating published materials and university-generated content. It is in the OpeningScholarship project that the first steps are bring taken to investigate the copyright solutions that could allow such materials to be opened beyond the originating university. The results of the PALM Africa project should help provide sustainability models for the delivery of scholarly and textbook materials in an African context and, it is hoped, help foster inter-African trade, using flexible licensing and print-on-demand to overcome the current barriers that inhibit it. There would also appear to be potential for exploring different licensing models to make available publications from the long tail of international publishers to lower the cost of specialist but vitally necessary publications into Africa. 7.

The Publishing Matrix â&#x20AC;&#x201C; mapping the publishing industry

The Publishing Matrix project, funded by the Open Society Institute and the Shuttleworth Foundation, arises from the acknowledgement that the access debate is now shifting from access alone to a consideration of the need for participation by developing countries in open knowledge production. If projects to achieve this goal are to succeed, they need to be backed by an understanding of how publication and knowledge dissemination work in the countries concerned, where there are blockages and weaknesses in the provision of learning materials and other knowledge resources, and where traditional systems are working well. There also needs to be an understanding of how the supply chain works in the different publishing sectors, particularly where print products would be needed. An example has been a rush to provide free textbooks for schools in developing countries. Initial enthusiasm is now being tempered by the realisation that the inhibiting factors preventing the wide dissemination of school textbooks do not reside in content development alone, but in printing and distribution. Donors, activists and policy-makers are seeking a more complex understanding of how best to advance access in circumstances where print products need to be distributed in what are often complex supply networks. While there is a common understanding in the open access movement of the goals that are being pursued, there are obstacles as the new world meets with the old. People of good will are struggling to find consensus on what aspects of traditional ways of learning and communicating should be preserved and how we might be better served by newer ways of generating and communicating knowledge. Vested interests abound, opportunists deflect and derail good intentions, but equally serious is a lack of understanding

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

265


266

Eve Gray; Marke Burke

of past, present and possible future contexts, and this is leading to fragmented decision-making. Policies are being made on the hoof with unintended consequences that can destroy many of the skills and capacities we actually wish to preserve. Equally, fear of the unknown is holding back the taking of justifiable risks. This study is intended to pull the various strands together, identify the friction nodes, and contribute to creating a more strategic vision of what changes to support. Outsiders face some difficulty in understanding how publishing works, not least in grappling with the fact that price differentials between countries are not determined only by publishers’ pricing practices, but by a complex set of circumstances in an industry in which price can be largely a factor of the size of the readership in a particular country. This study is therefore intended first and foremost to be a roadmap that aims to help others engaged in changing how knowledge, emanating out of both developed and developing countries, is communicated and how that may reduce the knowledge divide. A component of this project will also be a contribution of a better understanding of what professional publishing skills are needed for the effective development and dissemination of knowledge products . For example the often-cited case study of the HSRC Press in South Africa, which offers a dual model digital open access and for-sale print publication depends upon a highly professional publishing and marketing team for its success. The Publishing Matrix is being prepared as a wiki that will provide an account of how publishing works in different sectors along the value chain, providing multiple perspectives on how the industry works. The information produced will help inform the PAM Africa project and should provide a useful resource for the investigation of ways to improve access to knowledge in the southern African region. 7.1

Findings

Although it is too soon to have hard findings to hand, there are some realisations that are already offering new insights. Most striking was the realisation,when the matrix outline was drawn up and the different sectors profiled, of just how much publishing actually happens outside of the publishing industry. A number of NGOs have been practising what are effectively open access models for years, while corporate and government publishing also produces a wide range of content,including training and curriculum materials. This needs to be better factored in in mapping the access to knowledge terrain. 8.

Opening Access to Knowledge in Southern African Universities

SARUA has, in collaboration with the International Development Research Centre (IDRC) and the Link Centre at the University of the Witwatersrand, launched a research study entitled Opening Access to Knowledge in Southern African Universities to study the issues of ‘access to knowledge’ constraints in Southern African universities and the role and potential contribution of Open Access frameworks and initiatives for research in the region. The project is a qualitative research study that will be implemented in seven countries in he Southern African region over a ten month period. The study will assess the current situation pertaining to access to knowledge constraints in Southern African Universities and the role of Open Access Frameworks and initiatives for research and scientific collaboration. The

research

questions

being

asked

are

exploring:

The existing constraints to availability of academic and other relevant research publications in the social sciences and humanities, the health sciences and the natural sciences and engineering.

The extent to which Southern African universities are changing their practices relating to Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

production and dissemination of research and publications and if so, how.

How Southern African universities can increase the availability of academic and other relevant research publications to students and researchers.

What measures would be required to encourage new approaches to knowledge production and dissemination in Southern African universities among librarians, research managers and prominent researchers/scientists.

How open access could benefit and contribute to scientific collaboration and endeavour, and what its implications would be for research across higher education institutions across the region, given the current limitations confronting Southern African Universities.

How feasible would the establishment be of a SARUA regional open access network(s) based on an ‘open knowledge charter’, and the development of a Science Commonsand what would the the options be for doing so.

SARUA seeks to promote Open Access for increased quality research, enhanced collaboration, and the sharing and dissemination of knowledge, The outcomes of this project will inform the development of a longer-term SARUA programme aimed at promoting awareness and mobilising the University leadership across the region to promote free access to knowledge and enhance scientific research and collaboration. The importance of SARUA’s involvement in this project is that as a regional university association it has the potential for real traction in the formulation of policy in a wide region, involving some 64 universities in African countries south of the Sahara. The findings of this research project could therefore be of considerable importance in in providing the base for the regeneration and growth of southern African universities. The project acknowledges that developing countries, especially in Africa, face a broad spectrum of research infrastructure and capacity constraints that limits their capability to produce scientific output and absorb scientific and technical knowledge. Unequal access to information and knowledge by developing nations, exacerbated by unequal development and exchange in international trade, serve to reinforce the political and cultural hegemony of developed countries. The impact of knowledge-based development will continue to have insignificant impact for as long as this asymmetry in research output and access to up-to-date information remains [4]. At the same time, the project acknowledges the importance of the network society, in which, as Manuel Castells [25] describes this order, the generation of wealth, the exercise of power and the creation of cultural codes depend on technological capacity, with information technology as the core of this capacity. This project is delivered in the understanding that knowledge production, communication and dissemination are becoming central to the mission for all universities in the 21st century, thus enabling a shift beyond teaching towards research and civic engagement. It acknowledges the ways in which the internet and other collaborative technologies are changing the way universities conduct their business by making it possible to conduct collaborative research across disciplines, institutions and countries; making it possible for researchers and students to share working research and publications online; and to promote e-learning for undergraduate and post-graduate programmes. This creates the opportunity for African universities to participate in global knowledge production activities with significant potential gains through, inter alia, increased resources for research and publication in local and international academic journals. For institutions operating in developing countries within resource constrained environments such as SARUA member institutions, these technologies and associated practices offer tremendous opportunities for improving the research, publishing and dissemination processes and putting Southern African knowledge at the service of local economies and societies. The critical question is whether we are positioning our institutions to take advantage of these opportunities.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

267


268

Eve Gray; Marke Burke

This question can only be answered if we understand the present constraints to knowledge production, processing and dissemination within our universities and the extent to which collaborative technologies and its associated practices can contribute to increasing our capacity for generating knowledge and expanding existing knowledge. The rise of open approaches to scientific endeavours and research are closely associated with open source technologies, open access, open data. Open research for example, can significantly contribute to generating knowledge within our institutions. 8.1

Methodology

The study is employing a qualitative strategy of inquiry. A review of the literature and document analysis is being undertaken to assess what the emerging developments and trends are internationally and in Southern Africa. The literature review will serve to frame the inquiry and provide the basis for the formulation of the research questions and the key informant interview guidelines. A research methodology workshop has also been held to refine the design and methodology for the study in a participative way. The research will be aimed at gathering data from two respondent groups. The first group will consist of the heads of research and research managers of the selected universities and the second group will comprise of key informants in the community of librarians, academic publishing houses and teaching/ research/ scientific communities based at the universities. 8.2

Findings

Although the research findings for this project are not yet available, the initial results of the survey will be reported at the ELPUB 2008 conference. 9.

Conclusions

These four projects, although still in progress, taken together have already demonstrated that there are gains to be made in collaboration between projects offering different perspectives to common problems. The projects share a common understanding of the importance of the information and knowledge economy; and also of the inequalities inherent in the economics and politics of global knowledge production. Acknowledging the changing research environment, in which collaboration is of primary importance and the hierarchies of knowledge production are changing, these projects together chart at different levels and from different perspectives how these changes are impacting in Africa. Mapping across the four projects, it becomes clear that before formulating policies and strategies at the national level, there needs to be an understanding of the institutional climate within the universities and the competing cultures within the institutions, as well as of the needs of the communities within which they are operating. A number of issues have emerged from the combined wisdom of these projects that would need to contribute to any effort to being African research into the cyber-age and ensure that it is effectively published:

â&#x20AC;˘

The dominant culture of research publishing needs to be interrogated, with its narrow focus on journal articles in particular and its uncritical adherence to a global model that in fact depends upon an inequitable, imperialist and commercially-driven value system. Rather, the full value of the research being produced in African universities needs to be released for the benefit of the continent.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


African Universities in the Knowledge Economy

In order to achieve this, there would need to be a radical change in the current attitudes of university administration, government and senior academics, that support for publication and dissemination is not the job of a university. It is clear that in a digital world, the universities and allied NGOs are already at the forefront, exploring ways of harnessing the potential of the internet,something that publishers could learn from. This in turn needs government policy to recognise open and collaborative approaches to generating impact from research investment, rather than only the lock-down and proprietary models of patents and copyright protection currently valued.

Learning from the example of CET at UCT, there clearly needs to be a better understanding of what ICT infrastructure needs to be in place not only for teaching and learning, but also for an integrated approach to research management and effective research dissemination and publication. This would in turn need to include grappling with the changing job profiles and reward systems for staff working in ICT, who need to combine technical, research, communications and pedagogical skills.

In Africa, there is a need to interrogate the common wisdom of both the open access movement aand commercial publishing in the North models to come up with sustainability models that are workable in an African continent.. There is also, given the marginalised position of African research, a need for the incorporation of professional publishing skills and effective and targeted publishing strategies, wherever these are sited, for research outputs to reach their intended markets. ‘

The PALM Africa project and the Publishing Matrix provide a salutary reminder that in the African context, where resources are scarce, the use of new business models and commercial partnerships might well be needed to provide sustainability, particularly in a context where print is often still needed. Flexible licensing can also address the needs of a changing environment in the NGO sector, in which blended approaches are needed that combine sustainability and public interest.

Taken together, these projects should help to provide a comprehensive vision of what (complex) steps would be needed to create a publishing environment that could harness the potential of ICTs and open access approaches to give a voice to African knowledge. The SARUA project for Opening Access to Southern African Knowledge should hold the key to advancing this vision to the region as a whole, with the capacity to drive an initiative at the upper levels of the university administrations in the region. 10.

Notes and References

[1]

WILLINSKY, JOHN. The access principle: The case for open access to research and scholarship. Cambridge, MA: MIT Press, 2006. KIRSOP, BARBARA & CHAN, LESLIE Transforming Access to Research Literature for Developing Countries. Serials Review December 31, 246–255, 2005 https:// tspace.library.utoronto.ca/bitstream/1807/4416/1/Kirsop_Chan_SerialsReview.pdf (accessed May 2007). GRAY, EVE. Academic Publishing in South Africa. In Evans N & Seeber M (eds) The politics of publishing in South Africa. Scotsville: Holger Ehling Publishers & University of Natal Press, 2001. CHAN, LESLIE & COSTA, SELY. Participation in the global knowledge commons; Challenges and opportunities for research dissemination in developing countries. New Library World 106 (1210/ 1211): 141–163, 2005

[2]

[3] [4]

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

269


270

[5] [6]

[7] [8] [9] [10]

[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

[23] [24] [25]

Eve Gray; Marke Burke

WAFAWAROWA, BRIAN. Legal Exception to Copyright and the Development of the African and Developing Countries’ Information Sector, 2000. CZERNIEWICZ, LAURA & BROWN, CHERYL. Access to ICTs for teaching and learning: from single artefact to inter-related resources. Paper presented at the e-Merge 2004 Online Conference: Blended Collaborative Learning in Southern Africa, University of Cape Town, July 2004. http://emerge2004.net/profile/abstract.php?resid=7 (accessed May 2007). KING, DONALD The scientific impact of nations. Nature 430: 311–316., 2004 GRAY, EVE. Achieving Research Impact for Development: A critiquw of research dissemination policy in South Africa. OSI Fellowship Policy Paper / Budapest, 2007. http://www.policy.hu/gray/ IPF_Policy_paper_final.pdf (accessed May 2008) STEELE, C., BUTLER, L., & KINGSLEY, D. The publishing imperative; The pervasive influence of publication metrics. Learned Publishing 19(4): 277–290, (2006) BLOOM, D., CANNING, D. & CHAN, K. Higher education and economic development in Africa. Washington, World Bank, 2005. http://siteresources.worldbank.org/EDUCATION/Resources/ 278200-1099079877269/547664-1099079956815/HigherEd_Econ_Growth_Africa.pdf (accessed August 2006). ONDARI-OKEMWA, E. & MINISHI-MAJANJA, MK. Knowledge management education in the departments of Library / Information Science in South Africa. South African Journal of Libraries and Information Science 73(2): 136–146, 2007. KANYENGO, CW. & KANYENGO, BK. Information Services for Refugee Communities in Zambia. Proceedings of the 72nd IFLA World Library and Information Congress, 20–24 August, Seoul, Korea, 2006. SAWYERR, A. African universities and the challenge of research capacity development. Journal of Higher Education in Africa 2(1): 213—242, 2004, p. 218. Guedon, Jean-Claude. Accès libre, archives ouvertes et États-nations; les stratégies du possible. Ametist. Numéro 2AMETIST. Http://www.ametist.inist.fr/document.php?id=465. .(My translation) AUSTRALIAN PRODUCTIVITY COMMISSION. Public support for science and innovation. Research report, Canberra, Productivity Commission, 2007. http://www.pc.gov.au/study/science/ finalreport/index.html (accessed May 2007). Houghton, John W, with Steele, Colin and Henty, Margaret. Changing Research Practices in the Digital Information and Communication Environment . Federal Government of Australia, DEST 2003 BELL, ROBERT K with HILL, DEREK and LEHMING, ROLF F. The Changing Research and Publication Environment in American Research Universities. Working Paper | SRS 07-204 | July 2007. Division of Science Resources Statistics, National Science Foundation. 2007) Http://www.cet.uct.ac.za DACST (DEPARTMENT OF ARTS, CULTURE, SCIENCE AND TECHNOLOGY) White Paper on science and technology: Preparing for the 21st century. Pretoria: Department of Arts, Culture, Science and Technology, 1996 ASSAF (ACADEMY OF SCIENCE OF SOUTH AFRICA). Report on a Strategic Approach to Research Publishing in South Africa. Pretoria: Academy of Science of South Africa, 2006. http://web.uct.ac.za/org/agi/ The Mellon Foundation has been the major supporter of CET, and was responsible for the funding of the unit in its original incarnation as the Multimedia Education Unit. Mellon still funds posts within CET, although the university has now taken responsibility for supporting the major part of the department’s human resource and infrastructure needs http://www.uctsocialresponsiveness.org.za/home/default.asp HALL, MARTIN. Freeing the knowledge resources of public universities. Unpublished conference paper: KM Africa – Knowledge to Address Africa’s Development Challenges. University of Cape Town, March 2005. Castells, M. The Rise of the Networked Society (second edition). Oxford: Blackwell, 2000, p. 356. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


271

Open Access in India: Hopes and Frustrations Subbiah Arunachalam Trustee, Electronic Publishing Trust for Development Flat No.1, Raagas Apartments, 66 Venkatakrishna Road, Mandiaveli, Chennai 600 028, India email: subbiah.arunachalam@gmail.com

Abstract Current status of scientific research and progress made in open access â&#x20AC;&#x201C; OA journals, OA repositories and open course ware - in India are reviewed. India is essentially feudal and hierarchical; there is a wide variation in the level of engagement with science and research and there is a wide gap between talk and action. Things never happen till they really happen. The key therefore is constant advocacy and never slackening the effort, and to deploying both bottom-up and top-down approaches. The authorâ&#x20AC;&#x2122; own engagement with the Science Academies and key policymakers is described. Indian Institute of Science is likely to deposit a very large proportion of the papers published by its faculty and students in the past hundred years in its EPrints archive. There is hope that CSIR will soon adopt open access. Keywords: India; open access; CSIR 1.

Introduction

Intellectual (or knowledge) commons share with natural resources commons such as forests, grazing land, fisheries and the atmosphere some features such as congestion, free riding, conflict, overuse, and pollution. But there is a big difference. Natural resources belong to the zero sum domain: if you share something, your stock dwindles. But knowledge wants to be shared and when shared it grows! The two kinds of commons, however, require strong collective action, self-governing mechanisms and a high degree of social capital on the part of the stakeholders. Unfortunately knowledge can be enclosed, commodified, patented, polluted and degraded and the knowledge commons could be unsustainable. That is exactly what we have allowed to happen to much of the knowledge produced by scientists around the world in the past few centuries and recorded in journals. We have allowed the copyright laws to protect the interests of publishers, who are intermediaries in reaching the knowledge to others, rather than protect the interests of the knowledge creators, viz. the authors of research papers, who want to give away their knowledge for free. The past two decades have seen the emergence of a movement that seeks to restore the knowledge commons back to the knowledge creators, through facilitating open access. Although the open access movement began before the advent of the Internet, it would not be an exaggeration to say that it would not have grown but for the emergence and widespread use of the Internet. This movement, like everything else, is uneven. It has done well wherever the stakeholders were able to ensure certain degree of collective action, self-governing mechanism and social capital. For example, physicists started technology-enabled sharing of preprints about two decades ago and now they are moving into the next level with INSPIRE whereas chemists are even now unable to get out of the shackles imposed by one of their own societies. Some countries like the UK and the USA have made some progress, whereas many other countries are lagging far behind. Among the developing countries, Latin America and notably Brazil have done better than others. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


272

Subbiah Arunachalam

This paper presents a status report on open access in India. 2.

General trend in scientific output and publishing from India

Independent India, led by Jawaharlal Nehru, decided to invest in science and technology and to use S&T to leverage development efforts and to improve the standards of living of the people. Ever since, virtually all political parties and the people have generally supported investing in science, even though one in four Indians is living below the poverty line. This is not at all surprising considering that knowledge has always been valued very much in India. Today India has a large community of scientists and scholars and Indian researchers perform research in a wide variety of areas including science, technology, medicine, humanities and social sciences. They publish their research findings in a few thousand journals, roughly half of them in Indian journals and the rest in foreign journals, most of them low-impact journals. The Indian Academy of Sciences and the Council of scientific and Industrial Research have been the leading publishers of S&T journals in India for a long time. The other Academies, professional societies, educational institutions and a few commercial firms also publish journals. But not many of them are indexed in SCI or Web of Science, which are selective in their coverage. MedKnow Publications, a Bombay-based private company, is emerging as a quality publisher of medical journals. As social science has been neglected for long, there are not many social science journals of repute from India. The Economic and Political Weekly has a sizable following. India trains a very large number of scientists and engineers and a large percent of the best graduates, especially those trained at the famous IITs, migrate to the West, and they seem to perform well. Said an article in Forbes, “India is the leader in sending its students overseas for international education exchange, with over 123,000 students studying outside the country in 2006.” Indians constitute the largest contingent of foreign students in the USA; a recent estimate puts the number at over 83,000. The number of Indian students enrolled in British universities in 2006 was about 25,000. Of late there is a sizable outflow of students to Australian universities, and the Australians believe that most of them want to stay on in their country. Research is performed essentially in three sectors: (1) higher educational institutions such as the universities and deemed universities numbering over 400, Indian Institutes of Technology and Indian Institute of Science, (2) laboratories under different government agencies such as the Council of Scientific and Industrial Research (CSIR), Department of Atomic Energy (DAE), Indian Space Research Organization (ISRO), Defence Research and Development Organization (DRDO), Indian Council of Agricultural Research (ICAR), and Indian Council of Medical Research (ICMR), and (3) laboratories in the industrial sector, both public and private. Besides, a number of non-governmental organizations and think tanks are also contributing to India’s research output. Although its overall share of funds invested on R&D is decreasing, the Government continues to be the major source of funding for research, and currently it accounts for bout 70%. Industry’s share is increasing, as more and more Indian companies have started acquiring overseas companies in sectors ranging from automobiles and steel to pharmaceuticals, tea and information technology, and as many multinational corporations are setting up research centres in India to take advantage of high quality researchers they could hire at costs much lower than in the West. One would think that everything is fine with science and technology in India. Far from it. In terms of the number of papers published in refereed journals, in terms of the number of citations to these papers, in terms of citations per paper, and in terms of international awards and recognitions won, India’s record is not all that encouraging. In the past few years things have started looking up. In Table 1 I present some data on India’s contribution to the research literature of the world as seen from wellknown databases. I also provide the number of papers from the People’s Republic of China to see India’s Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Open Access in India: Hopes and Frustrations

273

contribution in perspective. India now accounts for 3.1% of journal papers abstracted in Chemical Abstarcts; a few years ago the figure was a rather poor 2.4%.

Year

Scopus

MathSciNet

Engineering SciFinder + Village India China India China India China India China 2007 43005 190847 1765 11252 25126 205734 41697 235309 2006 40749 179388 1949 11762 25954 199881 38253 222371 2005 36385 157809 1936 10073 21870 173291 33675 183931 2004 32319 111219 1777 9544 18982 121725 29341 126647 2003 29972 74895 1904 8663 16804 81604 25985 106518 *Data accessed on 29 May 2008. Data for 2007 in MathSci Net and Engineering Village + including both Compendex and INSPEC for 2007 are incomplete. Table 1. The numbers of papers indexed from India and China in major bibliographic databases* In Table 2 I present some data on the number of papers from India and three other countries and the average number of citations won by research papers from these countries as seen from Scopus. No doubt, the number of papers from India is increasing steadily, but the growth rate is nowhere near that of China. India moved from the 13th rank in 1996 in terms of number of papers published in journals indexed by Web of Science to 10th in 2006, whereas China moved during the same period from 9th to the second position.

Year China 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 19962006

Doc 26853 29871 31887 36180 42250 55850 55400 66748 98577 148221 166205 758042

India Rank 9 9 8 8 6 5 5 5 2 2 2 5

C/d 4.37 4.55 4.24 4.58 4.29 3.22 3.32 3.26 1.92 1.92 0.12 3.14

Doc 20106 20694 19755 22578 22788 23362 24838 28741 30258 34849 38140 28109

Sout Korea Rank 13 13 13 12 12 12 12 12 12 11 10 12

C/d 6.13 5.67 5.78 5.30 5.17 4.34 3.82 3.31 2.26 1.00 0.19 3.91

Doc 9669 11876 11579 14645 16321 17930 18740 23406 27200 32488 34025 217879

Rank 20 16 16 16 15 14 14 14 14 13 12 14

Brazil C/d 8.55 8.17 8.88 8.54 8.07 6.63 5.74 5.00 3.14 2.88 0.22 5.84

Doc 8497 10167 10357 12196 12857 12708 14590 16978 18695 21239 25266 163550

Rank 21 20 20 18 17 19 16 17 18 17 15 18

C/d 9.42 8.45 8.49 7.95 7.36 5.84 4.93 4.31 2.93 1.29 0.22 5.56

Source: SCImago Journal & Country Rank (based on data from SCOPUS), courtesy Prof. FĂŠlix de Moya of Grupo SCIMAGO, Spain. Doc = Number of documents. C/d = Citations per document, computed for the 11-yeaar period. Note the decrease in value for later years. Table 2. Output of research papers from selected countries Data provided in Table 3 (courtesy In-cites) clearly show that in no field does India receive enough citations to be on par with the world average. In certain fields like physics, materials science, the gap is narrow, but in most areas of life sciences the gap is indeed large. In areas like plant and animal science and immunology Indian research appears to be way behind. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


274

Subbiah Arunachalam

India, after near stagnation, is now on the growth path. In the past two years the government has increased investments on both higher education and R&D. New specialized higher educational institutions are being set up with the hope some of them will eventually emerge as world class institutions. Science Academies are discussing ways to improve the quality of science education with a view to getting better-educated graduates to research. Field

Percentage of papers from India

Relative impact compared to world ------------------------------------------

Materials Science

5.12

-25

Agricultural Sciences

5.06

-57

Chemistry

4.81

-34

Physics

3.71

-20

Plant & Animal Sciences

3.44

-65

Pharmacology

3.21

-45

Engineering

2.93

-28

Geosciences

2.72

-51

India's overall percent share, all fields: 2.63 Ecology/Environmental

2.55

-51

Space Science

2.52

-47

Microbiology

2.18

-50

Biology & Biochemistry

2.06

-56

Mathematics

1.72

-43

Computer Science

1.57

-29

Immunology

1.19

-65

Clinical Medicine

1.18

-56

Molec. Biol. & Genetics

1.17

-62

Economics & Business

0.75

-52

Social Sciences

0.73

-44

Neurosciences & Behavior

0.55

-51

Psychology/Psychiatry

0.30

-38

---------------------------------------------------------------------------------------------------------Courtesy: SciBytes - ScienceWatch, Thomson Reuters Table 3. Number of Indian papers published in different fields during the five years 20022006 and citations to them [Data from National Science Indicators, Thomson Reuters] 3.

Awareness of OA in India

Scientists do research and communicate results to other scientists. They build on what is already known, on what others have done – the ‘shoulders of giants’ as Newton said. Indian scientists face two problems common to scientists everywhere, but acutely felt by scientists in poorer countries: Access and Visibility. 1.

They find it difficult to access what other scientists have done, because of the high costs of access. With India’s annual per capita GDP well below US $1,000, most Indian libraries cannot afford to subscribe to key journals needed by their users. Most scientists in India

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Open Access in India: Hopes and Frustrations

2.

are forced to work in a situation of information poverty. Even the programmes supported by UN agencies are not available for free in India, even though India’s per capita GDP is far below the agreed upon threshold of US $1,000. Other researchers are unable to access what Indian researchers are doing, leading to low visibility and low use of their work. As Indian scientists publish their own research in thousands of journals, small and big, from around the world, their work is often not noticed by others elsewhere, even within India, working in the same and related areas. No wonder Indian work is poorly cited.

Both these handicaps can be overcome to a considerable extent if open access is adopted widely both within and outside the country. That is easier said than done. As an individual I have been actively advocating open access for the past seven years. There are a few more advocates and proponents of OA in India. But what we have to show is rather limited. With the advent of the Internet and the Web, we need not suffer these problems any longer. If science is about sharing, then the Net has liberated the world of science and scholarship and made it a level playing field. The Net and the Web have not merely replaced print and speeded up things but have inherently changed the way we can do science (e.g. eScience and Grid computing), we can collaborate, we can datamine, and deal with datasets of unimaginable size. But the potential is not fully realized, largely because most of us are conditioned by our past experience and are inherently resistant to change. Our thinking and actions are conditioned by the print-on-paper era, especially in India! From colonial days, most people do things only when they are told to do. The situation with accessing overseas toll-access journals has improved considerably thanks to five major (and a few smaller) consortia that provide access to a large number of journals for large groups of scientists in India (especially those in CSIR labs, IITs and IISc). Many universities have benefited through INFLIBNET. ICMR labs and selected medical institutions have formed ERMED, their own consortium. Rajiv Gandhi Health Sciences University, Bangalaore, provides access to literature through HELINET Consortia to a number of medical colleges in the South. On the open course ware front the consortium of IITs and IISc have launched the NPTEL programme under which top notch IIT and IISc professors have come together to produce both web-based and video lessons in many subjects. Now these are available on YouTube as well. Many physicists in the better-known institutions use arXiv, which has a mirror site in India, both for placing their preprints and postprints and for reading preprints of others. But many others are not aware of it. A very large number of Indian researchers working in universities and national laboratories are not aware of open access – green or gold - and its advantages. Very few Indian authors know about author’s addenda and whenever they receive a publisher’s agreement form they simply sign on the dotted line giving away all the rights to the publisher. Call it ignorance or indifference, but it is rampant. Many authors think that attaching an author addendum to the copyright agreement may lead to rejection of their paper! Or at least they do not want to take a risk. What we need is advocacy and more advocacy. 4.

OA journals and OA repositories in place

Thanks to the initiatives taken by Prof. M S Valiathan, former President of the Indian National Science Academy, the four journals published by INSA were made OA a few years ago. The Academy also signed the Berlin declaration. Four years ago, he convened a one-day seminar on open access as part of the Academy’s annual meeting. The Indian Academy of Sciences converted all its ten journals into OA. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

275


276

Subbiah Arunachalam

The Indian Medlars Centre at the National Informatics Centre brings out the OA version of 40 biomedical journals (published mostly by professional societies) under its medIND programme. MedKnow brings out more than 60 OA journals, on behalf of their publishers, mostly professional societies. [Not all of them are Indian journals. Also, some MedKnow journals are included in the medIND programme of NIC.] Three OA medical journals are brought out from the Calicut Medical College. A few more OA journals are brought out from India. In all, the number of Indian OA journals will be around 100 (DOAJ lists 97, but it does not list journals published by the Indian National Science Academy). Dr D K Sahu, the CEO of MedKnow has shown with ample data that OA journals can be win-win all the way. For example, the Journal of Pstgraduate Medicine (JPGM) was transformed into a much better journal after it became OA. It attracts more submissions of better quality papers and from researchers from many countries; the circulation of the print version has increased; advertisement revenue has increased (both for the print version and for the online version). Its citation per document ratio has been increasing steadily. Dr Sahu has made several presentations on MedKnow journals and how open access is helping in improving the quality of the journals as well as their revenue, but not many other Indian journal publishers are coming forward to make their own journals OA. Incidentally not a single Indian OA journal charges a publication fee. Several leading publishing firms (both European and multidisciplinary) have started poaching on these newly successful OA journals! In fact a few journals have moved out of MedKnow to foreign publishers who have lured them with money. The online versions of a few Indian journals are brought out by Bioline International. Two young OA advocates, Thomas Abraham and Sukhdev Singh, have formed a society to promote Open Journal System in India. The National Centre for Science Information at the Indian Institute of Science has also helped a few journals become OA by adopting OJS. The Indian Institute of Science, Bangalore, was the first to set up an institutional repository in India. They use the GNU EPrints software. Today the repository has close to 10,200 papers, not all of them full text and not all of them truly open (as many papers are available only to searchers within the campus). IISc also leads the Million Books Digital Library project’s India efforts under the leadership of Prof. N Balakrishnan. Today there are 31 repositories in India (as seen from ROAR; OpenDOAR lists only 28), including three in CSIR laboratories, viz. National Chemical Laboratory, National Institute of Oceanography, and the National Aerospace Laboratories. Three of them are subject-based central repositories. OpenMed of NIC, New Delhi, accepts papers in biomedical research from around the world. The Documentation Research and Training Centre at Bangalore maintains a repository for library and information science papers. Prof. B Viswanathan of the National Centre for Catalysis Research maintains, virtually single handed, a repository for Indian catalysis research papers with over 1,150 full text papers. Five of the thirty Indian repositories have found a place in the list of top 300 repositories prepared by the Cybermetrics Lab of the Centro de Información y Documentación Científica (CINDOC) of Consejo Superior de Investigaciones Científicas (CSIC), Spain: Indian Institute of Science is placed at 36th rank, followed by Indian Statistical Institute – Documentation Research and Training Centre at 96. Openmed of National Informatics Centre at 111, Indian Institute of Astrophysics at 228 and the National Institute of Oceanography at 231. The repository at the Raman Research Institute has all the papers written by C V Raman, the winner of the 1930 Nobel Prize for Physics. The National Institute of Technology, Rourkela, is the only Indian institution to have mandated OA for all faculty publications. Apart from NIT-R, the deposition rate of current papers is pretty low in all other institutions. Soon ICRISAT, a CGIAR laboratory located in India, will throw open its OA repository. A small proportion of Indian physicists, mostly high energy and condensed matter physicists, use arXiv to deposit preprints and postprints. And arXiv has a mirror site at the Institute of Mathematical Science Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Open Access in India: Hopes and Frustrations

(IMSc), Chennai, which is visited by an increasing number of researchers from India and the neighbouring countries. A few weeks ago IMSc set up its own institutional repository. A small team at the University of Mysore is digitizing doctoral dissertations from select Indian universities under a programme called Vidyanidhi. With funding from the Department of Scientific and Industrial Research a small group at Indian Institute of Science – National Centre for Science Information – was helping Indian institutions set up OA archives (using EPrints or DSpace) and to convert journals to open access using the Open Journal System. Not many institutions have taken advantage. Informatics India Pvt Ltd, a for-profit company with its headquarters in Bangalore, is bringing out a service called Open J-Gate, which indexes all open access journals in the world. And it is absolutely free. Jairam Haravu of Kesavan Institute of Information and Knowledge Management has made the NewGenLib library management software open source. NewGenLib can be used to set up and maintain institutional repositories. 5.

Policy developments

The two Science Academies, INSA at New Delhi and IASc at Bangalore, and many of their Fellows have been engaged in a discussion on open access and its advantages, but there has been very little follow-up. As India continues, in a sense, to be feudal, one wonders if top-down approaches would work better than bottom-up approaches. But OA advocates are working on both fronts! On the bottom-up front, a number of workshops have been held with a view to training mostly library staff in the use of OA software such as EPrints, DSpace, and NewGenLib. Dr A R D Prasas of the Indian Statistical Institute – DRTC, Bangalore, is on the advisory board of DSpace, and has conducted many workshops on setting up repositories using DSpace. Two online discussion lists OA-India and OADL are used by mostly LIS professionals to discuss OA related issues. But very few working scientists have taken part in these discussions. Several librarians have written about OA in professional journals. One major concern expressed by librarians and repository managers is about copyright violation; they are really worried about journal publishers taking action against their institutions. I have been writing to scientists and librarians regularly alerting them to OA developments around the world and the need for India to adopt OA quickly. By now a very large number of Indian researchers, among them elected Fellows of Academies, must have heard about the advantages of OA several times. The Indian Academy of Sciences had started on a pilot basis placing the full text of papers by Fellows of the Academy, but the project has not gone beyond the initial effort. A similar proposal is pending with INSA. If implemented these projects will be the equivalent of the Cream of Science project in the Netherlands. Despite concerted advocacy and many individual letters addressed to policy makers, the heads of government’s departments of science and research councils do not seem to have applied their minds to opening up access to research papers. The examples of the research councils in the UK, the Wellcome Trust, the Howard Hughes Medical Institute and more recently NIH And Harvard University have had virtually no impact on the Indian S&T establishment. Many senior scientists and directors of research laboratories and vice chancellors of universities do not have a clear appreciation of open access and its implications. The more than 60 well-funded Bioinformatics Centres have been talking about setting up their own OA archives for more than six years, but nothing has happened. In a national laboratory, scientists do not want Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

277


278

Subbiah Arunachalam

to upload their papers in the OA repository set up by the library. There is great reluctance and apathy among scientists. The National Knowledge Commission headed by Mr Sam Pitroda, a technocrat and a telecom expert, has recommended open access and it is understood that both the Prime Minister and the Deputy Chairman of the Planning Commission have been apprised of the need for adopting OA as a national policy. Two OA advocates, yours truly and Dr A R D Prasad were members of the Working Group on Libraries that advised the National Knowledge Commission. In addition, Dr Mangala Sundar Krishnan of NPTEL and IIT Madras and I were members of the working group on open and distance education. These two groups had submitted strong recommendations in favour of India adopting open access mandate for publiclyfunded research. 6.

Opportunities and Challenges

Among those who understand the issues, many would rather like to publish in high impact journals, as far as possible, and would not take the trouble to set up institutional archives. A recent letter to the editor of Nature from a leading Indian scientist, a foreign associate of the National Academy of Sciences, USA, illustrates this point. Publishing firms work in subtle ways to persuade senior librarians to keep away from OA initiatives. There have been no equivalents of FreeCulture.org among Indian student bodies and no equivalent of Taxpayers‘ Alliance to influence policy at the political level. Hopes - As pointed out earlier, the National Knowledge Commission supports open access and has included it in its recommendations to the Government. Google is in touch with NKC with a proposal to digitize all doctoral theses and bringing out OA versions of selected print journals and digitizing back runs of OA journals. The Director of Indian Institute of Science, which is in its centenary year, has decided to digitize all papers published from the Institute in the past more than 99 years and make them available to the world through the Institute’s EPrints archive, and the work has just begun. The Director General of the Council of Scientific and Industrial Research has said that it should be possible for CSIR to adopt a mandate similar to the one adopted by the Irish Research Council. Hope it becomes a reality soon. The Indian National Science Academy invited me to address its Council a few months ago and the President, Vice Presidents and Members of the Council listened to me carefully; again in early April 2008 the Academy held a half-day meeting on open access, free and open source software and copyright issues. I was asked to coordinate the presentations on the first two topics. But the lawyer who was invited to speak on copyright probably had very little understanding of the ‘give away’ nature of journal papers. INSA will send before long its recommendations to the Government. Developments around the world, including in Latin America, South Africa and China, I hope, will goad Indian establishment to action. 7.

International collaboration and ways forward

The Principal Scientific Advisor to the Government was a former chairman of the Atomic Energy Commission and is fully aware of developments around the world. His own colleagues have been part of the work at CERN and are involved in many international collaborative projects. He often meets with his counterparts in other countries, especially the UK and European Union. Decisions on OA made in the UK and Europe may have an influence on him. India is an important member of both the InterAcademy Panel and the InterAcademy Council. If these Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


Open Access in India: Hopes and Frustrations

bodies could be persuaded to endorse and adopt OA, then India will fall in line. I am trying to get a few OA champions to major events in India. Stevan Harnad came to India about eight years ago, but we did not provide him opportunities to meet many policy makers. Alma Swan came twice and did meet some key people. May be we need to facilitate more such visits and meetings. EIFL does not work in India. We should persuade them to include India in their programmes. 8.

Conclusion

One never knows when things start happening in India. They go on talking and holding meetings but they rarely act here. That is why it is important we keep pushing.

Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

279


280

An Overview of The Development of Open Access Journals and Repositories in Mexico Isabel Galina1; Joaquín Giménez2 School of Library, Archive and Information Studies (SLAIS) University College London, Gower Street, London, WC1E 6BT, UK e-mail: igalina@servidor.unam.mx; i.russell@ucl.ac.uk 2 UNIBIO (Unidad de Informática para la Biodiversidad) Instituto de Biología, Universidad Nacional Autonoma de Mexico Circuito exterior S/N Ciudad Universitaria C.P. 04510. Mexico DF, Mexico e-mail: joaquin@ibiologia.unam.mx 1

Abstract It has been noted that one of the potential benefits of Open Access is the increase in visibility for research output from less developed countries. However little is known about the development of OA journals and repositories in these regions. This paper presents an exploratory overview of the situation in Mexico, one of the leading countries in terms of scientific output in Latin America. To conduct the overview we focused on OA journals and repositories already in place and in development. It was particularly hard to locate information and our results do not intend to be exhaustive. We identified 72 Mexican OA journals using DOAJ. Of these journals 45 are from REDALyC which we identified as a key project in OA journal development in Mexico. Using OpenDOAR and ROAR, ten Mexican repositories were identified. These were reviewed and classified. We found a large variation between repositories in terms of size, degree of development and type. The more advanced repositories were well developed in terms of content and developing added on services. We also found inter-institutional groups working on advanced OAI tools. We also did a case study of 3R, a repository development project at one of the countries leading universities. This included interviews with two repository managers. The main challenges we found were lack of institutional buy in, staffing and policy development. The OA movement has not yet permeated the academic research environment. However, there are important working groups and projects that could collaborate and coordinate in order to lobby university authorities, national bodies and funders. Keywords: repositories, developing countries, Open Access, Open Access journals, institutional repository 1.

Introduction

This paper presents an overview of the Open Access movement in Mexico and the current OA journal and repository landscape. Although the importance of Open Access and repository building for developing countries by increasing visibility of under represented research has been noted [1-3], more work is required on the current situation [4]. The main objective of this paper is to present an introductory overview which will hopefully promote further discussion and contributions on this subject. We do not intend to present an exhaustive overview, as information regarding this subject is not easily available. First we look at the general trends in scientific output and publishing from Mexico in order to contextualize the discussion in particular with regard to other Latin American countries. This is followed by a broad discussion on the general awareness of Open Access in the country and a more detailed look at OA journals and repository development in place and in development. We present a case study of repository development at the National Autonomous University of Mexico (Universidad Nacional Autónoma de México- UNAM), in Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


An Overview of The Development of Open Access Journals and Repositories in Mexico

order to discuss key issues faced. In addition two interviews were conducted with repository managers to gather their views on repository development in Mexico. In particular the lack of policy development at a national and institutional level is addressed. Finally we look at the opportunities and challenges for Open Access in the country as well as the importance of international collaboration and other proposals to further the development of OA journals and repositories both in Mexico and in Latin America. 2.

General trends in scientific output and publishing

In terms of scientific output and publishing, Mexico is an important country in Latin America and could act as a one of leading players in the development of OA journals and repositories in the region. Mexico’s contribution to the global research output as measured by the ISI database is around 0.75%, second only to Brazil. Fifteen journals are included within the ISI [5]. It is particularly important to note that the UNAM, Mexico’s national university, is the biggest contributor to the country’s research output with over forty percent of the country’s research produced at this institution. We present a case study of the UNAM’s repository project in order to determine in more detail, particular issues and challenges in repository development. It is worth noting that the UNAM website is ranked number 59 in the Webometrics Ranking of World Universities [6], and although not all of it can be considered research output, it is clear that there is already a considerable base of material published online. Considering the size of the UNAM it could hopefully act as a key player in discussions and policy development in the country in collaboration with other institutions and national bodies. Both Mexico and Brazil have a relative low amount of citations for the number of articles published [5]. Increasing the visibility of Mexican research output is an important concern and the development of OA journals and repositories could contribute to this. In this sense, Brazil has been leading the way with the creation of SciELO (Scientific Electronic Library Online). This project is discussed further in the paper. As with other countries in the Latin American region, Mexico has a fairly low investment in science and technology development compared to Europe, Asia and the USA and Canada [7]. Mexico invests about 0.46% of its GDP compared to around 2% for most developed countries. From 1997 onwards however, Mexico has had a relatively steady investment in comparison to other countries possibly due to its recent stable economy. More than half the funding for research and development in universities and other research institutes comes from the public sector [7]. This could be a key issue when discussing mandates for selfdeposit when receiving public funding for research. 3.

General awareness of OA in Mexico

It is difficult to gauge the level of knowledge about OA in Mexico but there is little evidence of a generalized national awareness. However, a number of events and projects were found that suggest a growing momentum towards more widespread recognition. A few Mexican institutions are signatories of the Budapest and Berlin initiatives. In 2006 the UNAM organized the 5th International Conference on University Libraries (Conferencia internacional sobre bibliotecas universitarias) with the theme ‘Open Access: an alternative access to scientific information’. This was a two-day conference on the subject with a wide array of international and national speakers. Unfortunately it is not clear if any concrete policies or projects were developed as a result. At a national level the Open Network of Digital Libraries (Red Abierta de Bibliotecas Digitales- RABID) together with the University Consortium for the Development of the Internet (Corporación Universitaria para el Desarrollo del Internet- CUDI) has worked for several years on interconnectivity of resources and services between Mexican digital libraries. Their work has focused mainly on Open Access journals Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

281


282

Isabel Galina; Joaquín Giménez

and electronic theses, promoting their use using OAI-PMH. No mention of institutional repositories was found on their website [8] but this information may well be documented elsewhere or currently in development. Fourteen Mexican higher education institutions currently collaborate in RABID and they have developed a number of resource discovery tools such as: the OA-Hermes, developed by the UNAM, which is an OAI harvester for selected quality assured Open Access resources; and VOAI and XOAI developed by the Universidad de las Américas- UDLA, which are federated tools for sharing resources. The national body for Science and Technology, CONACyT has not apparently issued any public statement about Open Access. However, they have funded RedALyC (Red de Revistas Científicas de América Latina, el Caribe, España y Portugal) which is a large database of full text Open Access journals for Latin America, the Caribbean, Spain and Portugal, developed by the Universidad Autónoma del Estado de México- UAEM. This project will be discussed further in the next section. In general, although we found little evidence of an elevated awareness of OA in Mexico, we did find several concrete examples of institutions working on a number of projects that could positively influence further awareness. It is clear that more work needs to be done on this area though, in particular with national bodies in order to promote OA at more national level and involving many more institutions. 3.

OA journals and OA repositories in place and in development.

The following projects related to OA journals and repositories that we mention do not intend to be exhaustive. As the OA movement in Mexico is still relatively young, it was difficult to discover what projects are in development and in place and there may be important initiatives that we have missed. It is hoped that this paper will indeed be an opportunity to promote discussion in order to gather further information and bring together key players. 3.1

OA Journals

We used the Directory of Open Access Journals- DOAJ to perform a search using the term ‘Mexico’ and it produced 72 journal results. 50 of these journals have DOAJ content and more interestingly, all but five of these journals are from RedALyC. As mentioned in the previous section RedALyC is the most notable development in terms of OA journals. It currently offers 512 journals and over 81,000 full text articles. The site contains a section dedicated to Open Access, describing its development and the Budapest initiative. RedALyC works under the banner ‘Science that cannot be seen does not exist’ and its main objectives are to develop a common information space for Latin America, the Caribbean, Spain and Portugal, strengthen the quality of the publications of the region, act as a showcase for the regions quality research output and promote a more inclusive information society. We also used Latindex, an online registry of Latin American journals and found that of the 483 registered Mexican online publications, 238 are available freely. This of course, is not strictly OA but it does show that a wide range of material is already publicly available. It is not known if these journals support metadata harvesting but further work in this area could increase OA availability. A well-known Latin American journal publication project is SciELO (Scientific Electronic Library Online), originally developed in Brazil and which has now been expanded to eleven countries. Full text articles are marked up in XML using the SciELO markup methodology and in recent years an OAI interface has been included. As well as the SciELO portals by country there are also two subject portals on Public Health and Social Sciences. Although Mexico has participated in the Public Health portal for some time now, the national site was only recently launched with twenty-one full text journals. Most of the other country portals have been developed with strong support from national research councils or similar bodies. The Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008


An Overview of The Development of Open Access Journals and Repositories in Mexico

development of SciELO Mexico could possibly have been done more effectively with similar support. It is currently developed by the UNAM and now that proof of concept has been proved it will hopefully receive more attention. 3.2 Repository development Repositories have become increasingly important in the academic world [9-11] and can contribute to the development of Open Access. In 2004, ROAR (Registry of Open Access Repositories) registered around 200 repositories worldwide. This figure is currently over 1,000 with OpenDOAR (Directory of Open Access Repositories) recently reaching a similar number. However, the coverage of repositories on a global scale is patchy, with a small number of countries leading the way with most of their academic organizations developing an institutional repository plus a number of subject or national repositories [1214], whilst other countries will have none or only a few. For example, the Netherlands, Norway, Germany indicate one hundred percent coverage of universities with an institutional repository [13], whilst other countries such as Zimbabwe, Mexico, Argentina and others, register only a few repositories in the whole country. In these cases it would be reasonable to expect that is not representative of the total number of academic organizations considering the size of the country. In Latin America, Brazil has been leading in repository development with 26 repositories registered in OpenDOAR (55 in ROAR). In order to look at repository development in Mexico the browse by country function was used for OpenDOAR and ROAR. Five Mexican repositories were found in the former and eight in the latter. Two duplicates were eliminated leaving a total of eleven repositories for the whole country. This is actually quite a small number of repositories considering the size and academic importance of Mexico within Latin America. The repositories were reviewed and classified. Definition of repositories varies considerably and in order to classify we used the Heery and Anderson typological model [15] by describing repositories according to functionality, coverage, content type and user group. Despite there only being ten Mexican repositories a wide range of types were found. We found two national subject repositories, one theses, two institutional, two departmental, one subject, one catalogue, one regional and one unidentified as shown in Table 1. It is clear that repositories in Mexico are still in embryonic stage and there appears to be no coherent trends in developments. However, it is possible that there are currently a number of repositories in development that have not been registered in ROAR or picked up by OpenDOAR, so this number may not reflect the total figure. Of the eleven repositories inspected, three repositories were over five years old, two unknown and six had been registered in ROAR in the past two or three years but it was unclear how long they had been in development. There was no evident relationship between age and number of items. Two had less than 100 items, three between 1000 and 5000 items, whilst two were very large. Of the large repositories, one repository had over 200,000 items but on closer inspection was functioning as a library catalogue rather than a repository. The other has over 80,000 full-text article journals. Four repository sizes were unknown as they had not been successfully harvested by ROAR and there was no indication on their homepages, which is unfortunate as this information would be valuable. In order to examine repository development in Mexico in more detail we took the Network of University Repositories (Red de Repositorios Universitarios- 3R) currently being developed at the UNAM, as a case study. This project has been particularly well documented [16] and both authors are involved making access to information, experiences, interviews, documentation and development easier. Additionally it provides an important case study as it is a large, highly centralized and productive national university, currently producing over fifty percent of the countryâ&#x20AC;&#x2122;s total research output. This was followed by interviews with two UNAM based repository managers in order to gain a deeper understanding of repository development, content ingestion work flows, depositing behaviour, content typology, resource usage Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008

283


284

Isabel Galina; Joaquín Giménez

monitoring, dissemination. These interviews were compared to similar ones done with repository managers in the UK to discover points of convergence and differences. NAME INSTITUTION Acervo Digital del Instituto de Biología de la UNAM UNAM Acervo General de la biblioteca ITESO Árboles de la UNAM: Instituto de Biologí