Page 1

PANGAEA - Providing access to geoscientific data using Apache Lucene Java Uwe Schindler PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de


My Background ‡ ‡ ‡

‡

I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. Implemented fast numerical search and maintaining the new attribute-based text analysis API. Studied Physics at the University of Erlangen-Nuremberg and work as consultant and software architect for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geospatial retrieval functions with Lucene Java. Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and various local meetups.


About PANGAEA ‡

‡

‡

‡

since 1993 Information system for earth system science data hosted by AWI & MARUM 2001 Mandate of the International Council for Science (ICSU): World Data Center for Marine Environmental Sciences (WDCMARE) 2007 Mandate of the World Meteorological Organisation (WMO): World Radiation Monitoring Center (WRMC) 2010 (certification in progress) Mandate of the World Meteorological Organisation (WMO): Data Collection and Processing Center (DCPC)


Network of World Data Centers Geophysical Year 1957 ‡Airglow Mitaka,Japan ‡Astronomy Beijing, China

‡Meteorology Asheville NC, USA Beijing, China Obninsk, Russia

‡Marine Geology and Geophysics Boulder CO, USA ‡Nuclear Radiation Moscow, Russia Tokyo, Japan

‡Atmospheric Trace Gases Oak Ridge TN, USA

‡Seismology Denver CO, USA Beijing, China

‡Cosmic Rays Toyokawa, Japan

‡Soils Wageningen, The Netherlands

‡Earth Tides Brussels, Belgium

‡Solar Activity Meudon, France

‡Geology Beijing, China

‡Solar Radio Emission Nagano, Japan

‡Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India

‡Solar Terrestrial Physics Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia

‡Glaciology Boulder CO, USA Cambridge, UK Lanzhou, China

‡Ionosphere Tokyo, Japan ‡Marine Environmental Sciences Bremen, Germany, (2001)

‡Rotation of the Earth Obninsk, Russia Washington DC, USA ‡Satellite Information Greenbelt MD, USA

‡Aurora Tokyo, Japan

‡Human Interactions in the Environment Palisades NY, USA

‡Rockets and Satellites Obninsk, Russia

WDC Co-ordination Offices Washington DC, USA Beijing, China ‡Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China

‡Recent Crustal Movements Ondrejov, Czech Republic

‡Paleoclimatology Boulder CO, USA

‡Renewable Resources and Environment Beijing, China

‡Remotely Sensed Land Data Sioux Falls SD, USA

‡Solid Earth Geophysics Beijing, China Boulder CO, USA Moscow, Russia ‡Space Science Beijing, China ‡Space Science Satellites Kanagawa, Japan ‡Sunspot Index Brussels, Belgium


Why do we need Data Libraries? - Good scientific practice - Needed for verification of scientific work - Good availability of data for large scale and complex scientific approaches - ³'DWDUHF\FOLQJ´LVPRUHHIIHFWLYH than reproduction


Geosciences before 1900 William Smith, 1815

Glomar challenger, 1875

Turin papyrus, ~1160 BC


Technical Improvements ENIAC, 1944

Magnetometer


Development of the global climate

Thousands of years before present

Thousands of years before present The last 1300 years


Information increase in empirical sciences ?

30

25

20

Publications

15

Data 10

5

0

1970

1980

1990

2000

2010


Archiving and publication of scientific data

‡ ‡ ‡

Data acquisition Quality assurance Long-term availability and access


Long term archive ‡ Open access & non restricted data o Creative Commons license

‡ Data accepted from individual scientists, institutes, and science projects ‡ Long term funding for basic operation o hardware, software, system management & organisation

‡ Long term preservation of data o Technical: security, migration of media, o Usability: preserving the integrity & semantics of data sets


Contents


Data Types in PANGAEA PS1389-3

PS1390-3

IRD

Sand

( gr av/ 10 cm 3) 0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1431-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1640-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1648-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

IRD

( %/ clay) 50

0

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

( %/ clay) 50

0

100

0.0

100.0

‡ ‡ ‡ ‡ ‡ ‡ ‡

Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110 200.0

Age (kyr) max. : 233.55 kyr

PS1389-3ff

11°

12°

13°

14°

15°

55°30'

55°30'

55° 0'

55° 0'

54°30'

54°30'

54° 0'

54° 0'

11°

12°

13°

14°

15°

Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde.

World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m


Statistics (9/2010) unclassified Ice

Atmosphere

Sediment Corals

Water

Total number of data sets ~ 1 million Data items ~ 8 billions


Now the technical details :-)


PANGAEA Architecture

Editorial system

Sybase ASE

Harddisk + tape (silo)

Apache Lucene

RDB

Webserver

Google Maps / Earth

PANGAEA search engine

Middleware

ÂŤ


Indexing contents from relational database with dynamic updates Staffs Update Log

Projects

Data Set

Data Series

Events

XML Data  Set Description (Metadata)


Indexed Information ‡ Textual metadata: citation (authors, title), abstract, measurement parameters, methods, associated projects, comments, documentation including field info for all XML schema element types) ‡ Fulltext data set contents ‡ Geographical information: latitude/longitude/BBOX/track, dates, geological age, depth/elevation [NumericField/NumericRangeQuery]

‡ Soon: Fulltext of attached external documentation 3')


Geo-Retrieval with Lucene


Using scored queries with KML regions as filters


Apache Lucene as fast Key-Value Store ‡ Lucene is used for almost every query on the web-client ‡ /RWµV of keyword terms indexed for quick retrieval of data sets ‡ Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:


Apache Lucene as fast Key-Value Store ‡ Lucene is used for almost every query on the web-client ‡ /RWµV of keyword terms indexed for quick retrieval of data sets ‡ Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:


Live

PRESENTATION


Contact Uwe Schindler

PANGAEA - Publishing Network for Geoscientific & Environmental Data MARUM, Leobener Str., 28359 Bremen, Germany uschindler@pangaea.de

SD DataSolutions GmbH W채tjenstr. 49, 28213 Bremen, Germany uschindler@sd-datasolutions.de


Thank you! Know more about Apache Lucene at www.lucidimaginatin.com

panagea  

Its panegae

Read more
Read more
Similar to
Popular now
Just for you