
2 minute read
NSF Cyber Carpentry
NSF Cyber Carpentry Prepares Early-Career Researchers for Data-Intensive Projects
Big data is only getting bigger, and that can cause big problems for researchers who need to store and share their work. Twenty doctoral students and post-doctoral associates from across the county learned the tools and techniques to solve these problems at the inaugural Cyber Carpentry Workshop at the University of North Carolina at Chapel Hill.
Advertisement
Sponsored by the National Science Foundation (NSF) and hosted by the UNC School of Information and Library Science (SILS), the two-week workshop in July introduced students to a variety of applications, platforms, and processes for data life-cycle management and data-intensive computation.
“Previously, you had maybe a thousand files, maybe ten thousand,” said Arcot Rajasekar, SILS professor and director of the Cyber Carpentry Workshop. “Now, you’re talking about a 100 million files and doing simulations and emulations that can create petabytes of data. Managing that just by human interaction is not going to be effective; you need some automation there. In addition to the volume of data, you have to consider the velocity of data coming in and the multiple varieties of data you’re collecting. This is not easily done without a good level of management.”
The workshop familiarized participants with the concepts of virtualization, automation, and federation as defined through the Datanet Federation Consortium (DFC), an NSF-funded project that promotes sharing within and across science and engineering disciplines. Instructors introduced specific DFC web portals, including CyVerse, Dataverse, DataONE, and Hydroshare, as well as relevant software, meta-data management strategies, and large-scale workflows.
Jocelyn Colella, a PhD candidate in evolutionary genomics at the University of New Mexico, said gaining experience with containers – programs that can virtualize entire scientific workflows, including software, libraries, and data – was one of the highlights of her experience, and the introduction to the JetStream and CyVerse virtual environments had significant implications for her research.
“Coming from a smaller lab, it has been incredibly expensive to build the computing resources and data archival infrastructure necessary to deal with terabytes of genomic data,” she said. “Learning about the free computational and storage resources available through NSF-funded projects has revolutionized how I conceptualize my own workflows and will alter how I apply for grants going into the future.”
-Jocelyn Colella, PhD Candidate
University of New Mexico
Pictured below: Andres Espindola- Camacho from Oklahoma State University, Jeremy Thorpe from Johns Hopkins University School of Medicine, Gaurav Kandoi from Iowa State University, and Yingru Xu from Duke University discuss an issue with their team project.