Page 1

annual report 2012



01 director’s report

02 general overview


- Follow-up Commission - CNAG Structure - Areas of activity

03 platforms overview - Sequencing · Biorepository · Sample Preparation · Sequencing Production

04 research programmes


- Bioinformatics Development

14 15 16 17

· Statistical Genomics · Algorithm Development · Functional Bioinformatics

· T echnology Implementation and Development

- Bioinformatics Analysis · Production Bioinformatics · Data analysis

05 appendix -

Funding Collaborators Human Resources Projects Major Contracts Publications

· Genome Assembly and Annotation


- Genome Biology

19 20 21

· Structural Genomics

33 34 36 38 39 41 43


07 08 08 09

23 24 24 26 27 28 30 31


01 director’s report


director’s report

In 2012 we faced huge challenges. How can we continue the implementation of the CNAG that has so much to offer to society in an economically adverse climate? In a world of huge contraction, we have successfully managed to beat the trend. The CNAG now runs its 12 high-throughput 2nd generation sequencers at full capacity, with a data-output that is equal to seven complete human genomes sequenced at 30-fold coverage every 24 hours for 365 days a year. This places us at 2nd position in sequencing output in Europe. Automated laboratory procedures have been implemented and fully automated computational analysis systems have been established. The computational data analysis pipelines that we have set up are matching the data production and enable us to provide comprehensive support to our collaborators. All projects we carry out provide full coverage from study design, through sequencing, data analysis to data interpretation. We are now collaborating on projects in three of the major international research initiatives:

funding agencies such as the Instituto de Salud Carlos III, la Marató de TV3 and the European Commission. We are generating major impact in each of our four lines of work – Disease Gene Identification, Cancer Genomics, Genomics of Infectious Disease and Model Organisms and Agrogenomics.

– International Cancer Genome Consortium (ICGC) – International Human Epigenome Consortium (IHEC) – International Rare Disease Research Consortium (IRDiRC).

We are achieving our objective of being an integral platform for large-scale genome analysis projects and through this on the improvement of the quality of life for the citizen. Ivo G. Gut Director

The number of national and international collaborators has increased substantially and during 2012 we have performed projects with 87 researchers from 65 different institutions. With several of these projects we have managed to attract substantial extramural funds from


02 general overview


general overview

follow-up commission

cnag structure



Antoni Luis Andreu Périz

Ivo Gut

Subdirector General de Evaluación y Fomento de la Investigación Instituto de Salud Carlos III Ministerio de Economía y Competitividad

General Manager

David Badia


Programme Manager

Mònica Bayés

Marta Aymerich i Martínez

Coordinadora del Programa de Recerca i Innovació en Ciències de la Salut Departament de Salut


Lidia Águeda

Carles Constante i Beitia

Director General de Regulació, Planificació i Recursos Sanitaris Departament de Salut

Sequencing Department

Marta Gut

Bárbara López de Quintana Palacios

Subdirectora General de Relaciones Institucionales Secretaría de Estado de Investigación, Desarrollo e Innovación Ministerio de Economía y Competitividad

Bioinformatics Analysis

Sergi Beltran

Salvador Maluquer Amorós Director General Parc Científic de Barcelona

Bioinformatics Development

Josep Maria Martorell

Simon Heath

Director General de Recerca Departament d’Economia i Coneixement

Genome Biology

Luis Terrada Miarnau

Marc A. Marti-Renom

Jefe de Área de Industria Delegación del Gobierno en Catalunya


· Sample Preparation · Sequencing Production · Technology Development and Implementation · Production Bioinformatics · Data Analysis · · · ·

Statistical Genomics Functional Bioinformatics Algorithm Development Genome Assembly and Annotation · Structural Genomics

general overview

areas of activity The CNAG focuses its efforts on the analysis and interpretation of genome information in four interconnected research areas: Disease Gene Identification, Cancer Genomics, Genomics of Infectious Diseases and Genomics of Model Organisms.

disease gene identification Most rare disorders (RD) are caused by mutations in proteincoding regions that represent a small part (1–2%) of the human genome. Given the cost of high-coverage whole genome sequencing, exome sequencing (sequencing the collection of all human exons) has emerged as an alternative screening strategy to find variants underlying rare Mendelian disorders. A typical project of this type involves sequencing a few individuals from pedigrees with highly selected clinical phenotypes, identifying all genetic variants in relation to the reference sequence, filtering out common variations present in the general population and selecting and validating the putative pathogenic variants shared by the affected individuals.

leading to spinal muscular atrophy associated with progressive myoclonic epilepsy (Zhou et al. 2012) in collaboration with Dr Judith Melki from Inserm. In several projects, the CNAG has started delivering 2nd generation sequencing technologies into the clinics, for example for the diagnosis of genetically heterogeneous neuromuscular disorders, an initiative of the Paediatric Neurology Department at Vall d’Hebron Hospital (Dr Alfons Macaya). The CNAG has collaborated with companies such as Gendiag-Ferrer InCode and HealthinCode, which aim to provide diagnostic services for disease predisposition and early diagnosis for several medical conditions using costefficient sequence enrichment methods. Exome sequencing has also been used to reveal nonrecurring variants with large-effects on more common and complex phenotypes, such as fibromyalgia, autism, X-linked mental retardation and obesity, some within the framework of the EU FP7 funded GEUVADIS and ESGI projects.

The CNAG has successfully applied this approach in collaboration with researchers from several hospitals and research institutions. More than 110 exomes corresponding to 24 types of rare diseases or groups of pathologies have been sequenced in collaboration with researchers from the Centre for Biomedical Network Research on Rare Diseases (CIBERER), led by Dr Francesc Palau. Another research highlight produced at the CNAG was the identification of a gene

In total, during 2012, the CNAG carried out the sequencing of close to 1,000 human exomes from patients with different disorders, excluding cancer.

Spinal Muscular Atrophy Associated with Progressive Myoclonic Epilepsy Is Caused by Mutations in ASAH1. Zhou J, Tawk M, Tiziano FD et al including Bayes M, Castro-Giner F and Gut I Am J Hum Genet. 2012 Jul 13;91(1):5-14. 9

general overview

cancer genomics Cancer is a disease of the genome. High-throughput sequencing is allowing great headway to the general understanding of cancer. Several landmark studies have determined the complete DNA sequences of clinical tumour samples or cell lines and compared these to normal tissues from the same individual. At the CNAG, cancer genomics is currently one of the most active areas of research.

genome and transcriptome sequencing data have been generated for 65 and 100 of these patients, respectively.

Efforts have focused on obtaining genome or exome-wide molecular profiles of cancer for individuals with chronic lymphocytic leukaemia, bladder cancer, colon cancer, breast cancer, prostate cancer, bone cancer and other tumour types, revealing a wide range of somatic mutation loads. The results obtained support the hypothesis that there is extensive genetic heterogeneity within a given tumour type, with a high number of infrequently mutated genes and few genes with mutations of medium recurrence in a given form of cancer.

In 2012, the DNA methylome of two CLL and three mature B-cell subtypes isolated from a single donor were sequenced at single-base pair resolution (Kulis et al. 2012). The results revealed that widespread hypomethylation predominantly targeting the gene body is the major epigenetic change in the transition between naive B cells and memory B cells as well as between CLL and normal B cells. The CLL Consortium also recognised a DNA methylation signature that distinguishes the two molecular subtypes of CLL and even new subtypes of CLL with different biological features and clinical behaviours. This study represents an initial step towards the whole epigenomic characterisation of normal and neoplastic hematopoietic cells. This activity has been extended to our contribution to the EU funded project BLUEPRINT part of the IHEC (Adams et al. 2012).

The CNAG participates in the Chronic Lymphocytic Leukaemia (CLL) Genome Project, the Spanish contribution to the ICGC. CLL is a neoplasia of B lymphocytes and is one of the most common tumours in Western countries. In the last three years, the CLL Consortium, led by Dr Elias Campo from Hospital Clテュnic and Dr Carlos Lopez-Otin from the University of Oviedo, has established a comprehensive catalogue of exomic genetic alterations in 300 CLL tumours. Whole

Since 2011 the CNAG has worked in close collaboration with Dr Miguel テ]gel Piris and collegues at the Hospital Universitario Marquテゥs de Valdecillas in using sophisticated high-throughput genomic technologies for the clinical management of cancer patients through the development of individualised approaches to treatment. In 2012 more than 100 exomes were sequenced within the framework of this project.

Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia Kulis M, Heath S, Bibikova M, et al including Bayテゥs M, Gut M and Gut I. Nat Genet. 2012 Oct 14;44(11):1236-42. BLUEPRINT to decode the epigenetic signature written in blood Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, Bird A, Bock C, Boehm B et al. including Gut I. Nat Biotechnol. 2012 Mar 7;30(3):224-6. 10

general overview

genomics of infectious diseases 2nd generation sequencing technologies are also being used for the rapid identification of pathogens in chronic and acute diseases, replacing conventional Sanger sequencing methods. The CNAG has been very active in fungal genomics within the framework of the EU funded project SYBARIS led by Dr Misha Kapushesky from the European Bioinformatics Institute (EBI). This project investigates the specificity of the response of the cell-mediated immune system to fungal microorganisms in order to elucidate the genetic basis of susceptibility to fungal disease (Santamaria et al. 2011) and better understand the molecular mechanisms of drug resistance in fungal pathogens. In 2012, the CNAG sequenced the genomes of 300 clinical fungal isolates and the transcriptome of monocyte-derived dendritic cells from 150 human healthy donors stimulated by different fungal strains.

Systems biology of infectious diseases: a focus on fungal infections SantamarĂ­a R, Rizzetto L, Bromley M et al including Gut I. Immunobiology. 2011 Nov;216(11):1212-27. 11

general overview

model organisms and agrogenomics Genomics is accelerating the acquisition of fundamental knowledge about all living organisms and experiencing rapid uptake into breeding, selection and conservation programmes. The CNAG plays a major role in the Iberian lynx genome project coordinated by Dr Jose Antonio Godoy from the Estación Biológica Doñana (CSIC). The objective of this project is to apply genomics to support the conservation of this critically endangered species. It is the first large complex genome fully sequenced and characterised in Spain, as well as one of the first genomes of a highly endangered species to be sequenced. Using whole genome sequencing paired-end and mate-pair data as well as a fosmid-pool sequencing strategy, researchers at the CNAG have generated a final assembly with impressive contiguity statistics. Similar de novo assembly projects are currently being carried out for several other organisms such as the flatfish turbot and the maritime pine, oordinated by Dr Antonio Figueras from the Instituto de Investigaciones Marítimas de Vigo (CSIC) and Dra Maite Cervera from the Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), respectively. Large-scale resequencing of genomes is becoming more practical, enabling genetic variations to be thoroughly analysed and catalogued. The PRIMATE project, led by Tomàs Marquès of the Universitat Pompeu Fabra (UPF), aims to discover the extent of genome structural polymorphism

within great ape species. The CNAG has performed high-coverage sequencing of 70 wild-born unrelated primate specimens that will provide insight into the evolution of genome variation over the last 15 million years.


03 platforms overview


platforms overview

sequencing Head of Sequencing: Marta Gut Data Process and Quality Control: Lidia Sevilla

The Sequencing Department is responsible for the laboratory side of the CNAG. It is based exclusively on sequencing using 2nd generation high-throughput sequencing. Four teams are involved in the streamlined processes: - Biorepository - Sample Preparation - Sequencing Production - Technology Implementation and Development

our collaborators and making the latest achievements and cutting-edge technology accessible to a wide spectrum of researchers. The portfolio of supported sequencing protocols encompasses whole genome, whole exome enrichment, custom targeted enrichments, various RNA applications, chromatin immunoprecipitation, whole genome bisulphite and barcode analysis by sequencing (BAR-Seq).

All four teams work together to generate high quality sequencing data which is transferred and analysed by the Bioinformatics Department. All relevant information about projects, samples, libraries and sequencing runs are tracked and stored in a comprehensive Laboratory Information Management System (LIMS).

In 2012 the staff of the Sequencing Department grew to 17 professionals in order to deal with the growing number of projects and samples and their different sequencing applications. Furthermore, the increase in the variety of applications meant the intensification of quality control survey, leading to the need to create the new position of quality control manager.

The department’s primary commitments are delivering high quality sequencing data with an efficient turnaround to


platforms overview

biorepository Head of Biorepository: LĂ­dia Agueda

The Biorepository Unit receives stores, controls quality and distributes DNA and RNA samples from collaborators. The quality of the genomic DNA and RNA samples is controlled by means of integrity check by gel electrophoresis or Bioanalyzer chips, PCR check of enzymatic reaction compatibility and fluorescence-based quantification. These standardised protocols are adapted for samples requiring special quality control conditions such as low input DNA/RNA samples, formalin-fixed, paraffin-embedded DNA/RNA samples and rRNAdepleted RNA samples.

Major achievements: – F rom 2011 to 2012 the number of RNA and DNA samples processed by the unit doubled and tripled, respectively. In total, more than 5,000 samples were handled by the CNAG in 2012.


platforms overview

sample preparation Manager: Julie Blanc Engineer: Marta López Technicians: Katja Kahlem, Maite Rico, Pili Herruzo, Beatriz Fontal

The Sample Preparation team uses various molecular biology protocols to process DNA and RNA samples for different sequencing applications. The samples prepared in 2012 were divided as follows: 30% genome sequencing, 36% exome capture sequencing, 19% mRNA sequencing, 8% ChIP sequencing, 6%

bisulphite sequencing and 1% other special applications. The protocols used are written in a form of Standardised Operating Procedures (SOPs) and include exhaustive quality controls.

Major achievements: – Implementation of bisulphite sample preparation.

– Implementation of exome capture protocol for FFPE samples.

– I mplementation of low input DNA sample preparation protocol for whole genome sequencing, starting with 20ng of gDNA.

– I mplementation of automated high-throughput sample preparation using Caliper liquid handling robots for standard Illumina mRNA and DNA sample preparations, Nimblegen exome capture and Agilent exome capture.

– I mplementation of No-PCR DNA sample preparation protocol for whole genome sequencing, leading to significant lowering of the GC sequencing bias.

– P reparation of 4,318 libraries, 2.5 times more than in 2011.

–E  xtension of RNA protocols portfolio for different rRNA depletions, smallRNA, low input mRNA and dirRNA sample preparation protocols.


platforms overview

sequencing production Manager: Anna Carreras Technicians: Ragnhild Birkelund, Rebeca Medina Lab. Assistant: Glòria Plaja

The Sequencing Production team operates the Illumina 2nd Generation Sequencing platform including four cBots, ten HiSeq2000s and two Genome Analyser IIx. Standard protocols from Illumina and non-standard protocols are applied; the standard protocols are

fully described in the form of SOPs. The CNAG’s sequencing capacity is over 600 Gbases per day and produced close to 100 Tbases of sequence in 2012. The quality of the sequencing run is checked at several critical steps.

Major achievements: –2  76 flow cells run in 2012, 30% more than in 2011. In terms of Tbases/year, data production increased by over 120%.


platforms overview

technology implementation and development Staff Scientists: Isabelle Brun-Heath, María Méndez Engineer: Nicolas Boulanger Postdoctoral Fellows: Silvia Carbonell, Sergio Lario

The Technology Implementation and Development team improves, customises and implements advanced protocols and methods for sequencing – in particular non-standard sample preparation procedures. The methodology transfer

towards the Sample Preparation team is done via SOPs and through direct supervision of the technicians until production is smooth and standard.

Major achievements: –S uccessful implementation of small RNA and directional RNA sample preparation protocols.

–D  evelopment of a protocol for whole genome bisulphite sample preparation. This method was used to profile the first CLL methylome at single base resolution (Kulis et al. 2012).

–D NA-size fractioning with affinity coated magnetic beads was first established manually, followed by the development of automated protocols using Caliper robots. Implementation of this method practically eliminated the use of agarose gels for size selection, though usage of ethidium bromide.

– Improvement and customisation of the Bar-Seq protocol used in the EU FP7 funded SYBARIS project. – Implementation of qPCR as a very precise method for quantifying sequencing libraries.

Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia Kulis M, Heath S, Bibikova M, et al including Bayés M, Gut M and Gut I. Nat Genet. 2012 Oct 14;44(11):1236-42. 18

platforms overview

bioinformatics analysis Group Leader: Sergi Beltran Software Engineer: Jordi Camps Bioinformatics Technician: Jean-RĂŠmi Trotta

The Bioinformatics Analysis group develops and maintains state-of-the-art pipelines, tools and databases to manage, control, analyse and transfer the sequencing data generated at the CNAG. The aim is to deliver, with a short turnaround time, raw data and results that can be easily read and understood by the CNAG’s collaborators.

The Bioinformatics Analysis group was created in 2012 and is divided into two teams: - Production Bioinformatics - Data Analysis


platforms overview

production bioinformatics Team Leader: Matthew Ingham Software Engineer: Colin Kingswood Bioinformatics Technicians: Olga Fernando, Raúl Álcantara

The Production Bioinformatics team is in charge of the day-to-day production of polished, high quality sequence data, and acts as a liaison between the Sequencing Department, other bioinformatics teams and collaborators. It is responsible for developing and implementing an inhouse LIMS to track the status of samples from a wide variety of projects as they progress through reception, preparation, sequencing, analysis and data transfer. The team develops and operates the quality control pipelines

and is responsible for the delivery of raw data to the CNAG’s collaborators. During 2012, the main goal was to improve the organisation, optimisation and automation of the workflows and pipelines for quality control and data transfer. This reduced the time needed to complete certain tasks and, therefore, freed up staff to manage the over 2.2 fold increase in sequencing conducted at the CNAG in 2012.

Major achievements: –D  evelopment, implementation and consolidation of LIMS and an initial export function to the SAP ERP. Major and minor steps in CNAG’s production workflow can be tracked, including project management, sample reception, library preparation, sequencing, quality control , data transfer and analysis.

–D  evelopment of an initial pipeline to automatically transfer raw and aligned data (FASTQ and BAM files, respectively) to the CNAG’s SFTP (installed in 2012) based on LIMS information. This ensures file integrity and generates an email proposal and table linking files and samples. Data transfers are tracked in LIMS.

–W  orkflows, pipelines and LIMS have been adapted to improve traceability and file integrity and to cope with increasing amounts of data, indexes from different sources and a multiple sequencing threshold system (ability to choose between coverage, number of reads and yield).

–C  hange in the bioinformatics quality control workflow from a lane to an FLI (Flowcell Lane Index) centred system to allow greater flexibility. In 2012, quality control was performed for more than 8,400 FLI units.


platforms overview

data analysis Team Leader: Sergi Beltran Postdoctoral Fellows: Francesc Castro, Sophia Derdak PhD Student: Anna Pristoupilova

The Data Analysis team applies cutting-edge bioinformatics solutions to generate meaningful results for the CNAG’s collaborators and projects while providing personalised consulting and support. The team’s responsibilities include developing and operating mapping and data analysis pipelines for variant calling and annotation and providing personalised support. The close interaction with other CNAG groups and collaborators ensures the continuous improvement of analysis pipelines.

In 2012, the team’s main goals were to improve organisation, increase the automation of raw data and mapping pipelines, to develop a stable state-of-theart variant calling pipeline and, in general, to increase traceability and develop tools to extract information for the CNAG’s entire production system.


platforms overview

Major achievements: – Improvements in the FASTQ generation pipeline such as a greater integration with LIMS and unattended launching and running.

– Implementation of a stable, semi-automatic and comprehensive Samtools-based variant calling pipeline including local realignment to improve detection of INDELS. Addition and/or update of annotation and filtering options and statistic tests such as Fisher’s Exact to identify somatic variants. A PDF report summarizing the results can be generated automatically.

– Improvements in the mapping pipeline (generates MAP and BAM files) in terms of usability, traceability and LIMS integration. Trimming options and file integrity checks have been added.

– In 2012, variant calling pipelines were used to analyse sequencing data from 850 samples divided into 68 projects. Although most of the samples were human, the pipelines have also been used for samples from other organisms and non-standard projects.

– Development and implementation of a web-based, LIMS integrated, coverage computation tool. –D  evelopment of a web application to monitor the status of all of the CNAG’s sequencers. The application shows dynamically generated plots and statistics and the tables can be downloaded.


04 research programmes


research programmes

bioinformatics development Group Leader: Simon Heath

The goal of the Bioinformatics Development group is to develop and apply novel bioinformatics and statistical methods for genomics analysis, with a focus on the processing and analysis of next generation sequencing data.

A major theme in the research of the different teams is the development and implementation of computationally efficient, high quality, highly automated, state-of-the-art analysis pipelines that can not only be applied to the analysis of data generated at the CNAG, but can also be exported to other research centres. This is necessary for the development of future analysis techniques and pipelines that might give rise to changes in both the quantity and type of sequencing data that will be available over the coming years.

The activity of this group is divided into different teams: - Statistical Genomics - Algorithm Development - Functional Bioinformatics - Genome Assembly and Annotation

statistical genomics Team Leader: Simon Heath Staff Scientist: Emanuele Raineri Master Student: Marc Dabad

The goal of the team is to develop, implement and apply statistical approaches to the analysis of sequencing data in a wide range of genomic projects at the CNAG and other genomics institutes. A general theme of the research conducted by the team is to generate robust, reliable results with useful information about their uncertainty, so that subsequent analyses can optimally combine and interpret results. The specific areas that are currently being worked on are:

- R eliable methods of variant calling from sequencing data. This includes the analysis of pooled samples (multiple individuals), combining information across multiple related individuals (family studies), and considering haplotype information to increase the power to distinguish sequencing errors from variants. - Methods and pipeline development for the analysis of DNA methylation from whole genome bisulphite sequencing data. The approach followed simultaneously


research programmes

estimates genotype and methylation statuses to increase the robustness of the results. The results include called genotypes and methylation statuses for all covered bases, CpG and non-CpG methylation estimates, regions of hypo or hyper CpG methylation and regions of differential methylation between multiple samples. All estimates are accompanied by indicators of the uncertainty of the prediction.

-A  ssessment of the effectiveness of published longread sequence mappers (long read meaning sequence reads of between 100bp to 1kbp), with a focus on the speed, sensitivity and accuracy of the returned alignments when the methods are applied to real and simulated data from a variety of sequencing platforms.

Major achievements: –D  evelopment of the analysis pipeline and analysis performance of whole genome bisulphite data in the study of methylation differences between newborns and centenarians (Heyn et al. 2012).

–D  evelopment of a method for the detection of sequence variants from the study of pooled samples (Raineri et al. 2012).

–A  nalysis performance of whole genome bisulphite data in the study of methylation differences in different types of CLL (Kulis et al. 2012).

Distinct DNA methylomes of newborns and centenarians Heyn H, Li N, Ferreira HJ et al. including Heath SC and Gut IG Proc Natl Acad Sci U S A. 2012 Jun 26;109(26):10522-7. Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia Kulis M, Heath S, Bibikova M, et al including Bayés M, Gut M and Gut I. Nat Genet. 2012 Oct 14;44(11):1236-42. SNP calling by sequencing pooled samples Raineri E, Ferretti L, Esteve-Codina A, Nevado B, Heath S, Pérez-Enciso M. BMC Bioinformatics. 2012 Sep 20;13:239. 25

research programmes

algorithm development Team Leader: Paolo Ribeca Postdoctoral Fellow: Leonor Frias PhD Student: Santiago Marco Visiting PhD Student: Lukasz Roguski

The goal of the Algorithm Development team is to supply the CNAG with efficient methods for analysing sequencing data, with particular emphasis on both the optimisation of computational-intensive operations and the study of new approaches towards producing higher quality results. More precisely, at the moment the team’s activities are progressing in four different directions: - Algorithms for aligning short reads. This is the most highly developed research line; the group produced the first prototypes of software tools suitable for analysing Illumina data in late 2008, and has been constantly improving and refining them since that time. The latest version of the GEM aligner is faster than any other published software on the same computing hardware with best of class sensitivity and accuracy. - Algorithms for de novo assembly of mammalian-sized genomes from short reads. The task is very complicated

from both the theoretical and practical standpoints due to the short read lengths of 2nd generation sequencing technologies (100-500 nt). - Algorithms for flexible compressed storage of genomic data. The CNAG currently has 2 Pb of storage capacity. Handling data on such a large scale can make even ordinary operations like copying and sorting files a challenge. - Algorithms for accelerating in-hardware the processing of high-throughput data. In spite of our best efforts to improve the basic algorithms used to process genomic data, such processing still requires impressive amounts of computational power. Therefore, all new highperformance computational technologies (GPUs, FPGAs, multi-core coprocessors) are being looked into to meet data analysis needs.

Major achievements: – Publication and integration into the CNAG pipelines of the GEM aligner, which exceeds the accuracy of commonly used aligners and is 5-8 times faster (Marco-Sola et al. 2012).

–D  evelopment of a GEM tool box, focusing in particular on RNA-mapping. –D  e novo assembly tools for various de novo sequencing projects such as the Iberian lynx.

The GEM mapper: fast, accurate and versatile alignment by filtration Marco-Sola S, Sammeth M, Guigó R and Ribeca P Nat Methods. 2012 Dec;9(12):1185-8. 26

research programmes

functional bioinformatics Team Leader: Micha Sammeth (until October 2012) Postdoctoral Fellows: Thasso Griebel, Anna Esteve

The aim of the Functional Bioinformatics team is to understand the processes involved in the projection of information encompassed by the genome into the determination of the specific phenotype of a cell. RNA molecules play a key role here, as transcription is an indispensable step to all expressed genes, protein coding or not. Sequencing allows research into the world of cellular RNA without any a priori knowledge of its composition, opening the path to novel analyses.

The team’s current focus is on developing and fine-tuning the bioinformatics pipeline for mapping, quantification and differential gene analysis of RNA-Seq experiments. This is being done in collaboration with the Algorithm Development team.

Major achievements: – Implementation of a portable pipeline building system that allows complex bioinformatics pipelines running on a Unix cluster to be built up using a description language.

pipeline (based on the Flux capacitor) at the CNAG using the above pipeline building system and applied in several projects. –M  ajor role in the analysis of RNA-Seq data for three international collaborative projects: the EU funded projects GEUVADIS and EVA, and the NIH funded project GTEx.

–D  evelopment and publication of the flux simulator for simulating generic RNA-Seq experiments. (Griebel et al. 2012) – Implementation of a new RNA-Seq mapping (based on the GEM mapper) and quantification

Modelling and simulating generic RNA-Seq experiments with the flux simulator Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M. Nucleic Acids Res. 2012 Nov 1;40(20):10073-83. 27

research programmes

genome assembly and annotation Team Leader: Tyler Alioto Postdoctoral Fellow: AndrĂŠ Corvelo

The Genome Assembly and Annotation team is involved in all aspects of de novo genome sequencing projects. Once the sequence data is produced, the sequences are assembled using a modular pipeline developed to carry out the various steps necessary to assemble the huge number of short sequence reads produced from the genomic DNA into a high-quality draft genome sequence. A final draft sequence is used to prepare a consensus gene annotation, defined as the precise delineation of the exonic structures of all the genes present in the genome. A variety of computational gene annotation methods are used that draw upon orthogonal sources of evidence ranging from the genome sequence itself (ab initio) to homology with known proteins, genomic sequence conservation, and RNA-Seq/EST alignments. The resulting set of genes and transcripts can then be assigned putative molecular functions by means of homology with known proteins and protein domains or with known classes of non-coding RNAs.


research programmes

Major achievements: – I mplementation of the de novo genome assembly assessment server, which is designed to benchmark genome sequence assembly methods and test different sequencing strategies.

– L eadership of an exercise to examine concordance among somatic variant calls made by members of the International Cancer Genome Consortium on a set of whole genome sequencing data corresponding to a CLL tumournormal sample pair.

– P roduction of the first high quality draft genome sequence of the Iberian lynx based on whole genome shotgun and fosmid pool sequencing for the Iberian lynx genome project, coordinated by Dr José Antonio Godoy of the Estación Biológica de Doñana (CSIC).

– P roduction of a first draft gene annotation of the common bean (Phaseolus vulgaris) genome for the Proyecto Genoma-CYTED, coordinated by Dr Alfredo Herrera Estrella.

– Production of a draft genome sequence of turbot (Scophthalmus maximus) based on whole genome shotgun sequencing in collaboration with Dr Antonio Figueras of the Instituto de Investigaciones Marítimas de Vigo.

–C  ompletion of the annotation of the melon (Cucumis melo) genome for the project led by Dr Jordi Garcia Mas of the Centre de Recerca en Agrigenòmica (CRAG) (Garcia-Mas et al. 2012).

http://denovo.cnag.eu/dngaas/ The genome of melon (Cucumis melo L.) Garcia-Mas J, Benkak A, Sanseverino W et al including Alioto T Proc Natl Acad Sci U S A. 2012 Jul 17;109(29):11872-7. 29

research programmes

genome biology Group Leader: Marc A. Marti-Renom

The Genome Biology group aims at nucleating around the CNAG’s sequencing and analysis platforms a series of teams that will work on elucidating the relationship between sequence, structure and function for entire genomes. The overall objective of the group is to take advantage of the vast amount of data produced by the CNAG Sequencing Platform and perform advanced analyses using newly developed algorithms from the Bioinformatics Development

group. The integration of our laboratory and bioinformatics efforts will allow the CNAG to address unanswered questions on genome organisation and function. This research group was set up in 2012 and has grown to a total of seven people. The focus has been to integrate the diverse expertise at the CNAG to develop an in-house research program.


research programmes

structural genomics Team Leader: Marc A. Marti-Renom Staff Scientist: Davide Baù Postdoctoral Fellow: François Serra PhD Students: David Dufour, Francisco Martínez, Gireesh K. Bogu

The Structural Genomics team investigates molecular mechanisms that regulate the genome. To study such mechanisms, the laws of physics and the rules of evolution are used to develop and apply computational methods for predicting the 3D structures of macromolecules and their complexes. Our current lines of research are: - Protein-ligand interactions. The team develops methods for comparative docking of small chemical compounds and their target proteins. Such methods have already been applied to identify drug targets in ten genomes that cause tropical diseases.

-C  omparative RNA structure prediction. The recent interest in RNA, specially non-coding RNA molecules, has prompted the team to develop a series of tools for the alignment of RNA structures and the prediction of their functions - Structure determination of genomes. More recently, the team has worked with experimentalists to study the 3D organisation of the chromatin. This work is resulting in the first ever structures of genomic domains and entire genomes.

Major achievements: – The team has been awarded with a highly competitive research grant from the Human Frontier Science Program (HFSP) to coordinate three groups worldwide in the analysis of the human genome structure. This grant will allow us to study how the genome structure changes during cell cycle, which is one of the outstanding question in 3D genomics field.

– Implementation of methods for RNA structure prediction and alignment (Dufour & MartiRenom, Encyclopedia of Biophysics in press) and the application of these methods for predicting transmembrane sequences (Baeza et al. 2012). We are using computational means to understand how RNA and membrane proteins fold.

Modeling of RNA sequences Dufour D and Marti-Renom M.A.Encyclopedia of Biophysics (2012) in press. Structure-based statistical analysis of transmembrane helices Baeza-Delgado C, Marti-Renom MA, Mingarro I. Eur Biophys J. 2013 Mar;42(2-3):199-207. Epub 2012 May 16. 31

research programmes

–D  evelopment of new methods in genome structure determination (Baù et al. 2012). Our group is pioneer in development of hybrid methods for determining the structure of genomes. We published the first ever human and bacteria genome structures.

of genome domains (also called TADs). Our group has finished an entire computational pipeline that uses Hi-C data to automatically build threedimensional structures of entire genomes. Such pipeline is now currently used in more than a six different projects with collaborators at CNAG.

–S trengthened efforts in structure determination of RNA molecules using SHAPE-Seq data. We have initiated research collaborations with experimental groups to put into place during 2013 a pipeline for RNA structure determination based on sequencing data.

–A pplication of the team’s own methods for protein structure prediction to specific medical targets in tuberculosis. We have engaged in the analysis of point mutations in proteins that confer resistance to tuberculosis. In particular, we have built models of the alanine racemase protein in M. tuberculosis that explained the mechanisms of resistance to antibiotic.

–C  ompletion of the first pipeline for automatic determination of the three-dimensional structures

Genome structure determination via 3C-based data integration by the Integrative Modeling Platform Baù D, Marti-Renom MA Methods. 2012 Nov;58(3):300-6. 32

05 appendix



cnag’s funding evolution

2012 2011



2012 cnag’s funding by sources




new academic and industrial collaborations

2012 2011



Annual activity by sequencing application

Annual activity by research area



human resources staff evolution

staff evolution

by area and positions

by area and gender















- Male





- Female



- PhD Students





- Support









- Group and Team Leaders





- Postdocs and Staff Scientists



- Male



- Female



- Engineers & Technicians






- PhD & Master Students

















- Heads of Unit - Engineers & Technicians







projects Project Title Finding biomarkers of anti-microbial drug resistance via a systems biology analysis of fungal pathogen interactions with the human immune system (SYBARIS) Genetic susceptibility factors in attention-deficit/hyperactivity disorder (ADHD): a two-stage genome-wide association study Sharing capacity across Europe in high-throughput sequencing technology to explore genetic variation in health and disease (GEUVADIS) Structural bioinformatics for drug discovery against organisms causing neglected tropical diseases

Principal Investigator

Project Type

Start Date

End Date

CNAG Funding

Ivo Gut

FP7 Collaborative Project



1,217,463 €

Marató TV3



200,000 €

FP7 Coordination and support action



100,000 €

Plan Nacional



56,937 €



1,134,978 €

Mònica Bayés

Ivo Gut

Marc A. Marti-Renom

European Sequencing and Genotyping Infrastructure (ESGI)

Ivo Gut

FP7 Collaborative Project and Coordnation Support Action for Integrating Activities

Airway Disease PRedicting Outcomes through Patient Specific Computational Modelling (AirPROM)

Ivo Gut

FP7 Collaborative Project



459,400 €

A genome-wide approach for characterizing the mode of action of novel compounds against tuberculosis (GeMoa)

Marc A. Marti-Renom

Programa Nacional de Internacionalización de la I+D



113,286 €

A BLUEPRINT of Haematopoietic Epigenomes

Ivo Gut

FP7 Collaborative Project: Large-scale integrating project



2,739,628 €

Human Frontiers



350,000 €

Conformational changes of chromosomes during the cell cycle (ChrCycle)

Marc A. Marti-Renom



Project Title

Principal Investigator

Start Date

End Date

CNAG Funding

Plan Nacional



66,550 €

Project Type

Reliable and efficient calling of sequence and structural variations from high throughput sequence data

Simon Heath

An integrated platform connecting registries, biobanksand clinical bioinformatics for rare disease research (RD-CONNECT)

Ivo Gut

FP7 Collaborative Project: Large-scale integrating project




Inflammatory Bowel Disease CHARACTERization (IBDCHARACTER)

Ivo Gut

FP7 Collaborative Project



994,600 €



major contracts Institution

Principal Investigator


Gilles Thomas

536,299.75 €

Miguel Angel Piris

194,153.82 €

Manuel Talón/ Juan Cano

78,430.32 €

Víctor Moreno

70,105.40 €

Tom Druet

66,440.50 €

Instituto de Investigación y Formación Agraria y Pesquera (IFAPA)

Manuel Manchado

54,289 €

Universitat Autònoma de Barcelona (UAB)

Miguel Pérez-Enciso

54,144.96 €

Centro Nacional de Investigaciones Oncológicas (CNIO)

Francesc Xavier Real

49,792.92 €

Montserrat Aguade

47,856.33 €

Institut d’Investigacions Biomèdiques August i Sunyer (IDIBAPS)

Elias Campo

45,432.24 €

Institut d’Investigació Biomèdica de Bellvitge (IDIBELL)

Manel Esteller

43,192.32 €

Josep Maria Folch

40,608.72 €

Sergi Castellví

36,610.44 €

Xavier Estivill

27,111.60 €

Queensland Institute of Medical Research

Emma Whitelaw

26,470.12 €

Fundación IDICHUS 150

María Siso Carril

26,015.92 €

Fondation Synergie Lyon Cancer Cancer Fundación Marqués de Valdecilla Instituto Valenciano de Investigaciones Agrarias (IVIA)/ Eurosemillas Institut d’Investigació Biomèdica de Bellvitge (IDIBELL) University of Liège

Universitat de Barcelona (UB)

Universitat Autònoma de Barcelona (UAB) Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS) Center for Genomic Regulation (CRG)




Principal Investigator


Sergio Fernandes Almeida

25,891.92 €

Stephen Beck

25,516.80 €

Fundación para la Investigación Médica Aplicada (FIMA)

Pau Pastor

25,516.80 €

Fundación Marqués de Valdecilla

Jon Infante

25,270.56 €

Instituto de Medicina Molecular University College London



publications Analysis of two language-related genes in autism: a case-control association study of FOXP2 and CNTNAP2. Toma C, Hervás A, Torrico B, et al including Bayés M. Psychiart Genet. 2012 Dec 30. (E-publication ahead of print)

Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity. Esko T, Mezzavilla M, Nelis M et al including Gut I. Eur J Hum Genet. 2012 Dec 19. (E-publication ahead of print)

Similarity in recombination rate and linkage disequilibrium at CYP2C  and CYP2D cytochrome P450 gene regions among Europeans indicates signs of selection and no advantage of using tagSNPs in population isolates. Pimenoff VN, Laval G, Comas D et al including Gut I. Pharmacogenet Genomics. 2012 Dec;22(12):846-57. The GEM mapper: fast, accurate and versatile alignment by filtration. Marco-Sola S, Sammeth M, Guigó R, Ribeca P. Nat Methods. 2012 Dec;9(12):1185-8.

DNA sequencing - spanning the generations. McGinn S, Gut IG. N Biotechnol. 2012 Nov 16. (E-publication ahead of print)

CpG islands and GC content dictate nucleosome depletion in a transcription-independent manner at mammalian promoters. Fenouil R, Cauchy P, Koch F et al including Gut M and Gut I. Genome Res. 2012 Dec;22(12):2399-408.

High-specificity single-tube multiplex genotyping using Ribo-PAP PCR, tag primers, alkali cleavage of RNA/DNA chimeras and MALDI-TOF MS. Mauger F, Gelfand DH, Gupta A et al including Gut IG. Hum Mutat. 2012 Nov 8. (E-publication ahead of print)

Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Kulis M, Heath S, Bibikova M et al including Brun-Heath I,  Bayes M, Gut M and Gut I. Nat Genet. 2012 Nov;44(11):1236-42.

Evaluation of common variants in 16 genes involved in the regulation of neurotransmitter release in ADHD. Sánchez-Mora C, Cormand B, Ramos-Quiroga JA et al including Bayés M. Eur Neuropsychopharmacol. 2012 Aug 29. (E-publication ahead of print)

SNP calling by sequencing pooled samples. Raineri E, Ferretti L, Esteve-Codina A el al including Heath S. BMC Bioinformatics. 2012 Sep 20;13:239.

Structure-based statistical analysis of transmembrane helices. Baeza-Delgado C, Marti-Renom MA, Mingarro I. Eur Biophys J. 2012 May 16. (E-publication ahead of print)

Polar/Ionizable residues in transmembrane segments: effects on helixhelix packing. Bañó-Polo M, Baeza-Delgado C, Orzáez M et al including Marti-Renom MA. PLoS One. 2012;7(9):e44263.

Neurotransmitter systems and neurotrophic factors in autism: association study of 37 genes suggests involvement of DDC. Toma C, Hervás A, Balmaña N et al including Bayés M. World J Biol Psychiatry. 2012 Mar 8. (E-publication ahead of print)

Modelling and simulating generic RNA-Seq experiments with the flux  simulator. Griebel T, Zacher B, Ribeca P et al including Raineri E and Sammeth M. Nucleic Acids Res. 2012 Nov 1;40(20):10073-83.



Landscape of transcription in human cells. Djebali S, Davis CA, Merkel A et al including Alioto T,  Kingswood C, Ribeca P, Sammeth M. Nature. 2012 Sep 6;489(7414):101-8.

Genome-wide association study in a Lebanese cohort confirms PHACTR1 as a major determinant of coronary artery stenosis. Hager J, Kamatani Y, Cazier JB et al including Gut I. PLoS One. 2012;7(6):e38663.

An integrated encyclopedia of DNA elements in the human genome. ENCODE Project Consortium, Dunham I, Kundaje A et al including Alioto T, Kingswood C, Ribeca P, Sammeth M. Nature. 2012 Sep 6;489(7414):57-74.

Spinal muscular atrophy associated with progressive myoclonic epilepsy is caused by mutations in ASAH1. Zhou J, Tawk M, Tiziano FD et al including Bayes M, Castro-Giner F and Gut I. J Am J Hum Genet. 2012 Jul 13;91(1):5-14.

A mechanistic basis for amplification differences between samples and  between genome regions. Veal CD, Freeman PJ, Jacobs K et al including Gut I. BMC Genomics. 2012 Sep 5;13:455.

Distinct DNA methylomes of newborns and centenarians. Heyn H, Li N, Ferreira HJ, Moran S et al including Heath SC and Gut IG. Proc Natl Acad Sci U S A. 2012 Jun 26;109(26):10522-7.

An association study of sequence variants in the forkhead box P2  (FOXP2) gene and adulthood attention-deficit/hyperactivity disorder in  two European samples. Ribasés M, Sánchez-Mora C, Ramos-Quiroga JA et al including Bayés M. Psychiatr Genet. 2012 Aug;22(4):155-60.

Neutrality tests for sequences with missing data. Ferretti L, Raineri E, Ramos-Onsins S. Genetics. 2012 Aug;191(4):1397-401.

Genome-wide analysis reveals that Smad3 and JMJD3 HDM co-activate the neural developmental program. Estarás C, Akizu N, García A et al including Beltran S. Development. 2012 Aug;139(15):2681-91.

The tomato genome sequence provides insights into fleshy fruit evolution. Tomato Genome Consortium including Alioto T and Ribeca P. Nature. 2012 May 30;485(7400):635-41.

The genome of melon (Cucumis melo L.). Garcia-Mas J, Benjak A, Sanseverino W et al including Alioto T. Proc Natl Acad Sci U S A. 2012 Jul 17;109(29):11872-7.

High-throughput sequence analysis of turbot (Scophthalmus maximus) transcriptome using 454-pyrosequencing for the discovery of antiviral immune genes. Pereiro P, Balseiro P, Romero A et al including Beltran S. PLoS One. 2012;7(5):e35369.

Association between the NMDA glutamate receptor GRIN2B gene and  Gene prediction. obsessive-compulsive disorder. Alioto T. Alonso P, Gratacós M, Segalàs C et al including Bayés M. Methods Mol Biol. 2012;855:175-201. J Psychiatry Neurosci. 2012 Jul;37(4):273-81.



A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Manning AK, Hivert MF, Scott RA et al including Heath S. Nat Genet. 2012 May 13;44(6):659-69.

BLUEPRINT to decode the epigenetic signature written in blood. Adams D, Altucci L, Antonarakis SE et al including Gut I. Nat Biotechnol. 2012 Mar 7;30(3):224-6.

Threonine-4 of mammalian RNA polymerase II CTD is targeted by  Polo-like kinase 3 and required for transcriptional elongation. Hintermair C, Heidemann M, Koch F et al including Gut M and Gut I. EMBO J. 2012 May 1;31(12):2784-97.

Sequence Variants and Haplotype Analysis of Cat ERBB2 Gene: A  Survey on Spontaneous Cat Mammary Neoplastic and Non-Neoplastic Lesions. Santos S, Bastos E, Baptista CS et al including Gut IG. Int J Mol Sci. 2012;13(3):2783-800.

Genome structure determination via 3C-based data integration by the  Integrative Modeling Platform. Baù D, Marti-Renom MA. Methods. 2012 Nov;58(3):300-6.

Identification of IL7RA risk alleles for rapid progression during HIV-1 infection: a comprehensive study in the GRIV cohort. Limou S, Melica G, Coulonges C et al including Gut IG. Curr HIV Res. 2012 Mar;10(2):143-50.

Novel loci for adiponectin levels and their influence on type 2 diabetes and metabolic traits: a multi-ethnic meta-analysis of 45,891 individuals. Dastani Z, Hivert MF, Timpson N et al including Heath SC. PLoS Genet. 2012;8(3):e1002607.

Ribo-polymerase chain reaction--a facile method for the preparation of  chimeric RNA/DNA applied to DNA sequencing. Mauger F, Bauer K, Semhoun J et al including Gut IG. Hum Mutat. 2012 Jun;33(6):1010-5.

Transient receptor potential genes, smoking, occupational exposures and cough in adults. Smit LA, Kogevinas M, Antó JM et al including Gut I. Respir Res. 2012 Mar 23;13:26.

Applications of second generation sequencing technologies in complex disorders. Bayés M, Heath S, Gut IG. Curr Top Behav Neurosci. 2012;12:321-43.

The EvA study: aims and strategy. Ziegler-Heitbrock L, Frankenberger M, Heimbeck I et al including Gut I. Eur Respir J. 2012 Oct;40(4):823-9.

Tuning of natural killer cell reactivity by NKp46 and Helios calibrates T cell responses. Narni-Mancinelli E, Jaeger BN, Bernat C et al including Gut M, Heath SC and Gut IG. Science. 2012 Jan 20;335(6066):344-8.

Candidate system analysis in ADHD: evaluation of nine genes involved in dopaminergic neurotransmission identifies association with DRD1. Ribasés M, Ramos-Quiroga JA, Hervás A et al including Bayés M. World J Biol Psychiatry. 2012 Apr;13(4):281-92.

Fast computation and applications of genome mappability. Derrien T, Estellé J, Marco Sola S et al including Raineri E and Ribeca P. PLoS One. 2012;7(1):e30377.



Evidence for transcript networks composed of chimeric RNAs in human cells. Djebali S, Lagarde J, Kapranov P et al including Ribeca P. PLoS One. 2012;7(1):e28213.


A genome-wide association search for type 2 diabetes genes in African Americans. Palmer ND, McDonough CW, Hicks PJ et al including Heath SC. PLoS One. 2012;7(1):e29202.

Modeling of RNA sequences Dufour D and Marti-Renom M.A. Encyclopedia of Biophysics (2012) in press.

Comparison of SNPs and microsatellites for assessing the genetic structure of chicken populations. Gärke C, Ytournel F, Bed’hom B et al including Gut I. Anim Genet. 2012 Aug;43(4):419-28.

Short-Read Mapping Ribeca P. Bioinformatics for High Throughput Sequencing (2012) 107-152. Springer New York.

Analysis of new lactotransferrin gene variants in a case-control study related to periodontal disease in dog. Morinha F, Albuquerque C, Requicha J et al including Gut I. Mol Biol Rep. 2012 Apr;39(4):4673-81.

Analysis of RNA Transcripts by High-Throughput RNA Sequencing Ribeca P, Lacroix V, Sammeth M and Guigó R. Alternative pre-mRNA Splicing (2012) 544-554. Wiley-VCH Verlag GmbH & Co. KGaA.

Genomic binding of Pol III transcription machinery and relationship with  TFIIS transcription factor distribution in mouse embryonic stem cells. Carrière L, Graziani S, Alibert O et al including Gut M and Gut I. Nucleic Acids Res. 2012 Jan;40(1):270-83. A trans-ethnic genetic study of rheumatoid arthritis identified FCGR2A as a candidate common risk factor in Japanese and European populations. Meziani R, Yamada R, Takahashi M et al including Heath S. Mod Rheumatol. 2012 Feb;22(1):52-8 Sequence variation and mRNA expression of the TWIST1 gene in cats  with mammary hyperplasia and neoplasia. Baptista CS, Santos S, Laso A, et al including Gut IG. Vet J. 2012 Feb;191(2):203-7.


baldiri reixac, 4 pcb - tower i 08028 barcelona t +34 93 4020542 f +34 93 4037279 www.cnag.eu

Profile for cnag

Cnag Annual Report 2012  

Cnag Annual Report 2012  

Profile for cnag_eu