Annual Report 2012

Page 102

Finally, the work generated from another genome sequencing/annotation project we have participated in (“Annotation of the Solanum lycopersicum (tomato) genome”) was also published in the journal Nature in 2012 (Tomato Genome Consortium, 2012)

2. Prediction of selenoproteins Particularly difficult in eukaryotic genomes is the prediction of selenoprotein genes, because selenocysteine is specified by the UGA codon, normally a stop codon (Mariotti et al., 2012). We have been developing computational methods for selenoprotein prediction since early 2000. In the last year we have continued our work on selenoproteins that have been lost independently in a number of different insect lineages to pinpoint the mechanisms of selenoprotein extinction in insects. We have now genome assemblies for 8 Drosophila species from the Saltans group, which we identified as interesting for selenoproteins from preliminary PCR experiments. During 2012, we worked to obtain a robust phylogenetic tree, defining the topology of the Saltans group and also of all other Drosophilas, allowing us to draw the full history of events leading to Sec extinction. Searching selenoproteins and selenocysteine machinery genes, we found an interesting panorama of selenoprotein presence: although most of the saltans species possess the full selenoprotein repertoire, 3 species were found to have lost selenocysteine: D.sturtevanti and D.milleri (clustering together phylogenetically, suggesting a common selenoprotein extinction before their split), and D.neocordata. This is the most interesting species for selenoproteins: it appears to have lost or converted to cysteine 2 selenoproteins, while a third one remains apparently intact. This last selenoprotein is SPS2, whose function is the production of selenocysteine itself. The rest of Sec machinery is almost entirely intact, and yet some features of these genes strongly suggest that selenocysteine cannot be coded anymore in this species, and thus even the SPS2 gene cannot functional. We believe that a very recent extinction took place in this species. We may predict that with enough time, Sec machinery will further degenerate here too. This may provide a powerful tool to test neutral models of evolutionary theory. Observing the selenoprotein presence in all Drosophilas and in particular in D.neocordata, we formulated a model of selenoprotein extinction, in which the selenoprotein machinery can degenerate only after the selenoproteins have been already lost or converted to cysteine homologues. SPS2, playing both as selenoprotein and as selenoprotein machinery, must thus be always the last selenoprotein to be lost, which is what we observe in D.neocordata. We then focused on finding the causes of selenoprotein extinctions in the Saltans group, and we noticed that this group is peculiar among Drosophilas for several aspects, particularly for GC content and codon bias. We believe that here the selenoprotein extinctions are a downstream effect of a more general process acting in this lineage, involving probably a generalized accelerated evolution of protein coding genes. To further investigate this, and hoping to gain insights on this general process, we proceeded to annotate all protein coding genes in our genomes. We used a profile-based approach, adapting the program selenoprofiles for this purpose, and we complemented it with ab initio predictions by geneid. We are now using these annotations for a few functional analysis, whose aim is to find functional classes that behave in a peculiar way in this specific lineage, either with an overrepresentation in duplicated or lost genes, or a striking pattern of differential expression when compared to non-saltans species, or a particularly high or low evolution rate.

3. Methods for transcriptome analysis and the ENCODE project The field of transcriptomics has recently been given a huge boost from the use of “second” generation high throughput sequencing technologies to sequence RNA samples. Second-generation sequencing technologies provide an unprecedented capacity for surveying the nucleic acid content of cells. Specially since these techniques started to be applied to transcriptome sequencing we have become increasingly aware of the large number of genes that show alternative splice forms in human as well as the large variety of splice forms that these genes can have, that may range from just two splice variants to hundreds. On the other hand, the accelerating rate of data production with these new technologies is moving the bottleneck in many studies from the data generation to the actual analysis of these data. Because of this it is important to design methods with which we can analyze them in a fast and efficient manner. Our aim is to use the data from these experiments in order to determine the exact transcript abundances within the cell. Not only as a list of the transcripts that are expressed at the qualitative level, but also the exact expression level of each transcript and alternative variant within the cell, while at the same time developing a highly automated method that will allow us to take advantage of the huge amounts of data available. Therefore and as part of the ENCODE projects lead by Tom Gingeras (Transcriptome) and Tim Hubbard (GENCODE), our group has been working towards the development of a number of tools for RNASeq processing. These include the GEM read aligner (P. Ribeca, now at the CNAG;

102 . Annual Report 2012

...............................................................................


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.