Global Sequencing: new high-throughput sequencing technologies, hudge bioinformatic challenges

Richard Christen. Université de Nice & CNRS UMR 6543. Centre de biochimie. Parc Valrose, Nice, France

Since 2004, massively parallel DNA sequencing technologies (MPS) have exploded onto the scene, offering dramatically higher throughput and lower per-base costs than had previously been possible with electrophoretic sequencing. Application of this generation and next-generation sequencing will allow to sequence 1,000 human genomes, characterize thousands of transcriptomes and microbial diversities within a few years with unprecedented depth and resolution. Tens of millions of "sequencing tags" can now be obtained at a cost similar to what tens of thousands used to cost and next-generation technologies are coming. Over the past year, implementations of MPS have been applied to profile protein-DNA interactions, cytosine methylation, genetic variation, genomic rearrangements, transcriptomes and biodiversity studies.  Challenges faced include efficient algorithms to:

In all fields, the challenge is how to best generate biologically meaningful interpretations of very large datasets as well as store and search Tera- or Petabytes of data. Some of these goals are not new, but need to be revisited because of the deluge of data to analyse.

Next-generation sequencing technology: new sequencing platforms, new frontiers, new algorithms

Valentina Boeva, Institut Curie, Paris

A new powerful technology capable of producing gigabytes of DNA data is changing the genomics field. Platforms as Roche (454) GS FLX sequencer, Illumina genome analyzer and Applied Biosystems SOLiD sequencer are able to produce millions of short length sequence reads. Two types of sequence data constitute the output of the platforms: short single reads (32-250 bp, depending on the platform) and short paired-end reads (short reads with unknown 200bp-3Kb spacers). The first type of output is suitable for genome resequencing, the whole transcriptome acquisition, microRNA discovery, methylation inference, ChIPSeq experiments and SNP discovery. The second type of output would be useful for the whole genome sequencing, assessing of structural re-arrangements and DNA copy number alterations, as well as for SNP discovery. New algorithms suitable for each task still need to be developed. Indeed, despite the fact that required programs have already started to appear (some of them are delivered with sequencing platforms), the high-quality alignment of short reads to a reference genome and de novo genome assembly still pose a computational challenge.

Alternative splicing as a playground of evolution

Mikhail Gelfand, Institute for information transmission problems  RAS, Moscow


Most eukaryotic genes are encoded in DNA in a mosaic manner, being interrupted by nonsense inserts called introns (similar to advertisements in a TV show). After transcription, introns are removed, whereas the remaining pieces, exons, are joined and form mature mRNA; this process is called splicing. Many genes (e.g. about 60-70% of human genes) are alternatively spliced, that is, one gene yields several different mRNAs and hence encodes several different proteins.


Recently, considerable interest was generated by studies of evolution of alternative splicing. It has been demonstrated that alternative exons are more easily gained and lost than constitutive ones, the rate of sequence changes in them is higher, and, finally, they show signs of positive selection (bias towards mutations changing the encoded protein).


Taken together, these observations show that alternative splicing may serve as a testing ground for new exons and hence new protein functions. Such exons are easily formed due to point mutations creating splice sites, but initially their level of inclusion is low. However, if an exon proves useful, its inclusion level increases by point mutations in regulatory sites. Simultaneously, such exon experiences rapid evolution towards a local peak in the fitness landscape.


As opposed to evolution by gene duplication, this allows the cell to retain proteins identical in some regions, but completely different in other regions, which might be important for many functions (e.g. for using secreted, cytoplasmic, and membrane forms of the same protein).

Phageomics: a systematic way to study the process of phage infection

Konstantin Severinov, Institutes of Gene Biology and Molecular Genetics, Moscow, Russia and Waksman Institute, Rutgers University, Piscataway, USA

Bacteriophages (phages) are the most abundant life form on the planet, with estimates of total number of particles exceeding 1031.Whole genome sequencing also reveals that phages are the most diverse life form. However, our current knowledge of regulatory strategies used by phages during infection of their hosts is limited to only a few well-studied classic phages of E. coli (and even there is far from being complete). These strategies provide paradigms of genetic regulation directly relevant to all living organisms but clearly present only a tiny sample of the variety of regulatory strategies evolved by the phages. ?We are developing a set of tools and approaches that allow to get information about genetic regulation in bacteriophage-infected cells and that are sufficiently general to allow highly efficient studies of novel bacteriophages for which no functional data exist.

Statistical sampling and rational design of RNA

Yann Ponty, LRI Orsay

As we further our understanding of biological processes, RNA is found to play roles many more roles than initially suspected. Furthermore, the therapeutic promesses carried by the discovery of RNA interference provide incentives to study this, once overlooked, biopolymer.

Thanks to a joint effort of biomathematicians and biochemists an energy model, the Turner model, provided a firm ground for the study of RNA folding. Coupled with a minimal free-energy hypothesis, it allowed for a dynamic-programming-based prediction of the RNA structure, giving rise to the popular MFold software. Based on the same principle Ding et al showed that a stochastic sampling of the folding space led to better predictions and could participate in the rational design of RNAs. Simultaneously, Giegerich et al found that sampling was a suitable tool to capture the existence of multiple stable states for RNAs (Riboswitches).

This talk will be split in two parts. After introducing the energy model and tools used by these approaches, focusing mostly on their 'algorithmic core', I will present two algorithmic improvements for the statistical sampling approaches, and present some open questions. The second part will be dedicated to the inverse folding of RNA, which consists in designing a sequence that folds into a desired structure. Although "local search" approaches for this problem are the current standard, its theoretical complexity remains unknown at the moment, and motivates further studies in the field of inverse optimization.

How to model and to study the dynamics of the RNA secondary structure

Eugene Asarin, LIAFA Paris


Many important biological phenomena, such as regulation of gene expression are based on dynamics of the RNA secondary structure. Building precise and still tractable quantitative models of this dynamics, capable to predict real biological phenomena, is an important research challenge. We will present existing approaches to modeling the secondary structure and its dynamics, mostly by brute-force stochastic simulation. We will next discuss a couple of novel formalizations of the problem (using probabilistic rewriting systems etc).  We will discuss possible improvements of the simulation algorithms, and possibility of simulation-less direct algorithms.  

(Based on a joint work with V. Lyubetsky, A. Seliverstov, Th. Cachat and T. Touili)

Virtual cell: methods and approaches

Fedor Kolpakov, Institute of System Biology, Novosibirsk, Russia

Reconstruction of whole-cell model should be data driven and requires integration of diverse omics data: genomics, transcriptomics, proteomics, metabolomics, interactomics. All these information should be stored in integrated organism specific databases.

Different modelling approaches and algorithms for cellular pathway reconstruction should be used for different types of processes. For metabolic pathways constraint-based modelling, flux-balance analysis and ordinary differential equations are generally used. For reconstruction and simulation gene regulatory networks logical and hybrid models are used, for example Boolean networks, generalized logical models and generalized threshold models. For signal transduction pathways ODE models are mainly used. Agent based modelling is used to take into account interaction between cells, cell growth and differentiation.

There are several attempts of building  of whole-cell model. Most known of them are: E-cell project, reconstruction of E.coli (Covert et al., Nature( 2004), 429:92-96) and HypatoSys project. Here we will consider BioUML workbench (http://www.biouml.org) and our approaches for for whole-cell model reconstruction and simulation. BioUML workbench is integrated environment for systems biology that spans the comprehensive range of capabilities including access to databases with experimental data, tools for formalized description of biological systems structure and functioning on different logical
levels, as well as tools for their visualization and simulations.

GeneModels database - new database that contains information about predicted transcription factor binding sites and data from microarray experiments to build models of gene expression regulation.

BMOND - Biological MOdels aNd Diagrams database (http://bmond.biouml.org)
contains diagrams and models of that will be used as building block for reconstruction of metabolic pathways and signal transduction

Recognition of protein function using local similarity

Dmitry A. Filimonov

Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Pogodinskaya Str., 10, Moscow, 119121, Russia

Functional annotation of amino acid sequences is one of the most important problems of bioinformatics. Different approaches were applied for recognition of some functional classes; nevertheless many functional groups still cannot be predicted with the required accuracy. We developed a new method of the protein functions’ recognition based on the original concept of local similarity. The query (“new” sequence) is compared with each sequence from the set of sequences with known functions; the local similarity scores are calculated for the query sequence positions and used as input data for classifier. The method was tested on different sequences’ sets. About 100% recognition accuracy was obtained at various levels of classification hierarchy for a few sets of proteins represented non-intersected functional classes. Prediction accuracy was lower (but still reasonable) for recognition of ligand specificity in cytochromes P450. The performance of the proposed method is rather high, and it can be successfully applied for both prediction of protein functional classes and selection of functionally significant sites in amino acid sequences.

Joint work with Boris N. Sobolev, Kirill E. Alexandrov, Vladimir V. Poroikov

Protein repeats: from sequence to structure and functions

Andrey V. Kajava,  CRBM,  CNRS, Montpellier

A significant portion of proteins carrying fundamental functions contain arrays of tandem repeats. Furthermore, over the last years a number of evidences has been accumulated about the high incidence of tandem repeats in the virulence proteins of pathogens, toxins and allergens. Genetic instability of these regions can allow a rapid response to environmental changes and thus can lead to emerging infection threats. In addition, the tandem repeats frequently occur in amyloidogenic, prion and other disease-related human sequences. Thus, discovery of these domains and their structure-function study promise to be a fertile direction for research leading to the identification of targets for new medicaments and vaccines.

This presentation will provide a survey of current challenges in this area including: identification of protein repeats in genomes, structural classification of proteins with repeats, prediction and molecular modeling of the 3D structure of these proteins, inferring of their functions,  applications of these knowledge to the annotation of genomes.

Blood as a new type of an active media. Space dynamic of clot formation and coagulation disorders

Fazly Ataullakhanov, Mikhail Panteleev

Center for theoretical problems of physico-chemical pharmacology RAS, Lomonosov Moscow State University, Physics Dept., National Center for Hematology, RAMS 

Usually biological systems are considered to be very complicated.  As a matter of fact each of them contains tens and hundreds different elements.  At the same time our real biological experience suggests that a correct approach to the subject opens the simplicity and strict functional determinacy of arrangement and action of a biology system. If so why do we see a lot of elements in any biological system and often note its rather multiaspect behavior? Mostly, it is related to multitasking functionality, and multitudinous regulations of the biology systems. If only one function is selected or only one regulatory circuit is considered, it would be found precious few significant parameters, determining the system work.  To be honest, often it is not too simple to point out these significant parameters. However, one can be quite sure there exists only two or three of such parameters and/or reactions. It’s enough for performing practically arbitrary complex function (one), and at the same time the system stays dynamically well determined  – dynamic chaos doesn’t arise. In the system with a great number of elements such a simplicity is reached due to abundance of  possible interections between system elements. This allows to reconcile simplicity and complexity in one system.  One pattern of inrteractions leads to one simple system with one significant determinant and mode of action, and the other picture gives the system with quite different mode of behavior, and very often with different significant determinants. 

We’ll try to illustrate this with the blood coagulation system. The whole process involves many physiological systems. We‘ll confine ourselves to consideration of plasma coagulation only, consisting of less then couple tens biochemical reactions. This system demonstrates rather complicated dynamic behavior, including phase transition type trigger waves and autowave-type waves etc. Let us compare coagulation with the classic object of science - Belousov-Zhabotinsky reaction (BZR) that also demonstrates complicated dynamic behavior. At present it is considered as consisting of about two tens elementary chemical processes, determining dynamics of this system as a whole. At first blush the blood coagulation system looks much more complicated than the BZR: Each of it’s ten stages is catalyzed by the enzyme which in its turn needs several parameters to be described. Also in terms of dynamic, as we’ll show below, the blood coagulation system may possess dynamic regimes that till now have not been observed within the RBZ-type reaction media. We shall try to show that at the very all of it, the correct approach to the subject allows one to discover that the blood coagulation system is organized very simply, much more simply, than BZR. 

It appeared that the blood coagulation metabolic network fulfills 4 different physiological functions and for this aid the system possess 4 dynamically different sub-systems acting in different coagulation phases. It is important not for basic research only, but for medicine too, because this determines the nature of coagulation disorders. Differentiation of the disorders as procoagulant and anticoagulant ones appears to be primitive and inefficient. 

Mathematical models of erythrocyte. What they give us for understanding the disorders and ageing of this cell

Fazly Ataullakhanov, Center for theoretical problems of physico-chemical pharmacology RAS,

Lomonosov Moscow State University, Physics Dept., National Center for Hematology, RAMS 

Different types of inherited hemolytic anemia are associated with surprisingly small reduction of some enzymes activity. In a normal cell one can observe super excess of this activity (up to 104 - fold greater than the maximal rate of the metabolic pathway including this enzyme). At the same time the decrease of enzyme activity in the anemia erythrocytes often does not exceed even 10 fold. It seems the rest of activity should be enough for supporting adequate functioning of the metabolic pathway however the anemia patient’s erythrocytes are evidently destroyed. Besides, no correlation is observed between a decrease in the enzyme activity and the severity of anemia. 

To interpret this paradox we used a rather complex mathematical model, involving the all major metabolic pathways, as well as all other molecular systems, determining the cell viability. This analysis confirmed the existence of the paradox - 10-fold enzyme activity decrease in the model didn’t result in registration of any disturbances in metabolism or the cell viability. The model allowed us to calculate the threshold value of such enzymes activities that leads to the cell death.  

The erythrocytes significant lifetime shortening and anemia as a consequence, turned out to be not due to the decrease of mean enzyme activity in the erythrocytes, but significant lifetime shortening of mutant enzyme form. The hypothesis that the patient’s newborn erythrocytes have 100% enzyme activity, but it decreases rather rapidly and results in the cell death upon reaching the threshold level was checked by means of the model. At the same time the mean enzyme activity in the blood decreases much weaker, than its threshold level, and that fact is in a rather good agreement with decrease of activity observed in the patients. Since total erythrocytes amount in the blood, that characterizes severity of anemia, depends upon only the inactivation rate but not upon the mean activity of the enzyme. The subsequent experimental investigations confirmed intense decrease of stability of the mutant enzyme. 

Common intervals of chromosomes with paraloguous genes

Mathieu Raffinot, LIAFA Paris

Let C= c_1 c_2 .. c_N be a sequence of length N on a set of genes (also named alphabet - its size is non fixed as a constant). A fingerprint F is a set of genes that is contained in a segment c_i..c_j.

Let C1 and C2 two sequences of length N and M on a common set of genes (also named alphabet - its size is non fixed). A common interval is a pair [i,j]x[k,l] such that the two segments c1_i .. c1_j and c2_k .. c2_l have the same fingerprint.

Fingerprint and common interval are two fundamental notions in computational comparative genomics.


In this talk I survey several algorithmic approaches that lead to efficient algorithms to compute fingerprints and identify common intervals.

An Optimized Counting Graph

Mireille Régnier, INRIA Rocquencourt, France

Recent results on word counting will be presented, that allow for improving time and space complexity to assess the significance of exceptional words on a genome. This is a joint work with E.Furletova (IMPB).

A Data-Mining Approach To Time-Series Microarray Alignment for Crossing Large-Scale Biomolecular and Literature Information

Nicolas Turenne, INRA Jouy-en-Josas, France

Large scale biological experiments such as microarray are now available since a decade and require data analysis methods to process results. One of main purposes when studying not well-know species is to compare with model-species. The method of chip comparison  is useful in our project to collect information about groups of genes reacting around a target gene and from which we can take pieces of information in literature. Then I will also present how to cross information from chips and literature. In our biological issue we aim at understanding bovine embryo development and implied proliferation genes. A possible approach is based on comparison of two data sets. Our data sets are related to different species not having biological clocks, i.e. protein flow not produced at a same speed for each species. Hence if comparing two sets of sequential time-series data a way should be alignment largely studied for a long time with DNA sequences and even yet improved now. For microarray data the multidimensional property is not handle by such methods.  Our hypothesis consists in aligning some parts of microarray restricting the hypotheses space of all alignments. We use a classical clustering approach for each chip and try to merge some cluster by consensus to study hence individually a gene using symbolic time property of simultaneity and precedence.

Gene position scoring within transcription regulation networks

Ivan Junier, Programme d'Epigenomique, Evry

Transcription within the eukaryotic nucleus and within the bacterial nucleoid is expected to both depend on and determine the three-dimensional conformation of the chromosomes. Accordingly, transcribed genes, RNA polymerases and transcription factors (TF) have been shown to gather in vivo into discrete foci. Compared to a random location of the genes along the DNA, I will first show that a regular positioning along the DNA makes specialized co-localization more efficient. This corroborates recent results that have highlighted a periodic positioning of genes that are regulated by the same transcription factors in yeast, and the TF coding genes as well in E. coli.

Next, I will present scoring methods that are useful to reveal such periodic patterns in the position of the genes. In particular, I will show the efficiency of the method to highlight the presence of several groups of genes in which genes are periodically located and that have the same optimum period. Within this scope, I will discuss the definition of an empiric positional scoring function that could be combined to sequence alignment scoring in order to improve the automatized detection of binding site DNA sequences. Finally, I will provide a glimpse at the statistical studies of DNA conformations that could be useful to determine ab initio the positional scoring function of genes that tend to spatially co-localize.

Mining the semantics of genome super-blocks to infer ancestral architectures

Macha Nikolski, LaBRI, Université de Bordeaux, Talence, France

The study of evolutionary mechanisms is made more and more accurate by the increase in the number of fully sequenced genomes. One of the main problems

is to reconstruct plausible ancestral genome architectures based on the comparison of contemporary genomes. Current methods have largely focused on finding complete architectures for ancestral genomes, and, due to the computational difficulty of the problem, stop after a small number of equivalent minimal solutions have been found. Recent results suggest, however, that the set of minimum complete architectures is very large and heterogeneous. In fact these solutions are collections of conserved blocks, freely rearranged.

In this talk, we identify these conserved super-blocks, using a new method of analysis of ancestral architectures that reconciles both breakpoint and rearrangement analyses, as well as respects biological constraints. The resulting algorithms permit the first reliable reconstruction of plausible ancestral architectures for several non-WGD yeasts simultaneously, a problem hitherto intractable due to the extensive map reshuffling of these species.

Computing efficiently the evolution distance from minisatellites

Jean-Marc Steyaert, Ecole Polytechnique, Palaiseau

Subsequent duplication events are responsible for the evolution of the minisatellite maps. Alignments of two minisatellite maps should therefore take these duplication events into account, in addition to the well known edit operations. We present and discuss several algorithms which improve gradually the performance of the distance between two maps. These algorithms allow efficient computations of evolution distances. Using such an algorithm, we derive a quantitative evidence that there is a directional bias in the growth of minisatellites of the MSY1 dataset.

Some questions of interpretation of results for DNA-protein binding on tiling arrays

Vsevolod Makeev, GUP GosNIIgenetika, Moscow, Russia

Chromatin immunoprecipitation on DNA chips (ChIP-chip) is most commonly used for in vivo discovery of transcription factor binding sites (TFBS). However, this technique only locates TFBS within large sequence segments and not yields their precise positions. ChIP-chip is a direct method and the intensity of ChIP-chip signal is believed to be correlated with the binding affinity. On the other hand methods of assessing TFBS binding affinity in silico are based on position weight matrices. This technique, although popular, is suffering from specific tradeoffs and can only give estimated positions. It appears to be reasonable to compare both approaches. The objective of this presentation is discuss comparison of in silico predicted TFBS with different scores with in vivo obtained signal at tiling arrays. We took classical results [1] as a testing ground. Particularly, the effects of TFBS clustering on ChIP-chip results are discussed.

It was found that the overall hit ratio increases with the increasing PWM threshold, so the enrichment of experimental ChIP-chip segment with “strong” sites is greater than that with “weak” sites. Correlation analysis reveals a very similar length-correlation pattern for all signal and control datasets having two significant correlation lengths. Thus, it seems that correlation of signals at neighboring tags comes from length distribution of fragments produced through DNA sonication but not from site positional clustering in the chromosome. On the other hand individual isolated sites appear to not able to bias the binding enough to give a visible ChIP-chip signal. Only site clusters are able to give something reliable.

1. S. Cawley et al. (2004) Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs, Cell, 116: 499-509.

Genomic Exploration of the Hemiascomycetous Yeasts

David Sherman, LaBRI and INRIA, Bordeaux, France

The evolutionnary processes that shape genomes leave overlapping traces in contemporary genomes, and the comparison of the latter still provides the most powerful tool for understanding the former. It is with this goal that the Génolevures consortium has sequenced, annotated and analyzed sixteen eukaryote genomes from the Hemiascomycete yeasts since 1999. Only consideration of complete, assembled genomes makes it possible to confidently draw conclusions about the mechanisms of gene formation and loss, of segmental and complete duplication, and chromosome dynamics. I will present an overview of nine complete genomes and what we learn from their study, as well as results from a recent study of protoploid genomes from the Saccharomycetacae. Using algorithmic constuction of multi-species families of homologous proteins and of syntenic markers, we can identify a basic repertoire of functions and estimate models of breakage and genome rearrangement. With this we can develop a fairly complete overall view, and we can even today draw the evolutionary history of hemiascomycetous yeasts and fix in time the invention and the loss of complex mechanisms.