Information Theory of Genomes

/ Authors

/ Abstract

Relation of genome sizes to organisms complexity is still described rather equivocally. Neither the number of genes (G-value), nor the total amount of DNA (C-value) correlates consistently with phenotype complexity. Using information theory considerations we developed a model that allows a quantative estimate for the amount of functional information in a genomic sequence. This model easily answers the long-standing question of why GC content is increased in functional regions. The model allows consistent estimate of genome complexities, resolving the major discrepancies of G- and C-values. For related organisms with similarly complex phenotypes, this estimate provides biological insights into their niches complexities. This theoretical framework suggests that biological information can rapidly evolve on demand from environment, mainly in non-coding genomic sequence and explains the role of duplications in the evolution of biological information. Knowing the approximate amount of functionality in a genomic sequence is useful for many applications such as phylogenetics analyses, in-silico functional elements discovery or prioritising targets for genotyping and sequencing.

Journal: arXiv: Genomics