Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
Résumé
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotationdeposition workflow, and that may proliferate in public database repositories affecting all downstream
Domaines
Bio-informatique [q-bio.QM]Origine | Fichiers produits par l'(les) auteur(s) |
---|