Esteban Pérez-Wohlfeil, Sergio Diaz-del-Pino & Oswaldo Trelles

Abstract

In the last decade, a technological shift in the bioinformatics field has occurred: larger genomes can now be sequenced quickly and cost effectively, resulting in the computational need to efficiently compare large and abundant sequences. Furthermore, detecting conserved similarities across large collections of genomes remains a problem. The size of chromosomes, along with the substantial amount of noise and number of repeats found in DNA sequences (particularly in mammals and plants), leads to a scenario where executing and waiting for complete outputs is both time and resource consuming. Filtering steps, manual examination and annotation, very long execution times and a high demand for computational resources represent a few of the many difficulties faced in large genome comparisons. In this work, we provide a method designed for comparisons of considerable amounts of very long sequences that employs a heuristic algorithm capable of separating noise and repeats from conserved fragments in pairwise genomic comparisons. We provide software implementation that computes in linear time using one core as a minimum and a small, constant memory footprint. The method produces both a previsualization of the comparison and a collection of indices to drastically reduce computational complexity when performing exhaustive comparisons. Last, the method scores the comparison to automate classification of sequences and produces a list of detected synteny blocks to enable new evolutionary studies.

41598_2019_46773_Fig3_HTML

Similarity map of twelve primate species. Full genomes were compared. From left to right and top to bottom, genomes are ordered using the maximum likelihood method based on amplified genes. The compared genomes belong to six families, i.e., Hominidae (dark purple), Hylobatidae (light purple), Cercopithecidae (yellow), Cebidae (green), Lemuridae (cyan) and Cheirogaleidae (blue). Each cell of the upper right matrix shows the dot plot corresponding to the comparison of the row and column. The list containing copyright permissions of the pictures used in the Figure is available in the Additional File 1.

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

Code Availability

All sources, scripts and executables described or employed in the manuscript can be accessed at https://github.com/estebanpw/chromeister.

Leave a reply

Your email address will not be published. Required fields are marked *