Seminár z teoretickej informatiky - Mário Lipovský (3.11.2017)
v piatok 3.11.2017 o 11:00 hod. v miestnosti M/213
Od: Rastislav Královič
Prednášajúci: Mário Lipovský
Názov: Approximate Abundance Histograms and Their Use for Genome Size Estimation
Termín: 3.11.2017, 11:00 hod., M/213
DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.