Seminar of Theoretical Computer Science - Mário Lipovský (3.11.2017)
Friday 3.11.2017 at 11:00, Lecture room M/213
Mário Lipovský:
Approximate Abundance Histograms and Their Use for Genome Size Estimation
Abstract:
DNA sequencing data is typically a large collection of short strings called reads. We can summarize such data by computing a histogram of the number of occurrences of substrings of a fixed length. Such histograms can be used for example to estimate the size of a genome. In this paper, we study a recent tool, Kmerlight, which computes approximate histograms. We discover an approximation bias, and we propose a new, unbiased version of Kmerlight. We also model the distribution of approximation errors and support our theoretical model by experimental data. Finally, we use another tool, CovEst, to compute genome size estimates from approximate histograms. Our results show that although CovEst was designed to work with exact histograms, it can be used with their approximate versions, which can be produced in a much smaller memory.
web: http://kedrigern.dcs.fmph.uniba.sk/STI2
rss: http://kedrigern.dcs.fmph.uniba.sk/STI2/rss.php