Warning: pg_query(): Query failed: ERROR: missing chunk number 0 for toast value 29512337 in pg_toast_2619 in /dati/webiit-old/includes/database.pgsql.inc on line 138 Warning: ERROR: missing chunk number 0 for toast value 29512337 in pg_toast_2619 query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'https://www-old.iit.cnr.it/node/743' in /dati/webiit-old/includes/database.pgsql.inc on line 159 Warning: pg_query(): Query failed: ERROR: missing chunk number 0 for toast value 29512337 in pg_toast_2619 in /dati/webiit-old/includes/database.pgsql.inc on line 138 Warning: ERROR: missing chunk number 0 for toast value 29512337 in pg_toast_2619 query: SELECT data, created, headers, expire, serialized FROM cache_page WHERE cid = 'https://www-old.iit.cnr.it/node/743' in /dati/webiit-old/includes/database.pgsql.inc on line 159 -based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment | IIT - CNR - Istituto di Informatica e Telematica
IIT Home Page CNR Home Page

-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance.
BMC Bioinformatics, 2007

Autori: P. Ferragina, R. Giancarlo, V. Greco, G. Manzini, G. Valiente
Autori IIT:

Tipo: Articoli su riviste non ISI con referee internazionali
Area di disciplina: Mathematics
DOI: http://dx.doi.org/10.1186/1471-2105-8-252