by Axel Bernal

Support Vector Machines

by A. Bernal, K. Hovsepian, J.L. Lassez

Singular Value Decomposition

by J. Bernick, J.L. Lassez

Clustering

by K. Hovsepian, J.L. Lassez

Agents in Bioinformatics

by John Graham

by John Franco

On satisfiable CNF-formulas closed under literal flipping

by Bert Randerath

Worst case bounds for XSAT and Set Partitioning

by Ewald Speckenmayer

by Khaled Elbassioni

Inferring Minimal Functional Dependencies in Horn and q-Horn Theories

by Alex Kogan

On the Number of Patterns with Large Coverage in LAD

by Mutsunori Yagiura

Handling categorical variables in machine learning

by Endre Boros

by Gabriele Alexe

Enumeration of Patterns in Discrete Datasets

by Sorin Alexe

Strong Pattern Generation and Applications to the Logical Analysis of Numerical Data

by Ying Liu

Further information on the Symposium, registration, hotel, schedule, web-proceedings, etc. can be found at http://rutcor.rutgers.edu/~amai

Hidden Markov Models (HMMs) have been most successful in many applications, particularly in speech processing. More recently, HMMs have become a key tool for many applications in genomics and proteomics. The structure and associated algorithms of HMMs will be fully described, and an industrial application to the automated detection of genes will be presented in order to highlight the fact that the algorithms per se are only part of the solution; knowledge- based heuristics learned in the field, are of utmost importance. Concluding, several other applications will be mentioned.

Support Vector Machines have become a major tool in supervised learning. Their application to computational molecular biology is very recent, and their use is to expand substantially. In a first part we will describe the basic mathematical aspects of the main algorithm. In a second part we will illustrate a significant and most recent use of support vector machines for the automated detection of translation initiation sites and promoter sequences in prokaryotic genomes and suggest further applications.

The Singular Value Decomposition has been widely used for pattern recognition, and noise removal in image analysis. It is also closely connected to standard statistical tools such as Principal Component Analysis. More recently it has been used in document retrieval and Latent Semantic Analysis,which is the context that we will use to describe the algorithm. We analyse the effects and results of applying the algorithm as a tool to eliminate noisy data and to improve the performance of Support Vector Machines. In conclusion, the concept of latent semantic analysis and its potential use in bioinformatics will be briefly discussed.

Research on clustering algorithms has received a major boost with the advent of the world wide web and the Human Genome Project. Nevertheless, only a few algorithms, such as the K-means and its variants, and Self Organizing Maps, have made some impact at the industrial level. Even though, by seeing the Motto of Google's Directory "Humans Do it Better"it is clear that clustering is a major area in need of a major breakthrough, at lest in the context of search engines and bioinformatics. These two algorithms will be presented and discussed in the context of biological applications, as well as a most recent algorithm, Zoomed Clusters.

Biological information and algorithms for the analysis of biological data are available on the Internet in many different locations with overlapping content, different structure, and varied amounts levels of curation. The explosive growth in genomic (and soon, expression and proteomics) data as, exemplified by the Human Genome Project, further increases the need for multi-agent information gathering technologies. John Graham will present the system DECAF (Distributed, Environment Centered Agent Framework), which is a software toolkit for the rapid design, development and execution of "intelligent" agents to achieve solutions in complex software systems. DECAF provides an operating environment for agents in much the way an operating system provides an environment for executing traditional programs including, system services, synchronization and network protocols. Additionally, DECAF provides for the characterization of agent actions to assist in real-time scheduling procedures. John will then illustrate how to apply a system like DECAF to build reusable information gathering systems for bioinformatics.

We describe a hybrid SAT solver which takes, as input, a collection of Boolean functions in standard forms such as BDD, CNF, etc. and determines an interpretation satisfying all functions or verifies none exists. The solver uses a BDD front end which includes all the usual BDD-based tools to pre-process the given collection. If pre-processing does not find a solution then SAT-based search is applied to the pre-processed form except that, instead of translating to CNF as is usually done, the seach is conducted over BDD space. The motivation for this design comes from the following two requirements: 1) the solver must stay in the user's domain space at all times to avoid loss or mutilation of domain-specific knowledge which is useful for searching; 2) the solver may be applied multiple times to inputs which are all the same except for a few small changes.

We consider a problem due to J. Kleine B"uning: What are the satisfiable CNF-formulas closed under literal flipping. For 2CNF-formulas of this kind we can associate the following graph. The vertex set corresponds to the set of variables and the edge set corresponds to the set of clauses, i. e. for each clause the corresponding variables are joint by an edge. Now, applying graph-theoretic methods we determine all 2CNF-formulas of this kind. Moreover, we give some evidence indicating that the corresponding decisision problem, whether a satisfiable 3CNF-formula is closed under literal flipping, is intractable. Finally, we consider games on related propositional formulas and graph Ramsey games.

XSAT denotes the poroblem of determining a truth assignment for a Boolean formula C over n variables in conjunctive normal form with exacly one true literal in each clause. The problem remains NP-hard even if all literals are unnegated. In this case we deal with the Set Partitioning problem. We develop an algorithm solving XSAT in time proportional to |C|2^(n/4) and outline how this result can be extended to the case of determining a cheapest solution when dealing with weights on the variables. The result improves a recently achieved bound of 2^(0.2985n) for the same problem by Drori and Peleg.

We consider data analysis applications, where data attributes take values ranging over small finite lattices L_1, ..., L_n. Frequently a partially defined monotone binary function f:L -->{0,1}, where L=L_1 x ... x L_n is sought to represent the data. In many applications, the function f is defined by sets A and B of L consisting of minimal positive and maximal negative examples. A natural question is then to determine whether f is totally defined and if not, find a point x in L that is not covered by the given positive and negative examples. We give an efficient algorithm for answering this question. As an application, given a database of transactions D (a subset of L), and an integer threshold t, we consider the problem of generating all minimal (maximal) elements of L that are covered by at most (at least) t elements of D. Such generation problems arise frequently in knowledge discovery and data mining applications.

In this paper, we study various problems related to the inference of minimal functional dependencies in Horn and q-Horn theories. We show that if a Horn theory is represented by a Horn CNF, then there exists an incrementally polynomial algorithm for inferring all minimal functional dependencies. On the other hand, if a Horn theory is represented as the Horn envelope of a given set of models, then there exists a polynomial total time algorithm for this inference problem if and only if there exists such an algorithm for dualizing a positive CNF. Finally, we generalize our results to the case of q-Horn theories, and show that all the considered problems can be reduced in polynomial time to the corresponding problems for Horn theories.

We consider data sets consisting of n-dimensional binary vectors representing positive and negative examples for some (possibly unknown) phenomenon. We analyse, both theoretically and experimentally, the number of patterns with large coverage and show that it is quite small if the number of examples in the given data set is sufficiently large.

Most rule based learning algorithms use implicitly or explicitly some binarization of the non-binary attributes (e.g. generate rules in which propositions of the type "if temperature is greater than 99F then ..." are used in the generated classifier.) However, such propositions are typically not easy to generate when categorical attributes are present which have several different but not comparable values (e.g. "green", "red", "blue", etc.) In principle, the memebership in any subset of these values could be a binary indicator variable, however the number of those can be too large to handle by the learning algorithm. In this talk we consider several measures to evaluate some binary indicator variables generated from input attributes with many catagorical values, and show that finding the ``best'' such binary indicator variable according to these measures is always a very special binary optimization problem. These problems turns out to be NP-hard, but one which still has a fully polynomial approximation scheme. We develop an approximation algorithm and demonstrate its effectiveness on several real life data sets.

Given two disjoint sets $\Omega ^{+}$ and $\Omega ^{-},$ consisting of positive and negative points of $\mathbf{R}^{n}$ respectively, a pure positive (negative) \textit{pattern} is an interval of $\mathbf{R}^{n}$ which contains a nonempty subset of $\Omega ^{+}$ ($\Omega ^{-}$ respectively), and which is disjoint from $\Omega ^{-}$ ($\Omega ^{+}$ respectively$).$ A pure positive pattern $P$ is \textit{spanned }if\textit{\ }it has no proper subinterval $P^{\prime }$ having the same intersection with $\Omega ^{+}$ as $P.$ An incremental polynomial method, resembling the Black- Quine consensus method of Boolean algebra, is proposed for finding all spanned pure positive patterns. The efficiency of the proposed method is evaluated on various publicly available data sets.

Sets of "positive" and "negative" points (observations) in n-dimensional discrete space given along with their non-negative integer multiplicities are analyzed from the perspective of the Logical Analysis of Data (LAD). A set of observations satisfying upper and/or lower bounds imposed on certain components is called a positive pattern if it contains some positive observations and no negative one. The number of variables on which such restrictions are imposed is called the degree of the pattern. A total polynomial algorithm is proposed for the enumeration of all patterns of limited degree, and special efficient variants of it for the enumeration of all patterns with certain "sign" and "coverage" requirements are presented and evaluated on a publicly available collection of benchmark datasets

Given a data set consisting of ``positive'' and ``negative'' points in ${\mathbb{R}}^n,$ a branch-and-bound method is proposed for identifying an {\it interval} (strong pattern) in ${\mathbb{R}}^n$ which contains a maximum weight subset of positive (negative) points, without containing any negative (positive) points. We define a {\it model} as a collection of such intervals (obtained by certain variations of the weights) whose union covers the given set of points. Various types of models obtained in this way are examined, and their classification potential is evaluated on a series of standard data sets.