We are pleased to announce the following four special sessions and one additional tutorial session to be held
during the 7th Int'l Symposium on Artificial Intelligence and Mathematics,
Jan. 2--4, 2002 in Ft. Lauderdale, Florida.
Session on ALGORITHMS in BIOINFORMATICS
(Chair: Jean-Louis Lassez)
Hidden Markov Models
by Axel Bernal
Support Vector Machines
by A. Bernal, K. Hovsepian, J.L. Lassez
Singular Value Decomposition
by J. Bernick, J.L. Lassez
Clustering
by K. Hovsepian, J.L. Lassez
Agents in Bioinformatics
by John Graham
Session on Satisfiability (Chair: Endre Boros)
A Non-linear SAT Solver
by John Franco
On satisfiable CNF-formulas closed under literal flipping
by Bert Randerath
Worst case bounds for XSAT and Set Partitioning
by Ewald Speckenmayer
Session on Combinatorial Data Analysis (Chair: Endre Boros)
Learning monotone binary functions in products of lattices
by Khaled Elbassioni
Inferring Minimal Functional Dependencies in Horn and q-Horn Theories
by Alex Kogan
On the Number of Patterns with Large Coverage in LAD
by Mutsunori Yagiura
Handling categorical variables in machine learning
by Endre Boros
Session on Logical Analysis of Data (Chair: Peter L. Hammer)
Consensus-Type Algorithm for Spanned Pattern Generation
by Gabriele Alexe
Enumeration of Patterns in Discrete Datasets
by Sorin Alexe
Strong Pattern Generation and Applications to the Logical Analysis of Numerical Data
by Ying Liu
Logical Analysis of Data: theory, algorithms and applications (Tutorial by Peter L. Hammer)
Logical Analysis of Data (LAD) is a coherent collection of Boolean,
combinatorial and optimization methods for the analysis of data.
The tutorial will introduce the basic elements of LAD (support sets, patterns,
models, discriminants), and will explain the mathematical and algorithmic
aspects of it. In order to illustrate LAD's usefulness for the analysis and
classification of data, as well as its potential for the development of decision
support systems, the tutorial will demonstrate this methodology on real life
applications and publicly available data sets.
Further information on the Symposium, registration, hotel, schedule,
web-proceedings, etc. can be found at http://rutcor.rutgers.edu/~amai
Hidden Markov Models
Axel Bernal
Hidden Markov Models (HMMs) have been most successful in many applications, particularly in speech processing. More recently, HMMs have become a key tool for many applications in genomics and proteomics. The structure and associated algorithms of HMMs will be fully described, and an industrial application to the automated detection of genes will be presented in order to highlight the fact that the algorithms per se are only part of the solution; knowledge- based heuristics learned in the field, are of utmost importance. Concluding, several other applications will be mentioned.
Support Vector Machines
Axel Bernal, Karen Hovsepian, Jean-Louis Lassez
Support Vector Machines have become a major tool in supervised learning. Their application to computational molecular biology is very recent, and their use is to expand substantially. In a first part we will describe the basic mathematical aspects of the main algorithm. In a second part we will illustrate a significant and most recent use of support vector machines for the automated detection of translation initiation sites and promoter sequences in prokaryotic genomes and suggest further applications.
Singular Value Decomposition
Jonathan Bernick, Jean-Louis Lassez
The Singular Value Decomposition has been widely used for pattern recognition, and noise removal in image analysis. It is also closely connected to standard statistical tools such as Principal Component Analysis. More recently it has been used in document retrieval and Latent Semantic Analysis,which is the context that we will use to describe the algorithm.
We analyse the effects and results of applying the algorithm as a tool to eliminate noisy data and to improve the performance of Support Vector Machines. In conclusion, the concept of latent semantic analysis and its potential use in bioinformatics will be briefly discussed.
Clustering
Karen Hovsepian, Jean-Louis Lassez
Research on clustering algorithms has received a major boost with the advent of the world wide web and the Human Genome Project. Nevertheless, only a few algorithms, such as the K-means and its variants, and Self Organizing Maps, have made some impact at the industrial level.
Even though, by seeing the Motto of Google's Directory "Humans Do it Better"it is clear that clustering is a major area in need of a major breakthrough, at lest in the context of search engines and bioinformatics. These two algorithms will be presented and discussed in the context of biological applications, as well as a most recent algorithm, Zoomed Clusters.
Agents in Bioinformatics
John Graham
Biological information and algorithms for the analysis of biological data are available on the Internet in many different locations with overlapping content, different structure, and varied amounts levels of curation.
The explosive growth in genomic (and soon, expression and proteomics) data as, exemplified by the Human Genome Project, further increases the need for multi-agent information gathering technologies.
John Graham will present the system DECAF (Distributed, Environment Centered Agent Framework), which is a software toolkit for the rapid design, development and
execution of "intelligent" agents to achieve solutions in complex software systems.
DECAF provides an operating environment for agents in much the way an operating system provides an environment for executing traditional programs including, system services, synchronization and network protocols. Additionally, DECAF provides for the
characterization of agent actions to assist in real-time scheduling procedures.
John will then illustrate how to apply a system like DECAF to build reusable information gathering systems for bioinformatics.
A Non-linear SAT Solver
John Franco (University of Cincinnati, Cincinnati, Ohio; franco@gauss.ececs.uc.edu)
We describe a hybrid SAT solver which takes, as input, a collection of
Boolean functions in standard forms such as BDD, CNF, etc. and determines an
interpretation satisfying all functions or verifies none exists. The solver
uses a BDD front end which includes all the usual BDD-based tools to pre-process
the given collection. If pre-processing does not find a solution then SAT-based
search is applied to the pre-processed form except that, instead of translating
to CNF as is usually done, the seach is conducted over BDD space. The
motivation for this design comes from the following two requirements: 1) the
solver must stay in the user's domain space at all times to avoid loss or
mutilation of domain-specific knowledge which is useful for searching; 2) the
solver may be applied multiple times to inputs which are all the same except for
a few small changes.
On satisfiable CNF-formulas closed under literal flipping
Bert Randerath (University of Cologne, Germany, randerath@informatik.uni-koeln.de)
Co-authors: Ewald Speckenmeyer
We consider a problem due to J. Kleine B"uning: What are the satisfiable CNF-formulas
closed under literal flipping. For 2CNF-formulas of this kind we can associate the
following graph. The vertex set corresponds to the set of variables and the edge set
corresponds to the set of clauses, i. e. for each clause the corresponding variables are
joint by an edge. Now, applying graph-theoretic methods we determine all 2CNF-formulas of
this kind. Moreover, we give some evidence indicating that the corresponding decisision
problem, whether a satisfiable 3CNF-formula is closed under literal flipping, is
intractable. Finally, we consider games on related propositional formulas and graph Ramsey
games.
Worst case bounds for XSAT and Set Partitioning
Ewald Speckenmeyer (University of Cologne, Germany; esp@informatik.Uni-Koeln.DE)
Co-authors: B. Monien
XSAT denotes the poroblem of determining a truth assignment
for a Boolean formula C over n variables in conjunctive normal form
with exacly one true literal in each clause. The problem remains
NP-hard even if all literals are unnegated. In this case we deal
with the Set Partitioning problem.
We develop an algorithm solving XSAT in time proportional to |C|2^(n/4)
and outline how this result can be extended to the case of determining a
cheapest solution when dealing with weights on the variables.
The result improves a recently achieved bound of 2^(0.2985n) for the
same problem by Drori and Peleg.
Learning monotone binary functions in products of lattices
Khaled Elbassioni, Computer Science Department, Rutgers University
elbassio@paul.rutgers.edu
Joint work with Endre Boros, Vladimir Gurvich and Leonid Khachiyan.
We consider data analysis applications, where data attributes take values
ranging over small finite lattices L_1, ..., L_n. Frequently a partially
defined monotone binary function f:L -->{0,1}, where L=L_1 x ... x L_n is
sought to represent the data. In many applications, the function f is defined
by sets A and B of L consisting of minimal positive and maximal negative
examples. A natural question is then to determine whether f is totally defined
and if not, find a point x in L that is not covered by the given positive and
negative examples. We give an efficient algorithm for answering this question.
As an application, given a database of transactions D (a subset of L), and an
integer threshold t, we consider the problem of generating all minimal
(maximal) elements of L that are covered by at most (at least) t elements of
D. Such generation problems arise frequently in knowledge discovery and data
mining applications.
Inferring Minimal Functional Dependencies in Horn and q-Horn Theories
Alex Kogan (Rutgers University; kogan@rutcor.rutgers.edu)
Co-authors: Toshihide Ibaraki and Kazuhisa Makino
In this paper, we study various problems related to the inference of
minimal functional dependencies in Horn and q-Horn theories. We show that if a
Horn theory is represented by a Horn CNF, then there exists an incrementally
polynomial algorithm for inferring all minimal functional dependencies. On the
other hand, if a Horn theory is represented as the Horn envelope of a given set
of models, then there exists a polynomial total time algorithm for this
inference problem if and only if there exists such an algorithm for dualizing a
positive CNF. Finally, we generalize our results to the case of q-Horn theories,
and show that all the considered problems can be reduced in polynomial time to
the corresponding problems for Horn theories.
On the Number of Patterns with Large Coverage in LAD
Mutsunori Yagiura (Kyoto University, Japan; yagiura@amp.i.kyoto-u.ac.jp)
Co-authors: Endre Boros and Toshihide Ibaraki
We consider data sets consisting of n-dimensional binary vectors
representing positive and negative examples for some (possibly unknown)
phenomenon. We analyse, both theoretically and experimentally, the number of
patterns with large coverage and show that it is quite small if the number of
examples in the given data set is sufficiently large.
Handling categorical variables in machine learning
Endre Boros (Rutgers University, New Brunswick, NJ; boros@rutcor.rutgers.edu)
Co-author: Vladimir Menkov
Most rule based learning algorithms use implicitly or explicitly some
binarization of the non-binary attributes (e.g. generate rules in which
propositions of the type "if temperature is greater than 99F then ..." are used
in the generated classifier.) However, such propositions are typically not easy
to generate when categorical attributes are present which have several different
but not comparable values (e.g. "green", "red", "blue", etc.) In principle, the
memebership in any subset of these values could be a binary indicator variable,
however the number of those can be too large to handle by the learning
algorithm.
In this talk we consider several measures to evaluate some binary indicator
variables generated from input attributes with many catagorical values, and show
that finding the ``best'' such binary indicator variable according to these
measures is always a very special binary optimization problem. These problems
turns out to be NP-hard, but one which still has a fully polynomial
approximation scheme. We develop an approximation algorithm and demonstrate its
effectiveness on several real life data sets.
Consensus-Type Algorithm for Spanned Pattern Generation
Gabriela Alexe, RUTCOR, Rutgers University; alexe@rutcor.rutgers.edu
Co-author: Peter L. Hammer
Given two disjoint sets $\Omega ^{+}$ and $\Omega ^{-},$ consisting of
positive and negative points of $\mathbf{R}^{n}$ respectively, a pure
positive (negative) \textit{pattern} is an interval of $\mathbf{R}^{n}$
which contains a nonempty subset of $\Omega ^{+}$ ($\Omega ^{-}$
respectively), and which is disjoint from $\Omega ^{-}$ ($\Omega ^{+}$
respectively$).$ A pure positive pattern $P$ is \textit{spanned }if\textit{\
}it has no proper subinterval $P^{\prime }$ having the same intersection
with $\Omega ^{+}$ as $P.$ An incremental polynomial method, resembling the
Black- Quine consensus method of Boolean algebra, is proposed for finding
all spanned pure positive patterns. The efficiency of the proposed method is
evaluated on various publicly available data sets.
Enumeration of Patterns in Discrete Datasets
Sorin Alexe, RUTCOR, Rutgers University; salexe@rutcor.rutgers.edu
Co-author: Peter L. Hammer
Sets of "positive" and "negative" points (observations) in n-dimensional
discrete space given along with their non-negative integer multiplicities are
analyzed from the perspective of the Logical Analysis of Data (LAD). A set of
observations satisfying upper and/or lower bounds imposed on certain components
is called a positive pattern if it contains some positive observations and no
negative one. The number of variables on which such restrictions are imposed is
called the degree of the pattern. A total polynomial algorithm is proposed for
the enumeration of all patterns of limited degree, and special efficient
variants of it for the enumeration of all patterns with certain "sign" and
"coverage" requirements are presented and evaluated on a publicly available
collection of benchmark datasets
Strong Pattern Generation and Applications to the Logical Analysis of Numerical Data
Ying Liu, RUTCOR, Rutgers University; yingliu@rutcor.rutgers.edu
Co-author: Jonathan Eckstein, Peter L. Hammer and Mikhail Nediak
Given a data set consisting of ``positive'' and ``negative'' points in
${\mathbb{R}}^n,$ a branch-and-bound method is proposed for identifying an {\it
interval} (strong pattern) in ${\mathbb{R}}^n$ which contains a maximum weight
subset of positive (negative) points, without containing any negative
(positive) points. We define a {\it model} as a collection of such intervals
(obtained by certain variations of the weights) whose union covers the given set
of points. Various types of models obtained in this way are examined, and their
classification potential is evaluated on a series of standard data sets.