String matching (SUMMARY)

Thread View: comp.theory

2 messages

[T] Back to Threads [V] Tree View [G] Group Articles

2 total messages Started by graham@sees.bang Mon, 08 Jun 1992 07:55

#4012

Author: graham@sees.bang
Date: Mon, 08 Jun 1992 07:55

483 lines
19740 bytes

Here is a paraphrased summary of the replies that I received concerning
my recent query about a string searching problem

     *  *  *

0.  INTRODUCTION
================

|  Problem:
|  ========
|  
|  Given a finite alphabet, A, define the set of _words_, W, as the
|  positive closure of A, i.e. A+ (the set of all words of one or more 
|  symbols from A).
|  
|  Define a _sentence_ as a space-separated sequence of one or more words
|  from W. Let S be the set of all sentences.
|  
|  Given sentence x, belonging to S, identify any recurring subsequences
|  and their locations within x. A subsequence of x is a sentence derived 
|  from x by eliminating zero or more of its component words.
|  
|  Finding the solution seems to be of exponential time complexity, as a
|  sentence of n words will have 2^n subsequences (including the empty
|  sentence), each of which would be involved in at least one comparison.
|  Are there any algorithms yielding partial solutions with more
|  favourable complexities?

The statement of the problem may have been a little vague, leading to
some confusion (over the distinction between subsequences and substrings,
for example). The problem is essentially that of identifying which
subsequences from a string (`sentence') of symbols (`words') from an
infinite alphabet (W) occur more than once in the string.

Although not explicitly stated, it was intended that a given symbol from 
the string could belong only to a single occurrence of a particular 
subsequence. For example, in the following sequence

     A  A  B

A  B  would not be considered as a recurring subsequence, since the same
`B' would belong to both occurrences.


1.  SUBSEQUENCE DECISION PROCEDURES
===================================

Frantisek Mraz provided pseudo-code for a decision procedure to determine
whether or not a given string is a subsequence of another. Ulrich Graef 
also provided an idea for a decision procedure that involved sorting the
component words of two sentences into lexicographic order and then comparing
the sorted sentences.


2.  SUBSTRINGS
==============

Claude Le Pape clarified the distinction between subsequences and 
substrings :

     Given a set, W, of atomic words and x, y members of W*, then

          x is a _subsequence_ of y iff
               x = x_1 x_2 ... x_n
               y = y_0 x_1 y_1 x_2 y_2 ... x_n y_n
               where x_i is a member of W for each i, and y_j is a
               member of W* for each j

          x is a _substring_ of y iff
               y = u x v
               where u and v are members of W*

The distinction between subsequences and substrings was also mentioned
by Dan Hirschberg.

Mark William Hopkins initially provided a method for detecting maximal
recurring _substrings_, with a complexity of O(n^2). The method is as
follows. For a sequence, x, of length n, form an n x n array. Working in
the area above the main diagonal, enter a `1' at position [i,j] iff
x[i] = x[j], otherwise enter a `0'. Maximal recurring substrings will 
then appear in the array as diagonals of `1's. For example, consider
the sequence  A  B  C  X  B  C  Y  A  B  C  Z

        A  B  C  X  B  C  Y  A  B  C  Z
     ----------------------------------
     A  .  0  0  0  0  0  0  1  0  0  0
     
     B  .  .  0  0  1  0  0  0  1  0  0

     C  .  .  .  0  0  1  0  0  0  1  0

     X  .  .  .  .  0  0  0  0  0  0  0

     B  .  .  .  .  .  0  0  0  1  0  0
    
     C  .  .  .  .  .  .  0  0  0  1  0

     Y  .  .  .  .  .  .  .  0  0  0  0

     A  .  .  .  .  .  .  .  .  0  0  0

     B  .  .  .  .  .  .  .  .  .  0  0

     C  .  .  .  .  .  .  .  .  .  .  0

     Z  .  .  .  .  .  .  .  .  .  .  .


The fact that the total number of _substrings_ of a sequence is O(n^2),
as opposed to the exponential total number of _subsequences_, can be
shown as follows. There are (n - k + 1) substrings of length k from a 
sequence of length n. The total number of substrings is therefore given by:

      n                    n
     ---                  ---
     \                    \
      )  (n - k + 1)   =   )  k   =   (n + 1)n/2   =   (n^2 + n)/2
     /                    /
     ---                  ---
     k=1                  k=1

The polynomial bound on the number of substrings was mentioned by Jeff Horn,
Claude Le Pape and Samuel S Paik. Alexander Chislenko also estimated that,
in the worst case, finding all the recurring substrings would be O(n^3).


3.  COMPLEXITY OF THE SUBSEQUENCE PROBLEM
=========================================

In the worst case, there will be an exponential number of recurring
subsequences of a sequence. Consider, for example, the sequence of length
2n obtained by concatenating two identical copies of a sequence of n
distinct symbols. This will then have 2^n - 1 recurring subsequences
(neglecting the empty string). The exponential bound mentioned in the 
original formulation of the problem was reiterated by Ulrich Graef,
Dan Hirschberg, Claude Le Pape and Jan Sochor.

In the previous example, however, there is only one maximal recurring
subsequence, namely the original sequence that is repeated to form the
one of length 2n. Robert B Israel proposed the following example for the
worst case number of maximal recurring subsequences, but pointed out that
the subsequences in question can in some cases be less than maximal.

Consider the sequence of length 8n formed by repeating (A  B  C  D) 
n times, followed by (B  A  D  C) repeated n times, e.g. for n = 2 :

     A  B  C  D  A  B  C  D  B  A  D  C  B  A  D  C

There will be 2^(2n) recurring subsequences of the form (A|B)(C|D)
repeated n times, e.g. for n = 2 :  (A|B)(C|D)(A|B)(C|D), with one
occurrence in each half of the sequence. In general, however, these
subsequences may be less than maximal by one or two symbols at the 
beginning and end. For example, the subsequence  B  D  B  D  from 
the sequence

     A  B  C  D  A  B  C  D  B  A  D  C  B  A  D  C
        1     1     1     1  2     2     2     2

is not maximal since it is part of the longer recurring subsequence
A  B  D  B  D  C, as shown below.

    A  B  C  D  A  B  C  D  B  A  D  C  B  A  D  C
   (1) 1     1 (2) 1     1  2     2 (1) 2     2 (2)

Subsequences of this form may also be extended by placing the initial
symbols of both occurrences in the first half of the sequence. The
subsequences are, however, still maximal (neglecting the previously
mentioned exceptions) for occurrences in distinct halves of the sequence.

However, there still remains the question of what the maximum number of 
disjoint maximal recurring subsequences from a given length of sequence 
(of symbols from an infinte alphabet) is.


4.  MAXIMAL RECURRING SUBSEQUENCES
==================================

Dan Hirschberg suggested the following method for determining maximal
recurring subsequences (i.e. Longest Repeated Subsequences [LRS]) of a
sequence, x, of length n. If an LRS has its second occurrence starting
at index m, then the LRS may be obtained by calculating 
LCS(x[1:m-1], x[m:n]), where LCS is the Longest Common Subsequence (for
which there are quadratic algorithms). A straightforward, but not
necessarily the most efficient, approach would therefore be to calculate
LCS(x[1:i-1], x[i:n]) for i = 2 .. n. This process would take time O(n^3).

Unfortunately, this method assumes that the distinct occurrences of a 
subsequence are non-overlapping. For example, the LRS  A  B  C from the
sequence  A  A  B  B  C  C  could not be found in this way. This 
deficiency was pointed out by Phong Vo, who suggested the following approach.

Consider the component symbols of sequence x. For each one that occurs at
least twice, let its two least indices be i and j (where i < j). Now let
x1 be the suffix x[i:n] with x[j] eliminated, and x2 be the suffix x[j:n].
Computing LCS(x1, x2) will now yield an LRS with respect to the given
initial symbol. This can then be performed for each repeated symbol. Phong
Vo also suggested an alternative method involving the Heaviest Common
Subsequence (HCS) algorithm and provided a reference for the HCS algorithm
(Jacobson and Vo, 1992).  The method involves computing HCS(x[i:n], x[j:n]), 
giving the match x[i], x[j] infinite weight and all others unity weight.

This method, however, will yield non-disjoint instances of a subsequence.
Returning to the previous example of  A  A  B  B  C  C, executing this
procedure on the `A's would yield the following LRS:  A  B  B  C  C,
rather than the  A  B  C  when only disjoint occurrences are permitted.

Mark William Hopkins proposed a graphical method which also yielded 
non-disjoint subsequence occurrences. The procedure is described below.

For the sequence, x, of length n, construct an n x n array, A. Fill the
array with `0's and `1's according to the following rule:

     A[i,j] = 1   if x[i] = x[j]
            = 0   otherwise

Draw arcs between the `1's at positions [i1,j1] and [i2,j2] iff i1 < i2
and j1 < j2, i.e. draw arrows from each `1' to all the others that lie
below and to the right. This gives an irreflexive partial ordering of
the `1's. The next stage is to generate the minimal relation from this by
removing all the `redundant' arcs [i1,j1] -> [i2,j2] for which there exist
arcs

     [i1,j1] -> [i,j]  and  [i,j] -> [i2,j2]

Consider, for example, the sequence  A  B  C  X  A  B  Y  C. The resulting
graph for this is:


        A        B        C        X        A        B        Y        C
        ----------------------------------------------------------------
    A   *                                   *
          `      `                            `
         |  `2,10          ` 3               |  `4,8
              `                      `            `
    B           `*                             `    `*
          |        `         `       2        |      |   `
                     `                   `                   ` 3,4
                  |    `                             `           `
    C      |             `*                    |                 `   ` *
                            `                          |           
            |      |          `                 |
                                `
    X        9                    `*             7      |
                   10                `
                                       `                 |
              |                          `        |
    A   *            |                     `*      
          ` `                           `     `           8
         1,5`   |           ` 6                 `   |
              `       |               `           `        |
    B           `*                              `   `*
                           `                           `
                   \   |             ` 5              6,7`   |
                                               `           `
    Y                \                                   `   `*
                  1,9                                           ` 5,6,7,8
                       \ |                                        `
                                                                    `
    C                     *                                           `*


Disregarding the main diagonal, the horizontal projection of the maximal
paths in the graph yield the following maximal recurring subsequences:

     1,9,10   A  B  C  .  .  .  .  .
     2        A  B  .  .  .  .  .  C
     3        A  .  .  .  .  B  .  C
     4        .  .  .  .  A  B  .  C
     5        A  B  .  .  .  .  Y  C
     6        A  .  .  .  .  B  Y  C
     7,8      .  .  .  .  A  B  Y  C

This clearly shows the non-disjoint nature of the resulting subsequences.
Note that drawing the graph should be O(n^4), since the n x n array has
n^2 nodes and the maximum number of arcs in a graph with m nodes is given by:

     m-1
     ---
     \
      )  i   =   m(m - 1)/2   =  (m^2 - m)/2
     /
     ---
     i=1

There is also, however, the effort involved in searching the resulting 
graph for maximal paths to be taken into account.



5.  RELATED WORK
================

John Scott suggested that alphabet problems such as this one are
related to Thue's theorem, and provided the Salomaa references
(Salomaa, 1973; Salomaa, 1981). John Ypsilantis stated that a major
part of his doctoral research involved the machine learning of sequences.
The particular application involved the detection of recurring
subsequences in streams of symbols, given a normally large number of
training examples. He also provided a reference to this work (Ypsilantis
and Yee, 1992).


6.  REFERENCES
==============

Various contributors (attributions given by the initials between the
{}'s) provided the following relevant references.

AHO A.V., HIRSCHBERG D.S., ULLMAN J.D. [1976] "Bounds on the complexity 
     of the longest common subsequence problem," Journal of the ACM, 
     Vol. 23, No. 1, p. 1-12, January 1976.  {JH}

AHO A.V., HOPCROFT, ULLMAN J.D. [1983] "Data structures and algorithms,"
     Addison-Wesley.  {SP}

AHO A.V. [1990] "Algorithms for finding patterns in strings," in J. van
     Leeuwen (ed.), Handbook of Theoretical Computer Science, p. 255-300,
     Elsevier, Amsterdam.  {AB}

ALTSCHUL S.F., GISH W., MILLER W., MYERS E.W., LIPMAN D.J. [1990]
    "Basic local alignment search tool," Journal of Molecular Biology, 
     Vol. 215, p. 403-10.  {KR}
     
APOSTOLICA A., GUERRA C. [1987] "The longest common subsequence problem
     revisited," Algorithmica, Vol. 2, p. 315-37.  {PS}

ARRATIA R., GORDON L., WATERMAN M.S. [1986] "An extreme value theory for
     sequence matching," Annals of Statistics, Vol. 14, No. 3, 
     p. 971-93.  {PS}

ARRATIA R., WATERMAN M.S. [1985] "An Erdos-Renyi law with shifts,"
     Advances in Mathematics, Vol. 55, p. 13-23.  {PS}

ARRATIA R., WATERMAN M.S. [1985] "Critical phenomena in sequence
     matching," Annals of Probabilty, Vol. 13, No. 4, p. 1236-49.  {PS}

BAEZA-YATES R. [1991] "Searching subsequences," Theoretical Computer
     Science, Vol. 78, p. 363-76.  {RB}

HIRSCHBERG D.S. [1975] "A linear space algorithm for computing 
     maximal common subsequences," Communications of the ACM, Vol. 18,
     No. 6, p. 341-3, June 1975.  {KP}

HUNT J.W., SZYMANSKI T.G. [1977] "A fast algorithm for computing
     longest common subsequences," Communications of the ACM, Vol. 20,
     No. 5, p. 350-3, May 1977.  {JH, PS}

JACOBSON G., VO K-P. [1992] "Heaviest increasing/common subsequence
     problems," Proceedings of the Combinatorial Matching Conference,
     Tucson, Arizona, April 1992.  {PV}

LANDAU G.M., VISHKIN U. [1989] "Fast parallel and serial approximate
     string matching," Journal of Algorithms, Vol. 10, p. 157-69.  {UV}

MAIER D. [1978] "The complexity of some problems on subsequences and
     supersequences," Journal of the ACM, Vol. 25, p. 322-36.  {JH}

OOMMEN B.J. [1987] "Recognition of noisy subsequences using constrained
     edit distances," IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-9, No. 5, p. 676-85, September 1987.  {BP}

PIRKLBAUER K. [1992] "A study of pattern matching algorithms,"
     Structured Programming, Vol. 13, p. 89-98.  {JW}

SALOMAA A. [1973] "On sentential forms of context-free grammars,"
     Acta Informatica, Vol. 2, p. 40-9.  {JS}

SALOMAA A. [1981] "Jewels of formal language theory," Computer Science
     Press, ISBN 0-914894-69-2.  {JS}

SANKOFF D., KRUSKAL J.B. (eds.) [1983] "Time warps, string edits and
     macromolecules: the theory and practice of sequence comparison,"
     Addison-Wesley, Reading, MA.  {AB, JHn}

SIBBALD P.R., WHITE M.J. [1987] "How probable are antibody cross-
     reactions?", Journal of Theoretical Biology, Vol. 127, 
     p. 163-9.  {JS}

SUNDAY D.M. [1990] "A very fast substring search algorithm,"
     Communications of the ACM, Vol. 33, No. 8, p. 132-42, 
     August 1990.  {PF}

WANG Y.P., PAVLIDIS T. [1990] "Optimal correspondence of string
     subsequences," IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. 12, No. 11, p. 1080-7, November 1990,  {BP}

YPSILANTIS J., YEE H. [1992] "Machine learning of sequences-of-events:
     applications to power distribution," Electric Power Systems Research,
     Vol. 23, No. 2, p. 115-22.  {JY}


7.  CONTRIBUTORS
================

Thanks to the following contributors:

     Ricardo Baeza-Yates    (rbaeza@tortel.dcc.uchile.cl) {RB}
                            Depto. de Ciencias de la Computacion, 
                            Universidad de Chile, Blanco  Encalada 2120, 
                            Santiago, Chile

     Ashim Bose             (bose@cs.umn.edu) {AB}
                            CSci dept., University of Minnesota, Minneapolis

     Alexander Chislenko    (sasha@cs.umb.edu)

     Pat Farrell            (pfarrell@gmuvax2.gmu.edu) {PF}
                            Dept of Computer Science, George Mason University, 
                            Fairfax, VA

     Ulrich Graef           (graef@rs3.hrz.th-darmstadt.de)
                            Lichtenbergweg 11,  D-W-6103 Griesheim, Germany

     Dan Hirschberg         (dan@venus.ics.uci.edu)
                            Dept of ICS, UC Irvine

     Jim Holloway           (holloway@cgrb.orst.edu) {JH}

     Mark William Hopkins   (markh@csd4.csd.uwm.edu)
                            Computing Services Division, 
                            University of Wisconsin - Milwaukee

     Jeff Horn              (horn%agrimad@uunet.UU.NET) {JHn}
                            Strategic Research Labs, Agrigenetics Company

     Robert B. Israel       (israel@unixg.ubc.ca)
                            Dept of Mathematics, 
                            University of British Columbia, Vancouver,  
                            BC, Canada V6T 1Y4

     Frantisek Mraz         (MRAZF@CSPGUK11.EARN)
                            Dept of Computer Science, Charles University,
                            Prague, Czechoslovakia

     Samuel S. Paik         (d65y@crux1.cit.cornell.edu) {SP}
                            XYNE Knowledge Structures

     Claude Le Pape         (lepape@ilog.fr)
                            ILOG SA, 2, Avenue Gallieni, BP 85, 
                            F-94253 Gentilly Cedex, FRANCE

     Kunsoo Park            (UDAC080@OAK.CC.KCL.AC.UK) {KP}
                            King's College London

     Bahram Parvin          (parvin@george.lbl.gov) {BP}

     Keith Robison          (robison@ribo.harvard.edu) {KR}
                            Harvard University

     John Scott             (jmscott@mepsi.mobil.com) {JS}

     Peter Sibbald          (SIBBALD@EMBL-Heidelberg.DE) {PS}
                            European Molecular Biology Laboratory, Heidelberg

     Jan Sochor             (sochor@cspguk11.bitnet)
                            Dept of Computer Science, Charles University,
                            Prague, Czechoslovakia

     Uzi Vishkin            (vishkin@umiacs.UMD.EDU) {UV}

     Phong Vo               (kpv@ulysses.att.com)  {PV}
                            AT&T Bell Labs, 600 Mountain Ave, 
                            Murray Hill, NJ07974

     Joachim Weisbrod       (weisbrod@karlsruhe.gmd.de) {JW}

     John Ypsilantis        (johny@ee.su.OZ.AU) {JY}
                            Dept of Electrical Eng, Univ of Sydney, 
                            NSW 2006 AUSTRALIA

     *  *  *

For anyone surviving this far, any further ideas and/or comments
would be most welcome.

 

--
graham a stephen                        ``The only key to your riddle is
(e-mail: graham@sees.bangor.ac.uk)        to accept the absence of a key.''
                                               --  Laibach, 1992

Re: String matching (SUMMARY)

#4015

Author: kpv@ulysses.att.
Date: Mon, 08 Jun 1992 18:31

27 lines
1335 bytes

In article <1992Jun8.085538@sees.bangor.ac.uk> graham@sees.bangor.ac.uk (The Land of Confusion) writes:
:... The problem is essentially that of identifying which
:subsequences from a string (`sentence') of symbols (`words') from an
:infinite alphabet (W) occur more than once in the string.
:
:Although not explicitly stated, it was intended that a given symbol from
:the string could belong only to a single occurrence of a particular
:subsequence. For example, in the following sequence
:     A  A  B
:A  B  would not be considered as a recurring subsequence, since the same
:`B' would belong to both occurrences.

This actually simplifies things quite a bit.
The problem is now just an instance of the Heaviest Common Subsequence
problem where the two sequences are copies of the original sequence and
the weight of a match (s sub i, s sub j) is defined to be 0 if i == j,
and 1 otherwise. So, for the string AAB, the LRS (longest recurring
sequence) is A. On the other hand, for AABB, the LRS will be of length 2
corresponding to AB and there 4 possibilities for constructing the
reoccurrence of AB as below:
   A A B B    A A B B    A A B B    A A B B
    \   \      \   /      /   \      /   /
   A A B B    A A B B    A A B B    A A B B

Phong Vo, kpv@ulysses.att.com
AT&T Bell Labs, 600 Mountain Ave, Murray Hill, NJ07974

Thread Navigation

This is a paginated view of messages in the thread with full content displayed inline.

Messages are displayed in chronological order, with the original post highlighted in green.

Use pagination controls to navigate through all messages in large threads.

Back to All Threads