Thread View: comp.theory
2 messages
2 total messages
Started by graham@sees.bang
Mon, 08 Jun 1992 07:55
String matching (SUMMARY)
Author: graham@sees.bang
Date: Mon, 08 Jun 1992 07:55
Date: Mon, 08 Jun 1992 07:55
483 lines
19740 bytes
19740 bytes
Here is a paraphrased summary of the replies that I received concerning my recent query about a string searching problem * * * 0. INTRODUCTION ================ | Problem: | ======== | | Given a finite alphabet, A, define the set of _words_, W, as the | positive closure of A, i.e. A+ (the set of all words of one or more | symbols from A). | | Define a _sentence_ as a space-separated sequence of one or more words | from W. Let S be the set of all sentences. | | Given sentence x, belonging to S, identify any recurring subsequences | and their locations within x. A subsequence of x is a sentence derived | from x by eliminating zero or more of its component words. | | Finding the solution seems to be of exponential time complexity, as a | sentence of n words will have 2^n subsequences (including the empty | sentence), each of which would be involved in at least one comparison. | Are there any algorithms yielding partial solutions with more | favourable complexities? The statement of the problem may have been a little vague, leading to some confusion (over the distinction between subsequences and substrings, for example). The problem is essentially that of identifying which subsequences from a string (`sentence') of symbols (`words') from an infinite alphabet (W) occur more than once in the string. Although not explicitly stated, it was intended that a given symbol from the string could belong only to a single occurrence of a particular subsequence. For example, in the following sequence A A B A B would not be considered as a recurring subsequence, since the same `B' would belong to both occurrences. 1. SUBSEQUENCE DECISION PROCEDURES =================================== Frantisek Mraz provided pseudo-code for a decision procedure to determine whether or not a given string is a subsequence of another. Ulrich Graef also provided an idea for a decision procedure that involved sorting the component words of two sentences into lexicographic order and then comparing the sorted sentences. 2. SUBSTRINGS ============== Claude Le Pape clarified the distinction between subsequences and substrings : Given a set, W, of atomic words and x, y members of W*, then x is a _subsequence_ of y iff x = x_1 x_2 ... x_n y = y_0 x_1 y_1 x_2 y_2 ... x_n y_n where x_i is a member of W for each i, and y_j is a member of W* for each j x is a _substring_ of y iff y = u x v where u and v are members of W* The distinction between subsequences and substrings was also mentioned by Dan Hirschberg. Mark William Hopkins initially provided a method for detecting maximal recurring _substrings_, with a complexity of O(n^2). The method is as follows. For a sequence, x, of length n, form an n x n array. Working in the area above the main diagonal, enter a `1' at position [i,j] iff x[i] = x[j], otherwise enter a `0'. Maximal recurring substrings will then appear in the array as diagonals of `1's. For example, consider the sequence A B C X B C Y A B C Z A B C X B C Y A B C Z ---------------------------------- A . 0 0 0 0 0 0 1 0 0 0 B . . 0 0 1 0 0 0 1 0 0 C . . . 0 0 1 0 0 0 1 0 X . . . . 0 0 0 0 0 0 0 B . . . . . 0 0 0 1 0 0 C . . . . . . 0 0 0 1 0 Y . . . . . . . 0 0 0 0 A . . . . . . . . 0 0 0 B . . . . . . . . . 0 0 C . . . . . . . . . . 0 Z . . . . . . . . . . . The fact that the total number of _substrings_ of a sequence is O(n^2), as opposed to the exponential total number of _subsequences_, can be shown as follows. There are (n - k + 1) substrings of length k from a sequence of length n. The total number of substrings is therefore given by: n n --- --- \ \ ) (n - k + 1) = ) k = (n + 1)n/2 = (n^2 + n)/2 / / --- --- k=1 k=1 The polynomial bound on the number of substrings was mentioned by Jeff Horn, Claude Le Pape and Samuel S Paik. Alexander Chislenko also estimated that, in the worst case, finding all the recurring substrings would be O(n^3). 3. COMPLEXITY OF THE SUBSEQUENCE PROBLEM ========================================= In the worst case, there will be an exponential number of recurring subsequences of a sequence. Consider, for example, the sequence of length 2n obtained by concatenating two identical copies of a sequence of n distinct symbols. This will then have 2^n - 1 recurring subsequences (neglecting the empty string). The exponential bound mentioned in the original formulation of the problem was reiterated by Ulrich Graef, Dan Hirschberg, Claude Le Pape and Jan Sochor. In the previous example, however, there is only one maximal recurring subsequence, namely the original sequence that is repeated to form the one of length 2n. Robert B Israel proposed the following example for the worst case number of maximal recurring subsequences, but pointed out that the subsequences in question can in some cases be less than maximal. Consider the sequence of length 8n formed by repeating (A B C D) n times, followed by (B A D C) repeated n times, e.g. for n = 2 : A B C D A B C D B A D C B A D C There will be 2^(2n) recurring subsequences of the form (A|B)(C|D) repeated n times, e.g. for n = 2 : (A|B)(C|D)(A|B)(C|D), with one occurrence in each half of the sequence. In general, however, these subsequences may be less than maximal by one or two symbols at the beginning and end. For example, the subsequence B D B D from the sequence A B C D A B C D B A D C B A D C 1 1 1 1 2 2 2 2 is not maximal since it is part of the longer recurring subsequence A B D B D C, as shown below. A B C D A B C D B A D C B A D C (1) 1 1 (2) 1 1 2 2 (1) 2 2 (2) Subsequences of this form may also be extended by placing the initial symbols of both occurrences in the first half of the sequence. The subsequences are, however, still maximal (neglecting the previously mentioned exceptions) for occurrences in distinct halves of the sequence. However, there still remains the question of what the maximum number of disjoint maximal recurring subsequences from a given length of sequence (of symbols from an infinte alphabet) is. 4. MAXIMAL RECURRING SUBSEQUENCES ================================== Dan Hirschberg suggested the following method for determining maximal recurring subsequences (i.e. Longest Repeated Subsequences [LRS]) of a sequence, x, of length n. If an LRS has its second occurrence starting at index m, then the LRS may be obtained by calculating LCS(x[1:m-1], x[m:n]), where LCS is the Longest Common Subsequence (for which there are quadratic algorithms). A straightforward, but not necessarily the most efficient, approach would therefore be to calculate LCS(x[1:i-1], x[i:n]) for i = 2 .. n. This process would take time O(n^3). Unfortunately, this method assumes that the distinct occurrences of a subsequence are non-overlapping. For example, the LRS A B C from the sequence A A B B C C could not be found in this way. This deficiency was pointed out by Phong Vo, who suggested the following approach. Consider the component symbols of sequence x. For each one that occurs at least twice, let its two least indices be i and j (where i < j). Now let x1 be the suffix x[i:n] with x[j] eliminated, and x2 be the suffix x[j:n]. Computing LCS(x1, x2) will now yield an LRS with respect to the given initial symbol. This can then be performed for each repeated symbol. Phong Vo also suggested an alternative method involving the Heaviest Common Subsequence (HCS) algorithm and provided a reference for the HCS algorithm (Jacobson and Vo, 1992). The method involves computing HCS(x[i:n], x[j:n]), giving the match x[i], x[j] infinite weight and all others unity weight. This method, however, will yield non-disjoint instances of a subsequence. Returning to the previous example of A A B B C C, executing this procedure on the `A's would yield the following LRS: A B B C C, rather than the A B C when only disjoint occurrences are permitted. Mark William Hopkins proposed a graphical method which also yielded non-disjoint subsequence occurrences. The procedure is described below. For the sequence, x, of length n, construct an n x n array, A. Fill the array with `0's and `1's according to the following rule: A[i,j] = 1 if x[i] = x[j] = 0 otherwise Draw arcs between the `1's at positions [i1,j1] and [i2,j2] iff i1 < i2 and j1 < j2, i.e. draw arrows from each `1' to all the others that lie below and to the right. This gives an irreflexive partial ordering of the `1's. The next stage is to generate the minimal relation from this by removing all the `redundant' arcs [i1,j1] -> [i2,j2] for which there exist arcs [i1,j1] -> [i,j] and [i,j] -> [i2,j2] Consider, for example, the sequence A B C X A B Y C. The resulting graph for this is: A B C X A B Y C ---------------------------------------------------------------- A * * ` ` ` | `2,10 ` 3 | `4,8 ` ` ` B `* ` `* | ` ` 2 | | ` ` ` ` 3,4 | ` ` ` C | `* | ` ` * ` | | | ` | ` X 9 `* 7 | 10 ` ` | | ` | A * | `* ` ` ` ` 8 1,5` | ` 6 ` | ` | ` ` | B `* ` `* ` ` \ | ` 5 6,7` | ` ` Y \ ` `* 1,9 ` 5,6,7,8 \ | ` ` C * `* Disregarding the main diagonal, the horizontal projection of the maximal paths in the graph yield the following maximal recurring subsequences: 1,9,10 A B C . . . . . 2 A B . . . . . C 3 A . . . . B . C 4 . . . . A B . C 5 A B . . . . Y C 6 A . . . . B Y C 7,8 . . . . A B Y C This clearly shows the non-disjoint nature of the resulting subsequences. Note that drawing the graph should be O(n^4), since the n x n array has n^2 nodes and the maximum number of arcs in a graph with m nodes is given by: m-1 --- \ ) i = m(m - 1)/2 = (m^2 - m)/2 / --- i=1 There is also, however, the effort involved in searching the resulting graph for maximal paths to be taken into account. 5. RELATED WORK ================ John Scott suggested that alphabet problems such as this one are related to Thue's theorem, and provided the Salomaa references (Salomaa, 1973; Salomaa, 1981). John Ypsilantis stated that a major part of his doctoral research involved the machine learning of sequences. The particular application involved the detection of recurring subsequences in streams of symbols, given a normally large number of training examples. He also provided a reference to this work (Ypsilantis and Yee, 1992). 6. REFERENCES ============== Various contributors (attributions given by the initials between the {}'s) provided the following relevant references. AHO A.V., HIRSCHBERG D.S., ULLMAN J.D. [1976] "Bounds on the complexity of the longest common subsequence problem," Journal of the ACM, Vol. 23, No. 1, p. 1-12, January 1976. {JH} AHO A.V., HOPCROFT, ULLMAN J.D. [1983] "Data structures and algorithms," Addison-Wesley. {SP} AHO A.V. [1990] "Algorithms for finding patterns in strings," in J. van Leeuwen (ed.), Handbook of Theoretical Computer Science, p. 255-300, Elsevier, Amsterdam. {AB} ALTSCHUL S.F., GISH W., MILLER W., MYERS E.W., LIPMAN D.J. [1990] "Basic local alignment search tool," Journal of Molecular Biology, Vol. 215, p. 403-10. {KR} APOSTOLICA A., GUERRA C. [1987] "The longest common subsequence problem revisited," Algorithmica, Vol. 2, p. 315-37. {PS} ARRATIA R., GORDON L., WATERMAN M.S. [1986] "An extreme value theory for sequence matching," Annals of Statistics, Vol. 14, No. 3, p. 971-93. {PS} ARRATIA R., WATERMAN M.S. [1985] "An Erdos-Renyi law with shifts," Advances in Mathematics, Vol. 55, p. 13-23. {PS} ARRATIA R., WATERMAN M.S. [1985] "Critical phenomena in sequence matching," Annals of Probabilty, Vol. 13, No. 4, p. 1236-49. {PS} BAEZA-YATES R. [1991] "Searching subsequences," Theoretical Computer Science, Vol. 78, p. 363-76. {RB} HIRSCHBERG D.S. [1975] "A linear space algorithm for computing maximal common subsequences," Communications of the ACM, Vol. 18, No. 6, p. 341-3, June 1975. {KP} HUNT J.W., SZYMANSKI T.G. [1977] "A fast algorithm for computing longest common subsequences," Communications of the ACM, Vol. 20, No. 5, p. 350-3, May 1977. {JH, PS} JACOBSON G., VO K-P. [1992] "Heaviest increasing/common subsequence problems," Proceedings of the Combinatorial Matching Conference, Tucson, Arizona, April 1992. {PV} LANDAU G.M., VISHKIN U. [1989] "Fast parallel and serial approximate string matching," Journal of Algorithms, Vol. 10, p. 157-69. {UV} MAIER D. [1978] "The complexity of some problems on subsequences and supersequences," Journal of the ACM, Vol. 25, p. 322-36. {JH} OOMMEN B.J. [1987] "Recognition of noisy subsequences using constrained edit distances," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 5, p. 676-85, September 1987. {BP} PIRKLBAUER K. [1992] "A study of pattern matching algorithms," Structured Programming, Vol. 13, p. 89-98. {JW} SALOMAA A. [1973] "On sentential forms of context-free grammars," Acta Informatica, Vol. 2, p. 40-9. {JS} SALOMAA A. [1981] "Jewels of formal language theory," Computer Science Press, ISBN 0-914894-69-2. {JS} SANKOFF D., KRUSKAL J.B. (eds.) [1983] "Time warps, string edits and macromolecules: the theory and practice of sequence comparison," Addison-Wesley, Reading, MA. {AB, JHn} SIBBALD P.R., WHITE M.J. [1987] "How probable are antibody cross- reactions?", Journal of Theoretical Biology, Vol. 127, p. 163-9. {JS} SUNDAY D.M. [1990] "A very fast substring search algorithm," Communications of the ACM, Vol. 33, No. 8, p. 132-42, August 1990. {PF} WANG Y.P., PAVLIDIS T. [1990] "Optimal correspondence of string subsequences," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 11, p. 1080-7, November 1990, {BP} YPSILANTIS J., YEE H. [1992] "Machine learning of sequences-of-events: applications to power distribution," Electric Power Systems Research, Vol. 23, No. 2, p. 115-22. {JY} 7. CONTRIBUTORS ================ Thanks to the following contributors: Ricardo Baeza-Yates (rbaeza@tortel.dcc.uchile.cl) {RB} Depto. de Ciencias de la Computacion, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile Ashim Bose (bose@cs.umn.edu) {AB} CSci dept., University of Minnesota, Minneapolis Alexander Chislenko (sasha@cs.umb.edu) Pat Farrell (pfarrell@gmuvax2.gmu.edu) {PF} Dept of Computer Science, George Mason University, Fairfax, VA Ulrich Graef (graef@rs3.hrz.th-darmstadt.de) Lichtenbergweg 11, D-W-6103 Griesheim, Germany Dan Hirschberg (dan@venus.ics.uci.edu) Dept of ICS, UC Irvine Jim Holloway (holloway@cgrb.orst.edu) {JH} Mark William Hopkins (markh@csd4.csd.uwm.edu) Computing Services Division, University of Wisconsin - Milwaukee Jeff Horn (horn%agrimad@uunet.UU.NET) {JHn} Strategic Research Labs, Agrigenetics Company Robert B. Israel (israel@unixg.ubc.ca) Dept of Mathematics, University of British Columbia, Vancouver, BC, Canada V6T 1Y4 Frantisek Mraz (MRAZF@CSPGUK11.EARN) Dept of Computer Science, Charles University, Prague, Czechoslovakia Samuel S. Paik (d65y@crux1.cit.cornell.edu) {SP} XYNE Knowledge Structures Claude Le Pape (lepape@ilog.fr) ILOG SA, 2, Avenue Gallieni, BP 85, F-94253 Gentilly Cedex, FRANCE Kunsoo Park (UDAC080@OAK.CC.KCL.AC.UK) {KP} King's College London Bahram Parvin (parvin@george.lbl.gov) {BP} Keith Robison (robison@ribo.harvard.edu) {KR} Harvard University John Scott (jmscott@mepsi.mobil.com) {JS} Peter Sibbald (SIBBALD@EMBL-Heidelberg.DE) {PS} European Molecular Biology Laboratory, Heidelberg Jan Sochor (sochor@cspguk11.bitnet) Dept of Computer Science, Charles University, Prague, Czechoslovakia Uzi Vishkin (vishkin@umiacs.UMD.EDU) {UV} Phong Vo (kpv@ulysses.att.com) {PV} AT&T Bell Labs, 600 Mountain Ave, Murray Hill, NJ07974 Joachim Weisbrod (weisbrod@karlsruhe.gmd.de) {JW} John Ypsilantis (johny@ee.su.OZ.AU) {JY} Dept of Electrical Eng, Univ of Sydney, NSW 2006 AUSTRALIA * * * For anyone surviving this far, any further ideas and/or comments would be most welcome. -- graham a stephen ``The only key to your riddle is (e-mail: graham@sees.bangor.ac.uk) to accept the absence of a key.'' -- Laibach, 1992
Re: String matching (SUMMARY)
Author: kpv@ulysses.att.
Date: Mon, 08 Jun 1992 18:31
Date: Mon, 08 Jun 1992 18:31
27 lines
1335 bytes
1335 bytes
In article <1992Jun8.085538@sees.bangor.ac.uk> graham@sees.bangor.ac.uk (The Land of Confusion) writes: :... The problem is essentially that of identifying which :subsequences from a string (`sentence') of symbols (`words') from an :infinite alphabet (W) occur more than once in the string. : :Although not explicitly stated, it was intended that a given symbol from :the string could belong only to a single occurrence of a particular :subsequence. For example, in the following sequence : A A B :A B would not be considered as a recurring subsequence, since the same :`B' would belong to both occurrences. This actually simplifies things quite a bit. The problem is now just an instance of the Heaviest Common Subsequence problem where the two sequences are copies of the original sequence and the weight of a match (s sub i, s sub j) is defined to be 0 if i == j, and 1 otherwise. So, for the string AAB, the LRS (longest recurring sequence) is A. On the other hand, for AABB, the LRS will be of length 2 corresponding to AB and there 4 possibilities for constructing the reoccurrence of AB as below: A A B B A A B B A A B B A A B B \ \ \ / / \ / / A A B B A A B B A A B B A A B B Phong Vo, kpv@ulysses.att.com AT&T Bell Labs, 600 Mountain Ave, Murray Hill, NJ07974
Thread Navigation
This is a paginated view of messages in the thread with full content displayed inline.
Messages are displayed in chronological order, with the original post highlighted in green.
Use pagination controls to navigate through all messages in large threads.
Back to All Threads