FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 |

«Abstract. A factor u of a word w is a cover of w if every position in w lies within some occurrence of u in w. A factor u is a seed of w if it is a ...»

-- [ Page 1 ] --

Efficient Algorithms

for Shortest Partial Seeds in Words

Tomasz Kociumaka1, Solon P. Pissis2, Jakub Radoszewski1,

Wojciech Rytter1,3, and Tomasz Wale´1n

Faculty of Mathematics, Informatics and Mechanics,

University of Warsaw, Warsaw, Poland


Department of Informatics, King’s College London,

London WC2R 2LS, UK


Faculty of Mathematics and Computer Science,

Copernicus University, Toru´, Poland


Abstract. A factor u of a word w is a cover of w if every position in w lies within some occurrence of u in w. A factor u is a seed of w if it is a cover of a superstring of w. Covers and seeds extend the classical notions of periodicity. We introduce a new notion of α-partial seed, that is, a factor covering as a seed at least α positions in a given word. We use the Cover Suffix Tree, introduced recently in the context of α-partial covers (Kociumaka et al, CPM 2013); an O(n log n)-time algorithm constructing such a tree is known. However it appears that partial seeds are more complicated than partial covers—our algorithms require algebraic manipulations of special functions related to edges of the modified Cover Suffix Tree and the border array. We present an algorithm for computing shortest α-partial seeds that works in O(n) time if the Cover Suffix Tree is already given.

1 Introduction Periodicity in words is a fundamental topic in combinatorics on words and string algorithms (see [5]). The concept of quasiperiodicity is a generalization of the notion of periodicity [1]. Quasiperiodicity enables detecting repetitive structure of words when it cannot be found using the classical characterizations of periods.

Several types of quasiperiods have already been introduced. It depends on the type of quasiperiod what kinds of repetitive structure it allows to detect.

The best-known type of quasiperiodicity is the cover of word. A factor u of a word w is said to be a cover of w if every position in w lies within some occurrence of u in w, we also say that w is covered by u. An extension of the notion of cover is the notion of seed, in this case the positions covered by a seed Supported by Polish budget funds for science in 2013-2017 as a research project under the ‘Diamond Grant’ program.

u are also positions within overhanging occurrences of u. More formally, u is a seed of w if w is a factor of a word y covered by u.

Several algorithms are known for computation of covers and seeds. A lineartime algorithm for computing the shortest cover of a word was proposed by Apostolico et al. [2], and a linear-time algorithm for computing all the covers was proposed by Moore & Smyth [12]. Linear-time algorithms providing yet more complete characterizations of covers by so-called cover arrays were given in [3, 11]. Seeds were first introduced by Iliopoulos, Moore, and Park [7] who presented an O(n log n)-time algorithm computing seeds. This result was improved recently by Kociumaka et al. [8] who gave a complex linear-time algorithm.

It remains unlikely that an arbitrary word has a cover or a seed shorter than the word itself. Due to this reason, relaxed variants of quasiperiodicity have been introduced. One of the ideas are approximate covers [13] and approximate seeds [4] that require each position to lie within an approximate occurrence of the corresponding quasiperiod. Another idea, introduced recently in [9], was the notion of partial cover that is required to cover a certain number of positions of a word. We extend the ideas of [9] and introduce the notion of partial seed.

Let C(u, w) denote the number of positions in w covered by (full) occurrences of the word u in w. The word u is called an α-partial cover of w if C(u, w) ≥ α.

We call a non-empty prefix of w that is also a suffix of u a left-overhanging occurrence of u in w. Symmetrically, a non-empty suffix of w which is a prefix of u is called a right-overhanging occurrence. Let S(u, w) denote the number of positions in w covered by full, left-overhanging, or right-overhanging occurrences of u in w. We call u an α-partial seed of w if S(u, w) ≥ α. If the word w is clear from the context, we use the simpler notations of C(u) and S(u).

Example 1. If w = aaaabaabaaaaaba, see also Fig.

1, then S(abaa) = 12, S(aba) = 10, S(ab) = 7, S(a) = 12.

–  –  –

Fig. 1. The positions covered by abaa as a partial seed are underlined. The word abaa is a 12-partial seed of w, it has four overhanging occurrences and two full occurrences.

Note that a is the shortest 12-partial seed of w.

We study the following two related problems.

PartialSeeds Input: a word w of length n and a positive integer α ≤ n Output: all shortest factors u of w such that S(u, w) ≥ α LimitedLengthPartialSeeds Input: a word w of length n and an interval [, r], 0 ≤ r ≤ n Output: a factor u of w, |u| ∈ [, r], which maximizes S(u, w) In [9] a data structure called the Cover Suffix Tree and denoted by CST (w) was introduced. For a word w of length n the size of CST (w) is O(n) and the construction time is O(n log n). In this article, we obtain the following results.

Theorem 2. Given CST (w), the LimitedLengthPartialSeeds problem can be solved in O(n) time.

By applying binary search, Theorem 2 implies an O(n log n)-time solution to the PartialSeeds problem. However, this solution can be improved to an O(n)time algorithm, provided that CST (w) is known.

Theorem 3. [Main result] Given CST (w), the PartialSeeds problem can be solved in O(n) time.

Structure of the paper. In Section 2 we introduce basic notation related to words and suffix trees and recall the Cover Suffix Tree. Next in Section 3 we extend CST to obtain its counterpart suitable for computation of partial seeds, which we call the Seed Suffix Tree (SST ). In Section 4 we introduce two abstract problems formulated in terms of simple functions which encapsulate the intrinsic difficulty of the two types of PartialSeeds problems. We present the solutions of the


problems in Section 5; this section is essentially the most involved part of our contribution. We summarize our results in the Conclusions section.

2 Preliminaries

Let us fix a word w of length n over a totally ordered alphabet Σ. For a factor v of w, by Occ(v) we denote the set of positions where occurrences of v in w start.

By first(v) and last(v) we denote min Occ(v) and max Occ(v), respectively.

By w[i.. j] we denote the factor starting at the position i and ending at the position j. Factors w[1.. i] are called prefixes of w, and factors w[i.. n] are called suffixes of w. Words shorter than w that are both prefixes and suffixes of w are called borders of w. By β(w) we denote the length of the longest border of w.

The border array β[1.. n] and reverse border array β R [1.. n] of w are defined as follows: β[i] = β(w[1.. i]) and β R [i] = β(w[i.. n]). The arrays β, β R can be constructed in O(n) time [6].

The suffix tree of w, denoted by ST (w), is the compacted suffix trie of w in which only branching nodes and suffixes are explicit. We identify the nodes of ST (w) with the factors of w that they represent. An augmented suffix tree may contain some additional explicit nodes, called extra nodes. For an explicit node v = ε, we set path(v) = (v0, v1,..., vk ) where v0 = v and v1,..., vk are the implicit nodes on the path going upwards from v to its nearest explicit ancestor.

E.g., in the right tree in Fig. 2 we have path(v) = (v, v1, v2, v3, v4, v5 ). We define the locus of a factor v of w as a pair (v, j) such that v = vj where vj ∈ path(v).

The Cover Suffix Tree is an augmented version of a suffix tree introduced in [9] that allows to efficiently compute C(v) for any explicit or implicit node, as shown in the following theorem.

Theorem 4 ([9]). Let w be a word of length n. There exists an augmented suffix tree of size O(n), such that for each edge path(v) we have C(vj ) = c(v) − j∆(v) for some positive integers c(v), ∆(v). Such a tree together with values c(v), ∆(v), denoted as CST (w), can be constructed in O(n log n) time and O(n) space.

Actually [9] provides explicit formulas for c(v), ∆(v) in terms of Occ(v). Their form is not important here; the only property which we use is that 1 ≤ ∆(v) ≤ |Occ(v)|.

3 Seed Suffix Tree CST introduces some extra nodes to ST thanks to which the cover index C(vj ) on each edge becomes a linear function: C(vj ) = c(v) − j∆(v). With seed index S(vj ), the situation is more complex. However, if we make some more nodes explicit, then S(vj ) becomes a relatively simple function. We call the resulting tree the Seed Suffix Tree, denoted by SST (w).

Lemma 5. Let w be a word of length n.

We can construct an augmented suffix tree, denoted by SST (w), of size O(n) such that for each node v there exists a function φv (x) = av x + bv + min(cv, β[x]) and a range Rv = ( v, rv ] such that for all vj ∈ path(v) we have S(vj ) = φv (rv − j). Additionally, 0 ≤ av ≤ |Occ(v)|.

The tree SST (w), together with the border array β and tuples (av, bv, cv, v, rv ) representing φv, can be constructed in O(n) time given CST (w).

–  –  –

Fig. 2. In CST (w) there is a constant-space description of a linear function σv associated with each explicit node v, which gives the values of C(vj ) for implicit nodes on the edge from v upwards. In SST (w) there is a corresponding function ψv which is a combination of the linear function σv and two functions depending on border arrays.

After suitable linear transformation of variable j, the function ψv (j) is converted to a more convenient function φv (x). When transforming CST (w) to SST (w), some implicit nodes are made explicit to guarantee that φv has a simple form (v on the figure).

Proof. For any factor v of w we define:

–  –  –

The following observation relates these values to S(v), see also Fig. 3.

Claim. S(v) = C(v) + LeftS(v) + RightS(v).

Proof (of the claim). C(v) counts all positions covered by full occurrences of v.

We claim that the remaining positions covered by left-overhanging occurrences are counted by LeftS(v). Let p = first(v) + |v| − 1. Note that w[1.. p] has v as a suffix, which means that β[p] is the length of the longest left-overhanging occurrence of v. It covers β[p] positions, but, among them, positions greater than or equal to first(v) are already covered by a full occurrence of v. RightS(v) has a symmetric interpretation.

–  –  –

Fig. 3. The positions contained in S(v) are marked gray. In this case LeftS(v) = β[first(v) + |v| − 1], and RightS(v) = n − |v| + 1 − last(v).

–  –  –

We consider the function S(vj ) = C(vj ) + LeftS(vj ) + RightS(vj ). Note that only C(vj ) is a linear function of j. Also, RightS(vj ) is relatively simple. It either already is a linear function in the whole {0,..., k}, or it becomes one in both parts of {0,..., k} if we split it at j ∈ {0,..., k} such that β R [last(v)] = n − |v| + j + 1 − last(v). We subdivide each path(v) at vj if such j exists. Note that we can easily update values c and ∆ for newly created edges. Also, we make explicit at most O(n) nodes (at most one per edge of CST (w)), so the resulting tree SST (w) has O(n) explicit nodes in total.

It remains to show that after these transformations S(vj ) = φv (rv − j) for rv = first(v) + |v| − 1 and that the coefficients of φv can be efficiently computed.

We omit the explicit formulae in this version, they can be obtained with just a few simple algebraic transformations. The additional inequality 0 ≤ av ≤ |Occ(v)| follows from the property that 1 ≤ ∆(v) ≤ |Occ(v)|.

The following observation is a direct consequence of Lemma 5.

Observation 6. Given SST (w) and a locus of u, a factor of w, one can compute S(u) in constant time.

4 Reduction to two abstract problems

–  –  –

Observation 7. Any border array is a linear-oscillation array.

To solve the LimitedLengthPartialSeeds problem, we make explicit all nodes corresponding to factors of length − 1 and r. This way each edge either contains only nodes at tree levels in {,..., r} or none of them. Note that the functions φv on the subdivided edges stay the same, only the ranges shrink.

Consider the following abstract problem.

Problem A1 Input: a linear-oscillation array B of size n and m pairs (φi, Ri ), where φi is a function φi (x) = ai x+bi +min(ci, B[x]) and Ri = ( i, ri ] ⊆ [1, n] is a non-empty range Output: the values xi = argmax{φi (x) : x ∈ Ri }.3 Applying Problem A1 for B = β and a query for each edge path(v), we obtain vj ∈ path(v) maximizing S(vj ). Taking a global maximum over all edges containing factors of lengths within {,..., r}, we get the sought factor u, which maximizes S(u) among all factors of w with |u| ∈ {,..., r}. This results in the following lemma.

Lemma 8. Given SST (w) and an O(n + m)-time off-line solution to Problem A1, the LimitedLengthPartialSeeds problem can be solved in O(n) time.

To solve the PartialSeeds problem, we also apply Problem A1 to compute max S(vj ) for each edge path(v) of SST (w) (this time we do not introduce any additional extra nodes). We say that an edge path(v) is feasible, if max S(vj ) ≥ α, and important if it is feasible and no ancestor edge is feasible. It is easy to see For a set X and a function f : X → R, we define argmax{f (x) : x ∈ X} as the

largest argument for which f attains its maximum value, that is, max{x ∈ X :

∀x ∈X f (x) ≥ f (x )} (we use the maximality of x later on). We assume max ∅ = −∞ and min ∅ = ∞.

that all shortest α-partial seeds lie on important edges. Also, by Lemma 5, av summed over all feasible edges ev is at most n. Consider the following abstract problem.

Pages:   || 2 |

Similar works:

«TEST BIOTECH Testbiotech e. V. Repor t 2012 Institut für unabhängige Folgenabschätzung in der Biotechnologie „GOLDEN LIES“: das fragwürdige „Golden-Rice“-Projekt der Saatgutindustrie „Golden Lies“: das fragwürdige „Golden-Rice“-Projekt der Saatgutindustrie Ein foodwatch-Report, verfasst von Dr. Christoph Then, testbiotech, Berlin, im Januar 2012 Inhaltsverzeichnis Zusammenfassung 4 1. Einführung 5 1.1 Die Vitamin-A-Mangelernährung und ihre Bekämpfung 5 2. Fehlende...»

«Einsteigerhandbuch 06 / 2006 Copyright  2006 EPLAN Software & Service GmbH & Co. KG. Die EPLAN Software & Service GmbH & Co. KG haftet nicht für technische oder drucktechnische Fehler oder Mängel in diesen technischen Informationen und übernimmt auch keine Haftung für Schäden, die direkt oder indirekt auf Lieferung, Leistung und Nutzung dieses Materials zurückzuführen sind. Dieses Dokument enthält eigentumsrechtlich geschützte Informationen, die dem Urheberrecht unterliegen. Alle...»

«INALCO 2009, Proceedings of the 15th Inuit Studies Conference, Orality (Paris, 2006) About Masks: Conversations from Anaktuvuk Pass, Alaska Margaret B. Blackman Department of Anthropology SUNY College, Brockport (New York, U.S.A.) mblackma@brockport.edu Abstract For 50 years the Nunamiut of Anaktuvuk Pass, Alaska have been making skin masks by a technique that they invented, casting wet caribou skins on wooden molds. For 50 years the Nunamiut have also been talking about their village’s...»

«Energy Division http://energy.tycoelectronics.com Installation and Operating Instructions GEN-AUTO XM Tyco Electronics UK Limited Crompton Instruments Freebournes Road, Witham, Essex, CM8 3AH, UK Tel: +44 1376 509 509 Fax: +44 1376 509 511 Contents Page Section 1 Introduction 5 Section 2 Installation 7 2.1 Unpacking 7 2.2 Unit Configuration 7 2.3 Mechanical Installation 7 2.4 Electrical Connections 9 2.4.1 Charge Generator current 16 Section 3 Programming 17 3.1 Procedure 17 3.2 Program...»

«Thermo mechanical investigations and predictions for oxygen transport membrane materials Von der Fakultät für Georessourcen und Materialtechnik der Rheinisch -Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation vorgelegt von Master of Mechanical Engineering Goran Pećanac aus Split, Republik Kroatien Berichter: Univ.-Prof. Jochen M. Schneider, Ph.D. Univ.-Prof. Dr.-Ing. Tilmann Beck Tag der...»

«GFI Product Manual Fax-Client-Handbuch http://www.gfi.com info@gfi.com Die Informationen in diesem Dokument können ohne vorherige Ankündigung geändert werden. Sofern nicht anders angegeben, sind alle in den Beispielen dieses Dokuments verwendeten Unternehmen, Namen und Daten fiktiv. Ohne die ausdrückliche schriftliche Genehmigung von GFI SOFTWARE LTD. darf kein Teil dieses Dokuments in irgendeiner Form oder mit irgendwelchen Mitteln elektronischer oder mechanischer Art für irgendeinen...»

«S-Log: A new LUT for digital production mastering and interchange applications Hugo Gaggioni, Patel Dhanendra, Jin Yamashita (Sony Broadcast and Production Systems) N. Kawada, K. Endo ( Sony B2B Company) Curtis Clark, American Society of Cinematographers 1. Introduction Color correction tools, file formats, and capture devices are increasingly processing linear, high-dynamic range, color-component data, which is often stored in high-precision floating point formats. However, a wide variety of...»

«This Page Intentionally Left Blank M od erni t y Architecture and MIT Pr es s | Ca mbr i d g e, M as sa c h u s e t t s | London, England Hild e He y n e n M od erni t y Architecture and A C ri ti que © 1999 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. This book was set...»

«THOMAS KEITH FERRIS Assistant Professor Department of Industrial and Systems Engineering, Texas A&M University 236C Zachry Engineering Center, College Station, TX, 77845-3131 Email: tferris@tamu.edu Phone: 979-845-5049 http://ise.tamu.edu/people/faculty/ferris EDUCATION PhD, Industrial and Operations Engineering (Cognitive Ergonomics) December 2010 University of Michigan, Ann Arbor, MI Dissertation: Informative vibrotactile displays to support attention and task management in anesthesiology...»

«Lehrstuhl für Genetik der Technischen Universität München The cellular polarity of ENHANCER OF PINOID: underlying factors and its role in the organogenesis of Arabidopsis thaliana Michaela Matthes Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.Prof. Dr. A. Gierl...»

«200 Questions and Answers on Practical Civil Engineering Works Vincent T. H. CHU 200 Questions and Answers on Practical Civil Engineering Works Vincent T. H. CHU CONTENTS 1. Bridge Works Q1-26 P4-14 2. Concrete Structures Q1-24 P15-23 3. Drainage Works Q1-19 P24-32 4. Earthworks Q1-10 P33-36 5. Piers and Marine Structures Q1-18 P37-42 6. Roadworks Q1-22 P43-50 7. Pumping Station Q1-10 P51-54 8. Reclamation Q1-11 P55-58 9. Water Retaining Structures and Waterworks Q1-16 P59-63 10. Pipe Jacking...»

«Kaua‘i Community College Kaua‘i Community College 3-1901 Kaumuali‘i Hwy. Lihu‘e, HI 96766-9500 www.kauai.hawaii.edu John D. Constantino Editor & Graphic Artist W elcome to Kaua‘i Community College, your University of Hawai‘i on Kaua‘i. You’ve taken a very important step to reach your education and career goals. A college education can open a window on the broader world as well as greater opportunities. The expectations of a college education are also greater in terms of time,...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.