On structuring probabilistic dependences in stochastic language modelling Hermann Ney, Ute Essen and Reinhard Kneser Philips GmbH Forschungslaboratorien, Aachen, P.O. Box 1980, D-52021 Aachen, Germany Available online 26 April 2002. Computer Speech & Language Volume 8, Issue 1 , January 1994, Pages 1-38 Abstract In this paper, we study the problem of stochastic language modelling from the viewpoint of introducing suitable structures into the conditional probability distributions. The task of these distributions is to predict the probability of a new word by looking at M or even all predecessor words. The conventional approach is to limit M to 1 or 2 and to interpolate the resulting bigram and trigram models with a unigram model in a linear fashion. However, there are many other structures that can be used to model the probabilistic dependences between the predecessor word and the word to be predicted. The structures considered in this paper are: nonlinear interpolation as an alternative to linear interpolation; equivalence classes for word histories and single words; cache memory and word associations. For the optimal estimation of nonlinear and linear interpolation parameters, the leaving-one-out method is systematically used. For the determination of word equivalence classes in a bigram model, an automatic clustering procedure has been adapted. To capture long-distance dependences, we consider various models for word-by-word dependences; the cache model may be viewed as a special type of self-association. Experimental results are presented for two text databases, a Germany database and an English database. On optimal order in modeling sequence of letters in words of common language as a Markov chain doi:10.1016/0031-3203(91)90027-3 Copyright (c) 1991 Published by Elsevier Science B.V. On optimal order in modeling sequence of letters in words of common language as a Markov chain Amlan Kundu and Yang He Department of Electrical Engineering, State University of New York at, Buffalo, NY 14260, U.S.A. Received 27 November 1990; revised 16 July 1991. Available online 19 May 2003. Abstract In recognition of words of a language such as English, the letter sequences of the words are often modeled as Markov chains. In this paper the problem of determining the optimal order of such Markov chains is addressed using Tong's minimum Akaike information criterion estimate (MAICE) approach and Hoel's likelihood ratio statistic based hypothesis-testing approach. Simulation results show that the sequence of letters in English words is more likely to be a second order Markov chain than a first order one. Author Keywords: Markov model; Optimal order; Akaike A provably efficient algorithm for dynamic storage allocation E. G. Coffman, Jr. and F. T. Leighton* AT&T Bell Laboratories, Murray Hill, New Jersey 07974, USA Department of Mathematics and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA Received 25 August 1986; revised 21 September 1987. Available online 2 December 2003. Abstract The design and analysis of algorithms for on-line dynamic storage allocation has been a fundamental problem area in computer science for many years. In this paper we study the stochastic behavior of dynamic allocation algorithms under the natural assumption that files enter and leave the system according to a Poisson process. In particular, we prove that for any dynamic allocation algorithm and any distribution of file sizes, the expected wasted space (or fragmentation) in the system at any time is (N log log N), where N is the expected number of items (or used space) in the system. This result is known to be tight in the special case when all files have the same size. More importantly, we also construct a dynamic allocation algorithm which for any distribution of file sizes wastes only O(N log3/4N) space with very high probability. This bound is also shown to be tight for a wide variety of file-size distributions, including for example the uniform and normal distributions. The results are significant because they show that the cumulative wasted space in the holes formed by the continual arrival and departure of items is a vanishingly small portion of the used space, at least on the average. This fact is in striking contrast with Knuth's well-known 50% rule which states that the number of these holes is linear in the used space. Moreover, the proof techniques establish a surprising connection between stochastic processes, such as dynamic allocation, and static problems such as bin-packing and planar matching. We suspect that the techniques will also prove useful in analyzing other stochastic processes which might otherwise prove intractable. Lastly, we present experimental data in support of the theoretical proofs, and as a basis for postulating several conjectures. A formal derivation of Heaps' Law Information Sciences, Volume 170, Issues 2-4, 25 February 2005, Pages 263272 D. C. van Leijenhorst and Th. P. van der Weide SummaryPlus | Full Text + Links | PDF (207 K) Word frequencies in text documents can be reasonably described by the Mandelbrot distribution, which has Zipf's Law as a special case. Furthermore, the growth of vocabulary size as a function of the text size (its number of words) has been described in Heaps' Law. It has been shown that these two experimental laws are related. In this paper we go a step further, and provide a (formal) derivation of Heaps' Law from the Mandelbrot distribution. We also provide a specification of the validity area for applying Heaps' Law. Zipf's law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese Biosystems, Volume 73, Issue 2, February 2004, Pages 131-139 Terutaka Nabeshima and Yukio-Pegio Gunji SummaryPlus | Full Text + Links | PDF (299 K) Frequency distribution of word usage in a word sequence generated by capping is estimated in terms of the number of "hits" in retrieval of web-pages, to evaluate structure of semantics proper not to a particular text but to a language. Especially we compare distribution of English sequences with Japanese ones and obtain that, for English and Japanese phonogram, frequency of word usage against rank follows power-law function with exponent 1 and, for Japanese ideogram, it follows stretched exponential (Weibull distribution) function. We also discuss that such a difference can result from difference of phonogram based- (English) and ideogram-based language (Japanese). Mathematical modeling of empirical laws in computer applications: A case study Computers & Mathematics with Applications, Volume 24, Issue 7, October 1992, Pages 77-87 Ye-Sho ChenPete Chong Abstract A major difficulty in using empirical laws in computer applications is the estimation of parameters. In this paper, we argue that the difficulty arises from the misuse of goodness-of-fit tests. As an alternative, we suggest the use of Simon's theory of model building and apply the theory to examine the two well-known laws of Zipf. As a result, we show that the Simon-Yule model of text generation derives two formulations of Zipf's law: (1) the frequency-count distribution proposed by Booth, and (2) the frequency-rank distribution proposed by Mandelbrot. A further significant contribution of the paper is that it provides a theoretical foundation for the estimation of parameters associated with the two laws of Zipf. Improved bounds for covering complete uniform hypergraphs Information Processing Letters, Volume 41, Issue 4, 18 March 1992, Pages 203-207 Jaikumar Radhakrishnan Abstract We consider the problem of covering the complete r-uniform hypergraphs on n vertices using complete r-partite graphs. We obtain lower bounds on the size of such a covering. For small values of r our result implies a lower bound of on the size of any such covering. This improves the previous bound of (rn log n) due to Snir. We also obtain good lower bounds on the size of a family of perfect hash function using simple arguments. Fisher keys for content based retrieval Image and Vision Computing, Volume 19, Issue 8, 1 May 2001, Pages 561-566 M. S. Lew and D. Denteneer SummaryPlus | Full Text + Links | PDF (276 K) In the classic computer science paradigm of data searching, the data is sorted according to a key and then inserted into a hash table or tree for fast access. How would this paradigm work for images? What key function would be best? This paper examines the problem of efficient indexing of large image databases using the concept of image keys. The ideal image key maximizes the probability that the key of a corrupted image copy is closer to the key of the original than the key to a different image in the database. The case of optimal linear image keys turns out to be similar to Fisher's linear discriminant. Results on image collections with real world noise are presented. A Small Approximately Min-Wise Independent Family of Hash Functions Journal of Algorithms, Volume 38, Issue 1, January 2001, Pages 84-90 Piotr Indyk Abstract | Abstract + References | PDF (73 K) In this paper we give a construction of a small approximately min-wise independent family of hash functions, i.e., a family of hash functions such that for any set of arguments X and x X, the probability that the value of a random function from that family on x will be the smallest among all values of that function on X is roughly 1/|X|. The number of bits needed to represent each function is O(log n · log 1/). This construction gives a solution to the main open problem of A. Broder et al. (in "STOC'98"). A probabilistic dynamic technique for the distributed generation of very large state spaces Performance Evaluation, Volume 39, Issues 1-4, February 2000, Pages 127-148 W. J. [Reference to Knottenbelt], P. G. [Reference to Harrison], M. A. [Reference to Mestern] and P. S. [Reference to Kritzinger] Abstract | PDF (619 K) Conventional methods for state space exploration are limited to the analysis of small systems because they suffer from excessive memory and computational requirements. We have developed a new dynamic probabilistic state exploration algorithm which addresses this problem for general, structurally unrestricted state spaces. Our method has a low state omission probability and low memory usage that is independent of the length of the state vector. In addition, the algorithm can be easily parallelised. This combination of probability and parallelism enables us to rapidly explore state spaces that are an order of magnitude larger than those obtainable using conventional exhaustive techniques. We derive a performance model of this new algorithm in order to quantify its benefits in terms of distributed run-time, speedup and efficiency. We implement our technique on a distributed-memory parallel computer and demonstrate results which compare favourably with the performance model. Finally, we discuss suitable choices for the three hash functions upon which our algorithm is based. On probabilities of hash value matches Computers & Security, Volume 17, Issue 2, 1998, Pages 171-176 Mohammad Peyravian, Allen Roginsky and Ajay Kshemkalyani Abstract | Abstract + References | PDF (494 K) Hash functions are used in authentication and cryptography, as well as for the efficient storage and retrieval of data using hashed keys. Hash functions are susceptible to undesirable collisions. To design or choose an appropriate hash function for an application, it is essential to estimate the probabilities with which these collisions can occur. In this paper we consider two problems: one of evaluating the probability of no collision at all and one of finding a bound for the probability of a collision with a particular hash value. The quality of these estimates under various values of the parameters is also discussed. A Reliable Randomized Algorithm for the Closest-Pair Problem Journal of Algorithms, Volume 25, Issue 1, October 1997, Pages 19-51 Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen and Martti Penttonen Abstract | Abstract + References | PDF (360 K) The following two computational problems are studied: Duplicate grouping:Assume thatnitems are given, each of which is labeled by an integer key from the set {0,…,U − 1}. Store the items in an array of sizensuch that items with the same key occupy a contiguous segment of the array. Closest pair:Assume that a multiset ofnpoints in thed-dimensional Euclidean space is given, whered ≥ 1 is a fixed integer. Each point is represented as ad-tuple of integers in the range {0,…,U − 1} (or of arbitrary real numbers). Find a closest pair, i.e., a pair of points whose distance is minimal over all such pairs. In 1976, Rabin described a randomized algorithm for the closest-pair problem that takes linear expected time. As a subroutine, he used a hashing procedure whose implementation was left open. Only years later randomized hashing schemes suitable for filling this gap were developed. In this paper, we return to Rabin's classic algorithm to provide a fully detailed description and analysis, thereby also extending and strengthening his result. As a preliminary step, we study randomized algorithms for the duplicate-grouping problem. In the course of solving the duplicate-grouping problem, we describe a new universal class of hash functions of independent interest. It is shown that both of the foregoing problems can be solved by randomized algorithms that useO(n) space and finish inO(n) time with probability tending to 1 asngrows to infinity. The model of computation is a unit-cost RAM capable of generating random numbers and of performing arithmetic operations from the set {+, −, *, , 2, 2}, wheredenotes integer division and2and2are the mappings from to {0} with2(m) = log2m and2(m) = 2mfor allm . If the operations2and2are not available, the running time of the algorithms increases by an additive term ofO(log log U). All numbers manipulated by the algorithms consist ofO(log n + log U) bits. The algorithms for both of the problems exceed the time boundO(n) orO(n + log log U) with probability 2−n(1). Variants of the algorithms are also given that use onlyO(log n + log U) random bits and have probabilityO(n−) of exceeding the time bounds, where ≥ 1 is a constant that can be chosen arbitrarily. The algorithms for the closest-pair problem also works if the coordinates of the points are arbitrary real numbers, provided that the RAM is able to perform arithmetic operations from {+, −, *, } on real numbers, wherea bnow means a/b. In this case, the running time isO(n) with2and2andO(n + log log(max/max)) without them, where maxis the maximum and minis the minimum distance between any two distinct input points. Fast rehashing in PRAM emulations Theoretical Computer Science, Volume 155, Issue 2, 11 March 1996, Pages 349-363 Jörg Keller Abstract | Abstract + References | PDF (1055 K) In PRAM emulations, universal hashing is a well-known method for distributing the address space among memory modules. However, if the memory access patterns of an application often result in high module congestion, it is necessary to rehash by choosing another hash function and redistributing data on the fly. For the case of linear hash functions h(x) = ax mod m we present an algorithm to rehash an address space of size m = 2u on a PRAM emulation with p processors in time O(m/p + log m + L), where L denotes the network latency. For the common case that m is polynomial in p and L = O(log p) the runtime is O(m/p + log p). The algorithm requires O(log m + L) words of local storage per processor. We show that an obvious simplification of the algorithm will significantly increase runtime with high probability. The analysis of hashing with lazy deletions Information Sciences, Volume 62, Issues 1-2, July 1992, Pages 13-26 Pedro CelisJohn Franco Abstract We present new, improved algorithms for performing deletions and subsequent searches in hash tables. Our method is based on open addressing hashing extended to allow efficient reclamation of unoccupied space due to deletions which enables dynamic shortening of probe sequence lengths. We present an analysis on the number of table probes required to locate an element in the table. Specifically, we present a formula which bounds the average number of cells visited during searches of a data element over its lifetime assuming a system in equilibrium. The formula is a function of the probability that an accessed element is deleted and is exact at the extreme points when the probability is 0 and 1. In the case that the probability is 0 and the load factor is , the number of cell visits per search access is −ln(1−)/, and in the case that the probability is 1 the number of cell visits per search access is 1/(1−). How to emulate shared memory Journal of Computer and System Sciences, Volume 42, Issue 3, June 1991, Pages 307-326 Abhiram G. Ranade Abstract We present a simple algorithm for emulating an N-processor CROW PRAM on an N-ode butterfly. Each step of the PRAM is emulated in time O(log N) with high probability, using FIFO queues of size O(1) at each node. The only use of randomization is in selecting a hash function to distribute the shared address space of the PRAM onto the nodes of the butterfly. The routing itself is both deterministic and oblivious, and messages are combined without the use of associative memories or explicit sorting. As a corollary we improve the result of Pippenger by routing permutations with bounded queues in logarithmic time, without the possibility of deadlock. Besides being optimal, our algorithm has the advantage of extreme simplicity and is readily suited for use in practice. New hash functions and their use in authentication and set equality Journal of Computer and System Sciences, Volume 22, Issue 3, June 1981, Pages 265-279 Mark N. Wegman and J. Lawrence Carter Abstract In this paper we exhibit several new classes of hash functions with certain desirable properties, and introduce two novel applications for hashing which make use of these functions. One class contains a small number of functions, yet is almost universal2. If the functions hash n-bit long names into m-bit indices, then specifying a member of the class requires only O((m + log2log2(n)) · log2(n)) bits as compared to O(n) bits for earlier techniques. For long names, this is about a factor of m larger than the lower bound of m + log2n − log2m bits. An application of this class is a provably secure authentication technique for sending messages over insecure lines. A second class of functions satisfies a much stronger property than universal2. We present the application of testing sets for equality. The authentication technique allows the receiver to be certain that a message is genuine. An "enemy"—even one with infinite computer resources—cannot forge or modify a message without detection. The set equality technique allows operations including "add member to set," "delete member from set" and "test two sets for equality" to be performed in expected constant time and with less than a specified probability of error. The bag model in language statistics In-memory hash tables for accumulating text vocabularies A Reliable Randomized Algorithm for the Closest-Pair Problem (DUPLICATE GROUPING)