Abstracts - Emory University

advertisement
On structuring probabilistic dependences in stochastic language modelling
Hermann Ney, Ute Essen and Reinhard Kneser
Philips GmbH Forschungslaboratorien, Aachen, P.O. Box 1980, D-52021
Aachen, Germany
Available online 26 April 2002.
Computer Speech & Language
Volume 8, Issue 1 , January 1994, Pages 1-38
Abstract
In this paper, we study the problem of stochastic language modelling
from the viewpoint of introducing suitable structures into the
conditional probability distributions. The task of these distributions
is to predict the probability of a new word by looking at M or even
all predecessor words. The conventional approach is to limit M to 1 or
2 and to interpolate the resulting bigram and trigram models with a
unigram model in a linear fashion. However, there are many other
structures that can be used to model the probabilistic dependences
between the predecessor word and the word to be predicted. The
structures considered in this paper are: nonlinear interpolation as an
alternative to linear interpolation; equivalence classes for word
histories and single words; cache memory and word associations. For
the optimal estimation of nonlinear and linear interpolation
parameters, the leaving-one-out method is systematically used. For the
determination of word equivalence classes in a bigram model, an
automatic clustering procedure has been adapted. To capture
long-distance dependences, we consider various models for word-by-word
dependences; the cache model may be viewed as a special type of
self-association. Experimental results are presented for two text
databases, a Germany database and an English database.
On optimal order in modeling sequence of
letters in words of common language as a Markov chain
doi:10.1016/0031-3203(91)90027-3
Copyright (c) 1991 Published by Elsevier Science B.V.
On optimal order in modeling sequence of letters in words of common
language as a Markov chain
Amlan Kundu and Yang He
Department of Electrical Engineering, State University of New York at,
Buffalo, NY 14260, U.S.A.
Received 27 November 1990; revised 16 July 1991. Available online 19
May 2003.
Abstract
In recognition of words of a language such as English, the letter
sequences of the words are often modeled as Markov chains. In this
paper the problem of determining the optimal order of such Markov
chains is addressed using Tong's minimum Akaike information criterion
estimate (MAICE) approach and Hoel's likelihood ratio statistic based
hypothesis-testing approach. Simulation results show that the sequence
of letters in English words is more likely to be a second order Markov
chain than a first order one.
Author Keywords: Markov model; Optimal order; Akaike
A provably efficient algorithm for dynamic storage allocation
E. G. Coffman, Jr. and F. T. Leighton*
AT&T Bell Laboratories, Murray Hill, New Jersey 07974, USA
Department of Mathematics and Laboratory for Computer Science,
Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,
USA
Received 25 August 1986; revised 21 September 1987. Available online
2 December 2003.
Abstract
The design and analysis of algorithms for on-line dynamic storage
allocation has been a fundamental problem area in computer science for
many years. In this paper we study the stochastic behavior of dynamic
allocation algorithms under the natural assumption that files enter
and leave the system according to a Poisson process. In particular, we
prove that for any dynamic allocation algorithm and any distribution
of file sizes, the expected wasted space (or fragmentation) in the
system at any time is (N log log N), where N is the expected number of
items (or used space) in the system. This result is known to be tight
in the special case when all files have the same size. More
importantly, we also construct a dynamic allocation algorithm which
for any distribution of file sizes wastes only O(N log3/4N) space with
very high probability. This bound is also shown to be tight for a wide
variety of file-size distributions, including for example the uniform
and normal distributions. The results are significant because they
show that the cumulative wasted space in the holes formed by the
continual arrival and departure of items is a vanishingly small
portion of the used space, at least on the average. This fact is in
striking contrast with Knuth's well-known 50% rule which states that
the number of these holes is linear in the used space. Moreover, the
proof techniques establish a surprising connection between stochastic
processes, such as dynamic allocation, and static problems such as
bin-packing and planar matching. We suspect that the techniques will
also prove useful in analyzing other stochastic processes which might
otherwise prove intractable. Lastly, we present experimental data in
support of the theoretical proofs, and as a basis for postulating
several conjectures.
A formal derivation of Heaps' Law
Information Sciences, Volume 170, Issues 2-4, 25 February 2005, Pages 263272
D. C. van Leijenhorst and Th. P. van der Weide
SummaryPlus | Full Text + Links | PDF (207 K)
Word frequencies in text documents can be reasonably described by the
Mandelbrot distribution, which has Zipf's Law as a special case.
Furthermore, the growth of vocabulary size as a function of the text
size (its number of words) has been described in Heaps' Law. It has
been shown that these two experimental laws are related.
In this paper we go a step further, and provide a (formal) derivation
of Heaps' Law from the Mandelbrot distribution. We also provide a
specification of the validity area for applying Heaps' Law.
Zipf's law in phonograms and Weibull distribution in ideograms:
comparison of English with Japanese
Biosystems, Volume 73, Issue 2, February 2004, Pages 131-139
Terutaka Nabeshima and Yukio-Pegio Gunji
SummaryPlus | Full Text + Links | PDF (299 K)
Frequency distribution of word usage in a word sequence generated by
capping is estimated in terms of the number of "hits" in retrieval of
web-pages, to evaluate structure of semantics proper not to a
particular text but to a language. Especially we compare distribution
of English sequences with Japanese ones and obtain that, for English
and Japanese phonogram, frequency of word usage against rank follows
power-law function with exponent 1 and, for Japanese ideogram, it
follows stretched exponential (Weibull distribution) function. We also
discuss that such a difference can result from difference of phonogram
based- (English) and ideogram-based language (Japanese).
Mathematical modeling of empirical laws in computer applications: A
case study
Computers & Mathematics with Applications, Volume 24, Issue 7, October
1992, Pages 77-87
Ye-Sho ChenPete Chong
Abstract
A major difficulty in using empirical laws in computer applications is
the estimation of parameters. In this paper, we argue that the
difficulty arises from the misuse of goodness-of-fit tests. As an
alternative, we suggest the use of Simon's theory of model building
and apply the theory to examine the two well-known laws of Zipf. As a
result, we show that the Simon-Yule model of text generation derives
two formulations of Zipf's law: (1) the frequency-count distribution
proposed by Booth, and (2) the frequency-rank distribution proposed by
Mandelbrot. A further significant contribution of the paper is that it
provides a theoretical foundation for the estimation of parameters
associated with the two laws of Zipf.
Improved bounds for covering complete uniform hypergraphs
Information Processing Letters, Volume 41, Issue 4, 18 March 1992,
Pages 203-207
Jaikumar Radhakrishnan
Abstract
We consider the problem of covering the complete r-uniform hypergraphs
on n vertices using complete r-partite graphs. We obtain lower bounds
on the size of such a covering. For small values of r our result
implies a lower bound of on the size of any such covering. This
improves the previous bound of (rn log n) due to Snir. We also obtain
good lower bounds on the size of a family of perfect hash function
using simple arguments.
Fisher keys for content based retrieval
Image and Vision Computing, Volume 19, Issue 8, 1 May 2001, Pages 561-566
M. S. Lew and D. Denteneer
SummaryPlus | Full Text + Links | PDF (276 K)
In the classic computer science paradigm of data searching, the data
is sorted according to a key and then inserted into a hash table or
tree for fast access. How would this paradigm work for images? What
key function would be best? This paper examines the problem of
efficient indexing of large image databases using the concept of image
keys. The ideal image key maximizes the probability that the key of a
corrupted image copy is closer to the key of the original than the key
to a different image in the database. The case of optimal linear image
keys turns out to be similar to Fisher's linear discriminant. Results
on image collections with real world noise are presented.
A Small Approximately Min-Wise Independent Family of Hash Functions
Journal of Algorithms, Volume 38, Issue 1, January 2001, Pages 84-90
Piotr Indyk
Abstract | Abstract + References | PDF (73 K)
In this paper we give a construction of a small approximately min-wise
independent family of hash functions, i.e., a family of hash functions
such that for any set of arguments X and x X, the probability that
the value of a random function from that family on x will be the
smallest among all values of that function on X is roughly 1/|X|. The
number of bits needed to represent each function is O(log n · log 1/).
This construction gives a solution to the main open problem of A.
Broder et al. (in "STOC'98").
A probabilistic dynamic technique for the distributed generation of
very large state spaces
Performance Evaluation, Volume 39, Issues 1-4, February 2000, Pages 127-148
W. J. [Reference to Knottenbelt], P. G. [Reference to Harrison], M. A.
[Reference to Mestern] and P. S. [Reference to Kritzinger]
Abstract | PDF (619 K)
Conventional methods for state space exploration are limited to the
analysis of small systems because they suffer from excessive memory
and computational requirements. We have developed a new dynamic
probabilistic state exploration algorithm which addresses this problem
for general, structurally unrestricted state spaces.
Our method has a low state omission probability and low memory usage
that is independent of the length of the state vector. In addition,
the algorithm can be easily parallelised. This combination of
probability and parallelism enables us to rapidly explore state spaces
that are an order of magnitude larger than those obtainable using
conventional exhaustive techniques.
We derive a performance model of this new algorithm in order to
quantify its benefits in terms of distributed run-time, speedup and
efficiency. We implement our technique on a distributed-memory
parallel computer and demonstrate results which compare favourably
with the performance model. Finally, we discuss suitable choices for
the three hash functions upon which our algorithm is based.
On probabilities of hash value matches
Computers & Security, Volume 17, Issue 2, 1998, Pages 171-176
Mohammad Peyravian, Allen Roginsky and Ajay Kshemkalyani
Abstract | Abstract + References | PDF (494 K)
Hash functions are used in authentication and cryptography, as well as
for the efficient storage and retrieval of data using hashed keys.
Hash functions are susceptible to undesirable collisions. To design or
choose an appropriate hash function for an application, it is
essential to estimate the probabilities with which these collisions
can occur. In this paper we consider two problems: one of evaluating
the probability of no collision at all and one of finding a bound for
the probability of a collision with a particular hash value. The
quality of these estimates under various values of the parameters is
also discussed.
A Reliable Randomized Algorithm for the Closest-Pair Problem
Journal of Algorithms, Volume 25, Issue 1, October 1997, Pages 19-51
Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen and Martti Penttonen
Abstract | Abstract + References | PDF (360 K)
The following two computational problems are studied:
Duplicate grouping:Assume thatnitems are given, each of which is
labeled by an integer key from the set {0,…,U − 1}. Store the items in
an array of sizensuch that items with the same key occupy a contiguous
segment of the array.
Closest pair:Assume that a multiset ofnpoints in thed-dimensional
Euclidean space is given, whered ≥ 1 is a fixed integer. Each point is
represented as ad-tuple of integers in the range {0,…,U − 1} (or of
arbitrary real numbers). Find a closest pair, i.e., a pair of points
whose distance is minimal over all such pairs.
In 1976, Rabin described a randomized algorithm for the closest-pair
problem that takes linear expected time. As a subroutine, he used a
hashing procedure whose implementation was left open. Only years later
randomized hashing schemes suitable for filling this gap were
developed.
In this paper, we return to Rabin's classic algorithm to provide a
fully detailed description and analysis, thereby also extending and
strengthening his result. As a preliminary step, we study randomized
algorithms for the duplicate-grouping problem. In the course of
solving the duplicate-grouping problem, we describe a new universal
class of hash functions of independent interest.
It is shown that both of the foregoing problems can be solved by
randomized algorithms that useO(n) space and finish inO(n) time with
probability tending to 1 asngrows to infinity. The model of
computation is a unit-cost RAM capable of generating random numbers
and of performing arithmetic operations from the set {+, −, *, , 2,
2}, wheredenotes integer division and2and2are the mappings from to
{0} with2(m) = log2m and2(m) = 2mfor allm . If the operations2and2are
not available, the running time of the algorithms increases by an
additive term ofO(log log U). All numbers manipulated by the
algorithms consist ofO(log n + log U) bits.
The algorithms for both of the problems exceed the time boundO(n)
orO(n + log log U) with probability 2−n(1). Variants of the algorithms
are also given that use onlyO(log n + log U) random bits and have
probabilityO(n−) of exceeding the time bounds, where ≥ 1 is a
constant that can be chosen arbitrarily.
The algorithms for the closest-pair problem also works if the
coordinates of the points are arbitrary real numbers, provided that
the RAM is able to perform arithmetic operations from {+, −, *, } on
real numbers, wherea bnow means a/b. In this case, the running time
isO(n) with2and2andO(n + log log(max/max)) without them, where maxis
the maximum and minis the minimum distance between any two distinct
input points.
Fast rehashing in PRAM emulations
Theoretical Computer Science, Volume 155, Issue 2, 11 March 1996, Pages
349-363
Jörg Keller
Abstract | Abstract + References | PDF (1055 K)
In PRAM emulations, universal hashing is a well-known method for
distributing the address space among memory modules. However, if the
memory access patterns of an application often result in high module
congestion, it is necessary to rehash by choosing another hash
function and redistributing data on the fly. For the case of linear
hash functions h(x) = ax mod m we present an algorithm to rehash an
address space of size m = 2u on a PRAM emulation with p processors in
time O(m/p + log m + L), where L denotes the network latency. For the
common case that m is polynomial in p and L = O(log p) the runtime is
O(m/p + log p). The algorithm requires O(log m + L) words of local
storage per processor. We show that an obvious simplification of the
algorithm will significantly increase runtime with high probability.
The analysis of hashing with lazy deletions
Information Sciences, Volume 62, Issues 1-2, July 1992, Pages 13-26
Pedro CelisJohn Franco
Abstract
We present new, improved algorithms for performing deletions and
subsequent searches in hash tables. Our method is based on open
addressing hashing extended to allow efficient reclamation of
unoccupied space due to deletions which enables dynamic shortening of
probe sequence lengths. We present an analysis on the number of table
probes required to locate an element in the table. Specifically, we
present a formula which bounds the average number of cells visited
during searches of a data element over its lifetime assuming a system
in equilibrium. The formula is a function of the probability that an
accessed element is deleted and is exact at the extreme points when
the probability is 0 and 1. In the case that the probability is 0 and
the load factor is , the number of cell visits per search access is
−ln(1−)/, and in the case that the probability is 1 the number of cell
visits per search access is 1/(1−).
How to emulate shared memory
Journal of Computer and System Sciences, Volume 42, Issue 3, June
1991, Pages 307-326
Abhiram G. Ranade
Abstract
We present a simple algorithm for emulating an N-processor CROW PRAM
on an N-ode butterfly. Each step of the PRAM is emulated in time O(log
N) with high probability, using FIFO queues of size O(1) at each node.
The only use of randomization is in selecting a hash function to
distribute the shared address space of the PRAM onto the nodes of the
butterfly. The routing itself is both deterministic and oblivious, and
messages are combined without the use of associative memories or
explicit sorting. As a corollary we improve the result of Pippenger by
routing permutations with bounded queues in logarithmic time, without
the possibility of deadlock. Besides being optimal, our algorithm has
the advantage of extreme simplicity and is readily suited for use in
practice.
New hash functions and their use in authentication and set equality
Journal of Computer and System Sciences, Volume 22, Issue 3, June
1981, Pages 265-279
Mark N. Wegman and J. Lawrence Carter
Abstract
In this paper we exhibit several new classes of hash functions with
certain desirable properties, and introduce two novel applications for
hashing which make use of these functions. One class contains a small
number of functions, yet is almost universal2. If the functions hash
n-bit long names into m-bit indices, then specifying a member of the
class requires only O((m + log2log2(n)) · log2(n)) bits as compared to
O(n) bits for earlier techniques. For long names, this is about a
factor of m larger than the lower bound of m + log2n − log2m bits. An
application of this class is a provably secure authentication
technique for sending messages over insecure lines. A second class of
functions satisfies a much stronger property than universal2. We
present the application of testing sets for equality.
The authentication technique allows the receiver to be certain that a
message is genuine. An "enemy"—even one with infinite computer
resources—cannot forge or modify a message without detection. The set
equality technique allows operations including "add member to set,"
"delete member from set" and "test two sets for equality" to be
performed in expected constant time and with less than a specified
probability of error.
The bag model in language statistics
In-memory hash tables for accumulating text vocabularies
A Reliable Randomized Algorithm for the Closest-Pair Problem (DUPLICATE
GROUPING)
Download