Inventory Estimation From Transactions via Hidden Markov Models by Nirav Bhan B.Tech, Electrical Engineering Indian Institute of Technology-Bombay, 2013 Submitted to the MIT Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 c 2015 Massachusetts Institute of Technology. All rights reserved Author Department of Electrical Engineering and Computer Science August 26, 2015 Certified by Devavrat Shah Associate Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, EECS Committee on Graduate Students 2 Inventory Estimation From Transactions via Hidden Markov Models by Nirav Bhan Submitted to the MIT Department of Electrical Engineering and Computer Science on August 26, 2015, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering Abstract Our work solves the problem of inventory tracking in the retail industry using Hidden Markov Models. It has been observed that inventory records are extremely inaccurate in practice (cf. [1–4]). Reasons for this inaccuracy are item losses due to item theft, mishandling, etc. which are unaccounted. Even more important are the lost sales due to lack of items on the shelf, called stockout losses. In several industries, stockout is responsible for billions of dollars of lost sales each year (cf. [4]). In [5], it is estimated that 4% of annual sales are lost due to stockout, across a range of industries. Traditional approaches toward solving the inventory problem have been geared toward designing better inventory management practices, to reduce or account for stock uncertainity. However, such strategies have had limited success in overcoming the effects of inaccurate inventory (cf. [1]). Thus, inventory tracking remains an important unsolved problem. The work done in this thesis is a step toward solving this problem. Our solution follows a novel approach of estimating inventory using accurately available point-of-sales data. A similar approach has been seen in other recent work such as [1, 6, 7]. Our key idea is that when the item is in stockout, no sales are recorded. Thus, by looking at the sequence of sales as a time-series, we can guess the period when stockout has occured. In our work, we find that under appropriate assumptions, exact stock recovery is possible for all time. To represent the evolution of inventory in a retail store, we use a Hidden Markov Model (HMM), along the lines of [6]. In the latter work, the authors have shown that an HMM-based framework, with Gibbs sampling for estimation, manages to recover stock well in practice. However, their methods are computationally expensive and do not possess any theoretical guarantees. In our work, we introduce a slightly different HMM to represent the inventory process, which we call the Sales-Refills model. For this model, we are able to determine inventory level for all times, given enough data. Moreover, our recovery algorithms are easy to implement and computationally 3 fast. We also derive sample complexity bounds which show that our methods are statistically efficient. Our work also solves a related problem viz. accurate demand forecasting in presence of unobservable lost sales (cf. [8–10]). The naive approach of computing a timeaveraged sales rate underestimates the demand, as stockout may cause interested customers to leave without purchasing any items (cf. [8, 9]). By modelling the retail process explicitly in terms of sales and refills, our model achieves a natural decoupling of the true demand from other parameters. By explicitly determining instants where stock is empty, we obtain a consistent estimate of the demand. Our work also has consequences for HMM learning. In this thesis, we propose an HMM model which is learnable using simple and highly efficient algorithms. This is not a usual property of HMMs; indeed several problems on HMMs are known to be hard (cf. [11–13]). The learnability of our HMM can be considered a consequence of the following property: We have a few parameters which vary over a finite range, and for each value of the parameters we can identify a signature property of the observation sequence. For the Sales-Refills model, the signature property is the location of longer inter-sale intervals in the observation sequence. This simple idea may lead to practically useful HMMs, as exemplified by our work. Thesis Supervisor: Devavrat Shah Title: Associate Professor 4 Acknowledgements I would firstly like to express my utmost gratitude to my advisor, Devavrat Shah. His guidance has been instrumental in my research. His enthusiasm, wisdom and patience are remarkable, and I hope that some of these qualities rub off on me during the course of our work together. In addition, his detailed guidance with writing has allowed me to express my thoughts more clearly. Secondly, I would like to express my thanks toward my friends. I wish to thank Pritish, Anuran, Gauri, Sai, Ganesh and many others who have made my stay at MIT pleasureable. Thirdly, I would like to thank my past advisors, Prof. Vivek Borkar and Prof. Volkan Cevher. The positive research experiences I had with them are responsible for my pursuing an academic career. Lastly, I am deeply thankful to my parents. Their support has been key to my ability to focus on work undisturbed. Although leaving home for the first time is not easy, for me or them, they have shown a great deal of support and understanding, allowing me to do what I love. 5 6 Contents 1 Introduction 12 1.1 Background: Inventory Inaccuracy in Retail Industry . . . . . . . . . 12 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 A spectral algorithm for learning HMM . . . . . . . . . . . . . . . . . 17 1.4 Formulation as a Constrained HMM Learning Problem . . . . . . . . 21 1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 Model Description and Estimation 25 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 The Sales-Refills Model . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . 27 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 3 Algorithm 3.1 3.2 31 Estimating stock (or inventory) . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Estimating C, X0 mod C . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 Estimating hidden stock . . . . . . . . . . . . . . . . . . . . . 32 Estimating q and p . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7 3.3 Computational efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 General C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 General estimator for C, U . . . . . . . . . . . . . . . . . . . . 34 4 Proof of Estimation 36 4.1 Invariance modulo C . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Correctness of Ĉ T and Û T . . . . . . . . . . . . . . . . . . . . . . . . 37 T 4.3 Correctness of X̂t . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Correctness of q̂ T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Correctness of p̂T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.6 Correctness of C̄T , ŪT . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7 No-refill probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Sample complexity bounds 52 5.1 Error bounds for stock estimation . . . . . . . . . . . . . . . . . . . . 52 5.2 Data sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Generalized and Noise-Augmented models 6.1 6.2 6.3 6.4 62 Generalized Sales-Refills model . . . . . . . . . . . . . . . . . . . . . 62 6.1.1 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.1.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Estimation of the Generalized Sales-Refills model . . . . . . . . . . . 66 6.2.1 Simplifying Assumptions . . . . . . . . . . . . . . . . . . . . . 66 Estimation with Uniform Aggregate Demand . . . . . . . . . . . . . . 67 6.3.1 Estimation of stock and U . . . . . . . . . . . . . . . . . . . . 68 6.3.2 Estimation of sales parameters . . . . . . . . . . . . . . . . . . 69 Augmenting the Sales-Refills model with Noise . . . . . . . . . . . . 70 6.4.1 Description of Noisy Sales-Refills Model . . . . . . . . . . . . 72 6.4.2 Equivalence of hidden sales and Source uncertainty viewpoints 72 8 6.4.3 Estimation with Noise . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion 73 76 7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2 Significance and Future Work . . . . . . . . . . . . . . . . . . . . . . 77 9 List of Figures 2.1 Shifted order-2-hmm, representing the hidden state variables Xt ’s and observation variables Yt0 s. Notice that the observation sequence has shift 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 26 State transition diagram for the stock MC. Sales occur in all states besides 0, while refills only occur in states ≤ R. . . . . . . . . . . . . 10 28 11 Chapter 1 Introduction 1.1 Background: Inventory Inaccuracy in Retail Industry Inventory inaccuracy is an important operational challenge in the modern retail industry, cf. [1–4,6,8]. Inspite of the availability of modern high-tech tools and methods, it is found that many store managers do not know their inventory correctly. For example, the authors of [2] found that over 65% of all Stock Keeping Units (SKUs) in a study of 370,000 SKUs did not match the physical inventory. Moreover, about 20-25% of the SKUs differed from the inventory by six or more items. Such inaccurate records can have a serious negative impact on operational decisions in the store. For example, stores rely on the inventory record to order a fresh batch of items, replenishing sold-out products on the shelf. If such orders are not made in a timely manner, they may lead to a ‘stock-out’ situation, where sales are halted due to lack of items on the shelf. Clearly, stock-out leads to significant losses for the store in terms of missed sales. This problem is even more acute for stores that rely on automated inventory management systems. In [4], it is estimated that the revenue lost due to stockout equals 4% of annual sales, averaged across various industries. 12 There are several explanations for why the inventory records in stores are not accurate. The most important reason is widely believed to be stock loss (cf. [4]). Stock loss (also known as inventory shrinkage) refers to the loss of items in the store due to employee theft, shoplifting, mishandling, damage etc. By its nature, stock loss is unpredictable and hence very difficult to account for. It can also mean significant economic losses for the store. In [4], it is noted that stores lose 1.5% to 2% of their annual sales due to stock loss. They also note that for certain goods like batteries and razor blades, which are prone to theft due to their small size and high value, the stock loss can be as high as 5 to 8%. Unfortunately, stores are often ill-posed to combat these effects due to poor inventory records. Moreover, although stock loss is an important economic issue in its own right, its consequence viz. inventory inaccuracy can have an even bigger impact, due to the stock-out phenomenon noted above. In [3], the authors explain how inventory inaccuracy, while a serious issue for all industries, can be especially problematic in industries where there is close collaboration between various levels of the supply chain, and where demand uncertainity is lower and lead times are shorter. All of the above shows that inventory record inaccuracy is an important issue, and one that traditional methods have failed to resolve. Recently, there have been some attempts to use Machine Learning methods for solving these problems. These methods are based on 2 key observations: 1. In all modern stores, sales are recorded in a highly detailed, accurate and usable form. 2. When stockout happens, no sales will be recorded in the system, for a long period of time. Since the inventory must be regarded as not fully known, it is customary to replace hard knowledge of the inventory with a probablistic representation of the same. For example, in [1], the authors use a Bayesian inventory record, which is a vector of 13 probabilities about various inventory levels. The beliefs are updated from one time period to the next based on 3 factors - the belief in the previous interval, the sales recorded, and the time until the last sale. The authors also provide evidence via simulations that their idea is useful in practice. There is another important problem in retail that has attracted a lot of attention. This is the problem of estimating aggregate demand for supplies. The obvious method for estimating demand is to divide the number of sales in a large time period, by the total time. This gives us the average number of sales per unit time, which is assumed to equal demand. The problem with this approach is that it ignores the phenomenon of stockout viz. no sales occur when shelfs are empty. Hence, the number of sales per unit time is a strictly lower estimate than the actual demand, and it is difficult to guess this gap. Statistical and machine learning techniques have been applied to this task for quite some time. Works such as [8–10] address this problem using quite elaborate models. In our work, we present a relatively simple HMM model which addresses both of the above problems viz. inventory inaccuracy and demand estimation. 1.2 Our Approach A common feature of the two problems above is that they both rely on identifying stockout periods. It can therefore be imagined that a single mechanism could solve both problems. Our work in this thesis can be regarded as an important step in this direction. Our work solves the problem of inventory inaccuracy in a novel way, inspired from [6]. Following their approach, we give a Hidden Markov Model (HMM) description of the retail system, which we call the Sales-Refills model. The hidden state in our HMM is precisely the inventory level. Using accurately recorded sales data, we show that this hidden state can be estimated consistently for all time. The estimation of the hidden state is performed by identifying the stockout periods in a 14 clever way. This enables us to solve both the problems of inventory inaccuracy and demand estimation in retail. As an illustrative example, consider a candy shop using a point-of-sales (POS) system like Square, Inc. The candy shop sells one type of candies. Whenever a customer comes to the shop, if candies are not equal to 0 (i.e. stockout has not happened), then the customer may buy a candy resulting into a transaction being recorded in the POS. On the other hand, if candies available are 0 then customer shall not be able to purchase any candy. Of course, this may be confused with the fact that lack of customers arriving may not lead to recording of a transaction either. Thus, observing transactions provides some information about the hidden state (number of candies or inventory), but it is not clear if we can reconstruct the exact value of the inventory using transactions. To aid the process of estimation of the hidden state (inventory), we take note of the following standard practice in retail industry called (the (s, S) policy (cf. [14,15]): every time inventory level goes below a threshold s ≥ 0, a refill is ordered to get inventory to level (at least) S > s. With the knowledge of this practice along with the model that customers arrive as per Bernoulli process for a discrete time system (cf. [8]) and make purchasing decisions independently, the resulting setup is that of Hidden Markov Model where hidden state is the inventory that evolves as per a Markov chain and observations, transactions, depend on the hidden state. 1.2.1 Literature Survey The problem of estimating inventory and demand from observations has been considered in earlier work [1, 6, 7, 9, 10] Our work is inspired by that of [6] where authors introduce a (second order) Hidden Markov Model for the question of inventory estimation from transaction data that we briefly alluded to in the text above. In [6], authors show that using Gibbs Sampling to estimate the marginals in the usual method for estimation of HMM, it is feasible to estimate inventory accurately in practical set15 ting. However, their result lacks theoretical guarantees. We also take note of [9] that utilizes the Baum-Welch (or EM) algorithm for estimation of demand in a similar context. In general, this problem can be seen as estimation of hidden state of an HMM. It is well known that in general estimation of hidden state is not possible from observations (cf. [11]). However, under the model dependent specific conditions, it may be feasible to recover the hidden state from observations. For example, over years there have been various approaches developed to learn mixture distributions (simplest form of HMM) [16–20]. In the recent and not so recent past, various sufficient conditions for identifiability for HMMs have been identified, cf. [21–23]. None of these approaches seem to provide meaningful answer to the HMM estimation problem considered here. 1.2.2 Our contributions We provide a simple estimation procedure for which we establish that all the inventory states can be estimated accurately with probability 1 after enough observations are made. We derive the precise sample complexity bound to find how many samples are needed to estimate inventory at all states with high enough probability and we find that it to be efficient. Precisely, to get with probability at least 1-ε it takes O ε12 samples. 1 The key insight in our ability to learn the hidden state of the HMM is as follows. In the HMM of interest, we establish that the primary source of uncertainty arises due to the unknown value of the inventory in the beginning (initial state) along with refill amount. That is, if the initial state and refill amount is known, then by using transaction information, we can reconstruct the value of inventory for all subsequent states. Here we utilize the fact that inventory refill is happening as per an (s, S) policy (with unknown S). If maximum value of S is known (i.e. upper bound on 1 This is a loose bound, we actually get close to O bounds should be possible for our model 16 1 ε in chapter 5. We believe that stronger S), then effective uncertainty in the system is captured by finitely many different values (combination of initial state value and value of S). With each of these finite option, we produce a verifiable condition that is asymptotically true if and only if that particular finite option corresponds to the system uncertainty. This leads to consistent estimator of the unknown inventory. The finite sample bound follows by showing that this asymptotic property holds with high probability with few samples. Thus we obtain efficient estimator. In addition to solving the specific problem of mitigating inventory inaccuracies using transaction data, our approach identifies a class of Hidden Markov Models that are learnable. This class of models may be simple, but as exemplified by the above setting, it could be of relevance in practice. 1.3 A spectral algorithm for learning HMM We give a brief summary of the work in [21], and explain why it does not apply to our current problem. In this work, the authors propose a method for learning comb-like graphical models, which can be easily extended to learning HMMs. They provide a PAC algorithm for learning a set of probabilities such that the output distribution is close to the empirical output distribution (cf. Theorem 5 in [21]). We paraphrase this theorem below for completeness. Theorem A (From [21], Theorem 5 ). Let φd , κπ > 0 be constants. Let C be a finite set and Mn denote the collection of |C| × |C| transition matrices P , where 1 ≥ |detP | ≥ n−φd . Then there exists a PAC-learning algorithm for (TC3 (n) ⊗ Mn , nκπ ). The running time and sample complexity of the algorithm is poly(n, |C|, 1ε , 1δ ). Note: The object (TC3 (n)⊗Mn , nκπ ) refers to the collection of all possible Caterpillar Trees on n leaves, endowed with transition matrices from Mn , for which the stationary probability at every node for every state exceeds n−φd . 17 Strictly speaking, the above method only learns finite-size, HMM-like graphical models 2 . To extend this method to HMMs, we use a result from [24]. The result allows us to characterize the infinite output distribution of a HMM with a small block of the output distribution. Precisely, their result is as follows: Theorem B (From [24], Theorem 3). Let Θ(d,k) be the class of all HMMs with output alphabet size d and hidden state size k. There exists a measure zero set E ⊂ Θ(d,k) such that for all output processes generated by HMMs in the set Θ(d,k) \ E, the information in P(N ) is sufficient for finding the minimal HMM realization, for N = 2n + 1, with n > 2dlogd (k)e. Combining the results from theorems A and B above, we obtain a method for learning an HMM. This can be done as follows: instead of learning the entire (infinite) output distribution of the HMM, we learn a finite block of our HMM of size l = 4dlogd ke + 1. We can obtain IID samples for this finite graphical model from the infinite output stream of our HMM, by collecting blocks of ‘l’ consecutive samples, leaving a gap of several observations after each block. Assuming the HMM is ergodic, which is always true in practice, we shall obtain samples from the finite-length output distribution that are nearly independent. These samples can be used to learn the transition probabilities for our graphical model according to theorem A. By theorem B, the same probabilities will correctly reproduce the infinite output distribution of the HMM. 3 Limitations: While the above result constitutes a useful step toward solving the hard problem of HMM learning, it has several limitations. One consequence of these limitations is that 3 Note that the finite graphical model could in principle, have different transition and emission probabilities for each of its edges. However, one expects the learnt probabilities to be equal in the limit, since our observations are generated from a time-homogenous HMM. 18 the method is ineffective for our Sales-Refills model. We describe these limitations below. 1. Homogeneity of Alphabet Size: The result of Theorem A requires all nodes in the graphical model to have the same state space. In particular, this means that we can only learn HMMs for which the number of hidden states equals the alphabet size of the observed variable. Hence, thiis method cannot be used for our Sales-Refills HMM, since the hidden state takes C + R + 1 different values, while the observation variable is binary. 2. Spectral Conditions on Matrices and Stationary Probabilities: Theorem A also requires 2 important conditions to be satisfied, viz. • All probability matrices (i.e. both transition and emission matrices) must have determinant bounded polynomially away from 0. • All nodes must take on each possible state with probability that is polynomially bounded away from 0. In other words, there should be no ‘rare’ states at any node. These constraints restrict the class of models for which these methods can be applied. However, there is a reason why such constraints are needed, viz. to avoid hard instances such as the ‘Parity Learning with Noise’ HMM described in [21]. 3. Independence of Samples: Theorem A uses the concept of PAC-learning, and hence it assumes the existence of an oracle which provides i.i.d. samples from the output distribution of a block graphical model. In practice we are likely to not have i.i.d. samples, but a single, infinitely long, continous data stream from the output of an HMM. Our workaround is to take block samples at large intervals, and assume these to be almost i.i.d. (assuming ergodicity). While one expects this method to work in practice, the theorems will have to 19 be adapted to work with near-independence instead of independence, and the error rates would have to be calculated. 4. Interpretability/Learning a Constrained System: We explore yet another issue, which is not often discussed in relation to HMM learning, although it is quite important. In our Sales-Refills model, we are not trying to learn an HMM from scratch, but rather trying to learn a model for which some information about the parameters is already available (see section 1.4 for a detailed description). There does not seem to be any easy way to adapt the results of theorem A for such a constrained HMM learning problem. If we ignore these constraints, the model may not even be identifiable, as indeed the Sales-Refills model is not. Although the PAC-learning framework does not require identifiability, and hence in principle it may be possible to apply the theorem A for such an HMM, the resulting parameters are unlikely to have any meaning (since they do not satisy the constraints) and hence may not be ‘interpretable’. 5. Measure-Zero Set Issues: The result of [24] viz. theorem 1.3 holds for all HMMs except those in a measure zero set in the parameter space. Unfortunately, whether measure zero sets are ‘negligible’ in practice is a rather philosophical question. Since our linearly constrained HMM model has many parameters forcibly set to 0, it is not easy to say our HMM is outside the measure zero set. In case the result of theorem 1.3 is not applicable, there exist other results for sufficiently large block size, but these may be much larger (polynomial rather than logarithmic) and hence may not be very efficient. 20 1.4 Formulation as a Constrained HMM Learning Problem The HMM problem we are trying to solve can be described as learning the parameters of the model given some partial knowledge about them. For our particular HMM, these parameters are the initial state, the (hidden state) transition probabilities, and the emission probabilites. The partial knowledge available to us in this case may be regarded as linear constraints being imposed upon the probability matrices. In our work, we also show that it is possible to learn a high-level parameter ‘C’ which influences the output distribution in a more complex way. To present our problem in a more general framework, we shall simplify the problem here by assuming C is known. We also make one diversion from our Sales-Refills model in what follows: we make sale and refills events exclusive rather than independent 4 . The HMM learning problem we solve can be regarded as a special case of the following general problem: Definition 1 (HMM Learning with Linear Constraints). Given a partial HMM description viz. a hidden state size k, an output alphabet size d, constraint matrices Ae , At , Be , Bt ; where the transition and emission probability matrices Pt and Pe satisfy the following: Ae · Pe ≤ Be At · Pt ≤ Bt T The problem is to find consistent estimators P̂T t and P̂e based on T observations, such that T P̂T t , P̂e → Pt , Pe 4 In this case, the transition probabilities for low stock (i.e. less than R) become ternary variables, divided into p, q, and 1 − (p + q). The reader can verify that this model is no more complex than the model described in chapter 2, and can be solved by the same methods 21 as T → ∞. Furthermore, we can solve the problem ‘efficiently’ if our algorithm produces estimates of the stochastic matrices such that every element of the matrix is within distance ε of its true value, with probability exceeding 1 − δ, in time upper bounded by a polynomial in 1 δ and 1ε . We will now show that our problem fits into this general framework. For this, we shall briefly describe our model from a purely mathematical point of view. For a more complete discussion including its motivation from the retail industry, the reader is encouraged to read chapter 2. Our HMM is of order C + R, meaning the hidden state takes values in the set {0, 1, 2, . . . , C + R − 1} (remember that C is known). The observation alphabet is of size 2, i.e. binary. We have the following information about the transition and emission probabilities: 1. For any (hidden) state, the probability of moving to a lower state, that is lower by more than 1, equals 0. That is, P(Xt+1 < j − 1|Xt = j) = 0, ∀t, j 2. For any non-zero state, the probability of moving to a state which is lower by exactly 1, is equal to a constant (i.e. independent of the state). That is, P(Xt+1 = j − 1|Xt = j) = P(Xt+1 = i − 1|Xt = i), ∀t, j ≥ 1, i ≥ 1 3. For any state, the probability of moving to a bigger state equals 0, unless the bigger state is exactly C larger, and the current state is less than or equal to R. For transitions satisfying the 2 conditions, the transition probability is a constant. That is, P(Xt+1 > j|Xt = j) = 0 ∀t, j > R P(Xt+1 = j|Xt = i) = 0 ∀t, i ≤ R, j > i, j 6= (i + C) P(Xt+1 = j + C|Xt = j) = P(Xt+1 = i + C|Xt = i) ∀t, i ≤ R, j ≤ R 22 4. For any non-zero state, the probability of emitting a 1 is equal to a constant. That is, P(Yt = 1|Xt = j) = P(Yt = 1|Xt = i) ∀t, i ≥ 1, j ≥ 1 5. If the hidden state equals zero, the only symbol that can be emitted is a zero. That is, P(Yt = 0|Xt = 0) = 1 ∀t All of the above facts can be encoded as linear constraints on the transition and emission matrices. Hence, the HMM problem we solve is a special case of HMM learning with linear constraints. In fact, our work solves a somewhat more general problem than what we describe above, since it allows us to find not just the transition and emission probabilities, but also the starting state X0 and the high-level parameter C. 1.5 Organization of Thesis Chapter 2 describes our HMM model for retail scenario, namely, the Sales-Refills model. We also note down our key results for parameter estimation in this model, in the form of 2 theorems. Chapter 3 describes our estimation algorithm. We define estimators for each parameter mentioned in theorem 1. Chapter 4 describes the proof of our estimation algorithm. In this chapter, we argue for consistency of our model, by showing that all estimators defined in chapter 3 converge almost surely to the true quantities. Chapters 3 and 4 together constitute our proof of theorem 1. Chapter 5 describes our sample complexity bounds for the stock estimation. We show that in addition to being consistent, our estimators are also efficient in the sense of giving error bounds from polynomially many samples. This chapter gives the proof of theorem 2. 23 Chapter 6 offers extensions of the Sales-Refills model, showing how our model can be learnt even with some generalizations. In addition, it outlines our idea for Noisy Sales-Refills model, which would overcome the key limitations of the basic model. Chapter 7 summarizes our work, discusses its significance and gives pointers for future work. 24 Chapter 2 Model Description and Estimation 2.1 Definitions To describe our mathematical model for the retail system, we shall first define a general order-k-hmm. Definition 2 (Order-k-hmm). For any natural number k, we define an order-k-hmm to consist of 2 sequences of random variables X0 , X1 , X2 , . . . and Y0 , Y1 , Y2 , . . . such that Xi ’s form a markov chain, and Yi is a function of Xi , Xi−1 , . . . , Xi−k+1 for all i ∈ N. We shall call X0 , X1 , X2 , . . . the sequence of hidden states, and Y0 , Y1 , Y2 , . . . the sequence of observations. Thus, we allow each observation to depend on the past k states. Definition 3 (Shifted order-k-hmm). Given an order-k-hmm with state sequence X0 , X1 , X2 , . . . and observation sequence Y0 , Y1 , Y2 , . . . , we say that the observation sequence has shift d if Yi is a function of Xd+i , Xd+i−1 , . . . , Xd+i−k+1 , for all i. Our HMM model shall be an order-2-hmm, with the observation sequence having shift 1. As a directed graphical model, this can be represented as in figure 2.1. 25 X0 X1 X2 X3 Y0 Y1 Y2 Figure 2.1: Shifted order-2-hmm, representing the hidden state variables Xt ’s and observation variables Yt0 s. Notice that the observation sequence has shift 1. 2.2 2.2.1 The Sales-Refills Model Overview Our model describes how the inventory or stock for an item in a retail store evolves with time. The random variable Xt denotes the stock at time t 1 . We assume that the stock evolves according to a markov process i.e. X0 , X1 , X2 , . . . , XT form a markov chain (MC). Since the stock is not observable, this is a hidden MC. We also have transactions or sales, which we denote by the random variable Yt , at time t. Sales are observed in our model. Taken together, the sequences X0 , X1 , . . . , XT and Y0 , Y1 , . . . , YT form a shifted order-2-hmm with shift 1 (figure 2.1). Our model incorporates two ways in which stock can change - sales and refills. Whenever the stock is non-zero, there is a probability q that a customer buys an item. For simplicity, we shall assume that a customer can only purchase 1 item at a time 2 . Since sales are observed, we have Yt = 1 if a sale occurs at time t. Sales decrease the current stock by 1. We also allow refill events. If the current stock is smaller than a number R, the store manager may decide to order a fresh batch of stock. We assume that the stock refill happens with probability p, and this increases 1 2 We assume there is a notion of time. For more details, we refer the reader to [6] Some justification for this assumption is provided in [6], by suitably defining non-purchase events 26 the stock by a fixed amount C. Clearly, R < C. Our refill policy is thus a variant of the popular (s, S) policy for inventory management ( [15]). In our model, refills are not observed. We also allow sales and refills to occur independently of each other.3 2.2.2 Formal definition We will now formally describe our HMM as a generative model. We assume that Xt ∈ {0, 1, 2, . . . , S}, where S = C + R is the maximum stock value. The initial stock value viz. X0 is a parameter in our model. Knowing Xt , we show how to generate Xt+1 as well as Yt . For this, we define 2 variables 4s,t and 4r,t representing the change in stock due to sales and refills respectively, at time t. (a) 4s,t : Sales decrement. i. 4s,t = When Xt > 0 : 1 w.p. q 0 w.p. 1 − q 4s,t = 0 ii. When Xt = 0 : (b) 4r,t : Refill increment. i. When Xt ≤ R : 4r,t = C w.p. p 0 w.p. 1 − p 4r,t = 0 ii. When Xt > R : We now define Xt+1 and Yt as follows: 3 Yt , 4s,t Xt+1 , Xt + 4r,t − 4s,t In this paper, we use terms stock for inventory and sales for transaction during the formal treatment. We use R and C respectively for the s and S in the modified (s, S) policy. 27 The above tells us how, starting from a particular X0 , we can generate a sequence of state variables X1 , X2 , . . . , XT and a sequence of observations Y0 , Y1 , . . . , YT . This completes our description of the model. For clarity, the possible transitions of the stock MC are shown diagrammatically in figure 2.2.2. C+R C+1 Sale (q) Refill (p) C R+1 Sale (q) R Refill (p) 1 Sale (q) 0 Figure 2.2: State transition diagram for the stock MC. Sales occur in all states besides 0, while refills only occur in states ≤ R. 28 2.3 Estimation For the Sales-Refills model of section 2.2.2, assuming that we know R, we claim that it is possible to obtain consistent estimates of all parameters of interest. Further, this will allow us to estimate the hidden stock accurately at all times. We formalize this claim in two theorems. For the sample complexity bounds, we shall make an additional assumption about C, namely C ∈ {K, K + 1, . . . , 2K − 1} for some known integer K. We call this assumption ‘restricted C’. We shall describe how it can be removed in section 3.4. Theorem 1 (Consistent Estimation). Let R be known. Then, there exist estimators q̂ T , Ĉ T , X̂0T , p̂T and X̂tT , based on observation Y0T = (Y0 , Y1 , . . . , YT ), so that as T → ∞, a.s. (q̂ T , Ĉ T , X̂0T , p̂T , X̂tT ) → (q, C, X0 mod C, p, Xt mod C). (2.1) That is, our estimators converges almost surely.4 The result states that there exist consistent model estimators. Next, we discuss the finite sample error bound. Theorem 2 (Sample Complexity). Let R be known, and C satisfy the above stated restriction that C ∈ {K, . . . , 2K − 1} for known value of K. Then, for any ε ∈ (0, 1), ! sup P |(X̂t − Xt ) mod C| > 0 ≤ ε, t∈{0,...,T } for all T ≥ max 4B 2 4 , ε2 ε . The constant B is defined explicitly in terms of model parameters, via equations 5.20 , 5.13, 5.14, and 4.35. Theorem 1 constitutes the main result of this paper, as it shows our model is learnable, while theorem 2 shows that the estimation procedure is efficient. As a 4 a.s. Note that aT −−→ a as T → ∞, means that the random variable aT converges to a with probability 1, as T → ∞ 29 consequence, whenever the stock lies strictly between R and C, our algorithm yields the exact stock. On the other hand, if the stock is too low i.e. smaller than R, there is no way to discern whether a refill has occurred as per ours, or indeed any algorithm. It should be noted that in practice one expects R C and hence this uncertainty is of minimal relevance. We prove the above theorems by giving explicit algorithms for parameter recovery, in chapter 3. The proof of correctness of these algorithms is given in chapter 4, and provides the proof of theorem 1. The proof of finite-sample error bounds (i.e. theorem 2) is described in chapter 5. 30 Chapter 3 Algorithm We describe our methods for stock and parameter recovery. 3.1 Estimating stock (or inventory) In order to estimate the stock correctly, we must first estimate parameters C and X0 mod C. Then, we use these values to recover the hidden stock values. 3.1.1 Estimating C, X0 mod C Although X0 is a parameter in our model, we shall only be interested in the value of X0 mod C, which we denote by U for convenience. That is, U , X0 mod C (3.1) Define the set Sc,i , as being the set of time indices t, such that number of observed sales from 0 up to t is congruent to i modulo c. These sets are defined for each c ∈ {K, . . . , 2K − 1}, and for a fixed c, for each i ∈ {0, 1, . . . , c − 1}. Thus, we have the following sets, indexed by parameters c and i: ( ) t−1 X STc,i , t ∈ {1, . . . , T } : Yj ≡ i mod c ∀ c, i j=0 31 (3.2) Define the sale events for c, i up to time T , as the number of time instants t in STc,i , such that Yt > 0. In other words, T Ec,i , X 1(Yt =1) ∀ c, i (3.3) t∈ST c,i Define the average window length for c, i up to time T : LTc,i |STc,i | , T Ec,i (3.4) Define the maximizing indices: c∗T , i∗T , arg max LTc,i (3.5) c,i Then, Ĉ T = c∗T and Û T = i∗T are our estimators for C and U respectively. 3.1.2 Estimating hidden stock Using our estimators for C and U , we can now easily recover the stock modulo C for T all times. We define an estimator X̂t for Xt mod C, as follows: T X̂t = Û T − t−1 X ! Yj mod Ĉ T (3.6) j=0 3.2 Estimating q and p Estimating q Having already described the method for estimating C and U in the section 3.1.1, we now assume that these are known. In order to estimate p and q from the data, we define the following empirically-obtained quantities: P T i∈{0,...,C−1},i6=U |SC,i | TST , P (average sale time) T i∈{0,...,C−1},i6=U EC,i T TSZ , |STC,U | = LTC,U T EC,U (average sale time at ‘zero’ stock) 32 (3.7) (3.8) Our consistent estimator for q is q̂, where: q̂ T , 1 TST (3.9) Estimating p We now look into estimating the refill probability, p. Let q̂ be our estimate of q from the previous section. Define function f : [0, 1] × [0, 1] → R+ as R q − qp 1 . f (p; q) = p q − qp + p (3.10) As argued in Lemma 5, for fixed q, f is a strictly decreasing function in it’s first argument p. That is, it has well defined inverse f −1 in its first argument. Using this property, we define T p̂T = f −1 ((TSZ − TST )+ , q̂ T ). 3.3 (3.11) Computational efficiency Our estimation algorithms, due to their simplicity, are computationally very efficient. The most expensive step in our algorithm is computing |STc,i | and |Ec,i |T for each c and i. By doing computations cleverly, we can do all computations for one value of c in a single pass over the observations. Hence, the total time required for computing all values equals O((no. of values of c to try) ∗ T ) = O(K T ). Since K ≤ C < 2K, we can also write the time complexity as O(C T ). 3.4 3.4.1 General C Idea We extend the estimator described in section 3.1.1 to general C, not necessarily between K and 2K. In the general case, we assume that C ∈ {Cmin , . . . , Cmax }, where 33 Cmin and Cmax are 2 positive integers, and Cmax ≥ Cmin ≥ 2. 1 Our observations in lemma 3 still hold. Namely, if we guess an incorrect value C 0 , the fraction of times we shall see a longer interval equals gcd(C,C 0 ) . C However, it is now possible that we choose a C 0 which may be a multiple of C, in which case the quantity above equals 1. Thus, we are likely to be confused between the true C and its multiples, since these will have the same expected window length. The way to resolve this is to notice that for large T , all multiples of C will give ‘approximately equal’ window lengths, and this length will be the maximum among all (c, i) indices. Thus, we can look at all (c, i) pairs for which LTc,i lies within a suitable distance δ of maxc,i LTc,i . If we have sufficient data, all values of c which attain such a maximum should be multiples of each other, of the form c∗T , 2c∗T , 3c∗T , . . . etc. We then pick the smallest of these multiples viz. c∗T as our estimator for C. For the above value of c, there shall be exactly one value of i, viz. i∗T , such that LTc∗T ,i∗T is close to maxc,i LTc,i . We pick i∗T as our estimator for U . 3.4.2 General estimator for C, U Assume that Cmax and Cmin are known, and the true C lies between them. Our aim is to determine the true C, along with the corresponding X0 mod C (i.e. U ). As in section 3.1.1, define sets STc,i , for each c ∈ {Cmin , . . . , Cmax } (note the range), and for a fixed c, for each i ∈ {0, 1, . . . , c − 1}. Similarly, define the sale events and average window length up to time T , for each choice of the parameters c and i. Thus, we have the following quantities: ( STc,i , t ∈ {1, . . . , T } : t−1 X ) Yj ≡ i mod c ∀ c, i (3.12) j=0 T Ec,i , X 1(Yt =1) ∀ c, i (3.13) t∈ST c,i 1 It may be of interest from a theoretical standpoint to consider the case when Cmax is allowed to depend on T . However, we do not consider this scenario in the current work. 34 LTc,i , |STc,i | T Ec,i ∀ c, i (3.14) Define the maximizing quantity: L∗T , max LTc,i c,i (3.15) Also define the following min-max quantity: L̃T , max min LTc,i c i Define the following set of candidate indices for C, viz. 3 ∗ 1 T T C = c ∈ {Cmin , . . . , Cmax } : max Lc,i ≥ LT + L̃T i 4 4 (3.16) (3.17) Now, we define the following indices: C̄ T , min(CT ) (3.18) Ū T , arg max LTC̄ T ,i (3.19) i Then, C̄ T and Ū T are our general estimators for C and U respectively. 35 Chapter 4 Proof of Estimation We now prove that the methods described in chapter 3 give correct answers. We start by proving an important lemma. 4.1 Invariance modulo C We shall prove the following lemma, which states that given the true C and U , it is possible for us to determine the stock modulo C at all times. This lemma captures the observability of our model. Lemma 1. In a Sales-Refills model, let X0 represent the starting state, Xt represent the hidden state at time t, C represent the refill quantity, and let U , X0 mod C. Then the following congruence relation holds modulo C: Xt ≡ U − X Yi mod C ∀t ∈ {1, 2, . . . , T } (4.1) 0≤i≤t−1 Proof. Intuitively, if we look at the value of the hidden state modulo C, it can change in only 2 ways: by either a sale or a refill. Since sales are observed, we can simply subtract them out to get the updated value of X . Although refills are unobserved, they do not affect the value of X mod C. Hence, it is possible to determine the 36 value of Xt mod C from the observations alone. To prove this formally, we use the definitions of Xt and Yt . Xt+1 = Xt + 4r,t − 4s,t Yt = 4s,t (by definition) (by definition) Since 4r,t ∈ {0, C}, 4r,t ≡ 0 ∀t mod C ∴ Xt+1 ≡ Xt − 4s,t ∴ Xt+1 ≡ Xt − Yt mod C mod C Adding up these equations for 0 ≤ t ≤ t0 − 1, we get: Xt0 ≡ X0 − X Yt mod C 0≤t≤t0 −1 ∴ Xt ≡ U − X Yi mod C 0≤i≤t−1 4.2 Correctness of Ĉ T and Û T We first prove few lemmas regarding the quantities defined in section 3.1.1. To that end, define PN R ≡ PN R (p, q) = q − qp q + p − qp R . (4.2) Lemma 2. Let Lc,i be defined as in equations (3.2)-(3.4). Then, 1 PN R + , q p 1 E[LC,i ] = , ∀i ∈ {0, 1, . . . , C − 1} \ {U }. q E[LC,U ] = Proof. From the definition 3.2 and lemma 1, we can re-write STC,i as STC,i = {t ∈ {1, 2, . . . , T } : Xt ≡ U − i mod C} 37 (4.3) (4.4) Thus, STC,U represents the set of times when stock equals 0 mod C, and hence contains all instants when the stock is empty (or equal to C). Now, Li is the average window length in STC,i , and represents the average length of time we need to wait in order to observe a sale (immediately after the previous sale). Hence if i 6= U , then the stock is always non-empty, so LTC,i is the average time for sale from a non-empty stock. Since the sale time is a geometric random variable with success probability q, we get: 1 E[LTC,i ] = , q ∀i 6= U. The distribution for i = U is more complex. In our refill policy, we allow for the possibility that the stock has been refilled before reaching 0. We call this an ‘early refill’. Thus, if we observe the stock value to be 0 modulo C, the true stock could be either 0 or C. We need to treat these cases separately. We define the No-Refill Probabillity PN R in section 4.7 to be the long-term probability of the event that when the stock attains a value congruent to 0 modulo C immediately after a sale, the true stock is actually 0. When this is true, we need to wait for a refill before we can observe the next sale. Thus, with probability PN R , the window length is a sum of 2 geometric random variables, one of which has mean 1 p and the other has mean 1q . With probability 1 − PN R , the stock is refilled before reaching 0, so the window length is a geometric random variable with mean 1q . Hence, we can write: Average window length in STC,U = (average sale time) + PN R ∗ (average refill time | no early refill) 1 1 ∴ E[LC,U ] = + PN R q p Lemma 3. For a Sales-Refills mode of section 2.2, with the definitions described in lemma 2, the following holds: E[Lc,i ] ≤ 1 PN R + q 2p 38 ∀c 6= C, ∀i (4.5) Proof. In this case, we do not assume that our c equals the true C, so it is more difficult to interpret the various sets STc,i . To that end, think of the sequence of intersale gaps as representing a sequence of intervals. The interval lengths are random, with two possible means - either d1 = 1 q or d2 = 1 q + PN R p1 . Clearly d2 > d1 . The intervals with mean d2 occur at a fixed period of once every C intervals, and all other intervals have mean d1 . The leftmost interval with mean d2 occurs at position U ∈ {0, 1, . . . , C − 1}. Our problem can be stated as figuring out the correct period C and starting position U for these longer intervals. We attempt to solve this by guessing a value of C (which we call c) and a value of U (which we call i). We then compute the average interval length, of intervals starting at i and at a gap of c intervals respectively, up to a total of T intervals. We call this average length LTc,i . Clearly, if we guess the correct values of c and i, viz. C and U respectively, then we shall obtain interval length whose average value is d2 , the largest possible. On the other hand, if we guess an incorrect value of c, say C 0 , then the question is what is the longest average interval length that will be observed. We would like to show that it is strictly less than d2 so as to allow for our ability to distinguish such an incorrect choice of C 0 from the true value of C. Indeed, we shall establish this fact. Let C 0 6= C. Then we shall observe the longer intervals at most a fraction gcd(C 0 ,C) C of the time for any value of i. We establish this fact next. There are two possibilities for a particular choice of C 0 6= C and i. One possibility is that we never observe longer intervals e.g. if all longer intervals are even numbered, and we are only looking at odd-numbered intervals. In this case, the fraction of longer intervals is 0. Now, suppose that our choice of C 0 and i is such that we observe at least 1 longer interval. Then the next time we observe a longer interval is exactly when C 0 and C have a common multiple. This will happen once in every lcm(C, C 0 ) intervals. Meanwhile, our average length LTC 0 ,i is computed by considering one interval in every C 0 intervals. Thus, over the long term, the fraction of longer intervals in our average 39 equals: 1/lcm(C, C 0 ) gcd(C, C 0 ) C0 = = 1/C 0 lcm(C, C 0 ) C For any C 0 which is not a multiple of C, this quantity is at most 21 . Now, in any set of the form {K, K + 1, . . . , 2K − 1}, where K is a positive integer, there are no two numbers such that one is a multiple of the other. Thus, at most half the intervals in our average shall have mean d2 . Hence, we can write: E[LTc,i ] = d1 · (fraction of intervals with mean d1 ) + d2 · (fraction of intervals with mean d2 ) = d1 + (d2 − d1 ) · (fraction of intervals with mean d2 ) 1 ≤ d1 + d2 (since d2 − d1 > 0) 2 1 1 PN R ≤ + ∀c 6= C, ∀i q 2 p since by assumption, LTc,i is only defined for c ∈ {K, K + 1, . . . , 2K − 1}. This proves the required result. Combining lemmas 2 and 3, we are ready to prove the correctness of our estimators. Proposition 1. Under the setup of Theorem 1, as T → ∞ a.s. (Ĉ T , Û T ) → (C, U ). Proof. From lemmas 2 and 3, we get: = 1q + T E[Lc,i ] ≤ 1 + q PN R , p (4.6) c = C, i = U (4.7) PN R , 2p o.w. Moreover, since LTc,i is an empirical average of integrable i.i.d. random variables, we have by Strong Law of Large Numbers that 40 a.s. LTc,i −−→ E[LTc,i ] ∀c, i, a.s. ∴ arg max LTc,i −−→ arg max E[LTc,i ], ∴ arg max LTc,i c,i c,i ∴ 4.3 c,i a.s. −−→ C, U (from equation 4.7), a.s. Ĉ T , Û T −−→ C, U. Correctness of X̂t (4.8) T Proposition 2. Under setup of Theorem 1, as T → ∞ T a.s. X̂t → Xt mod C. (4.9) a.s. Proof. This follows directly from lemma 1 and the fact that (Ĉ T , Û T ) → (C, U ) as per Proposition 1. 4.4 Correctness of q̂ T Proposition 3. Under setup of Theorem 1, a.s. q̂ T → q, as T → ∞. (4.10) Proof. The quantity TST above represents the average length of time for a sale to occur, when the initial stock does not equal 0 modulo C. Since the stock must be non-empty in such instances, the length of time for a sale is a geometric random variable with success probability q. Hence, E[TST ] = 1q . Moreover, since TST is the empirical average of integrable i.i.d. random variables, by Strong Law of Large Numbers, we have a.s. TST −−→ E[TST ] 41 a.s. 1 −−→ . q (4.11) And hence q̂ T , 4.5 1 TST a.s. −−→ q. (4.12) Correctness of p̂T First, we state few lemmas. Lemma 4. Under the setup of Theorem 1, T E[TSZ − TST ] = f (p; q), where f is defined in (3.10). Proof. From lemma 2 and proposition 3, we know that: 1 1 + PN R , q p 1 E[TST ] = . q T E[TSZ ] = E[LTC,U ] = Hence, T E[TSZ − TST ] = 1 PN R . p (4.13) From equations (3.10) , (4.13) and (4.35), T E[TSZ − TST ] = f (p; q). Lemma 5. Given fixed R ≥ 0, let f : [0, 1] × [0, 1] → R+ be defined as R 1 q − qp f (p; q) = . p q − qp + p (4.14) Then for any q ∈ (0, 1), f is a strictly monotonically decreasing function of variable p ∈ (0, 1). 42 Proof. Effectively, f is product two terms, each of which is strictly decreasing in p (for fixed q ∈ (0, 1) and R ≥ 0). Clearly, the first term, 1/p is strictly decreasing function in p ∈ (0, 1). The second term is q − qp q − qp + p R = 1 1+ p q(1−p) !R . If R = 0, then it is constant. If R > 0, then the second term is strictly decreasing function of p ∈ (0, 1) since p/(1 − p) is strictly increasing in p. Therefore, f is strictly decreasing function of p, for fixed q, R. Proposition 4. Under setup of Theorem 1, as T → ∞ a.s. p̂T → p. (4.15) T Proof. First, note that both TST and TSZ are empirical averages of i.i.d. random variables with finite mean. By Strong Law of Large Numbers, 1 , q 1 1 a.s. −−→ + PN R , q p 1 a.s. −−→ PN R = f (p; q). p a.s. TST −−→ T TSZ Hence, T TSZ − TST Given q, the function f is strictly decreasing in p. Therefore, with respect to argument p, there exist an inverse of f which we defined as f −1 ≡ f −1 (·, q). Because we do not have access to the true q, as stated in Section 3.2, we used our estimate q̂ T to make f −1 computable from observations. That is, we used f −1 (·; q̂ T ) to obtain p̂T as T p̂T = f −1 ((TSZ − TST )+ , q̂ T ). (4.16) T − TST )+ . f (p̂T , q̂ T ) = (TSZ (4.17) That is, 43 As argued before, we have a.s q̂ T → q, (4.18) a.s T (TSZ − TST )+ → f (p, q). That is, a.s. f (p̂T , q̂ T ) → f (p, q). (4.19) Now f is a continuous function and it is strictly decreasing in its first argument. Therefore, by Lemma 6 and (4.18) and (4.19), it follows that it must be p̂T → p. This completes the proof. We state a useful analytic fact. Lemma 6. Consider function f : [0, 1] × [0, 1] → R+ . Let f continuous. Further, for any given q ∈ (0, 1), let f (·, q) be strictly decreasing function (in its first argument). Let there be (pn , qn , xn ), n ≥ 1 so that f (pn , qn ) = xn for all n; as n → ∞, qn → q and xn → f (p, q) for some (p, q) ∈ (0, 1) × (0, 1). Then pn → p as n → ∞. Proof. Suppose the contrary that pn does not converge to p. Without loss of generality, let p∗ = lim inf pn < p. That is, there exists sub-sequence nk → ∞ as k → ∞ so that pnk → p∗ < p. That is, for all k large enough, pnk < p and we shall consider only such values of nk . Using the fact that f (·, q) is strictly decreasing, we have that f (pnk , q) > f (p, q) + δ, (4.20) for some δ > 0. Since qn → q and f is continuous function, we have that for all n large enough, |f (pn , qn ) − f (pn , q)| < δ/4. (4.21) Now considering nk for k large enough so that both (4.20) and (4.21) are satisfied, we obtain f (pnk , qnk ) > f (pnk , q) − δ/4, > f (p, q) + 3δ/4. 44 But by assumption of the Lemma, we have that f (pnk , qnk ) → f (p, q). Therefore, our assumption is wrong and hence pn → p as desired. 4.6 Correctness of C̄T , ŪT The correctness of the general estimators follows directly from some simple facts. We shall prove these in the form of lemmas. Lemma 7. With respect to the quantities defined in section 3.4.2, the following hold: 1 1 + PN R q p 1 1 1 a.s. −−→ + PN R q C p a.s. −−→ {c ∈ {Cmin , . . . , Cmax } : c = kC, for some k ∈ N} 1 a.s. L∗T −−→ (4.22) L̃T (4.23) CT (4.24) Proof. We shall prove the above convergence relations one at a time. Consider the first relation: a.s. L∗T −−→ 1 1 + PN R q p where L∗T , maxc,i LTc,i . From lemma 3, we obtain the following: = 1q + PNp R , c = kC, i mod C = U E[LTc,i ] ≤ 1 + PN R , o.w. q 2p (4.25) Moreover, since LTc,i is an empirical average of i.i.d. integrable random variables, a.s. LTc,i −−→ E[LTc,i ] ∀c, i as T → ∞. Hence, a.s. max LTc,i −−→ max E[LTc,i ] = c,i 1 c,i 1 1 + PN R q p By almost sure convergence of a sequence of sets, we mean the following: For every possible element, the sequence of indicator functions corresponding to its membership in the sets converges almost surely. For finite sets, this means the sequence of sets eventually becomes constant. 45 Since the LHS equals L∗T , this proves equation 4.22 . Now, consider the second relation: a.s. L̃T −−→ 1 1 1 + PN R q C p To prove this, we rely on a slightly deeper understanding of lemma 3. As described in the lemma, consider the sequence of inter-sale intervals as a sequence of intervals, with longer intervals occuring regularly at some period. U and C represent respectively the starting point and period of the longer intervals. As proved in the lemma, if we guess a C 0 6= C, the fraction of times we shall see the longer intervals is given by gcd(C,C 0 ) , C provided at least 1 longer interval is observed. However, since we are minimizing over the starting point, we now ask the question - for which C 0 is it possible to observe a sequence with no longer intervals? It turns out that for all C 0 which have a common factor with C (i.e. gcd(C, C 0 ) > 1), there exists a starting location such that no longer intervals are observed. This can be proved as follows: suppose gcd(C, C 0 ) = x, and suppose U, U 0 are the true and assumed starting states, such that U 6≡ U 0 mod x. Since both C and C 0 are congruent to 0 modulo x, the true sequence with C, U and the assumed sequence with C 0 , U 0 will never overlap, since they will be different modulo x. Conversely, if C and C 0 are co-prime, then for every starting point, we shall observe some longer intervals, and their frequency is exactly 1 . C Now, consider the definition of L̃T , viz. L̃T , maxc mini LTc,i . Since we are minimizing over i, for any c which has a common factor with C, asymptotically a starting state will be selected for which there are no longer intervals. Hence, the average window length of this sequence will tend to the average sale time. However, since we are also maximizing over choice c, this means that the c chosen in our minmax definition will asymptotically always be co-prime to C. For such a co-prime C 0 , we shall observe the longer intervals only a fraction 1 C of the time, as mentioned above (by lemma 3). Thus, we obtain the required equation: 46 a.s. L̃T −−→ 1 1 1 + PN R q C p This completes the proof of the second result. We now consider the third equation, viz.: a.s. CT −−→ {c : c = kC, for some k ∈ N} Recall the definition of C, viz.: T c ∈ {Cmin , . . . , Cmax } : C = max LTc,i i 3 1 ≥ L∗T + L̃T 4 4 Using equations 4.22 and 4.23, we see that asymptotically, the set CT is close to C̄T , where the latter is defined as follows: 3 1 PN R 1 1 PN R T T + + C̄ , c ∈ {Cmin , . . . , Cmax } : max Lc,i ≥ + i 4 q p 4 q p·C 1 1 PN R 3 T T + i.e. C̄ = c ∈ {Cmin , . . . , Cmax } : max Lc,i ≥ + (4.26) i q p 4 4C Note: C̄T defined above is not an empirically obtained set, but because of equations 4.22 and 4.23, the empirically defined set CT behaves like it asymptotically. We define two sets, C̄TL and C̄TU to bound the above set. These sets are defined as follows: ∴ C̄TU ∴ C̄TL c ∈ {Cmin , . . . , Cmax } : , , max LTc,i i c ∈ {Cmin , . . . , Cmax } : max LTc,i i 1 3 PN R ≥ + q 4 p 1 7 PN R ≥ + q 8 p (4.27) (4.28) Assuming C ≥ 2, it is easy to see that C̄TL ⊆ C̄T ⊆ C̄TU . Now, we consider the behaviour of the 2 bounding sets as T goes to ∞. From lemma 3, we know that for every c that is not a multiple of C, maxi E LTc,i ≤ 1q + 12 PNp R . On the other hand, 47 for every c that is a multiple of C, maxi E LTc,i = 1 q + PN R . p And since the Lc,i ’s are empirical averages, they converge to their expectations as T → ∞. Hence: 1 PN R + , for all c = kC for some k ∈ N q p 1 gcd(c, C) PN R 1 1 PN R a.s. −−→ + ≤ + , otherwise q C p q 2 p a.s. max Lc,i −−→ (4.29) max Lc,i (4.30) i i Using the above equations, we can determine what the sets C̄TL and C̄TU look like, asymptotically. In particular, we see that these sets converge to the following: a.s. C̄TL −−→ {c ∈ {Cmin , . . . , Cmax } : c = kC, for some k ∈ N} a.s. C̄TU −−→ {c ∈ {Cmin , . . . , Cmax } : c = kC, for some k ∈ N} (4.31) (4.32) Since C̄TL ⊆ C̄T ⊆ C̄TU , and C̄TL , C̄TU both converge a.s. to the same set, this implies that C̄T must also converge a.s. to the same limit set. Hence: a.s. C̄T −−→ {c ∈ {Cmin , . . . , Cmax } : c = kC, for some k ∈ N} From equations 4.22 and 4.23, and from the definition of C̄T in equation 4.26, it follows that CT has the same convergence properties as C̄T . In particular : a.s. CT −−→ {c ∈ {Cmin , . . . , Cmax } : c = kC, for some k ∈ N} Thus we have proven the third statement of the lemma viz. equation 4.24. This completes our proof of the lemma. From equation 4.24 in lemma 7, it directly follows that: a.s. min(CT ) −−→ C i.e. a.s. C̄T −−→ C 48 (4.33) This proves the consistency of our estimator for C. We now prove the consistency of our estimator for U . This proof is relatively straightforward. Since C is a discrete variable, the a.s. convergence of C̄T to C means that with probability one, we shall have the correct C after finitely many steps. Once the correct C is obtained, we recover U as the index which maximizes the window length for this C. It is easy to show that if C were known apriori, the estimator for U would be correct. This can be seen as follows: ∀i 6= U : E[LTC,U ] ∀i : ∴ E[LTC,i ] > a.s. LTC,i −−→ E[LTC,i ] a.s. arg max LTC,i −−→ arg max E[LTC,i ] i i ∴ a.s. ŪT −−→ U Although we do not know C apriori in this case, we have shown above that we will know it eventually, with probability 1. Once C is known, all future observations can be regarded as independent events which will bias LC,U toward a larger value than LC,i , for all other i. Since the variables Lc,i ’s are empirical averages, finitely many observations will not affect their long term behaviour. Thus, eventually we shall recover the correct U as arg maxi LTC,i . Hence, our estimator for U is consistent as well. That is, a.s. ŪT −−→ U, where U¯T arg max , i∈{0,1,...,C̄T −1} 4.7 (4.34) LC̄T ,i No-refill probability Definition 4. We define the No-Refill Probability PN R , as the long-term probability that the stock equals 0 given that it equals 0 modulo C, in a time instant immediately after a sale. 49 In order to estimate refill probability p, we shall need to explicitly compute PN R . Below, we provide an expression for this probability. To do this, we will first prove a lemma about doimnance probabilities in Geometric random variables. Lemma 8. Consider 2 independent Geometric Random Variables X and Y , generated from Bernoulli processes with success probabilities a and b respectively, both defined on the support {1, 2, . . . ∞}. Then the probability that we see a success of the first process before the second i.e. P(X < Y ) = a−ab . a+b−ab Proof. By definition: P(X = i) = a(1 − a)i−1 ∀ i ∈ {1, 2, . . . } P(Y = i) = b(1 − b)i−1 ∀ i ∈ {1, 2, . . . } ∴ P(Y > i) = +∞ X b(1 − b)j−1 = b(1 − b)i j=i+1 ∴ P(Y > X) = = +∞ X i=1 +∞ X 1 = (1 − b)i b P(X = i)P(Y > i) a(1 − a)i−1 (1 − b)i i=1 a(1 − b) 1 − (1 − a)(1 − b) a − ab = a + b − ab = We now want to find the probability that we arrive at a stock of 0 without any refill occuring. Since refills only occur when the stock is ≤ R, we can be certain that when the stock equals R + 1 mod C, it’s actual value is R + 1. Immediately after the next sale, the value of the stock equals R. Thus, in order to see a stock of 0 the next time when it equals 0 mod C, we need to see R consecutive sale transactions occuring before any refill. By the Markov property of our model, the probabillity 50 that the r + 1-th sale transaction occurs before a refill transaction is independent of r. Thus, the probability of having 0 stock is simply the R-th power of the probability that the next sale occurs before a refill. Thus, PN R = P(Sale before Refill)R . From lemma 8, this implies: PN R = q − qp q + p − qp 51 R (4.35) Chapter 5 Sample complexity bounds In this section,we analyze the amount of data (i.e. observations) required to learn our parameters effectively. We thereby establish the efficiency of our estimators. 5.1 Error bounds for stock estimation We wish to find the probability that our estimators in section 3.1.1 return the wrong values, for a finite amount of data. That is, we want to find: P (Ĉ T , Û T ) 6= (C, U ) T = P (C, U ) 6= arg max Lc,i c,i [ = P LTc,i ≥ LTC,U (c,i)6=(C,U ) ≤ 2K 2 max (c,i)6=(C,U ) P LTc,i ≥ LTC,U (union bound) (5.1) We now bound the above probability for any (c, i) 6= (C, U ) 1 . To do this, we first prove a lemma which expresses the distribution of these random variables in a convenient form. 1 By (c, i) 6= (C, U ), we mean that at least one of the following holds: c 6= C, or i 6= U . 52 d d d Lemma 9. Define the random variables: A ∼ Geometric(q), W ∼ Geometric(p), Z ∼ Bernoulli(PN R ). 2 Let A1 , A2 , . . . be i.i.d. random variables distributed identically as A; and likewise for W and Z . Let all random variables defined above be independent of each other. With respect to the setup in theorem 1 and section 3.1.1, the following 2 properties hold: d LTC,U ∼ I. N1 1 X (Ai + Zi Wi ) N1 i=1 (5.2) T where N1 = EC,U LTc,i II. d 1 ∼ N2 where N2 X ! Ai i=1 + N3 X ! Zi W i (5.3) i=1 T N2 = Ec,i and 1 T N3 ≤ Ec,i 2 for any fixed values of (c, i) other than (C, U ) Proof. The proof follows largely from arguments made in section 4.2. We first prove the statement for LTC,U . By definition, LTC,U = ST C,U T EC,U . As noted in lemma 2, the set STC,U consists of those time instances from 0 up to T , for which the stock Xt satisfies: Xt ≡ 0 mod C. This set consists of many different ‘windows’ of observations, where each window represents a set of contiguous time instances, ending with an instance where sale occurs. No other time instance in a window has a sale, T except the last. Hence, the number of such windows is exactly EC,U . Moreover, each window represents either a sale time (if there is early refill), or the sum of a refill and sale time (otherwise). Accordingly, we define the Bernoulli random variable Z , which equals 1 if there is no early refill, which happens with probability PN R . It may be noted that Z is independent of any other sale/refill times in STC,U . By definition, the RV A has the same distribution as a sale time, and the RV W 2 d The notation X ∼ Y means that r.v. X has the same distribution as r.v. Y 53 has the same distribution as a refill time. Thus, a single window length in STC,U is distributed identically as A with probability 1 − PN R , and distributed as A + W the rest of the time. Hence, a single window length is distributed as A+ZW , and therefore T PEC,U the average window length viz. LTC,U is distributed as E T1 i=1 (Ai + Zi Wi ). C,U We now turn to proving the second statement, which concerns the distribution of T windows holds LTc,i , when (c, i) 6= (C, U ). Note that the same subdivision into Ec,i for the set of indices in STc,i as described above. But this time, we cannot exactly determine the stock modulo C in this set. However, following the argument of lemma 3, we can say that the fraction of times we see a ‘longer interval‘ (i.e. window in which Xt ≡ 0 mod C) is at most 12 . If we let N2 be the total number of windows, and let N3 be the number of windows where stock is congruent to 0 mod C, then N3 ≤ 21 N2 . Also, in these N3 windows, the window length is a sale time with probability (1−PN R ), while the rest of the time it is the sum of a sale and refill time. In the remaining T N2 − N3 windows, the window length is simply a sale time. Also, N2 = Ec,i as noted earlier. Thus, the distribution of LTc,i is identical to: 1 N2 N2 X i=1 Ai + N3 X i=1 ! Zi Wi , 1 T where N2 = Ec,i , N3 ≤ N2 2 We now obtain a bound on the probabiliity that LTc,i ≥ LTC,U for any (c, i) 6= (C, U ). Lemma 10. With respect to the setup in theorem 1 and section 3.1.1, fix any arbitrary T T (c, i) 6= (C, U ). Also, let N1 = EC,U and N2 = Ec,i , and assume that N1 , N2 ≥ N for some positive integer N . Then, the following holds: 1 32p2 2 2 P LTc,i ≥ LTC,U ≤ σA + σW Z 2 N PN R 1−q (2 − p)PN R − PN2 R 2 where σA2 = and σ = WZ q2 p2 54 (5.4) Proof. First, notice that: P LTc,i ≥ LTC,U 1 3 PN R 1 3 PN R T T ≤ P LC,U ≤ + ∪ Lc,i ≥ + q 4 p q 4 p 1 3 PN R 1 3 PN R T T + P Lc,i ≥ + (5.5) ≤ P LC,U ≤ + q 4 p q 4 p Now using Chebyshev’s inequality, we have for any random variable X : P (|X − E[X]| ≥ ∆)) ≤ 1 2 σ ∆2 X ∀∆ > 0 (5.6) 2 , E [X 2 ] − E[X]2 represents the variance of X . where σX Now, let us define X = LTC,U . Note that E[X] = 1q + PNp R . Let the variance of LTC,U 2 be denoted as σC,U ;T . Then: 1 3PN R 1 3 PN R T P LC,U ≤ + =P X≤ + q 4 p q 4p PN R = P X − E[X] ≤ − 4p PN R ≤ P |X − E[X]| ≥ 4p (4p)2 2 σ (using 5.6) ≤ (PN R )2 X 16p2 2 = 2 σC,U ;T PN R Thus, P LTC,U 1 3 PN R ≤ + q 4 p ≤ 16p2 2 σ PN2 R C,U ;T (5.7) Similarly, we can obtain a bound on the second term of equation 5.5. Define X = LTc,i , for any c, i 6= C, U . Note that E[X] ≤ 55 1 q + PN R . 2p Let the variance of LTc,i be 2 denoted as σc,i;T . Then, in a similar manner to above, we have: 1 3 PN R 1 3PN R T P Lc,i ≥ + =P X≥ + q 4 p q 4p PN R = P X − E[X] ≥ 4p PN R ≤ P |X − E[X]| ≥ 4p 2 (4p) ≤ σ 2 (using 5.6) (PN R )2 X 16p2 2 = 2 σc,i;T PN R Thus, P LTc,i 1 3 PN R ≥ + q 4 p ≤ 16p2 2 σ PN2 R c,i;T (5.8) Combining equations 5.5, 5.7 and 5.8, we get: 16p2 2 2 P LTc,i ≥ LTC,U ≤ 2 σC,U ;T + σc,i;T PN R (5.9) We further note that it is easy to bound the variance of LTC,U and LTc,i , as they are empirical averages of i.i.d. random variables. From lemma 9, d LTC,U ∼ ∴ 2 σC,U N1 1 X (Ai + Wi Zi ) N1 i=1 N1 1 X 2 = 2 (σA2 + σW Z) N1 i=1 2 ∴ σC,U = 1 2 2 (σ + σW Z) N1 A 2 where σA2 and σW Z are the variances of random variables A and (WZ ) respectively, defined in lemma 9. Now, assuming N1 ≥ N , we get: 2 σC,U ≤ 1 2 2 (σA + σW Z) N 56 (5.10) We can easily derive the same for c, i: " N ! !# N3 2 X X 1 d LTc,i ∼ Ai + Wi Zi N2 i=1 i=1 ! N3 N2 X X 1 2 = 2 ∴ σc,i σ2 σ2 + N2 i=1 A i=1 W Z 1 2 (N2 · σA2 + N3 · σW Z) N22 1 2 ≤ (since N3 ≤ N2 , from lemma 9) σA2 + σW Z N2 2 ∴ σc,i = 2 ∴ σc,i Since N2 ≥ N , we have: 2 σc,i ≤ 1 2 2 (σ + σW Z) N A (5.11) We use equations 5.10 and 5.11 along with 5.9 to obtain: 1 32p2 2 2 σ + σ P LTc,i ≥ LTC,U ≤ WZ N PN2 R A (5.12) We shall express the variances on RHS in terms of model parameters. Using the 2 definitions of random variables A, W , and Z from lemma 9, we compute σA and σW Z : σA2 , E A2 − (E[A])2 , d where A ∼ Geometric(q) 1 2−q − q2 q2 1−q ∴ σA2 = (5.13) q2 d d 2 2 σW − (E[WZ ])2 , where W ∼ Geometric(p), Z ∼ Bernoulli(PN R ) Z , E (WZ ) = E W 2 E Z 2 − (E[W ])2 (E[Z ])2 = 1 2−p (PN R ) − 2 (PN2 R ) 2 p p (2 − p)PN R − PN2 R = p2 = 2 σW Z ∴ (5.14) 3 3 2 2 For brevity, we continue to refer to the variances as σA and σW Z , but their explicit form can be found in equations 5.13 and 5.14 57 Equations 5.12, 5.13, and 5.14 together complete our proof. Now, we are ready to prove our result: Proposition 5. Under the setup of theorem 1, given T observations, it is possible to recover the stock modulo C viz. Xt mod C for all time uniformly, with high probability. Precisely, we prove the statement in theorem 2 viz. ! |(X̂t − Xt ) mod C| > 0 sup P ≤ ε, t∈{0,...,T } for all T ≥ max 2B 2 ε , 4 ε , where B is a constant that depends on model parameters. Proof. From equation 5.4, we know that 1 32p2 2 2 P LTc,i ≥ LTC,U ≤ σA + σW Z 2 N PN R (5.15) Now we use the result of lemma 11, to obtain N in terms of T . Specifically, note that: TN ≤ 4N k K PN R + q p T i.e. N ≥ 4k K q + PN R p , k 1 , with probability at least 1 − 2 k 1 with probability at least 1 − (5.16) 2 Combining equations 5.15 and 5.16, we get: PN R K k 4k + q p 1 32p2 2 T T 2 P Lc,i ≥ LC,U ≤ σA + σW Z + (5.17) 2 T PN R 2 k In the above expression, the term 12 accounts for the probability of the event that N does not satisfy the inequality in equation 5.16. Now, using equation 5.1, we get: k 4k Kq + PNp R 32p2 1 T T 2 2 2 P (Ĉ , Û ) 6= (C, U ) ≤ (2K ) σA + σW Z + (5.18) 2 T PN R 2 Or, more simply, Bk 1 k P (Ĉ , Û ) 6= (C, U ) ≤ + T 2 T T 58 (5.19) where (256K 2 p2 ) B, PN2 R Suppose we choose k such that 1 k ∈ T1 , T2 . Then, we get: 2 T C PN R 2 (5.20) + σA2 + σW Z q p 1 k ≈ T1 i.e. choose k = blog2 T c, so that 2 T P (Ĉ , Û ) 6= (C, U ) ≤ B log2 T + 2 T (5.21) In order to make the error probability smaller than ε, we can make each term in the above smaller than 2ε . Thus, we can choose T such that 2 T ≤ 2ε , and B log2 T T ≤ 2ε . Thus, we get: T T ∀T T T ∀T T T ∀T T T ∀T P (Ĉ , Û ) 6= (C, U ) ≤ ε, =⇒ P (Ĉ , Û ) 6= (C, U ) ≤ ε, =⇒ P (Ĉ , Û ) 6= (C, U ) ≤ ε, =⇒ P (Ĉ , Û ) 6= (C, U ) ≤ ε, B log2 T + 2 : ≤ε T ε 2 ε B log2 T ≤ , ≤ : T 2 T 2 ! √ B T ε 4 : ≤ , T ≥ T 2 ε 2 ! 2B 4 : T ≥ max , ε ε (5.22) 4 The above gives us a bound on the error probability ε in estimating C, U in terms of T . From lemma 1, this is exactly the probability that we recover stock incorrectly. Hence, ! sup P |(X̂t − Xt ) mod C| > 0 ≤ ε, t∈{0,...,T } for all T ≥ max 4 For T > 20, √ 2B 2 ε , 4ε T always dominates log2 T . We expect T > 20 in any practical setting 59 5.2 Data sufficiency In this section, we determine the number of time instants we shall have to wait, in order to see N instances of each random variable defined in lemma 9, with high probability. The data bounds derived in this section will be useful for deriving our sample complexity bounds. We prove these bounds in the following lemma: Lemma 11. With respect to the setup in theorem 1 and section 3.1.1, let N1 (T ) = T T for all c, i 6= C, U . For any positive integer N , let TN be EC,U , and let N2c,i (T ) = Ec,i the smallest time such that N1 (T ) ≥ N , and N2c,i (T ) ≥ N . That is, define a stopping time: TN , min{T N1 (T ) ≥ N, ∀c, i : N2c,i (T ) ≥ N } Then the following holds for any positive integer k: k K PN R 1 TN ≤ 4N k + , with probability at least 1 − q p 2 Proof. We wish to find the smallest time such that for each c, i, the set Sc,i contains at least N sales. For a particular value of c, i, this will happen if the total number of sales up to time T exceeds N c. Thus, in order to satisfy this for all c, i, we need to observe N (2K) sales in total, since by assumption C ≤ 2K. In order to sell 2KN items, we shall have to refill the stock at most 2N times. This is because each refill adds at least K items, since C ≥ K. Thus, the total time required would be at most equal to the sum of 2K(N + 1) sale times, and the additional waiting time required for 2N refills. Note that we will actually have to wait for a refill only with probability PN R , since the rest of the time the stock refill would have already occured before the stock becomes empty. Thus with probability PN R , we shall have to wait for a refill which takes average time p1 . Also, note that each sale takes time 1 q on average. Using these observations, we can get an upper bound on the expected value of TN as follows: E[TN ] ≤ 2KN 2N PN R + q p 60 By the Markov inequality, we obtain: P(TN ≥ 2E[TN ]) ≤ 12 . That is, if we wait for time 2E[TN ], we have at least a 50% chance of seeing the required 2KN sales. Since the above arguments are independent of the starting state, they can be easily extended. By considering k different intervals of size 2E [TN ] each, we can say that k the probabillity of observing 2KN sales in total, is at least 1 − 21 . Thus, TN ∴ TN k 1 ≤ 2kE[TN ], with probability at least 1 − 2 k 1 K PN R + , with probability at least 1 − ≤ 4kN q p 2 61 (5.23) Chapter 6 Generalized and Noise-Augmented models In this chapter, we describe a more general version of the Sales-Refills model introduced in chapter 2. This model, which we call a Generalized Sales-Refills model, can represent a larger class of retail systems, or can represent existing markets more accurately, leading to better estimation. We also look at a modification called the Noisy Sales-Refills model, which augments our basic model with noise-handling capabilities. We present both models in this chapter in brief, without formal proofs and details. 6.1 6.1.1 Generalized Sales-Refills model Modifications Our general model is also a second order Hidden Markov Model with shift 1, representing the evolution of stock. It still has 2 events that can change the stock viz. Sale and Refill, of which the former is observed. However, there are 3 key changes in the general model. These are described below. 62 1. Multiple Sales: We allow multiple items to be sold in a single instant. Thus, the observation variables Yt ’s are no longer binary variables, but instead are integers describing the number of units sold at time t. This is an important feature for 2 reasons. Firstly, if we are using a notion of real time to generate purchase and nonpurchase events, then we are forced to allow for the possibility of multiple sales. For example, suppose we divide the day into 24 hours, and let each observation variable represent the total sales made in the corresponding hour. Multiple sales naturally arise from this definition. Even if we define non-purchase events using the method described in [6], it may still be useful to allow for multiple sales. When modelling items that are frequently sold in bunches, allowing for multiple sales can lead to a much more accurate/realistic model than breaking them up into consecutive sales. The latter may not give very good results when modelling sales as independent geometric random variables. 2. Non-uniform Sales Probability: We also allow the sales probabilities to be different at different times. In particular, we allow the possibility that current inventory levels may influence the purchasing patterns of customers. There are 2 important effects that such a stock-dependent model can handle. Firstly, since we allow for multiple sales, there is an obvious need to allow some dependence of purchase probabilities on stock, since a customer cannot buy more items than the current inventory level. Secondly, we allow the tendency of purchase to be influenced by the stock level. For example, if the stock is very low, a prospective customer may not see the item on the shelf, and hence fail to purchase it. Similarly, a customer who is window-shopping would not see the item, and hence we would miss a possible sale of the item which may have occurred if the stock was high. Apart from this, there can be other unconscious 63 influences on the customer behaviour due to the stock level. 3. Unknown R: We may treat R as an unknown parameter, rather than a known quantity. We have assumed so far that the refill level R is known. While this assumption makes perfect sense for a retailer who must know his own store, it may not be so obvious if the estimation is being done by a point-of-sales system like Square. In such a situation, we may wish to treat R as an unknown parameter. We may even wish to estimate it from observations. Unfortunately, the latter is not possible (without additional knowledge), since both R and p influence the refill times in equivalent ways. However, as we shall demonstrate, it is possible to estimate certain other quantities without knowledge of R. Thus, this is not a fruitless assumption. 6.1.2 Description Below, we describe the modifications in our formal definition of the Sales-Refills model, in order to allow the generalizations described above. These modifications shall correspond to redefining the Sale Event, while the description of the Refill Event shall be the same as earlier. 1. Sale :Whenever the stock is greater than zero, we expect customers to arrive and sales to occur. Suppose at time instant t, the current stock Xt is equal to some integer k > 0, then there is a probability qk of a sale occuring. Note that the sale probability depends on the current value of the stock. When a sale event occurs, several items may be sold. We allow up to m items to be sold in a single sales transaction. Given that a sale event occurs, the number of items sold is given by the following probability distribution: with probability rk,j , we will 64 sell j items, for each j ∈ {1, 2, . . . , min(m, k)}. Such a sale obviously reduces the stock by j. Accordingly, we define the “sales decrement” 4s,t for the general Sales-Refills model as follows: (1) When Xt > 0: Let Xt = k, for some positive integer k. Then, (i) With probability 1 − qk , no sale happens, meaning that 4s,t = 0. (ii) With probability qk , a sale occurs. In this case, 4s,t ∈ {1, . . . , m}, with the probability distribution given by P(|4s,t | = i) = rk,i ∀i ∈ {1, 2, . . . , m}. (2) When Xt = 0: No sale occurs from an empty stock i.e. 4s,t = 0. Note that all the parameters defined above such as qk ’s and rk,j ’s are unknown parameters in our model. We will later talk about learning these parameters from observations. 2. Refill :We allow refills to occur exactly as described in section 2.2.1, for the ordinary Sales-Refills model. We also define the refill incerement ∆r,t exactly as in section 2.2.2. As before, we define Xt+1 and Yt by adding the above quantities with appropriate sign: Yt , 4s,t Xt+1 , Xt + 4r,t − 4s,t As an illustrative example, suppose we have a sale of 3 items as well as a refill occuring at a time instant t, then we obtain the stock at time t+1 as: Xt+1 = Xt +C−3. 65 6.2 Estimation of the Generalized Sales-Refills model In this section, we talk about estimating parameters from data. It turns out that the Generalized Sales-Refills model is too complex for recovering all parameters merely from observations of sales. We deal with this by imposing certain simplifying assumptions, which reduce the parameter space. While many such sets of assumptions are possible in theory, we will discuss one in detail below. 6.2.1 Simplifying Assumptions We describe here a set of assumptions which are quite reasonable for a large number of situations. We call these assumptions the Uniform Aggregate Demand condition. This involves the following 3 constraints: (i) ∀ k : qk = q (ii) ∀ k ≥ m : For all j, rk,j = wj (iii) ∀ k < m : For all j < k, rk,j = wj . For j = k, rk,k = Pm i=k wi . The above assumptions can be understood intuitively. The underlying idea is this: we assume that customers arriving at the store form a homogenous distribution, which does not depend on the stock. Moreover, we assume that each customer arrives with a fixed intent of buying a certain number of items. However, the actual number of items purchased by the customer depends not just on her itent, but also on the current inventory at the store. If the inventory level of the store is lower than the number of items the customer wants to buy, we assume that she simply buys the available number of items. In other words, the customer buys out the particular item, and the item goes out of stock. It can be seen how this leads to the assumptions made above. For example, consider the uniformity of qk ’s. Since every customer who arrives at the store leaves only after making a purchase, the probability of sale is not affected by the current 66 inventory level, as long as it is non-empty. Similarly, we can understand the 2nd assumption. It says that when there is enough stock, the customer preferences are uniform. The 3rd assumption is more interesting. This assumption says, for example, that when the stock equals either 3 units or 10 units, the probability of a customer purchasing 2 units is the same. This is to be expected, because in both cases, the only customer who purchases 2 units is the person who came with the intent of buying exactly 2 items. However, when the stock equals 3 units, customers who come with the intention of buying 4 units, 5 units, or any higher amount, also leave after purchasing 3 units. Thus the demand for the current number of items in stock is an aggregate of the customers who come with the intent of buying the same number or any higher number, of items. The above assumptions reduce the set of independent parameters on the sales side to m + 1 parameters: q, w1 , w2 , . . . , wm . Although some such reduction is necessary for learning the parameters, the constraints can be different and more general than what we have imposed here. One may also contrast the above scheme with another possibility, where the customer buys only the number of items she originally intended to buy, and leaves the store if sufficient stock is not available. Such a scheme will lead to a different set of constraints on the parameters, which can be worked out. One interesting feature is that we shall see a dependence of the sales probability on stock level, in this case. In general, then, the right set of assumptions would depend on the application. 6.3 Estimation with Uniform Aggregate Demand Assuming the properties described in section 6.2.1, we now proceed to estimation. We can obtain the following estimation results for the General Sales-Refills model under Uniform Aggregate Demand. Theorem 3 (Consistent Estimation of Generalized model). For the model described 67 in section 6.1 with assumptions in section 6.2.1, assume that C is known while R may be unknown. Then, there exist estimators q̂ T , Ĉ T , X̂0T , p̂T , X̂tT , and ŵiT based on observation Y0T = (Y0 , Y1 , . . . , YT ), so that as T → ∞, a.s. (q̂ T , X̂0T , X̂tT , ŵiT ) → (q, X0 mod C, Xt mod C, ŵi ). 6.3.1 (6.1) Estimation of stock and U Our estimation algorithm is a direct extension of the algorithm for the basic model, T described in chapter 3. That is, we define the index sets STC,i , the sales events EC,i , and the average window length LTC,i almost exactly as defined in section 3.1.1. Notice that C is known, so that our sets are indexed by a single parameter viz. i. We also T , to take into account all instants need a slight modification in the definition of EC,i when a sale occurs, rather than instants when exactly 1 item is sold. Thus, we have the following definitions: ( STi , t ∈ {1, . . . , T } : t−1 X ) Yj ≡ i mod C ∀i (6.2) j=0 ẼiT , X 1(Yt ≥1) ∀i (6.3) t∈ST i LTi , |STi | EiT (6.4) As before, we argue that the window length would be much longer for those periods which correspond to a stock of 0 mod C, because in this cases we need to wait for a refill (assuming no early refill) as well as a sale. Therefore, a simple estimator for U works: Û T = arg max LTi (6.5) i Once we have estimated U , the stock at any time can be estimated as before. 68 X̂tT , Û T − t−1 X ! Yt mod C i=0 We claim that these estimators converge almost surely. That is, a.s. Û T −−→ U as T → ∞ T a.s. X̂t −−→ Xt (6.6) The proof is identical to the proof in the basic case. 6.3.2 Estimation of sales parameters We will now discuss the estimation for the 2 other quantities mentioned in theorem 3, viz. q and the wi ’s. Estimating q: Our estimator for q has exactly the same form as before. Thus, we define the following quantity: i∈{0,...,C−1},i6=U |STi | i∈{0,...,C−1},i6=U EiT P TST , P (average sale time) And our estimator for q is derived from the above, as: q̂ T = 1 TST Estimating wj ’s: We wiill now outline the procedure for estimating wj ’s, for j ∈ {1, . . . , m}. Our basic idea is to use those instants where the stock exceeds j, so that the fraction of sold items directly give us wj . 69 Consider again the set Si viz. the set of time instants t where Xt − X0 = i mod C, for all i ∈ {0, 1, . . . , C − 1}. Define a family of numbers Ẽi,j to represent the number of instants in Si such that Yt = j. That is, T , Ẽi,j X 1(Yt =j) ∀ i, j (6.7) t∈ST i Now, note that STi represents those instants when the stock is congruent to U − i modulo C. Hence, ST((U −k) mod C represents time instants when the stock equals k. It follows that Ẽ(U −k mod C),j is the number of instants when exactly j items are sold, from a stock of k items. Since U is not known apriori, we can replace U by Û T in these equations. Thus, our consistent estimators for the parameters wj ’s are as follows: ŵj T P k: j+1≤k≤C−1 Ẽ(U −k mod C),j = P k: j+1≤i≤C−1 E(U −k mod C) ∀ j ∈ {1, 2, . . . , m} (6.8) A careful look at the above expression shows that it is simply the empirical estimator for a multinomial distribution. However, we only use those time instances for estimation when the stock available is greater than or equal to (k+1). This follows from how the sales probabilities are defined. Note that in the above expression, we have implicitly used the assumption that m < C. If this assumption does not hold, we will have to use a more complex estimation scheme. It is easily seen via Strong Law of Large Numbers that our estimators for q and wj ’s converge almost surely. 6.4 Augmenting the Sales-Refills model with Noise They key drawback of our previous models with respect to practical applications, is the assumption that the retail market is a perfect process. In other words, no items are sold, lost, or otherwise changed without being recorded as a transaction in the system, except in a refill. Moreover, the stock also changes in a very precise manner 70 during refills i.e. by exactly C items, for some fixed constant C. Both of these assumptions may be violated in practice. To demonstrate that these assumptions are rather farfetched, we may consider what would happen if both of these assumptions were true. Knowing the stock at one time instant then allows us to essentially determine its value for ever! In settings where such assumptions held in practice, inventory inaccuracy would not be a problem in the first place. Of course, in the real world the problem of inventory inaccuracy is ubiquituous, which means that these assumptions are frequently untrue in practice. Thus, the model is inadequate for solving the problem of inventory estimation. To make our model useful in practice, we have to do away with these assumptions. In particular, we must allow for goods to be lost in other ways besides sales, such as theft, misplacement, poor handling, etc. From a practical viewpoint, it is virtually impossible to record or track such losses, hence they are unobserved. One way to model these losses is to allow a certain number of ‘hidden sales’ in each refill-cycle. These hidden sales reduce the stock just like ordinary sales, but they are not recorded in the transactions and hence not observable. They can occur anywhere and are independent of the usual sales and refills. In addition to losses of goods, we might also wish to account for losses during refilling the stock. We call this flexibility as ‘source uncertainity’. As we show later (section 6.4.2), the two relaxations above are equivalent with respect to observable quantities. It may be noted that above we have only considered the possibility that the true stock is smaller than the recorded stock, and not the other way. While this is usually the case in practice, there are also a few effects that lead to the true stock being larger. For example, suppose the cashier accidentally marks a purchased item of type A as type B, so that the real stock of B is higher than the recorded stock. If we wish to handle these influences in our model, we can allow for ‘hidden negative sales’ which increase the stock by 1. The analysis of this case carries forth exactly like the analysis we describe here. For ease of description, here we shall only consider events 71 which reduce stock. 6.4.1 Description of Noisy Sales-Refills Model We describe one possible means for extending the basic Sales-Refills model to hande noise. We start with the same basic idea of a Markov Chain to model the evolution of stock, with ‘sale’ and ‘refill’ processes allowed to change the stock. In addition to these, we shall now allow a new type of event to change the stock, which we call a ‘hidden sale’. A hidden sale sale would reduce the stock by a single unit, just like an ordinary sale. However, unlike the latter, hidden sales are not observed i.e. there is no information about them in the observed (i.e. Y ) variables. Another difference from sales is that multiple hidden sales can occur in a single time instant, and their occurence is uaffected by ordinary sale or refill events. To make our model estimable, we make further assumptions. We assume that in each refill-cycle, there can be at most L hidden sales. This means, of the C goods ordered in any refill, not more than L goods will be lost due to wear and tear. To be useful, we require that L should be a small number i.e. L C << 1. We can see that this is a reasonable assumption in practice. Moreover, even if this assumption is not strictly satisfied during all refill cycles, our model remains fairly accurate, provided that the average rate of losses is small, and the per-cycle deviations from this rate are not too large. We call the above model a Noisy Sales-Refills Model. 6.4.2 Equivalence of hidden sales and Source uncertainty viewpoints The motivation for our Noisy model stems from losses of goods in real markets. These losses could happen either while restocking or during regular sales. It turns out that as far as observables quantities are concerned, these 2 types of losses are 72 indistinguishable. Hence, for the purpose of estimation, we can assume that all losses of goods occur during a refill at the beginning of each cycle. The reason for this equivalence is as follows. Suppose in a refill cycle, we have 100 items added to an empty stock (i.e. C = 100), of which 5 are lost during restocking (i.e. source loss = 5), and another 5 items are lost during sales. Then the item will be in stockout after selling 90 items, and hence we shall observe a (potentially) longer gap between the 90th and 91st sale, as we may have to wait for a refill before observing the next sale. Once a refill occurs, the stock shall again be at a 100 items. Note that even if 8 items were lost at source and 2 during purchases, or vice-versa, we would observe exactly the same kinds of inter-sale gaps. Similarly, for the items lost during purchases, the exact timing of these losses does not affect, and hence cannot be estimated from, the observations. Thus, for purposes of estimation, it is valid to assume that all losses occur during any single time instant in a cycle. We shall consider that time instant to be the beginning of the cycle. There is another advantage of this viewpoint. Suppose we wish to estimate a modification of our Sales-Refills model, where the refill policy is a pure s, S policy. In other words, the amount of items added during a refill is not a constant C, but can vary between S − s and S, depending on the inventory level at the time of refill. Since refill times are not observed in our Sales-Refills model, there is no way to know the exact number of items added in a refill. However, we can model the variability in refill amount as a pseudo-loss of goods. In terms of the notation described in section 6.4.1, we simply replace L by L + s. Thus, the Noisy Sales-Refills model also allows us to handle variable refills. 6.4.3 Estimation with Noise Estimation Idea: We will give a brief outline of the estimation procedure for the noisy model. Our aim is to estimate the stock at each time, from the observed sales. To do this, we will 73 look at various ‘pathways’ taken by the stock, depending on the number of items lost in each cycle. Depending on the pathway of stock, we obtain a sequence of inter-sale intervals which correspond to sale+refill intervals. Hence, we expect these intervals to be longer than their neighboring itnervals. By suitably guessing the pathway of stock, and trying to maximize the average sale+refill interval length, we obtain an approximate estimate for the evolution of stock. The estimate is approximate, because it is impossible to know the stock more accurately than L items. This is because the exact time when goods are lost is not discernable within a cycle. Assuming that we guess all the refill intervals correctly, we shall then obtain an estimate of the stock accurate to within L items. This is the best that can be achieved by any algorithm, under the assumptions of this model. Limitations: There are many limitations with the estimation process under noise, compared to the earlier model. For example, since not all refill intervals involve waiting for a refill (since we may have early refills), some refill intervals will be non-observable. Even if there are no early refills, a refill interval may have a very small sale and refill time, just by chance. In that case, it may be difficult to discern the correct refill interval. However, such events will only occur with a small probability, and moreover correlating information from multiple cycles can help to alleviate this problem to some extent. The correlation among adjacent cycles also leads to some robustness properties enjoyed by this model. In particular, the assumption of losses being restricted to less than L items can be violated for a few cycles, as long as their surrounding intervals have fewer lost items to compensate. That is, the instantaneous loss count can be larger than L, provided the ‘average loss rate’ is upper bounded by L. 74 Consistency: Another property, important from a theoretical point of view, is that the hidden state in the Noisy Sales-Refills model cannot be consistently estimated from data. The proof is straightforward. Consider 2 possibilities; in one case L items are lost in each of C L consecutive cycles; and in the other case no items are lost during the same period (i.e. C L − 1 cycles). Since the final stock at the end of this period is the same in both cases, we cannot make any distinction between these 2 possibilities on the basis of events outside a finite interval. By design, our model cannot distinguish between the 2 possibilities above with 100% certainity, using only a finite amount of data. Hence, we cannot consistently say whether the number of refills in this period equals C L or C L − 1. This implies that we must have large errors (≥ C ) 2 in estimating the stock, with non-zero probability. Hence, we say that the model is not consistently estimable. At first glance, this may seem like a drawback, but the lack of consistency can also be interpreted as a feature. This is because any model which represents the inventory process in a realistic setting, should not be consistent. To put it another way, realworld processes are noisy and do not have infinite memory. The lack of consistency is a direct outcome of this finite-memory property of the Noisy Sales-Refills model. Thus, this model provides a better representation of reality. 75 Chapter 7 Conclusion 7.1 Summary of contributions We recap the key contributions described in this work. 1. We provide a method for solving the inventory estimation problem in retail markets, using Hidden Markov Models. 2. We describe a family of Hidden Markov Models, viz. Sales-Refills models, that can be learnt from data in a computationally and statistically efficient manner. 3. We describe computationally fast estimation algorithms for learning parameters from data. We prove the correctness of our estimators by showing that they converge almost surely to the true parameters. 4. We derive finite-sample error bounds for our estimators. For T observations, these provide an upper bound on the error in stock estimation, for all time. 5. We discuss a generalized version of the Sales-Refills model, and describe estimators for this model. 6. We provide some ideas for extending our Sales-Refills model to handle noise. 76 7.2 Significance and Future Work Our work shows the possibility of using Hidden Markov Models for accurate estimation of inventory from sales data. We provide an example of a highly learnable HMM, for which nearly all parameters of interest can be learnt from observations, using estimators that converge almost surely. Moreover, the estimation methods we provide are computationally fast and statistically efficient. Our model, in addition to solving the problem of inventory estimation, can be used to perform accurate demand forecasting in the presence of unobserved lost sales. Thus, it can be potentially used to solve a major problem in the retail industry. A key drawback of our primary Sales-Refills model is the lack of ‘noise’. In real markets, lost goods are common due to wear and tear, theft, mishandling, and other processes, which are unobserved (cf. [7]). In order to deal with such losses, we need to find good models which account for them. One such model is suggested at the end of this thesis (section 6.4.1), which may work. But this model needs to be formalized, and further error analysis is necessary, to gauge its usefulness in real scenarios. Another important feature would be taking into account side information about inventory, such as information obtained through inventory inspections. We leave these to future work. The other important contribution of this thesis is to the field of HMM learning. As is widely known, HMM learning is a hard problem except for a few special cases. Our work provides an instance of an HMM that is easy to describe, efficiently learnable, and of practical utility. Moreover, we argue that this HMM is not solved by any of the existing methods in the literature. Thus, it adds usefully to our existing knowledge about HMMs. We provide a brief survey of a spectral method from the literature, explaining how it can be used for HMM learning. We also list several drawbacks of this technique illustrating that it totally fails on our specific HMM model. This reinforces our argument that the currently known HMM learning methods are limited, and our work is a novel addition to this set. It may be possible to expand our contributions 77 even further, by identifying the properties which make the given model learnable, and extend these to other, more powerful models. We leave this analysis to future work. 78 Bibliography [1] N. DeHoratius, a. J. Mersereau, and L. Schrage, “Retail Inventory Management When Records Are Inaccurate,” Manufacturing & Service Operations Management, vol. 10, no. 2, pp. 257–277, 2008. [2] N. DeHoratius and a. Raman, “Inventory Record Inaccuracy: An Empirical Analysis,” Management Science, vol. 54, no. 4, pp. 627–641, 2008. [3] K. Sari, “Inventory inaccuracy and performance of collaborative supply chain practices,” Industrial Management & Data Systems, vol. 108, no. 4, pp. 495– 509, 2008. [4] Y. Kang and S. B. Gershwin, “Information inaccuracy in inventory systems: stock loss and stockout,” IIE Transactions, vol. 37, no. 9, pp. 843–859, 2005. [5] T. W. Gruen, D. S. Corsten, and S. Bharadwaj, Retail Out of Stocks : A Worldwide Examination of Extent , Causes , and Consumer Responses. 2002. [6] J. M. Chaneton, G. V. Ryzin, and M. Pierson, “Estimating inaccurate inventory with transactional data,” Preprint, pp. 1–18, 2014. [7] L. Chen, “Fixing Phantom Stockouts : Optimal Data-Driven Shelf Inspection Policies,” Working Paper, Duke University, pp. 1–37, 2013. 79 [8] N. Agrawal and S. a. Smith, “Estimating negative binomial demand for retail inventory management with unobservable lost sales,” Naval Research Logistics, vol. 43, no. 6, pp. 839–861, 1996. [9] G. Vulcano, G. van Ryzin, and R. Ratliff, “Estimating Primary Demand for Substitutable Products from Sales Transaction Data,” Operations Research, vol. 60, no. 2, pp. 313–334, 2012. [10] S. Nahmias, “Demand estimation in lost sales inventory systems,” Naval Research Logistics, vol. 41, no. 6, pp. 739–757, 1994. [11] S. a. Terwijn, “On the Learnability of Hidden Markov Models,” pp. 261–268, 2002. [12] R. B. Lyngsøand C. N. S. Pedersen, “Complexity of comparing hidden markov models,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2223 LNCS, pp. 416–428, 2001. [13] N. Abe and M. K. Warmuth, “On the Computational Complexity of Approximating Ditributions by Probabilistic Automata,” Machine Learning, vol. 9, pp. 205– 260, 1992. [14] H. Scarf, “The optimality of (S,s) policies for the dynamic inventory problem,” 1960. [15] S. P. Sethi and F. Cheng, “Optimality of (s, S) Policies in Inventory Models with Markovian Demand,” Operations Research, vol. 45, no. 6, pp. 931–939, 1997. [16] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu, “A Spectral Algorithm for Latent Dirichlet Allocation,” Advances in Neural Information Processing Systems, vol. 25, pp. 1–9, 2012. 80 [17] R. E. Schapire, M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, and L. Sellie, “On the learnability of discrete distributions,” Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pp. 273–282, 1994. [18] Z. Zivkovic and F. van der Heijden, “Recursive unsupervised learning of finite mixture models.,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 5, pp. 651–6, 2004. [19] A. Moitra and G. Valianty, “Settling the polynomial learnability of mixtures of Gaussians,” Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS, pp. 93–102, 2010. [20] S. Vempala and G. Wang, “A spectral algorithm for learning mixture models,” Journal of Computer and System Sciences, vol. 68, no. 4, pp. 841–860, 2004. [21] E. Mossel and S. Roch, Learning nonsingular phylogenies and hidden Markov models, vol. 16. 2006. [22] D. Hsu, S. M. Kakade, and T. Zhang, “A Spectral Algorithm for Learning Hidden Markov Models,” 2008. [23] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky, “Tensor decompositions for learning latent variable models,” arXiv preprint arXiv:1210.7559, vol. 15, pp. 1–55, 2014. [24] Q. Huang, R. Ge, S. Kakade, and M. Dahleh, “Minimal Realization Problems for Hidden Markov Models,” in 52nd Annual Allerton Conference on Communication, Control and Computing, (Allerton), pp. 1–14, IEEE, 2014. 81