UNIT V Advanced Learning Content • • • • • • • • • • Reinforcement Learning K-armed bandit-Elements Model-based Learning Value iteration Policy iteration Temporal difference Learning Exploration Strategies Deterministic and Non-Deterministic Rewards & Actions Semi-supervised Learning Computational Learning Thoery – VC Dimension – PAC Learning Reinforcement Learning • What is Reinforcement Learning? How does it compare with other ML techniques? – Reinforcement Learning(RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. • Though both supervised and reinforcement learning use mapping between input and output, unlike supervised learning where the feedback provided to the agent is correct set of actions for performing a task, reinforcement learning uses rewards and punishments as signals for positive and negative behaviour. • As compared to unsupervised learning, reinforcement learning is different in terms of goals. While the goal in unsupervised learning is to find similarities and differences between data points, in the case of reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent. How to formulate a basic Reinforcement Learning problem? • Some key terms that describe the basic elements of an RL problem are: – Environment — Physical world in which the agent operates – State — Current situation of the agent – Reward — Feedback from the environment – Policy — Method to map agent’s state to actions – Value — Future reward that an agent would receive by taking an action in a particular state Multi-Armed Bandits and Reinforcement Learning • Multi-Arm Bandit is a classic reinforcement learning problem, in which a player is facing with k slot machines or bandits, each with a different reward distribution, and the player is trying to maximise his cumulative reward based on trials. Formulation • Let’s strike into the problem directly. There are 3 key components in a reinforcement learning problem — state, action and reward. Let’s recall the problem — k machines are placed in front of you, and in each episode, you choose one machine and pull the handle, by taking this action, you get a reward accordingly. So the state is the current estimation of all the machines, which is zeros for all in the beginning, the action is the machine you decide to choose at each episode, and the reward is the result or payout after you pull the handle. • Actions – Choosing Action – To identify the machine with the most reward, the straight forward way is to try, try as many times as possible until one has a certain confidence with each machine, and stick to the best estimated machine from onward. The thing is we probably can test in a wiser way. As our goal is to get the maximum reward along the way, we surely should not waste to much time on a machine that always gives a low reward, and on the other hand, even we come across a palatable reward of a certain machine, we should still be able to explore other machines in the hope of some notexplored-enough machines could give us a higher reward. What is model-based machine learning? • The field of machine learning has seen the development of thousands of learning algorithms. Typically, scientists choose from these algorithms to solve specific problems. Their choices often being limited by their familiarity with these algorithms. In this classical/traditional framework of machine learning, scientists are constrained to making some assumptions so as to use an existing algorithm. This is in contrast to the model-based machine learning approach which seeks to create a bespoke solution tailored to each new problem. • The goal of MBML is “to provide a single development framework which supports the creation of a wide range of bespoke models“. This framework emerged from an important convergence of three key ideas: • the adoption of a Bayesian viewpoint, • the use of factor graphs (a type of a probabilistic graphical model), and • the application of fast, deterministic, efficient and approximate inference algorithms. • The core idea is that all assumptions about the problem domain are made explicit in the form of a model. In this framework, a model is simply a set of assumptions about the world expressed in a probabilistic graphical format with all the parameters and variables expressed as random components. • Stages of MBML – There are 3 steps to model based machine learning, namely: – Describe the Model: Describe the process that generated the data using factor graphs. – Condition on Observed Data: Condition the observed variables to their known quantities. – Perform Inference: Perform backward reasoning to update the prior distribution over the latent variables or parameters. In other words, calculate the posterior probability distributions of latent variables conditioned on observed variables. Value Iteration • Value iteration computes the optimal state value function by iteratively improving the estimate of V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and V(s) values until they converges. Value iteration is guaranteed to converge to the optimal values Pseudo code for value-iteration algorithm. Credit: Alpaydin Introduction to Machine Learning, 3rd edition Policy Iteration • While value-iteration algorithm keeps improving the value function at each iteration until the value-function converges. Since the agent only cares about the finding the optimal policy, sometimes the optimal policy will converge before the value function. Therefore, another algorithm called policy-iteration instead of repeated improving the value-function estimate, it will re-define the policy at each step and compute the value according to this new policy until the policy converges. Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm. Pseudo code for policy-iteration algorithm. Credit: Alpaydin Introduction to Machine Learning, 3rd edition. Value-Iteration vs Policy-Iteration • Both value-iteration and policy-iteration algorithms can be used for offline planning where the agent is assumed to have prior knowledge about the effects of its actions on the environment (they assume the MDP model is known). Comparing to each other, policy-iteration is computationally efficient as it often takes considerably fewer number of iterations to converge although each iteration is more computationally expensive. Temporal difference Learning • Temporal difference(TD) is an agent learning from an environment through episodes with no prior knowledge of the environment. – Algorithms • TD(0) • TD(1) • TD(y) • notation for at least some of the hyper parameters (greek letters that are sometimes intimidating). • Gamma (γ): the discount rate. A value between 0 and 1. The higher the value the less you are discounting. • Lambda (λ): the credit assignment variable. A value between 0 and 1. The higher the value the more credit you can assign to further back states and actions. • Alpha (α): the learning rate. How much of the error should we accept and therefore adjust our estimates towards. A value between 0 and 1. A higher value adjusts aggressively, accepting more of the error while a smaller one adjusts conservatively but may make more conservative moves towards the actual values. • Delta (δ): a change or difference in value. • TD(1) Algorithm • So the first algorithm we’ll start with will be TD(1). TD(1) makes an update to our values in the same manner as Monte Carlo, at the end of an episode. So back to our random walk, going left or right randomly, until landing in ‘A’ or ‘G’. Once the episode ends then the update is made to the prior states. As we mentioned above if the higher the lambda value the further the credit can be assigned and in this case its the extreme with lambda equaling 1. This is an important distinction because TD(1) and MC only work in episodic environments meaning they need a ‘finish line’ to make an update. • So thats temporal difference learning in a simplified manner, I hope. The gist of it is we make an initial estimate, explore a space, and update our prior estimate based on our exploration efforts. The difficult part of Reinforcement Learning seems to be where to apply it, what is the environment, how do I set up my rewards properly, etc, but at least for now you understand the exploration of a state space and making estimates with an unsupervised modelfree approach. Exploration Strategies • classical approach to any reinforcement learning (RL) problem is to explore and to exploit. Explore the most rewarding way that reaches the target and keep on exploiting a certain action; exploration is hard. Without proper reward functions, the algorithms can end up chasing their own tails to eternity. When we say rewards, think of them as mathematical functions crafted carefully to nudge the algorithm. To be more precise, consider teaching a robotic arm or an AI playing a strategic game like Go or Chess to reach a target on its own. Curiosity Based Exploration • First introduced by Dr Juergen Schmidhuber in 1991, curiosity in RL models was implemented through a framework of curious neural controllers. This described how a particular algorithm can be driven by curiosity and boredom. This was done by introducing (delayed) reinforcement for actions that increase the model network’s knowledge about the world. This, in turn, requires the model network to model its own ignorance, thus showing a rudimentary form of selfintrospective behaviour. Epsilon-greedy • As the name suggests, the objective of this approach is to identify a potential way and keep on exploiting it ‘greedily’. This approach is popularly associated with the multi-arm bandit problem, a simplified RL problem where the agent has to find the best slot machine to make more money. The agent randomly explores with probability ϵ and takes the optimal action most of the time with probability 1−ϵ. Deterministic and Non-Deterministic Rewards & Actions Semi-supervised Learning • Every machine learning algorithm needs data to learn from. But even with tons of data in the world, including texts, images, time-series, and more, only a small fraction is actually labeled, whether algorithmically or by hand. • Most of the time, we need labeled data to do supervised machine learning. I particular, we use it to predict the label of each data point with the model. Since the data tells us what the label should be, we can calculate the difference between the prediction and the label, and then minimize that difference. • As you might know, another category of algorithms called unsupervised algorithms don’t need labels but can learn from unlabeled data. Unsupervised learning often works well to discover new patterns in a dataset and to cluster the data into several categories based on several features. Popular examples are K-Means and Latent Dirichlet Allocation (LDA) algorithms. • Now imagine you want to train a model to classify text documents but you want to give your algorithm a hint about how to construct the categories. You want to use only a very small portion of labeled text documents because every document is not labeled and at the same time you want your model to classify the unlabeled documents as accurately as possible based on the documents that are already labeled. Computational learning theory • Computational learning theory, or statistical learning theory, refers to mathematical frameworks for quantifying learning tasks and algorithms. – Computational learning theory uses formal methods to study learning tasks and learning algorithms. – PAC learning provides a way to quantify the computational difficulty of a machine learning task. – VC Dimension provides a way to quantify the computational capacity of a machine learning algorithm. PAC Learning (Theory of Learning Problems) • Probably approximately correct learning, or PAC learning, refers to a theoretical machine learning framework developed by Leslie Valiant. • PAC learning seeks to quantify the difficulty of a learning task and might be considered the premier sub-field of computational learning theory. • Consider that in supervised learning, we are trying to approximate an unknown underlying mapping function from inputs to outputs. We don’t know what this mapping function looks like, but we suspect it exists, and we have examples of data produced by the function. • PAC learning is concerned with how much computational effort is required to find a hypothesis (fit model) that is a close match for the unknown target function. VC Dimension (Theory of Learning Algorithms) • Vapnik–Chervonenkis theory, or VC theory for short, refers to a theoretical machine learning framework developed by Vladimir Vapnik and Alexey Chervonenkis. • VC theory learning seeks to quantify the capability of a learning algorithm and might be considered the premier sub-field of statistical learning theory. • VC theory is comprised of many elements, most notably the VC dimension. • The VC dimension quantifies the complexity of a hypothesis space, e.g. the models that could be fit given a representation and learning algorithm. • One way to consider the complexity of a hypothesis space (space of models that could be fit) is based on the number of distinct hypotheses it contains and perhaps how the space might be navigated. The VC dimension is a clever approach that instead measures the number of examples from the target problem that can be discriminated by hypotheses in the space.