Reward function we use All combinations within budget using brute force method As we can see, the optimal combination of product is [a1,b1,c1,d1] because it maximizes the reward within budget. The total reward is 2.3 for that combination. Step 1 generate episode • We use a epsilon to adjust the greedy degree we have. (epsilon is between 0 to 1) • We generate a integer ‘greedy_select’ between 0 to 10: If greedy_select< epsilon*10: We randomly select a product for each ingredient If greedy_select>epsilon*10: we select the product with maximum value for that ingredient Step 2 calculate the terminal reward according to the preference value(reward) If the sum of the real cost of all product for each ingredient is larger than budge: Return is based on reward and multiply by discount factor for the product in the episode If the sum of the real cost of all product for each ingredient is less than budge: Return is based on reward and multiply by discount factor for the product in the episode Step 3 Update the value function • For those products in the episodes, update their value function • V=V+alpha*(Retrun_value/num_of_ingredients –V) Discussion of the return function • The strategy of the existing code: Return value is 1+avg(sum(award)) if the sum of the selected episode is larger than budget; Return value is -1+avg(sum(award)) if the sum of the selected episode is less than budget. • The performance of the existing code: The total cost is 18 Budget is 30 It performs in a way that it is selecting the cheapest combination But not make the reward as large as possible As we can see, the total reward is far away from the optimal solution • Here is some improvement strategy: • strategy one: • 1.Enlarge the epsilon during the first quarter of training epoch to make it more randomly and explore more possible combination of products. 2.After several training epoch, decrease the epsilon to pay more attention on the actions with more values. strategy two • Modify the return value: if the sum of the selected episode is larger than budget; Return value is 1+3*avg(sum(award)) for the first quarter of training epoch, and 1+200*avg(sum(award)) Return value is -1 if the sum of the selected episode is less than budget. Result after improvement The cost is 25 While the budget is 30 Evaluation of the model Choose the learning rate