Uploaded by 1075088875

449

advertisement
Reward function we use
All combinations within budget using brute
force method
As we can see, the
optimal combination
of product is
[a1,b1,c1,d1] because
it maximizes the
reward within budget.
The total reward is 2.3
for that combination.
Step 1 generate episode
• We use a epsilon to adjust the greedy degree we have. (epsilon is
between 0 to 1)
• We generate a integer ‘greedy_select’ between 0 to 10:
If greedy_select< epsilon*10:
We randomly select a product for each ingredient
If greedy_select>epsilon*10:
we select the product with maximum value for that ingredient
Step 2 calculate the terminal reward
according to the preference value(reward)
If the sum of the real cost of all product for each ingredient is larger than
budge:
Return is based on reward and multiply by discount factor for the product in
the episode
If the sum of the real cost of all product for each ingredient is less than budge:
Return is based on reward and multiply by discount factor for the product in
the episode
Step 3 Update the value function
• For those products in the episodes, update their value function
• V=V+alpha*(Retrun_value/num_of_ingredients –V)
Discussion of the return function
• The strategy of the existing code:
Return value is 1+avg(sum(award)) if the sum of the selected episode is larger
than budget;
Return value is -1+avg(sum(award)) if the sum of the selected episode is less
than budget.
• The performance of the existing code:
The total cost is 18
Budget is 30
It performs in a way that it is selecting the cheapest combination
But not make the reward as large as possible
As we can see, the total reward is far away
from the optimal solution
• Here is some improvement strategy:
• strategy one:
• 1.Enlarge the epsilon during the first quarter of training epoch
to make it more randomly and explore more possible combination of
products.
2.After several training epoch, decrease the epsilon to pay more
attention on the actions with more values.
strategy two
• Modify the return value:
if the sum of the selected episode is larger than budget; Return value is
1+3*avg(sum(award)) for the first quarter of training epoch, and
1+200*avg(sum(award))
Return value is -1 if the sum of the selected episode is less than budget.
Result after improvement
The cost is 25
While the budget is 30
Evaluation of the model
Choose the learning rate
Download