Jul 27, 2024

- Continuation of inventory management topics
- Previously discussed: EOQ model, Newsvendor Problem, Deterministic Dynamic Inventory Optimization
- Focus: Stochastic Dynamic Inventory Management
- Introduction to applications of reinforcement learning

- Demand is uncertain and probabilistically distributed (demand distribution known)
- Differentiates from deterministic dynamic inventory problems
- Decision-making involves ordering quantities based on current inventory and demand forecast
- Costs: Ordering Cost (Co) and Holding Cost (Cc/Ch)

- Demand is discrete and an IID random variable with a stationary pmf
- Instantaneous delivery (no lead time)
- Finite warehouse capacity (state variable β€ M)
- Non-perishable inventory
- Lost sales assumption (unmet demand is lost)

- Current inventory, ordering, and next time period inventory update
- Immediate costs incurred: ordering cost and holding cost
- Focus on optimizing trade-off between ordering and holding costs
- State updates based on realized demand

- Martin Putterman's Book on MDPs (Markov Decision Processes)
- Abhijit Gosafi's simulation-based optimization
- Use of Bellman equation for policy/value iteration
- Standard notations derived from Putterman's book

- Demand is discrete
- No lead time
- IID demand
- Inventory is non-perishable
- Lost sales scenario

- Start of period: Observe beginning inventory (St)
- Place order (At)
- Demand realization (Dt)
- End of period: Calculate next beginning inventory (St + At - Dt)

- Used for solving dynamic programming problems
- Takes into account current state, action, and expected future rewards

- Considers revenue, ordering cost, and holding cost
- R(St, At) = revenue (price) β ordering cost (Co) β holding cost (Cc)

- State update: St+1 = max(St + At - Dt, 0) (lost sales scenario)
- Includes state transition probabilities

**Stage Variable:**Time (days, weeks, etc.)**State Variable:**Inventory level (St)**Action Variable:**Order quantity (At)

- Governed by demand distribution (pmf: PJ)
- State transition probabilities impact next state conditioning on current state and action
- Based on warehouse capacity (M)

- Maximization of total reward over a time horizon (includes immediate and expected future rewards)

**SS Policy:**Order up to level S if beginning inventory < S, else donβt order- Stationary policy example

- Fixed numerical values for costs and revenues
- Deterministic time horizon: 3 periods
- Demand probabilities provided
- Solving using Bellman equation

- Iterative algorithm to calculate the value of being in each state
- Uses Bellman equation to iteratively find state values

- Combines policy evaluation and improvement steps
- Iteratively refines policy to maximize rewards

- Model-free learning method for MDPs
- Used when transition probabilities are unknown
- Q-value updates to find optimal policy

- Python demonstration of solving the example numeric problem
- Introduction to more advanced methods (TD, etc.) in subsequent sessions