Decision-making optimization in realistic engineering settings needs to account for the existence of operational constraints. An efficient policy is not always a policy that optimizes an objective with 'all options on the table', but rather a policy that can nimbly morph in face of the exogenous environment constraints.
Such constraints can be either 'hard', deterministically delimiting the degrees of freedom of the decision-maker according to a known capacity (e.g. a fixed budget), or 'soft', satisfying a known capacity probabilistically or in expectation (e.g. an acceptable level of failure probability). In addition, as per the temporal and sequential nature of the decision problem, both constraints generally correspond to long-term metrics (e.g. a 5-year budget or a 50-year probability of failure, respectively).
Extending the multi-agent actor critic DRL concept introduced in the DRL-POMDP project through state augmentation and Lagrange multipliers in order to account for such types of constraints, a 10-component deteriorating system is examined against various constrained scenarios. In the figure below, statistical features of the learned policies can be observed for two cases of 5-year budget constraints, and the unconstrained case (from right to left).
It is observed that the presence of budget constraints forces the agents to cluster their actions in specific system parts, autonomously emphasizing their prioritization to certain components of the system. For example, it is notable that repair and replacement resources are primarily designated to components 3,4,8,9 which are the components with the most aggressive deterioration model. In the low budget case, inspections are mainly conducted for component 3, which is the component whose functionality is more likely to cause failure of the most reliable path of the system. Overall, it is noted that the value of information provided by inspections fades in lower budget scenarios, since the agents prefer to reserve insection resources for major interventions in an event of disruption (e.g. link failure).
Besides the apparent diversification of different component policies, a notable pattern in this risk-constrained case is that the agents develop opportunistic strategies when failure events occur in components of the same link, whereas, similarly, intervention actions, which in general cause partial or total link closures, are synchronized for same-link components. The reason why the agents develop this behavior is that maintenance-induced network disruptions are thereby minimized. Again, this behavior, similarly to the observed re-prioritization of inspection and maintenance resources under budget constraints shown previously, is autonomously learned by the agents without any user-defined enforcement or implicit reward-based penalization/motivation.
Andriotis, C.P., and Papakonstantinou, K.G., “Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints”, Reliability Engineering & System Safety (under review), arXiv preprint arXiv:2007.01380, 2020. [Link]
Andriotis, C.P., and Papakonstantinou, K.G., “Managing engineering systems with large state and action spaces through deep reinforcement learning”, Reliability Engineering & System Safety, 191 (11), 106483, 2019. [Link]
Andriotis C.P., and Papakonstantinou, K.G., “Life-cycle policies for large engineering systems under complete and partial observability”, 13th International Conference on Applications of Statistics and Probability in Civil Engineering (ICASP), Seoul, South Korea, 2019. [Link]
Data, Documentation, Presentations
Faculty of Architecture & the Built Environment
Delft University of Technology
Julianalaan 134, 2628 BL, Delft
email: c.andriotis [at] tudelft [dot] nl
Copyright © 2020-21 by C.P. Andriotis