政策勾配ポテンシャルに基づく協調的マルチエージェント強化学習法

/ /

日本語AIでPubMedを検索

PubMedの提供する医学論文データベースを日本語で検索できます。AI(Deep Learning)を活用した機械翻訳エンジンにより、精度高く日本語へ翻訳された論文をご参照いただけます。

IEEE Trans Cybern.2019 Aug;doi: 10.1109/TCYB.2019.2932203.Epub 2019-08-21.

政策勾配ポテンシャルに基づく協調的マルチエージェント強化学習法

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential.

Zhen Zhang
Yew-Soon Ong
Dongqing Wang
Binqiang Xue

PMID: 31443061 DOI: 10.1109/TCYB.2019.2932203.

抄録

勾配ベースの手法は、今日のマルチエージェント強化学習(MARL)で広く使われています。勾配ベースのMARLアルゴリズムでは、各エージェントは性能指標の勾配の方向にパラメータ化された戦略を更新します。しかし、既存の勾配ベースのMARLアルゴリズムの同一関心ゲームに対する収束性に関する研究は非常に少ない。本論文では、政策勾配ポテンシャル(PGP)アルゴリズムを提案する。このアルゴリズムは、最大のグローバル報酬を持つ最適な共同戦略を学習するために、勾配そのものではなく、戦略更新を導くための情報源としてPGPを用いる。現実にはペイオフ行列や共同戦略は学習エージェントが入手できないことが多いので、最大の報酬を得る確率を性能指標とする。同一関心の繰り返しゲームを含む連続モデルに対するPGPアルゴリズムの理論的解析により、すべての最適な共同行動の構成要素行動が一意であれば、すべての最適な共同行動に対応する臨界点は漸近的に安定であることが示された。PGPアルゴリズムは、実験的に研究され、2つの一般的に使用されている共同作業-ロボットが部屋を出る作業と分散センサーネットワーク作業-と、局所的な状態と局所的な報酬情報のみが利用可能な実世界の地雷原ナビゲーション問題-において、他のMARLアルゴリズムと比較されています。その結果、PGPアルゴリズムは、累積報酬とエピソード内で使用される時間ステップ数の点で、他のアルゴリズムよりも優れていることが示された。

Gradient-based method has been extensively used in today's multiagent reinforcement learning (MARL). In a gradient-based MARL algorithm, each agent updates its parameterized strategy in the direction of the gradient of some performance index. However, studies on the convergence of the existing gradient-based MARL algorithms for identical interest games are quite few. In this article, we propose a policy gradient potential (PGP) algorithm that takes PGP as the source of information for guiding the strategy update, as opposed to the gradient itself, to learn the optimal joint strategy that has a maximal global reward. Since the payoff matrix and the joint strategy are often unavailable to the learning agents in reality, we consider the probability of obtaining the maximal reward as the performance index. Theoretical analysis of the PGP algorithm on the continuous model involving an identical interest repeated game shows that if the component action of every optimal joint action is unique, the critical points corresponding to all optimal joint actions are asymptotically stable. The PGP algorithm is experimentally studied and compared against other MARL algorithms on two commonly used collaborative tasks--the robots leaving a room task and the distributed sensor network task, as well as a real-world minefield navigation problem where only local state and local reward information are available. The results show that the PGP algorithm outperforms the other algorithms in terms of the cumulative reward and the number of time steps used in an episode.