An Adaptive Strategy via Reinforcement Learning for the Prisoner's Dilemma Game

 Pflanzen

 2 views
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Description
The iterated prisoner's dilemma (IPD) is an ideal model for analyzing interactions between agents in complex networks. It has attracted wide interest in the development of novel strategies since the success of tit-for-tat in Axelrod's
Share
Transcript
  IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018 301 An Adaptive Strategy via Reinforcement Learningfor the Prisoner’s Dilemma Game Lei Xue, Changyin Sun,  Member, IEEE  , Donald Wunsch,  Fellow, IEEE  , Yingjiang Zhou, and Fang Yu  Abstract —The iterated prisoner’s dilemma (IPD) is an idealmodel for analyzing interactions between agents in complexnetworks. It has attracted wide interest in the developmentof novel strategies since the success of tit-for-tat in Axelrod’stournament. This paper studies a new adaptive strategy of IPDin different complex networks, where agents can learn andadapt their strategies through reinforcement learning method. Atemporal difference learning method is applied for designing theadaptive strategy to optimize the decision making process of theagents. Previous studies indicated that mutual cooperation is hardto emerge in the IPD. Therefore, three examples which based onsquare lattice network and scale-free network are provided toshow two features of the adaptive strategy. First, the mutualcooperation can be achieved by the group with adaptive agentsunder scale-free network, and once evolution has convergedmutual cooperation, it is unlikely to shift. Secondly, the adaptivestrategy can earn a better payoff compared with other strategiesin the square network. The analytical properties are discussedfor verifying evolutionary stability of the adaptive strategy.  Index Terms —Complex network, prisoner’s dilemma, rein-forcement learning, temporal differences learning. I. I NTRODUCTION G AME theory has become the natural mathematicalmethod to discuss strategic and social interactions, par-ticularly in a competitive environment [1]. The prisoner’s Manuscript received March 9, 2016; accepted October 31, 2016. Thiswork was supported by the National Natural Science Foundation (NNSF) of China (61603196, 61503079, 61520106009, 61533008), the Natural ScienceFoundation of Jiangsu Province of China (BK20150851), China PostdoctoralScience Foundation (2015M581842), Jiangsu Postdoctoral Science Foun-dation (1601259C), Nanjing University of Posts and TelecommunicationsScience Foundation (NUPTSF) (NY215011), Priority Academic ProgramDevelopment of Jiangsu Higher Education Institutions, the open fund of KeyLaboratory of Measurement and Control of Complex Systems of Engineering,Ministry of Education (MCCSE2015B02), and the Research Innovation Pro-gram for College Graduates of Jiangsu Province (CXLX1309). Recommendedby Associate Editor Qinglai Wei. ( Corresponding author: Changyin Sun. )Citation: L. Xue, C. Y. Sun, D. Wunsch, Y. J. Zhou, and F. Yu, “Anadaptive strategy via reinforcement learning for the prisoner’s dilemma game,”  IEEE/CAA J. of Autom. Sinica , vol. 5, no. 1, pp. 301 − 310, Jan. 2018.L. Xue and C. Y. Sun are with the Key Laboratory of Measurementand Control of Complex Systems of Engineering, Ministry of Education,School of Automation, Southeast University, Nanjing 210096, China (e-mail:rayxue@seu.edu.cn; cysun@seu.edu.cn).D. Wunsch is with the Department of Electrical and Computer Engineering,Missouri University of Science and Technology, Rolla, MO 65409, USA (e-mail: dwunsch@mst.edu).Y. J. Zhou is with the College of Automation, Nanjing Univer-sity of Posts and Telecommunications, Nanjing 210023, China (e-mail:zhouyj@njupt.edu.cn).F. Yu is with the Institute of Logistics Science and Engineer-ing, Shanghai Maritime University, Shanghai 210096, China (e-mail:fangyu1985@hotmail.com).Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JAS.2017.7510466 dilemma, which exists in many areas, serves as a useful toolfor studying human behavior in various social settings andhas contributed insights to engineering science, economics,game theory, the analysis of social network structures, andpsychology [2]. The iterated prisoner’s dilemma (IPD) is awidely used model for analyzing the individual behavior of anagent within a given system. In the IPD, mutual cooperationcould provide the highest total income, although selfish indi-vidual reasoning often leads to other choices. There are manyexamples of the Prisoner’s dilemma in real life, when peoplehave to choose between being selfish or altruistic. Therefore,the famous computer tournaments for IPD were held by RobertAxelrod [3]. He invited game theorists to submit strategies forplaying IPD. The highest payoff was earned by the strategynamed “tit-for-tat”, it is a strategy which cooperates in thefirst round and repeats what the opponent has done in theprevious move. Some impressive results were collected in [3],the relevant message for people facing a prisoner’s dilemmacan be summarized as follows:1) don’t be envious;2) don’t be the first to defect;3) reciprocate both cooperation and defection;4) don’t be too clever.For one-shot prisoner’s dilemma, there is no doubt thatbetrayal will earn the best payoff for the agent. However, it isseldom that people just face between being selfish or altruisticonly once. Thereafter, scholars turned their attention to seek the mutual cooperation during IPD. The spatial evolutionarygame demonstrated that local interactions within a spatialstructure can maintain cooperative behavior [4]. Reference[4] dealt with the relative merits of various strategies whenplayers who recognized each other meet repeatedly. This spa-tial version of the prisoner’s dilemma can generate chaoticallychanging spatial patterns. Reference [5] introduced a measurefor the cluster shape and demonstrated that the macroscopicpatterns can be used to determine the characteristics of theunderlying microscopic interactions. Reference [6] studiedthe competition and strategy selections between a class of generalized strategies.With the development of evolutionary computation, manyscholars have made significant contributions to the researchof IPD [7]. Different agents within IPD games may uti-lize different strategies. Scholars have introduced numeroustechnologies which can be used to identify or modify IPDstrategies. Some strategies are fixed and can be implementedusing the finite state machine or Markov decision process [8].Other strategies are adaptive based on different representationschemes [9], [10]. In order to determine which kind of   302 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018 strategy has a better performance under specific conditions,some studies have investigated using fingerprint and agent-case embedding to analyze them [11] − [13]. The existingresearch results also indicated that evolutionary strategies cancompete well against some of the fixed strategies [14], [15].Inspired by these outstanding results, we propose an adaptivestrategy based on temporal difference learning method whichcan consider both the achievement of mutual cooperation andbetter performance.The reinforcement learning paradigm can be appliedto solve many practical problems [16], [17]. References[18] − [20] applied reinforcement learning method to solvenonlinear system problems. Moreover, learning to predictinvolves using past experience with an incompletely knownsystem to predict its future behavior, and this type of learningis significant for the IPD. Reinforcement learning is an ef-fective way to teach the agent how to make a decision basedon the previous experiences. Reference [20] brought togethercooperative control, reinforcement learning, and game theoryto present a multi-agent formulation for the online solutionof the team games. In [21], the scholars explored interactionsin a coevolving population of model-based adaptive agentsand fixed non-adaptive agents, and identified that the rein-forcement learning method can serve as a useful tool fordeveloping an adaptive strategy. Hingston and Kendall alsoincorporated reputation as the mechanism for evolving intothe existing co-evolutionary learning model for IPD games,where the mechanism for evolving cooperative behaviors isreputation [22]. Reference [23] designed a team of agentswhich can accomplish consensus over a common value ac-cording to cooperative game theory approach. Reference [24],[25] investigated the evolution of cooperation in the prisoner’sdilemma when individuals change their strategies subject toperformance evaluation of their neighbors over variable timehorizons. The main contribution of this paper is shown asfollows:1) A temporal difference (TD) learning method is appliedto design an adaptive strategy. The feature of adaptive strategyis that it can balance the short-term rational decision for self-interest against the long-term decision for overall interest. Theevolutionary stability of the adaptive strategy is studied.2) Three kinds of tournaments based on different complexnetworks are provided. During the tournaments in squarelattice network which contains different strategies, the adaptivestrategy earns a better payoff. As to the scale-free network constituted by adaptive agent, all the agents will cooperatewith each other for long-term reward.Therefore, the simulation results verify that the adaptivestrategy is willing to choose cooperation without losing com-petitiveness.This paper is organized as follows. In Section II, IPDis introduced. In Section III, TD( λ ) method for prisoner’sdilemma is presented. In Section IV, three tournaments aregiven to verify the feasibility of adaptive strategy. Section Vstates the conclusions of our study.II. I TERATED  P RISONER ’ S  D ILEMMA Life is filled with paradoxes and dilemmas. A very lifelikeparadox is called “prisoner’s dilemma”, discovered by MelvinDresher and Merrill Flood [12]. The prisoner’s dilemma isa canonical example of two non-zero-sum game. Each agenthas two options in each round. One is to cooperate (C), andthe other is to defect (D). Based on its choice, the agentwill receive a payoff governed by a payoff matrix, as shownin Table I. where  R  is the payoff when both agents choosecooperation. When only one of the agents defects, it willreceive a payoff  T  , and the opponent will receive a payoff  S  . If both of the agents decide to defect, each will receive a payoff  P  . The basic rule of the payoff matrix is  T > R > P > S  and  2 R > T   + S  .The standard IPD is played repeatedly between two players,each with its own strategy. They may have different strategieswhich can be represented by lookup tables, finite-state ma-chines, and neural networks. The IPD based on the strategieswhich are represented by finite state machines can be analyzedas a Markov process. This allows an average score to bedetermined for any pair of strategies using standard techniquesin stochastic processes [12]. Some typical types of the fixedstrategies are described in Table II [26]. TABLE IIPD P AYOFF  M ATRIX Agent 1 \  Agent 2 Cooperate DefectCooperate  R \ R S   \ T  Defect  T   \ S P   \ P  TABLE IIS OME  F IXED  IPD S TRATEGIES Name BehaviorAlways defect Always plays D.Always cooperate Always plays C.Tit-for-tat Plays C initially and thenrepeats the other player’s last action.GRIM Always defectsif the opponent ever defects.PavlovPlays C initially andthen cooperates thereafter if its actionand its opponent’s action matchedduring the previous round.Tit-for-2titChooses Donly if its opponent has chosen Dfor its last two moves.Ripoff Alternates betweenC and D until its opponent chooses Dfor the first time. On the round afterthis defection, it cooperates andthen plays tit-for-tat thereafter.PsychoChooses D initiallyand then plays the opposite of itsopponent’s last action.RandomSimply flips a fair cointo decide how to play. It cannot berepresented by the finite state machine.  XUE  et al. : AN ADAPTIVE STRATEGY VIA REINFORCEMENT LEARNING FOR THE PRISONER’S DILEMMA GAME 303  Definition 1:  Denoted by  a  as an  n -tuple of mixed ac-tions and if   a  = ( a 1 ,a 2 ,...,a n ) , then the payoff of   a can be written as  U   = ( U  1 ,U  2 ,...,U  n ) . For conveniencewe introduce the substitution notation  a  = ( a − i ; a i ) , where a − i  = ( a 1 ,a 2 ,..., a i − 1 ,a i +1 ,...,a n ) .  Definition 2 (Nash Equilibrium) [3]:  A joint action set ( a − i ; a i )  is a pure Nash Equilibrium point if, for the  i th player U  i (  a -i ; a i ) = max a i ∈ a U  i (  a -i ; a ∗ i ) .  (1)During the IPD, a Nash equilibrium (NE) can be achievedby maximizing the payoffs of agents. If the agents are selfishand want to maximize their own payoffs without consideringthe interests of the others, the mutual defection can be obtainedas a NE. Therefore, one of the two following situations willoccur:1) If the opponent chooses C, the agent may have payoff   R or  T  ,  T > R , then the agent will choose D.2) If the opponent chooses D, the agent may have payoff   S  or  P  ,  P > S  , then the agent will choose D.Therefore, mutual defection leads to the unique NE, whichrepresents a better short-term payoff. However, Axelrodshowed that although mutual defection yields a better short-term reward [3], mutual cooperation is a better solution in thelong run. Furthermore, when the IPD presents more than twochoices, the evolution of defection may be a result of strategieseffectively having more opportunities to exploit others whenmore choices exist [11]. Based on the fixed strategies shownin Table II, each of them has their own personality and canmake decisions based on the opponent’s move. However, thefixed strategies are of passive type which means they choosethe actions based on history without considering the actions of next step. The main contribution of this paper is designing acompetitive adaptive strategy which can predict the actions andachieve mutual cooperation without losing competitiveness.In the next section, the reinforcement learning method isintroduced to design the adaptive strategy.III. TD( λ ) M ETHOD FOR  P RISONER ’ S  D ILEMMA During the IPD, the known information includes the deci-sions and payoffs of the previous steps. Therefore, the goal of our study is solving a multi-step prediction problem regardinghow to teach the agent predict its own future scores basedon the available options. Learning to predict involves usingpast experience of the unknown system to predict its futurebehavior. One advantage of prediction learning is that thetraining examples can be taken directly from the temporalsequence of ordinary sensory input; no special supervisor orteacher is required. TD learning is a prediction method whichcombines Monte Carlo and dynamic programming ideas. Itlearns by sampling the environment according to some policyand then approximates its current estimate based on previouslylearned estimates.  A. Adaptive Design Strategy by TD( λ ) Method  In this paper, the adaptive agent should have some featuresas follows. An adaptive agent should consider the long-termreward based on the situation. In other words, an adaptiveagent should learn cooperative coevolution. For instance, if the agent identifies that the opponent will choose “cooperate”,the adaptive agent should choose cooperate. If the opponentchoose “defect”, for protecting its payoff, the adaptive agentshould choose “defect”. Learn to cooperate with others isa significance character of human beings. Therefore, for anadaptive agent, learning to cooperate is vital.As a prediction method, when observation is possible, TDlearning can be adjusted to better match the observation. TDmethods also are more incremental, easier to compute, andtend to make more efficient use of their experience.Based on the model mentioned in [1], the TD( λ ) learnercan be expressed as Fig.1. Fig.1. The decision making process for the adaptive agent. Epochs  t  = 1 , 2 ,...,T  . The scores of the adaptive agentsand their opponents are  S  adp ( t )  and  S  opp ( t ) , while the fore-casting cooperative and defective earnings of the adaptiveagent at time  t  are  S  cp ( t )  and  S  dp ( t ) , respectively. Theequation for calculating the forecasting earnings of adaptiveagent is shown as follows: T  cp ( t + 1) = t  i =1 P  t − iC   ( S  adp ( i ) − S  opp ( i ))  (2) T  dp ( t + 1) = t  i =1 P  t − iD  ( S  adp ( i ) − S  opp ( i ))  (3) S  cp ( t + 1) =  T  cp ( t + 1) + 2 P  C  ( t ) R  (4) S  dp ( t + 1) =  T  dp ( t + 1) + 2 P  D ( t ) P   (5)where  P  C   ∈  (0 , 1)  and  P  D  ∈  (0 , 1)  represent the possibility of cooperation and defection, respectively, for the adaptive agent. R,T,S   and  P   are the payoffs of the IPD, as shown in TableI.Therefore, the TD( λ ) learner can be described as follows:1) Initialization: The state set and action set of the  i th agentare  Z   =  { C,D } , where  C   is cooperate, and  D  is defect. Thepayoff matrix of the agents is shown as Table I.2) Calculating  S  cp ( t + 1)  and  S  dp ( t + 1)  by (2) − (5).3) The agent makes decision based on the decision makingprocess (DMP).4) Back to 2) until the iteration stops.The DMP is shown as follows:1) If   S  cp ( t  + 1)  > S  dp ( t  + 1) , the adaptive agent willcooperate with the opponent. The possibility of cooperatingwill increase to  P  C  ( t  + 1) =  P  C  ( t ) +  F  ( P  C  ( t  + 1)) ; the  304 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018 possibility of defecting will decrease to  P  D ( t +1) =  P  D ( t ) − F  ( P  D ( t + 1)) .2) If   S  cp ( t +1)  < S  dp ( t +1) , the adaptive agent will defectwith the opponent. The possibility of defecting will increaseto  P  D ( t  + 1) =  P  D ( t ) +  F  ( P  D ( t  + 1)) ; the possibility of cooperating will decrease to  P  C  ( t +1)= P  C  ( t ) − F  ( P  C  ( t +1)) .3) If   S  cp ( t  + 1) =  S  dp ( t  + 1) , the adaptive agent willcooperate with the opponent. However, the possibility of defecting cannot be reduced. The possibility of cooperatingwill increase to  P  C  ( t +1) =  P  C  ( t )+ F  ( P  C  ( t +1)) . Therefore,the adaptive agent can be encouraged to choose cooperationfor the long-term team reward. The function  F  ( ε )  is modifiedFermi function shown as follows: F  ( ε ( t + 1)) = 11 + e [ ε ( t ) − ε ( t − 1)] /k .  (6)The DMP of adaptive agent clearly indicates that thedecisions rely not only on the previous steps, but also onthe learning method of the adaptive agent.  P  D  and  P  C  decide the tendency of the adaptive agent. Moreover, due to P  C  ,P  D  ∈  [0 , 1] ,  P  t − iD  and  P  t − iC   will decrease. In other words,the effect of the last step is greater than that of the othersteps, and effect of the first decision within the IPD becomesincreasingly weaker as the number of iterations increases. Forinvestigating the convergence in the mentioned manner, theanalytical properties are discussed in following subsection.  B. The Analytical Properties of Adaptive Strategy Different IPD strategies use different history lengths of memory to determine their choices. In a finite length IPDwhich has  L  rounds, the largest history length which a strategycan access is  L . As to the strategy mentioned in this section,the memory size is as long as the length of finite IPD game.A nontrivial question is whether the adaptive strategy can bethe counter strategy (CS) against other strategy? It has beenproven that every finite history length is possible to occur inan infinite length IPD which can be expressed as followingtheorem.  Definition 3 (Counter Strategy) [26]:  A strategy  S   is acounter strategy (CS) against another strategy  S  1 , if for anystrategy  S  ′ U  ( S,S  1 )  ≥  U  ( S  ′ ,S  1 ) .  (7)  Lemma 1 [26]:  For any strategy that uses a limited historylength, there always exist some strategies with longer memoryagainst which the strategy cannot be a counter strategy. Theorem 1:  The adaptive strategy mentioned in Section IIhas a higher probability of being a CS against fixed strategyin an infinite length IPD.  Definition 4 [26]:  Any strategy that uses a limited historycannot be an evolutionarily stable strategy (ESS) in an infinitelength or indefinite length IPD. ESS is a strategy such that,if all the members of a population adopt it, then no mutantstrategy can invade the population under the influence of natural selection.Therefore, the condition for a strategy  S   to be ESS is thatfor any  S  ′ U  ( S,S  )  ≥  U  ( S  ′ ,S  ) . There is a relationship between CS and ESS. A strategy isESS if it is the only CS against all IPD strategies [26]. As tothe adaptive strategy, the evolutionary stability is discussed asfollows. Theorem 2:  As to IPD consisted of fixed strategies andadaptive strategy,theadaptivestrategymentionedinSection IIis ESS.The adaptive strategy designed in this paper is a kind of strategy with the memory size as the length of the IPD game.During the IPD game, the adaptive strategy has the longestmemory compared with other strategies. As to the designedgame, the adaptive strategy has a higher possibility to be CS of the other fixed strategies. Therefore, the adaptive strategy willearn a better payoff when played against other fixed strategies.During next section, some simulations are given to illustratethe effectiveness of the mentioned manner.IV. S IMULATIONS In order to measure the effectiveness of agents using theadaptive strategy, three IPD tournaments are simulated indifferent complex networks. The square lattice network andscale-free network are introduced to be the environments of the simulations. Each of the game theoretical tournaments canbe represented as tuple  G ( N,A,F  ) , where  N   is the numberof the agents,  A  =  { C,D }  is the action set of the agents, C   and  D  represents the cooperate and defect, and  F   is thepayoff function for each action. The practical payoff matrixof the Prisoner’s dilemma is shown in Table III. TABLE IIIIPD P AYOFF  M ATRIX Agent 1 \  Agent 2 Cooperate DefectCooperate 3  \  3 0  \  4Defect 4  \  0 1  \  1 The simulations based on square lattice network and scale-free networks are given in the following sections.  A. The Simulations Based on the Square Lattice Network  Based on the characteristic of the  100 × 100  square latticenetwork, there are  N   = 1 , 2 ,...,i,..., 10000  agents on thenetwork. In this section, two tournaments are provided toillustrate the effectiveness of the adaptive strategy. During firsttournament, the designed adaptive strategy will play againstthe fixed strategies mentioned in Table II. During the secondtournament, two kinds of strategies based on Q-learning [27]and self-adaptive method [28] are given to play against thedesigned strategy. The structure of the partial square latticenetwork is described as Fig.2. As to each agent, it has eightneighbors. The coordinates of agents represent the relativeposition between them.During each iteration, the  i th agent plays against its neigh-bors according to the rules of IPD. Based on the feature of the lattice network, the adaptive agents can play against agentswith all kinds of strategies. The cooperative rate  p  and averagefitness value are introduced to illustrate the effectiveness of mentioned manner.  p t  =  N  c N   (8)  XUE  et al. : AN ADAPTIVE STRATEGY VIA REINFORCEMENT LEARNING FOR THE PRISONER’S DILEMMA GAME 305 where  p t  represents cooperative rate of the  t th iteration;  N  c represents the number of cooperating agent;  N   is the numberof all the agents. f  ave ( Stra ) = T   t =1 S  t T   (9)where  f  ave ( Stra )  means the average fitness value of strategy Stra ;  S  t  is the payoff of   t th round-robin game;  T   is thenumber of total iterations. Fig.2. The structure of the square lattice network. 1) The Tournaments Between Designed Strategy and Fixed Strategies:  In this section, the tournament between designedstrategy based on the TD learning method and fixed strategiesis provided to verify whether the designed strategy can earn abetter payoff than the other fixed strategies. The steps of thetournament are represented as follows: Step 1:  Initializing the number of iteration as 100. Ran-domly generating the positions of the agents with differentstrategies. The numbers of agents with different strategies areequal. The initial states of the agents are cooperation anddefection. The total initial cooperative rate is  p 1  = 0 . 3 . Thecooperators will be marked as red squares. The defectors willbe marked as blue squares. Step 2:  During each round of the round-robin games, i th agent will play against its neighbors by the prescribedsequence which is shown as Fig.3, and update its statesaccording to the characteristics. Fig.3. The order of each round-robin game. Step 3:  Calculating the cooperative rates and average fitnessvalues for verifying effectiveness of the mentioned manners.During the first example, the average fitness values of thedifferent strategies in IPD tournament are given in the TableIV. The statistical results illustrate that adaptive strategy earnsa better payoff compared with other strategies. In the tourna-ment, defective actions became extinct after 9 generations, andcooperative actions occupy the majority of the population. TABLE IVR ESULTS OF  F IRST  R OUND - ROBIN  IPD Strategy ScoreAlways defect 2.367Ripoff 2.347Always cooperate 2.289Psycho 2.462Tit-for-tat 2.774Adaptive strategy 3.135GRIM 2.763Pavlov 2.334Tit-for-2tit 2.597 The simulation results of the first tournament are shown inFig.4. Therefore, a conclusion can be obtained that mutualcooperation becomes popular among the agents and spreadsfast. However, another problem is that how does the initialcooperative rate  p 1  influence the result. Therefore, the simula-tion results based on the different initial conditions are shownas Fig.5. Fig.5 mentions that the three different initial valueswhich are  p 1  = 0 . 3 ,  p 1  = 0 . 5  and  p 1  = 0 . 8  lead to differentbalance points. However, the equilibrium points are all above75 percent which is significantly larger than the percentage of agent with always cooperate. Therefore, mutual cooperationcan be achieved between majority of the agents. 2) The Tournaments Between Designed Strategy and Other  Evolutionary Learning Methods:  In this section, two evolu-tionary learning methods are introduced into the tournaments.One is a Q-learning strategy [27], and the other one is aself-adaptive win-stay-shift reference selection strategy [28].The environment of this tournament is also the square latticenetwork. The steps of the tournament are represented asfollows: Step 1:  Initializing the number of iteration as 100. Ran-domly generating the positions for the agents with differentstrategies on the square lattice network. Three kinds of evolu-tionary strategies are evenly distributed on the network. Theinitial states of the agents are cooperate and defect. The initialcooperative rate is  p 1  = 0 . 3 . The cooperators will be markedas green squares. The defectors will be marked as blue squares. Step 2:  During each round of the round-robin,  i th agent willplay against its neighbors by the prescribed sequence which isshown as Fig.3, and update its states according to their ownfeatures. Step 3:  Calculating the cooperative rates and average fitnessvalues of the three different evolutionary strategies.The simulation results of the second tournament are shownin Fig.6. The simulation results illustrate that the cooperationspreads fast among the agents. The network becomes stablewithin 5 iterations, and the cooperative rate of 100th iterationis 0.904. The results verify that the mutual cooperation can beachieved between the agents with evolutionary strategies.
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks