IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018 301
An Adaptive Strategy via Reinforcement Learningfor the Prisoner’s Dilemma Game
Lei Xue, Changyin Sun,
Member, IEEE
, Donald Wunsch,
Fellow, IEEE
, Yingjiang Zhou, and Fang Yu
Abstract
—The iterated prisoner’s dilemma (IPD) is an idealmodel for analyzing interactions between agents in complexnetworks. It has attracted wide interest in the developmentof novel strategies since the success of titfortat in Axelrod’stournament. This paper studies a new adaptive strategy of IPDin different complex networks, where agents can learn andadapt their strategies through reinforcement learning method. Atemporal difference learning method is applied for designing theadaptive strategy to optimize the decision making process of theagents. Previous studies indicated that mutual cooperation is hardto emerge in the IPD. Therefore, three examples which based onsquare lattice network and scalefree network are provided toshow two features of the adaptive strategy. First, the mutualcooperation can be achieved by the group with adaptive agentsunder scalefree network, and once evolution has convergedmutual cooperation, it is unlikely to shift. Secondly, the adaptivestrategy can earn a better payoff compared with other strategiesin the square network. The analytical properties are discussedfor verifying evolutionary stability of the adaptive strategy.
Index Terms
—Complex network, prisoner’s dilemma, reinforcement learning, temporal differences learning.
I. I
NTRODUCTION
G
AME theory has become the natural mathematicalmethod to discuss strategic and social interactions, particularly in a competitive environment [1]. The prisoner’s
Manuscript received March 9, 2016; accepted October 31, 2016. Thiswork was supported by the National Natural Science Foundation (NNSF) of China (61603196, 61503079, 61520106009, 61533008), the Natural ScienceFoundation of Jiangsu Province of China (BK20150851), China PostdoctoralScience Foundation (2015M581842), Jiangsu Postdoctoral Science Foundation (1601259C), Nanjing University of Posts and TelecommunicationsScience Foundation (NUPTSF) (NY215011), Priority Academic ProgramDevelopment of Jiangsu Higher Education Institutions, the open fund of KeyLaboratory of Measurement and Control of Complex Systems of Engineering,Ministry of Education (MCCSE2015B02), and the Research Innovation Program for College Graduates of Jiangsu Province (CXLX1309). Recommendedby Associate Editor Qinglai Wei. (
Corresponding author: Changyin Sun.
)Citation: L. Xue, C. Y. Sun, D. Wunsch, Y. J. Zhou, and F. Yu, “Anadaptive strategy via reinforcement learning for the prisoner’s dilemma game,”
IEEE/CAA J. of Autom. Sinica
, vol. 5, no. 1, pp. 301
−
310, Jan. 2018.L. Xue and C. Y. Sun are with the Key Laboratory of Measurementand Control of Complex Systems of Engineering, Ministry of Education,School of Automation, Southeast University, Nanjing 210096, China (email:rayxue@seu.edu.cn; cysun@seu.edu.cn).D. Wunsch is with the Department of Electrical and Computer Engineering,Missouri University of Science and Technology, Rolla, MO 65409, USA (email: dwunsch@mst.edu).Y. J. Zhou is with the College of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210023, China (email:zhouyj@njupt.edu.cn).F. Yu is with the Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai 210096, China (email:fangyu1985@hotmail.com).Color versions of one or more of the ﬁgures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identiﬁer 10.1109/JAS.2017.7510466
dilemma, which exists in many areas, serves as a useful toolfor studying human behavior in various social settings andhas contributed insights to engineering science, economics,game theory, the analysis of social network structures, andpsychology [2]. The iterated prisoner’s dilemma (IPD) is awidely used model for analyzing the individual behavior of anagent within a given system. In the IPD, mutual cooperationcould provide the highest total income, although selﬁsh individual reasoning often leads to other choices. There are manyexamples of the Prisoner’s dilemma in real life, when peoplehave to choose between being selﬁsh or altruistic. Therefore,the famous computer tournaments for IPD were held by RobertAxelrod [3]. He invited game theorists to submit strategies forplaying IPD. The highest payoff was earned by the strategynamed “titfortat”, it is a strategy which cooperates in theﬁrst round and repeats what the opponent has done in theprevious move. Some impressive results were collected in [3],the relevant message for people facing a prisoner’s dilemmacan be summarized as follows:1) don’t be envious;2) don’t be the ﬁrst to defect;3) reciprocate both cooperation and defection;4) don’t be too clever.For oneshot prisoner’s dilemma, there is no doubt thatbetrayal will earn the best payoff for the agent. However, it isseldom that people just face between being selﬁsh or altruisticonly once. Thereafter, scholars turned their attention to seek the mutual cooperation during IPD. The spatial evolutionarygame demonstrated that local interactions within a spatialstructure can maintain cooperative behavior [4]. Reference[4] dealt with the relative merits of various strategies whenplayers who recognized each other meet repeatedly. This spatial version of the prisoner’s dilemma can generate chaoticallychanging spatial patterns. Reference [5] introduced a measurefor the cluster shape and demonstrated that the macroscopicpatterns can be used to determine the characteristics of theunderlying microscopic interactions. Reference [6] studiedthe competition and strategy selections between a class of generalized strategies.With the development of evolutionary computation, manyscholars have made signiﬁcant contributions to the researchof IPD [7]. Different agents within IPD games may utilize different strategies. Scholars have introduced numeroustechnologies which can be used to identify or modify IPDstrategies. Some strategies are ﬁxed and can be implementedusing the ﬁnite state machine or Markov decision process [8].Other strategies are adaptive based on different representationschemes [9], [10]. In order to determine which kind of
302 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018
strategy has a better performance under speciﬁc conditions,some studies have investigated using ﬁngerprint and agentcase embedding to analyze them [11]
−
[13]. The existingresearch results also indicated that evolutionary strategies cancompete well against some of the ﬁxed strategies [14], [15].Inspired by these outstanding results, we propose an adaptivestrategy based on temporal difference learning method whichcan consider both the achievement of mutual cooperation andbetter performance.The reinforcement learning paradigm can be appliedto solve many practical problems [16], [17]. References[18]
−
[20] applied reinforcement learning method to solvenonlinear system problems. Moreover, learning to predictinvolves using past experience with an incompletely knownsystem to predict its future behavior, and this type of learningis signiﬁcant for the IPD. Reinforcement learning is an effective way to teach the agent how to make a decision basedon the previous experiences. Reference [20] brought togethercooperative control, reinforcement learning, and game theoryto present a multiagent formulation for the online solutionof the team games. In [21], the scholars explored interactionsin a coevolving population of modelbased adaptive agentsand ﬁxed nonadaptive agents, and identiﬁed that the reinforcement learning method can serve as a useful tool fordeveloping an adaptive strategy. Hingston and Kendall alsoincorporated reputation as the mechanism for evolving intothe existing coevolutionary learning model for IPD games,where the mechanism for evolving cooperative behaviors isreputation [22]. Reference [23] designed a team of agentswhich can accomplish consensus over a common value according to cooperative game theory approach. Reference [24],[25] investigated the evolution of cooperation in the prisoner’sdilemma when individuals change their strategies subject toperformance evaluation of their neighbors over variable timehorizons. The main contribution of this paper is shown asfollows:1) A temporal difference (TD) learning method is appliedto design an adaptive strategy. The feature of adaptive strategyis that it can balance the shortterm rational decision for selfinterest against the longterm decision for overall interest. Theevolutionary stability of the adaptive strategy is studied.2) Three kinds of tournaments based on different complexnetworks are provided. During the tournaments in squarelattice network which contains different strategies, the adaptivestrategy earns a better payoff. As to the scalefree network constituted by adaptive agent, all the agents will cooperatewith each other for longterm reward.Therefore, the simulation results verify that the adaptivestrategy is willing to choose cooperation without losing competitiveness.This paper is organized as follows. In Section II, IPDis introduced. In Section III, TD(
λ
) method for prisoner’sdilemma is presented. In Section IV, three tournaments aregiven to verify the feasibility of adaptive strategy. Section Vstates the conclusions of our study.II. I
TERATED
P
RISONER
’
S
D
ILEMMA
Life is ﬁlled with paradoxes and dilemmas. A very lifelikeparadox is called “prisoner’s dilemma”, discovered by MelvinDresher and Merrill Flood [12]. The prisoner’s dilemma isa canonical example of two nonzerosum game. Each agenthas two options in each round. One is to cooperate (C), andthe other is to defect (D). Based on its choice, the agentwill receive a payoff governed by a payoff matrix, as shownin Table I. where
R
is the payoff when both agents choosecooperation. When only one of the agents defects, it willreceive a payoff
T
, and the opponent will receive a payoff
S
. If both of the agents decide to defect, each will receive a payoff
P
. The basic rule of the payoff matrix is
T > R > P > S
and
2
R > T
+
S
.The standard IPD is played repeatedly between two players,each with its own strategy. They may have different strategieswhich can be represented by lookup tables, ﬁnitestate machines, and neural networks. The IPD based on the strategieswhich are represented by ﬁnite state machines can be analyzedas a Markov process. This allows an average score to bedetermined for any pair of strategies using standard techniquesin stochastic processes [12]. Some typical types of the ﬁxedstrategies are described in Table II [26].
TABLE IIPD P
AYOFF
M
ATRIX
Agent 1
\
Agent 2 Cooperate DefectCooperate
R
\
R S
\
T
Defect
T
\
S P
\
P
TABLE IIS
OME
F
IXED
IPD S
TRATEGIES
Name BehaviorAlways defect Always plays D.Always cooperate Always plays C.Titfortat Plays C initially and thenrepeats the other player’s last action.GRIM Always defectsif the opponent ever defects.PavlovPlays C initially andthen cooperates thereafter if its actionand its opponent’s action matchedduring the previous round.Titfor2titChooses Donly if its opponent has chosen Dfor its last two moves.Ripoff Alternates betweenC and D until its opponent chooses Dfor the ﬁrst time. On the round afterthis defection, it cooperates andthen plays titfortat thereafter.PsychoChooses D initiallyand then plays the opposite of itsopponent’s last action.RandomSimply ﬂips a fair cointo decide how to play. It cannot berepresented by the ﬁnite state machine.
XUE
et al.
: AN ADAPTIVE STRATEGY VIA REINFORCEMENT LEARNING FOR THE PRISONER’S DILEMMA GAME 303
Deﬁnition 1:
Denoted by
a
as an
n
tuple of mixed actions and if
a
= (
a
1
,a
2
,...,a
n
)
, then the payoff of
a
can be written as
U
= (
U
1
,U
2
,...,U
n
)
. For conveniencewe introduce the substitution notation
a
= (
a
−
i
;
a
i
)
, where
a
−
i
= (
a
1
,a
2
,...,
a
i
−
1
,a
i
+1
,...,a
n
)
.
Deﬁnition 2 (Nash Equilibrium) [3]:
A joint action set
(
a
−
i
;
a
i
)
is a pure Nash Equilibrium point if, for the
i
th player
U
i
(
a
i
;
a
i
) = max
a
i
∈
a
U
i
(
a
i
;
a
∗
i
)
.
(1)During the IPD, a Nash equilibrium (NE) can be achievedby maximizing the payoffs of agents. If the agents are selﬁshand want to maximize their own payoffs without consideringthe interests of the others, the mutual defection can be obtainedas a NE. Therefore, one of the two following situations willoccur:1) If the opponent chooses C, the agent may have payoff
R
or
T
,
T > R
, then the agent will choose D.2) If the opponent chooses D, the agent may have payoff
S
or
P
,
P > S
, then the agent will choose D.Therefore, mutual defection leads to the unique NE, whichrepresents a better shortterm payoff. However, Axelrodshowed that although mutual defection yields a better shortterm reward [3], mutual cooperation is a better solution in thelong run. Furthermore, when the IPD presents more than twochoices, the evolution of defection may be a result of strategieseffectively having more opportunities to exploit others whenmore choices exist [11]. Based on the ﬁxed strategies shownin Table II, each of them has their own personality and canmake decisions based on the opponent’s move. However, theﬁxed strategies are of passive type which means they choosethe actions based on history without considering the actions of next step. The main contribution of this paper is designing acompetitive adaptive strategy which can predict the actions andachieve mutual cooperation without losing competitiveness.In the next section, the reinforcement learning method isintroduced to design the adaptive strategy.III. TD(
λ
) M
ETHOD FOR
P
RISONER
’
S
D
ILEMMA
During the IPD, the known information includes the decisions and payoffs of the previous steps. Therefore, the goal of our study is solving a multistep prediction problem regardinghow to teach the agent predict its own future scores basedon the available options. Learning to predict involves usingpast experience of the unknown system to predict its futurebehavior. One advantage of prediction learning is that thetraining examples can be taken directly from the temporalsequence of ordinary sensory input; no special supervisor orteacher is required. TD learning is a prediction method whichcombines Monte Carlo and dynamic programming ideas. Itlearns by sampling the environment according to some policyand then approximates its current estimate based on previouslylearned estimates.
A. Adaptive Design Strategy by TD(
λ
) Method
In this paper, the adaptive agent should have some featuresas follows. An adaptive agent should consider the longtermreward based on the situation. In other words, an adaptiveagent should learn cooperative coevolution. For instance, if the agent identiﬁes that the opponent will choose “cooperate”,the adaptive agent should choose cooperate. If the opponentchoose “defect”, for protecting its payoff, the adaptive agentshould choose “defect”. Learn to cooperate with others isa signiﬁcance character of human beings. Therefore, for anadaptive agent, learning to cooperate is vital.As a prediction method, when observation is possible, TDlearning can be adjusted to better match the observation. TDmethods also are more incremental, easier to compute, andtend to make more efﬁcient use of their experience.Based on the model mentioned in [1], the TD(
λ
) learnercan be expressed as Fig.1.
Fig.1. The decision making process for the adaptive agent.
Epochs
t
= 1
,
2
,...,T
. The scores of the adaptive agentsand their opponents are
S
adp
(
t
)
and
S
opp
(
t
)
, while the forecasting cooperative and defective earnings of the adaptiveagent at time
t
are
S
cp
(
t
)
and
S
dp
(
t
)
, respectively. Theequation for calculating the forecasting earnings of adaptiveagent is shown as follows:
T
cp
(
t
+ 1) =
t
i
=1
P
t
−
iC
(
S
adp
(
i
)
−
S
opp
(
i
))
(2)
T
dp
(
t
+ 1) =
t
i
=1
P
t
−
iD
(
S
adp
(
i
)
−
S
opp
(
i
))
(3)
S
cp
(
t
+ 1) =
T
cp
(
t
+ 1) + 2
P
C
(
t
)
R
(4)
S
dp
(
t
+ 1) =
T
dp
(
t
+ 1) + 2
P
D
(
t
)
P
(5)where
P
C
∈
(0
,
1)
and
P
D
∈
(0
,
1)
represent the possibility of cooperation and defection, respectively, for the adaptive agent.
R,T,S
and
P
are the payoffs of the IPD, as shown in TableI.Therefore, the TD(
λ
) learner can be described as follows:1) Initialization: The state set and action set of the
i
th agentare
Z
=
{
C,D
}
, where
C
is cooperate, and
D
is defect. Thepayoff matrix of the agents is shown as Table I.2) Calculating
S
cp
(
t
+ 1)
and
S
dp
(
t
+ 1)
by (2)
−
(5).3) The agent makes decision based on the decision makingprocess (DMP).4) Back to 2) until the iteration stops.The DMP is shown as follows:1) If
S
cp
(
t
+ 1)
> S
dp
(
t
+ 1)
, the adaptive agent willcooperate with the opponent. The possibility of cooperatingwill increase to
P
C
(
t
+ 1) =
P
C
(
t
) +
F
(
P
C
(
t
+ 1))
; the
304 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 5, NO. 1, JANUARY 2018
possibility of defecting will decrease to
P
D
(
t
+1) =
P
D
(
t
)
−
F
(
P
D
(
t
+ 1))
.2) If
S
cp
(
t
+1)
< S
dp
(
t
+1)
, the adaptive agent will defectwith the opponent. The possibility of defecting will increaseto
P
D
(
t
+ 1) =
P
D
(
t
) +
F
(
P
D
(
t
+ 1))
; the possibility of cooperating will decrease to
P
C
(
t
+1)=
P
C
(
t
)
−
F
(
P
C
(
t
+1))
.3) If
S
cp
(
t
+ 1) =
S
dp
(
t
+ 1)
, the adaptive agent willcooperate with the opponent. However, the possibility of defecting cannot be reduced. The possibility of cooperatingwill increase to
P
C
(
t
+1) =
P
C
(
t
)+
F
(
P
C
(
t
+1))
. Therefore,the adaptive agent can be encouraged to choose cooperationfor the longterm team reward. The function
F
(
ε
)
is modiﬁedFermi function shown as follows:
F
(
ε
(
t
+ 1)) = 11 +
e
[
ε
(
t
)
−
ε
(
t
−
1)]
/k
.
(6)The DMP of adaptive agent clearly indicates that thedecisions rely not only on the previous steps, but also onthe learning method of the adaptive agent.
P
D
and
P
C
decide the tendency of the adaptive agent. Moreover, due to
P
C
,P
D
∈
[0
,
1]
,
P
t
−
iD
and
P
t
−
iC
will decrease. In other words,the effect of the last step is greater than that of the othersteps, and effect of the ﬁrst decision within the IPD becomesincreasingly weaker as the number of iterations increases. Forinvestigating the convergence in the mentioned manner, theanalytical properties are discussed in following subsection.
B. The Analytical Properties of Adaptive Strategy
Different IPD strategies use different history lengths of memory to determine their choices. In a ﬁnite length IPDwhich has
L
rounds, the largest history length which a strategycan access is
L
. As to the strategy mentioned in this section,the memory size is as long as the length of ﬁnite IPD game.A nontrivial question is whether the adaptive strategy can bethe counter strategy (CS) against other strategy? It has beenproven that every ﬁnite history length is possible to occur inan inﬁnite length IPD which can be expressed as followingtheorem.
Deﬁnition 3 (Counter Strategy) [26]:
A strategy
S
is acounter strategy (CS) against another strategy
S
1
, if for anystrategy
S
′
U
(
S,S
1
)
≥
U
(
S
′
,S
1
)
.
(7)
Lemma 1 [26]:
For any strategy that uses a limited historylength, there always exist some strategies with longer memoryagainst which the strategy cannot be a counter strategy.
Theorem 1:
The adaptive strategy mentioned in Section IIhas a higher probability of being a CS against ﬁxed strategyin an inﬁnite length IPD.
Deﬁnition 4 [26]:
Any strategy that uses a limited historycannot be an evolutionarily stable strategy (ESS) in an inﬁnitelength or indeﬁnite length IPD. ESS is a strategy such that,if all the members of a population adopt it, then no mutantstrategy can invade the population under the inﬂuence of natural selection.Therefore, the condition for a strategy
S
to be ESS is thatfor any
S
′
U
(
S,S
)
≥
U
(
S
′
,S
)
.
There is a relationship between CS and ESS. A strategy isESS if it is the only CS against all IPD strategies [26]. As tothe adaptive strategy, the evolutionary stability is discussed asfollows.
Theorem 2:
As to IPD consisted of ﬁxed strategies andadaptive strategy,theadaptivestrategymentionedinSection IIis ESS.The adaptive strategy designed in this paper is a kind of strategy with the memory size as the length of the IPD game.During the IPD game, the adaptive strategy has the longestmemory compared with other strategies. As to the designedgame, the adaptive strategy has a higher possibility to be CS of the other ﬁxed strategies. Therefore, the adaptive strategy willearn a better payoff when played against other ﬁxed strategies.During next section, some simulations are given to illustratethe effectiveness of the mentioned manner.IV. S
IMULATIONS
In order to measure the effectiveness of agents using theadaptive strategy, three IPD tournaments are simulated indifferent complex networks. The square lattice network andscalefree network are introduced to be the environments of the simulations. Each of the game theoretical tournaments canbe represented as tuple
G
(
N,A,F
)
, where
N
is the numberof the agents,
A
=
{
C,D
}
is the action set of the agents,
C
and
D
represents the cooperate and defect, and
F
is thepayoff function for each action. The practical payoff matrixof the Prisoner’s dilemma is shown in Table III.
TABLE IIIIPD P
AYOFF
M
ATRIX
Agent 1
\
Agent 2 Cooperate DefectCooperate 3
\
3 0
\
4Defect 4
\
0 1
\
1
The simulations based on square lattice network and scalefree networks are given in the following sections.
A. The Simulations Based on the Square Lattice Network
Based on the characteristic of the
100
×
100
square latticenetwork, there are
N
= 1
,
2
,...,i,...,
10000
agents on thenetwork. In this section, two tournaments are provided toillustrate the effectiveness of the adaptive strategy. During ﬁrsttournament, the designed adaptive strategy will play againstthe ﬁxed strategies mentioned in Table II. During the secondtournament, two kinds of strategies based on Qlearning [27]and selfadaptive method [28] are given to play against thedesigned strategy. The structure of the partial square latticenetwork is described as Fig.2. As to each agent, it has eightneighbors. The coordinates of agents represent the relativeposition between them.During each iteration, the
i
th agent plays against its neighbors according to the rules of IPD. Based on the feature of the lattice network, the adaptive agents can play against agentswith all kinds of strategies. The cooperative rate
p
and averageﬁtness value are introduced to illustrate the effectiveness of mentioned manner.
p
t
=
N
c
N
(8)
XUE
et al.
: AN ADAPTIVE STRATEGY VIA REINFORCEMENT LEARNING FOR THE PRISONER’S DILEMMA GAME 305
where
p
t
represents cooperative rate of the
t
th iteration;
N
c
represents the number of cooperating agent;
N
is the numberof all the agents.
f
ave
(
Stra
) =
T
t
=1
S
t
T
(9)where
f
ave
(
Stra
)
means the average ﬁtness value of strategy
Stra
;
S
t
is the payoff of
t
th roundrobin game;
T
is thenumber of total iterations.
Fig.2. The structure of the square lattice network.
1) The Tournaments Between Designed Strategy and Fixed Strategies:
In this section, the tournament between designedstrategy based on the TD learning method and ﬁxed strategiesis provided to verify whether the designed strategy can earn abetter payoff than the other ﬁxed strategies. The steps of thetournament are represented as follows:
Step 1:
Initializing the number of iteration as 100. Randomly generating the positions of the agents with differentstrategies. The numbers of agents with different strategies areequal. The initial states of the agents are cooperation anddefection. The total initial cooperative rate is
p
1
= 0
.
3
. Thecooperators will be marked as red squares. The defectors willbe marked as blue squares.
Step 2:
During each round of the roundrobin games,
i
th agent will play against its neighbors by the prescribedsequence which is shown as Fig.3, and update its statesaccording to the characteristics.
Fig.3. The order of each roundrobin game.
Step 3:
Calculating the cooperative rates and average ﬁtnessvalues for verifying effectiveness of the mentioned manners.During the ﬁrst example, the average ﬁtness values of thedifferent strategies in IPD tournament are given in the TableIV. The statistical results illustrate that adaptive strategy earnsa better payoff compared with other strategies. In the tournament, defective actions became extinct after 9 generations, andcooperative actions occupy the majority of the population.
TABLE IVR
ESULTS OF
F
IRST
R
OUND

ROBIN
IPD
Strategy ScoreAlways defect 2.367Ripoff 2.347Always cooperate 2.289Psycho 2.462Titfortat 2.774Adaptive strategy 3.135GRIM 2.763Pavlov 2.334Titfor2tit 2.597
The simulation results of the ﬁrst tournament are shown inFig.4. Therefore, a conclusion can be obtained that mutualcooperation becomes popular among the agents and spreadsfast. However, another problem is that how does the initialcooperative rate
p
1
inﬂuence the result. Therefore, the simulation results based on the different initial conditions are shownas Fig.5. Fig.5 mentions that the three different initial valueswhich are
p
1
= 0
.
3
,
p
1
= 0
.
5
and
p
1
= 0
.
8
lead to differentbalance points. However, the equilibrium points are all above75 percent which is signiﬁcantly larger than the percentage of agent with always cooperate. Therefore, mutual cooperationcan be achieved between majority of the agents.
2) The Tournaments Between Designed Strategy and Other Evolutionary Learning Methods:
In this section, two evolutionary learning methods are introduced into the tournaments.One is a Qlearning strategy [27], and the other one is aselfadaptive winstayshift reference selection strategy [28].The environment of this tournament is also the square latticenetwork. The steps of the tournament are represented asfollows:
Step 1:
Initializing the number of iteration as 100. Randomly generating the positions for the agents with differentstrategies on the square lattice network. Three kinds of evolutionary strategies are evenly distributed on the network. Theinitial states of the agents are cooperate and defect. The initialcooperative rate is
p
1
= 0
.
3
. The cooperators will be markedas green squares. The defectors will be marked as blue squares.
Step 2:
During each round of the roundrobin,
i
th agent willplay against its neighbors by the prescribed sequence which isshown as Fig.3, and update its states according to their ownfeatures.
Step 3:
Calculating the cooperative rates and average ﬁtnessvalues of the three different evolutionary strategies.The simulation results of the second tournament are shownin Fig.6. The simulation results illustrate that the cooperationspreads fast among the agents. The network becomes stablewithin 5 iterations, and the cooperative rate of 100th iterationis 0.904. The results verify that the mutual cooperation can beachieved between the agents with evolutionary strategies.