TensorflowShang
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/README.md
Lines changed: 115 additions & 0 deletions b/‎MachineLearning/DeepLearningFlappyBird-master/README.md
Lines changed: 115 additions & 0 deletions
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/die.ogg
17.1 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/die.ogg
17.1 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/die.wav
190 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/die.wav
190 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/hit.ogg
15.3 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/hit.ogg
15.3 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/hit.wav
94.3 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/hit.wav
94.3 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/point.ogg
12.9 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/point.ogg
12.9 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/point.wav
173 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/point.wav
173 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/swoosh.ogg
13.4 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/swoosh.ogg
13.4 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/swoosh.wav
346 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/swoosh.wav
346 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/wing.ogg
7.55 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/wing.ogg
7.55 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/wing.wav
29.2 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/audio/wing.wav
29.2 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/0.png
2.81 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/0.png
2.81 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/1.png
2.8 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/1.png
2.8 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/2.png
2.82 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/2.png
2.82 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/3.png
2.81 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/3.png
2.81 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/4.png
2.83 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/4.png
2.83 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/5.png
2.82 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/5.png
2.82 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/6.png
2.82 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/6.png
2.82 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/7.png
2.83 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/7.png
2.83 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/8.png
2.81 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/8.png
2.81 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/9.png
2.82 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/9.png
2.82 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/background-black.png
3.94 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/background-black.png
3.94 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/base.png
664 Bytes b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/base.png
664 Bytes
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/pipe-green.png
4.92 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/pipe-green.png
4.92 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-downflap.png
2.88 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-downflap.png
2.88 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-midflap.png
2.88 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-midflap.png
2.88 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-upflap.png
2.88 KB b/‎MachineLearning/DeepLearningFlappyBird-master/assets/sprites/redbird-upflap.png
2.88 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/deep_q_network.py
Lines changed: 215 additions & 0 deletions b/‎MachineLearning/DeepLearningFlappyBird-master/deep_q_network.py
Lines changed: 215 additions & 0 deletions
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/game/__pycache__/flappy_bird_utils.cpython-35.pyc
2.16 KB b/‎MachineLearning/DeepLearningFlappyBird-master/game/__pycache__/flappy_bird_utils.cpython-35.pyc
2.16 KB
diff --git a/‎MachineLearning/DeepLearningFlappyBird-master/game/__pycache__/wrapped_flappy_bird.cpython-35.pyc
5.7 KB b/‎MachineLearning/DeepLearningFlappyBird-master/game/__pycache__/wrapped_flappy_bird.cpython-35.pyc
5.7 KB
@@ -0,0 +1,115 @@
+# Using Deep Q-Network to Learn How To Play Flappy Bird
+
+<img src="./images/flappy_bird_demp.gif" width="250">
+
+7 mins version: [DQN for flappy bird](https://www.youtube.com/watch?v=THhUXIhjkCM)
+
+## Overview
+This project follows the description of the Deep Q Learning algorithm described in Playing Atari with Deep Reinforcement Learning [2] and shows that this learning algorithm can be further generalized to the notorious Flappy Bird.
+
+## Installation Dependencies:
+* Python 2.7 or 3
+* TensorFlow 0.7
+* pygame
+* OpenCV-Python
+
+## How to Run?
+```
+git clone https://github.com/yenchenlin1994/DeepLearningFlappyBird.git
+cd DeepLearningFlappyBird
+python deep_q_network.py
+```
+
+## What is Deep Q-Network?
+It is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.
+
+For those who are interested in deep reinforcement learning, I highly recommend to read the following post:
+
+[Demystifying Deep Reinforcement Learning](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
+
+## Deep Q-Network Algorithm
+
+The pseudo-code for the Deep Q Learning algorithm, as given in [1], can be found below:
+
+```
+Initialize replay memory D to size N
+Initialize action-value function Q with random weights
+for episode = 1, M do
+    Initialize state s_1
+    for t = 1, T do
+        With probability ϵ select random action a_t
+        otherwise select a_t=max_a  Q(s_t,a; θ_i)
+        Execute action a_t in emulator and observe r_t and s_(t+1)
+        Store transition (s_t,a_t,r_t,s_(t+1)) in D
+        Sample a minibatch of transitions (s_j,a_j,r_j,s_(j+1)) from D
+        Set y_j:=
+            r_j for terminal s_(j+1)
+            r_j+γ*max_(a^' )  Q(s_(j+1),a'; θ_i) for non-terminal s_(j+1)
+        Perform a gradient step on (y_j-Q(s_j,a_j; θ_i))^2 with respect to θ
+    end for
+end for
+```
+
+## Experiments
+
+#### Environment
+Since deep Q-network is trained on the raw pixel values observed from the game screen at each time step, [3] finds that remove the background appeared in the original game can make it converge faster. This process can be visualized as the following figure:
+
+<img src="./images/preprocess.png" width="450">
+
+#### Network Architecture
+According to [1], I first preprocessed the game screens with following steps:
+
+1. Convert image to grayscale
+2. Resize image to 80x80
+3. Stack last 4 frames to produce an 80x80x4 input array for network
+
+The architecture of the network is shown in the figure below. The first layer convolves the input image with an 8x8x4x32 kernel at a stride size of 4. The output is then put through a 2x2 max pooling layer. The second layer convolves with a 4x4x32x64 kernel at a stride of 2. We then max pool again. The third layer convolves with a 3x3x64x64 kernel at a stride of 1. We then max pool one more time. The last hidden layer consists of 256 fully connected ReLU nodes.
+
+<img src="./images/network.png">
+
+The final output layer has the same dimensionality as the number of valid actions which can be performed in the game, where the 0th index always corresponds to doing nothing. The values at this output layer represent the Q function given the input state for each valid action. At each time step, the network performs whichever action corresponds to the highest Q value using a ϵ greedy policy.
+
+
+#### Training
+At first, I initialize all weight matrices randomly using a normal distribution with a standard deviation of 0.01, then set the replay memory with a max size of 500,00 experiences.
+
+I start training by choosing actions uniformly at random for the first 10,000 time steps, without updating the network weights. This allows the system to populate the replay memory before training begins.
+
+Note that unlike [1], which initialize ϵ = 1, I linearly anneal ϵ from 0.1 to 0.0001 over the course of the next 3000,000 frames. The reason why I set it this way is that agent can choose an action every 0.03s (FPS=30) in our game, high ϵ will make it **flap** too much and thus keeps itself at the top of the game screen and finally bump the pipe in a clumsy way. This condition will make Q function converge relatively slow since it only start to look other conditions when ϵ is low.
+However, in other games, initialize ϵ to 1 is more reasonable.
+
+During training time, at each time step, the network samples minibatches of size 32 from the replay memory to train on, and performs a gradient step on the loss function described above using the Adam optimization algorithm with a learning rate of 0.000001. After annealing finishes, the network continues to train indefinitely, with ϵ fixed at 0.001.
+
+## FAQ
+
+#### Checkpoint not found
+Change [first line of `saved_networks/checkpoint`](https://github.com/yenchenlin1994/DeepLearningFlappyBird/blob/master/saved_networks/checkpoint#L1) to 
+
+`model_checkpoint_path: "saved_networks/bird-dqn-2920000"`
+
+#### How to reproduce?
+1. Comment out [these lines](https://github.com/yenchenlin1994/DeepLearningFlappyBird/blob/master/deep_q_network.py#L108-L112)
+
+2. Modify `deep_q_network.py`'s parameter as follow:
+```python
+OBSERVE = 10000
+EXPLORE = 3000000
+FINAL_EPSILON = 0.0001
+INITIAL_EPSILON = 0.1
+```
+
+## References
+
+[1] Mnih Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. **Human-level Control through Deep Reinforcement Learning**. Nature, 529-33, 2015.
+
+[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. **Playing Atari with Deep Reinforcement Learning**. NIPS, Deep Learning workshop
+
+[3] Kevin Chen. **Deep Reinforcement Learning for Flappy Bird** [Report](http://cs229.stanford.edu/proj2015/362_report.pdf) | [Youtube result](https://youtu.be/9WKBzTUsPKc)
+
+## Disclaimer
+This work is highly based on the following repos:
+
+1. [sourabhv/FlapPyBird] (https://github.com/sourabhv/FlapPyBird)
+2. [asrivat1/DeepLearningVideoGames](https://github.com/asrivat1/DeepLearningVideoGames)
+
@@ -0,0 +1,215 @@
+#!/usr/bin/env python
+from __future__ import print_function
+
+import tensorflow as tf
+import cv2
+import sys
+sys.path.append("game/")
+import wrapped_flappy_bird as game
+import random
+import numpy as np
+from collections import deque
+
+GAME = 'bird' # the name of the game being played for log files
+ACTIONS = 2 # number of valid actions
+GAMMA = 0.99 # decay rate of past observations
+OBSERVE = 100000. # timesteps to observe before training
+EXPLORE = 2000000. # frames over which to anneal epsilon
+FINAL_EPSILON = 0.0001 # final value of epsilon
+INITIAL_EPSILON = 0. # starting value of epsilon
+REPLAY_MEMORY = 50000 # number of previous transitions to remember
+BATCH = 32 # size of minibatch
+FRAME_PER_ACTION = 1
+
+def weight_variable(shape):
+    initial = tf.truncated_normal(shape, stddev = 0.01)
+    return tf.Variable(initial)
+
+def bias_variable(shape):
+    initial = tf.constant(0.01, shape = shape)
+    return tf.Variable(initial)
+
+def conv2d(x, W, stride):
+    return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")
+
+def max_pool_2x2(x):
+    return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")
+
+def createNetwork():
+    # network weights
+    W_conv1 = weight_variable([8, 8, 4, 32])
+    b_conv1 = bias_variable([32])
+
+    W_conv2 = weight_variable([4, 4, 32, 64])
+    b_conv2 = bias_variable([64])
+
+    W_conv3 = weight_variable([3, 3, 64, 64])
+    b_conv3 = bias_variable([64])
+
+    W_fc1 = weight_variable([1600, 512])
+    b_fc1 = bias_variable([512])
+
+    W_fc2 = weight_variable([512, ACTIONS])
+    b_fc2 = bias_variable([ACTIONS])
+
+    # input layer
+    s = tf.placeholder("float", [None, 80, 80, 4])
+
+    # hidden layers
+    h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
+    h_pool1 = max_pool_2x2(h_conv1)
+
+    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)
+    #h_pool2 = max_pool_2x2(h_conv2)
+
+    h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)
+    #h_pool3 = max_pool_2x2(h_conv3)
+
+    #h_pool3_flat = tf.reshape(h_pool3, [-1, 256])
+    h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])
+
+    h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)
+
+    # readout layer
+    readout = tf.matmul(h_fc1, W_fc2) + b_fc2
+
+    return s, readout, h_fc1
+
+def trainNetwork(s, readout, h_fc1, sess):
+    # define the cost function
+    a = tf.placeholder("float", [None, ACTIONS])
+    y = tf.placeholder("float", [None])
+    readout_action = tf.reduce_sum(tf.multiply(readout, a), reduction_indices=1)
+    cost = tf.reduce_mean(tf.square(y - readout_action))
+    train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)
+
+    # open up a game state to communicate with emulator
+    game_state = game.GameState()
+
+    # store the previous observations in replay memory
+    D = deque()
+
+    # printing
+    a_file = open("logs_" + GAME + "/readout.txt", 'w')
+    h_file = open("logs_" + GAME + "/hidden.txt", 'w')
+
+    # get the first state by doing nothing and preprocess the image to 80x80x4
+    do_nothing = np.zeros(ACTIONS)
+    do_nothing[0] = 1
+    x_t, r_0, terminal = game_state.frame_step(do_nothing)
+    x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)
+    ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
+    s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
+
+    # saving and loading networks
+    saver = tf.train.Saver()
+    sess.run(tf.initialize_all_variables())
+    checkpoint = tf.train.get_checkpoint_state("saved_networks")
+    if checkpoint and checkpoint.model_checkpoint_path:
+        saver.restore(sess, checkpoint.model_checkpoint_path)
+        print("Successfully loaded:", checkpoint.model_checkpoint_path)
+    else:
+        print("Could not find old network weights")
+
+    # start training
+    epsilon = INITIAL_EPSILON
+    t = 0
+    while "flappy bird" != "angry bird":
+        # choose an action epsilon greedily
+        readout_t = readout.eval(feed_dict={s : [s_t]})[0]
+        a_t = np.zeros([ACTIONS])
+        action_index = 0
+        if t % FRAME_PER_ACTION == 0:
+            if random.random() <= epsilon:
+                print("----------Random Action----------")
+                action_index = random.randrange(ACTIONS)
+                a_t[random.randrange(ACTIONS)] = 1
+            else:
+                action_index = np.argmax(readout_t)
+                a_t[action_index] = 1
+        else:
+            a_t[0] = 1 # do nothing
+
+        # scale down epsilon
+        if epsilon > FINAL_EPSILON and t > OBSERVE:
+            epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
+
+        # run the selected action and observe next state and reward
+        x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
+        x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)
+        ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
+        x_t1 = np.reshape(x_t1, (80, 80, 1))
+        #s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
+        s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)
+
+        # store the transition in D
+        D.append((s_t, a_t, r_t, s_t1, terminal))
+        if len(D) > REPLAY_MEMORY:
+            D.popleft()
+
+        # only train if done observing
+        if t > OBSERVE:
+            # sample a minibatch to train on
+            minibatch = random.sample(D, BATCH)
+
+            # get the batch variables
+            s_j_batch = [d[0] for d in minibatch]
+            a_batch = [d[1] for d in minibatch]
+            r_batch = [d[2] for d in minibatch]
+            s_j1_batch = [d[3] for d in minibatch]
+
+            y_batch = []
+            readout_j1_batch = readout.eval(feed_dict = {s : s_j1_batch})
+            for i in range(0, len(minibatch)):
+                terminal = minibatch[i][4]
+                # if terminal, only equals reward
+                if terminal:
+                    y_batch.append(r_batch[i])
+                else:
+                    y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))
+
+            # perform gradient step
+            train_step.run(feed_dict = {
+                y : y_batch,
+                a : a_batch,
+                s : s_j_batch}
+            )
+
+        # update the old values
+        s_t = s_t1
+        t += 1
+
+        # save progress every 10000 iterations
+        if t % 10000 == 0:
+            saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)
+
+        # print info
+        state = ""
+        if t <= OBSERVE:
+            state = "observe"
+        elif t > OBSERVE and t <= OBSERVE + EXPLORE:
+            state = "explore"
+        else:
+            state = "train"
+
+        print("TIMESTEP", t, "/ STATE", state, \
+            "/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t, \
+            "/ Q_MAX %e" % np.max(readout_t))
+        # write info to files
+        '''
+        if t % 10000 <= 100:
+            a_file.write(",".join([str(x) for x in readout_t]) + '\n')
+            h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + '\n')
+            cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)
+        '''
+
+def playGame():
+    sess = tf.InteractiveSession()
+    s, readout, h_fc1 = createNetwork()
+    trainNetwork(s, readout, h_fc1, sess)
+
+def main():
+    playGame()
+
+if __name__ == "__main__":
+    main()