Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge QOL Improvements to Master #2

Closed
wants to merge 137 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
b268e67
Updated numpy int to remove deprecation error
leathalman May 23, 2023
2c0983b
Enabled score printing, display board, and fixed dimensionality misma…
leathalman May 23, 2023
406cd22
Fixed tensor compatibility issue with numpy and wrong winner in conso…
leathalman May 23, 2023
a696d7c
Added code to add board states to text file
blakegood31 May 23, 2023
d33df0b
Added code to add game boards to text file
blakegood31 May 23, 2023
8ef8d52
Added code to add game boards to a text file
blakegood31 May 23, 2023
172658b
Finished creating text file with all boards
blakegood31 May 23, 2023
f3a9b4d
Added score to each board state in console and game_history.txt. Incr…
leathalman May 24, 2023
ff91b16
Updated Arena.py bar to show proper progress
blakegood31 May 24, 2023
364a28d
Added plot for win rates
blakegood31 May 24, 2023
e41dd91
Added p and v loss graphs for CNN training
leathalman May 24, 2023
61485de
Merge branch 'master' into loss_graphs
blakegood31 May 24, 2023
dea4ee1
Merge pull request #1 from leathalman/loss_graphs
blakegood31 May 24, 2023
f8b9c30
Improved graphing of statistics
blakegood31 May 25, 2023
e5a0acd
Fixed display enum, Game_History.txt now updates properly on NO_DISPL…
leathalman May 25, 2023
532b7fb
Cleaned up display function
leathalman May 25, 2023
8c9151a
Graphs now update after every iteration
leathalman May 25, 2023
14cc03f
Updated file system to save new game/training history each time
blakegood31 May 25, 2023
dac78ef
Updated to save new Game_History and Training_Results each time
blakegood31 May 25, 2023
5a4ff80
Game_Histories and Training_Results directories will now automaticall…
leathalman May 25, 2023
ebb5f1f
Fixed previously introduced bug with save location of csv files
leathalman May 25, 2023
a8a9324
Fixed issue with loading last model/checkpoints
blakegood31 May 25, 2023
2774496
Merge branch 'master' of https://github.com/leathalman/AZ-Go
blakegood31 May 25, 2023
9c57c95
Plots now represent updateThreshold accurately and added debug inform…
leathalman May 26, 2023
345d90f
Added failsafe for infinite recursion bug
leathalman May 26, 2023
14b4139
New implementation of PettingZoo. Slightly improved over old implemen…
blakegood31 Jun 6, 2023
bbad199
Fixed the infinite recursion error from yesterday, but still dealing …
blakegood31 Jun 7, 2023
c9bf58c
Updated GoCoach and added board printing
leathalman Jun 8, 2023
e478dbf
Fixed bug where valids were not updated to action_mask from PettingZoo
leathalman Jun 8, 2023
39ca041
First stable PettingZoo release and code cleanup
leathalman Jun 8, 2023
53b75b5
Refactored print board for compatibility with PettingZoo
leathalman Jun 8, 2023
fe0a798
Added broken implementation of deep_copy
leathalman Jun 8, 2023
6bb1798
Working implementation of PettingZoo!
blakegood31 Jun 9, 2023
f60e59c
Merge pull request #2 from leathalman/petting_zoo_new_implementation
leathalman Jun 10, 2023
7605de0
Implemented DataParallel for CNN training
leathalman Jun 10, 2023
e53a741
Added Sabaki support for arena games and removed PZGo
leathalman Jun 19, 2023
e9f7db8
Fixed wrong bar incrementation during arena play
leathalman Jun 19, 2023
b299e0a
sgf files are now created for each game in an iteration
leathalman Jun 20, 2023
2d8effb
Added function to cap length of self-play (episode) games
leathalman Jun 20, 2023
076ce48
Added dynamic batch sizes to neural network
leathalman Jun 20, 2023
94bad6b
Updated console output for training session
leathalman Jun 20, 2023
8f8ebb7
Improved console printout
leathalman Jun 21, 2023
bedf77c
Reimplemented distributed GPU training, sgf support, and improved con…
leathalman Jun 26, 2023
c410471
Proof of concept for parallel games
leathalman Jun 27, 2023
06d5866
Updated eps bar print out
leathalman Jun 27, 2023
b4d8dfb
Semi-functional parallelized arena play
leathalman Jun 27, 2023
734789e
Reverted arena and updated parameters for RES training
leathalman Jun 28, 2023
b4e5a4c
Merge branch 'master' into reversion
leathalman Jun 28, 2023
d8049ba
Merge pull request #3 from leathalman/reversion
leathalman Jun 28, 2023
fd157c7
Remerge changes from reversion
leathalman Jun 28, 2023
cc3b7a6
Added distributed NN training for ResNet
leathalman Jun 29, 2023
2da227c
Added folder with all engine code
leathalman Jul 10, 2023
428bd80
Cleaned up engine.py
leathalman Jul 10, 2023
8548a1c
Added code to allow for distributed training with Google Drive
blakegood31 Jul 11, 2023
e511a4b
Added log to track total number of games generated for each iteration
blakegood31 Jul 11, 2023
2066dfc
Reimplemented distributed training on old commit
blakegood31 Jul 14, 2023
c6b003c
Added logs for losses and winrate and added data parallel for RESnet
blakegood31 Jul 14, 2023
7b767c1
Added credentials, log for number of games used in training, and upda…
blakegood31 Jul 14, 2023
c356ac6
Cleaned up and removed deprecated async calls
leathalman Jul 17, 2023
55e44ec
Small fix for distributed_training set to false
leathalman Jul 17, 2023
e23e925
Added failsafe to avoid running out of RAM
blakegood31 Jul 17, 2023
08184b4
what is happening
blakegood31 Jul 17, 2023
365fbab
Added dynamic dumping and loading of trainExamples for memory optimiz…
leathalman Jul 18, 2023
4f09157
Merged changes (added RAM limiter)
leathalman Jul 18, 2023
bb8e2e4
Made small changes to help save RAM
blakegood31 Jul 18, 2023
59d6ac1
Updated logic for keeping train examples history
blakegood31 Jul 19, 2023
91f9d0c
Added support for single game checkpoint files
leathalman Jul 19, 2023
254c8c4
Merge branch 'master' into Reimplementation
leathalman Jul 19, 2023
6f5a68e
Fixed bug with episode loading and training
leathalman Jul 20, 2023
3fc9854
Added failsafe for improperly downloaded files
leathalman Jul 20, 2023
68b6130
Updated DriveAPI to remove local args
leathalman Jul 20, 2023
a27c5ed
Overhaul of neural net, mcts, and train examples
blakegood31 Jul 25, 2023
4317f5b
Large rewrite of training loop with dynamic downloading from Google D…
leathalman Jul 25, 2023
af16bfd
Added DataParallel back and fixed printout error
leathalman Jul 25, 2023
fd74073
Added local network distributed training
leathalman Jul 27, 2023
14c2fbf
Cleaned up code + removed deprecated code
leathalman Jul 27, 2023
74bb6da
Updated status bars for episode and arena
leathalman Jul 27, 2023
4989468
Updated NN parameters and disable trainExamples dynamic loading:
leathalman Jul 28, 2023
f39c919
Added yaml support
leathalman Aug 1, 2023
e5c05c0
Merge pull request #4 from leathalman/local_distributed
leathalman Aug 1, 2023
5bf1697
Added distributed training file
leathalman Aug 2, 2023
13cdb8c
Added temporary fix for pickler segmentation fault
leathalman Aug 3, 2023
ed97610
Added multiprocessing to distributed training
leathalman Aug 3, 2023
b2de6ae
Updated distributed MP to use spawn
leathalman Aug 3, 2023
5974944
Updated input for neural network to 17x7x7
blakegood31 Aug 4, 2023
91a5f1c
Merge pull request #5 from leathalman/nnet_input_update
leathalman Aug 4, 2023
09137ed
Added optimizers to yaml and error handling for incorrect optimizer d…
leathalman Aug 4, 2023
22629af
Added fix for file scanning error with glob
leathalman Aug 7, 2023
3757add
Fixed bug with status bar value divided by 0
leathalman Aug 7, 2023
a558396
Refactored engine code into a single file
leathalman Aug 7, 2023
c7c33cf
Updated engine for 17 x 7 x 7 activation dimensions
leathalman Aug 7, 2023
a5629be
Updated neural net input to 18x7x7
blakegood31 Aug 7, 2023
444901f
Merge pull request #6 from leathalman/18x7x7_update
leathalman Aug 8, 2023
d1c08b9
Added fix for incorrect channel size in Arena
leathalman Aug 8, 2023
e411db0
Fixed pathing issues with engine
leathalman Aug 8, 2023
c0ce5f6
Fix for segmentation fault error. Now using dill library instead of p…
blakegood31 Aug 8, 2023
78d532f
Updated distributed training to use dill
blakegood31 Aug 8, 2023
21a7274
Updated engine for 18 x 7 x 7 activation dimensions
leathalman Aug 8, 2023
7ce6417
Fixed loading in train examples when loading from checkpoint
blakegood31 Aug 8, 2023
43c35f1
Added dirichlet noise to self play games
blakegood31 Aug 10, 2023
e38fe70
Reworked game ending criteria and added ability to disable score thre…
leathalman Aug 10, 2023
3a6271b
Added fix to pass move being masked/unmasked
blakegood31 Aug 11, 2023
fba1dc5
Pass is masked for first 15 moves in self play and unmasked for all m…
leathalman Aug 11, 2023
7a85fe1
Fixed minor naming error
blakegood31 Aug 11, 2023
7d3b1f1
Improved distributed worker to wait for model to be uploaded
blakegood31 Aug 11, 2023
f5c75cd
Refactor and cleaned up status bar print outs
leathalman Aug 11, 2023
9e035fe
Made changes to end game and fixed error with reading/saving files
blakegood31 Aug 31, 2023
63f6539
Merge branch 'terminal_state_improvements' of https://github.com/leat…
blakegood31 Aug 31, 2023
b33d704
Modified action selection in Coach to choose highest probability move…
blakegood31 Sep 8, 2023
f64d08a
Merge pull request #7 from leathalman/terminal_state_improvements
leathalman Sep 11, 2023
8cfc19e
Fixed bug in engine code and updated temp in config.yaml
leathalman Sep 18, 2023
f268416
Added dynamic scaling for maximum number of moves allowed in a game
leathalman Sep 18, 2023
149f55e
Fixed incorrect file deletion
leathalman Sep 18, 2023
5a57016
First stable refactor
leathalman Oct 17, 2023
e212773
Refactored learn() and executeEpisode() in coach.py
blakegood31 Oct 17, 2023
e87b5fa
Added playout cap randomization to self play games
blakegood31 Oct 17, 2023
5021f42
Added bug fixes for incorrect x-axis on graphs, broken yaml configs, …
leathalman Oct 24, 2023
9ba2b5e
Fixed pathing for engine code
leathalman Oct 25, 2023
1181082
Fixed engine import path for model.tar
leathalman Oct 25, 2023
0a35bb3
Updated pathing for distributed_worker
leathalman Oct 25, 2023
8d18af3
Updated parameters and number of games on first iteration of distribu…
leathalman Oct 25, 2023
6d40f43
Arena play now returns after threshold is met by one of the players
blakegood31 Jan 19, 2024
8351d76
Added cosine annealing functionality
blakegood31 Jan 20, 2024
602acb5
Added support for multiple concurrent worker machines
leathalman Jan 24, 2024
8188f71
Merge branch 'modified_arena'
leathalman Jan 24, 2024
98804b2
Restored config.yaml
leathalman Jan 24, 2024
5b0d10c
Fixed error where distributed/examples directory was not created by s…
leathalman Jan 25, 2024
e779441
Implemented fix for pickler, fixed arena statistics, and unmasked mov…
blakegood31 Jan 26, 2024
24d30b1
Merge remote-tracking branch 'origin/pickler_fix'
leathalman Jan 26, 2024
04284a9
First attempt at proper implementation of queue length
blakegood31 Jan 30, 2024
5c3a83c
Added/updated code to use sensibility layer
blakegood31 Feb 1, 2024
c6e9a41
Remasked pi values for training, config set to Model Q levels
leathalman Feb 16, 2024
378ebce
Updated pathing for loading in models
leathalman Feb 18, 2024
34d782b
Fixed load_model() in coach to actually load model
blakegood31 Mar 15, 2024
0648a00
Modified how acceptance threshold is calculated for new models
blakegood31 Mar 15, 2024
36418cc
Added updated engine code
blakegood31 Mar 15, 2024
ea5a7a0
Updated loss function and added loadsgf command to engine
blakegood31 Mar 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Made small changes to help save RAM
  • Loading branch information
blakegood31 committed Jul 18, 2023
commit bb8e2e4fde80986a6ce8454626db73730b73e400
27 changes: 17 additions & 10 deletions GoCoach.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import matplotlib.pyplot as plt
from DriveAPI import DriveAPI
import psutil

import gc

class Coach():
"""
Expand Down Expand Up @@ -136,10 +136,10 @@ def learn(self):

if self.args.distributed_training:
# Create drive object
print('RAM Used before download (GB):', psutil.virtual_memory()[3] / 1000000000)

drive = DriveAPI()
downloads_count = 0

print('RAM Used before download (GB):', psutil.virtual_memory()[3] / 1000000000)
# Get list of all files in Google Drive
files = []
for item in drive.items:
Expand Down Expand Up @@ -175,9 +175,12 @@ def learn(self):
elif self.args.load_model:
downloads_count += 1
file_path = os.path.join(self.args.checkpoint, drive.items[j]['name'])
iterationTrainExamples = self.loadDownloadedExamples(file_path, iterationTrainExamples)
self.loadDownloadedExamples(file_path)
append_downloads = True

del files
del downloaded_files
gc.collect()
downloads_count = downloads_count * 5
downloads_count += self.args.numEps
else:
Expand Down Expand Up @@ -217,13 +220,15 @@ def learn(self):
while len(self.trainExamplesHistory) > self.args.numItersForTrainExamplesHistory:
self.trainExamplesHistory.pop(0)

print("Ram used before clear itTrainEx: ", psutil.virtual_memory()[3] / 1000000000)
self.iterationTrainExamples.clear()
print("Ram used after clear itTrainEx: ", psutil.virtual_memory()[3] / 1000000000)
# prune trainExamples to meet ram requirement
ramUsed = int(psutil.virtual_memory()[3] / 1000000000)
ramCap = self.args.ram_cap
while ramUsed > ramCap:
while int(psutil.virtual_memory()[3] / 1000000000) > ramCap and len(self.trainExamplesHistory) > 13:
print(len(self.trainExamplesHistory))
self.trainExamplesHistory.pop(0)

# backup history to a file
# NB! the examples were collected using the model from the previous iteration, so (i-1)
self.saveTrainExamples(i - 1)
Expand All @@ -234,16 +239,18 @@ def learn(self):
trainExamples.extend(e)
shuffle(trainExamples)

print("Ram used before clear trainExHis: ", psutil.virtual_memory()[3] / 1000000000)
self.trainExamplesHistory = []
print("Ram used after clear trainExHis: ", psutil.virtual_memory()[3] / 1000000000)

# training new network, keeping a copy of the old one
self.nnet.save_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar')
self.pnet.load_checkpoint(folder=self.args.checkpoint, filename='temp.pth.tar')
pmcts = MCTS(self.game, self.pnet, self.args)

trainLog = self.nnet.train(trainExamples)

trainExamples = []
# clear trainExamples to save memory, reload when needed
self.trainExamplesHistory = []
self.iterationTrainExamples.clear()

self.p_loss_per_iteration.append(np.average(trainLog['P_LOSS'].to_numpy()))
self.v_loss_per_iteration.append(np.average(trainLog['V_LOSS'].to_numpy()))
Expand Down
8 changes: 4 additions & 4 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,19 @@ class Display(IntEnum):
args = dotdict({
# training parameters
'numIters': 1000,
'numEps': 2, # Number of complete self-play games to simulate during a new iteration.
'numEps': 100, # Number of complete self-play games to simulate during a new iteration.
'tempThreshold': 15,
'updateThreshold': 0.54,
# During arena playoff, new neural net will be accepted if threshold or more of games are won.
'maxlenOfQueue': 200000, # Number of game examples to train the neural networks.
'numMCTSSims': 150, # Number of games moves for MCTS to simulate.
'arenaCompare': 2, # Number of games to play during arena play to determine if new net will be accepted.
'arenaCompare': 50, # Number of games to play during arena play to determine if new net will be accepted.
'cpuct': 1.0,
'numItersForTrainExamplesHistory': 15,

# customization
'load_model': False,
'distributed_training': False, # use Google Drive for computing on multiple machines
'load_model': True,
'distributed_training': True, # use Google Drive for computing on multiple machines
'display': Display.DISPLAY_BAR,
'ram_cap': 30,

Expand Down