python 2.7 - Saving checkpoints and resuming training in tensorflow -
i playing saving checkpoints , resuming training saved checkpoints. following example given in - https://www.tensorflow.org/versions/r0.8/api_docs/python/train.html#import_meta_graph keep things simple, have not used 'real' training of network. performed simple subtraction op , each check point saves same operation on same tensors again , again. minimal example provided in form of following ipython notebook - https://gist.github.com/dasabir/29b8f84c6e5e817a72ce06584e988f10
in first phase, i'm running loop 100 times (by setting value of variable 'enditer = 100' in code) , saving checkpoints every 10th iteration. so, checkpoints being saved numbered - 9, 19, ..., 99. when i'm changing 'enditer' value 200 , resuming training, checkpoints again start saved 9, 19, ... (not 109, 119, 129, ...). there trick i'm missing?
can print out 'latest_ckpt', , see if points latest ckpt file? also, need maintain global_step using tf.variable:
global_step = tf.variable(0, name='global_step', trainable=false) ... ckpt = tf.train.get_checkpoint_state(ckpt_dir) if ckpt , ckpt.model_checkpoint_path: print ckpt.model_checkpoint_path saver.restore(sess, ckpt.model_checkpoint_path) # restore variables start = global_step.eval() # last global_step print "start from:", start in range(start, 100): ... global_step.assign(i).eval() # set , update(eval) global_step index, saver.save(sess, ckpt_dir + "/model.ckpt", global_step=global_step)
you can take @ full example:
https://github.com/nlintz/tensorflow-tutorials/pull/32/files
Comments
Post a Comment