python 2.7 - Saving checkpoints and resuming training in tensorflow -


i playing saving checkpoints , resuming training saved checkpoints. following example given in - https://www.tensorflow.org/versions/r0.8/api_docs/python/train.html#import_meta_graph keep things simple, have not used 'real' training of network. performed simple subtraction op , each check point saves same operation on same tensors again , again. minimal example provided in form of following ipython notebook - https://gist.github.com/dasabir/29b8f84c6e5e817a72ce06584e988f10

in first phase, i'm running loop 100 times (by setting value of variable 'enditer = 100' in code) , saving checkpoints every 10th iteration. so, checkpoints being saved numbered - 9, 19, ..., 99. when i'm changing 'enditer' value 200 , resuming training, checkpoints again start saved 9, 19, ... (not 109, 119, 129, ...). there trick i'm missing?

can print out 'latest_ckpt', , see if points latest ckpt file? also, need maintain global_step using tf.variable:

global_step = tf.variable(0, name='global_step', trainable=false) ... ckpt = tf.train.get_checkpoint_state(ckpt_dir) if ckpt , ckpt.model_checkpoint_path:     print ckpt.model_checkpoint_path     saver.restore(sess, ckpt.model_checkpoint_path) # restore variables start = global_step.eval() # last global_step print "start from:", start  in range(start, 100): ...     global_step.assign(i).eval() # set , update(eval) global_step index,     saver.save(sess, ckpt_dir + "/model.ckpt", global_step=global_step) 

you can take @ full example:

https://github.com/nlintz/tensorflow-tutorials/pull/32/files


Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -