2019/8/13

caffe train snapshot and resume

resume training (with --snapshot=XXX.solvestate) 之後,在save snapshot 之後,出現 Error:
I0813 15:10:02.564018 12537 solver.cpp:635] Iteration 6000, Testing net (#0)
F0813 15:10:02.564225 12537 net.cpp:1081] Check failed: target_blobs[j]->shape() == source_blob->shape() Cannot share param 0 weights from layer 'conv1a/bn'; shape mismatch.  Source param shape is 1 32 1 1 (32); target param shape is 32 (32)
*** Check failure stack trace: ***
    @     0x7fe99628b0cd  google::LogMessage::Fail()
    @     0x7fe99628cf33  google::LogMessage::SendToLog()
    @     0x7fe99628ac28  google::LogMessage::Flush()
    @     0x7fe99628d999  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fe996e9b28a  caffe::Net::ShareTrainedLayersWith()
    @     0x7fe99729c883  caffe::Solver::TestDetection()
    @     0x7fe99729f91d  caffe::Solver::TestAll()
    @     0x7fe9972a06ac  caffe::Solver::Step()
    @     0x7fe9972a1fd2  caffe::Solver::Solve()
    @     0x555c3341fe4e  train()
    @     0x555c3341cdb1  main
    @     0x7fe9948b9b97  __libc_start_main
    @     0x555c3341db4a  _start
但是如果不用 resume (--snapshot)的話,training 會 success.

2019/8/12

caffe-jacindo-models train_image_object_detection.sh

續上篇,把 training data 準備好後, run 這個 script 就會做 train 了。
這個 script 大概就是..

呼叫 image_object_detection.py ,準備好參數,產生五個 run.sh:
./initial/run.sh
./l1reg/run.sh
./sparse/run.sh
./test/run.sh
./test_quantize/run.sh

然後依序 run..
#run
list_dirs=`command ls -d1 "$folder_name"/*/ | command cut -f5 -d/`
for f in $list_dirs; do "$folder_name"/$f/run.sh; done
所以會依照 folder name 順序 (alphabet) 執行 run.sh

每一個 run.sh 就是 call caffe.bin,並且上一個 stage 的 run.sh 產生的 *.caffemodel 當作 initial weights,
依照自己的 solver.txt 做 taining,產生 *.caffemodel

根據 solver.prototxt , 每 2000 iter snapshot 一次。(caffemodel)

所以在 run train_image_object_detection.sh 中圖中斷的話。
可以看看training/,,,,/ 下,以上各 folder 的內容,知道最後 run 那一個 script。


每 200 iter snapshot 一次,所以中斷後,可以用 --snapshot 參數從最新的 snapshoot 繼續 training.
參考:Training and Resuming

但是就要修改 run.sh..
舉例來說,initial stage, iter 16000 開始的話...
--- run.sh 2019-08-12 13:49:23.741507855 +0800
+++ resume.sh 2019-08-13 09:44:19.540395008 +0800
@@ -1,4 +1,4 @@
 /home/checko/caffe-jacinto/build/tools/caffe.bin train \
 --solver="training/voc0712/JDetNet/20190812_13-49_ds_PSP_dsFac_32_hdDS8_1/initial/solver.prototxt" \
---weights="training/imagenet_jacintonet11v2_iter_320000.caffemodel" \
+--snapshot="training/voc0712/JDetNet/20190812_13-49_ds_PSP_dsFac_32_hdDS8_1/initial/voc0712_ssdJacintoNetV2_iter_16000.solverstate" \
 --gpu "0" 2>&1 | tee training/voc0712/JDetNet/20190812_13-49_ds_PSP_dsFac_32_hdDS8_1/initial/run.log
就是原來的 weights 改成 snapshot



train 完後,要測試 (infernal), run 或 import ssd_detect_video.py,會出現找不到 propagate_obj.py 的錯誤。
用 git log 看,在某一版被刪除了。
git checkout 6ca88ff12e559a839ae5ee9bc7c25201f0ed9217 scripts/propagate_obj.py
就可以找回來了。