优化若干细节支持本地GPU和远程GPU训练

2023-09-22 00:47:00 +08:00 · 2023-09-22 00:47:00 +08:00 · bdf17e8356
parent cad4d97004
commit bdf17e8356
22 changed files with 63 additions and 8 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,3 @@
+**/__pycache__
+*.pth
+**/logs
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-sThis repository hosts the source code of our paper: [[CVPR 2022] Cascade Transformers for End-to-End Person Search](https://arxiv.org/abs/2203.09642). In this work, we developed a novel Cascaded Occlusion-Aware Transformer (COAT) model for end-to-end person search. The COAT model outperforms **state-of-the-art** methods on the PRW benchmark dataset by a large margin and achieves state-of-the-art performance on the CUHK-SYSU dataset. 
+This repository hosts the source code of our paper: [[CVPR 2022] Cascade Transformers for End-to-End Person Search](https://arxiv.org/abs/2203.09642). In this work, we developed a novel Cascaded Occlusion-Aware Transformer (COAT) model for end-to-end person search. The COAT model outperforms **state-of-the-art** methods on the PRW benchmark dataset by a large margin and achieves state-of-the-art performance on the CUHK-SYSU dataset. 

 | Dataset   | mAP  | Top-1 | Model                                                        |
 | --------- | ---- | ----- | ------------------------------------------------------------ |
@ -43,12 +43,23 @@ conda activate coat
 If you want to install another version of PyTorch, you can modify the versions in `coat_pt171.yml`. Just make sure the dependencies have the appropriate version. 


-## Experiments on CUHK-SYSU
-**Training**: The code currently only supports single GPU. The default training script for CUHK-SYSU is as follows:
+## CUHK-SYSU数据集实验
+**训练**: 目前代码只支持单GPU. The default training script for CUHK-SYSU is as follows:

-``` 
+**在本地GTX4090训练**：
+
+``` bash
 cd COAT 
-python train.py --cfg configs/cuhk_sysu.yaml INPUT.BATCH_SIZE_TRAIN 3 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 14 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu 
+# 说明：4090显存较小，所以batchsize只能设置为2, 实测可以运行
+python train.py --cfg configs/cuhk_sysu-local.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 14 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu 
+```
+
+**在本地UESTC训练**：
+
+```bash
+cd COAT 
+# 说明：RTX8000显存48G，所以batchsize只能设置为3
+python train.py --cfg configs/cuhk_sysu.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 14 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu 
 ```

 Note that the dataset-specific parameters are defined in `configs/cuhk_sysu.yaml`. When the batch size (`INPUT.BATCH_SIZE_TRAIN`) is 3, the training will take about 23GB GPU memory, being suitable for GPUs like RTX6000. When the batch size is 5, the training will take about 38GB GPU memory, being able to run on A100 GPU. The larger batch size usually results in better performance on CUHK-SYSU. 
@ -57,6 +68,8 @@ For the CUHK-SYSU dataset, we use a relative low weight for softmax loss (`SOLVE

 **Testing**: The test script is very simple. You just need to add the flag `--eval` and provide the folder `--ckpt` where the [model](https://drive.google.com/file/d/1LkEwXYaJg93yk4Kfhyk3m6j8v3i9s1B7/view?usp=sharing) was saved.

+测试：这个测试脚本非常简单，你只需要添加flag --eval以及对应提供--ckpt当模型已经保存的时候
+
 ``` 
 python train.py --cfg ./configs/cuhk-sysu/config.yaml --eval --ckpt ./logs/cuhk-sysu/cuhk_COAT.pth 
 ```
@ -76,8 +89,19 @@ python train.py --cfg ./configs/cuhk-sysu/config.yaml --eval --ckpt ./logs/cuhk-
 ## Experiments on PRW
 **Training**: The script is similar to CUHK-SYSU. The code currently only supports single GPU. The default training script for PRW is as follows: 

-```
+**在本地GTX4090训练**：
+
+```bash
 cd COAT 
+# PRW数据集较小，可以RTX4090的bs可以设置为3
+python train.py --cfg ./configs/prw-local.yaml INPUT.BATCH_SIZE_TRAIN 3 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 13 MODEL.LOSS.USE_SOFTMAX True OUTPUT_DIR ./logs/prw
+```
+
+**在本地UESTC训练**：
+
+```bash
+cd COAT 
+# PRW数据集较小，可以RTX4090的bs可以设置为3
 python train.py --cfg ./configs/prw.yaml INPUT.BATCH_SIZE_TRAIN 3 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 13 MODEL.LOSS.USE_SOFTMAX True OUTPUT_DIR ./logs/prw
 ```

--- a/pycache/defaults.cpython-38.pyc
+++ b/pycache/defaults.cpython-38.pyc
--- a/pycache/engine.cpython-38.pyc
+++ b/pycache/engine.cpython-38.pyc
--- a/pycache/eval_func.cpython-38.pyc
+++ b/pycache/eval_func.cpython-38.pyc
--- a/configs/cuhk_sysu-local.yaml
+++ b/configs/cuhk_sysu-local.yaml
@ -0,0 +1,15 @@
+OUTPUT_DIR: "./logs/cuhk_coat"
+INPUT:
+  DATASET: "CUHK-SYSU"
+  DATA_ROOT: "E:/DeepLearning/PersonSearch/COAT/datasets/CUHK-SYSU"
+  BATCH_SIZE_TRAIN: 3
+SOLVER:
+  MAX_EPOCHS: 14
+  BASE_LR: 0.003
+  LW_RCNN_SOFTMAX_2ND: 0.1
+  LW_RCNN_SOFTMAX_3RD: 0.1
+MODEL:
+  LOSS:
+    LUT_SIZE: 5532
+    CQ_SIZE: 5000
+DISP_PERIOD: 100
--- a/configs/prw-local.yaml
+++ b/configs/prw-local.yaml
@ -0,0 +1,13 @@
+OUTPUT_DIR: "./logs/prw_coat"
+INPUT:
+  DATASET: "PRW"
+  DATA_ROOT: "E:/DeepLearning/PersonSearch/COAT/datasets/PRW"
+  BATCH_SIZE_TRAIN: 3
+SOLVER:
+  MAX_EPOCHS: 13
+  BASE_LR: 0.003
+MODEL:
+  LOSS:
+    LUT_SIZE: 482
+    CQ_SIZE: 500
+DISP_PERIOD: 100
--- a/datasets/pycache/init.cpython-38.pyc
+++ b/datasets/pycache/init.cpython-38.pyc
--- a/datasets/pycache/base.cpython-38.pyc
+++ b/datasets/pycache/base.cpython-38.pyc
--- a/datasets/pycache/build.cpython-38.pyc
+++ b/datasets/pycache/build.cpython-38.pyc
--- a/datasets/pycache/cuhk_sysu.cpython-38.pyc
+++ b/datasets/pycache/cuhk_sysu.cpython-38.pyc
--- a/datasets/pycache/prw.cpython-38.pyc
+++ b/datasets/pycache/prw.cpython-38.pyc
--- a/defaults.py
+++ b/defaults.py
@ -205,7 +205,7 @@ _C.DISP_PERIOD = 10
 # Whether to use tensorboard for visualization
 _C.TF_BOARD = True
 # The device loading the model
-_C.DEVICE = "cuda:1"
+_C.DEVICE = "cuda:0"
 # Set seed to negative to fully randomize everything
 _C.SEED = 1
 # Directory where output files are written
--- a/loss/pycache/oim.cpython-38.pyc
+++ b/loss/pycache/oim.cpython-38.pyc
--- a/loss/pycache/softmax_loss.cpython-38.pyc
+++ b/loss/pycache/softmax_loss.cpython-38.pyc
--- a/models/pycache/coat.cpython-38.pyc
+++ b/models/pycache/coat.cpython-38.pyc
--- a/models/pycache/resnet.cpython-38.pyc
+++ b/models/pycache/resnet.cpython-38.pyc
--- a/models/pycache/transformer.cpython-38.pyc
+++ b/models/pycache/transformer.cpython-38.pyc
--- a/utils/pycache/km.cpython-38.pyc
+++ b/utils/pycache/km.cpython-38.pyc
--- a/utils/pycache/mask.cpython-38.pyc
+++ b/utils/pycache/mask.cpython-38.pyc
--- a/utils/pycache/transforms.cpython-38.pyc
+++ b/utils/pycache/transforms.cpython-38.pyc
--- a/utils/pycache/utils.cpython-38.pyc
+++ b/utils/pycache/utils.cpython-38.pyc