提交初始代码

2024-10-03 01:04:42 +08:00 · 2024-10-03 01:04:42 +08:00 · d3d73b3377
commit d3d73b3377
43 changed files with 59997 additions and 0 deletions
--- a/Code/Python/.gitignore
+++ b/Code/Python/.gitignore
@ -0,0 +1,3 @@
+**/__pycache__
+*.pth
+**/logs
--- a/Code/Python/COAT_pt171.yml
+++ b/Code/Python/COAT_pt171.yml
@ -0,0 +1,26 @@
+name: coat
+channels:
+  - pytorch
+  - conda-forge
+  - defaults
+dependencies:
+  - cudatoolkit=11.0
+  - numpy=1.19.2
+  - pillow=8.2.0
+  - pip=21.0.1
+  - python=3.8.8
+  - pytorch=1.7.1
+  - scipy=1.6.2
+  - torchvision=0.8.2
+  - tqdm=4.60.0
+  - scikit-learn=0.24.1
+  - black=21.5b0
+  - flake8=3.9.0
+  - isort=5.8.0
+  - tabulate=0.8.9
+  - future=0.18.2
+  - tensorboard=2.4.1
+  - tensorboardx=2.2
+  - pip:
+    - ipython==7.5.0
+    - yacs==0.1.8
--- a/Code/Python/COAT基础笔记.md
+++ b/Code/Python/COAT基础笔记.md
@ -0,0 +1,25 @@
+## COAT 基础笔记
+
+1、MultiScaleRoIAlign的基本原理和用法
+
+		**MultiScaleRoIAlign实际上是RoIAlign的增强版，RoIAlign 用于将任意尺寸感兴趣区域的特征图，都转换为具有固定尺寸 `H × W `的小特征图**。与RoI pooling一样，其基本原理是将 `h x w `的特征划分为 `H × W `网格，每个格子是大小近似为 h / H × w / W 的子窗口 ，然后将每个子窗口中的值最大池化到相应的输出网格单元中。想复习RoI pooling概念的可以看[这篇](https://deepsense.ai/region-of-interest-pooling-explained/)。
+
+		由于基于anchor的方法会有矩形框，从而产生ROI区域。这些ROI区域尺寸是大小不一的。但是我们在后续对矩形框的位置回归和分类的时候是需要输入固定尺寸特征图的。所以ROIAlign核心的作用是把大小不一的ROI区域的特征图转换为固定尺寸大小，以便后续操作。ROIAlign的原理介绍可以参考[这篇文章](https://blog.csdn.net/Bit_Coders/article/details/121203584)。
+
+```python
+# 这里roi模块的功能是把feat1、feat3特征转换为5x5的尺寸
+roi = torchvision.ops.MultiScaleRoIAlign(['feat1', 'feat3'], 5, 2)
+# 构建仿真的特征, 用于模拟rpn提取的ROI区域
+i['feat1'] = torch.rand(1, 5, 64, 64)
+# 这个特征实际没有使用，模拟实际情况下，部分不用的特征
+i['feat2'] = torch.rand(1, 5, 32, 32)  
+i['feat3'] = torch.rand(1, 5, 16, 16)
+# 创建随机的矩形框
+boxes = torch.rand(6, 4) * 256; boxes[:, 2:] += boxes[:, :2]
+# 原始图像尺寸
+image_sizes = [(512, 512)]
+# ROI对齐操作
+output = roi(i, [boxes], image_sizes)
+# torch.Size([6, 5, 5, 5]) : 对应的含义 6个矩形框、5个通道、5x5的Size
+print(output.shape)
+```
--- a/Code/Python/README.md
+++ b/Code/Python/README.md
@ -0,0 +1,150 @@
+# **COAT代码使用说明**
+
+		这个存储库托管了论文的源代码：[[CVPR 2022] Cascade Transformers for End-to-End Person Search](https://arxiv.org/abs/2203.09642)。在这项工作中，我们开发了一种新颖的级联遮挡感知Transformer（COAT）模型，用于端到端的人物搜索。COAT模型在PRW基准数据集上以显著的优势胜过了最先进的方法，并在CUHK-SYSU数据集上取得了最先进的性能。
+
+| 数据集(Datasets) | mAP  | Top-1 | Model                                                        |
+| ---------------- | ---- | ----- | ------------------------------------------------------------ |
+| CUHK-SYSU        | 94.2 | 94.7  | [model](https://drive.google.com/file/d/1LkEwXYaJg93yk4Kfhyk3m6j8v3i9s1B7/view?usp=sharing) |
+| PRW              | 53.3 | 87.4  | [model](https://drive.google.com/file/d/1vEd_zzFN88RgxbRMG5-WfJZgD3vmP0Xg/view?usp=sharing) |
+
+**Abstract**: The goal of person search is to localize a target person from a gallery set of scene images, which is extremely challenging due to large scale variations, pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search. Specifically, our three-stage cascade design focuses on detecting people at the first stage, then progressively refines the representation for person detection and re-identification simultaneously at the following stages. The occluded attention transformer at each stage applies tighter intersection over union thresholds, forcing the network to learn coarse-to-fine pose/scale invariant features. Meanwhile, we calculate the occluded attention across instances in a mini-batch to differentiate tokens from other people or the background. In this way, we simulate the effect of other objects occluding a person of interest at the token-level. Through comprehensive experiments, we demonstrate the benefits of our method by achieving state-of-the-art performance on two benchmark datasets.
+
+![COAT](doc/framework.png)
+
+
+## Installation
+1. Download the datasets in your path `$DATA_DIR`. Change the dataset paths in L4 in [cuhk_sysu.yaml](configs/cuhk_sysu.yaml) and [prw.yaml](configs/prw.yaml).
+
+**PRW**:
+
+```
+cd $DATA_DIR 
+pip install gdown
+gdown https://drive.google.com/uc?id=0B6tjyrV1YrHeYnlhNnhEYTh5MUU
+unzip PRW-v16.04.20.zip 
+mv PRW-v16.04.20 PRW 
+```
+
+**CUHK-SYSU**:
+
+```
+cd $DATA_DIR 
+gdown https://drive.google.com/uc?id=1z3LsFrJTUeEX3-XjSEJMOBrslxD2T5af 
+tar -xzvf cuhk_sysu.tar.gz 
+mv cuhk_sysu CUHK-SYSU 
+```
+
+2. Our method is tested with PyTorch 1.7.1. You can install the required packages by anaconda/miniconda with the following commands: 
+
+```
+cd COAT 
+conda env create -f COAT_pt171.yml 
+conda activate coat 
+```
+
+If you want to install another version of PyTorch, you can modify the versions in `coat_pt171.yml`. Just make sure the dependencies have the appropriate version. 
+
+
+## CUHK-SYSU数据集实验
+**训练**: 目前代码只支持单GPU. The default training script for CUHK-SYSU is as follows:
+
+**在本地GTX4090训练**：
+
+``` bash
+cd COAT 
+# 说明：4090显存较小，所以batchsize只能设置为2, 实测可以运行
+python train.py --cfg configs/cuhk_sysu-local.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 14 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu 
+```
+
+**在本地UESTC训练**：
+
+```bash
+cd COAT 
+# 说明：RTX8000显存48G，所以batchsize只能设置为3
+python train.py --cfg configs/cuhk_sysu.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 14 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu 
+```
+
+Note that the dataset-specific parameters are defined in `configs/cuhk_sysu.yaml`. When the batch size (`INPUT.BATCH_SIZE_TRAIN`) is 3, the training will take about 23GB GPU memory, being suitable for GPUs like RTX6000. When the batch size is 5, the training will take about 38GB GPU memory, being able to run on A100 GPU. The larger batch size usually results in better performance on CUHK-SYSU. 
+
+For the CUHK-SYSU dataset, we use a relative low weight for softmax loss (`SOLVER.LW_RCNN_SOFTMAX_2ND` 0.1 and `SOLVER.LW_RCNN_SOFTMAX_3RD` 0.1). The trained models and TF logs will be saved in the folder `OUTPUT_DIR`. Other important training parameters can be found in the file `COAT/defaults.py`. For example, `CKPT_PERIOD` is the frequency of saving a checkpoint model. 
+
+**Testing**: The test script is very simple. You just need to add the flag `--eval` and provide the folder `--ckpt` where the [model](https://drive.google.com/file/d/1LkEwXYaJg93yk4Kfhyk3m6j8v3i9s1B7/view?usp=sharing) was saved.
+
+测试：这个测试脚本非常简单，你只需要添加flag --eval以及对应提供--ckpt当模型已经保存的时候
+
+``` 
+python train.py --cfg ./configs/cuhk-sysu/config.yaml --eval --ckpt ./logs/cuhk-sysu/cuhk_COAT.pth 
+```
+
+**Testing with CBGM**: Context Bipartite Graph Matching ([CBGM](https://github.com/serend1p1ty/SeqNet)) is an optimized matching algorithm in test phase. The detail can be found in the paper [[AAAI 2021] Sequential End-to-end Network for Efficient Person Search](https://arxiv.org/abs/2103.10148). We can use CBGM to further improve the person search accuracy. In test script, we just set the flag `EVAL_USE_CBGM` to True (default is False). 
+
+```
+python train.py --cfg ./configs/cuhk-sysu/config.yaml --eval --ckpt ./logs/cuhk-sysu/cuhk_COAT.pth EVAL_USE_CB GM True
+```
+
+**Testing with different gallery sizes on CUHK-SYSU**: The default gallery size for evaluating CUHK-SYSU is 100. If you want to test with other pre-defined gallery sizes (50, 100, 500, 1000, 2000, 4000) for drawing the CUHK-SYSU gallery size curve, please set the parameter `EVAL_GALLERY_SIZE` with a gallery size. 
+
+```
+python train.py --cfg ./configs/cuhk-sysu/config.yaml --eval --ckpt ./logs/cuhk-sysu/cuhk_COAT.pth EVAL_GALLER Y_SIZE 500 
+```
+
+## Experiments on PRW
+**Training**: The script is similar to CUHK-SYSU. The code currently only supports single GPU. The default training script for PRW is as follows: 
+
+**在本地GTX4090训练**：
+
+```bash
+cd COAT 
+# PRW数据集较小，可以RTX4090的bs可以设置为3
+python train.py --cfg ./configs/prw-local.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 13 MODEL.LOSS.USE_SOFTMAX True OUTPUT_DIR ./logs/prw
+```
+
+**在本地UESTC训练**：
+
+```bash
+cd COAT 
+# PRW数据集较小，可以RTX4090的bs可以设置为3
+python train.py --cfg ./configs/prw.yaml INPUT.BATCH_SIZE_TRAIN 3 SOLVER.BASE_LR 0.003 SOLVER.MAX_EPOCHS 13 MODEL.LOSS.USE_SOFTMAX True OUTPUT_DIR ./logs/prw
+```
+
+The dataset-specific parameters are defined in `configs/prw.yaml`. When the batch size (`INPUT.BATCH_SIZE_TRAIN`) is 3, the training will take about 19GB GPU memory, being suitable for GPUs like RTX6000. The larger batch size does not necessarily result in better accuracy on the PRW dataset. 
+Softmax loss is effective on PRW. The default weights of softmax loss at Stage 2 and Stage 3 (`SOLVER.LW_RCNN_SOFTMAX_2ND` and `SOLVER.LW_RCNN_SOFTMAX_3RD`) are 0.5, which can be found in the file `COAT/defaults.py`. If you want to run a model without Softmax loss for comparison, just set `MODEL.LOSS.USE_SOFTMAX` to False in the script. 
+
+
+**Testing**: The test script is similar to CUHK-SYSU. Make sure the path of pre-trained model [model](https://drive.google.com/file/d/1vEd_zzFN88RgxbRMG5-WfJZgD3vmP0Xg/view?usp=sharing) is correct.
+
+``` 
+python train.py --cfg ./logs/prw/config.yaml --eval --ckpt ./logs/prw/prw_COAT.pth 
+
+```
+
+**Testing with CBGM**: Similar to CUHK-SYSU, set the flag `EVAL_USE_CBGM` to True (default is False). 
+
+```
+python train.py --cfg ./logs/prw/config.yaml --eval --ckpt ./logs/prw/prw_COAT.pth EVAL_USE_CBGM True
+```
+
+
+## Acknowledgement
+This code borrows from [SeqNet](https://github.com/serend1p1ty/SeqNet), [TransReID](https://github.com/damo-cv/TransReID), and [DSTT](https://github.com/ruiliu-ai/DSTT).
+
+## Citation
+If you use this code in your research, please cite this project as follows:
+
+```
+@inproceedings{yu2022coat,
+  title     = {Cascade Transformers for End-to-End Person Search},
+  author    = {Rui Yu and 
+               Dawei Du and 
+               Rodney LaLonde and 
+               Daniel Davila and 
+               Christopher Funk and 
+               Anthony Hoogs and 
+               Brian Clipp},
+  booktitle = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  year      = {2022}
+}
+```
+
+## License
+This work is distributed under the OSI-approved BSD 3-Clause [License](https://github.com/Kitware/COAT/blob/master/LICENSE).
--- a/Code/Python/backbone/pvt_v2.py
+++ b/Code/Python/backbone/pvt_v2.py
@ -0,0 +1,402 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from functools import partial
+from collections import OrderedDict
+from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+from mmengine.runner import load_checkpoint
+import math
+
+
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=1., linear=False):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.dwconv = DWConv(hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+        self.linear = linear
+        if self.linear:
+            self.relu = nn.ReLU(inplace=True)
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+            fan_out //= m.groups
+            m.weight.data.normal_(0, math.sqrt(2.0 / fan_out))
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+    def forward(self, x, H, W):
+        x = self.fc1(x)
+        if self.linear:
+            x = self.relu(x)
+        x = self.dwconv(x, H, W)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1, linear=False):
+        super().__init__()
+        assert dim % num_heads == 0, f"dim {dim} should be divided by num_heads {num_heads}."
+
+        self.dim = dim
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+
+        self.q = nn.Linear(dim, dim, bias=qkv_bias)
+        self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.linear = linear
+        self.sr_ratio = sr_ratio
+        if not linear:
+            if sr_ratio > 1:
+                self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
+                self.norm = nn.LayerNorm(dim)
+        else:
+            self.pool = nn.AdaptiveAvgPool2d(7)
+            self.sr = nn.Conv2d(dim, dim, kernel_size=1, stride=1)
+            self.norm = nn.LayerNorm(dim)
+            self.act = nn.GELU()
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+            fan_out //= m.groups
+            m.weight.data.normal_(0, math.sqrt(2.0 / fan_out))
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+    def forward(self, x, H, W):
+        B, N, C = x.shape
+        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3)
+
+        if not self.linear:
+            if self.sr_ratio > 1:
+                x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
+                x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
+                x_ = self.norm(x_)
+                kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+            else:
+                kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        else:
+            x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
+            x_ = self.sr(self.pool(x_)).reshape(B, C, -1).permute(0, 2, 1)
+            x_ = self.norm(x_)
+            x_ = self.act(x_)
+            kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        k, v = kv[0], kv[1]
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        return x
+
+
+class Block(nn.Module):
+
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, sr_ratio=1, linear=False):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop, proj_drop=drop, sr_ratio=sr_ratio, linear=linear)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop, linear=linear)
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+            fan_out //= m.groups
+            m.weight.data.normal_(0, math.sqrt(2.0 / fan_out))
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+    def forward(self, x, H, W):
+        x = x + self.drop_path(self.attn(self.norm1(x), H, W))
+        x = x + self.drop_path(self.mlp(self.norm2(x), H, W))
+
+        return x
+
+
+class OverlapPatchEmbed(nn.Module):
+    """ Image to Patch Embedding
+    """
+
+    def __init__(self, img_size=224, patch_size=7, stride=4, in_chans=3, embed_dim=768):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        
+        assert max(patch_size) > stride, "Set larger patch_size than stride"
+        
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.H, self.W = img_size[0] // stride, img_size[1] // stride
+        self.num_patches = self.H * self.W
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride,
+                              padding=(patch_size[0] // 2, patch_size[1] // 2))
+        self.norm = nn.LayerNorm(embed_dim)
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+            fan_out //= m.groups
+            m.weight.data.normal_(0, math.sqrt(2.0 / fan_out))
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+    def forward(self, x):
+        x = self.proj(x)
+        _, _, H, W = x.shape
+        x = x.flatten(2).transpose(1, 2)
+        x = self.norm(x)
+
+        return x, H, W
+
+
+class PyramidVisionTransformerV2(nn.Module):
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dims=[64, 128, 256, 512],
+                 num_heads=[1, 2, 4, 8], mlp_ratios=[4, 4, 4, 4], qkv_bias=False, qk_scale=None, drop_rate=0.,
+                 attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm, depths=[3, 4, 6, 3],
+                 sr_ratios=[8, 4, 2, 1], num_stages=4, linear=False, pretrained=None):
+        super().__init__()
+        # self.num_classes = num_classes
+        self.depths = depths
+        self.num_stages = num_stages
+        self.linear = linear
+
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]  # stochastic depth decay rule
+        cur = 0
+
+        for i in range(num_stages):
+            patch_embed = OverlapPatchEmbed(img_size=img_size if i == 0 else img_size // (2 ** (i + 1)),
+                                            patch_size=7 if i == 0 else 3,
+                                            stride=4 if i == 0 else 2,
+                                            in_chans=in_chans if i == 0 else embed_dims[i - 1],
+                                            embed_dim=embed_dims[i])
+
+            block = nn.ModuleList([Block(
+                dim=embed_dims[i], num_heads=num_heads[i], mlp_ratio=mlp_ratios[i], qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[cur + j], norm_layer=norm_layer,
+                sr_ratio=sr_ratios[i], linear=linear)
+                for j in range(depths[i])])
+            norm = norm_layer(embed_dims[i])
+            cur += depths[i]
+
+            setattr(self, f"patch_embed{i + 1}", patch_embed)
+            setattr(self, f"block{i + 1}", block)
+            setattr(self, f"norm{i + 1}", norm)
+
+        # classification head
+        # self.head实际就是个线性模型并未使用，删除后模型在加载pth会有警告
+        # 下面代码是否注释影响不大
+        self.head = nn.Linear(embed_dims[3], num_classes) if num_classes > 0 else nn.Identity()
+
+        self.apply(self._init_weights)
+        self.init_weights(pretrained)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            fan_out = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+            fan_out //= m.groups
+            m.weight.data.normal_(0, math.sqrt(2.0 / fan_out))
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+    def init_weights(self, pretrained=None):
+        if isinstance(pretrained, str):
+            #logger = get_root_logger()
+            load_checkpoint(self, pretrained, map_location='cpu', strict=False)
+
+    def freeze_patch_emb(self):
+        self.patch_embed1.requires_grad = False
+
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed1', 'pos_embed2', 'pos_embed3', 'pos_embed4', 'cls_token'}  # has pos_embed may be better
+
+    def get_classifier(self):
+        return self.head
+
+    def reset_classifier(self, num_classes, global_pool=''):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+
+    def forward_features(self, x):
+        B = x.shape[0]
+        outs = []
+
+        for i in range(self.num_stages):
+            patch_embed = getattr(self, f"patch_embed{i + 1}")
+            block = getattr(self, f"block{i + 1}")
+            norm = getattr(self, f"norm{i + 1}")
+            x, H, W = patch_embed(x)
+            for blk in block:
+                x = blk(x, H, W)
+            x = norm(x)
+            x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous()
+            outs.append(x)
+
+        return outs
+
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        # 目标检测的head被原作者删除了
+        # x = self.head(x)
+        return x
+
+
+class DWConv(nn.Module):
+    def __init__(self, dim=768):
+        super(DWConv, self).__init__()
+        self.dwconv = nn.Conv2d(dim, dim, 3, 1, 1, bias=True, groups=dim)
+
+    def forward(self, x, H, W):
+        B, N, C = x.shape
+        x = x.transpose(1, 2).view(B, C, H, W)
+        x = self.dwconv(x)
+        x = x.flatten(2).transpose(1, 2)
+
+        return x
+
+
+def _conv_filter(state_dict, patch_size=16):
+    """ convert patch embedding weight from manual patchify + linear proj to conv"""
+    out_dict = {}
+    for k, v in state_dict.items():
+        if 'patch_embed.proj.weight' in k:
+            v = v.reshape((v.shape[0], 3, patch_size, patch_size))
+        out_dict[k] = v
+
+    return out_dict
+
+class pvt_v2_b0(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b0, self).__init__(
+            patch_size=4, embed_dims=[32, 64, 160, 256], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[2, 2, 2, 2], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+
+
+class pvt_v2_b1(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b1, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[2, 2, 2, 2], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+
+
+class pvt_v2_b2(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b2, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+        self.out_channels = 512
+
+
+class pvt_v2_b2_2(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b2_2, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+        self.out_channels = 512
+    def forward(self, x):
+        # using the forward method from nn.Sequential
+        feat = super(pvt_v2_b2_2, self).forward(x)
+        return OrderedDict([["feat_pvt2", feat]])
+
+class pvt_v2_b2_li(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b2_li, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 6, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, linear=True, pretrained=kwargs['pretrained'])
+
+
+class pvt_v2_b3(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b3, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 4, 18, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+
+class pvt_v2_b4(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b4, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[8, 8, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 8, 27, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
+
+class pvt_v2_b5(PyramidVisionTransformerV2):
+    def __init__(self, **kwargs):
+        super(pvt_v2_b5, self).__init__(
+            patch_size=4, embed_dims=[64, 128, 320, 512], num_heads=[1, 2, 5, 8], mlp_ratios=[4, 4, 4, 4],
+            qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), depths=[3, 6, 40, 3], sr_ratios=[8, 4, 2, 1],
+            drop_rate=0.0, drop_path_rate=0.1, pretrained=kwargs['pretrained'])
--- a/Code/Python/backbone/resnet.py
+++ b/Code/Python/backbone/resnet.py
@ -0,0 +1,53 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+from collections import OrderedDict
+import torch.nn.functional as F
+import torchvision
+from torch import nn
+
+class Backbone(nn.Sequential):
+    def __init__(self, resnet):
+        super(Backbone, self).__init__(
+            OrderedDict(
+                [
+                    ["conv1", resnet.conv1],
+                    ["bn1", resnet.bn1],
+                    ["relu", resnet.relu],
+                    ["maxpool", resnet.maxpool],
+                    ["layer1", resnet.layer1],  # res2
+                    ["layer2", resnet.layer2],  # res3
+                    ["layer3", resnet.layer3],  # res4
+                ]
+            )
+        )
+        self.out_channels = 1024
+
+    def forward(self, x):
+        # using the forward method from nn.Sequential
+        feat = super(Backbone, self).forward(x)
+        return OrderedDict([["feat_res4", feat]])
+
+
+class Res5Head(nn.Sequential):
+    def __init__(self, resnet):
+        super(Res5Head, self).__init__(OrderedDict([["layer4", resnet.layer4]]))  # res5
+        self.out_channels = [1024, 2048]
+
+    def forward(self, x):
+        feat = super(Res5Head, self).forward(x)
+        x = F.adaptive_max_pool2d(x, 1)
+        feat = F.adaptive_max_pool2d(feat, 1)
+        return OrderedDict([["feat_res4", x], ["feat_res5", feat]])
+
+
+def build_resnet(name="resnet50", pretrained=True):
+    resnet = torchvision.models.resnet.__dict__[name](pretrained=pretrained)
+
+    # freeze layers
+    resnet.conv1.weight.requires_grad_(False)
+    resnet.bn1.weight.requires_grad_(False)
+    resnet.bn1.bias.requires_grad_(False)
+
+    return Backbone(resnet), Res5Head(resnet)
--- a/Code/Python/backbone/test.jpg
+++ b/Code/Python/backbone/test.jpg
--- a/Code/Python/backbone/test_pvtv2.py
+++ b/Code/Python/backbone/test_pvtv2.py
@ -0,0 +1,32 @@
+from pvt_v2 import pvt_v2_b2
+import cv2
+import torchvision.transforms as transforms
+
+# 读取测试图像
+image = cv2.imread('E:/DeepLearning/PersonSearch/COAT/COAT/main/backbone/test.jpg')  # 替换为您的测试图像文件路径
+image = cv2.resize(image, (224, 224))
+
+# 使用 OpenCV 显示原始图像
+cv2.imshow("Original Image", image)
+cv2.waitKey(0)
+cv2.destroyAllWindows()
+# 创建转换以将图像缩放到 224x224 大小并转换为张量
+transform = transforms.Compose([
+    transforms.ToPILImage(),
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+])
+
+# 应用转换并将图像转换为张量
+image_tensor = transform(image)
+image_tensor = image_tensor.unsqueeze(0)  # 添加批次维度，将形状变为 [1, 3, 224, 224]
+
+model = pvt_v2_b2(pretrained = "E:/DeepLearning/PersonSearch/COAT/COAT/main/backbone/pvt_v2_b2.pth")
+# 使用模型进行推理
+output = model(image_tensor)
+
+print(output[0].shape)
+print(output[1].shape)
+print(output[2].shape)
+print(output[3].shape)
+# 在这里，您可以处理模型的输出，进行后续的操作或分析
--- a/Code/Python/backbone/test_resnet50.py
+++ b/Code/Python/backbone/test_resnet50.py
@ -0,0 +1,34 @@
+from resnet import build_resnet
+import cv2
+import torchvision.transforms as transforms
+
+
+backbone, _ = build_resnet(name="resnet50", pretrained=True)
+
+# 读取测试图像
+image = cv2.imread('E:/DeepLearning/PersonSearch/COAT/COAT/main/backbone/test.jpg')  # 替换为您的测试图像文件路径
+image = cv2.resize(image, (224, 224))
+
+# 使用 OpenCV 显示原始图像
+cv2.imshow("Original Image", image)
+cv2.waitKey(0)
+cv2.destroyAllWindows()
+# 创建转换以将图像缩放到 224x224 大小并转换为张量
+transform = transforms.Compose([
+    transforms.ToPILImage(),
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+])
+
+# 应用转换并将图像转换为张量
+image_tensor = transform(image)
+image_tensor = image_tensor.unsqueeze(0)  # 添加批次维度，将形状变为 [1, 3, 224, 224]
+
+# 使用模型进行推理
+output = backbone(image_tensor)
+
+# 输入的shape是[1,1024,7,7]
+#print(output)
+print(output['feat_res4'].shape)
+# 在这里，您可以处理模型的输出，进行后续的操作或分析
+
--- a/Code/Python/configs/cuhk_sysu-local.yaml
+++ b/Code/Python/configs/cuhk_sysu-local.yaml
@ -0,0 +1,15 @@
+OUTPUT_DIR: "./logs/cuhk_coat"
+INPUT:
+  DATASET: "CUHK-SYSU"
+  DATA_ROOT: "E:/DeepLearning/PersonSearch/COAT/datasets/CUHK-SYSU"
+  BATCH_SIZE_TRAIN: 3
+SOLVER:
+  MAX_EPOCHS: 14
+  BASE_LR: 0.003
+  LW_RCNN_SOFTMAX_2ND: 0.1
+  LW_RCNN_SOFTMAX_3RD: 0.1
+MODEL:
+  LOSS:
+    LUT_SIZE: 5532
+    CQ_SIZE: 5000
+DISP_PERIOD: 100
--- a/Code/Python/configs/cuhk_sysu.yaml
+++ b/Code/Python/configs/cuhk_sysu.yaml
@ -0,0 +1,15 @@
+OUTPUT_DIR: "./logs/cuhk_coat"
+INPUT:
+  DATASET: "CUHK-SYSU"
+  DATA_ROOT: "/home/logzhan/datasets/CUHK-SYSU"
+  BATCH_SIZE_TRAIN: 4
+SOLVER:
+  MAX_EPOCHS: 14
+  BASE_LR: 0.003
+  LW_RCNN_SOFTMAX_2ND: 0.1
+  LW_RCNN_SOFTMAX_3RD: 0.1
+MODEL:
+  LOSS:
+    LUT_SIZE: 5532
+    CQ_SIZE: 5000
+DISP_PERIOD: 100
--- a/Code/Python/configs/prw-local.yaml
+++ b/Code/Python/configs/prw-local.yaml
@ -0,0 +1,13 @@
+OUTPUT_DIR: "./logs/prw_coat"
+INPUT:
+  DATASET: "PRW"
+  DATA_ROOT: "E:/DeepLearning/PersonSearch/COAT/datasets/PRW"
+  BATCH_SIZE_TRAIN: 3
+SOLVER:
+  MAX_EPOCHS: 13
+  BASE_LR: 0.003
+MODEL:
+  LOSS:
+    LUT_SIZE: 482
+    CQ_SIZE: 500
+DISP_PERIOD: 100
--- a/Code/Python/configs/prw.yaml
+++ b/Code/Python/configs/prw.yaml
@ -0,0 +1,13 @@
+OUTPUT_DIR: "./logs/prw_coat"
+INPUT:
+  DATASET: "PRW"
+  DATA_ROOT: "../../datasets/PRW"
+  BATCH_SIZE_TRAIN: 3
+SOLVER:
+  MAX_EPOCHS: 13
+  BASE_LR: 0.003
+MODEL:
+  LOSS:
+    LUT_SIZE: 482
+    CQ_SIZE: 500
+DISP_PERIOD: 100
--- a/Code/Python/datasets/init.py
+++ b/Code/Python/datasets/init.py
@ -0,0 +1,5 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+from .build import build_test_loader, build_train_loader
--- a/Code/Python/datasets/base.py
+++ b/Code/Python/datasets/base.py
@ -0,0 +1,42 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import torch
+from PIL import Image
+
+class BaseDataset:
+    """
+    Base class of person search dataset.
+    """
+
+    def __init__(self, root, transforms, split):
+        self.root = root
+        self.transforms = transforms
+        self.split = split
+        assert self.split in ("train", "gallery", "query")
+        self.annotations = self._load_annotations()
+
+    def _load_annotations(self):
+        """
+        For each image, load its annotation that is a dictionary with the following keys:
+            img_name (str): image name
+            img_path (str): image path
+            boxes (np.array[N, 4]): ground-truth boxes in (x1, y1, x2, y2) format
+            pids (np.array[N]): person IDs corresponding to these boxes
+            cam_id (int): camera ID (only for PRW dataset)
+        """
+        raise NotImplementedError
+
+    def __getitem__(self, index):
+        anno = self.annotations[index]
+        img = Image.open(anno["img_path"]).convert("RGB")
+        boxes = torch.as_tensor(anno["boxes"], dtype=torch.float32)
+        labels = torch.as_tensor(anno["pids"], dtype=torch.int64)
+        target = {"img_name": anno["img_name"], "boxes": boxes, "labels": labels}
+        if self.transforms is not None:
+            img, target = self.transforms(img, target)
+        return img, target
+
+    def __len__(self):
+        return len(self.annotations)
--- a/Code/Python/datasets/build.py
+++ b/Code/Python/datasets/build.py
@ -0,0 +1,104 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import torch
+from utils.transforms import build_transforms
+from utils.utils import create_small_table
+from .cuhk_sysu import CUHKSYSU
+from .prw import PRW
+
+def print_statistics(dataset):
+    """
+    Print dataset statistics.
+    """
+    num_imgs = len(dataset.annotations)
+    num_boxes = 0
+    pid_set = set()
+    for anno in dataset.annotations:
+        num_boxes += anno["boxes"].shape[0]
+        for pid in anno["pids"]:
+            pid_set.add(pid)
+    statistics = {
+        "dataset": dataset.name,
+        "split": dataset.split,
+        "num_images": num_imgs,
+        "num_boxes": num_boxes,
+    }
+    if dataset.name != "CUHK-SYSU" or dataset.split != "query":
+        pid_list = sorted(list(pid_set))
+        if dataset.split == "query":
+            num_pids, min_pid, max_pid = len(pid_list), min(pid_list), max(pid_list)
+            statistics.update(
+                {
+                    "num_labeled_pids": num_pids,
+                    "min_labeled_pid": int(min_pid),
+                    "max_labeled_pid": int(max_pid),
+                }
+            )
+        else:
+            unlabeled_pid = pid_list[-1]
+            pid_list = pid_list[:-1]  # remove unlabeled pid
+            num_pids, min_pid, max_pid = len(pid_list), min(pid_list), max(pid_list)
+            statistics.update(
+                {
+                    "num_labeled_pids": num_pids,
+                    "min_labeled_pid": int(min_pid),
+                    "max_labeled_pid": int(max_pid),
+                    "unlabeled_pid": int(unlabeled_pid),
+                }
+            )
+    print(f"=> {dataset.name}-{dataset.split} loaded:\n" + create_small_table(statistics))
+
+
+def build_dataset(dataset_name, root, transforms, split, verbose=True):
+    if dataset_name == "CUHK-SYSU":
+        dataset = CUHKSYSU(root, transforms, split)
+    elif dataset_name == "PRW":
+        dataset = PRW(root, transforms, split)
+    else:
+        raise NotImplementedError(f"Unknow dataset: {dataset_name}")
+    if verbose:
+        print_statistics(dataset)
+    return dataset
+
+
+def collate_fn(batch):
+    return tuple(zip(*batch))
+
+
+def build_train_loader(cfg):
+    transforms = build_transforms(cfg, is_train=True)
+    dataset = build_dataset(cfg.INPUT.DATASET, cfg.INPUT.DATA_ROOT, transforms, "train")
+    return torch.utils.data.DataLoader(
+        dataset,
+        batch_size=cfg.INPUT.BATCH_SIZE_TRAIN,
+        shuffle=True,
+        num_workers=cfg.INPUT.NUM_WORKERS_TRAIN,
+        pin_memory=True,
+        drop_last=True,
+        collate_fn=collate_fn,
+    )
+
+
+def build_test_loader(cfg):
+    transforms = build_transforms(cfg, is_train=False)
+    gallery_set = build_dataset(cfg.INPUT.DATASET, cfg.INPUT.DATA_ROOT, transforms, "gallery")
+    query_set = build_dataset(cfg.INPUT.DATASET, cfg.INPUT.DATA_ROOT, transforms, "query")
+    gallery_loader = torch.utils.data.DataLoader(
+        gallery_set,
+        batch_size=cfg.INPUT.BATCH_SIZE_TEST,
+        shuffle=False,
+        num_workers=cfg.INPUT.NUM_WORKERS_TEST,
+        pin_memory=True,
+        collate_fn=collate_fn,
+    )
+    query_loader = torch.utils.data.DataLoader(
+        query_set,
+        batch_size=cfg.INPUT.BATCH_SIZE_TEST,
+        shuffle=False,
+        num_workers=cfg.INPUT.NUM_WORKERS_TEST,
+        pin_memory=True,
+        collate_fn=collate_fn,
+    )
+    return gallery_loader, query_loader
--- a/Code/Python/datasets/cuhk_sysu.py
+++ b/Code/Python/datasets/cuhk_sysu.py
@ -0,0 +1,121 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import os.path as osp
+import numpy as np
+from scipy.io import loadmat
+from .base import BaseDataset
+
+class CUHKSYSU(BaseDataset):
+    def __init__(self, root, transforms, split):
+        self.name = "CUHK-SYSU"
+        self.img_prefix = osp.join(root, "Image", "SSM")
+        super(CUHKSYSU, self).__init__(root, transforms, split)
+
+    def _load_queries(self):
+        # TestG50: a test protocol, 50 gallery images per query
+        protoc = loadmat(osp.join(self.root, "annotation/test/train_test/TestG50.mat"))
+        protoc = protoc["TestG50"].squeeze()
+        queries = []
+        for item in protoc["Query"]:
+            img_name = str(item["imname"][0, 0][0])
+            roi = item["idlocate"][0, 0][0].astype(np.int32)
+            roi[2:] += roi[:2]
+            queries.append(
+                {
+                    "img_name": img_name,
+                    "img_path": osp.join(self.img_prefix, img_name),
+                    "boxes": roi[np.newaxis, :],
+                    "pids": np.array([-100]),  # dummy pid
+                }
+            )
+        return queries
+
+    def _load_split_img_names(self):
+        """
+        Load the image names for the specific split.
+        """
+        assert self.split in ("train", "gallery")
+        # gallery images
+        gallery_imgs = loadmat(osp.join(self.root, "annotation", "pool.mat"))
+        gallery_imgs = gallery_imgs["pool"].squeeze()
+        gallery_imgs = [str(a[0]) for a in gallery_imgs]
+        if self.split == "gallery":
+            return gallery_imgs
+        # all images
+        all_imgs = loadmat(osp.join(self.root, "annotation", "Images.mat"))
+        all_imgs = all_imgs["Img"].squeeze()
+        all_imgs = [str(a[0][0]) for a in all_imgs]
+        # training images = all images - gallery images
+        training_imgs = sorted(list(set(all_imgs) - set(gallery_imgs)))
+        return training_imgs
+
+    def _load_annotations(self):
+        if self.split == "query":
+            return self._load_queries()
+
+        # load all images and build a dict from image to boxes
+        all_imgs = loadmat(osp.join(self.root, "annotation", "Images.mat"))
+        all_imgs = all_imgs["Img"].squeeze()
+        name_to_boxes = {}
+        name_to_pids = {}
+        unlabeled_pid = 5555  # default pid for unlabeled people
+        for img_name, _, boxes in all_imgs:
+            img_name = str(img_name[0])
+            boxes = np.asarray([b[0] for b in boxes[0]])
+            boxes = boxes.reshape(boxes.shape[0], 4)  # (x1, y1, w, h)
+            valid_index = np.where((boxes[:, 2] > 0) & (boxes[:, 3] > 0))[0]
+            assert valid_index.size > 0, "Warning: {} has no valid boxes.".format(img_name)
+            boxes = boxes[valid_index]
+            name_to_boxes[img_name] = boxes.astype(np.int32)
+            name_to_pids[img_name] = unlabeled_pid * np.ones(boxes.shape[0], dtype=np.int32)
+
+        def set_box_pid(boxes, box, pids, pid):
+            for i in range(boxes.shape[0]):
+                if np.all(boxes[i] == box):
+                    pids[i] = pid
+                    return
+
+        # assign a unique pid from 1 to N for each identity
+        if self.split == "train":
+            train = loadmat(osp.join(self.root, "annotation/test/train_test/Train.mat"))
+            train = train["Train"].squeeze()
+            for index, item in enumerate(train):
+                scenes = item[0, 0][2].squeeze()
+                for img_name, box, _ in scenes:
+                    img_name = str(img_name[0])
+                    box = box.squeeze().astype(np.int32)
+                    set_box_pid(name_to_boxes[img_name], box, name_to_pids[img_name], index + 1)
+        else:
+            protoc = loadmat(osp.join(self.root, "annotation/test/train_test/TestG50.mat"))
+            protoc = protoc["TestG50"].squeeze()
+            for index, item in enumerate(protoc):
+                # query
+                im_name = str(item["Query"][0, 0][0][0])
+                box = item["Query"][0, 0][1].squeeze().astype(np.int32)
+                set_box_pid(name_to_boxes[im_name], box, name_to_pids[im_name], index + 1)
+                # gallery
+                gallery = item["Gallery"].squeeze()
+                for im_name, box, _ in gallery:
+                    im_name = str(im_name[0])
+                    if box.size == 0:
+                        break
+                    box = box.squeeze().astype(np.int32)
+                    set_box_pid(name_to_boxes[im_name], box, name_to_pids[im_name], index + 1)
+
+        annotations = []
+        imgs = self._load_split_img_names()
+        for img_name in imgs:
+            boxes = name_to_boxes[img_name]
+            boxes[:, 2:] += boxes[:, :2]  # (x1, y1, w, h) -> (x1, y1, x2, y2)
+            pids = name_to_pids[img_name]
+            annotations.append(
+                {
+                    "img_name": img_name,
+                    "img_path": osp.join(self.img_prefix, img_name),
+                    "boxes": boxes,
+                    "pids": pids,
+                }
+            )
+        return annotations
--- a/Code/Python/datasets/prw.py
+++ b/Code/Python/datasets/prw.py
@ -0,0 +1,97 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import os.path as osp
+import re
+
+import numpy as np
+from scipy.io import loadmat
+
+from .base import BaseDataset
+
+
+class PRW(BaseDataset):
+    def __init__(self, root, transforms, split):
+        self.name = "PRW"
+        self.img_prefix = osp.join(root, "frames")
+        super(PRW, self).__init__(root, transforms, split)
+
+    def _get_cam_id(self, img_name):
+        match = re.search(r"c\d", img_name).group().replace("c", "")
+        return int(match)
+
+    def _load_queries(self):
+        query_info = osp.join(self.root, "query_info.txt")
+        with open(query_info, "rb") as f:
+            raw = f.readlines()
+
+        queries = []
+        for line in raw:
+            linelist = str(line, "utf-8").split(" ")
+            pid = int(linelist[0])
+            x, y, w, h = (
+                float(linelist[1]),
+                float(linelist[2]),
+                float(linelist[3]),
+                float(linelist[4]),
+            )
+            roi = np.array([x, y, x + w, y + h]).astype(np.int32)
+            roi = np.clip(roi, 0, None)  # several coordinates are negative
+            img_name = linelist[5][:-2] + ".jpg"
+            queries.append(
+                {
+                    "img_name": img_name,
+                    "img_path": osp.join(self.img_prefix, img_name),
+                    "boxes": roi[np.newaxis, :],
+                    "pids": np.array([pid]),
+                    "cam_id": self._get_cam_id(img_name),
+                }
+            )
+        return queries
+
+    def _load_split_img_names(self):
+        """
+        Load the image names for the specific split.
+        """
+        assert self.split in ("train", "gallery")
+        if self.split == "train":
+            imgs = loadmat(osp.join(self.root, "frame_train.mat"))["img_index_train"]
+        else:
+            imgs = loadmat(osp.join(self.root, "frame_test.mat"))["img_index_test"]
+        return [img[0][0] + ".jpg" for img in imgs]
+
+    def _load_annotations(self):
+        if self.split == "query":
+            return self._load_queries()
+
+        annotations = []
+        imgs = self._load_split_img_names()
+        for img_name in imgs:
+            anno_path = osp.join(self.root, "annotations", img_name)
+            anno = loadmat(anno_path)
+            box_key = "box_new"
+            if box_key not in anno.keys():
+                box_key = "anno_file"
+            if box_key not in anno.keys():
+                box_key = "anno_previous"
+
+            rois = anno[box_key][:, 1:]
+            ids = anno[box_key][:, 0]
+            rois = np.clip(rois, 0, None)  # several coordinates are negative
+
+            assert len(rois) == len(ids)
+
+            rois[:, 2:] += rois[:, :2]
+            ids[ids == -2] = 5555  # assign pid = 5555 for unlabeled people
+            annotations.append(
+                {
+                    "img_name": img_name,
+                    "img_path": osp.join(self.img_prefix, img_name),
+                    "boxes": rois.astype(np.int32),
+                    # (training pids) 1, 2,..., 478, 480, 481, 482, 483, 932, 5555
+                    "pids": ids.astype(np.int32),
+                    "cam_id": self._get_cam_id(img_name),
+                }
+            )
+        return annotations
--- a/Code/Python/defaults.py
+++ b/Code/Python/defaults.py
@ -0,0 +1,219 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+from yacs.config import CfgNode as CN
+
+_C = CN()
+
+# -------------------------------------------------------- #
+#                           Input                          #
+# -------------------------------------------------------- #
+_C.INPUT = CN()
+_C.INPUT.DATASET = "CUHK-SYSU"
+_C.INPUT.DATA_ROOT = "E:/DeepLearning/PersonSearch/COAT/datasets/CUHK-SYSU"
+
+# Size of the smallest side of the image
+_C.INPUT.MIN_SIZE = 900
+# Maximum size of the side of the image
+_C.INPUT.MAX_SIZE = 1500
+
+# Number of images per batch
+_C.INPUT.BATCH_SIZE_TRAIN = 1
+_C.INPUT.BATCH_SIZE_TEST = 1
+
+# Number of data loading threads
+_C.INPUT.NUM_WORKERS_TRAIN = 5
+_C.INPUT.NUM_WORKERS_TEST = 1
+
+# Image augmentation
+_C.INPUT.IMAGE_CUTOUT = False
+_C.INPUT.IMAGE_ERASE = False
+_C.INPUT.IMAGE_MIXUP = False
+
+# -------------------------------------------------------- #
+#                           GRID                           #
+# -------------------------------------------------------- #
+_C.INPUT.IMAGE_GRID = False
+_C.GRID = CN()
+_C.GRID.ROTATE = 1
+_C.GRID.OFFSET = 0
+_C.GRID.RATIO = 0.5
+_C.GRID.MODE = 1
+_C.GRID.PROB = 0.5
+
+# -------------------------------------------------------- #
+#                          Solver                          #
+# -------------------------------------------------------- #
+_C.SOLVER = CN()
+_C.SOLVER.MAX_EPOCHS = 13
+
+# Learning rate settings
+_C.SOLVER.BASE_LR = 0.003
+
+# The epoch milestones to decrease the learning rate by GAMMA
+_C.SOLVER.LR_DECAY_MILESTONES = [10, 14]
+_C.SOLVER.GAMMA = 0.1
+
+_C.SOLVER.WEIGHT_DECAY = 0.0005
+_C.SOLVER.SGD_MOMENTUM = 0.9
+
+# Loss weight of RPN regression
+_C.SOLVER.LW_RPN_REG = 1
+# Loss weight of RPN classification
+_C.SOLVER.LW_RPN_CLS = 1
+
+# Loss weight of Cascade R-CNN and Re-ID (OIM)
+_C.SOLVER.LW_RCNN_REG_1ST = 10
+_C.SOLVER.LW_RCNN_CLS_1ST = 1
+_C.SOLVER.LW_RCNN_REG_2ND = 10
+_C.SOLVER.LW_RCNN_CLS_2ND = 1
+_C.SOLVER.LW_RCNN_REG_3RD = 10
+_C.SOLVER.LW_RCNN_CLS_3RD = 1
+_C.SOLVER.LW_RCNN_REID_2ND = 0.5
+_C.SOLVER.LW_RCNN_REID_3RD = 0.5
+# Loss weight of box reid, softmax loss
+_C.SOLVER.LW_RCNN_SOFTMAX_2ND = 0.5
+_C.SOLVER.LW_RCNN_SOFTMAX_3RD = 0.5
+
+# Set to negative value to disable gradient clipping
+_C.SOLVER.CLIP_GRADIENTS = 10.0
+
+# -------------------------------------------------------- #
+#                            RPN                           #
+# -------------------------------------------------------- #
+_C.MODEL = CN()
+_C.MODEL.RPN = CN()
+# NMS threshold used on RoIs
+_C.MODEL.RPN.NMS_THRESH = 0.7
+# Number of anchors per image used to train RPN
+_C.MODEL.RPN.BATCH_SIZE_TRAIN = 256
+# Target fraction of foreground examples per RPN minibatch
+_C.MODEL.RPN.POS_FRAC_TRAIN = 0.5
+# Overlap threshold for an anchor to be considered foreground (if >= POS_THRESH_TRAIN)
+_C.MODEL.RPN.POS_THRESH_TRAIN = 0.7
+# Overlap threshold for an anchor to be considered background (if < NEG_THRESH_TRAIN)
+_C.MODEL.RPN.NEG_THRESH_TRAIN = 0.3
+# Number of top scoring RPN RoIs to keep before applying NMS
+_C.MODEL.RPN.PRE_NMS_TOPN_TRAIN = 12000
+_C.MODEL.RPN.PRE_NMS_TOPN_TEST = 6000
+# Number of top scoring RPN RoIs to keep after applying NMS
+_C.MODEL.RPN.POST_NMS_TOPN_TRAIN = 2000
+_C.MODEL.RPN.POST_NMS_TOPN_TEST = 300
+
+# -------------------------------------------------------- #
+#                         RoI head                         #
+# -------------------------------------------------------- #
+_C.MODEL.ROI_HEAD = CN()
+# Whether to use bn neck (i.e. batch normalization after linear)
+_C.MODEL.ROI_HEAD.BN_NECK = True
+# Number of RoIs per image used to train RoI head
+_C.MODEL.ROI_HEAD.BATCH_SIZE_TRAIN = 128
+# Target fraction of foreground examples per RoI minibatch
+_C.MODEL.ROI_HEAD.POS_FRAC_TRAIN = 0.25 # 0.5
+
+_C.MODEL.ROI_HEAD.USE_DIFF_THRESH = True
+# Overlap threshold for an RoI to be considered foreground (if >= POS_THRESH_TRAIN)
+_C.MODEL.ROI_HEAD.POS_THRESH_TRAIN = 0.5
+_C.MODEL.ROI_HEAD.POS_THRESH_TRAIN_2ND = 0.6
+_C.MODEL.ROI_HEAD.POS_THRESH_TRAIN_3RD = 0.7
+# Overlap threshold for an RoI to be considered background (if < NEG_THRESH_TRAIN)
+_C.MODEL.ROI_HEAD.NEG_THRESH_TRAIN = 0.5
+_C.MODEL.ROI_HEAD.NEG_THRESH_TRAIN_2ND = 0.6
+_C.MODEL.ROI_HEAD.NEG_THRESH_TRAIN_3RD = 0.7
+# Minimum score threshold
+_C.MODEL.ROI_HEAD.SCORE_THRESH_TEST = 0.5
+# NMS threshold used on boxes
+_C.MODEL.ROI_HEAD.NMS_THRESH_TEST = 0.4
+_C.MODEL.ROI_HEAD.NMS_THRESH_TEST_1ST = 0.4
+_C.MODEL.ROI_HEAD.NMS_THRESH_TEST_2ND = 0.4
+_C.MODEL.ROI_HEAD.NMS_THRESH_TEST_3RD = 0.5
+# Maximum number of detected objects
+_C.MODEL.ROI_HEAD.DETECTIONS_PER_IMAGE_TEST = 300
+
+# -------------------------------------------------------- #
+#                     Transformer head                     #
+# -------------------------------------------------------- #
+_C.MODEL.TRANSFORMER = CN()
+_C.MODEL.TRANSFORMER.DIM_MODEL = 512
+_C.MODEL.TRANSFORMER.ENCODER_LAYERS = 1
+_C.MODEL.TRANSFORMER.N_HEAD = 8
+_C.MODEL.TRANSFORMER.USE_OUTPUT_LAYER = False
+_C.MODEL.TRANSFORMER.DROPOUT = 0.
+_C.MODEL.TRANSFORMER.USE_LOCAL_SHORTCUT = True
+_C.MODEL.TRANSFORMER.USE_GLOBAL_SHORTCUT = True
+
+_C.MODEL.TRANSFORMER.USE_DIFF_SCALE = True
+_C.MODEL.TRANSFORMER.NAMES_1ST = ['scale1','scale2']
+_C.MODEL.TRANSFORMER.NAMES_2ND = ['scale1','scale2']
+_C.MODEL.TRANSFORMER.NAMES_3RD = ['scale1','scale2']
+_C.MODEL.TRANSFORMER.KERNEL_SIZE_1ST = [(1,1),(3,3)]
+_C.MODEL.TRANSFORMER.KERNEL_SIZE_2ND = [(1,1),(3,3)]
+_C.MODEL.TRANSFORMER.KERNEL_SIZE_3RD = [(1,1),(3,3)]
+_C.MODEL.TRANSFORMER.USE_MASK_1ST = False
+_C.MODEL.TRANSFORMER.USE_MASK_2ND = True
+_C.MODEL.TRANSFORMER.USE_MASK_3RD = True
+_C.MODEL.TRANSFORMER.USE_PATCH2VEC = True
+
+####
+_C.MODEL.USE_FEATURE_MASK = True
+_C.MODEL.FEATURE_AUG_TYPE = 'exchange_token' # 'exchange_token', 'jigsaw_token', 'cutout_patch', 'erase_patch', 'mixup_patch', 'jigsaw_patch'
+_C.MODEL.FEATURE_MASK_SIZE = 4
+_C.MODEL.MASK_SHAPE = 'stripe' # 'square', 'random'
+_C.MODEL.MASK_SIZE = 1
+_C.MODEL.MASK_MODE = 'random_direction' # 'horizontal', 'vertical' for stripe; 'random_size' for square
+_C.MODEL.MASK_PERCENT = 0.1
+#### 
+_C.MODEL.EMBEDDING_DIM = 256
+
+# -------------------------------------------------------- #
+#                           Loss                           #
+# -------------------------------------------------------- #
+_C.MODEL.LOSS = CN()
+# Size of the lookup table in OIM
+_C.MODEL.LOSS.LUT_SIZE = 5532
+# Size of the circular queue in OIM
+_C.MODEL.LOSS.CQ_SIZE = 5000
+_C.MODEL.LOSS.OIM_MOMENTUM = 0.5
+_C.MODEL.LOSS.OIM_SCALAR = 30.0
+
+_C.MODEL.LOSS.USE_SOFTMAX = True
+
+# -------------------------------------------------------- #
+#                        Evaluation                        #
+# -------------------------------------------------------- #
+# The period to evaluate the model during training
+_C.EVAL_PERIOD = 1
+# Evaluation with GT boxes to verify the upper bound of person search performance
+_C.EVAL_USE_GT = False
+# Fast evaluation with cached features
+_C.EVAL_USE_CACHE = False
+# Evaluation with Context Bipartite Graph Matching (CBGM) algorithm
+_C.EVAL_USE_CBGM = False
+# Gallery size in evaluation, only for CUHK-SYSU
+_C.EVAL_GALLERY_SIZE = 100
+# Feature used for evaluation
+_C.EVAL_FEATURE = 'concat' # 'stage2', 'stage3'
+
+# -------------------------------------------------------- #
+#                           Miscs                          #
+# -------------------------------------------------------- #
+# Save a checkpoint after every this number of epochs
+_C.CKPT_PERIOD = 1
+# The period (in terms of iterations) to display training losses
+_C.DISP_PERIOD = 10
+# Whether to use tensorboard for visualization
+_C.TF_BOARD = True
+# The device loading the model
+_C.DEVICE = "cuda:0"
+# Set seed to negative to fully randomize everything
+_C.SEED = 1
+# Directory where output files are written
+_C.OUTPUT_DIR = "./output"
+
+
+def get_default_cfg():
+    """
+    Get a copy of the default config.
+    """
+    return _C.clone()
--- a/Code/Python/doc/framework.png
+++ b/Code/Python/doc/framework.png
--- a/Code/Python/engine.py
+++ b/Code/Python/engine.py
@ -0,0 +1,179 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import math
+import sys
+from copy import deepcopy
+
+import torch
+from torch.nn.utils import clip_grad_norm_
+from tqdm import tqdm
+
+from eval_func import eval_detection, eval_search_cuhk, eval_search_prw
+from utils.utils import MetricLogger, SmoothedValue, mkdir, reduce_dict, warmup_lr_scheduler
+from utils.transforms import mixup_data
+
+
+def to_device(images, targets, device):
+    images = [image.to(device) for image in images]
+    for t in targets:
+        t["boxes"] = t["boxes"].to(device)
+        t["labels"] = t["labels"].to(device)
+    return images, targets
+
+
+def train_one_epoch(cfg, model, optimizer, data_loader, device, epoch, tfboard, softmax_criterion_s2, softmax_criterion_s3):
+    model.train()
+    metric_logger = MetricLogger(delimiter="  ")
+    metric_logger.add_meter("lr", SmoothedValue(window_size=1, fmt="{value:.6f}"))
+    header = "Epoch: [{}]".format(epoch)
+
+    # warmup learning rate in the first epoch
+    if epoch == 0:
+        warmup_factor = 1.0 / 1000
+        warmup_iters = len(data_loader) - 1
+        warmup_scheduler = warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
+
+    for i, (images, targets) in enumerate(
+        metric_logger.log_every(data_loader, cfg.DISP_PERIOD, header)
+    ):
+        images, targets = to_device(images, targets, device)
+
+        # if using image based data augmentation
+        if cfg.INPUT.IMAGE_MIXUP:
+            images = mixup_data(images, alpha=0.8)
+
+        loss_dict, feats_reid_2nd, targets_reid_2nd, feats_reid_3rd, targets_reid_3rd = model(images, targets)
+
+        if cfg.MODEL.LOSS.USE_SOFTMAX:
+            softmax_loss_2nd = cfg.SOLVER.LW_RCNN_SOFTMAX_2ND * softmax_criterion_s2(feats_reid_2nd, targets_reid_2nd)
+            softmax_loss_3rd = cfg.SOLVER.LW_RCNN_SOFTMAX_3RD * softmax_criterion_s3(feats_reid_3rd, targets_reid_3rd)
+            loss_dict.update(loss_box_softmax_2nd=softmax_loss_2nd)
+            loss_dict.update(loss_box_softmax_3rd=softmax_loss_3rd)
+
+        losses = sum(loss for loss in loss_dict.values())
+
+        # reduce losses over all GPUs for logging purposes
+        loss_dict_reduced = reduce_dict(loss_dict)
+        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
+        loss_value = losses_reduced.item()
+
+        if not math.isfinite(loss_value):
+            print(f"Loss is {loss_value}, stopping training")
+            print(loss_dict_reduced)
+            sys.exit(1)
+
+        optimizer.zero_grad()
+        losses.backward()
+        if cfg.SOLVER.CLIP_GRADIENTS > 0:
+            clip_grad_norm_(model.parameters(), cfg.SOLVER.CLIP_GRADIENTS)
+        optimizer.step()
+
+        if epoch == 0:
+            warmup_scheduler.step()
+
+        metric_logger.update(loss=loss_value, **loss_dict_reduced)
+        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
+        if tfboard:
+            iter = epoch * len(data_loader) + i
+            for k, v in loss_dict_reduced.items():
+                tfboard.add_scalars("train", {k: v}, iter)
+
+
+@torch.no_grad()
+def evaluate_performance(
+    model, gallery_loader, query_loader, device, use_gt=False, use_cache=False, use_cbgm=False, gallery_size=100):
+    """
+    Args:
+        use_gt (bool, optional): Whether to use GT as detection results to verify the upper
+                                bound of person search performance. Defaults to False.
+        use_cache (bool, optional): Whether to use the cached features. Defaults to False.
+        use_cbgm (bool, optional): Whether to use Context Bipartite Graph Matching algorithm.
+                                Defaults to False.
+    """
+    model.eval()
+    if use_cache:
+        eval_cache = torch.load("data/eval_cache/eval_cache.pth")
+        gallery_dets = eval_cache["gallery_dets"]
+        gallery_feats = eval_cache["gallery_feats"]
+        query_dets = eval_cache["query_dets"]
+        query_feats = eval_cache["query_feats"]
+        query_box_feats = eval_cache["query_box_feats"]
+    else:
+        gallery_dets, gallery_feats = [], []
+        for images, targets in tqdm(gallery_loader, ncols=0):
+            images, targets = to_device(images, targets, device)
+            if not use_gt:
+                outputs = model(images)
+            else:
+                boxes = targets[0]["boxes"]
+                n_boxes = boxes.size(0)
+                embeddings = model(images, targets)
+                outputs = [
+                    {
+                        "boxes": boxes,
+                        "embeddings": torch.cat(embeddings),
+                        "labels": torch.ones(n_boxes).to(device),
+                        "scores": torch.ones(n_boxes).to(device),
+                    }
+                ]
+
+            for output in outputs:
+                box_w_scores = torch.cat([output["boxes"], output["scores"].unsqueeze(1)], dim=1)
+                gallery_dets.append(box_w_scores.cpu().numpy())
+                gallery_feats.append(output["embeddings"].cpu().numpy())
+
+        # regarding query image as gallery to detect all people
+        # i.e. query person + surrounding people (context information)
+        query_dets, query_feats = [], []
+        for images, targets in tqdm(query_loader, ncols=0):
+            images, targets = to_device(images, targets, device)
+            # targets will be modified in the model, so deepcopy it
+            outputs = model(images, deepcopy(targets), query_img_as_gallery=True)
+
+            # consistency check
+            gt_box = targets[0]["boxes"].squeeze()
+
+            assert (
+                gt_box - outputs[0]["boxes"][0]
+            ).sum() <= 0.001, "GT box must be the first one in the detected boxes of query image"
+
+            for output in outputs:
+                box_w_scores = torch.cat([output["boxes"], output["scores"].unsqueeze(1)], dim=1)
+                query_dets.append(box_w_scores.cpu().numpy())
+                query_feats.append(output["embeddings"].cpu().numpy())
+
+        # extract the features of query boxes
+        query_box_feats = []
+        for images, targets in tqdm(query_loader, ncols=0):
+            images, targets = to_device(images, targets, device)
+            embeddings = model(images, targets)
+            assert len(embeddings) == 1, "batch size in test phase should be 1"
+            query_box_feats.append(embeddings[0].cpu().numpy())
+
+        mkdir("data/eval_cache")
+        save_dict = {
+            "gallery_dets": gallery_dets,
+            "gallery_feats": gallery_feats,
+            "query_dets": query_dets,
+            "query_feats": query_feats,
+            "query_box_feats": query_box_feats,
+        }
+        torch.save(save_dict, "data/eval_cache/eval_cache.pth")
+
+    eval_detection(gallery_loader.dataset, gallery_dets, det_thresh=0.01)
+    eval_search_func = (
+        eval_search_cuhk if gallery_loader.dataset.name == "CUHK-SYSU" else eval_search_prw
+    )
+    eval_search_func(
+        gallery_loader.dataset,
+        query_loader.dataset,
+        gallery_dets,
+        gallery_feats,
+        query_box_feats,
+        query_dets,
+        query_feats,
+        cbgm=use_cbgm,
+        gallery_size=gallery_size,
+    )
--- a/Code/Python/eval_func.py
+++ b/Code/Python/eval_func.py
@ -0,0 +1,488 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import os.path as osp
+import numpy as np
+from scipy.io import loadmat
+from sklearn.metrics import average_precision_score
+
+from utils.km import run_kuhn_munkres
+from utils.utils import write_json
+
+
+def _compute_iou(a, b):
+    x1 = max(a[0], b[0])
+    y1 = max(a[1], b[1])
+    x2 = min(a[2], b[2])
+    y2 = min(a[3], b[3])
+    inter = max(0, x2 - x1) * max(0, y2 - y1)
+    union = (a[2] - a[0]) * (a[3] - a[1]) + (b[2] - b[0]) * (b[3] - b[1]) - inter
+    return inter * 1.0 / union
+
+
+def eval_detection(
+    gallery_dataset, gallery_dets, det_thresh=0.5, iou_thresh=0.5, labeled_only=False
+):
+    """
+    gallery_det (list of ndarray): n_det x [x1, y1, x2, y2, score] per image
+    det_thresh (float): filter out gallery detections whose scores below this
+    iou_thresh (float): treat as true positive if IoU is above this threshold
+    labeled_only (bool): filter out unlabeled background people
+    """
+    assert len(gallery_dataset) == len(gallery_dets)
+    annos = gallery_dataset.annotations
+
+    y_true, y_score = [], []
+    count_gt, count_tp = 0, 0
+    for anno, det in zip(annos, gallery_dets):
+        gt_boxes = anno["boxes"]
+        if labeled_only:
+            # exclude the unlabeled people (pid == 5555)
+            inds = np.where(anno["pids"].ravel() != 5555)[0]
+            if len(inds) == 0:
+                continue
+            gt_boxes = gt_boxes[inds]
+        num_gt = gt_boxes.shape[0]
+
+        if det != []:
+            det = np.asarray(det)
+            inds = np.where(det[:, 4].ravel() >= det_thresh)[0]
+            det = det[inds]
+            num_det = det.shape[0]
+        else:
+            num_det = 0
+        if num_det == 0:
+            count_gt += num_gt
+            continue
+
+        ious = np.zeros((num_gt, num_det), dtype=np.float32)
+        for i in range(num_gt):
+            for j in range(num_det):
+                ious[i, j] = _compute_iou(gt_boxes[i], det[j, :4])
+        tfmat = ious >= iou_thresh
+        # for each det, keep only the largest iou of all the gt
+        for j in range(num_det):
+            largest_ind = np.argmax(ious[:, j])
+            for i in range(num_gt):
+                if i != largest_ind:
+                    tfmat[i, j] = False
+        # for each gt, keep only the largest iou of all the det
+        for i in range(num_gt):
+            largest_ind = np.argmax(ious[i, :])
+            for j in range(num_det):
+                if j != largest_ind:
+                    tfmat[i, j] = False
+        for j in range(num_det):
+            y_score.append(det[j, -1])
+            y_true.append(tfmat[:, j].any())
+        count_tp += tfmat.sum()
+        count_gt += num_gt
+
+    det_rate = count_tp * 1.0 / count_gt
+    ap = average_precision_score(y_true, y_score) * det_rate
+
+    print("{} detection:".format("labeled only" if labeled_only else "all"))
+    print("  recall = {:.2%}".format(det_rate))
+    if not labeled_only:
+        print("  ap = {:.2%}".format(ap))
+    return det_rate, ap
+
+
+def eval_search_cuhk(
+    gallery_dataset,
+    query_dataset,
+    gallery_dets,
+    gallery_feats,
+    query_box_feats,
+    query_dets,
+    query_feats,
+    k1=10,
+    k2=3,
+    det_thresh=0.5,
+    cbgm=False,
+    gallery_size=100,
+):
+    """
+    gallery_dataset/query_dataset: an instance of BaseDataset
+    gallery_det (list of ndarray): n_det x [x1, x2, y1, y2, score] per image
+    gallery_feat (list of ndarray): n_det x D features per image
+    query_feat (list of ndarray): D dimensional features per query image
+    det_thresh (float): filter out gallery detections whose scores below this
+    gallery_size (int): gallery size [-1, 50, 100, 500, 1000, 2000, 4000]
+                        -1 for using full set
+    """
+    assert len(gallery_dataset) == len(gallery_dets)
+    assert len(gallery_dataset) == len(gallery_feats)
+    assert len(query_dataset) == len(query_box_feats)
+
+    use_full_set = gallery_size == -1
+    fname = "TestG{}".format(gallery_size if not use_full_set else 50)
+    protoc = loadmat(osp.join(gallery_dataset.root, "annotation/test/train_test", fname + ".mat"))
+    protoc = protoc[fname].squeeze()
+
+    # mapping from gallery image to (det, feat)
+    annos = gallery_dataset.annotations
+    name_to_det_feat = {}
+    for anno, det, feat in zip(annos, gallery_dets, gallery_feats):
+        name = anno["img_name"]
+        if det != []:
+            scores = det[:, 4].ravel()
+            inds = np.where(scores >= det_thresh)[0]
+            if len(inds) > 0:
+                name_to_det_feat[name] = (det[inds], feat[inds])
+
+    aps = []
+    accs = []
+    topk = [1, 5, 10]
+    ret = {"image_root": gallery_dataset.img_prefix, "results": []}
+    for i in range(len(query_dataset)):
+        y_true, y_score = [], []
+        imgs, rois = [], []
+        count_gt, count_tp = 0, 0
+        # get L2-normalized feature vector
+        feat_q = query_box_feats[i].ravel()
+        # ignore the query image
+        query_imname = str(protoc["Query"][i]["imname"][0, 0][0])
+        query_roi = protoc["Query"][i]["idlocate"][0, 0][0].astype(np.int32)
+        query_roi[2:] += query_roi[:2]
+        query_gt = []
+        tested = set([query_imname])
+
+        name2sim = {}
+        name2gt = {}
+        sims = []
+        imgs_cbgm = []
+        # 1. Go through the gallery samples defined by the protocol
+        for item in protoc["Gallery"][i].squeeze():
+            gallery_imname = str(item[0][0])
+            # some contain the query (gt not empty), some not
+            gt = item[1][0].astype(np.int32)
+            count_gt += gt.size > 0
+            # compute distance between query and gallery dets
+            if gallery_imname not in name_to_det_feat:
+                continue
+            det, feat_g = name_to_det_feat[gallery_imname]
+            # no detection in this gallery, skip it
+            if det.shape[0] == 0:
+                continue
+            # get L2-normalized feature matrix NxD
+            assert feat_g.size == np.prod(feat_g.shape[:2])
+            feat_g = feat_g.reshape(feat_g.shape[:2])
+            # compute cosine similarities
+            sim = feat_g.dot(feat_q).ravel()
+
+            if gallery_imname in name2sim:
+                continue
+            name2sim[gallery_imname] = sim
+            name2gt[gallery_imname] = gt
+            sims.extend(list(sim))
+            imgs_cbgm.extend([gallery_imname] * len(sim))
+        # 2. Go through the remaining gallery images if using full set
+        if use_full_set:
+            for gallery_imname in gallery_dataset.imgs:
+                if gallery_imname in tested:
+                    continue
+                if gallery_imname not in name_to_det_feat:
+                    continue
+                det, feat_g = name_to_det_feat[gallery_imname]
+                # get L2-normalized feature matrix NxD
+                assert feat_g.size == np.prod(feat_g.shape[:2])
+                feat_g = feat_g.reshape(feat_g.shape[:2])
+                # compute cosine similarities
+                sim = feat_g.dot(feat_q).ravel()
+                # guaranteed no target query in these gallery images
+                label = np.zeros(len(sim), dtype=np.int32)
+                y_true.extend(list(label))
+                y_score.extend(list(sim))
+                imgs.extend([gallery_imname] * len(sim))
+                rois.extend(list(det))
+
+        if cbgm:
+            # -------- Context Bipartite Graph Matching (CBGM) ------- #
+            sims = np.array(sims)
+            imgs_cbgm = np.array(imgs_cbgm)
+            # only process the top-k1 gallery images for efficiency
+            inds = np.argsort(sims)[-k1:]
+            imgs_cbgm = set(imgs_cbgm[inds])
+            for img in imgs_cbgm:
+                sim = name2sim[img]
+                det, feat_g = name_to_det_feat[img]
+                # only regard the people with top-k2 detection confidence
+                # in the query image as context information
+                qboxes = query_dets[i][:k2]
+                qfeats = query_feats[i][:k2]
+                assert (
+                    query_roi - qboxes[0][:4]
+                ).sum() <= 0.001, "query_roi must be the first one in pboxes"
+
+                # build the bipartite graph and run Kuhn-Munkres (K-M) algorithm
+                # to find the best match
+                graph = []
+                for indx_i, pfeat in enumerate(qfeats):
+                    for indx_j, gfeat in enumerate(feat_g):
+                        graph.append((indx_i, indx_j, (pfeat * gfeat).sum()))
+                km_res, max_val = run_kuhn_munkres(graph)
+
+                # revise the similarity between query person and its matching
+                for indx_i, indx_j, _ in km_res:
+                    # 0 denotes the query roi
+                    if indx_i == 0:
+                        sim[indx_j] = max_val
+                        break
+        for gallery_imname, sim in name2sim.items():
+            gt = name2gt[gallery_imname]
+            det, feat_g = name_to_det_feat[gallery_imname]
+            # assign label for each det
+            label = np.zeros(len(sim), dtype=np.int32)
+            if gt.size > 0:
+                w, h = gt[2], gt[3]
+                gt[2:] += gt[:2]
+                query_gt.append({"img": str(gallery_imname), "roi": list(map(float, list(gt)))})
+                iou_thresh = min(0.5, (w * h * 1.0) / ((w + 10) * (h + 10)))
+                inds = np.argsort(sim)[::-1]
+                sim = sim[inds]
+                det = det[inds]
+                # only set the first matched det as true positive
+                for j, roi in enumerate(det[:, :4]):
+                    if _compute_iou(roi, gt) >= iou_thresh:
+                        label[j] = 1
+                        count_tp += 1
+                        break
+            y_true.extend(list(label))
+            y_score.extend(list(sim))
+            imgs.extend([gallery_imname] * len(sim))
+            rois.extend(list(det))
+            tested.add(gallery_imname)
+        # 3. Compute AP for this query (need to scale by recall rate)
+        y_score = np.asarray(y_score)
+        y_true = np.asarray(y_true)
+        assert count_tp <= count_gt
+        recall_rate = count_tp * 1.0 / count_gt
+        ap = 0 if count_tp == 0 else average_precision_score(y_true, y_score) * recall_rate
+        aps.append(ap)
+        inds = np.argsort(y_score)[::-1]
+        y_score = y_score[inds]
+        y_true = y_true[inds]
+        accs.append([min(1, sum(y_true[:k])) for k in topk])
+        # 4. Save result for JSON dump
+        new_entry = {
+            "query_img": str(query_imname),
+            "query_roi": list(map(float, list(query_roi))),
+            "query_gt": query_gt,
+            "gallery": [],
+        }
+        # only record wrong results
+        if int(y_true[0]):
+            continue
+        # only save top-10 predictions
+        for k in range(10):
+            new_entry["gallery"].append(
+                {
+                    "img": str(imgs[inds[k]]),
+                    "roi": list(map(float, list(rois[inds[k]]))),
+                    "score": float(y_score[k]),
+                    "correct": int(y_true[k]),
+                }
+            )
+        ret["results"].append(new_entry)
+
+    print("search ranking:")
+    print("  mAP = {:.2%}".format(np.mean(aps)))
+    accs = np.mean(accs, axis=0)
+    for i, k in enumerate(topk):
+        print("  top-{:2d} = {:.2%}".format(k, accs[i]))
+
+    write_json(ret, "vis/results.json")
+
+    ret["mAP"] = np.mean(aps)
+    ret["accs"] = accs
+    return ret
+
+
+def eval_search_prw(
+    gallery_dataset,
+    query_dataset,
+    gallery_dets,
+    gallery_feats,
+    query_box_feats,
+    query_dets,
+    query_feats,
+    k1=30,
+    k2=4,
+    det_thresh=0.5,
+    cbgm=False,
+    gallery_size=None, # not used in PRW
+    ignore_cam_id=True,
+):
+    """
+    gallery_det (list of ndarray): n_det x [x1, x2, y1, y2, score] per image
+    gallery_feat (list of ndarray): n_det x D features per image
+    query_feat (list of ndarray): D dimensional features per query image
+    det_thresh (float): filter out gallery detections whose scores below this
+    gallery_size (int): -1 for using full set
+    ignore_cam_id (bool): Set to True acoording to CUHK-SYSU,
+                        although it's a common practice to focus on cross-cam match only.
+    """
+    assert len(gallery_dataset) == len(gallery_dets)
+    assert len(gallery_dataset) == len(gallery_feats)
+    assert len(query_dataset) == len(query_box_feats)
+
+    annos = gallery_dataset.annotations
+    name_to_det_feat = {}
+    for anno, det, feat in zip(annos, gallery_dets, gallery_feats):
+        name = anno["img_name"]
+        scores = det[:, 4].ravel()
+        inds = np.where(scores >= det_thresh)[0]
+        if len(inds) > 0:
+            name_to_det_feat[name] = (det[inds], feat[inds])
+
+    aps = []
+    accs = []
+    topk = [1, 5, 10]
+    ret = {"image_root": gallery_dataset.img_prefix, "results": []}
+    for i in range(len(query_dataset)):
+        y_true, y_score = [], []
+        imgs, rois = [], []
+        count_gt, count_tp = 0, 0
+
+        feat_p = query_box_feats[i].ravel()
+
+        query_imname = query_dataset.annotations[i]["img_name"]
+        query_roi = query_dataset.annotations[i]["boxes"]
+        query_pid = query_dataset.annotations[i]["pids"]
+        query_cam = query_dataset.annotations[i]["cam_id"]
+
+        # Find all occurence of this query
+        gallery_imgs = []
+        for x in annos:
+            if query_pid in x["pids"] and x["img_name"] != query_imname:
+                gallery_imgs.append(x)
+        query_gts = {}
+        for item in gallery_imgs:
+            query_gts[item["img_name"]] = item["boxes"][item["pids"] == query_pid]
+
+        # Construct gallery set for this query
+        if ignore_cam_id:
+            gallery_imgs = []
+            for x in annos:
+                if x["img_name"] != query_imname:
+                    gallery_imgs.append(x)
+        else:
+            gallery_imgs = []
+            for x in annos:
+                if x["img_name"] != query_imname and x["cam_id"] != query_cam:
+                    gallery_imgs.append(x)
+
+        name2sim = {}
+        sims = []
+        imgs_cbgm = []
+        # 1. Go through all gallery samples
+        for item in gallery_imgs:
+            gallery_imname = item["img_name"]
+            # some contain the query (gt not empty), some not
+            count_gt += gallery_imname in query_gts
+            # compute distance between query and gallery dets
+            if gallery_imname not in name_to_det_feat:
+                continue
+            det, feat_g = name_to_det_feat[gallery_imname]
+            # get L2-normalized feature matrix NxD
+            assert feat_g.size == np.prod(feat_g.shape[:2])
+            feat_g = feat_g.reshape(feat_g.shape[:2])
+            # compute cosine similarities
+            sim = feat_g.dot(feat_p).ravel()
+
+            if gallery_imname in name2sim:
+                continue
+            name2sim[gallery_imname] = sim
+            sims.extend(list(sim))
+            imgs_cbgm.extend([gallery_imname] * len(sim))
+
+        if cbgm:
+            sims = np.array(sims)
+            imgs_cbgm = np.array(imgs_cbgm)
+            inds = np.argsort(sims)[-k1:]
+            imgs_cbgm = set(imgs_cbgm[inds])
+            for img in imgs_cbgm:
+                sim = name2sim[img]
+                det, feat_g = name_to_det_feat[img]
+                qboxes = query_dets[i][:k2]
+                qfeats = query_feats[i][:k2]
+                # assert (
+                #     query_roi - qboxes[0][:4]
+                # ).sum() <= 0.001, "query_roi must be the first one in pboxes"
+
+                graph = []
+                for indx_i, pfeat in enumerate(qfeats):
+                    for indx_j, gfeat in enumerate(feat_g):
+                        graph.append((indx_i, indx_j, (pfeat * gfeat).sum()))
+                km_res, max_val = run_kuhn_munkres(graph)
+
+                for indx_i, indx_j, _ in km_res:
+                    if indx_i == 0:
+                        sim[indx_j] = max_val
+                        break
+        for gallery_imname, sim in name2sim.items():
+            det, feat_g = name_to_det_feat[gallery_imname]
+            # assign label for each det
+            label = np.zeros(len(sim), dtype=np.int32)
+            if gallery_imname in query_gts:
+                gt = query_gts[gallery_imname].ravel()
+                w, h = gt[2] - gt[0], gt[3] - gt[1]
+                iou_thresh = min(0.5, (w * h * 1.0) / ((w + 10) * (h + 10)))
+                inds = np.argsort(sim)[::-1]
+                sim = sim[inds]
+                det = det[inds]
+                # only set the first matched det as true positive
+                for j, roi in enumerate(det[:, :4]):
+                    if _compute_iou(roi, gt) >= iou_thresh:
+                        label[j] = 1
+                        count_tp += 1
+                        break
+            y_true.extend(list(label))
+            y_score.extend(list(sim))
+            imgs.extend([gallery_imname] * len(sim))
+            rois.extend(list(det))
+
+        # 2. Compute AP for this query (need to scale by recall rate)
+        y_score = np.asarray(y_score)
+        y_true = np.asarray(y_true)
+        assert count_tp <= count_gt
+        recall_rate = count_tp * 1.0 / count_gt
+        ap = 0 if count_tp == 0 else average_precision_score(y_true, y_score) * recall_rate
+        aps.append(ap)
+        inds = np.argsort(y_score)[::-1]
+        y_score = y_score[inds]
+        y_true = y_true[inds]
+        accs.append([min(1, sum(y_true[:k])) for k in topk])
+        # 4. Save result for JSON dump
+        new_entry = {
+            "query_img": str(query_imname),
+            "query_roi": list(map(float, list(query_roi.squeeze()))),
+            "query_gt": query_gts,
+            "gallery": [],
+        }
+        # only save top-10 predictions
+        for k in range(10):
+            new_entry["gallery"].append(
+                {
+                    "img": str(imgs[inds[k]]),
+                    "roi": list(map(float, list(rois[inds[k]]))),
+                    "score": float(y_score[k]),
+                    "correct": int(y_true[k]),
+                }
+            )
+        ret["results"].append(new_entry)
+
+    print("search ranking:")
+    mAP = np.mean(aps)
+    print("  mAP = {:.2%}".format(mAP))
+    accs = np.mean(accs, axis=0)
+    for i, k in enumerate(topk):
+        print("  top-{:2d} = {:.2%}".format(k, accs[i]))
+
+    # write_json(ret, "vis/results.json")
+
+    ret["mAP"] = np.mean(aps)
+    ret["accs"] = accs
+    return ret
--- a/Code/Python/loss/oim.py
+++ b/Code/Python/loss/oim.py
@ -0,0 +1,76 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import torch
+import torch.nn.functional as F
+from torch import autograd, nn
+
+class OIM(autograd.Function):
+    @staticmethod
+    def forward(ctx, inputs, targets, lut, cq, header, momentum):
+        ctx.save_for_backward(inputs, targets, lut, cq, header, momentum)
+        outputs_labeled = inputs.mm(lut.t())
+        outputs_unlabeled = inputs.mm(cq.t())
+        return torch.cat([outputs_labeled, outputs_unlabeled], dim=1)
+
+    @staticmethod
+    def backward(ctx, grad_outputs):
+        inputs, targets, lut, cq, header, momentum = ctx.saved_tensors
+
+        grad_inputs = None
+        if ctx.needs_input_grad[0]:
+            grad_inputs = grad_outputs.mm(torch.cat([lut, cq], dim=0))
+            if grad_inputs.dtype == torch.float16:
+                grad_inputs = grad_inputs.to(torch.float32)
+
+        for x, y in zip(inputs, targets):
+            if y < len(lut):
+                lut[y] = momentum * lut[y] + (1.0 - momentum) * x
+                lut[y] /= lut[y].norm()
+            else:
+                cq[header] = x
+                header = (header + 1) % cq.size(0)
+        return grad_inputs, None, None, None, None, None
+
+
+def oim(inputs, targets, lut, cq, header, momentum=0.5):
+    return OIM.apply(inputs, targets, lut, cq, torch.tensor(header), torch.tensor(momentum))
+
+
+class OIMLoss(nn.Module):
+    def __init__(self, num_features, num_pids, num_cq_size, oim_momentum, oim_scalar):
+        super(OIMLoss, self).__init__()
+        self.num_features = num_features
+        self.num_pids = num_pids
+        self.num_unlabeled = num_cq_size
+        self.momentum = oim_momentum
+        self.oim_scalar = oim_scalar
+
+        self.register_buffer("lut", torch.zeros(self.num_pids, self.num_features))
+        self.register_buffer("cq", torch.zeros(self.num_unlabeled, self.num_features))
+
+        self.header_cq = 0
+
+    def forward(self, inputs, roi_label):
+        # merge into one batch, background label = 0
+        targets = torch.cat(roi_label)
+        label = targets - 1  # background label = -1
+
+        inds = label >= 0
+        
+        label = label[inds]
+        inputs = inputs[inds.unsqueeze(1).expand_as(inputs)].view(-1, self.num_features)
+
+        projected = oim(inputs, label, self.lut, self.cq, self.header_cq, momentum=self.momentum)
+        # projected - Tensor [M, lut+cq], e.g., [M, 482+500]=[M, 982]
+        
+        projected *= self.oim_scalar        
+
+        self.header_cq = (
+            self.header_cq + (label >= self.num_pids).long().sum().item()
+        ) % self.num_unlabeled
+        
+        loss_oim = F.cross_entropy(projected, label, ignore_index=5554)
+        
+        return loss_oim, inputs, label
--- a/Code/Python/loss/softmax_loss.py
+++ b/Code/Python/loss/softmax_loss.py
@ -0,0 +1,62 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+class SoftmaxLoss(nn.Module):
+    def __init__(self, cfg):
+        super(SoftmaxLoss, self).__init__()
+
+        self.feat_dim = cfg.MODEL.EMBEDDING_DIM
+        self.num_classes = cfg.MODEL.LOSS.LUT_SIZE
+
+        self.bottleneck = nn.BatchNorm1d(self.feat_dim)
+        self.bottleneck.bias.requires_grad_(False)  # no shift
+        self.classifier = nn.Linear(self.feat_dim, self.num_classes, bias=False)
+
+        self.bottleneck.apply(weights_init_kaiming)
+        self.classifier.apply(weights_init_classifier)
+
+    def forward(self, inputs, labels):
+        """
+        Args:
+            inputs: feature matrix with shape (batch_size, feat_dim).
+            labels: ground truth labels with shape (num_classes).
+        """
+        assert inputs.size(0) == labels.size(0), "features.size(0) is not equal to labels.size(0)"
+
+        target = labels.clone()
+        target[target >= self.num_classes] = 5554
+
+        feat = self.bottleneck(inputs)
+        score = self.classifier(feat)
+        loss = F.cross_entropy(score, target, ignore_index=5554)
+
+        return loss
+
+
+def weights_init_kaiming(m):
+    classname = m.__class__.__name__
+    if classname.find('Linear') != -1:
+        nn.init.kaiming_normal_(m.weight, a=0, mode='fan_out')
+        nn.init.constant_(m.bias, 0.0)
+    elif classname.find('Conv') != -1:
+        nn.init.kaiming_normal_(m.weight, a=0, mode='fan_in')
+        if m.bias is not None:
+            nn.init.constant_(m.bias, 0.0)
+    elif classname.find('BatchNorm') != -1:
+        if m.affine:
+            nn.init.constant_(m.weight, 1.0)
+            nn.init.constant_(m.bias, 0.0)
+
+
+def weights_init_classifier(m):
+    classname = m.__class__.__name__
+    if classname.find('Linear') != -1:
+        nn.init.normal_(m.weight, std=0.001)
+        if m.bias:
+            nn.init.constant_(m.bias, 0.0)
+
--- a/Code/Python/models/coat.py
+++ b/Code/Python/models/coat.py
@ -0,0 +1,779 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+from copy import deepcopy
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import init
+from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
+from torchvision.models.detection.roi_heads import RoIHeads
+from torchvision.models.detection.rpn import AnchorGenerator, RegionProposalNetwork, RPNHead
+from torchvision.models.detection.transform import GeneralizedRCNNTransform
+from torchvision.ops import MultiScaleRoIAlign
+from torchvision.ops import boxes as box_ops
+from torchvision.models.detection import _utils as det_utils
+
+from loss.oim import OIMLoss
+from models.resnet import build_resnet,build_network
+from models.transformer import TransformerHead
+
+
+class COAT(nn.Module):
+    def __init__(self, cfg):
+        super(COAT, self).__init__()
+
+        #backbone = build_network(name="pvtv2", pretrained=True)
+        backbone = build_network(name="resnet50", pretrained=True)
+        anchor_generator = AnchorGenerator(
+            sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),)
+        )
+        head = RPNHead(
+            # rpn的输入通道数量是backbone的输出参数通道
+            in_channels=backbone.out_channels,
+            # 这个参数是sizes和aspect_ratios确定后这个参数num_anchors就确定了
+            num_anchors=anchor_generator.num_anchors_per_location()[0],
+        )
+        pre_nms_top_n = dict(
+            training=cfg.MODEL.RPN.PRE_NMS_TOPN_TRAIN, testing=cfg.MODEL.RPN.PRE_NMS_TOPN_TEST
+        )
+        post_nms_top_n = dict(
+            training=cfg.MODEL.RPN.POST_NMS_TOPN_TRAIN, testing=cfg.MODEL.RPN.POST_NMS_TOPN_TEST
+        )
+        rpn = RegionProposalNetwork(
+            anchor_generator=anchor_generator,
+            head=head,
+            fg_iou_thresh=cfg.MODEL.RPN.POS_THRESH_TRAIN,
+            bg_iou_thresh=cfg.MODEL.RPN.NEG_THRESH_TRAIN,
+            batch_size_per_image=cfg.MODEL.RPN.BATCH_SIZE_TRAIN,
+            positive_fraction=cfg.MODEL.RPN.POS_FRAC_TRAIN,
+            pre_nms_top_n=pre_nms_top_n,
+            post_nms_top_n=post_nms_top_n,
+            nms_thresh=cfg.MODEL.RPN.NMS_THRESH,
+        )
+
+        box_head = TransformerHead(
+            cfg=cfg,
+            trans_names=cfg.MODEL.TRANSFORMER.NAMES_1ST,
+            kernel_size=cfg.MODEL.TRANSFORMER.KERNEL_SIZE_1ST,
+            use_feature_mask=cfg.MODEL.TRANSFORMER.USE_MASK_1ST,
+        )
+        box_head_2nd = TransformerHead(
+            cfg=cfg,
+            trans_names=cfg.MODEL.TRANSFORMER.NAMES_2ND,
+            kernel_size=cfg.MODEL.TRANSFORMER.KERNEL_SIZE_2ND,
+            use_feature_mask=cfg.MODEL.TRANSFORMER.USE_MASK_2ND,
+        )
+        box_head_3rd = TransformerHead(
+            cfg=cfg,
+            trans_names=cfg.MODEL.TRANSFORMER.NAMES_3RD,
+            kernel_size=cfg.MODEL.TRANSFORMER.KERNEL_SIZE_3RD,
+            use_feature_mask=cfg.MODEL.TRANSFORMER.USE_MASK_3RD,
+        )
+
+        # faster_rcnn分类器件，2分类，输入通道2048
+        # ？这个2048是怎么来的？
+        faster_rcnn_predictor = FastRCNNPredictor(2048, 2)
+        box_roi_pool = MultiScaleRoIAlign(
+            featmap_names=["feat_res4"], output_size=14, sampling_ratio=2
+        )
+
+        box_predictor = BBoxRegressor(2048, num_classes=2, bn_neck=cfg.MODEL.ROI_HEAD.BN_NECK)
+        roi_heads = CascadedROIHeads(
+            cfg=cfg,
+            # Cascade Transformer Head
+            faster_rcnn_predictor=faster_rcnn_predictor,
+            box_head_2nd=box_head_2nd,
+            box_head_3rd=box_head_3rd,
+            # parent class
+            box_roi_pool=box_roi_pool,
+            box_head=box_head,
+            box_predictor=box_predictor,
+            fg_iou_thresh=cfg.MODEL.ROI_HEAD.POS_THRESH_TRAIN,
+            bg_iou_thresh=cfg.MODEL.ROI_HEAD.NEG_THRESH_TRAIN,
+            batch_size_per_image=cfg.MODEL.ROI_HEAD.BATCH_SIZE_TRAIN,
+            positive_fraction=cfg.MODEL.ROI_HEAD.POS_FRAC_TRAIN,
+            bbox_reg_weights=None,
+            score_thresh=cfg.MODEL.ROI_HEAD.SCORE_THRESH_TEST,
+            nms_thresh=cfg.MODEL.ROI_HEAD.NMS_THRESH_TEST,
+            detections_per_img=cfg.MODEL.ROI_HEAD.DETECTIONS_PER_IMAGE_TEST,
+        )
+
+        transform = GeneralizedRCNNTransform(
+            min_size=cfg.INPUT.MIN_SIZE,
+            max_size=cfg.INPUT.MAX_SIZE,
+            image_mean=[0.485, 0.456, 0.406],
+            image_std=[0.229, 0.224, 0.225],
+        )
+
+        self.backbone = backbone
+        self.rpn = rpn
+        self.roi_heads = roi_heads
+        self.transform = transform
+        self.eval_feat = cfg.EVAL_FEATURE
+
+        # loss weights
+        self.lw_rpn_reg = cfg.SOLVER.LW_RPN_REG
+        self.lw_rpn_cls = cfg.SOLVER.LW_RPN_CLS
+        self.lw_rcnn_reg_1st = cfg.SOLVER.LW_RCNN_REG_1ST
+        self.lw_rcnn_cls_1st = cfg.SOLVER.LW_RCNN_CLS_1ST
+        self.lw_rcnn_reg_2nd = cfg.SOLVER.LW_RCNN_REG_2ND
+        self.lw_rcnn_cls_2nd = cfg.SOLVER.LW_RCNN_CLS_2ND
+        self.lw_rcnn_reg_3rd = cfg.SOLVER.LW_RCNN_REG_3RD
+        self.lw_rcnn_cls_3rd = cfg.SOLVER.LW_RCNN_CLS_3RD
+        self.lw_rcnn_reid_2nd = cfg.SOLVER.LW_RCNN_REID_2ND
+        self.lw_rcnn_reid_3rd = cfg.SOLVER.LW_RCNN_REID_3RD
+
+    def inference(self, images, targets=None, query_img_as_gallery=False):
+        original_image_sizes = [img.shape[-2:] for img in images]
+        images, targets = self.transform(images, targets)
+        features = self.backbone(images.tensors)
+
+        if query_img_as_gallery:
+            assert targets is not None
+
+        if targets is not None and not query_img_as_gallery:
+            # query
+            boxes = [t["boxes"] for t in targets]
+            box_features = self.roi_heads.box_roi_pool(features, boxes, images.image_sizes)
+            box_features_2nd = self.roi_heads.box_head_2nd(box_features)
+            embeddings_2nd, _ = self.roi_heads.embedding_head_2nd(box_features_2nd)
+            box_features_3rd = self.roi_heads.box_head_3rd(box_features)
+            embeddings_3rd, _ = self.roi_heads.embedding_head_3rd(box_features_3rd)
+            if self.eval_feat == 'concat':
+                embeddings = torch.cat((embeddings_2nd, embeddings_3rd), dim=1)
+            elif self.eval_feat == 'stage2':
+                embeddings = embeddings_2nd
+            elif self.eval_feat == 'stage3':
+                embeddings = embeddings_3rd
+            else:
+                raise Exception("Unknown evaluation feature name")
+            return embeddings.split(1, 0)
+        else:
+            # gallery
+            boxes, _ = self.rpn(images, features, targets)
+            detections = self.roi_heads(features, boxes, images.image_sizes, targets, query_img_as_gallery)[0]
+            detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
+            return detections
+
+    def forward(self, images, targets=None, query_img_as_gallery=False):
+        if not self.training:
+            return self.inference(images, targets, query_img_as_gallery)
+
+        # 对输入的图像和GT进行变换
+        images, targets = self.transform(images, targets)
+
+        features = self.backbone(images.tensors)
+        print(features["feat_res4"].shape)
+        # 这里会有问题
+        boxes, rpn_losses = self.rpn(images, features, targets)
+
+        _, rcnn_losses, feats_reid_2nd, targets_reid_2nd, feats_reid_3rd, targets_reid_3rd = self.roi_heads(features, boxes, images.image_sizes, targets)
+
+        # rename rpn losses to be consistent with detection losses
+        rpn_losses["loss_rpn_reg"] = rpn_losses.pop("loss_rpn_box_reg")
+        rpn_losses["loss_rpn_cls"] = rpn_losses.pop("loss_objectness")
+
+        losses = {}
+        losses.update(rcnn_losses)
+        losses.update(rpn_losses)
+
+        # apply loss weights
+        losses["loss_rpn_reg"] *= self.lw_rpn_reg
+        losses["loss_rpn_cls"] *= self.lw_rpn_cls
+        losses["loss_rcnn_reg_1st"] *= self.lw_rcnn_reg_1st
+        losses["loss_rcnn_cls_1st"] *= self.lw_rcnn_cls_1st
+        losses["loss_rcnn_reg_2nd"] *= self.lw_rcnn_reg_2nd
+        losses["loss_rcnn_cls_2nd"] *= self.lw_rcnn_cls_2nd
+        losses["loss_rcnn_reg_3rd"] *= self.lw_rcnn_reg_3rd
+        losses["loss_rcnn_cls_3rd"] *= self.lw_rcnn_cls_3rd
+        losses["loss_rcnn_reid_2nd"] *= self.lw_rcnn_reid_2nd
+        losses["loss_rcnn_reid_3rd"] *= self.lw_rcnn_reid_3rd
+
+        return losses, feats_reid_2nd, targets_reid_2nd, feats_reid_3rd, targets_reid_3rd
+
+class CascadedROIHeads(RoIHeads):
+    '''
+    https://github.com/pytorch/vision/blob/master/torchvision/models/detection/roi_heads.py
+    '''
+    def __init__(
+        self,
+        cfg,
+        faster_rcnn_predictor,
+        box_head_2nd,
+        box_head_3rd,
+        *args,
+        **kwargs
+    ):
+        super(CascadedROIHeads, self).__init__(*args, **kwargs)
+
+        # ROI head
+        self.use_diff_thresh=cfg.MODEL.ROI_HEAD.USE_DIFF_THRESH
+        self.nms_thresh_1st = cfg.MODEL.ROI_HEAD.NMS_THRESH_TEST_1ST
+        self.nms_thresh_2nd = cfg.MODEL.ROI_HEAD.NMS_THRESH_TEST_2ND
+        self.nms_thresh_3rd = cfg.MODEL.ROI_HEAD.NMS_THRESH_TEST_3RD
+        self.fg_iou_thresh_1st = cfg.MODEL.ROI_HEAD.POS_THRESH_TRAIN
+        self.bg_iou_thresh_1st = cfg.MODEL.ROI_HEAD.NEG_THRESH_TRAIN
+        self.fg_iou_thresh_2nd = cfg.MODEL.ROI_HEAD.POS_THRESH_TRAIN_2ND
+        self.bg_iou_thresh_2nd = cfg.MODEL.ROI_HEAD.NEG_THRESH_TRAIN_2ND
+        self.fg_iou_thresh_3rd = cfg.MODEL.ROI_HEAD.POS_THRESH_TRAIN_3RD
+        self.bg_iou_thresh_3rd = cfg.MODEL.ROI_HEAD.NEG_THRESH_TRAIN_3RD
+
+        # Regression head
+        self.box_predictor_1st = faster_rcnn_predictor
+        self.box_predictor_2nd = self.box_predictor
+        self.box_predictor_3rd = deepcopy(self.box_predictor)
+
+        # Transformer head
+        self.box_head_1st = self.box_head
+        self.box_head_2nd = box_head_2nd
+        self.box_head_3rd = box_head_3rd
+
+        # feature mask
+        self.use_feature_mask = cfg.MODEL.USE_FEATURE_MASK
+        self.feature_mask_size = cfg.MODEL.FEATURE_MASK_SIZE
+
+        # Feature embedding
+        embedding_dim = cfg.MODEL.EMBEDDING_DIM
+        self.embedding_head_2nd = NormAwareEmbedding(featmap_names=["before_trans", "after_trans"], in_channels=[1024, 2048], dim=embedding_dim)
+        self.embedding_head_3rd = deepcopy(self.embedding_head_2nd)
+
+        # OIM
+        num_pids = cfg.MODEL.LOSS.LUT_SIZE
+        num_cq_size = cfg.MODEL.LOSS.CQ_SIZE
+        oim_momentum = cfg.MODEL.LOSS.OIM_MOMENTUM
+        oim_scalar = cfg.MODEL.LOSS.OIM_SCALAR
+        self.reid_loss_2nd = OIMLoss(embedding_dim, num_pids, num_cq_size, oim_momentum, oim_scalar)
+        self.reid_loss_3rd = deepcopy(self.reid_loss_2nd)
+
+        # rename the method inherited from parent class
+        self.postprocess_proposals = self.postprocess_detections
+
+        # evaluation        
+        self.eval_feat = cfg.EVAL_FEATURE
+
+    def forward(self, features, boxes, image_shapes, targets=None, query_img_as_gallery=False):
+        """
+        Arguments:
+            features (List[Tensor])
+            boxes (List[Tensor[N, 4]])
+            image_shapes (List[Tuple[H, W]])
+            targets (List[Dict])
+        """
+        cws = True
+        gt_det_2nd = None
+        gt_det_3rd = None
+        feats_reid_2nd = None
+        feats_reid_3rd = None
+        targets_reid_2nd = None
+        targets_reid_3rd = None
+
+        if self.training:
+            if self.use_diff_thresh:
+                self.proposal_matcher = det_utils.Matcher(
+                    self.fg_iou_thresh_1st,
+                    self.bg_iou_thresh_1st,
+                    allow_low_quality_matches=False)
+            boxes, _, box_pid_labels_1st, box_reg_targets_1st = self.select_training_samples(
+                boxes, targets
+            )
+
+        # ------------------- The first stage ------------------ #
+        # print(features["feat_res4"].shape)
+        # torch.Size([2, 1024, 58, 94])
+        # torch.Size([2, 1024, 54, 94])
+        box_features_1st = self.box_roi_pool(features, boxes, image_shapes)
+        box_features_1st = self.box_head_1st(box_features_1st)
+        box_cls_scores_1st, box_regs_1st = self.box_predictor_1st(box_features_1st["after_trans"])
+
+        if self.training:
+            boxes = self.get_boxes(box_regs_1st, boxes, image_shapes)
+            boxes = [boxes_per_image.detach() for boxes_per_image in boxes]
+            if self.use_diff_thresh:
+                self.proposal_matcher = det_utils.Matcher(
+                    self.fg_iou_thresh_2nd,
+                    self.bg_iou_thresh_2nd,
+                    allow_low_quality_matches=False)
+            boxes, _, box_pid_labels_2nd, box_reg_targets_2nd = self.select_training_samples(boxes, targets)
+        else:
+            orig_thresh = self.nms_thresh # 0.4
+            self.nms_thresh = self.nms_thresh_1st
+            boxes, scores, _ = self.postprocess_proposals(
+                box_cls_scores_1st, box_regs_1st, boxes, image_shapes
+            )
+
+        if not self.training and query_img_as_gallery:
+            # When regarding the query image as gallery, GT boxes may be excluded
+            # from detected boxes. To avoid this, we compulsorily include GT in the
+            # detection results. Additionally, CWS should be disabled as the
+            # confidences of these people in query image are 1
+            cws = False
+            gt_box = [targets[0]["boxes"]]
+            gt_box_features = self.box_roi_pool(features, gt_box, image_shapes)
+            gt_box_features = self.box_head_2nd(gt_box_features)
+            embeddings, _ = self.embedding_head_2nd(gt_box_features)
+            gt_det_2nd = {"boxes": targets[0]["boxes"], "embeddings": embeddings}
+
+        # no detection predicted by Faster R-CNN head in test phase
+        if boxes[0].shape[0] == 0:
+            assert not self.training
+            boxes = gt_det_2nd["boxes"] if gt_det_2nd else torch.zeros(0, 4)
+            labels = torch.ones(1).type_as(boxes) if gt_det_2nd else torch.zeros(0)
+            scores = torch.ones(1).type_as(boxes) if gt_det_2nd else torch.zeros(0)
+            if self.eval_feat == 'concat':
+                embeddings = torch.cat((gt_det_2nd["embeddings"], gt_det_2nd["embeddings"]), dim=1) if gt_det_2nd else torch.zeros(0, 512)
+            elif self.eval_feat == 'stage2' or self.eval_feat == 'stage3':
+                embeddings = gt_det_2nd["embeddings"] if gt_det_2nd else torch.zeros(0, 256)
+            else:
+                raise Exception("Unknown evaluation feature name")
+            return [dict(boxes=boxes, labels=labels, scores=scores, embeddings=embeddings)], []
+
+        # --------------------- The second stage -------------------- #
+        box_features = self.box_roi_pool(features, boxes, image_shapes)
+        box_features = self.box_head_2nd(box_features)
+        box_regs_2nd = self.box_predictor_2nd(box_features["after_trans"])
+        box_embeddings_2nd, box_cls_scores_2nd = self.embedding_head_2nd(box_features)
+        if box_cls_scores_2nd.dim() == 0:
+            box_cls_scores_2nd = box_cls_scores_2nd.unsqueeze(0)
+
+        if self.training:
+            boxes = self.get_boxes(box_regs_2nd, boxes, image_shapes)
+            boxes = [boxes_per_image.detach() for boxes_per_image in boxes]
+            if self.use_diff_thresh:
+                self.proposal_matcher = det_utils.Matcher(
+                    self.fg_iou_thresh_3rd,
+                    self.bg_iou_thresh_3rd,
+                    allow_low_quality_matches=False)
+            boxes, _, box_pid_labels_3rd, box_reg_targets_3rd = self.select_training_samples(boxes, targets)
+        else:
+            self.nms_thresh = self.nms_thresh_2nd
+            if self.eval_feat != 'stage2':
+                boxes, scores, _, _ = self.postprocess_boxes(
+                    box_cls_scores_2nd,
+                    box_regs_2nd,
+                    box_embeddings_2nd,
+                    boxes,
+                    image_shapes,
+                    fcs=scores,
+                    gt_det=None,
+                    cws=cws,
+                )
+
+        if not self.training and query_img_as_gallery and self.eval_feat != 'stage2':
+            cws = False
+            gt_box = [targets[0]["boxes"]]
+            gt_box_features = self.box_roi_pool(features, gt_box, image_shapes)
+            gt_box_features = self.box_head_3rd(gt_box_features)
+            embeddings, _ = self.embedding_head_3rd(gt_box_features)
+            gt_det_3rd = {"boxes": targets[0]["boxes"], "embeddings": embeddings}
+
+        # no detection predicted by Faster R-CNN head in test phase
+        if boxes[0].shape[0] == 0 and self.eval_feat != 'stage2':
+            assert not self.training
+            boxes = gt_det_3rd["boxes"] if gt_det_3rd else torch.zeros(0, 4)
+            labels = torch.ones(1).type_as(boxes) if gt_det_3rd else torch.zeros(0)
+            scores = torch.ones(1).type_as(boxes) if gt_det_3rd else torch.zeros(0)
+            if self.eval_feat == 'concat':
+                embeddings = torch.cat((gt_det_2nd["embeddings"], gt_det_3rd["embeddings"]), dim=1) if gt_det_3rd else torch.zeros(0, 512)
+            elif self.eval_feat == 'stage3':
+                embeddings = gt_det_2nd["embeddings"] if gt_det_3rd else torch.zeros(0, 256)
+            else:
+                raise Exception("Unknown evaluation feature name")
+            return [dict(boxes=boxes, labels=labels, scores=scores, embeddings=embeddings)], []
+
+        # --------------------- The third stage -------------------- #
+        box_features = self.box_roi_pool(features, boxes, image_shapes)
+
+        if not self.training:
+            box_features_2nd = self.box_head_2nd(box_features)
+            box_embeddings_2nd, _ = self.embedding_head_2nd(box_features_2nd)
+
+        box_features = self.box_head_3rd(box_features)
+        box_regs_3rd = self.box_predictor_3rd(box_features["after_trans"])
+        box_embeddings_3rd, box_cls_scores_3rd = self.embedding_head_3rd(box_features)
+        if box_cls_scores_3rd.dim() == 0:
+            box_cls_scores_3rd = box_cls_scores_3rd.unsqueeze(0)
+
+        result, losses = [], {}
+        if self.training:
+            box_labels_1st = [y.clamp(0, 1) for y in box_pid_labels_1st]
+            box_labels_2nd = [y.clamp(0, 1) for y in box_pid_labels_2nd]
+            box_labels_3rd = [y.clamp(0, 1) for y in box_pid_labels_3rd]
+            losses = detection_losses(
+                box_cls_scores_1st,
+                box_regs_1st,
+                box_labels_1st,
+                box_reg_targets_1st,
+                box_cls_scores_2nd,
+                box_regs_2nd,
+                box_labels_2nd,
+                box_reg_targets_2nd,
+                box_cls_scores_3rd,
+                box_regs_3rd,
+                box_labels_3rd,
+                box_reg_targets_3rd,
+            )
+
+            loss_rcnn_reid_2nd, feats_reid_2nd, targets_reid_2nd = self.reid_loss_2nd(box_embeddings_2nd, box_pid_labels_2nd)
+            loss_rcnn_reid_3rd, feats_reid_3rd, targets_reid_3rd = self.reid_loss_3rd(box_embeddings_3rd, box_pid_labels_3rd)
+            losses.update(loss_rcnn_reid_2nd=loss_rcnn_reid_2nd)
+            losses.update(loss_rcnn_reid_3rd=loss_rcnn_reid_3rd)
+        else:
+            if self.eval_feat == 'stage2':
+                boxes, scores, embeddings_2nd, labels = self.postprocess_boxes(
+                    box_cls_scores_2nd,
+                    box_regs_2nd,
+                    box_embeddings_2nd,
+                    boxes,
+                    image_shapes,
+                    fcs=scores,
+                    gt_det=gt_det_2nd,
+                    cws=cws,
+                )
+            else:
+                self.nms_thresh = self.nms_thresh_3rd
+                _, _, embeddings_2nd, _ = self.postprocess_boxes(
+                    box_cls_scores_3rd,
+                    box_regs_3rd,
+                    box_embeddings_2nd,
+                    boxes,
+                    image_shapes,
+                    fcs=scores,
+                    gt_det=gt_det_2nd,
+                    cws=cws,
+                )
+                boxes, scores, embeddings_3rd, labels = self.postprocess_boxes(
+                    box_cls_scores_3rd,
+                    box_regs_3rd,
+                    box_embeddings_3rd,
+                    boxes,
+                    image_shapes,
+                    fcs=scores,
+                    gt_det=gt_det_3rd,
+                    cws=cws,
+                )
+                # set to original thresh after finishing postprocess
+                self.nms_thresh = orig_thresh
+
+            num_images = len(boxes)
+            for i in range(num_images):
+                if self.eval_feat == 'concat':
+                    embeddings = torch.cat((embeddings_2nd[i],embeddings_3rd[i]), dim=1)
+                elif self.eval_feat == 'stage2':
+                    embeddings = embeddings_2nd[i]
+                elif self.eval_feat == 'stage3':
+                    embeddings = embeddings_3rd[i]
+                else:
+                    raise Exception("Unknown evaluation feature name")
+                result.append(
+                    dict(
+                        boxes=boxes[i],
+                        labels=labels[i],
+                        scores=scores[i],
+                        embeddings=embeddings
+                    )
+                )
+
+        return result, losses, feats_reid_2nd, targets_reid_2nd, feats_reid_3rd, targets_reid_3rd
+
+    def get_boxes(self, box_regression, proposals, image_shapes):
+        """
+        Get boxes from proposals.
+        """
+        boxes_per_image = [len(boxes_in_image) for boxes_in_image in proposals]
+        pred_boxes = self.box_coder.decode(box_regression, proposals)
+        pred_boxes = pred_boxes.split(boxes_per_image, 0)
+
+        all_boxes = []
+        for boxes, image_shape in zip(pred_boxes, image_shapes):
+            boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
+            # remove predictions with the background label
+            boxes = boxes[:, 1:].reshape(-1, 4)
+            all_boxes.append(boxes)
+
+        return all_boxes
+
+    def postprocess_boxes(
+        self,
+        class_logits,
+        box_regression,
+        embeddings,
+        proposals,
+        image_shapes,
+        fcs=None,
+        gt_det=None,
+        cws=True,
+    ):
+        """
+        Similar to RoIHeads.postprocess_detections, but can handle embeddings and implement
+        First Classification Score (FCS).
+        """
+        device = class_logits.device
+
+        boxes_per_image = [len(boxes_in_image) for boxes_in_image in proposals]
+        pred_boxes = self.box_coder.decode(box_regression, proposals)
+
+        if fcs is not None:
+            # Fist Classification Score (FCS)
+            pred_scores = fcs[0]
+        else:
+            pred_scores = torch.sigmoid(class_logits)
+        if cws:
+            # Confidence Weighted Similarity (CWS)
+            embeddings = embeddings * pred_scores.view(-1, 1)
+
+        # split boxes and scores per image
+        pred_boxes = pred_boxes.split(boxes_per_image, 0)
+        pred_scores = pred_scores.split(boxes_per_image, 0)
+        pred_embeddings = embeddings.split(boxes_per_image, 0)
+
+        all_boxes = []
+        all_scores = []
+        all_labels = []
+        all_embeddings = []
+        for boxes, scores, embeddings, image_shape in zip(
+            pred_boxes, pred_scores, pred_embeddings, image_shapes
+        ):
+            boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
+
+            # create labels for each prediction
+            labels = torch.ones(scores.size(0), device=device)
+
+            # remove predictions with the background label
+            boxes = boxes[:, 1:]
+            scores = scores.unsqueeze(1)
+            labels = labels.unsqueeze(1)
+
+            # batch everything, by making every class prediction be a separate instance
+            boxes = boxes.reshape(-1, 4)
+            scores = scores.flatten()
+            labels = labels.flatten()
+            embeddings = embeddings.reshape(-1, self.embedding_head_2nd.dim)
+
+            # remove low scoring boxes
+            inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
+            boxes, scores, labels, embeddings = (
+                boxes[inds],
+                scores[inds],
+                labels[inds],
+                embeddings[inds],
+            )
+
+            # remove empty boxes
+            keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
+            boxes, scores, labels, embeddings = (
+                boxes[keep],
+                scores[keep],
+                labels[keep],
+                embeddings[keep],
+            )
+
+            if gt_det is not None:
+                # include GT into the detection results
+                boxes = torch.cat((boxes, gt_det["boxes"]), dim=0)
+                labels = torch.cat((labels, torch.tensor([1.0]).to(device)), dim=0)
+                scores = torch.cat((scores, torch.tensor([1.0]).to(device)), dim=0)
+                embeddings = torch.cat((embeddings, gt_det["embeddings"]), dim=0)
+
+            # non-maximum suppression, independently done per class
+            keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
+            # keep only topk scoring predictions
+            keep = keep[: self.detections_per_img]
+            boxes, scores, labels, embeddings = (
+                boxes[keep],
+                scores[keep],
+                labels[keep],
+                embeddings[keep],
+            )
+
+            all_boxes.append(boxes)
+            all_scores.append(scores)
+            all_labels.append(labels)
+            all_embeddings.append(embeddings)
+
+        return all_boxes, all_scores, all_embeddings, all_labels
+
+
+class NormAwareEmbedding(nn.Module):
+    """
+    Implements the Norm-Aware Embedding proposed in
+    Chen, Di, et al. "Norm-aware embedding for efficient person search." CVPR 2020.
+    """
+
+    def __init__(self, featmap_names=["feat_res4", "feat_res5"], in_channels=[1024, 2048], dim=256):
+        super(NormAwareEmbedding, self).__init__()
+        self.featmap_names = featmap_names
+        self.in_channels = in_channels
+        self.dim = dim
+
+        self.projectors = nn.ModuleDict()
+        indv_dims = self._split_embedding_dim()
+        for ftname, in_channel, indv_dim in zip(self.featmap_names, self.in_channels, indv_dims):
+            proj = nn.Sequential(nn.Linear(in_channel, indv_dim), nn.BatchNorm1d(indv_dim))
+            init.normal_(proj[0].weight, std=0.01)
+            init.normal_(proj[1].weight, std=0.01)
+            init.constant_(proj[0].bias, 0)
+            init.constant_(proj[1].bias, 0)
+            self.projectors[ftname] = proj
+
+        self.rescaler = nn.BatchNorm1d(1, affine=True)
+
+    def forward(self, featmaps):
+        """
+        Arguments:
+            featmaps: OrderedDict[Tensor], and in featmap_names you can choose which
+                      featmaps to use
+        Returns:
+            tensor of size (BatchSize, dim), L2 normalized embeddings.
+            tensor of size (BatchSize, ) rescaled norm of embeddings, as class_logits.
+        """
+        assert len(featmaps) == len(self.featmap_names)
+        if len(featmaps) == 1:
+            k, v = featmaps.items()[0]
+            v = self._flatten_fc_input(v)
+            embeddings = self.projectors[k](v)
+            norms = embeddings.norm(2, 1, keepdim=True)
+            embeddings = embeddings / norms.expand_as(embeddings).clamp(min=1e-12)
+            norms = self.rescaler(norms).squeeze()
+            return embeddings, norms
+        else:
+            outputs = []
+            for k, v in featmaps.items():
+                v = self._flatten_fc_input(v)
+                outputs.append(self.projectors[k](v))
+            embeddings = torch.cat(outputs, dim=1)
+            norms = embeddings.norm(2, 1, keepdim=True)
+            embeddings = embeddings / norms.expand_as(embeddings).clamp(min=1e-12)
+            norms = self.rescaler(norms).squeeze()
+            return embeddings, norms
+
+    def _flatten_fc_input(self, x):
+        if x.ndimension() == 4:
+            assert list(x.shape[2:]) == [1, 1]
+            return x.flatten(start_dim=1)
+        return x
+
+    def _split_embedding_dim(self):
+        parts = len(self.in_channels)
+        tmp = [self.dim // parts] * parts
+        if sum(tmp) == self.dim:
+            return tmp
+        else:
+            res = self.dim % parts
+            for i in range(1, res + 1):
+                tmp[-i] += 1
+            assert sum(tmp) == self.dim
+            return tmp
+
+
+class BBoxRegressor(nn.Module):
+    """
+    Bounding box regression layer.
+    """
+
+    def __init__(self, in_channels, num_classes=2, bn_neck=True):
+        """
+        Args:
+            in_channels (int): Input channels.
+            num_classes (int, optional): Defaults to 2 (background and pedestrian).
+            bn_neck (bool, optional): Whether to use BN after Linear. Defaults to True.
+        """
+        super(BBoxRegressor, self).__init__()
+        if bn_neck:
+            self.bbox_pred = nn.Sequential(
+                nn.Linear(in_channels, 4 * num_classes), nn.BatchNorm1d(4 * num_classes)
+            )
+            init.normal_(self.bbox_pred[0].weight, std=0.01)
+            init.normal_(self.bbox_pred[1].weight, std=0.01)
+            init.constant_(self.bbox_pred[0].bias, 0)
+            init.constant_(self.bbox_pred[1].bias, 0)
+        else:
+            self.bbox_pred = nn.Linear(in_channels, 4 * num_classes)
+            init.normal_(self.bbox_pred.weight, std=0.01)
+            init.constant_(self.bbox_pred.bias, 0)
+
+    def forward(self, x):
+        if x.ndimension() == 4:
+            if list(x.shape[2:]) != [1, 1]:
+                x = F.adaptive_avg_pool2d(x, output_size=1)
+        x = x.flatten(start_dim=1)
+        bbox_deltas = self.bbox_pred(x)
+        return bbox_deltas
+
+
+def detection_losses(
+    box_cls_scores_1st,
+    box_regs_1st,
+    box_labels_1st,
+    box_reg_targets_1st,
+    box_cls_scores_2nd,
+    box_regs_2nd,
+    box_labels_2nd,
+    box_reg_targets_2nd,
+    box_cls_scores_3rd,
+    box_regs_3rd,
+    box_labels_3rd,
+    box_reg_targets_3rd,
+):
+    # --------------------- The first stage -------------------- #
+    box_labels_1st = torch.cat(box_labels_1st, dim=0)
+    box_reg_targets_1st = torch.cat(box_reg_targets_1st, dim=0)
+    loss_rcnn_cls_1st = F.cross_entropy(box_cls_scores_1st, box_labels_1st)    
+
+    # get indices that correspond to the regression targets for the
+    # corresponding ground truth labels, to be used with advanced indexing
+    sampled_pos_inds_subset = torch.nonzero(box_labels_1st > 0).squeeze(1)
+    labels_pos = box_labels_1st[sampled_pos_inds_subset]
+    N = box_cls_scores_1st.size(0)
+    box_regs_1st = box_regs_1st.reshape(N, -1, 4)
+
+    loss_rcnn_reg_1st = F.smooth_l1_loss(
+        box_regs_1st[sampled_pos_inds_subset, labels_pos],
+        box_reg_targets_1st[sampled_pos_inds_subset],
+        reduction="sum",
+    )
+    loss_rcnn_reg_1st = loss_rcnn_reg_1st / box_labels_1st.numel()
+
+    # --------------------- The second stage -------------------- #
+    box_labels_2nd = torch.cat(box_labels_2nd, dim=0)
+    box_reg_targets_2nd = torch.cat(box_reg_targets_2nd, dim=0)
+    loss_rcnn_cls_2nd = F.binary_cross_entropy_with_logits(box_cls_scores_2nd, box_labels_2nd.float())
+
+    sampled_pos_inds_subset = torch.nonzero(box_labels_2nd > 0).squeeze(1)
+    labels_pos = box_labels_2nd[sampled_pos_inds_subset]
+    N = box_cls_scores_2nd.size(0)
+    box_regs_2nd = box_regs_2nd.reshape(N, -1, 4)
+
+    loss_rcnn_reg_2nd = F.smooth_l1_loss(
+        box_regs_2nd[sampled_pos_inds_subset, labels_pos],
+        box_reg_targets_2nd[sampled_pos_inds_subset],
+        reduction="sum",
+    )
+    loss_rcnn_reg_2nd = loss_rcnn_reg_2nd / box_labels_2nd.numel()
+
+    # --------------------- The third stage -------------------- #
+    box_labels_3rd = torch.cat(box_labels_3rd, dim=0)
+    box_reg_targets_3rd = torch.cat(box_reg_targets_3rd, dim=0)
+    loss_rcnn_cls_3rd = F.binary_cross_entropy_with_logits(box_cls_scores_3rd, box_labels_3rd.float())
+
+    sampled_pos_inds_subset = torch.nonzero(box_labels_3rd > 0).squeeze(1)
+    labels_pos = box_labels_3rd[sampled_pos_inds_subset]
+    N = box_cls_scores_3rd.size(0)
+    box_regs_3rd = box_regs_3rd.reshape(N, -1, 4)
+
+    loss_rcnn_reg_3rd = F.smooth_l1_loss(
+        box_regs_3rd[sampled_pos_inds_subset, labels_pos],
+        box_reg_targets_3rd[sampled_pos_inds_subset],
+        reduction="sum",
+    )
+    loss_rcnn_reg_3rd = loss_rcnn_reg_3rd / box_labels_3rd.numel()
+
+    return dict(
+        loss_rcnn_cls_1st=loss_rcnn_cls_1st,
+        loss_rcnn_reg_1st=loss_rcnn_reg_1st,
+        loss_rcnn_cls_2nd=loss_rcnn_cls_2nd,
+        loss_rcnn_reg_2nd=loss_rcnn_reg_2nd,
+        loss_rcnn_cls_3rd=loss_rcnn_cls_3rd,
+        loss_rcnn_reg_3rd=loss_rcnn_reg_3rd,
+    )
--- a/Code/Python/models/resnet.py
+++ b/Code/Python/models/resnet.py
@ -0,0 +1,72 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+from collections import OrderedDict
+from backbone.pvt_v2 import pvt_v2_b2_2,pvt_v2_b2
+import torch.nn.functional as F
+import torchvision
+from torch import nn
+
+class Backbone(nn.Sequential):
+    def __init__(self, resnet):
+        super(Backbone, self).__init__(
+            OrderedDict(
+                [
+                    ["conv1", resnet.conv1],
+                    ["bn1", resnet.bn1],
+                    ["relu", resnet.relu],
+                    ["maxpool", resnet.maxpool],
+                    ["layer1", resnet.layer1],  # res2
+                    ["layer2", resnet.layer2],  # res3
+                    ["layer3", resnet.layer3],  # res4
+                ]
+            )
+        )
+        self.out_channels = 1024
+        self.out_feat_key = "feat_res4"
+
+    def forward(self, x):
+        # using the forward method from nn.Sequential
+        feat = super(Backbone, self).forward(x)
+        return OrderedDict([[self.out_feat_key, feat]])
+   
+
+class PVTv2Backbone(pvt_v2_b2):
+    def __init__(self, pretrained_path=""):
+        super(PVTv2Backbone, self).__init__(pretrained = pretrained_path)
+        self.out_channels = 512
+        self.out_feat_key = "feat_pvtv23"
+    def forward(self, x):
+        feat = super(PVTv2Backbone, self).forward(x)
+        return OrderedDict([[self.out_feat_key, feat[3]]])
+
+class Res5Head(nn.Sequential):
+    def __init__(self, resnet):
+        super(Res5Head, self).__init__(OrderedDict([["layer4", resnet.layer4]]))  # res5
+        self.out_channels = [1024, 2048]
+
+    def forward(self, x):
+        feat = super(Res5Head, self).forward(x)
+        x = F.adaptive_max_pool2d(x, 1)
+        feat = F.adaptive_max_pool2d(feat, 1)
+        return OrderedDict([["feat_res4", x], ["feat_res5", feat]])
+
+
+def build_resnet(name="resnet50", pretrained=True):
+    resnet = torchvision.models.resnet.__dict__[name](pretrained=pretrained)
+    # freeze layers
+    resnet.conv1.weight.requires_grad_(False)
+    resnet.bn1.weight.requires_grad_(False)
+    resnet.bn1.bias.requires_grad_(False)
+    return Backbone(resnet)
+
+
+def build_network(name="resnet50", pretrained=True):
+    if(name == "resnet50"):
+        return build_resnet(name, pretrained)
+    else:
+        # use pvtv2_b2 网络
+        #model = pvt_v2_b2_2(pretrained = "./backbone/pvt_v2_b2.pth")
+        model = PVTv2Backbone(pretrained_path = "./backbone/pvt_v2_b2.pth")
+        return model
--- a/Code/Python/models/transformer.py
+++ b/Code/Python/models/transformer.py
@ -0,0 +1,300 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import math
+import random
+from functools import reduce
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from utils.mask import exchange_token, exchange_patch, get_mask_box, jigsaw_token, cutout_patch, erase_patch, mixup_patch, jigsaw_patch
+
+
+def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
+    """1x1 convolution"""
+    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
+
+
+class TransformerHead(nn.Module):
+    def __init__(
+        self,
+        cfg,
+        trans_names, 
+        kernel_size,
+        use_feature_mask,
+    ):
+        super(TransformerHead, self).__init__()
+        d_model = cfg.MODEL.TRANSFORMER.DIM_MODEL
+
+        # Mask parameters
+        self.use_feature_mask = use_feature_mask
+        mask_shape = cfg.MODEL.MASK_SHAPE
+        mask_size = cfg.MODEL.MASK_SIZE
+        mask_mode = cfg.MODEL.MASK_MODE
+
+        self.bypass_mask = exchange_patch(mask_shape, mask_size, mask_mode)
+        self.get_mask_box = get_mask_box(mask_shape, mask_size, mask_mode)
+
+        self.transformer_encoder = Transformers(
+            cfg=cfg,
+            trans_names=trans_names, 
+            kernel_size=kernel_size,
+            use_feature_mask=use_feature_mask,
+        )
+        self.conv0 = conv1x1(1024, 1024)
+        self.conv1 = conv1x1(1024, d_model)
+        self.conv2 = conv1x1(d_model, 2048)
+
+    def forward(self, box_features):
+        mask_box = self.get_mask_box(box_features)
+
+        if self.use_feature_mask:
+            skip_features = self.conv0(box_features)
+            if self.training:
+                skip_features = self.bypass_mask(skip_features)
+        else:
+            skip_features = box_features
+
+        trans_features = {}
+        trans_features["before_trans"] = F.adaptive_max_pool2d(skip_features, 1)
+        box_features = self.conv1(box_features)
+        box_features = self.transformer_encoder((box_features,mask_box))
+        box_features = self.conv2(box_features)
+        trans_features["after_trans"] = F.adaptive_max_pool2d(box_features, 1)
+
+        return trans_features
+
+
+class Transformers(nn.Module):
+    def __init__(
+        self,
+        cfg,
+        trans_names, 
+        kernel_size,
+        use_feature_mask,
+    ):
+        super(Transformers, self).__init__()
+        d_model = cfg.MODEL.TRANSFORMER.DIM_MODEL
+        self.feature_aug_type = cfg.MODEL.FEATURE_AUG_TYPE
+        self.use_feature_mask = use_feature_mask
+
+        # If no conv before transformer, we do not use scales
+        if not cfg.MODEL.TRANSFORMER.USE_PATCH2VEC:
+            trans_names = ['scale1']
+            kernel_size = [(1,1)]
+
+        self.trans_names = trans_names
+        self.scale_size = len(self.trans_names)
+        hidden = d_model//(2*self.scale_size)
+
+        # kernel_size: (padding, stride)
+        kernels = {
+            (1,1): [(0,0),(1,1)],
+            (3,3): [(1,1),(1,1)]
+        }
+
+        padding = []
+        stride = []
+        for ksize in kernel_size:
+            if ksize not in [(1,1),(3,3)]:
+                raise ValueError('Undefined kernel size.')
+            padding.append(kernels[ksize][0])
+            stride.append(kernels[ksize][1])
+
+        self.use_output_layer = cfg.MODEL.TRANSFORMER.USE_OUTPUT_LAYER
+        self.use_global_shortcut = cfg.MODEL.TRANSFORMER.USE_GLOBAL_SHORTCUT
+
+        self.blocks = nn.ModuleDict()
+        for tname, ksize, psize, ssize in zip(self.trans_names, kernel_size, padding, stride):
+            transblock = Transformer(
+                cfg, d_model//self.scale_size, ksize, psize, ssize, hidden, use_feature_mask
+            )
+            self.blocks[tname] = nn.Sequential(transblock)
+
+        self.output_linear = nn.Sequential(
+            nn.Conv2d(d_model, d_model, kernel_size=3, padding=1),
+            nn.LeakyReLU(0.2, inplace=True)
+        )
+        self.mask_para = [cfg.MODEL.MASK_SHAPE, cfg.MODEL.MASK_SIZE, cfg.MODEL.MASK_MODE]
+
+    def forward(self, inputs):
+        trans_feat = []
+        enc_feat, mask_box = inputs
+
+        if self.training and self.use_feature_mask and self.feature_aug_type == 'exchange_patch':
+            feature_mask = exchange_patch(self.mask_para[0], self.mask_para[1], self.mask_para[2])
+            enc_feat = feature_mask(enc_feat)
+
+        for tname, feat in zip(self.trans_names, torch.chunk(enc_feat, len(self.trans_names), dim=1)):
+            feat = self.blocks[tname]((feat, mask_box))
+            trans_feat.append(feat)
+
+        trans_feat = torch.cat(trans_feat, 1)
+        if self.use_output_layer:
+            trans_feat = self.output_linear(trans_feat)
+        if self.use_global_shortcut:
+            trans_feat = enc_feat + trans_feat
+        return trans_feat
+
+
+class Transformer(nn.Module):
+    def __init__(self, cfg, channel, kernel_size, padding, stride, hidden, use_feature_mask
+        ):
+        super(Transformer, self).__init__()
+        self.k = kernel_size[0]
+        stack_num = cfg.MODEL.TRANSFORMER.ENCODER_LAYERS
+        num_head = cfg.MODEL.TRANSFORMER.N_HEAD
+        dropout = cfg.MODEL.TRANSFORMER.DROPOUT
+        output_size = (14,14)
+        token_size = tuple(map(lambda x,y:x//y, output_size, stride))
+        blocks = []
+        self.transblock = TransformerBlock(token_size, hidden=hidden, num_head=num_head, dropout=dropout)
+        for _ in range(stack_num):
+            blocks.append(self.transblock)
+        self.transformer = nn.Sequential(*blocks)
+        self.patch2vec = nn.Conv2d(channel, hidden, kernel_size=kernel_size, stride=stride, padding=padding)
+        self.vec2patch = Vec2Patch(channel, hidden, output_size, kernel_size, stride, padding)
+        self.use_local_shortcut = cfg.MODEL.TRANSFORMER.USE_LOCAL_SHORTCUT
+        self.use_feature_mask = use_feature_mask
+        self.feature_aug_type = cfg.MODEL.FEATURE_AUG_TYPE
+        self.use_patch2vec = cfg.MODEL.TRANSFORMER.USE_PATCH2VEC
+
+    def forward(self, inputs):
+        enc_feat, mask_box = inputs
+        b, c, h, w = enc_feat.size()
+
+        trans_feat = self.patch2vec(enc_feat)
+
+        _, c, h, w = trans_feat.size()
+        trans_feat = trans_feat.view(b, c, -1).permute(0, 2, 1)
+
+        # For 1x1 & 3x3 kernels, exchange tokens
+        if self.training and self.use_feature_mask:
+            if self.feature_aug_type == 'exchange_token':
+                feature_mask = exchange_token()
+                trans_feat = feature_mask(trans_feat, mask_box)
+            elif self.feature_aug_type == 'cutout_patch':
+                feature_mask = cutout_patch()
+                trans_feat = feature_mask(trans_feat)
+            elif self.feature_aug_type == 'erase_patch':
+                feature_mask = erase_patch()
+                trans_feat = feature_mask(trans_feat)
+            elif self.feature_aug_type == 'mixup_patch':
+                feature_mask = mixup_patch()
+                trans_feat = feature_mask(trans_feat)
+
+        if self.use_feature_mask:
+            if self.feature_aug_type == 'jigsaw_patch':
+                feature_mask = jigsaw_patch()
+                trans_feat = feature_mask(trans_feat)
+            elif self.feature_aug_type == 'jigsaw_token':
+                feature_mask = jigsaw_token()
+                trans_feat = feature_mask(trans_feat)
+
+        trans_feat = self.transformer(trans_feat)
+        trans_feat = self.vec2patch(trans_feat)
+        if self.use_local_shortcut:
+            trans_feat = enc_feat + trans_feat
+
+        return trans_feat
+
+
+class TransformerBlock(nn.Module):
+    """
+    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
+    """
+    def __init__(self, tokensize, hidden=128, num_head=4, dropout=0.1):
+        super().__init__()
+        self.attention = MultiHeadedAttention(tokensize, d_model=hidden, head=num_head, p=dropout)
+        self.ffn = FeedForward(hidden, p=dropout)
+        self.norm1 = nn.LayerNorm(hidden)
+        self.norm2 = nn.LayerNorm(hidden)
+        self.dropout = nn.Dropout(p=dropout)
+        
+    def forward(self, x):
+        x = self.norm1(x)
+        x = x + self.dropout(self.attention(x))
+        y = self.norm2(x)
+        x = x + self.ffn(y)
+
+        return x
+
+
+class Attention(nn.Module):
+    """
+    Compute 'Scaled Dot Product Attention
+    """
+    def __init__(self, p=0.1):
+        super(Attention, self).__init__()
+        self.dropout = nn.Dropout(p=p)
+
+    def forward(self, query, key, value):
+        scores = torch.matmul(query, key.transpose(-2, -1)
+                              ) / math.sqrt(query.size(-1))
+        p_attn = F.softmax(scores, dim=-1)
+        p_attn = self.dropout(p_attn)
+        p_val = torch.matmul(p_attn, value)
+        return p_val, p_attn
+
+
+class Vec2Patch(nn.Module):
+    def __init__(self, channel, hidden, output_size, kernel_size, stride, padding):
+        super(Vec2Patch, self).__init__()
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+        c_out = reduce((lambda x, y: x * y), kernel_size) * channel
+        self.embedding = nn.Linear(hidden, c_out)
+        self.to_patch = torch.nn.Fold(output_size=output_size, kernel_size=kernel_size, stride=stride, padding=padding)
+        h, w = output_size
+
+    def forward(self, x):
+        feat = self.embedding(x)
+        b, n, c = feat.size()
+        feat = feat.permute(0, 2, 1)
+        feat = self.to_patch(feat)
+
+        return feat
+
+class MultiHeadedAttention(nn.Module):
+    """
+    Take in model size and number of heads.
+    """
+    def __init__(self, tokensize, d_model, head, p=0.1):
+        super().__init__()
+        self.query_embedding = nn.Linear(d_model, d_model)
+        self.value_embedding = nn.Linear(d_model, d_model)
+        self.key_embedding = nn.Linear(d_model, d_model)
+        self.output_linear = nn.Linear(d_model, d_model)
+        self.attention = Attention(p=p)
+        self.head = head
+        self.h, self.w = tokensize
+
+    def forward(self, x):
+        b, n, c = x.size() 
+        c_h = c // self.head
+        key = self.key_embedding(x)
+        query = self.query_embedding(x)
+        value = self.value_embedding(x)
+        key = key.view(b, n, self.head, c_h).permute(0, 2, 1, 3)
+        query = query.view(b, n, self.head, c_h).permute(0, 2, 1, 3)
+        value = value.view(b, n, self.head, c_h).permute(0, 2, 1, 3)
+        att, _ = self.attention(query, key, value)
+        att = att.permute(0, 2, 1, 3).contiguous().view(b, n, c)
+        output = self.output_linear(att)
+        
+        return output
+
+
+class FeedForward(nn.Module):
+    def __init__(self, d_model, p=0.1):
+        super(FeedForward, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Linear(d_model, d_model * 4),
+            nn.ReLU(inplace=True),
+            nn.Dropout(p=p),
+            nn.Linear(d_model * 4, d_model),
+            nn.Dropout(p=p))
+
+    def forward(self, x):
+        x = self.conv(x)
+        return x
--- a/Code/Python/output/config.yaml
+++ b/Code/Python/output/config.yaml
@ -0,0 +1,130 @@
+CKPT_PERIOD: 1
+DEVICE: cuda:0
+DISP_PERIOD: 10
+EVAL_FEATURE: concat
+EVAL_GALLERY_SIZE: 100
+EVAL_PERIOD: 1
+EVAL_USE_CACHE: false
+EVAL_USE_CBGM: false
+EVAL_USE_GT: false
+GRID:
+  MODE: 1
+  OFFSET: 0
+  PROB: 0.5
+  RATIO: 0.5
+  ROTATE: 1
+INPUT:
+  BATCH_SIZE_TEST: 1
+  BATCH_SIZE_TRAIN: 1
+  DATASET: CUHK-SYSU
+  DATA_ROOT: E:/DeepLearning/PersonSearch/COAT/datasets/CUHK-SYSU
+  IMAGE_CUTOUT: false
+  IMAGE_ERASE: false
+  IMAGE_GRID: false
+  IMAGE_MIXUP: false
+  MAX_SIZE: 1500
+  MIN_SIZE: 900
+  NUM_WORKERS_TEST: 1
+  NUM_WORKERS_TRAIN: 5
+MODEL:
+  EMBEDDING_DIM: 256
+  FEATURE_AUG_TYPE: exchange_token
+  FEATURE_MASK_SIZE: 4
+  LOSS:
+    CQ_SIZE: 5000
+    LUT_SIZE: 5532
+    OIM_MOMENTUM: 0.5
+    OIM_SCALAR: 30.0
+    USE_SOFTMAX: true
+  MASK_MODE: random_direction
+  MASK_PERCENT: 0.1
+  MASK_SHAPE: stripe
+  MASK_SIZE: 1
+  ROI_HEAD:
+    BATCH_SIZE_TRAIN: 128
+    BN_NECK: true
+    DETECTIONS_PER_IMAGE_TEST: 300
+    NEG_THRESH_TRAIN: 0.5
+    NEG_THRESH_TRAIN_2ND: 0.6
+    NEG_THRESH_TRAIN_3RD: 0.7
+    NMS_THRESH_TEST: 0.4
+    NMS_THRESH_TEST_1ST: 0.4
+    NMS_THRESH_TEST_2ND: 0.4
+    NMS_THRESH_TEST_3RD: 0.5
+    POS_FRAC_TRAIN: 0.25
+    POS_THRESH_TRAIN: 0.5
+    POS_THRESH_TRAIN_2ND: 0.6
+    POS_THRESH_TRAIN_3RD: 0.7
+    SCORE_THRESH_TEST: 0.5
+    USE_DIFF_THRESH: true
+  RPN:
+    BATCH_SIZE_TRAIN: 256
+    NEG_THRESH_TRAIN: 0.3
+    NMS_THRESH: 0.7
+    POST_NMS_TOPN_TEST: 300
+    POST_NMS_TOPN_TRAIN: 2000
+    POS_FRAC_TRAIN: 0.5
+    POS_THRESH_TRAIN: 0.7
+    PRE_NMS_TOPN_TEST: 6000
+    PRE_NMS_TOPN_TRAIN: 12000
+  TRANSFORMER:
+    DIM_MODEL: 512
+    DROPOUT: 0.0
+    ENCODER_LAYERS: 1
+    KERNEL_SIZE_1ST:
+    - &id001
+      - 1
+      - 1
+    - &id002
+      - 3
+      - 3
+    KERNEL_SIZE_2ND:
+    - *id001
+    - *id002
+    KERNEL_SIZE_3RD:
+    - *id001
+    - *id002
+    NAMES_1ST:
+    - scale1
+    - scale2
+    NAMES_2ND:
+    - scale1
+    - scale2
+    NAMES_3RD:
+    - scale1
+    - scale2
+    N_HEAD: 8
+    USE_DIFF_SCALE: true
+    USE_GLOBAL_SHORTCUT: true
+    USE_LOCAL_SHORTCUT: true
+    USE_MASK_1ST: false
+    USE_MASK_2ND: true
+    USE_MASK_3RD: true
+    USE_OUTPUT_LAYER: false
+    USE_PATCH2VEC: true
+  USE_FEATURE_MASK: true
+OUTPUT_DIR: ./output
+SEED: 1
+SOLVER:
+  BASE_LR: 0.003
+  CLIP_GRADIENTS: 10.0
+  GAMMA: 0.1
+  LR_DECAY_MILESTONES:
+  - 10
+  - 14
+  LW_RCNN_CLS_1ST: 1
+  LW_RCNN_CLS_2ND: 1
+  LW_RCNN_CLS_3RD: 1
+  LW_RCNN_REG_1ST: 10
+  LW_RCNN_REG_2ND: 10
+  LW_RCNN_REG_3RD: 10
+  LW_RCNN_REID_2ND: 0.5
+  LW_RCNN_REID_3RD: 0.5
+  LW_RCNN_SOFTMAX_2ND: 0.5
+  LW_RCNN_SOFTMAX_3RD: 0.5
+  LW_RPN_CLS: 1
+  LW_RPN_REG: 1
+  MAX_EPOCHS: 13
+  SGD_MOMENTUM: 0.9
+  WEIGHT_DECAY: 0.0005
+TF_BOARD: true
--- a/Code/Python/output/tf_log/events.out.tfevents.1697874586.DESKTOP-0N7R4NR.2608.0
+++ b/Code/Python/output/tf_log/events.out.tfevents.1697874586.DESKTOP-0N7R4NR.2608.0
--- a/Code/Python/output/tf_log/events.out.tfevents.1697874658.DESKTOP-0N7R4NR.12812.0
+++ b/Code/Python/output/tf_log/events.out.tfevents.1697874658.DESKTOP-0N7R4NR.12812.0
--- a/Code/Python/test/RolAlign.py
+++ b/Code/Python/test/RolAlign.py
@ -0,0 +1,29 @@
+import torch
+import torchvision
+from collections import OrderedDict
+
+# featmap_names (List[str]): the names of the feature maps that will be used
+#     for the pooling.
+# output_size (List[Tuple[int, int]] or List[int]): output size for the pooled region
+# sampling_ratio (int): sampling ratio for ROIAlign
+
+# canonical_scale (int, optional): canonical_scale for LevelMapper
+# canonical_level (int, optional): canonical_level for LevelMapper
+# 依次是要处理的特征图名字、输出尺寸、采样系数
+roi = torchvision.ops.MultiScaleRoIAlign(['feat1', 'feat3'], 5, 2)
+i = OrderedDict()
+# 构建仿真的特征
+i['feat1'] = torch.rand(1, 5, 64, 64)
+# this feature won't be used in the pooling
+i['feat2'] = torch.rand(1, 5, 32, 32)  
+i['feat3'] = torch.rand(1, 5, 16, 16)
+# 创建随机的矩形框
+boxes = torch.rand(6, 4) * 256; boxes[:, 2:] += boxes[:, :2]
+# original image size, before computing the feature maps
+image_sizes = [(512, 512)]
+output = roi(i, [boxes], image_sizes)
+print(output.shape)
+#print(output)
+
+# 6个矩形框、5个通道、3x3是怎么来的？
+# torch.Size([6, 5, 3, 3])
--- a/Code/Python/train.py
+++ b/Code/Python/train.py
@ -0,0 +1,154 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import argparse
+import datetime
+import os.path as osp
+import time
+
+import torch
+import torch.utils.data
+
+from datasets import build_test_loader, build_train_loader
+from defaults import get_default_cfg
+from engine import evaluate_performance, train_one_epoch
+from models.coat import COAT
+from utils.utils import mkdir, resume_from_ckpt, save_on_master, set_random_seed
+
+from loss.softmax_loss import SoftmaxLoss
+
+
+def main(args):
+    cfg = get_default_cfg()
+    if args.cfg_file:
+        cfg.merge_from_file(args.cfg_file)
+    cfg.merge_from_list(args.opts)
+    cfg.freeze()
+
+    device = torch.device(cfg.DEVICE)
+    if cfg.SEED >= 0:
+        set_random_seed(cfg.SEED)
+
+    print("Creating model...")
+    model = COAT(cfg)
+    model.to(device)
+
+    print("Loading data...")
+    train_loader = build_train_loader(cfg)
+    gallery_loader, query_loader = build_test_loader(cfg)
+
+    softmax_criterion_s2 = None
+    softmax_criterion_s3 = None
+    if cfg.MODEL.LOSS.USE_SOFTMAX:
+        softmax_criterion_s2 = SoftmaxLoss(cfg)
+        softmax_criterion_s3 = SoftmaxLoss(cfg)
+        softmax_criterion_s2.to(device)
+        softmax_criterion_s3.to(device)
+
+    if args.eval:
+        assert args.ckpt, "--ckpt must be specified when --eval enabled"
+        resume_from_ckpt(args.ckpt, model)
+        evaluate_performance(
+            model,
+            gallery_loader,
+            query_loader,
+            device,
+            use_gt=cfg.EVAL_USE_GT,
+            use_cache=cfg.EVAL_USE_CACHE,
+            use_cbgm=cfg.EVAL_USE_CBGM,
+            gallery_size=cfg.EVAL_GALLERY_SIZE,
+        )
+        exit(0)
+
+    params = [p for p in model.parameters() if p.requires_grad]
+    if cfg.MODEL.LOSS.USE_SOFTMAX:
+        params_softmax_s2 = [p for p in softmax_criterion_s2.parameters() if p.requires_grad]
+        params_softmax_s3 = [p for p in softmax_criterion_s3.parameters() if p.requires_grad]
+        params.extend(params_softmax_s2)
+        params.extend(params_softmax_s3)
+
+    optimizer = torch.optim.SGD(
+        params,
+        lr=cfg.SOLVER.BASE_LR,
+        momentum=cfg.SOLVER.SGD_MOMENTUM,
+        weight_decay=cfg.SOLVER.WEIGHT_DECAY,
+    )
+
+    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(
+        optimizer, milestones=cfg.SOLVER.LR_DECAY_MILESTONES, gamma=cfg.SOLVER.GAMMA
+    )
+
+    start_epoch = 0
+    if args.resume:
+        assert args.ckpt, "--ckpt must be specified when --resume enabled"
+        start_epoch = resume_from_ckpt(args.ckpt, model, optimizer, lr_scheduler) + 1
+
+    print("Creating output folder...")
+    output_dir = cfg.OUTPUT_DIR
+    mkdir(output_dir)
+    path = osp.join(output_dir, "config.yaml")
+    with open(path, "w") as f:
+        f.write(cfg.dump())
+    print(f"Full config is saved to {path}")
+    tfboard = None
+    if cfg.TF_BOARD:
+        from torch.utils.tensorboard import SummaryWriter
+
+        tf_log_path = osp.join(output_dir, "tf_log")
+        mkdir(tf_log_path)
+        tfboard = SummaryWriter(log_dir=tf_log_path)
+        print(f"TensorBoard files are saved to {tf_log_path}")
+
+    print("Start training...")
+    start_time = time.time()
+    for epoch in range(start_epoch, cfg.SOLVER.MAX_EPOCHS):
+        train_one_epoch(cfg, model, optimizer, train_loader, device, epoch, tfboard, softmax_criterion_s2, softmax_criterion_s3)
+        lr_scheduler.step()
+
+        # only save the last three checkpoints
+        if epoch >= cfg.SOLVER.MAX_EPOCHS - 3:
+            save_on_master(
+                {
+                    "model": model.state_dict(),
+                    "optimizer": optimizer.state_dict(),
+                    "lr_scheduler": lr_scheduler.state_dict(),
+                    "epoch": epoch,
+                },
+                osp.join(output_dir, f"epoch_{epoch}.pth"),
+            )
+
+        # evaluate the current checkpoint
+            evaluate_performance(
+                model,
+                gallery_loader,
+                query_loader,
+                device,
+                use_gt=cfg.EVAL_USE_GT,
+                use_cache=cfg.EVAL_USE_CACHE,
+                use_cbgm=cfg.EVAL_USE_CBGM,
+                gallery_size=cfg.EVAL_GALLERY_SIZE,
+            )
+
+    if tfboard:
+        tfboard.close()
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print(f"Total training time {total_time_str}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Train a person search network.")
+    parser.add_argument("--cfg", dest="cfg_file", help="Path to configuration file.")
+    parser.add_argument(
+        "--eval", action="store_true", help="Evaluate the performance of a given checkpoint."
+    )
+    parser.add_argument(
+        "--resume", action="store_true", help="Resume from the specified checkpoint."
+    )
+    parser.add_argument("--ckpt", help="Path to checkpoint to resume or evaluate.")
+    parser.add_argument(
+        "opts", nargs=argparse.REMAINDER, help="Modify config options using the command-line"
+    )
+    args = parser.parse_args()
+    main(args)
--- a/Code/Python/utils/km.py
+++ b/Code/Python/utils/km.py
@ -0,0 +1,150 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import random
+import numpy as np
+
+zero_threshold = 0.00000001
+
+class KMNode(object):
+    def __init__(self, id, exception=0, match=None, visit=False):
+        self.id = id
+        self.exception = exception
+        self.match = match
+        self.visit = visit
+
+
+class KuhnMunkres(object):
+    def __init__(self):
+        self.matrix = None
+        self.x_nodes = []
+        self.y_nodes = []
+        self.minz = float("inf")
+        self.x_length = 0
+        self.y_length = 0
+        self.index_x = 0
+        self.index_y = 1
+
+    def __del__(self):
+        pass
+
+    def set_matrix(self, x_y_values):
+        xs = set()
+        ys = set()
+        for x, y, value in x_y_values:
+            xs.add(x)
+            ys.add(y)
+
+        if len(xs) < len(ys):
+            self.index_x = 0
+            self.index_y = 1
+        else:
+            self.index_x = 1
+            self.index_y = 0
+            xs, ys = ys, xs
+
+        x_dic = {x: i for i, x in enumerate(xs)}
+        y_dic = {y: j for j, y in enumerate(ys)}
+        self.x_nodes = [KMNode(x) for x in xs]
+        self.y_nodes = [KMNode(y) for y in ys]
+        self.x_length = len(xs)
+        self.y_length = len(ys)
+
+        self.matrix = np.zeros((self.x_length, self.y_length))
+        for row in x_y_values:
+            x = row[self.index_x]
+            y = row[self.index_y]
+            value = row[2]
+            x_index = x_dic[x]
+            y_index = y_dic[y]
+            self.matrix[x_index, y_index] = value
+
+        for i in range(self.x_length):
+            self.x_nodes[i].exception = max(self.matrix[i, :])
+
+    def km(self):
+        for i in range(self.x_length):
+            while True:
+                self.minz = float("inf")
+                self.set_false(self.x_nodes)
+                self.set_false(self.y_nodes)
+
+                if self.dfs(i):
+                    break
+
+                self.change_exception(self.x_nodes, -self.minz)
+                self.change_exception(self.y_nodes, self.minz)
+
+    def dfs(self, i):
+        x_node = self.x_nodes[i]
+        x_node.visit = True
+        for j in range(self.y_length):
+            y_node = self.y_nodes[j]
+            if not y_node.visit:
+                t = x_node.exception + y_node.exception - self.matrix[i][j]
+                if abs(t) < zero_threshold:
+                    y_node.visit = True
+                    if y_node.match is None or self.dfs(y_node.match):
+                        x_node.match = j
+                        y_node.match = i
+                        return True
+                else:
+                    if t >= zero_threshold:
+                        self.minz = min(self.minz, t)
+        return False
+
+    def set_false(self, nodes):
+        for node in nodes:
+            node.visit = False
+
+    def change_exception(self, nodes, change):
+        for node in nodes:
+            if node.visit:
+                node.exception += change
+
+    def get_connect_result(self):
+        ret = []
+        for i in range(self.x_length):
+            x_node = self.x_nodes[i]
+            j = x_node.match
+            y_node = self.y_nodes[j]
+            x_id = x_node.id
+            y_id = y_node.id
+            value = self.matrix[i][j]
+
+            if self.index_x == 1 and self.index_y == 0:
+                x_id, y_id = y_id, x_id
+            ret.append((x_id, y_id, value))
+
+        return ret
+
+    def get_max_value_result(self):
+        ret = -100
+        for i in range(self.x_length):
+            j = self.x_nodes[i].match
+            ret = max(ret, self.matrix[i][j])
+        return ret
+
+
+def run_kuhn_munkres(x_y_values):
+    process = KuhnMunkres()
+    process.set_matrix(x_y_values)
+    process.km()
+    return process.get_connect_result(), process.get_max_value_result()
+
+
+def test():
+    values = []
+    random.seed(0)
+    for i in range(500):
+        for j in range(1000):
+            value = random.random()
+            values.append((i, j, value))
+
+    return run_kuhn_munkres(values)
+
+
+if __name__ == "__main__":
+    values = [(1, 1, 3), (1, 3, 4), (2, 1, 2), (2, 2, 1), (2, 3, 3), (3, 2, 4), (3, 3, 5)]
+    print(run_kuhn_munkres(values))
--- a/Code/Python/utils/mask.py
+++ b/Code/Python/utils/mask.py
@ -0,0 +1,325 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import random
+import torch
+
+class exchange_token:
+    def __init__(self):
+        pass
+
+    def __call__(self, features, mask_box):
+        b, hw, c = features.size()
+        assert hw == 14*14
+        new_idx, mask_x1, mask_x2, mask_y1, mask_y2 = mask_box
+        features = features.view(b, 14, 14, c)
+        features[:, mask_x1 : mask_x2, mask_y1 : mask_y2, :] = features[new_idx, mask_x1 : mask_x2, mask_y1 : mask_y2, :]
+        features = features.view(b, hw, c)
+        return features
+
+class jigsaw_token:
+    def __init__(self, shift=5, group=2, begin=1):
+        self.shift = shift
+        self.group = group
+        self.begin = begin
+
+    def __call__(self, features):
+        batchsize = features.size(0)
+        dim = features.size(2)
+
+        num_tokens = features.size(1)
+        if num_tokens == 196:
+            self.group = 2
+        elif num_tokens == 25:
+            self.group = 5
+        else:
+            raise Exception("Jigsaw - Unwanted number of tokens")
+
+        # Shift Operation
+        feature_random = torch.cat([features[:, self.begin-1+self.shift:, :], features[:, self.begin-1:self.begin-1+self.shift, :]], dim=1)
+        x = feature_random
+
+        # Patch Shuffle Operation
+        try:
+            x = x.view(batchsize, self.group, -1, dim)
+        except:
+            raise Exception("Jigsaw - Unwanted number of groups")
+
+        x = torch.transpose(x, 1, 2).contiguous()
+        x = x.view(batchsize, -1, dim)
+
+        return x
+
+class get_mask_box:
+    def __init__(self, shape='stripe', mask_size=2, mode='random_direct'):
+        self.shape = shape
+        self.mask_size = mask_size
+        self.mode = mode
+
+    def __call__(self, features):
+        # Stripe mask
+        if self.shape == 'stripe':
+            if self.mode == 'horizontal':
+                mask_box = self.hstripe(features, self.mask_size)
+            elif self.mode == 'vertical':
+                mask_box = self.vstripe(features, self.mask_size)
+            elif self.mode == 'random_direction':
+                if random.random() < 0.5:
+                    mask_box = self.hstripe(features, self.mask_size)
+                else:
+                    mask_box = self.vstripe(features, self.mask_size)
+            else:
+                raise Exception("Unknown stripe mask mode name")
+        # Square mask
+        elif self.shape == 'square':
+            if self.mode == 'random_size':
+                self.mask_size = 4 if random.random() < 0.5 else 5
+            mask_box = self.square(features, self.mask_size)
+        # Random stripe/square mask
+        elif self.shape == 'random':
+            random_num = random.random()
+            if random_num < 0.25:
+                mask_box = self.hstripe(features, 2)
+            elif random_num < 0.5 and random_num >= 0.25:
+                mask_box = self.vstripe(features, 2)
+            elif random_num < 0.75 and random_num >= 0.5:
+                mask_box = self.square(features, 4)
+            else:
+                mask_box = self.square(features, 5)
+        else:
+            raise Exception("Unknown mask shape name")
+        return mask_box
+
+    def hstripe(self, features, mask_size):
+        """
+        """
+        # horizontal stripe
+        mask_x1 = 0
+        mask_x2 = features.shape[2]
+        y1_max = features.shape[3] - mask_size
+        mask_y1 = torch.randint(y1_max, (1,))
+        mask_y2 = mask_y1 + mask_size
+        new_idx = torch.randperm(features.shape[0])
+        mask_box = (new_idx, mask_x1, mask_x2, mask_y1, mask_y2)
+        return mask_box
+
+    def vstripe(self, features, mask_size):
+        """
+        """
+        # vertical stripe
+        mask_y1 = 0
+        mask_y2 = features.shape[3]
+        x1_max = features.shape[2] - mask_size
+        mask_x1 = torch.randint(x1_max, (1,))
+        mask_x2 = mask_x1 + mask_size
+        new_idx = torch.randperm(features.shape[0])
+        mask_box = (new_idx, mask_x1, mask_x2, mask_y1, mask_y2)
+        return mask_box
+
+    def square(self, features, mask_size):
+        """
+        """
+        # square
+        x1_max = features.shape[2] - mask_size
+        y1_max = features.shape[3] - mask_size
+        mask_x1 = torch.randint(x1_max, (1,))
+        mask_y1 = torch.randint(y1_max, (1,))
+        mask_x2 = mask_x1 + mask_size
+        mask_y2 = mask_y1 + mask_size
+        new_idx = torch.randperm(features.shape[0])
+        mask_box = (new_idx, mask_x1, mask_x2, mask_y1, mask_y2)
+        return mask_box
+
+
+class exchange_patch:
+    def __init__(self, shape='stripe', mask_size=2, mode='random_direct'):
+        self.shape = shape
+        self.mask_size = mask_size
+        self.mode = mode
+
+    def __call__(self, features):
+        # Stripe mask
+        if self.shape == 'stripe':
+            if self.mode == 'horizontal':
+                features = self.xpatch_hstripe(features, self.mask_size)
+            elif self.mode == 'vertical':
+                features = self.xpatch_vstripe(features, self.mask_size)
+            elif self.mode == 'random_direction':
+                if random.random() < 0.5:
+                    features = self.xpatch_hstripe(features, self.mask_size)
+                else:
+                    features = self.xpatch_vstripe(features, self.mask_size)
+            else:
+                raise Exception("Unknown stripe mask mode name")
+        # Square mask
+        elif self.shape == 'square':
+            if self.mode == 'random_size':
+                self.mask_size = 4 if random.random() < 0.5 else 5
+            features = self.xpatch_square(features, self.mask_size)
+        # Random stripe/square mask
+        elif self.shape == 'random':
+            random_num = random.random()
+            if random_num < 0.25:
+                features = self.xpatch_hstripe(features, 2)
+            elif random_num < 0.5 and random_num >= 0.25:
+                features = self.xpatch_vstripe(features, 2)
+            elif random_num < 0.75 and random_num >= 0.5:
+                features = self.xpatch_square(features, 4)
+            else:
+                features = self.xpatch_square(features, 5)
+        else:
+            raise Exception("Unknown mask shape name")
+
+        return features
+
+    def xpatch_hstripe(self, features, mask_size):
+        """
+        """
+        # horizontal stripe
+        y1_max = features.shape[3] - mask_size
+        num_masks = 1
+        for i in range(num_masks):
+            mask_y1 = torch.randint(y1_max, (1,))
+            mask_y2 = mask_y1 + mask_size
+            new_idx = torch.randperm(features.shape[0])
+            features[:, :, :, mask_y1 : mask_y2] = features[new_idx, :, :, mask_y1 : mask_y2]
+        return features
+
+
+    def xpatch_vstripe(self, features, mask_size):
+        """
+        """
+        # vertical stripe
+        x1_max = features.shape[2] - mask_size
+        num_masks = 1
+        for i in range(num_masks):
+            mask_x1 = torch.randint(x1_max, (1,))
+            mask_x2 = mask_x1 + mask_size
+            new_idx = torch.randperm(features.shape[0])
+            features[:, :, mask_x1 : mask_x2, :] = features[new_idx, :, mask_x1 : mask_x2, :]
+        return features
+
+
+    def xpatch_square(self, features, mask_size):
+        """
+        """
+        # square
+        x1_max = features.shape[2] - mask_size
+        y1_max = features.shape[3] - mask_size
+        num_masks = 1
+        for i in range(num_masks):
+            mask_x1 = torch.randint(x1_max, (1,))
+            mask_y1 = torch.randint(y1_max, (1,))
+            mask_x2 = mask_x1 + mask_size
+            mask_y2 = mask_y1 + mask_size
+            new_idx = torch.randperm(features.shape[0])
+            features[:, :, mask_x1 : mask_x2, mask_y1 : mask_y2] = features[new_idx, :, mask_x1 : mask_x2, mask_y1 : mask_y2]
+        return features
+
+
+class cutout_patch:
+    def __init__(self, mask_size=2):
+        self.mask_size = mask_size
+
+    def __call__(self, features):
+        if random.random() < 0.5:
+            y1_max = features.shape[3] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_y1 = torch.randint(y1_max, (features.shape[0],))
+                mask_y2 = mask_y1 + self.mask_size
+                for k in range(features.shape[0]):
+                    features[k, :, :, mask_y1[k] : mask_y2[k]] = 0
+        else:
+            x1_max = features.shape[3] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_x1 = torch.randint(x1_max, (features.shape[0],))
+                mask_x2 = mask_x1 + self.mask_size
+                for k in range(features.shape[0]):
+                    features[k, :, mask_x1[k] : mask_x2[k], :] = 0
+
+        return features
+
+
+class erase_patch:
+    def __init__(self, mask_size=2):
+        self.mask_size = mask_size
+
+    def __call__(self, features):
+        std, mean = torch.std_mean(features.detach())
+        dim = features.shape[1]
+        if random.random() < 0.5:
+            y1_max = features.shape[3] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_y1 = torch.randint(y1_max, (features.shape[0],))
+                mask_y2 = mask_y1 + self.mask_size
+                for k in range(features.shape[0]):
+                    features[k, :, :, mask_y1[k] : mask_y2[k]] = torch.normal(mean.repeat(dim,14,2), std.repeat(dim,14,2))
+        else:
+            x1_max = features.shape[3] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_x1 = torch.randint(x1_max, (features.shape[0],))
+                mask_x2 = mask_x1 + self.mask_size
+                for k in range(features.shape[0]):
+                    features[k, :, mask_x1[k] : mask_x2[k], :] = torch.normal(mean.repeat(dim,2,14), std.repeat(dim,2,14))
+
+        return features
+
+class mixup_patch:
+    def __init__(self, mask_size=2):
+        self.mask_size = mask_size
+
+    def __call__(self, features):
+        lam = random.uniform(0, 1)
+        if random.random() < 0.5:
+            y1_max = features.shape[3] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_y1 = torch.randint(y1_max, (1,))
+                mask_y2 = mask_y1 + self.mask_size
+                new_idx = torch.randperm(features.shape[0])
+                features[:, :, :, mask_y1 : mask_y2] = lam*features[:, :, :, mask_y1 : mask_y2] + (1-lam)*features[new_idx, :, :, mask_y1 : mask_y2]
+        else:
+            x1_max = features.shape[2] - self.mask_size
+            num_masks = 1
+            for i in range(num_masks):
+                mask_x1 = torch.randint(x1_max, (1,))
+                mask_x2 = mask_x1 + self.mask_size
+                new_idx = torch.randperm(features.shape[0])
+                features[:, :, mask_x1 : mask_x2, :] = lam*features[:, :, mask_x1 : mask_x2, :] + (1-lam)*features[new_idx, :, mask_x1 : mask_x2, :]
+
+        return features
+
+
+class jigsaw_patch:
+    def __init__(self, shift=5, group=2):
+        self.shift = shift
+        self.group = group
+
+    def __call__(self, features):
+        batchsize = features.size(0)
+        dim = features.size(1)
+        features = features.view(batchsize, dim, -1)
+
+        # Shift Operation
+        feature_random = torch.cat([features[:, :, self.shift:], features[:, :, :self.shift]], dim=2)
+        x = feature_random
+
+        # Patch Shuffle Operation
+        try:
+            x = x.view(batchsize, dim, self.group, -1)
+        except:
+            x = torch.cat([x, x[:, -2:-1, :]], dim=1)
+            x = x.view(batchsize, self.group, -1, dim)
+
+        x = torch.transpose(x, 2, 3).contiguous()
+
+        x = x.view(batchsize, dim, -1)
+        x = x.view(batchsize, dim, 14, 14)
+
+        return x
+
--- a/Code/Python/utils/transforms.py
+++ b/Code/Python/utils/transforms.py
@ -0,0 +1,144 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import random
+import math
+import torch
+import numpy as np
+from copy import deepcopy
+from torchvision.transforms import functional as F
+
+def mixup_data(images, alpha=0.8):
+    if alpha > 0. and alpha < 1.:
+        lam = random.uniform(alpha, 1)
+    else:
+        lam = 1.
+
+    batch_size = len(images)
+    min_x = 9999
+    min_y = 9999
+    for i in range(batch_size):
+        min_x = min(min_x, images[i].shape[1])
+        min_y = min(min_y, images[i].shape[2])
+
+    shuffle_images = deepcopy(images)
+    random.shuffle(shuffle_images)
+    mixed_images = deepcopy(images)
+    for i in range(batch_size):
+        mixed_images[i][:, :min_x, :min_y] = lam * images[i][:, :min_x, :min_y] + (1 - lam) * shuffle_images[i][:, :min_x, :min_y]
+
+    return mixed_images
+
+class Compose:
+    def __init__(self, transforms):
+        self.transforms = transforms
+
+    def __call__(self, image, target):
+        for t in self.transforms:
+            image, target = t(image, target)
+        return image, target
+
+
+class RandomHorizontalFlip:
+    def __init__(self, prob=0.5):
+        self.prob = prob
+
+    def __call__(self, image, target):
+        if random.random() < self.prob:
+            height, width = image.shape[-2:]
+            image = image.flip(-1)
+            bbox = target["boxes"]
+            bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
+            target["boxes"] = bbox
+        return image, target
+
+class Cutout(object):
+    """Randomly mask out one or more patches from an image.
+    https://github.com/uoguelph-mlrg/Cutout/blob/master/util/cutout.py
+    Args:
+        n_holes (int): Number of patches to cut out of each image.
+        length (int): The length (in pixels) of each square patch.
+    """
+    def __init__(self, n_holes=2, length=100):
+        self.n_holes = n_holes
+        self.length = length
+
+    def __call__(self, img, target):
+        """
+        Args:
+            img (Tensor): Tensor image of size (C, H, W).
+        Returns:
+            Tensor: Image with n_holes of dimension length x length cut out of it.
+        """
+        h = img.size(1)
+        w = img.size(2)
+        mask = np.ones((h, w), np.float32)
+
+        for n in range(self.n_holes):
+            y = np.random.randint(h)
+            x = np.random.randint(w)
+            y1 = np.clip(y - self.length // 2, 0, h)
+            y2 = np.clip(y + self.length // 2, 0, h)
+            x1 = np.clip(x - self.length // 2, 0, w)
+            x2 = np.clip(x + self.length // 2, 0, w)
+            mask[y1: y2, x1: x2] = 0.
+
+        mask = torch.from_numpy(mask)
+        mask = mask.expand_as(img)
+        img = img * mask
+
+        return img, target
+
+
+class RandomErasing(object):
+    '''
+    https://github.com/zhunzhong07/CamStyle/blob/master/reid/utils/data/transforms.py
+    '''
+    def __init__(self, EPSILON=0.5, mean=[0.485, 0.456, 0.406]):
+        self.EPSILON = EPSILON
+        self.mean = mean
+
+    def __call__(self, img, target):
+        if random.uniform(0, 1) > self.EPSILON:
+            return img, target
+
+        for attempt in range(100):
+            area = img.size()[1] * img.size()[2]
+
+            target_area = random.uniform(0.02, 0.2) * area
+            aspect_ratio = random.uniform(0.3, 3)
+
+            h = int(round(math.sqrt(target_area * aspect_ratio)))
+            w = int(round(math.sqrt(target_area / aspect_ratio)))
+
+            if w <= img.size()[2] and h <= img.size()[1]:
+                x1 = random.randint(0, img.size()[1] - h)
+                y1 = random.randint(0, img.size()[2] - w)
+                img[0, x1:x1 + h, y1:y1 + w] = self.mean[0]
+                img[1, x1:x1 + h, y1:y1 + w] = self.mean[1]
+                img[2, x1:x1 + h, y1:y1 + w] = self.mean[2]
+
+                return img, target
+
+        return img, target
+
+
+class ToTensor:
+    def __call__(self, image, target):
+        # convert [0, 255] to [0, 1]
+        image = F.to_tensor(image)
+        return image, target
+
+
+def build_transforms(cfg, is_train):
+    transforms = []
+    transforms.append(ToTensor())
+    if is_train:
+        transforms.append(RandomHorizontalFlip())
+        if cfg.INPUT.IMAGE_CUTOUT:
+            transforms.append(Cutout())
+        if cfg.INPUT.IMAGE_ERASE:
+            transforms.append(RandomErasing())
+
+    return Compose(transforms)
--- a/Code/Python/utils/utils.py
+++ b/Code/Python/utils/utils.py
@ -0,0 +1,436 @@
+# This file is part of COAT, and is distributed under the
+# OSI-approved BSD 3-Clause License. See top-level LICENSE file or
+# https://github.com/Kitware/COAT/blob/master/LICENSE for details.
+
+import datetime
+import errno
+import json
+import os
+import os.path as osp
+import pickle
+import random
+import time
+from collections import defaultdict, deque
+
+import numpy as np
+import torch
+import torch.distributed as dist
+from tabulate import tabulate
+
+
+# -------------------------------------------------------- #
+#                          Logger                          #
+# -------------------------------------------------------- #
+class SmoothedValue(object):
+    """
+    Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device="cuda")
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+
+    @property
+    def global_avg(self):
+        return self.total / self.count
+
+    @property
+    def max(self):
+        return max(self.deque)
+
+    @property
+    def value(self):
+        return self.deque[-1]
+
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value,
+        )
+
+
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(type(self).__name__, attr))
+
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append("{}: {}".format(name, str(meter)))
+        return self.delimiter.join(loss_str)
+
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ""
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt="{avg:.4f}")
+        data_time = SmoothedValue(fmt="{avg:.4f}")
+        space_fmt = ":" + str(len(str(len(iterable)))) + "d"
+        if torch.cuda.is_available():
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                    "max mem: {memory:.0f}",
+                ]
+            )
+        else:
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                ]
+            )
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(
+                        log_msg.format(
+                            i,
+                            len(iterable),
+                            eta=eta_string,
+                            meters=str(self),
+                            time=str(iter_time),
+                            data=str(data_time),
+                            memory=torch.cuda.max_memory_allocated() / MB,
+                        )
+                    )
+                else:
+                    print(
+                        log_msg.format(
+                            i,
+                            len(iterable),
+                            eta=eta_string,
+                            meters=str(self),
+                            time=str(iter_time),
+                            data=str(data_time),
+                        )
+                    )
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print(
+            "{} Total time: {} ({:.4f} s / it)".format(
+                header, total_time_str, total_time / len(iterable)
+            )
+        )
+
+
+# -------------------------------------------------------- #
+#                   Distributed training                   #
+# -------------------------------------------------------- #
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+
+    Args:
+        data: any picklable object
+
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+
+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.tensor([tensor.numel()], device="cuda")
+    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
+    dist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+
+    # receiving Tensor from all ranks
+    # we pad the tensor because torch all_gather does not support
+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
+    if local_size != max_size:
+        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
+        tensor = torch.cat((tensor, padding), dim=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
+
+
+def reduce_dict(input_dict, average=True):
+    """
+    Reduce the values in the dictionary from all processes so that all processes
+    have the averaged results. Returns a dict with the same fields as
+    input_dict, after reduction.
+
+    Args:
+        input_dict (dict): all the values will be reduced
+        average (bool): whether to do average or sum
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return input_dict
+    with torch.no_grad():
+        names = []
+        values = []
+        # sort the keys so that they are consistent across processes
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = torch.stack(values, dim=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+
+    builtin_print = __builtin__.print
+
+    def print(*args, **kwargs):
+        force = kwargs.pop("force", False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+
+    __builtin__.print = print
+
+
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+
+
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+
+
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+
+
+def is_main_process():
+    return get_rank() == 0
+
+
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+
+
+def init_distributed_mode(args):
+    if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ["WORLD_SIZE"])
+        args.gpu = int(os.environ["LOCAL_RANK"])
+    elif "SLURM_PROCID" in os.environ:
+        args.rank = int(os.environ["SLURM_PROCID"])
+        args.gpu = args.rank % torch.cuda.device_count()
+    else:
+        print("Not using distributed mode")
+        args.distributed = False
+        return
+
+    args.distributed = True
+
+    torch.cuda.set_device(args.gpu)
+    args.dist_backend = "nccl"
+    print("| distributed init (rank {}): {}".format(args.rank, args.dist_url), flush=True)
+    torch.distributed.init_process_group(
+        backend=args.dist_backend,
+        init_method=args.dist_url,
+        world_size=args.world_size,
+        rank=args.rank,
+    )
+    torch.distributed.barrier()
+    setup_for_distributed(args.rank == 0)
+
+
+# -------------------------------------------------------- #
+#                      File operation                      #
+# -------------------------------------------------------- #
+def filename(path):
+    return osp.splitext(osp.basename(path))[0]
+
+
+def mkdir(path):
+    try:
+        os.makedirs(path)
+    except OSError as e:
+        if e.errno != errno.EEXIST:
+            raise
+
+
+def read_json(fpath):
+    with open(fpath, "r") as f:
+        obj = json.load(f)
+    return obj
+
+
+def write_json(obj, fpath):
+    mkdir(osp.dirname(fpath))
+    _obj = obj.copy()
+    for k, v in _obj.items():
+        if isinstance(v, np.ndarray):
+            _obj.pop(k)
+    with open(fpath, "w") as f:
+        json.dump(_obj, f, indent=4, separators=(",", ": "))
+
+
+def symlink(src, dst, overwrite=True, **kwargs):
+    if os.path.lexists(dst) and overwrite:
+        os.remove(dst)
+    os.symlink(src, dst, **kwargs)
+
+
+# -------------------------------------------------------- #
+#                           Misc                           #
+# -------------------------------------------------------- #
+def create_small_table(small_dict):
+    """
+    Create a small table using the keys of small_dict as headers. This is only
+    suitable for small dictionaries.
+
+    Args:
+        small_dict (dict): a result dictionary of only a few items.
+
+    Returns:
+        str: the table as a string.
+    """
+    keys, values = tuple(zip(*small_dict.items()))
+    table = tabulate(
+        [values],
+        headers=keys,
+        tablefmt="pipe",
+        floatfmt=".3f",
+        stralign="center",
+        numalign="center",
+    )
+    return table
+
+
+def warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor):
+    def f(x):
+        if x >= warmup_iters:
+            return 1
+        alpha = float(x) / warmup_iters
+        return warmup_factor * (1 - alpha) + alpha
+
+    return torch.optim.lr_scheduler.LambdaLR(optimizer, f)
+
+
+def resume_from_ckpt(ckpt_path, model, optimizer=None, lr_scheduler=None):
+    ckpt = torch.load(ckpt_path)
+    model.load_state_dict(ckpt["model"], strict=False)
+    if optimizer is not None:
+        optimizer.load_state_dict(ckpt["optimizer"])
+    if lr_scheduler is not None:
+        lr_scheduler.load_state_dict(ckpt["lr_scheduler"])
+    print(f"loaded checkpoint {ckpt_path}")
+    print(f"model was trained for {ckpt['epoch']} epochs")
+    return ckpt["epoch"]
+
+
+def set_random_seed(seed):
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.deterministic = True
+    random.seed(seed)
+    np.random.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
--- a/Code/Python/vis/results.json
+++ b/Code/Python/vis/results.json
--- a/Docs/Cascade
+++ b/Docs/Cascade
@ -0,0 +1,632 @@
+            Cascade Transformers for End-to-End Person Search
+
+     Rui Yu1,2,* Dawei Du1, Rodney LaLonde1, Daniel Davila1,
+            Christopher Funk1, Anthony Hoogs1, Brian Clipp1
+
+1Kitware, Inc., NY & NC, USA, 2Pennsylvania State University, PA, USA
+
+                                  https://github.com/Kitware/COAT
+
+                        Abstract                                       Figure 1. Main challenges of person search, e.g., scale variations,
+                                                                       pose/viewpoint change, and occlusion. The boxes with the same
+   The goal of person search is to localize a target per-              color represent the same ID. For better viewing, we highlight the
+son from a gallery set of scene images, which is extremely             small-scale individuals at bottom-right corners.
+challenging due to large scale variations, pose/viewpoint
+changes, and occlusions. In this paper, we propose the Cas-            found by a separate object detector. In contrast, end-to-end
+cade Occluded Attention Transformer (COAT) for end-to-                 methods [2, 20, 32–34, 39] jointly solve the detection and
+end person search. Our three-stage cascade design focuses              ReID sub-problems in a more efficient, multi-task learning
+on detecting people in the first stage, while later stages si-         framework. However, as shown in Figure 1, they still suffer
+multaneously and progressively refine the representation for           from three main challenges:
+person detection and re-identification. At each stage the
+occluded attention transformer applies tighter intersection            • There is a conflict in feature learning between person
+over union thresholds, forcing the network to learn coarse-              detection and ReID. Person detection aims to learn fea-
+to-fine pose/scale invariant features. Meanwhile, we cal-                tures which generalize across people to distinguish peo-
+culate each detection’s occluded attention to differentiate a            ple from the background, while ReID aims to learn fea-
+person’s tokens from other people or the background. In                  tures which do not generalize across people but distin-
+this way, we simulate the effect of other objects occlud-                guish people from each other. Previous works follow a
+ing a person of interest at the token-level. Through com-                “ReID first” [33] or “detection first” [20] principle to give
+prehensive experiments, we demonstrate the benefits of our               priority to one subtask over the other. However, it is diffi-
+method by achieving state-of-the-art performance on two                  cult to balance the importance of two subtasks in different
+benchmark datasets.                                                      situations when relying on either strategy.
+
+1. Introduction                                                        • Significant scale or pose variations increase identity
+                                                                         recognition difficulty; see Figure 1. Feature pyramids or
+   Person search aims to localize a particular target person             deformable convolutions [14, 18, 33] have been used to
+from a gallery set of scene images, which is an extremely                solve scale, pose or viewpoint misalignment in feature
+difficult fine-grained recognition and retrieval problem. A              learning. However, simple feature fusion strategies may
+person search system must both generalize to separate peo-               introduce additional background noise in feature embed-
+ple from the background, and specialize to discriminate                  dings, resulting in inferior ReID performance.
+identities from each other.
+                                                                       • Occlusions with background objects or other people make
+   In real-world applications, person search systems must                appearance representations more ambiguous, as shown in
+detect people across a wide variety of image sizes and                   Figure 1. The majority of previous person search meth-
+re-identify people despite large changes in resolution and
+viewpoint. To this end, modern person search methods, ei-
+ther two-step or one-step (i.e., end-to-end), consist of reli-
+able person detection and discriminative feature embedding
+learning. Two-step methods [5, 10, 13, 18, 30, 38] conduct
+person re-identification (ReID) on cropped person patches
+
+    *Rui Yu’s work on this paper was done when he was a summer intern
+at Kitware.
+
+                                                                       7267
+ Figure 2. Our proposed cascade framework for person search.    person representations in occluded scenes. 3) Extensive
+                                                                experiments on two datasets show the superiority of our
+  ods focus on holistic appearance modeling of people by        method over existing person search approaches.
+  anchor-based [20] or anchor-free [33] methods. Despite
+  the improvement of person search accuracy, these are          2. Related Work
+  prone to fail with complex occlusions.
+                                                                Person Search. Person search methods can be roughly
+   To deal with the aforementioned challenges, as shown         grouped into two-step and end-to-end approaches. Two-
+in Figure 2, we propose a new Cascade Occluded Atten-           step methods [5, 10, 13, 18, 30] combine a person detector
+tion Transformer (COAT) for end-to-end person search.           (e.g., Faster R-CNN [27], RetinaNet [22], or FCOS [28])
+First, inspired by Cascade R-CNN [1], we refine the per-        and a person ReID model sequentially. For example,
+son detection and ReID quality by a coarse-to-fine strategy     Wang et al. [30] build a person search system including
+in three stages. The first stage focuses on discriminating      an identity-guided query detector followed by a detection
+people from background (detection), but crucially, is not       results adapted ReID model. On the other hand, end-to-
+trained to discriminate people from each other (ReID) with      end methods [6, 20, 32, 33] integrate the two models into
+a ReID loss. Later stages include both detection and ReID       a unified framework for better efficiency. Chen et al. [6]
+losses. This design improves detection performance (see         share detection and ReID features but decompose them in
+Section 4.3), as the first stage can generalize across peo-     the polar coordinate system in terms of radial norm and
+ple without having to discriminate between persons. Sub-        angle. Yan et al. [33] propose the first anchor-free per-
+sequent stages simultaneously refine the previous stages’       son search method, which tackles the misalignment issues
+bounding box estimates and identity embeddings (see Ta-         in different levels (i.e., scale, region, and task). Recently,
+ble 1). Second, we apply multi-scale convolutional trans-       Li and Miao [20] share the stem representations of person
+formers at each stage of the cascade. The base feature          detection and ReID, but solve the two subtasks by two-
+maps are split into multiple slices corresponding to different  head networks sequentially. In contrast, inspired by Cas-
+scales. The transformer attention encourages the network        cade R-CNN [1], our method follows an end-to-end strat-
+to learn embeddings on the discriminative parts of each per-    egy that balances person detection and ReID progressively
+son for each scale, helping overcome the problem of region      via a three-stage cascade framework.
+misalignment. Third, we augment the transformer’s learned       Visual Transformers in Person ReID. Based on the origi-
+feature embeddings with an occluded attention mechanism         nal transformer model [29] for natural language processing,
+that synthetically mimics occlusions . We randomly mix-         Vision Transformer (ViT) [11] is the first pure transformer
+up partial tokens of instances in a mini-batch, and learn       network to extract features for image recognition. CNNs
+the cross-attention among the token bank for each instance.     are widely adopted to extract base features and so reduce
+This trains the transformer differentiate tokens from other     the scale of training data required for a pure transformer
+foreground and background detection proposals. Experi-          approach. Luo et al. [25] develop a spatial transformer
+ments on the challenging CUHK-SYSU [32] and PRW [38]            network to sample an affined image from the holistic im-
+datasets show that the proposed network outperforms state-      age to match a partial image. Li et al. [19] propose the
+of-the-art end-to-end methods, especially in terms of the       part-aware transformer to perform occluded person Re-ID
+cross-camera setting on the PRW dataset.                        through diverse part discovery. Zhang et al. [36] introduce a
+                                                                transformer-based feature calibration to integrate large scale
+Contributions. 1) To our knowledge, we propose the first        features as a global prior. Our paper is the first in the lit-
+cascaded transformer-based framework for end-to-end per-        erature to perform person search with multi-scale convolu-
+son search. The progressive design effectively balances per-    tional transformers . It not only learns discriminative ReID
+son detection and ReID and the transformers help attend to      features but also distinguishes people from the background
+scale and pose/viewpoint changes. 2) We improve perfor-         in a cascade pipeline.
+mance with an occluded attention mechanism in the multi-        Attention Mechanism in Transformers. Attention mech-
+scale transformer that generates discriminative fine-grained    anism plays a crucial role in transformers. Recently, many
+                                                                ViT variants [3, 16, 21, 35] have computed discriminative
+                                                                features using a variety of token attention methods. Chen et
+                                                                al. [3] propose a dual-branch transformer with a cross-
+                                                                attention based token fusion module to combine two scales
+                                                                of patch features. Lin et al. [21] alternate attention in the
+                                                                feature map patches for local representation and attention
+                                                                on the single channel feature map for global representation.
+                                                                Yuan et al. [35] introduce the tokens-to-token process to
+
+                                                                7268
+gradually tokenize images to tokens while preserving struc-       and a ReID discriminator. Note that we remove the ReID
+tural information. He et al. [16] rearrange the transformer       discriminator at the first stage to focus the network on first
+layers’ patch embeddings via shift and patch shuffle opera-       detecting all people in the scene before refinement.
+tions. Unlike these methods that rearrange features within
+an instance, the proposed occluded attention module con-          3.2. Occluded Attention Transformer
+siders token cross-attention between either positive or nega-
+tive instances from the mini-batch. Thus our method learns              In the following, we describe the details of the occluded
+to differentiate tokens from other objects by synthetically
+mimicking occlusions.                                             attention transformers, shown in Figure 3.
+                                                                  Tokenization. Given the base feature map F ∈ Rh×w×c,
+3. Cascade Transformers
+                                                                  we tokenize it for transformer input at different scales. For
+   As discussed in previous works [14, 20, 33], person de-
+tection and person ReID have conflicting goals. Hence, it         multi-scale representation, we first split F channel-wise
+is difficult to jointly learn discriminative unified representa-
+tions for the two subtasks on the top of the backbone net-        into  n  slices,  F¯  ∈  Rh×w×cˆ,  where  cˆ =  c  to  deal  with  each
+work. Similar to Cascade R-CNN [1], we decompose fea-                                                             n
+ture learning into sequential steps in T stages of multi-scale
+transformers. That is, each head in the transformer refines       scale of token. In contrast to ViT [11] with its tokenization
+the detection and ReID accuracy of the predicted objects
+stage-by-stage. Thus we can progressively learn coarse-to-        of large image patches, our transformer leverages a series of
+fine unified embeddings.
+                                                                  convolutional layers to generate tokens based on the sliced
+   Nevertheless, in the case of occlusions by other people,       feature maps F¯. Our method benefits from CNNs’ induc-
+objects or the background, the network may suffer from
+noisy representations of the target identity. To this end,        tive biases and learns the CNN’s local spatial context. The
+we develop the occluded attention mechanism in the multi-
+scale transformer to learn an occlusion-robust representa-        different scales are realized by different sizes of convolu-
+tion. As shown in Figure 2, our network is based on the
+Faster R-CNN object detector backbone with Region Pro-            tional kernels.
+posal Network (RPN). However, we extend the framework                After converting the sliced feature maps F¯ ∈ Rh×w×cˆ
+by introducing a cascade of occluded attention transformers
+(see Figure 3), trained in an end-to-end manner.                  to the new token map Fˆ ∈ Rhˆ×wˆ×cˆ by one convolutional
+                                                                  layer, we flatten it into tokens inputs x ∈ Rhˆwˆ×cˆ for one
+3.1. Coarse-to-fine Embeddings
+                                                                  instance. The number of tokens calculated as
+   After extracting the 1024-dim stem feature maps from
+the ResNet-50 [15] backbone, we use the RPN to generate                    N=\frac {\hat {h}\hat {w}}{d^2}=\frac {\lfloor \frac {h+2p-k}{s}+1\rfloor \times \lfloor \frac {w+2p-k}{s}+1\rfloor }{d^2}, \label {equ:tokens}  (1)
+region proposals. For each proposal, the RoI-Align opera-
+tion [27] is applied to pool an h × w region as the base fea-     where we have the kernel size k, stride s, and padding p for
+ture maps F, where h and w denote the height and width of         the convolutional layer. d is the patch size of each token.
+the feature maps respectively, and c is the number of chan-       Occluded attention. To handle occlusions, we introduce
+nels.                                                             a new token-level occluded attention mechanism into the
+                                                                  transformers to mimic occlusions found in real applica-
+   Afterwards, we employ a multi-stage cascade structure          tions. Specifically, we first collect the tokens from all the
+to learn embeddings for person detection and ReID. The            detection proposals in a mini-batch, denoted as token bank
+output proposals of the RPN are used at the first stage for       X = {x1, x2, · · · , xP }, where P is the number of detection
+re-sampling both positive and negative instances. The box         proposals in the batch at each stage. Since the proposals the
+outputs of the first stage are then adopted as the inputs of the  from RPN contain positive and negative examples, the to-
+second stage, and so forth. At each stage t, the pooled fea-      ken bank is composed of both foreground person parts and
+ture map of each proposal is sent to the convolutional trans-     background objects. We exchange tokens among the token
+formers for that stage. To obtain high-quality instances, the     bank, based on the same exchange index set M for all the
+cascade structure imposes progressively more strict stage-        instances. As shown in Figure 3, the exchanged tokens cor-
+wise constraints. In practice, we increase the intersection-      respond to a semantically consistent but randomly selected
+over-union (IoU) thresholds ut gradually. The transformers        sub-regions in the token maps. Each exchanged token is
+at each stage are followed by three heads, like NAE [6],          denoted as
+including a person/background classifier, a box regressor,
+                                                                       \mathbf {x}_i=\{\mathbf {x}_i(\bar {\mathcal {M}}), \mathbf {x}_j(\mathcal {M})\}, \quad i=1,2,\cdots ,P, i\neq j, \label {equ:exchange}  (2)
+
+                                                                  where xj denotes another sample randomly selected from
+                                                                  the token bank. M¯ indicates the complementary set of M,
+                                                                  i.e., xi = xi(M¯ ) xi(M). Given the exchanged token
+                                                                  bank X, we compute the multi-scale self-attention among
+                                                                  them, as shown in Figure 3. In terms of each scale of tokens,
+                                                                  we run two sub-layers of the transformers (i.e., Multi-head
+                                                                  Self-Attention (MSA) and a Feed Forward Network (FFN)
+                                                                  as in [29]). Specifically, the mixed tokens x are transformed
+
+                                                                  7269
+Figure 3. Architecture of occluded attention transformer. The randomly selected regions for token exchange are the same within one
+mini-batch. For clarity, we only show three instances in a mini-batch and occluded attention for one scale. Best view in color.
+
+into query matrices Q ∈ Rhˆwˆ×cˆ, key matrices K ∈ Rhˆwˆ×cˆ,                                                                                                                            To deal with occlusion and misalignment in person
+and value matrices V ∈ Rhˆwˆ×cˆ by three individual fully                                                                                                                            ReID, He et al. [16] shuffle person part patch embeddings
+                                                                                                                                                                                     and re-group them, each group of which contains several
+connected (FC) layers. We can further compute multi-head                                                                                                                             random patch embeddings of an individual instance. In con-
+                                                                                                                                                                                     trast, our method first exchanges partial tokens of instances
+attention and the weighted sum over all values as                                                                                                                                    in a mini-batch, and then calculate the occluded attention
+                                                                                                                                                                                     based on mixed tokens. Thus the final embeddings partially
+          \mathrm {MSA}(\tens {Q},\tens {K},\tens {V}) = \mathrm {softmax}(\frac {\tens {Q}\tens {K}^\mathrm {T} {\sqrt {\hat {c}/m} )\tens {V}, \label {equ:at ention}   (3)        cover the target person with extracted features from a differ-
+                                                                                                                                                                                     ent person or a background object, yielding more occlusion-
+where we split queries, keys, and values into m heads for                                                                                                                            robust representations.
+
+more diversity, i.e., from tensor with the size of hˆwˆ ×cˆ to m                                                                                                                     3.3. Training and Inference
+
+pieces  with  the  size  of  hˆwˆ  ×  cˆ  .  The  independent                                                                                                             attention     In the training phase, the proposed network is trained
+                                      m                                                                                                                                              end-to-end for person detection and person ReID. The per-
+                                                                                                                                                                                     son detection loss Ldet consists of regression and classifica-
+outputs are then concatenated and linearly transformed into                                                                                                                          tion loss terms. The former is a Smooth-L1 loss of regres-
+                                                                                                                                                                                     sion vectors between ground-truth and foreground boxes,
+the expected dimension. Following the MSA module, the                                                                                                                                while the latter computes the cross-entropy loss of predicted
+                                                                                                                                                                                     classification probabilities of the estimated boxes.
+FFN module nonlinearly transforms each token to enhance
+                                                                                                                                                                                        To supervise person ReID, we use the classic non-
+its representation ability. The enhanced feature is then pro-                                                                                                                        parametric Online Instance Matching (OIM) loss [32] LOIM,
+jected to the size of hˆ × wˆ × cˆ as the transformer’s output.                                                                                                                      which maintains a lookup table (LUT) and a circular queue
+                                                                                                                                                                                     (CQ) to store the features of all the labeled and unlabeled
+   Finally, we concatenate the outputs of the n scales of                                                                                                                            identities from recent mini-batches, respectively. We can
+transformers to original spatial size hˆ × wˆ × c. Note that                                                                                                                         efficiently compute the cosine similarities between the sam-
+                                                                                                                                                                                     ples in the mini-batch and LUT/CQ for embedding learning.
+there is a residual connection outside each transformer.                                                                                                                             Moreover, inspired by [24], we add another cross-entropy
+                                                                                                                                                                                     loss function LID to predict the identities of people for an
+After the global average pooling (GAP) layer, the ex-                                                                                                                                additional ID-wise supervision. In summary, we train the
+                                                                                                                                                                                     proposed COAT by using the following multi-stage loss:
+tracted features are fed into subsequent heads for box re-
+                                                                                                                                                                                              \mathcal {L}=\sum _{t=1}^T{\mathcal {L}_\text {det}^t+\mathbb {I}(t>1)(\lambda _\text {OIM}\mathcal {L}_\text {OIM}^t+\lambda _\text {ID}\mathcal {L}_\text {ID}^t)}, \label {equ:all_loss}  (4)
+gression, person/background classification, and person re-
+
+identification.
+
+Relations to concurrent works. There are two concurrent
+
+ViT based works [3, 16] in different fields. Chen et al. [3]
+
+develop a multi-scale transformer including two separate
+
+branches with small-patch and large-patch tokens. The two-
+
+scale representation is learned based on a cross-attention to-
+
+ken fusion module, where a single token for each branch
+
+is treated as a query to exchange information with other
+
+branches. Instead, we leverage a series of convolutional
+
+layers with different kernels to generate multi-scale tokens.
+
+Finally, we concatenate the enhanced feature maps corre-
+
+sponding to each scale in specific slice of the transformers.
+
+                                                                                                                                                                                     7270
+                                                                       Stage1  Stage2 Stage3 mAP               top-1
+
+where t ∈ {1, 2, . . . , T } denotes the index of the stage and          †        (a) w/o Transformers:        81.2
+T is the number of cascade stages. The coefficients λOIM                 †                                     84.6
+and λID are used to balance the OIM and ID loss terms.                   †                               43.5  85.2
+I(t > 1) is the indicator function to indicate that we do not                                                  85.5
+consider person ReID loss at the first stage.                            †                               47.7  84.9
+                                                                         †
+   In the inference phase, we replace the occluded atten-                †     †                         48.4  78.7
+tion mechanism with the classic self-attention module in the                                                   84.9
+transformers by removing the token mix-up step in Figure 3.              0.5                             49.5  85.5
+We output the detection bounding boxes with correspond-                  0.6                                   87.4
+ing embeddings at the last stage and use NMS operations to               0.7                             47.2  84.0
+remove redundant boxes.                                                  0.5
+                                                                         0.5      (b) w/ Transformers:         86.0
+4. Experiments                                                                                                 86.2
+                                                                                                         43.3  85.5
+   All experiments are conducted in PyTorch with                                                               86.3
+one NVIDIA A100 GPU. For a fair comparison with                                                          50.8  87.4
+prior works, we use the first four residual blocks
+(conv1∼conv4) of ResNet-50 [15] as the backbone and                            †                         51.3
+resize the images to 900 × 1500 as the input.
+                                                                                                         53.3
+4.1. Datasets
+                                                                                                         50.3
+   We evaluate our method on two publicly available
+datasets. The CUHK-SYSU dataset [32] annotates 8, 432                             (c) IoU Thresholds:
+identities and 96, 143 bounding boxes in 18, 184 images.
+The default gallery size is set as 100 for the 2, 900 testing                  0.5  0.5                  52.5
+identities in 6, 978 images. The PRW dataset [38] collects
+data from 6 cameras, including 932 identities and 43, 110                      0.6  0.6                  52.6
+pedestrian boxes in 11, 816 frames. PRW is divided into
+a training set with 5, 704 frames and 482 identities and a                     0.7  0.7                  51.0
+testing set with 2, 057 query persons in 6, 112 frames.
+                                                                               0.6  0.6                  52.6
+   We follow the standard evaluation metrics for person
+search [32, 38]. A box is matched if the overlap ratio be-                     0.6  0.7                  53.3
+tween the predicted and ground-truth boxes with the same
+identity is more than 0.5 IoU. For person detection, we          Table 1. Comparison with different cascade variants of COAT on
+use Recall and Average Precision (AP). For person ReID,          PRW [38]. “ ” means using the same ResNet block (conv5)
+we use the mean Average Precision (mAP) and cumulative           as [6, 20, 32], while “ ” means using the proposed transformers
+matching characteristics (top-1) scores.                         at each stage. “†” means the heads without the ReID loss. Gray
+                                                                 highlighting indicates the parameters selected for our final system.
+4.2. Implementation Details
+                                                                 size of the OIM loss is set as 5, 000 and 500 for CUHK-
+   Similar to Cascade R-CNN [1], we use T = 3 stages in          SYSU and PRW respectively. The loss weights in Eq. (4)
+the cascade framework, where 128 detection proposals are         are set as λOIM = λID = 0.5.
+extracted per image for each stage. Following [6, 20, 32],
+the scale of the base feature map is set as h = w = 14. The         We use the SGD optimizer with momentum 0.9 to train
+index of exchanging tokens in Eq. (2) is set as the random       our model for 15 epochs, with an initial learning rate warm-
+horizontal or vertical strip in the token map. The number        ing up to 0.003 during the first epoch, being reduced by a
+of heads in Eq. (3) is set as m = 8. The IoU thresholds          factor of 10 at the 10-th epoch. At the inference phase, we
+ut for detection are set as 0.5, 0.6, 0.7 for the three sequen-  use NMS with 0.4/0.4/0.5 threshold to remove redundant
+tial stages. The kernel sizes of the convolutional layers to     boxes detected by the first/second/third stage.
+compute the tokens are set as k = {1 × 1, 3 × 3} for the
+three stages, with corresponding strides s = {1, 1} and          4.3. Ablation Studies
+paddings p = {0, 1} to guarantee the same size of output
+feature maps. Due to the small feature size, we set d = 1           We conduct a series of ablation studies on the PRW
+in Eq. (2), i.e., conducting pixel-wise tokenization. The CQ     dataset [38] to analyze our design decisions.
+                                                                 Contribution of cascade structure. To show the cascade
+                                                                 structure’s contribution, we evaluate coarse-to-fine con-
+                                                                 straints in terms of the number of cascade stages and IoU
+                                                                 thresholds.
+
+                                                                    First, we replace the occluded attention transformer with
+                                                                 the same ResNet block (conv5) as [6,20,32] at each stage.
+                                                                 As shown in Table 1(a), the cascade structure significantly
+                                                                 improves person search accuracy when adding more stages,
+                                                                 i.e., from 43.5% to 49.5% in mAP and 81.2% to 85.5%
+                                                                 in top-1 accuracy. As we introduce the proposed occluded
+                                                                 attention transformer, the performance is further improved
+                                                                 (see Table 1(b)), which demonstrates our occluded attention
+                                                                 transformer’s effectiveness .
+
+                                                                    Moreover, the increasing IoU thresholds ut in the cas-
+                                                                 cade design improve person search performance. As re-
+                                                                 ported in Table 1(c), equal IoU thresholds at each stage pro-
+
+                                                                 7271
+                                                                                Method        Tokens  Feats  mAP   top-1
+                                                                           Vanilla Attention                 52.9  86.4
+                                                                                                             49.9  86.1
+                                                                             CrossViT [3]                    51.9  86.0
+                                                                              Jigsaw [16]                    52.7  86.7
+                                                                         Batch DropBlock [7]                 53.2  86.6
+                                                                               Cutout [8]                    52.8  86.6
+                                                                              Mixup [37]                     53.3  87.4
+                                                                         Occluded Attention
+
+Figure 4. Detection and person search results for COAT and two     Table 2. Comparison of our attention mechanisms and other re-
+compared methods on PRW, both with (person ReID only) and          lated modules. “Tokens” and “Feats” denote token-level enhanced
+without (person search) ground-truth detection boxes being pro-    attention and feature-level augmentation respectively.
+vided. ∗ denotes the oracle results using the ground-truth boxes.
+                                                                   and CrossViT [3] in our method. As discussed in Sec-
+duce lower accuracy than our method. For example, more             tion 3.2, Jigsaw Patch [16] is used to generate robust ReID
+false positives or false negatives are introduced if ut = 0.5      features by shift and patch shuffle operations. CrossViT [3]
+or ut = 0.7. In contrast, our method can select detection          is a dual-branch transformer to learn multi-scale features. It
+proposals with increasing quality for better performance,          is also noteworthy that they leverage large image patches as
+i.e., generating more candidate detections in the first stage      the input for pure vision transformers. We also evaluate the
+and only highly-overlapping detections by the third stage.         COAT variant a vanilla self-attention mechanism, denoted
+                                                                   as vanilla attention.
+Relations between person detection and ReID. As dis-
+cussed in the introduction, there is a conflict between per-          In Table 2, CrossViT [3] focuses on exchanging infor-
+son detection and ReID. In Figure 4, we explore the rela-          mation between two scales of tokens, achieving inferior
+tionship between the two subtasks. We compare our COAT             mAP. The results show that Jigsaw [16] also hurts mAP.
+with state-of-the-art NAE [6] and SeqNet [20], which share         We speculate that either exchanging query information in
+the same Faster R-CNN detector. We also construct three            CrossViT [3] or the shift and shuffle feature operations in
+COAT variants with different stages, i.e., COAT-t, where           Jigsaw [16] are ambiguous in such small 14 × 14 base fea-
+t = 1, 2, 3 denotes the number of stages. When looking             ture maps, limiting the power of them for person search. In
+solely at person ReID rather than person search, i.e., when        contrast, our occluded attention is designed for small fea-
+ground-truth detection boxes are given, COAT outperforms           ture maps and obtains better performance, i.e., both 0.4%
+the two competitors with an over 3% gain in top-1 and over         gain in mAP and 1.0% gain in top-1 score. Instead of shar-
+6% gain in mAP. Meanwhile, our is slightly worse in person         ing class tokens in different branches or shuffling channels
+detection accuracy than SeqNet [20]. These results indicate        of feature maps based on an individual instance, we effec-
+that our improved ReID performance comes from coarse-to-           tively learn context information across different instances in
+fine person embeddings rather than more precise detections.        a mini-batch, and differentiate the person from other people
+                                                                   or the background to synthetically mimic occlusion.
+   We also observe that the person detection performance
+is improved from t = 1 to t = 2 but then slightly reduced          Comparison with feature augmentation. Our method
+with t = 3. We speculate that this is because, when trading-       is related to previous augmentation strategies for person
+off person detection and ReID, our method focuses more on          ReID, such as Batch DropBlock Network [7], Cutout [8]
+learning discriminative embeddings for person ReID, while          and Mixup [37]. As presented in Table 2, person search
+slightly sacrificing detection performance.                        accuracy is not improved by using feature augmentation,
+                                                                   simply augmenting feature patches with zeros.
+   In addition, from Table 1(a)(b), note that the COAT vari-
+ant with ReID loss in the first stage performs worse than          Influence of occluded attention mechanism. As discussed
+our method (50.3 vs. 53.3 for mAP). Simultaneously learn-          in Section 3.2, we use occluded attention to calculate dis-
+ing a discriminative representation for person detection and       criminative person embeddings. We evaluate the use of oc-
+ReID is extremely difficult. Therefore, we remove the ReID         cluded attention (token mixup) and different scales in Ta-
+disciminator head at Stage 1 in the COAT method (c.f. Fig-         ble 3. Note, the top-1 score is improved from 86.4 to 87.4
+ure 2). If we continue removing the ReID discriminator at          with occluded attention and that multiple convolutional ker-
+the second stage, the ReID performance is reduced by ∼ 2%          nels for tokenization improve performance. Note that mul-
+in mAP. This shows the ReID embeddings do benefit from             tiple convolutions do not increase the model size, since the
+multi-stage refinement.                                            feature maps F are channel-wise sliced for each scale.
+
+Comparison with other attention mechanisms. To ver-
+ify the effectiveness of our occluded attention mechanism in
+the transformer, we apply the recently proposed Jigsaw [16]
+
+                                                                   7272
+Figure 5. Qualitative examples of top-1 person search results of NAE [6], SeqNet [20] and COAT on PRW (1st row) and CUHK-SYSU
+(2nd and 3rd rows) datasets, where small query, failure and correct cases are highlighted in yellow, red and green boxes respectively.
+
+Method              Token Mixup Scales mAP top-1
+
+Vanilla Attention   {1 × 1} 52.1 85.3
+
+Vanilla Attention   {3 × 3} 53.1 86.0
+
+Vanilla Attention   {1 × 1, 3 × 3} 52.9 86.4
+
+Occluded Attention  {1 × 1} 52.2 86.5
+
+Occluded Attention  {3 × 3} 52.5 86.4
+
+Occluded Attention  {1 × 1, 3 × 3} 53.3 87.4
+
+Table 3. Comparison of our attention mechanisms and other re-          (a) End-to-end models  (b) Two-step models
+lated modules. “Scales” denotes the used convolutional kernels.
+                                                                 Figure 6. Comparison with (a) end-to-end models and (b) two-step
+4.4. Comparison with State-of-the-art                            models on CUHK-SYSU with different gallery sizes.
+
+   As presented in Table 4, we compare our COAT with             of all compared methods is reduced as the gallery size in-
+state-of-the-art algorithms, including both two-step meth-       creases. However, our method consistently outperforms all
+ods [5, 10, 13, 18, 30, 38] and end-to-end methods [2, 4, 6, 9,  the end-to-end methods and the majority of two-step meth-
+12, 17, 20, 23, 26, 31–34, 39], on two datasets.                 ods. When the gallery size is larger than 1, 000, our method
+                                                                 performs slightly worse than the two-step TCTS [30].
+Results on CUHK-SYSU. With the gallery size of 100,
+our method achieves the best 94.2% mAP and compa-                Results on PRW. Although the PRW dataset [38] is more
+rable 94.7% top-1 scores compared to the best two-step           challenging, with less training data but larger gallery size,
+method TCTS [30] with explicitly trained bounding box            than the CUHK-SYSU dataset [32], the results show a sim-
+and ReID feature refinement modules. Among end-to-               ilar trend. Our method achieves comparable performance
+end methods, our method performs better than state-of-           as AGWF [12] and a significant gain of 6.7% mAP and
+the-art AlignPS+ [33] with a multi-scale anchor-free repre-      4.0% top-1 scores than SeqNet [20]. DMRNet [14] and
+sentation [28], SeqNet [20] with two-stage refinement and        AlignPS [33] leverage stronger object detectors, such as
+AGWF [12] with part classification based sub-networks.           RetinaNet [22] and FCOS [28], than the Faster R-CNN [27]
+The results indicate the effectiveness of our cascaded multi-    in our method, but still achieve inferior performance. Fur-
+scale representation. Using the post-processing operation        ther, we compare performance on PRW’s multi-view gallery
+Context Bipartite Graph Matching (CBGM) [20], both mAP           (see the group marked by † in Table 4). Our method out-
+and top-1 scores of our method can be further improved           performs existing methods in terms of both mAP and Top-1
+slightly. For a comprehensive evaluation, as shown in Fig-       scores with a clear margin. We attribute this to our cascaded
+ure 6, we compare mAP scores of competitive methods as           transformer structure which generates more discriminative
+we increase gallery size. Since it is challenging to consider    ReID features, especially in the cross-camera setting with
+more distracting people in the gallery set, the performance      significant pose/viewpoint changes.
+
+                                                                 7273
+                    Method           CUHK-SYSU      PRW             Method      Params(M)  MACs(G)   FPS    mAP           top-1
+                                     mAP top-1  mAP top-1           NAE [6]        33.43     287.35  14.48  43.3          80.9
+            DPM [38]                                              AlignPS [33]     42.18     189.98  16.39  45.9          81.9
+            MGTS [5]                 -  -       20.5 48.3         SeqNet [20]      48.41     275.11  12.23  46.7          83.4
+            CLSA [18]                           32.6 72.1             COAT         37.00     236.29  11.14  53.3          87.4
+Two-step    RDLR [13]                83.0 83.7  38.7 65.0
+            IGPN [10]                           42.9 70.2
+            TCTS [30]                87.2 88.5  47.2 87.0
+                                                46.8 87.5
+            OIM [32]                 93.0 94.2
+            IAN [31]
+            NPSM [23]                90.3 91.4                          Table 5. Comparison of person search efficiency.
+            RCAA [2]
+            CTXG [34]                93.9 95.1
+            QEEPS [26]
+            HOIM [4]                 75.5 78.7  21.3 49.9         number of channels before tokenization, increasing COAT’s
+            APNet [39]                                            efficiency.
+            BINet [9]                76.3 80.1  23.0 61.9
+            NAE [6]
+            NAE+ [6]                 77.9 81.2  24.2 53.1
+            DMRNet [14]
+            PGS [17]                 79.3 81.3  -  -
+            AlignPS [33]
+            AlignPS+ [33]            84.1 86.5  33.4 73.6         5. Conclusion
+            SeqNet [20]
+            AGWF [12]                88.9 89.1  37.1 76.7            We have developed a new Cascade Occluded Attention
+            COAT                                                  Transformer (COAT) for end-to-end person search. No-
+            AlignPS [33]+CBGM [20]   89.7 90.8  39.8 80.4         tably, COAT learns a discriminative coarse-to-fine repre-
+            AlignPS+ [33]+CBGM [20]                               sentation for both person detection and person ReID via a
+End-to-end  SeqNet+CBGM [20]         88.9 89.3  41.9 81.4         cascade transformer framework. Meanwhile, the occluded
+            COAT+CBGM                                             attention mechanism synthetically mimics occlusions from
+            HOIM† [4]                90.0 90.7  45.3 81.7         either foreground or background objects. COAT outper-
+            NAE+† [6]                                             forms state-of-the-art methods, which we hope will inspire
+            SeqNet† [20]             91.5 92.4  43.3 80.9         more research into transformer-based person search meth-
+            SeqNet+CBGM† [20]                                     ods.
+            AGWF† [12]               92.1 92.9  44.0 81.1         Ethical considerations. Like most technologies, person
+            COAT†                                                 search methods may have societal benefits and negative im-
+            COAT+CBGM†               93.2 94.2  46.9 83.3         pacts. How the technology is employed is critical. For ex-
+                                                                  ample, person search can identify persons of interest to aid
+                                     92.3 94.7  44.2 85.2         law enforcement and counter-terrorism operations. How-
+                                                                  ever, the technology should only be used in locations where
+                                     93.1 93.4  45.9 81.9         an expectation of privacy is waived by entering those loca-
+                                                                  tions, such as public areas, airports, and private buildings
+                                     94.0 94.5  46.1 82.1         with clear signage. These systems should not be employed
+                                                                  without probable cause, or by unjust governments that seek
+                                     93.8 94.6  46.7 83.4         to acquire ubiquitous knowledge of the movements of all of
+                                                                  their citizens to enable persecution and repression.
+                                     93.3 94.2  53.3 87.7
+                                                                     For comparability, this research uses human subjects im-
+                                     94.2 94.7  53.3 87.4         agery collected in prior works. CUHK-SYSU [32] was col-
+                                                                  lected from “street snaps” and “movie snapshots”, while
+                                     93.6 94.2  46.8 85.8         PRW [38] was collected with video cameras in a public area
+                                                                  of a university campus. No mention is made in either paper
+                                     94.2 94.3  46.9 85.7         of review by an ethical board (e.g., an Institutional Review
+                                                                  Board), but these papers were published before this new
+                                     94.8 95.7  47.6 87.6         standard was established at CVPR or most major AI con-
+                                                                  ferences. Our preference would be to work with ethically
+                                     94.8 95.2  54.0 89.1         collected person search datasets, and we would welcome a
+                                                                  public disclosure from the authors of their ethical compli-
+                                     -  -       36.5 65.0         ance. We believe the community should focus resources on
+                                                                  developing ethical person search datasets and phase out the
+                                     -  -       40.0 67.5         use of legacy, unethically collected datasets.
+                                                                  Acknowledgement. This material is based upon work sup-
+                                     -  -       43.6 68.5         ported by the United States Air Force under Contract No.
+                                                                  FA8650-19-C-6036. Any opinions, findings and conclu-
+                                     -  -       44.3 70.6         sions or recommendations expressed in this material are
+                                                                  those of the author(s) and do not necessarily reflect the
+                                     -  -       48.0 73.2         views of the United States Air Force.
+
+                                     -  -       50.9 75.1
+
+                                     -  -       51.7 76.1
+
+Table 4. Comparison with the state-of-the-art methods. † denotes
+the performance only evaluated on the multi-view gallery. Bold
+indicates highest score in the group.
+
+Qualitative results. Some example person search results
+on two datasets are shown in Figure 5. Our method can
+deal with cases of slight/moderate occlusion and scale/pose
+variations, while other state-of-the-art methods such as Se-
+qNet [20] and NAE [6] fail in these scenarios.
+
+Efficiency comparison. We compare our efficiency
+with three representative end-to-end networks including
+NAE [6], AlignPS [33] and SeqNet [20] which have pub-
+licly released source code. We evaluate the methods with
+the same scale test images and on the same GPU.
+
+   From Table 5, we compare the number of parameters, the
+multiply–accumulate operations (MACs), and the running
+speed in frames per second (FPS). Our method has lower
+computational complexity and slightly slower speed than
+other compared methods, but achieved +6.6% and +4.0%
+gains in mAP and top-1 accuracy respectively. In con-
+trast to [11, 16], we employ only one encoder layer in our
+transformers and use multi-scale convolutions to reduce the
+
+                                                                  7274
+References                                                            [17] Hanjae Kim, Sunghun Joung, Ig-Jae Kim, and Kwanghoon
+                                                                            Sohn. Prototype-guided saliency feature learning for person
+ [1] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: delv-                 search. In CVPR, pages 4865–4874, 2021. 7, 8
+      ing into high quality object detection. In CVPR, pages 6154–
+      6162, 2018. 2, 3, 5                                             [18] Xu Lan, Xiatian Zhu, and Shaogang Gong. Person search by
+                                                                            multi-scale matching. In ECCV, pages 553–569, 2018. 1, 2,
+ [2] Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan                     7, 8
+      Liang, Yi Yang, and Alexander G. Hauptmann. RCAA: re-
+      lational context-aware agents for person search. In ECCV,       [19] Yulin Li, Jianfeng He, Tianzhu Zhang, Xiang Liu, Yongdong
+      pages 86–102, 2018. 1, 7, 8                                           Zhang, and Feng Wu. Diverse part discovery: Occluded per-
+                                                                            son re-identification with part-aware transformer. In CVPR,
+ [3] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit:                2021. 2
+      Cross-attention multi-scale vision transformer for image
+      classification. In ICCV, 2021. 2, 4, 6                          [20] Zhengjia Li and Duoqian Miao. Sequential end-to-end net-
+                                                                            work for efficient person search. In AAAI, pages 2011–2019,
+ [4] Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and                  2021. 1, 2, 3, 5, 6, 7, 8
+      Bernt Schiele. Hierarchical online instance matching for per-
+      son search. In AAAI, pages 10518–10525, 2020. 7, 8              [21] Hezheng Lin, Xing Cheng, Xiangyu Wu, Fan Yang, Dong
+                                                                            Shen, Zhongyuan Wang, Qing Song, and Wei Yuan. CAT:
+ [5] Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and                  cross attention in vision transformer. CoRR, abs/2106.05786,
+      Ying Tai. Person search via a mask-guided two-stream CNN              2021. 2
+      model. In ECCV, pages 764–781, 2018. 1, 2, 7, 8
+                                                                      [22] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
+ [6] Di Chen, Shanshan Zhang, Jian Yang, and Bernt Schiele.                 and Piotr Dolla´r. Focal loss for dense object detection. In
+      Norm-aware embedding for efficient person search. In                  ICCV, pages 2999–3007, 2017. 2, 7
+      CVPR, pages 12612–12621, 2020. 2, 3, 5, 6, 7, 8
+                                                                      [23] Hao Liu, Jiashi Feng, Zequn Jie, Jayashree Karlekar, Bo
+ [7] Zuozhuo Dai, Mingqiang Chen, Xiaodong Gu, Siyu Zhu,                    Zhao, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. Neural
+      and Ping Tan. Batch dropblock network for person re-                  person search machines. In ICCV, pages 493–501, 2017. 7,
+      identification and beyond. In ICCV, pages 3690–3700, 2019.            8
+      6
+                                                                      [24] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei
+ [8] Terrance Devries and Graham W. Taylor. Improved regular-               Jiang. Bag of tricks and a strong baseline for deep person
+      ization of convolutional neural networks with cutout. CoRR,           re-identification. In CVPRW, pages 1487–1495, 2019. 4
+      abs/1708.04552, 2017. 6
+                                                                      [25] Hao Luo, Wei Jiang, Xing Fan, and Chi Zhang. Stnreid:
+ [9] Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tie-                  Deep convolutional networks with pairwise spatial trans-
+      niu Tan. Bi-directional interaction network for person search.        former networks for partial person re-identification. IEEE
+      In CVPR, pages 2836–2845, 2020. 7, 8                                  TMM, 22(11):2905–2913, 2020. 2
+
+[10] Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, and Tie-            [26] Bharti Munjal, Sikandar Amin, Federico Tombari, and Fabio
+      niu Tan. Instance guided proposal network for person search.          Galasso. Query-guided end-to-end person search. In CVPR,
+      In CVPR, pages 2582–2591, 2020. 1, 2, 7, 8                            pages 811–820, 2019. 7, 8
+
+[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,           [27] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun.
+      Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,                   Faster R-CNN: towards real-time object detection with re-
+      Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-              gion proposal networks. IEEE TPAMI, 39(6):1137–1149,
+      vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is            2017. 2, 3, 7
+      worth 16x16 words: Transformers for image recognition at
+      scale. In ICLR, 2021. 2, 3, 8                                   [28] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
+                                                                            fully convolutional one-stage object detection. In ICCV,
+[12] Byeong-Ju Han, Kuhyeun Ko, and Jae-Young Sim. End-to-                  pages 9626–9635, 2019. 2, 7
+      end trainable trident person search network using adaptive
+      gradient propagation. In ICCV, pages 925–933, 2021. 7, 8        [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
+                                                                            reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
+[13] Chuchu Han, Jiacheng Ye, Yunshan Zhong, Xin Tan, Chi                   Polosukhin. Attention is all you need. In NeurIPS, pages
+      Zhang, Changxin Gao, and Nong Sang. Re-id driven local-               5998–6008, 2017. 2, 3
+      ization refinement for person search. In ICCV, pages 9813–
+      9822, 2019. 1, 2, 7, 8                                          [30] Cheng Wang, Bingpeng Ma, Hong Chang, Shiguang Shan,
+                                                                            and Xilin Chen. TCTS: A task-consistent two-stage frame-
+[14] Chuchu Han, Zhedong Zheng, Changxin Gao, Nong Sang,                    work for person search. In CVPR, pages 11949–11958, 2020.
+      and Yi Yang. Decoupled and memory-reinforced net-                     1, 2, 7, 8
+      works: Towards effective feature learning for one-step per-
+      son search. In AAAI, pages 1505–1512, 2021. 1, 3, 7, 8          [31] Jimin Xiao, Yanchun Xie, Tammam Tillo, Kaizhu Huang,
+                                                                            Yunchao Wei, and Jiashi Feng. IAN: the individual aggrega-
+[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.                 tion network for person search. PR, 87:332–340, 2019. 7,
+      Deep residual learning for image recognition. In CVPR,                8
+      pages 770–778, 2016. 3, 5
+                                                                      [32] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiao-
+[16] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li,                    gang Wang. Joint detection and identification feature learn-
+      and Wei Jiang. Transreid: Transformer-based object re-                ing for person search. In CVPR, pages 3376–3385, 2017. 1,
+      identification. In ICCV, 2021. 2, 3, 4, 6, 8                          2, 4, 5, 7, 8
+
+                                                                      7275
+[33] Yichao Yan, Jingpeng Li, Jie Qin, Song Bai, Shengcai Liao,
+      Li Liu, Fan Zhu, and Ling Shao. Anchor-free person search.
+      In CVPR, 2021. 1, 2, 3, 7, 8
+
+[34] Yichao Yan, Qiang Zhang, Bingbing Ni, Wendong Zhang,
+      Minghao Xu, and Xiaokang Yang. Learning context graph
+      for person search. In CVPR, pages 2158–2167, 2019. 1, 7, 8
+
+[35] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
+      Francis E. H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
+      to-token vit: Training vision transformers from scratch on
+      imagenet. In ICCV, pages 558–567, 2021. 2
+
+[36] Guowen Zhang, Pingping Zhang, Jinqing Qi, and Huchuan
+      Lu. Hat: Hierarchical aggregation transformers for person
+      re-identification. In ACMMM, 2021. 2
+
+[37] Hongyi Zhang, Moustapha Cisse´, Yann N. Dauphin, and
+      David Lopez-Paz. mixup: Beyond empirical risk minimiza-
+      tion. In ICLR, 2018. 6
+
+[38] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan
+      Chandraker, Yi Yang, and Qi Tian. Person re-identification
+      in the wild. In CVPR, pages 3346–3355, 2017. 1, 2, 5, 7, 8
+
+[39] Yingji Zhong, Xiaoyu Wang, and Shiliang Zhang. Robust
+      partial matching for person search in the wild. In CVPR,
+      pages 6826–6834, 2020. 1, 7, 8
+
+                                                                                   7276
+
--- a/Docs/Image/COAT-Framework.png
+++ b/Docs/Image/COAT-Framework.png
--- a/Notes/COAT环境配置.md
+++ b/Notes/COAT环境配置.md
--- a/Notes/注意力机制笔记.txt
+++ b/Notes/注意力机制笔记.txt
--- a/Notes/级联神经网络笔记.md
+++ b/Notes/级联神经网络笔记.md
--- a/README.md
+++ b/README.md
@ -0,0 +1,35 @@
+## Cascade Transformers for End-to-End Person Search
+
+### 一、代码说明
+
+		这个仓库是论文`Cascade Transformers for End-to-End Person Search [CVPR 2022]`的源码仓库。在这项工作中，我们开发了一种新颖的级联遮挡感知Transformer (COAT) 模型，用于端到端行人搜索。该模型在`PRW`基准数据集上大幅超越了**当前最先进**的方法，并在`CUHK-SYSU`数据集上达到了最先进的性能。
+
+### 二、环境配置 & 运行
+
+#### 2.1 开发环境的配置
+
+		本下项目可以采用`anaconda`或者`docker`进行环境配置。
+
+		`anaconda`的配置方式:
+
+```shell
+# 以下是安装anaconda环境安装的指令
+conda create -n COAT python=3.8.1
+conda activate COAT
+```
+
+#### 2.2 数据集的下载
+
+##### 2.2.1 PRW 行人重识别视频数据集
+
+- 发布机构: University of Technology Sydney
+- 下载地址: https://hyper.ai/datasets/17890
+- 原始发布地址: http://zheng-lab.cecs.anu.edu.au/Project/project_prw.html
+
+##### 2.2.1 ...
+
+### 三、框架说明
+
+		这篇论文采用基于`Transformer`的级联神经网络优化...，其核心框架如下说明:
+
+![image-20241003004438891](./Docs/Image/COAT-Framework.png)