codeBERT实战

UniXcoder

运行环境:windows11系统,python3.11

项目创建

我创建了一个仓库,当然里面没有模型参数的文件(因为实在太大了),不过按照指示的话应该是可以跑出来的。目前只是在windows上跑过了,如果是mac和linux的话要相应做一些调整。

https://gitee.com/thinkerhui/codebert

数据集下载

注意下面的数据集下载命令和README的命令有所区别,这是因为windows下的wget和linux下的wget有所不同,我发现如果windows下不用 -O filepath参数的话默认就不保存,所以要加上。

1
2
3
4
5
6
7
8
mkdir dataset
cd dataset
mkdir cosqa
cd cosqa
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/code_idx_map.txt -O code_idx_map.txt
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-dev-500.json -O cosqa-retrieval-dev-500.json
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-test-500.json -O cosqa-retrieval-test-500.json
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-train-19604.json -O cosqa-retrieval-train-19604.json

运行

CodeBert项目本身提供了完善的运行程序,能够以输入参数的形式来指定目录等参数。

比如下面的是zero_shot的测试运行代码,我感觉其中有些参数是用不到的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
python run.py \
--output_dir saved_models/cosqa \
--model_name_or_path microsoft/unixcoder-base \
--do_zero_shot \
--do_test \
--test_data_file dataset/cosqa/cosqa-retrieval-test-500.json \
--codebase_file dataset/cosqa/code_idx_map.txt \
--num_train_epochs 10 \
--code_length 256 \
--nl_length 128 \
--train_batch_size 64 \
--eval_batch_size 64 \
--learning_rate 2e-5 \
--seed 123456

比较重要的参数:
--output_dir是运行训练/验证/测试时使用到的路径。实际过程中验证/测试并不会往里面存东西,只有在训练的时候会把训练得到的最好模型存储起来,验证/测试(非zero-shot)会从里面读取训练好的微调模型。

--model_name_or_path是模型所在的路径,要根据实际情况来修改。比如我把下载好的预训练模型存放在run.py的同一目录的unixcoder-base中,后面的路径就要改为 unixcoder-base

--test_data_file,--codebase_file这两个要指定上面所下载的数据集所在的路径。

--num_train_epochs指定训练几个轮回。

--train_batch_size调整训练时采样一个批次的大小,这个不能随意设置,要根据显卡的显存来调整,不然会爆掉。

我的项目运行的时候的结构:跑的时候运行的主要是model.py和run.py两个文件。

当然,上面的是linux的。结合windows终端的情况,可以用下面的命令来运行:

1
python run.py --output_dir saved_models\cosqa --model_name_or_path unixcoder-base --do_zero_shot --do_test --test_data_file dataset\cosqa\cosqa-retrieval-test-500.json --codebase_file dataset\cosqa\code_idx_map.txt --num_train_epochs 10 --code_length 256 --nl_length 128 --train_batch_size 64 --eval_batch_size 64 --learning_rate 2e-5 --seed 123456

这个我直接跑了一下,似乎是需要先有训练才行?只有一个结果跑了出来。

查看代码最终确实只输出结果而不会输出或者保存中间过程的数据(比如哪些查询更好或者更坏)。

下面是非zero-shot的一些运行命令,称之为fine-tune微调出来的好模型:

官方README提供的命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Training
python run.py \
--output_dir saved_models/cosqa \
--model_name_or_path microsoft/unixcoder-base \
--do_train \
--train_data_file dataset/cosqa/cosqa-retrieval-train-19604.json \
--eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \
--codebase_file dataset/cosqa/code_idx_map.txt \
--num_train_epochs 10 \
--code_length 256 \
--nl_length 128 \
--train_batch_size 64 \
--eval_batch_size 64 \
--learning_rate 2e-5 \
--seed 123456

# Evaluating
python run.py \
--output_dir saved_models/cosqa \
--model_name_or_path microsoft/unixcoder-base \
--do_eval \
--do_test \
--eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \
--test_data_file dataset/cosqa/cosqa-retrieval-test-500.json \
--codebase_file dataset/cosqa/code_idx_map.txt \
--num_train_epochs 10 \
--code_length 256 \
--nl_length 128 \
--train_batch_size 64 \
--eval_batch_size 64 \
--learning_rate 2e-5 \
--seed 123456

适应windows终端的命令:

1
2
3
4
5
# Training
python run.py --output_dir saved_models\cosqa --model_name_or_path unixcoder-base --do_train --train_data_file dataset\cosqa\cosqa-retrieval-train-19604.json --eval_data_file dataset\cosqa\cosqa-retrieval-dev-500.json --codebase_file dataset\cosqa\code_idx_map.txt --num_train_epochs 10 --code_length 256 --nl_length 128 --train_batch_size 64 --eval_batch_size 64 --learning_rate 2e-5 --seed 123456

# Evaluating
python run.py --output_dir saved_models\cosqa --model_name_or_path unixcoder-base --do_eval --do_test --eval_data_file dataset\cosqa\cosqa-retrieval-dev-500.json --test_data_file dataset\cosqa\cosqa-retrieval-test-500.json --codebase_file dataset\cosqa\code_idx_map.txt --num_train_epochs 10 --code_length 256 --nl_length 128 --train_batch_size 64 --eval_batch_size 64 --learning_rate 2e-5 --seed 123456

运行训练的时候出现了报错:

1
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Of the allocated memory 13.69 GiB is allocated by PyTorch, and 68.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

也就是显存爆掉了,尝试调整一下参数,batch_size调小了很多:

1
python run.py --output_dir saved_models\cosqa --model_name_or_path unixcoder-base --do_train --train_data_file dataset\cosqa\cosqa-retrieval-train-19604.json --eval_data_file dataset\cosqa\cosqa-retrieval-dev-500.json --codebase_file dataset\cosqa\code_idx_map.txt --num_train_epochs 10 --code_length 256 --nl_length 128 --train_batch_size 12 --eval_batch_size 12 --learning_rate 2e-5 --seed 123456

看来调整的这个batch_size对于显卡是相对比较合适的(应该还可以大一点),但是训练出的模型可能会受影响。

终端的输出如下(可以跳过),这个主要方便自己查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
(venv) PS D:\thinkerhui\大模型大创\unixcoder> python run.py --output_dir saved_models\cosqa --model_name_or_path 
unixcoder-base --do_train --train_data_file dataset\cosqa\cosqa-retrieval-train-19604.json --eval_data_file datas
et\cosqa\cosqa-retrieval-dev-500.json --codebase_file dataset\cosqa\code_idx_map.txt --num_train_epochs 10 --code_length 256 --nl_length 128 --train_batch_size 12 --eval_batch_size 12 --learning_rate 2e-5 --seed 123456
01/29/2024 16:35:26 - INFO - __main__ - device: cuda, n_gpu: 1
D:\thinkerhui\大模型大创\unixcoder\venv\Lib\site-packages\torch\_utils.py:831: UserWarning: TypedStorage is depre
cated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matte
r to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
01/29/2024 16:35:26 - INFO - __main__ - Training/evaluation parameters Namespace(train_data_file='dataset\\cosq
a\\cosqa-retrieval-train-19604.json', output_dir='saved_models\\cosqa', eval_data_file='dataset\\cosqa\\cosqa-ret
rieval-dev-500.json', test_data_file=None, codebase_file='dataset\\cosqa\\code_idx_map.txt', model_name_or_path='
unixcoder-base', config_name='', tokenizer_name='', nl_length=128, code_length=256, do_train=True, do_eval=False,
do_test=False, do_zero_shot=False, do_F2_norm=False, train_batch_size=12, eval_batch_size=12, learning_rate=2e-05, max_grad_norm=1.0, num_train_epochs=10, seed=123456, n_gpu=1, device=device(type='cuda'))
01/29/2024 16:35:31 - INFO - __main__ - *** Example ***
01/29/2024 16:35:31 - INFO - __main__ - idx: 0
01/29/2024 16:35:31 - INFO - __main__ - code_tokens: ['<s>', '<encoder-only>', '</s>', 'def', '_write', 'Boolea
n', '_(', '_self', '_,', '_n', '_)', '_:', '_t', '_=', '_TYPE', '_', 'BOOL', '_', 'TRUE', '_if', '_n', '_is', '_F
alse', '_:', '_t', '_=', '_TYPE', '_', 'BOOL', '_', 'FALSE', '_self', '_.', '_stream', '_.', '_write', '_(', '_t', '_)', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - code_ids: 0 6 2 729 2250 4259 400 1358 2019 416 743 545 422 385 8781 18
1 9249 181 4835 462 416 555 3378 545 422 385 8781 181 9249 181 5732 1358 746 2239 746 2250 400 422 743 2 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/29/2024 16:35:31 - INFO - __main__ - nl_tokens: ['<s>', '<encoder-only>', '</s>', 'python', '_code', '_to', '_write', '_bool', '_value', '_1', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - nl_ids: 0 6 2 9038 1717 508 2250 1223 767 524 2 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/29/2024 16:35:31 - INFO - __main__ - *** Example ***
01/29/2024 16:35:31 - INFO - __main__ - idx: 1
01/29/2024 16:35:31 - INFO - __main__ - code_tokens: ['<s>', '<encoder-only>', '</s>', 'def', '_paste', '_(', '
_x', 'sel', '_=', '_False', '_)', '_:', '_selection', '_=', '_"', 'primary', '"', '_if', '_x', 'sel', '_else', '_
"', 'clipboard', '"', '_try', '_:', '_return', '_subprocess', '_.', '_P', 'open', '_(', '_[', '_"', 'xc', 'lip',
'"', '_,', '_"-', 'selection', '"', '_,', '_selection', '_,', '_"-', 'o', '"', '_]', '_,', '_stdout', '_=', '_sub
process', '_.', '_PIPE', '_)', '_.', '_communicate', '_(', '_)', '_[', '_0', '_]', '_.', '_decode', '_(', '_"', '
utf', '-', '8', '"', '_)', '_except', '_OSError', '_as', '_why', '_:', '_raise', '_X', 'clip', 'NotFound', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - code_ids: 0 6 2 729 32436 400 868 4761 385 3378 743 545 6244 385 437 71
30 120 462 868 4761 669 437 26898 120 1568 545 483 13053 746 615 2012 400 626 437 5444 2740 120 2019 4007 6125 12
0 2019 6244 2019 4007 197 120 2406 2019 8932 385 13053 746 17711 743 746 43633 400 743 626 461 2406 746 4954 400
437 3737 131 142 120 743 3552 22934 880 14904 545 3085 1352 7283 6064 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/29/2024 16:35:31 - INFO - __main__ - nl_tokens: ['<s>', '<encoder-only>', '</s>', '"', 'python', '_how', '_to', '_manip', 'ulate', '_clipboard', '"', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - nl_ids: 0 6 2 120 9038 5064 508 23181 4526 29038 120 2 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/29/2024 16:35:31 - INFO - __main__ - *** Example ***
01/29/2024 16:35:31 - INFO - __main__ - idx: 2
01/29/2024 16:35:31 - INFO - __main__ - code_tokens: ['<s>', '<encoder-only>', '</s>', 'def', '__', 'format', '
_', 'json', '_(', '_data', '_,', '_theme', '_)', '_:', '_output', '_=', '_json', '_.', '_d', 'umps', '_(', '_data
', '_,', '_indent', '_=', '_2', '_,', '_sort', '_', 'keys', '_=', '_True', '_)', '_if', '_py', 'g', 'ments', '_an
d', '_sys', '_.', '_stdout', '_.', '_is', 'at', 'ty', '_(', '_)', '_:', '_style', '_=', '_get', '_', 'style', '_'
, 'by', '_', 'name', '_(', '_theme', '_)', '_formatter', '_=', '_Terminal', '256', 'Formatter', '_(', '_style', '
_=', '_style', '_)', '_return', '_py', 'g', 'ments', '_.', '_highlight', '_(', '_output', '_,', '_Json', 'Lexer', '_(', '_)', '_,', '_formatter', '_)', '_return', '_output', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - code_ids: 0 6 2 729 623 1478 181 2317 400 869 2019 11079 743 545 1721 3
85 3192 746 480 11537 400 869 2019 5310 385 688 2019 4821 181 2814 385 2998 743 462 4689 189 2067 706 3455 746 89
32 746 555 384 2329 400 743 545 3221 385 744 181 2057 181 2499 181 616 400 11079 743 12641 385 41581 3528 7088 40
0 3221 385 3221 743 483 4689 189 2067 746 14885 400 1721 2019 5902 12901 400 743 2019 12641 743 483 1721 2 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/29/2024 16:35:31 - INFO - __main__ - nl_tokens: ['<s>', '<encoder-only>', '</s>', 'python', '_co', 'lored', '_output', '_to', '_html', '</s>']
01/29/2024 16:35:31 - INFO - __main__ - nl_ids: 0 6 2 9038 1912 21320 1721 508 4875 2 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
D:\thinkerhui\大模型大创\unixcoder\venv\Lib\site-packages\transformers\optimization.py:429: FutureWarning: This i
mplementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
01/29/2024 16:35:31 - INFO - __main__ - ***** Running training *****
01/29/2024 16:35:31 - INFO - __main__ - Num examples = 19604
01/29/2024 16:35:31 - INFO - __main__ - Num Epochs = 10
01/29/2024 16:35:31 - INFO - __main__ - Instantaneous batch size per GPU = 12
01/29/2024 16:35:31 - INFO - __main__ - Total train batch size = 12
01/29/2024 16:35:31 - INFO - __main__ - Total optimization steps = 16340
01/29/2024 16:36:25 - INFO - __main__ - epoch 0 step 100 loss 0.13636
01/29/2024 16:37:08 - INFO - __main__ - epoch 0 step 200 loss 0.06745
01/29/2024 16:37:52 - INFO - __main__ - epoch 0 step 300 loss 0.07746
01/29/2024 16:38:35 - INFO - __main__ - epoch 0 step 400 loss 0.0599
01/29/2024 16:39:19 - INFO - __main__ - epoch 0 step 500 loss 0.04683
01/29/2024 16:40:02 - INFO - __main__ - epoch 0 step 600 loss 0.05991
01/29/2024 16:40:45 - INFO - __main__ - epoch 0 step 700 loss 0.04931
01/29/2024 16:41:29 - INFO - __main__ - epoch 0 step 800 loss 0.04109
01/29/2024 16:42:13 - INFO - __main__ - epoch 0 step 900 loss 0.03477
01/29/2024 16:42:57 - INFO - __main__ - epoch 0 step 1000 loss 0.03945
01/29/2024 16:43:42 - INFO - __main__ - epoch 0 step 1100 loss 0.04783
01/29/2024 16:44:26 - INFO - __main__ - epoch 0 step 1200 loss 0.03678
01/29/2024 16:45:11 - INFO - __main__ - epoch 0 step 1300 loss 0.04101
01/29/2024 16:45:55 - INFO - __main__ - epoch 0 step 1400 loss 0.04457
01/29/2024 16:46:39 - INFO - __main__ - epoch 0 step 1500 loss 0.04475
01/29/2024 16:47:23 - INFO - __main__ - epoch 0 step 1600 loss 0.03299
01/29/2024 16:47:40 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 16:47:40 - INFO - __main__ - Num queries = 500
01/29/2024 16:47:40 - INFO - __main__ - Num codes = 6267
01/29/2024 16:47:40 - INFO - __main__ - Batch size = 12
01/29/2024 16:48:47 - INFO - __main__ - eval_mrr = 0.6323
01/29/2024 16:48:47 - INFO - __main__ - ********************
01/29/2024 16:48:47 - INFO - __main__ - Best mrr:0.6323
01/29/2024 16:48:47 - INFO - __main__ - ********************
01/29/2024 16:48:47 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 16:49:39 - INFO - __main__ - epoch 1 step 100 loss 0.03557
01/29/2024 16:50:22 - INFO - __main__ - epoch 1 step 200 loss 0.02627
01/29/2024 16:51:06 - INFO - __main__ - epoch 1 step 300 loss 0.03244
01/29/2024 16:51:49 - INFO - __main__ - epoch 1 step 400 loss 0.02158
01/29/2024 16:52:33 - INFO - __main__ - epoch 1 step 500 loss 0.01971
01/29/2024 16:53:16 - INFO - __main__ - epoch 1 step 600 loss 0.02587
01/29/2024 16:54:00 - INFO - __main__ - epoch 1 step 700 loss 0.03965
01/29/2024 16:54:44 - INFO - __main__ - epoch 1 step 800 loss 0.04392
01/29/2024 16:55:28 - INFO - __main__ - epoch 1 step 900 loss 0.01245
01/29/2024 16:56:11 - INFO - __main__ - epoch 1 step 1000 loss 0.01865
01/29/2024 16:56:55 - INFO - __main__ - epoch 1 step 1100 loss 0.02658
01/29/2024 16:57:38 - INFO - __main__ - epoch 1 step 1200 loss 0.03788
01/29/2024 16:58:22 - INFO - __main__ - epoch 1 step 1300 loss 0.03048
01/29/2024 16:59:05 - INFO - __main__ - epoch 1 step 1400 loss 0.02551
01/29/2024 16:59:48 - INFO - __main__ - epoch 1 step 1500 loss 0.0335
01/29/2024 17:00:32 - INFO - __main__ - epoch 1 step 1600 loss 0.02083
01/29/2024 17:00:49 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 17:00:49 - INFO - __main__ - Num queries = 500
01/29/2024 17:00:49 - INFO - __main__ - Num codes = 6267
01/29/2024 17:00:49 - INFO - __main__ - Batch size = 12
01/29/2024 17:01:55 - INFO - __main__ - eval_mrr = 0.6441
01/29/2024 17:01:55 - INFO - __main__ - ********************
01/29/2024 17:01:55 - INFO - __main__ - Best mrr:0.6441
01/29/2024 17:01:55 - INFO - __main__ - ********************
01/29/2024 17:01:56 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 17:02:47 - INFO - __main__ - epoch 2 step 100 loss 0.02099
01/29/2024 17:03:30 - INFO - __main__ - epoch 2 step 200 loss 0.01507
01/29/2024 17:04:13 - INFO - __main__ - epoch 2 step 300 loss 0.02138
01/29/2024 17:04:57 - INFO - __main__ - epoch 2 step 400 loss 0.01924
01/29/2024 17:05:41 - INFO - __main__ - epoch 2 step 500 loss 0.01657
01/29/2024 17:06:25 - INFO - __main__ - epoch 2 step 600 loss 0.02167
01/29/2024 17:07:10 - INFO - __main__ - epoch 2 step 700 loss 0.01339
01/29/2024 17:07:54 - INFO - __main__ - epoch 2 step 800 loss 0.01948
01/29/2024 17:08:38 - INFO - __main__ - epoch 2 step 900 loss 0.01726
01/29/2024 17:09:23 - INFO - __main__ - epoch 2 step 1000 loss 0.02174
01/29/2024 17:10:07 - INFO - __main__ - epoch 2 step 1100 loss 0.01927
01/29/2024 17:10:51 - INFO - __main__ - epoch 2 step 1200 loss 0.01658
01/29/2024 17:11:36 - INFO - __main__ - epoch 2 step 1300 loss 0.02042
01/29/2024 17:12:20 - INFO - __main__ - epoch 2 step 1400 loss 0.02495
01/29/2024 17:13:04 - INFO - __main__ - epoch 2 step 1500 loss 0.00752
01/29/2024 17:13:49 - INFO - __main__ - epoch 2 step 1600 loss 0.02654
01/29/2024 17:14:08 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 17:14:08 - INFO - __main__ - Num queries = 500
01/29/2024 17:14:08 - INFO - __main__ - Num codes = 6267
01/29/2024 17:14:08 - INFO - __main__ - Batch size = 12
01/29/2024 17:15:16 - INFO - __main__ - eval_mrr = 0.6522
01/29/2024 17:15:16 - INFO - __main__ - ********************
01/29/2024 17:15:16 - INFO - __main__ - Best mrr:0.6522
01/29/2024 17:15:16 - INFO - __main__ - ********************
01/29/2024 17:15:16 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 17:16:09 - INFO - __main__ - epoch 3 step 100 loss 0.02855
01/29/2024 17:16:53 - INFO - __main__ - epoch 3 step 200 loss 0.02285
01/29/2024 17:17:38 - INFO - __main__ - epoch 3 step 300 loss 0.01508
01/29/2024 17:18:22 - INFO - __main__ - epoch 3 step 400 loss 0.01831
01/29/2024 17:19:06 - INFO - __main__ - epoch 3 step 500 loss 0.01391
01/29/2024 17:19:51 - INFO - __main__ - epoch 3 step 600 loss 0.01907
01/29/2024 17:20:35 - INFO - __main__ - epoch 3 step 700 loss 0.01575
01/29/2024 17:21:20 - INFO - __main__ - epoch 3 step 800 loss 0.01694
01/29/2024 17:22:04 - INFO - __main__ - epoch 3 step 900 loss 0.02172
01/29/2024 17:22:48 - INFO - __main__ - epoch 3 step 1000 loss 0.01624
01/29/2024 17:23:32 - INFO - __main__ - epoch 3 step 1100 loss 0.01293
01/29/2024 17:24:17 - INFO - __main__ - epoch 3 step 1200 loss 0.01496
01/29/2024 17:25:01 - INFO - __main__ - epoch 3 step 1300 loss 0.01474
01/29/2024 17:25:45 - INFO - __main__ - epoch 3 step 1400 loss 0.0136
01/29/2024 17:26:30 - INFO - __main__ - epoch 3 step 1500 loss 0.0134
01/29/2024 17:27:14 - INFO - __main__ - epoch 3 step 1600 loss 0.02719
01/29/2024 17:27:33 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 17:27:33 - INFO - __main__ - Num queries = 500
01/29/2024 17:27:33 - INFO - __main__ - Num codes = 6267
01/29/2024 17:27:33 - INFO - __main__ - Batch size = 12
01/29/2024 17:28:40 - INFO - __main__ - eval_mrr = 0.6542
01/29/2024 17:28:40 - INFO - __main__ - ********************
01/29/2024 17:28:40 - INFO - __main__ - Best mrr:0.6542
01/29/2024 17:28:40 - INFO - __main__ - ********************
01/29/2024 17:28:41 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 17:29:33 - INFO - __main__ - epoch 4 step 100 loss 0.01079
01/29/2024 17:30:18 - INFO - __main__ - epoch 4 step 200 loss 0.01621
01/29/2024 17:31:02 - INFO - __main__ - epoch 4 step 300 loss 0.01586
01/29/2024 17:31:46 - INFO - __main__ - epoch 4 step 400 loss 0.0139
01/29/2024 17:32:31 - INFO - __main__ - epoch 4 step 500 loss 0.0143
01/29/2024 17:33:16 - INFO - __main__ - epoch 4 step 600 loss 0.02795
01/29/2024 17:34:00 - INFO - __main__ - epoch 4 step 700 loss 0.01114
01/29/2024 17:34:44 - INFO - __main__ - epoch 4 step 800 loss 0.02773
01/29/2024 17:35:28 - INFO - __main__ - epoch 4 step 900 loss 0.01336
01/29/2024 17:36:13 - INFO - __main__ - epoch 4 step 1000 loss 0.01658
01/29/2024 17:36:57 - INFO - __main__ - epoch 4 step 1100 loss 0.02225
01/29/2024 17:37:41 - INFO - __main__ - epoch 4 step 1200 loss 0.00846
01/29/2024 17:38:25 - INFO - __main__ - epoch 4 step 1300 loss 0.01427
01/29/2024 17:39:08 - INFO - __main__ - epoch 4 step 1400 loss 0.01837
01/29/2024 17:39:52 - INFO - __main__ - epoch 4 step 1500 loss 0.01503
01/29/2024 17:40:35 - INFO - __main__ - epoch 4 step 1600 loss 0.01601
01/29/2024 17:40:52 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 17:40:52 - INFO - __main__ - Num queries = 500
01/29/2024 17:40:52 - INFO - __main__ - Num codes = 6267
01/29/2024 17:40:52 - INFO - __main__ - Batch size = 12
01/29/2024 17:41:58 - INFO - __main__ - eval_mrr = 0.6696
01/29/2024 17:41:58 - INFO - __main__ - ********************
01/29/2024 17:41:58 - INFO - __main__ - Best mrr:0.6696
01/29/2024 17:41:58 - INFO - __main__ - ********************
01/29/2024 17:41:59 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 17:42:50 - INFO - __main__ - epoch 5 step 100 loss 0.0175
01/29/2024 17:43:33 - INFO - __main__ - epoch 5 step 200 loss 0.01447
01/29/2024 17:44:17 - INFO - __main__ - epoch 5 step 300 loss 0.02013
01/29/2024 17:45:00 - INFO - __main__ - epoch 5 step 400 loss 0.02057
01/29/2024 17:45:43 - INFO - __main__ - epoch 5 step 500 loss 0.02427
01/29/2024 17:46:27 - INFO - __main__ - epoch 5 step 600 loss 0.01444
01/29/2024 17:47:10 - INFO - __main__ - epoch 5 step 700 loss 0.0093
01/29/2024 17:47:54 - INFO - __main__ - epoch 5 step 800 loss 0.01442
01/29/2024 17:48:37 - INFO - __main__ - epoch 5 step 900 loss 0.01147
01/29/2024 17:49:21 - INFO - __main__ - epoch 5 step 1000 loss 0.0238
01/29/2024 17:50:04 - INFO - __main__ - epoch 5 step 1100 loss 0.02009
01/29/2024 17:50:47 - INFO - __main__ - epoch 5 step 1200 loss 0.01224
01/29/2024 17:51:31 - INFO - __main__ - epoch 5 step 1300 loss 0.01701
01/29/2024 17:52:14 - INFO - __main__ - epoch 5 step 1400 loss 0.01257
01/29/2024 17:52:58 - INFO - __main__ - epoch 5 step 1500 loss 0.01694
01/29/2024 17:53:41 - INFO - __main__ - epoch 5 step 1600 loss 0.01883
01/29/2024 17:53:58 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 17:53:58 - INFO - __main__ - Num queries = 500
01/29/2024 17:53:58 - INFO - __main__ - Num codes = 6267
01/29/2024 17:53:58 - INFO - __main__ - Batch size = 12
01/29/2024 17:55:04 - INFO - __main__ - eval_mrr = 0.6819
01/29/2024 17:55:04 - INFO - __main__ - ********************
01/29/2024 17:55:04 - INFO - __main__ - Best mrr:0.6819
01/29/2024 17:55:04 - INFO - __main__ - ********************
01/29/2024 17:55:05 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 17:55:56 - INFO - __main__ - epoch 6 step 100 loss 0.01716
01/29/2024 17:56:39 - INFO - __main__ - epoch 6 step 200 loss 0.00899
01/29/2024 17:57:23 - INFO - __main__ - epoch 6 step 300 loss 0.01626
01/29/2024 17:58:06 - INFO - __main__ - epoch 6 step 400 loss 0.01486
01/29/2024 17:58:50 - INFO - __main__ - epoch 6 step 500 loss 0.02121
01/29/2024 17:59:33 - INFO - __main__ - epoch 6 step 600 loss 0.01381
01/29/2024 18:00:16 - INFO - __main__ - epoch 6 step 700 loss 0.01112
01/29/2024 18:01:00 - INFO - __main__ - epoch 6 step 800 loss 0.00927
01/29/2024 18:01:43 - INFO - __main__ - epoch 6 step 900 loss 0.0235
01/29/2024 18:02:26 - INFO - __main__ - epoch 6 step 1000 loss 0.0092
01/29/2024 18:03:10 - INFO - __main__ - epoch 6 step 1100 loss 0.01253
01/29/2024 18:03:53 - INFO - __main__ - epoch 6 step 1200 loss 0.00726
01/29/2024 18:04:37 - INFO - __main__ - epoch 6 step 1300 loss 0.0115
01/29/2024 18:05:20 - INFO - __main__ - epoch 6 step 1400 loss 0.0111
01/29/2024 18:06:04 - INFO - __main__ - epoch 6 step 1500 loss 0.01461
01/29/2024 18:06:47 - INFO - __main__ - epoch 6 step 1600 loss 0.01115
01/29/2024 18:07:04 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 18:07:04 - INFO - __main__ - Num queries = 500
01/29/2024 18:07:04 - INFO - __main__ - Num codes = 6267
01/29/2024 18:07:04 - INFO - __main__ - Batch size = 12
01/29/2024 18:08:11 - INFO - __main__ - eval_mrr = 0.6858
01/29/2024 18:08:11 - INFO - __main__ - ********************
01/29/2024 18:08:11 - INFO - __main__ - Best mrr:0.6858
01/29/2024 18:08:11 - INFO - __main__ - ********************
01/29/2024 18:08:11 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 18:09:02 - INFO - __main__ - epoch 7 step 100 loss 0.01316
01/29/2024 18:09:45 - INFO - __main__ - epoch 7 step 200 loss 0.01817
01/29/2024 18:10:29 - INFO - __main__ - epoch 7 step 300 loss 0.01309
01/29/2024 18:11:12 - INFO - __main__ - epoch 7 step 400 loss 0.01251
01/29/2024 18:11:56 - INFO - __main__ - epoch 7 step 500 loss 0.01529
01/29/2024 18:12:39 - INFO - __main__ - epoch 7 step 600 loss 0.01101
01/29/2024 18:13:22 - INFO - __main__ - epoch 7 step 700 loss 0.01705
01/29/2024 18:14:06 - INFO - __main__ - epoch 7 step 800 loss 0.00989
01/29/2024 18:14:49 - INFO - __main__ - epoch 7 step 900 loss 0.00921
01/29/2024 18:15:33 - INFO - __main__ - epoch 7 step 1000 loss 0.01106
01/29/2024 18:16:16 - INFO - __main__ - epoch 7 step 1100 loss 0.00759
01/29/2024 18:17:00 - INFO - __main__ - epoch 7 step 1200 loss 0.00909
01/29/2024 18:17:43 - INFO - __main__ - epoch 7 step 1300 loss 0.0152
01/29/2024 18:18:26 - INFO - __main__ - epoch 7 step 1400 loss 0.01086
01/29/2024 18:19:10 - INFO - __main__ - epoch 7 step 1500 loss 0.01087
01/29/2024 18:19:53 - INFO - __main__ - epoch 7 step 1600 loss 0.00908
01/29/2024 18:20:10 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 18:20:10 - INFO - __main__ - Num queries = 500
01/29/2024 18:20:10 - INFO - __main__ - Num codes = 6267
01/29/2024 18:20:10 - INFO - __main__ - Batch size = 12
01/29/2024 18:21:16 - INFO - __main__ - eval_mrr = 0.6839
01/29/2024 18:22:07 - INFO - __main__ - epoch 8 step 100 loss 0.00553
01/29/2024 18:22:50 - INFO - __main__ - epoch 8 step 200 loss 0.00996
01/29/2024 18:23:34 - INFO - __main__ - epoch 8 step 300 loss 0.0075
01/29/2024 18:24:17 - INFO - __main__ - epoch 8 step 400 loss 0.01108
01/29/2024 18:25:01 - INFO - __main__ - epoch 8 step 500 loss 0.00765
01/29/2024 18:25:44 - INFO - __main__ - epoch 8 step 600 loss 0.01275
01/29/2024 18:26:28 - INFO - __main__ - epoch 8 step 700 loss 0.00875
01/29/2024 18:27:11 - INFO - __main__ - epoch 8 step 800 loss 0.011
01/29/2024 18:27:54 - INFO - __main__ - epoch 8 step 900 loss 0.0118
01/29/2024 18:28:38 - INFO - __main__ - epoch 8 step 1000 loss 0.00724
01/29/2024 18:29:21 - INFO - __main__ - epoch 8 step 1100 loss 0.00416
01/29/2024 18:30:05 - INFO - __main__ - epoch 8 step 1200 loss 0.01071
01/29/2024 18:30:48 - INFO - __main__ - epoch 8 step 1300 loss 0.00849
01/29/2024 18:31:31 - INFO - __main__ - epoch 8 step 1400 loss 0.01281
01/29/2024 18:32:15 - INFO - __main__ - epoch 8 step 1500 loss 0.01515
01/29/2024 18:32:58 - INFO - __main__ - epoch 8 step 1600 loss 0.01455
01/29/2024 18:33:15 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 18:33:15 - INFO - __main__ - Num queries = 500
01/29/2024 18:33:15 - INFO - __main__ - Num codes = 6267
01/29/2024 18:33:15 - INFO - __main__ - Batch size = 12
01/29/2024 18:34:22 - INFO - __main__ - eval_mrr = 0.6901
01/29/2024 18:34:22 - INFO - __main__ - ********************
01/29/2024 18:34:22 - INFO - __main__ - Best mrr:0.6901
01/29/2024 18:34:22 - INFO - __main__ - ********************
01/29/2024 18:34:22 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin
01/29/2024 18:35:13 - INFO - __main__ - epoch 9 step 100 loss 0.01404
01/29/2024 18:35:57 - INFO - __main__ - epoch 9 step 200 loss 0.01246
01/29/2024 18:36:40 - INFO - __main__ - epoch 9 step 300 loss 0.01094
01/29/2024 18:37:23 - INFO - __main__ - epoch 9 step 400 loss 0.00944
01/29/2024 18:38:07 - INFO - __main__ - epoch 9 step 500 loss 0.01247
01/29/2024 18:39:34 - INFO - __main__ - epoch 9 step 700 loss 0.01045
01/29/2024 18:40:17 - INFO - __main__ - epoch 9 step 800 loss 0.00451
01/29/2024 18:41:00 - INFO - __main__ - epoch 9 step 900 loss 0.00712
01/29/2024 18:41:44 - INFO - __main__ - epoch 9 step 1000 loss 0.00727
01/29/2024 18:39:34 - INFO - __main__ - epoch 9 step 700 loss 0.01045
01/29/2024 18:40:17 - INFO - __main__ - epoch 9 step 800 loss 0.00451
01/29/2024 18:41:00 - INFO - __main__ - epoch 9 step 900 loss 0.00712
01/29/2024 18:41:44 - INFO - __main__ - epoch 9 step 1000 loss 0.00727
01/29/2024 18:42:27 - INFO - __main__ - epoch 9 step 1100 loss 0.01189
01/29/2024 18:43:11 - INFO - __main__ - epoch 9 step 1200 loss 0.00638
01/29/2024 18:43:54 - INFO - __main__ - epoch 9 step 1300 loss 0.00737
01/29/2024 18:44:37 - INFO - __main__ - epoch 9 step 1400 loss 0.00871
01/29/2024 18:45:21 - INFO - __main__ - epoch 9 step 1500 loss 0.01583
01/29/2024 18:46:04 - INFO - __main__ - epoch 9 step 1600 loss 0.00592
01/29/2024 18:46:21 - INFO - __main__ - ***** Running evaluation *****
01/29/2024 18:46:21 - INFO - __main__ - Num queries = 500
01/29/2024 18:46:21 - INFO - __main__ - Num codes = 6267
01/29/2024 18:46:21 - INFO - __main__ - Batch size = 12
01/29/2024 18:47:28 - INFO - __main__ - eval_mrr = 0.6993
01/29/2024 18:47:28 - INFO - __main__ - ********************
01/29/2024 18:47:28 - INFO - __main__ - Best mrr:0.6993
01/29/2024 18:47:28 - INFO - __main__ - ********************
01/29/2024 18:47:28 - INFO - __main__ - Saving model checkpoint to saved_models\cosqa\checkpoint-best-mrr\model.bin

配置环境

上面默认是已经配好环境的,实际上要配置支持gpu加速的pytorch环境。如果没有支持加速的gpu的话直接安装pytorch和transformers就好。

我主要参考了:http://t.csdnimg.cn/YJBLp

CUDA

输入 nvidia-smi查看显卡信息。下载并安装对应版本的cuda toolkit.

pytorch

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

--index-url是换源用的。为了方便后面进行环境配置,可以为pip配置清华源:pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

transformers

transformers的安装比较简单,对于pip包管理器:

pip install transformers

代码

增加测试过程中的记录。

1
2
3
4
5
# 创建一个DataFrame用于存储结果
df = pd.DataFrame({'URL': nl_urls, 'Rank': ranks})

# 将结果保存到CSV文件
df.to_csv('evaluation_results.csv', index=False)

nl_urls是对应代码片段的序号,实际上 这个的输出和测试样本的顺序是一致的 。对于同一个代码会有多个不同的查询,所以nl_urls会重复。

分值越大说明模型在这个测试样本上效果越好。最好是1,最差是0.


codeBERT实战
http://thinkerhui.site/2024/06/01/自学研究/codeBERT实战/
作者
thinkerhui
发布于
2024年6月1日
许可协议