用huggingface.transformers在文本分类任务(单任务和多任务场景下)上微调预训练模型

news/2024/5/19 18:49:10/文章来源:https://blog.csdn.net/PolarisRisingWar/article/details/127365675

诸神缄默不语-个人CSDN博文目录

transformers官方文档:https://huggingface.co/docs/transformers/index
AutoModel文档:https://huggingface.co/docs/transformers/v4.23.1/en/model_doc/auto#transformers.AutoModel
AutoTokenizer文档:https://huggingface.co/docs/transformers/v4.23.1/en/model_doc/auto#transformers.AutoTokenizer

单任务就是直接用Bert表征,然后接一个Dropout,接一层线性网络(和直接使用AutoModelforSequenceClassification性质相同)。
多任务单数据集就是将单任务的线性网络改成给每个任务一个线性网络。

https://github.com/huggingface/transformers/blob/ad654e448444b60937016cbea257f69c9837ecde/src/transformers/modeling_utils.py
https://github.com/huggingface/transformers/blob/ee0d001de71f0da892f86caa3cf2387020ec9696/src/transformers/models/bert/modeling_bert.py

多任务多数据集则是参考transformers官方代码(上面两个网址),在多任务单数据集的基础上再把BertEmbeddings拆出来,所有任务仅共享BertEncoder部分。

(事实上多任务学习有很多种范式,本文使用的是基本的硬共享机制)

文章目录

  • 1. 单任务文本分类
  • 2. 多任务文本分类(单数据集)
  • 3. 多任务文本分类(多数据集)

1. 单任务文本分类

本文用的数据集是https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv,预训练语言模型是https://huggingface.co/bert-base-chinese

可参考我写的另一个项目PolarisRisingWar/pytorch_text_classification

代码:

import csv,random
from tqdm import tqdm
from copy import deepcopyfrom sklearn.metrics import accuracy_score,precision_score,recall_score,f1_scoreimport torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoaderfrom transformers import AutoModel, AutoTokenizer#超参设置
random_seed=20221125
split_ratio='6-2-2'
pretrained_path='/data/pretrained_model/bert-base-chinese'
dropout_rate=0.1
max_epoch_num=16
cuda_device='cuda:2'
output_dim=2#数据预处理
with open('other_data_temp/ChnSentiCorp_htl_all.csv') as f:reader=csv.reader(f)header = next(reader)  #表头data = [[int(row[0]),row[1]] for row in reader]  #每个元素是一个由字符串组成的列表,第一个元素是标签(01),第二个元素是评论文本。random.seed(random_seed)
random.shuffle(data)
split_ratio_list=[int(i) for i in split_ratio.split('-')]
split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))
split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))
train_data=data[:split_point1]
valid_data=data[split_point1:split_point2]
test_data=data[split_point2:]#建立数据集迭代器
class TextInitializeDataset(Dataset):def __init__(self,input_data) -> None:self.text=[x[1] for x in input_data]self.label=[x[0] for x in input_data]def __getitem__(self, index):return [self.text[index],self.label[index]]def __len__(self):return len(self.text)tokenizer=AutoTokenizer.from_pretrained(pretrained_path)def collate_fn(batch):pt_batch=tokenizer([x[0] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt')return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'],'label':torch.tensor([x[1] for x in batch])}train_dataloader=DataLoader(TextInitializeDataset(train_data),batch_size=16,shuffle=True,collate_fn=collate_fn)
valid_dataloader=DataLoader(TextInitializeDataset(valid_data),batch_size=128,shuffle=False,collate_fn=collate_fn)
test_dataloader=DataLoader(TextInitializeDataset(test_data),batch_size=128,shuffle=False,collate_fn=collate_fn)#建模
class ClsModel(nn.Module):def __init__(self,output_dim,dropout_rate):super(ClsModel,self).__init__()self.encoder=AutoModel.from_pretrained(pretrained_path)self.dropout=nn.Dropout(dropout_rate)self.classifier=nn.Linear(768,output_dim)def forward(self,input_ids,token_type_ids,attention_mask):x=self.encoder(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=attention_mask)['pooler_output']x=self.dropout(x)x=self.classifier(x)return xloss_func=nn.CrossEntropyLoss()model=ClsModel(output_dim,dropout_rate)
model.to(cuda_device)optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5)max_valid_f1=0
best_model={}for e in tqdm(range(max_epoch_num)):for batch in train_dataloader:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)train_loss=loss_func(outputs,batch['label'].to(cuda_device))train_loss.backward()optimizer.step()#验证with torch.no_grad():model.eval()labels=[]predicts=[]for batch in valid_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])f1=f1_score(labels,predicts,average='macro')if f1>max_valid_f1:best_model=deepcopy(model.state_dict())max_valid_f1=f1#测试
model.load_state_dict(best_model)
with torch.no_grad():model.eval()labels=[]predicts=[]for batch in test_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])print(accuracy_score(labels,predicts))print(precision_score(labels,predicts,average='macro'))print(recall_score(labels,predicts,average='macro'))print(f1_score(labels,predicts,average='macro'))

用时约1h35min

实验结果:

accuracymacro-Pmacro-Rmacro-F
91.8991.3990.3390.82

2. 多任务文本分类(单数据集)

本文使用的数据集TEL-NLP来自:https://github.com/scsmuhio/MTGCN
我用的数据集文件是:https://raw.githubusercontent.com/scsmuhio/MTGCN/main/Data/ei_task.csv
出处论文MT-Text GCN:Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language
我用的泰卢固语Bert模型权重是:https://huggingface.co/kuppuluri/telugu_bertu(不是数据集原论文用的表征工具)

这是个泰卢固语多任务文本分类数据集。呃我其实完全不会泰卢固语,所以原则上我其实不想用这个数据集的,但是我只找到了这一个很典型的单数据集多任务文本分类数据集!

数据集示例:
在这里插入图片描述

本文用的数据集预处理方法和论文里写的相似(无法相同,因为第一,这个数据集和论文里给的数据不一样,我也在GitHub项目里问了:Questions about data · Issue #1 · scsmuhio/MTGCN;第二,代码里没有给出每次划分的结果,我只能自定义随机种子实现;第三,我其实没太看懂论文里到底是咋分的,据我理解大概是5次按照7-1-2比例随机划分,用5次实验上的结果平均值作为最终结果,但是我懒得搞这么多次):
按照7-1-2比例随机划分数据集(随机种子为20221028)
(最终结果看起来和论文里报的结果就没法比,就完全不在一个谱上……)

跑了2次实验,对比使用单任务分类范式和多任务分类范式的区别,每次都是微调最多16个epoch,取macro-F1值最高的epoch的模型来做测试(多任务就是macro-F1平均值最高)。
单看实验结果的话,感觉多任务范式没有体现出明显的优势或劣势。但是多任务范式没有做什么优化就是啦,搞得比较简单,有时间的话再优化一下代码。

单任务版代码:

import csv,os,random
from tqdm import tqdm
from copy import deepcopyfrom sklearn.metrics import accuracy_score,precision_score,recall_score,f1_scoreimport torch
import torch.nn as nn
from torch.utils.data import Dataset,TensorDataset,DataLoaderfrom transformers import AutoModel, AutoTokenizer, pipeline#数据预处理
with open('other_data_temp/telnlp_ei.csv') as f:reader=csv.reader(f)header = next(reader)  #表头print(header)data=list(reader)#对标签进行数值化map1={'neg':0,'neutral':1,'pos':2}map2={'angry':0,'sad':1,'fear':2,'happy':3}map3={'yes':0,'no':1}random.seed(20221028)random.shuffle(data)split_ratio_list=[7,1,2]split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))train_data=data[:split_point1]valid_data=data[split_point1:split_point2]test_data=data[split_point2:]#建立数据集迭代器
class TextInitializeDataset(Dataset):def __init__(self,input_data) -> None:self.text=[x[0] for x in input_data]self.sentiment=[map1[x[1]] for x in input_data]self.emotion=[map2[x[2]] for x in input_data]self.hate=[map3[x[3]] for x in input_data]self.sarcasm=[map3[x[4]] for x in input_data]def __getitem__(self, index):return [self.text[index],self.sentiment[index],self.emotion[index],self.hate[index],self.sarcasm[index]]def __len__(self):return len(self.text)tokenizer = AutoTokenizer.from_pretrained("/data/pretrained_model/telugu_bertu",clean_text=False,handle_chinese_chars=False,strip_accents=False,wordpieces_prefix='##')def collate_fn(batch):pt_batch=tokenizer([x[0] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt')return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'],'sentiment':torch.tensor([x[1] for x in batch]),'emotion':torch.tensor([x[2] for x in batch]),'hate':torch.tensor([x[3] for x in batch]),'sarcasm':torch.tensor([x[4] for x in batch])}train_dataloader=DataLoader(TextInitializeDataset(train_data),batch_size=64,shuffle=True,collate_fn=collate_fn)
valid_dataloader=DataLoader(TextInitializeDataset(valid_data),batch_size=512,shuffle=False,collate_fn=collate_fn)
test_dataloader=DataLoader(TextInitializeDataset(test_data),batch_size=512,shuffle=False,collate_fn=collate_fn)#建模
class ClsModel(nn.Module):def __init__(self,output_dim,dropout_rate):super(ClsModel,self).__init__()self.encoder=AutoModel.from_pretrained("/data/pretrained_model/telugu_bertu")self.dropout=nn.Dropout(dropout_rate)self.classifier=nn.Linear(768,output_dim)def forward(self,input_ids,token_type_ids,attention_mask):x=self.encoder(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=attention_mask)['pooler_output']x=self.dropout(x)x=self.classifier(x)return x#运行
dropout_rate=0.1
max_epoch_num=16
cuda_device='cuda:1'
od_map={'sentiment':3,'emotion':4,'hate':2,'sarcasm':2}loss_func=nn.CrossEntropyLoss()for the_label in ['sentiment','emotion','hate','sarcasm']:model=ClsModel(od_map[the_label],dropout_rate)model.to(cuda_device)optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5)max_valid_f1=0best_model={}for e in tqdm(range(max_epoch_num)):for batch in train_dataloader:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)train_loss=loss_func(outputs,batch[the_label].to(cuda_device))train_loss.backward()optimizer.step()#验证with torch.no_grad():model.eval()labels=[]predicts=[]for batch in valid_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch[the_label]])predicts.extend([i.item() for i in torch.argmax(outputs,1)])f1=f1_score(labels,predicts,average='macro')if f1>max_valid_f1:best_model=deepcopy(model.state_dict())max_valid_f1=f1#测试model.load_state_dict(best_model)with torch.no_grad():model.eval()labels=[]predicts=[]for batch in test_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch[the_label]])predicts.extend([i.item() for i in torch.argmax(outputs,1)])print(the_label)print(accuracy_score(labels,predicts))print(precision_score(labels,predicts,average='macro'))print(recall_score(labels,predicts,average='macro'))print(f1_score(labels,predicts,average='macro'))

多任务版代码:

import csv,os,random
from tqdm import tqdm
from copy import deepcopy
from statistics import meanfrom sklearn.metrics import accuracy_score,precision_score,recall_score,f1_scoreimport torch
import torch.nn as nn
from torch.utils.data import Dataset,TensorDataset,DataLoaderfrom transformers import AutoModel, AutoTokenizer, pipeline#数据预处理
with open('other_data_temp/telnlp_ei.csv') as f:reader=csv.reader(f)header = next(reader)  #表头print(header)data=list(reader)#对标签进行数值化map1={'neg':0,'neutral':1,'pos':2}map2={'angry':0,'sad':1,'fear':2,'happy':3}map3={'yes':0,'no':1}random.seed(20221028)random.shuffle(data)split_ratio_list=[7,1,2]split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))train_data=data[:split_point1]valid_data=data[split_point1:split_point2]test_data=data[split_point2:]#建立数据集迭代器
class TextInitializeDataset(Dataset):def __init__(self,input_data) -> None:self.text=[x[0] for x in input_data]self.sentiment=[map1[x[1]] for x in input_data]self.emotion=[map2[x[2]] for x in input_data]self.hate=[map3[x[3]] for x in input_data]self.sarcasm=[map3[x[4]] for x in input_data]def __getitem__(self, index):return [self.text[index],self.sentiment[index],self.emotion[index],self.hate[index],self.sarcasm[index]]def __len__(self):return len(self.text)tokenizer = AutoTokenizer.from_pretrained("/data/pretrained_model/telugu_bertu",clean_text=False,handle_chinese_chars=False,strip_accents=False,wordpieces_prefix='##')def collate_fn(batch):pt_batch=tokenizer([x[0] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt')return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'],'sentiment':torch.tensor([x[1] for x in batch]),'emotion':torch.tensor([x[2] for x in batch]),'hate':torch.tensor([x[3] for x in batch]),'sarcasm':torch.tensor([x[4] for x in batch])}train_dataloader=DataLoader(TextInitializeDataset(train_data),batch_size=64,shuffle=True,collate_fn=collate_fn)
valid_dataloader=DataLoader(TextInitializeDataset(valid_data),batch_size=512,shuffle=False,collate_fn=collate_fn)
test_dataloader=DataLoader(TextInitializeDataset(test_data),batch_size=512,shuffle=False,collate_fn=collate_fn)#建模
class ClsModel(nn.Module):def __init__(self,output_dims,dropout_rate):super(ClsModel,self).__init__()self.encoder=AutoModel.from_pretrained("/data/pretrained_model/telugu_bertu")self.dropout=nn.Dropout(dropout_rate)self.classifiers=nn.ModuleList([nn.Linear(768,output_dim) for output_dim in output_dims])def forward(self,input_ids,token_type_ids,attention_mask):x=self.encoder(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=attention_mask)['pooler_output']x=self.dropout(x)xs=[classifier(x) for classifier in self.classifiers]return xs#运行
dropout_rate=0.1
max_epoch_num=16
cuda_device='cuda:2'
od_name=['sentiment','emotion','hate','sarcasm']
od=[3,4,2,2]loss_func=nn.CrossEntropyLoss()model=ClsModel(od,dropout_rate)
model.to(cuda_device)optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5)max_valid_f1=0
best_model={}for e in tqdm(range(max_epoch_num)):for batch in train_dataloader:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)loss_list=[loss_func(outputs[i],batch[od_name[i]].to(cuda_device)) for i in range(4)]train_loss=torch.sum(torch.stack(loss_list))train_loss.backward()optimizer.step()#验证with torch.no_grad():model.eval()labels=[[] for _ in range(4)]predicts=[[] for _ in range(4)]for batch in valid_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)for i in range(4):labels[i].extend([i.item() for i in batch[od_name[i]]])predicts[i].extend([i.item() for i in torch.argmax(outputs[i],1)])f1=mean([f1_score(labels[i],predicts[i],average='macro') for i in range(4)])if f1>max_valid_f1:best_model=deepcopy(model.state_dict())max_valid_f1=f1#测试
model.load_state_dict(best_model)
with torch.no_grad():model.eval()labels=[[] for _ in range(4)]predicts=[[] for _ in range(4)]for batch in test_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)for i in range(4):labels[i].extend([i.item() for i in batch[od_name[i]]])predicts[i].extend([i.item() for i in torch.argmax(outputs[i],1)])for i in range(4):print(od_name[i])print(accuracy_score(labels[i],predicts[i]))print(precision_score(labels[i],predicts[i],average='macro'))print(recall_score(labels[i],predicts[i],average='macro'))print(f1_score(labels[i],predicts[i],average='macro'))

(多任务时间是单任务的1/4,具体差多少没计时)
实验结果对比(×100 保留2位小数):

任务-标签accuracymacro-Pmacro-Rmacro-F
单-sentiment85.6964.3863.5563.73
多-sentiment86.3765.7463.2963.9
单-emtion87.6172.1873.1672.47
多-emotion88.2879.9766.5170.81
单-hate-speech96.5863.9969.1566.12
多-hate-speech96.8466.3672.7868.99
单-sarcasm98.3464.4768.5566.25
多-sarcasm98.0360.9266.0462.96

3. 多任务文本分类(多数据集)

本文用的数据集是2种新浪微博数据,都来源于https://github.com/SophonPlus/ChineseNlpCorpus这个项目:
一个标注情感正负性(0/1):https://pan.baidu.com/s/1DoQbki3YwqkuwQUOj64R_g
一个标注4种情感:https://pan.baidu.com/s/16c93E5x373nsGozyWevITg

预训练语言模型是https://huggingface.co/bert-base-chinese

(时间太久了,懒得跑好几个epoch,我就都只跑1个epoch了)

单任务代码:

import csv,random
from tqdm import tqdm
from copy import deepcopyfrom sklearn.metrics import accuracy_score,precision_score,recall_score,f1_scoreimport torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoaderfrom transformers import AutoModel, AutoTokenizer#超参设置
random_seed=20221125
split_ratio='6-2-2'
pretrained_path='/data/pretrained_model/bert-base-chinese'
dropout_rate=0.1
max_epoch_num=1
cuda_device='cuda:3'
output_dim=[['/data/other_data/weibo_senti_100k.csv',2],['/data/other_data/simplifyweibo_4_moods.csv',4]]#数据预处理
random.seed(random_seed)#建立数据集迭代器
class TextInitializeDataset(Dataset):def __init__(self,input_data) -> None:self.text=[x[1] for x in input_data]self.label=[x[0] for x in input_data]def __getitem__(self, index):return [self.text[index],self.label[index]]def __len__(self):return len(self.text)tokenizer = AutoTokenizer.from_pretrained(pretrained_path)def collate_fn(batch):pt_batch=tokenizer([x[0] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt')return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'],'label':torch.tensor([x[1] for x in batch])}#建模
class ClsModel(nn.Module):def __init__(self,output_dim,dropout_rate):super(ClsModel,self).__init__()self.encoder=AutoModel.from_pretrained(pretrained_path)self.dropout=nn.Dropout(dropout_rate)self.classifier=nn.Linear(768,output_dim)def forward(self,input_ids,token_type_ids,attention_mask):x=self.encoder(input_ids=input_ids,token_type_ids=token_type_ids,attention_mask=attention_mask)['pooler_output']x=self.dropout(x)x=self.classifier(x)return x#运行
loss_func=nn.CrossEntropyLoss()for task in output_dim:with open(task[0]) as f:reader=csv.reader(f)header = next(reader)  #表头data = [[int(row[0]),row[1]] for row in reader]  #每个元素是一个由字符串组成的列表,第一个元素是标签(01),第二个元素是评论文本。split_ratio_list=[int(i) for i in split_ratio.split('-')]split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))train_data=data[:split_point1]valid_data=data[split_point1:split_point2]test_data=data[split_point2:]train_dataloader=DataLoader(TextInitializeDataset(train_data),batch_size=16,shuffle=True,collate_fn=collate_fn)valid_dataloader=DataLoader(TextInitializeDataset(valid_data),batch_size=128,shuffle=False,collate_fn=collate_fn)test_dataloader=DataLoader(TextInitializeDataset(test_data),batch_size=128,shuffle=False,collate_fn=collate_fn)#64-512在第一个数据集上是可行的,在第二个数据集上会OOM,所以我直接全调一样了model=ClsModel(task[1],dropout_rate)model.to(cuda_device)optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5)max_valid_f1=0best_model={}for e in tqdm(range(max_epoch_num)):for batch in train_dataloader:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)train_loss=loss_func(outputs,batch['label'].to(cuda_device))train_loss.backward()optimizer.step()#验证with torch.no_grad():model.eval()labels=[]predicts=[]for batch in valid_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])f1=f1_score(labels,predicts,average='macro')if f1>max_valid_f1:best_model=deepcopy(model.state_dict())max_valid_f1=f1#测试model.load_state_dict(best_model)with torch.no_grad():model.eval()labels=[]predicts=[]for batch in test_dataloader:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])print(task[0])print(accuracy_score(labels,predicts))print(precision_score(labels,predicts,average='macro'))print(recall_score(labels,predicts,average='macro'))print(f1_score(labels,predicts,average='macro'))

多任务代码:

import csv,random
from tqdm import tqdm
from copy import deepcopyfrom sklearn.metrics import accuracy_score,precision_score,recall_score,f1_scoreimport torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoaderfrom transformers import AutoTokenizer,AutoConfig
from transformers.models.bert.modeling_bert import BertEmbeddings,BertEncoder,BertPooler
from transformers.modeling_outputs import BaseModelOutputWithPoolingAndCrossAttentions
from transformers.modeling_utils import ModuleUtilsMixininstance=ModuleUtilsMixin()#超参设置
random_seed=20221125
split_ratio='6-2-2'
pretrained_path='/data/pretrained_model/bert-base-chinese'
dropout_rate=0.1
max_epoch_num=1
cuda_device='cuda:2'
output_dim=[2,4]#数据预处理
random.seed(random_seed)#数据1
with open('/data/other_data/weibo_senti_100k.csv') as f:reader=csv.reader(f)header = next(reader)  #表头data = [[int(row[0]),row[1]] for row in reader]  #每个元素是一个由字符串组成的列表,第一个元素是标签(01),第二个元素是评论文本。random.shuffle(data)
split_ratio_list=[int(i) for i in split_ratio.split('-')]
split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))
split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))
train_data1=data[:split_point1]
valid_data1=data[split_point1:split_point2]
test_data1=data[split_point2:]#数据2
with open('/data/other_data/simplifyweibo_4_moods.csv') as f:reader=csv.reader(f)header = next(reader)  #表头data = [[int(row[0]),row[1]] for row in reader]  #每个元素是一个由字符串组成的列表,第一个元素是标签(01),第二个元素是评论文本。random.shuffle(data)
split_ratio_list=[int(i) for i in split_ratio.split('-')]
split_point1=int(len(data)*split_ratio_list[0]/sum(split_ratio_list))
split_point2=int(len(data)*(split_ratio_list[0]+split_ratio_list[1])/sum(split_ratio_list))
train_data2=data[:split_point1]
valid_data2=data[split_point1:split_point2]
test_data2=data[split_point2:]#建立数据集迭代器
class TextInitializeDataset(Dataset):def __init__(self,input_data) -> None:self.text=[x[1] for x in input_data]self.label=[x[0] for x in input_data]def __getitem__(self, index):return [self.text[index],self.label[index]]def __len__(self):return len(self.text)tokenizer=AutoTokenizer.from_pretrained(pretrained_path)def collate_fn(batch):pt_batch=tokenizer([x[0] for x in batch],padding=True,truncation=True,max_length=512,return_tensors='pt')return {'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],'attention_mask':pt_batch['attention_mask'],'label':torch.tensor([x[1] for x in batch])}train_dataloader1=DataLoader(TextInitializeDataset(train_data1),batch_size=16,shuffle=True,collate_fn=collate_fn)
train_dataloader2=DataLoader(TextInitializeDataset(train_data2),batch_size=16,shuffle=True,collate_fn=collate_fn)
valid_dataloader1=DataLoader(TextInitializeDataset(valid_data1),batch_size=128,shuffle=False,collate_fn=collate_fn)
valid_dataloader2=DataLoader(TextInitializeDataset(valid_data2),batch_size=128,shuffle=False,collate_fn=collate_fn)
test_dataloader1=DataLoader(TextInitializeDataset(test_data1),batch_size=128,shuffle=False,collate_fn=collate_fn)
test_dataloader2=DataLoader(TextInitializeDataset(test_data2),batch_size=128,shuffle=False,collate_fn=collate_fn)config=AutoConfig.from_pretrained(pretrained_path)#建模
class ClsModel(nn.Module):def __init__(self,output_dim,dropout_rate):super(ClsModel,self).__init__()self.config=configself.embedding1=BertEmbeddings(config)self.embedding2=BertEmbeddings(config)self.encoder=BertEncoder(config)self.pooler=BertPooler(config)self.dropout=nn.Dropout(dropout_rate)self.classifier1=nn.Linear(768,output_dim[0])self.classifier2=nn.Linear(768,output_dim[1])def forward(self,input_ids,token_type_ids,attention_mask,type):output_attentions=self.config.output_attentionsoutput_hidden_states=self.config.output_hidden_statesreturn_dict=self.config.use_return_dictif self.config.is_decoder:use_cache=self.config.use_cacheelse:use_cache = Falseinput_shape = input_ids.size()batch_size, seq_length = input_shapedevice = input_ids.device# past_key_values_lengthpast_key_values_length = 0if attention_mask is None:attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)if type==1:self.embeddings=self.embedding1else:self.embeddings=self.embedding2# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]# ourselves in which case we just need to make it broadcastable to all heads.dtype=attention_mask.dtype# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]# ourselves in which case we just need to make it broadcastable to all heads.if attention_mask.dim() == 3:extended_attention_mask = attention_mask[:, None, :, :]elif attention_mask.dim() == 2:# Provided a padding mask of dimensions [batch_size, seq_length]# - if the model is a decoder, apply a causal mask in addition to the padding mask# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]if self.config.is_decoder:extended_attention_mask = ModuleUtilsMixin.create_extended_attention_mask_for_decoder(input_shape, attention_mask, device)else:extended_attention_mask = attention_mask[:, None, None, :]else:raise ValueError(f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})")# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and the dtype's smallest value for masked positions.# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.extended_attention_mask = extended_attention_mask.to(dtype=dtype)  # fp16 compatibilityextended_attention_mask = (1.0 - extended_attention_mask) * torch.iinfo(dtype).minencoder_extended_attention_mask = None# Prepare head mask if needed# 1.0 in head_mask indicate we keep the head# attention_probs has shape bsz x n_heads x N x N# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]head_mask=[None] *self.config.num_hidden_layersembedding_output = self.embeddings(input_ids=input_ids,position_ids=None,token_type_ids=token_type_ids,inputs_embeds=None,past_key_values_length=past_key_values_length,)encoder_outputs = self.encoder(embedding_output,attention_mask=extended_attention_mask,head_mask=head_mask,encoder_hidden_states=None,encoder_attention_mask=encoder_extended_attention_mask,past_key_values=None,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)sequence_output = encoder_outputs[0]pooled_output = self.pooler(sequence_output) if self.pooler is not None else Noneif not return_dict:return (sequence_output, pooled_output) + encoder_outputs[1:]x=BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=sequence_output,pooler_output=pooled_output,past_key_values=encoder_outputs.past_key_values,hidden_states=encoder_outputs.hidden_states,attentions=encoder_outputs.attentions,cross_attentions=encoder_outputs.cross_attentions,)['pooler_output']x=self.dropout(x)if type==1:self.classifier=self.classifier1else:self.classifier=self.classifier2x=self.classifier(x)return xloss_func=nn.CrossEntropyLoss()model=ClsModel(output_dim,dropout_rate)
model.to(cuda_device)optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-5)max_valid_f1=0
best_model={}for e in tqdm(range(max_epoch_num)):for batch in train_dataloader1:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,1)train_loss=loss_func(outputs,batch['label'].to(cuda_device))train_loss.backward()optimizer.step()for batch in train_dataloader2:model.train()optimizer.zero_grad()input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,2)train_loss=loss_func(outputs,batch['label'].to(cuda_device))train_loss.backward()optimizer.step()#验证with torch.no_grad():model.eval()labels=[]predicts=[]for batch in valid_dataloader1:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,1)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])f11=f1_score(labels,predicts,average='macro')labels=[]predicts=[]for batch in valid_dataloader2:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,2)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])f12=f1_score(labels,predicts,average='macro')f1=(f11+f12)/2if f1>max_valid_f1:best_model=deepcopy(model.state_dict())max_valid_f1=f1#测试
model.load_state_dict(best_model)
with torch.no_grad():model.eval()labels=[]predicts=[]for batch in test_dataloader1:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,1)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])print(accuracy_score(labels,predicts))print(precision_score(labels,predicts,average='macro'))print(recall_score(labels,predicts,average='macro'))print(f1_score(labels,predicts,average='macro'))labels=[]predicts=[]for batch in test_dataloader2:input_ids=batch['input_ids'].to(cuda_device)token_type_ids=batch['token_type_ids'].to(cuda_device)attention_mask=batch['attention_mask'].to(cuda_device)outputs=model(input_ids,token_type_ids,attention_mask,2)labels.extend([i.item() for i in batch['label']])predicts.extend([i.item() for i in torch.argmax(outputs,1)])print(accuracy_score(labels,predicts))print(precision_score(labels,predicts,average='macro'))print(recall_score(labels,predicts,average='macro'))print(f1_score(labels,predicts,average='macro'))

单任务实验结果:
(第二个数据集为什么会这样我也很迷茫,但是我结果打印出来确实是这样的!)

数据集accuracymacro-Pmacro-Rmacro-F用时
weibo_senti_100k90.045045.0247.3832min
simplifyweibo_4_moods00002h

多任务实验结果:(耗时2h30min)

数据集accuracymacro-Pmacro-Rmacro-F
weibo_senti_100k85.5488.6285.6985.29
simplifyweibo_4_moods57.3343.0730.1527.81

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_226983.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Java#数据结构----2

目录 一.数据结构(树) 二.二叉树(任意节点的度<2) 二叉查找树又称为二叉排序树/二叉搜索树 平衡二叉树 平衡二叉树的旋转机制 三.红黑树 一.数据结构(树) 基本概念: 度: 每一个节点的子节点数量 树高: 树的总层数 根节点: 最顶层的节点 左子节点: 左下方的节点 右子节…

优维低代码:Redirect 路由重定向If 条件渲染

优维低代码技术专栏&#xff0c;是一个全新的、技术为主的专栏&#xff0c;由优维技术委员会成员执笔&#xff0c;基于优维7年低代码技术研发及运维成果&#xff0c;主要介绍低代码相关的技术原理及架构逻辑&#xff0c;目的是给广大运维人提供一个技术交流与学习的平台。 连载…

2022年超实用的推特营销策略

Twitter推广需知的13条基础知识&#xff1a; 1、Twitter日活用户达1亿 2、Twitter月活用户3.25亿 3、Twitter广告价格比其他渠道便宜33% 4、每天产生5亿条推文 5、Twitter推广能够提高29%的线下交易 6、37%的Twitter用户在18到29岁之间 7、86%的带链接推文会比普通推文效…

Cerebral Cortex:调节γ振荡可以促进大脑连接性而改善认知障碍

摘要 老年痴呆症造成了巨大的全球经济负担&#xff0c;但目前还缺乏有效的治疗方法。最近的研究表明&#xff0c;脑电活动的伽马波段波&#xff0c;特别是40赫兹振荡&#xff0c;与高阶认知功能密切相关&#xff0c;可以激活小胶质细胞清除淀粉样蛋白&#xff0d;β沉积。本研究…

Flowable 中的网关、流程变量以及历史流程

今天这篇文章&#xff0c;松哥和大家梳理一下 Flowable 中的网关、流程变量以及历史流程的玩法。 1. 三大网关 Flowable 中网关类型其实也不少&#xff0c;常见的主要有三种类型&#xff0c;分别是&#xff1a; 排他网关并行网关包容网关 这三个里边最常用的当然就是排他网关…

Cesium中的DataSource和Entity关系

本章主要探讨一下Cesium中的DataSource和Entity。 介绍 首先简单说一下Entity与Primitive。 Cesium为开发者提供了丰富的图形绘制和空间数据管理的API&#xff0c;可以分为两类&#xff0c;一类是面向图形开发人员的低层次API&#xff0c;通常被称为Primitive API&#xff0…

连续时间系统的时域分析

一.微分方程的求解 1.求微分方程的齐次解 &#xff08;1&#xff09;写出特征方程并求解 2.写出齐次解 2.求微分方程的特解 已知 &#xff08;1&#xff09;根据表2-2&#xff0c;写出特解函数 ​​​​​​​ &#xff08;2&#xff09;带入并求解 3.完全解 二.微分方…

小杨哥陷入打假风波,会变成下一个辛巴吗?

最近&#xff0c;网红疯狂小杨哥频繁登上热搜。最初的起因是他花了1亿元在合肥一家高科技公司购买了5万多平方米的房产&#xff0c;作为他名下公司的全球总部&#xff0c;由此带来了争议。 据了解&#xff0c;该物业总建筑面积为53874.33平方米&#xff0c;包括1个生产综合体、…

使用扩展有效对齐 SwiftUI 内容,创建自定义 SwiftUI 方法以快速对齐项目并使您的代码看起来简洁明了(教程含源码)

在开发 iOS 应用程序时,对齐内容可能是一个耗时的过程。如果应用程序有多个屏幕,则需要在不同的地方完成这件事,并可能导致看起来杂乱无章的视图。 作为一个始终致力于让我的代码看起来简单和流线型的人,实现目标所需的大量Spacer()元素常常让我恼火,这就是为什么当我发…

APS软件的技术指标与特色

企业可能经常会因为无法掌握生产制造现场的实际产能状况及物料进货状况&#xff0c;导致物料及产能规划与现场详细作业排程难度增大&#xff0c;从而采取有单就接的接单政策与粗估产能的生产排程方式。这种方式就可能导致企业的生产状况频发&#xff1a;在提高对顾客的服务水平…

【树莓派不吃灰】Linux篇⑨ 学习 磁碟配额(Quota)与进阶文件系统管理(核心概念)

目录1. 磁碟配额 (Quota) 的应用与实作2.软件磁盘阵列 (Software RAID)3. 逻辑卷轴管理员 (Logical Volume Manager)4. 重点回顾❤️ 博客主页 单片机菜鸟哥&#xff0c;一个野生非专业硬件IOT爱好者 ❤️❤️ 本篇创建记录 2022-11-28 ❤️❤️ 本篇更新记录 2022-11-28 ❤️&…

如何采集需要验证码登录的网站数据

如何抓取网页上的数据,需要登录&#xff1f;随着互联网的发展&#xff0c;移动支付技术的普及&#xff0c;以及人们对内容进行消费的观念逐渐养成。有很多网站&#xff0c;需要付费后才能查看&#xff0c;或者是开通会员之类的才能查看。针对这类网站&#xff0c;我们如何快速的…

Scrapy基本概念——Scrapy shell

Scrapy shell是一个交互式shell&#xff0c;可以在不运行Spider的情况下&#xff0c;测试和调试自己的数据提取代码。事实上&#xff0c;Scrapy shell可以测试任何类型的代码&#xff0c;因为它本就是一个常规的Python shell。 一、Scrapy shell的使用 1、启动Scrapy shell …

动态规划算法(1)

认识动态规划 动态规划的求解思路&#xff1a; 1. 把一个问题分解成若干个子问题 2. 将中间结果保存以避免重复计算 基本步骤&#xff1a; 1. 找出最优解的性质&#xff0c;然后刻画结构特征 &#xff08;找规律&#xff09; 2. 最优解(最好的解决方案 定义) 循环(递归) 3. 以…

QFileInfo(文件信息)和临时文件

QFileInfo提供有关文件在文件系统中的名称和位置&#xff08;路径&#xff09;&#xff0c;其访问权限以及它是目录还是符号链接等的信息。文件的大小和上次修改/读取时间也可用。QFileInfo还可用于获取有关Qt资源的信息 QFileInfo可以指向具有相对或绝对文件路径的文件。绝对…

java刷题day 06

一. 单选题&#xff1a; 解析&#xff1a;最终类也叫密封类&#xff0c;是被final修饰的类&#xff0c;不能被继承 解析&#xff1a; A&#xff1a;6入&#xff0c;5 入&#xff0c;5出&#xff0c;4入&#xff0c;4出&#xff0c;3入&#xff0c;3出&#xff0c;6出&#xff0…

教培行业迎来重大变局,三大方向或成新机遇

“双减”政策落地&#xff0c;教培行业迎来重大变局。校内教育深化改革正在路上&#xff0c;而学科类机构或将踏上转型之路&#xff0c;结合政策和市场来看&#xff0c;素质教育类、职业教育类、教育数字化3大方向或成新机遇。 “双减”的总体思路是什么呢&#xff1f; 教育部有…

阿里P8架构师进阶心得:分布式数据库架构MyCat学习笔记送给你

前言&#xff1a; MyCat 是一个数据库分库分表中间件&#xff0c;使用 MyCat 可以非常方便地实现数据库的分库分表查询&#xff0c;并且减少项目中的业务代码。今天我们将通过数据库架构发展的演变来介绍 MyCat 的诞生背景&#xff0c;以及 MyCat 在其中扮演的角色&#xff0c…

Dubbo3.0新特性

服务注册模型 注册模型从接口级别服务注册改为 应用级别服务之策 应用级服务发现简介 概括来说&#xff0c;Dubbo3 引入的应用级服务发现主要有以下优势 适配云原生微服务变革。云原生时代的基础设施能力不断向上释放&#xff0c;像 Kubernetes 等平台都集成了微服务概念抽…

yolov5训练coco数据集

文章目录参考链接一、coco数据集1. 简介2. 下载3.解压后的数据4. COCO数据集(.json)训练格式转换成YOLO格式(.txt)参考链接 为YOLOv5搭建COCO数据集训练、验证和测试环境 CoCo数据集下载 一、coco数据集 1. 简介 MS COCO的全称是Microsoft Common Objects in Context&#…