HuggingFace TAPAS Model细节记录: 目前google的TAPAS model在huggingface你可以找到各种各样的预训练训练model和训练好的model,所以在这里做一下model尝试,还有finetune过程。
首先,看看最常用的训练model: hugging face TableQA model 里面有很多model,例如常用:
google/tapas-base-finetuned-wtq
(download 19.2k)model 这个和基础TAPAS原论文model有什么区别?
-
首先是这个model两个版本的区别在于position-embedding,一个用的是相对位置索引(在表格的每个cell重置开始时position index,默认情况下,使用会更好),另一个是绝对位置索引。
-
(不感兴趣的可以忽略这一点) 这是基础训练模型的一部分TAPAS的pre-training和fintune增加之间的很多huggingface上很多基于tapas的model都用了。下面用的都是wikipedia的table和相关text,wikipedia的table至少保留两行两列(header加上三行),递归地将表格分成上半部分和下半部分,直到它们最多有 50 个单元格。这样我们就能得到它 370 万张表。
具体来说,分为两个步骤:
-
这部分是在wikipedia的table上合成句子和SQL,用于提高模型的数值操作处理和比较能力。实际上,它是根据某个模板生成的table相关句子及SQL,而关句子以一定的概率干扰。SQL语法规则如下: <>用尖括号包裹SQL句子,其余的是生成句子的短语,按照这个语法规则,生成句子和相应的句子SQL,并且在正常的句子的前面或者后面再加一层数字对比短语,这样的句子可能不会通顺,但是没关系,比如: example table:
Rank Player Country Earnings Events Wins 1 Greg Norman Australia 1,654,959 16 3 2 Billy Mayfair United States 1,543,192 28 2 3 Lee Janzen United States 1,378,966 28 3 4 Corey Pavin United States 1,340,079 22 2 5 Steve Elkington Australia 1,254,352 21 2 Australia 1,254,352 21 2 合成句子:
2 is less than
wins when Player is Lee Janzen. SQL:SELECT wins FROM table WHERE player = “Lee Janzen” 结果:通过SQL产生的结果为3,2 is less than 三是对的,positive 负例句子:3 is less than
wins when Player is Lee Janzen. 这样的就是negative 这样生成370万对,模型是输入生成的句子和表格,输出是对是错。 -
Counterfactual Statements 这里是对wikipedia表附近的文本entity replace,例如,原句是:
Greg Norman
has the highest earnings,我们替换:Steve Elkington
has the highest earnings. 模型任务是判断句子是否被替换。 有些理解可能不是很正确,感兴趣可以参考: github描述 论文3.1 和3.2章节
-
fintune
依次通过 SQA, WikiSQL and finally WTQ.
code test
环境
- torch. = ‘1.6.0 cu101’
- torch-scatter (必须安装此包):
pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html
这两个对应就好,哪个版本没关系,官网给1.8.0 cu101 - torch-sparse:
pip install --no-index torch-sparse -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html
tqa = pipeline(task="table-question-answering", model="google/tapas-base-finetuned-wtq")
table = pd.DataFrame({
"fund":["CSI 300", "Bank of communications","China Shipping"
]
,
"type"
:
[
"income"
,
"hybrid"
,
"income"
]
,
"annual increase"
:
[
0.15
,
0.18
,
0.16
]
}
) table
= table
.astype
(
str
) table
query = ["Which funds its type is income?",
"Which fund has the highest annual return?",
"Is there a fund of income type?",
"What are the funds with an annual increase of more than 0.2?",
"Are there any funds with annual returns higher than 20%?",
"Which funds its type is investment?"]
answer = tqa(table=table, query=query)
for ans in answer:
print(ans["answer"])
结果:
CSI 300
Bank of communications
CSI 300, China Shipping
Bank of communications
Bank of communications
Bank of communications
这部分很简单,就是直接调用模型计算,这里存在一个问题,因为模型进行微调的数据都是有结果的,所以inference结果也是不存在None的,如果某些问题在table中找不到答案的话,模型也会给出一个结果,尽管不对。
google/tapas-base-finetuned-sqa model
这个和google/tapas-base-finetuned-wtq的区别在于这个可以进行连续问答,其他的都一样,就是最后微调的时候只使用了SQA数据集,因为这个数据集就是连续问答数据。 这个就不能用transformer的pipeline去load运行了,因为是连续问答,这里有一段code也是从网上找到的:
import collections
import numpy as np
import pandas as pd
import torch
from transformers import TapasForQuestionAnswering, TapasTokenizer
model = TapasForQuestionAnswering.from_pretrained(
"google/tapas-base-finetuned-sqa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# initialize the tokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")
# Runs compute on the model specified
def compute_prediction_sequence(model, data, device):
"""Computes predictions using model's answers to the previous questions."""
# prepare data
input_ids = data["input_ids"].to(device)
attention_mask = data["attention_mask"].to(device)
token_type_ids = data["token_type_ids"].to(device)
all_logits = []
prev_answers = None
num_batch = data["input_ids"].shape[0]
for idx in range(num_batch):
if prev_answers is not None:
coords_to_answer = prev_answers[idx]
# Next, set the label ids predicted by the model
# shape (seq_len,)
prev_label_ids_example = token_type_ids_example[:, 3]
model_label_ids = np.zeros_like(
prev_label_ids_example.cpu().numpy()) # shape (seq_len,)
# for each token in the sequence:
token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)
for i in range(model_label_ids.shape[0]):
segment_id = token_type_ids_example[:, 0].tolist()[i]
col_id = token_type_ids_example[:, 1].tolist()[i] - 1
row_id = token_type_ids_example[:, 2].tolist()[i] - 1
if row_id >= 0 and col_id >= 0 and segment_id == 1:
model_label_ids[i] = int(
coords_to_answer[(col_id, row_id)])
# set the prev label ids of the example (shape (1, seq_len) )
token_type_ids_example[:, 3] = torch.from_numpy(
model_label_ids).type(torch.long).to(device)
prev_answers = {
}
# get the example
input_ids_example = input_ids[idx] # shape (seq_len,)
attention_mask_example = attention_mask[idx] # shape (seq_len,)
token_type_ids_example = token_type_ids[idx] # shape (seq_len, 7)
# forward pass to obtain the logits
outputs = model(input_ids=input_ids_example.unsqueeze(0),
attention_mask=attention_mask_example.unsqueeze(0),
token_type_ids=token_type_ids_example.unsqueeze(0))
logits = outputs.logits
all_logits.append(logits)
# convert logits to probabilities (which are of shape (1, seq_len))
dist_per_token = torch.distributions.Bernoulli(logits=logits)
probabilities = dist_per_token.probs * \
attention_mask_example.type(torch.float32).to(
dist_per_token.probs.device)
# Compute average probability per cell, aggregating over tokens.
# Dictionary maps coordinates to a list of one or more probabilities
coords_to_probs = collections.defaultdict(list)
prev_answers = {
}
for i, p in enumerate(probabilities.squeeze().tolist()):
segment_id = token_type_ids_example[:, 0].tolist()[i]
col = token_type_ids_example[:, 1].tolist()[i] - 1
row = token_type_ids_example[:, 2].tolist()[i] - 1
if col >= 0 and row >= 0 and segment_id == 1:
coords_to_probs[(col, row)].append(p)
# Next, map cell coordinates to 1 or 0 (depending on whether the mean prob of all cell tokens is > 0.5)
coords_to_answer = {
}
for key in coords_to_probs:
coords_to_answer[key] = np.array(coords_to_probs[key]).mean() > 0.5
prev_answers[idx+1] = coords_to_answer
logits_batch = torch.cat(tuple(all_logits), 0)
return logits_batch
def get_answers():
data = {
'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
'Age': ["56", "45", "59"],
'Number of movies': ["87", "53", "69"],
'Date of birth': ["7 february 1967", "10 june 1996", "28 november 1967"]}
# pass them as simmple queries in array format
queries = ["How many movies has George Clooney played in?",
"How old is he?", "What's his date of birth?"]
# Convert to dataframe entry once upon checkign
table = pd.DataFrame.from_dict(data)
inputs = tokenizer(table=table, queries=queries,
padding='max_length', return_tensors="pt")
logits = compute_prediction_sequence(model, inputs, device)
predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(
inputs, logits.cpu().detach())
answers = []
for coordinates in predicted_answer_coordinates:
if len(coordinates) == 1:
# only a single cell:
answers.append(table.iat[coordinates[0]])
else:
# multiple cells
cell_values = []
for coordinate in coordinates:
cell_values.append(table.iat[coordinate])
answers.append(", ".join(cell_values))
for query, answer in zip(queries, answers):
print(query)
print("Predicted answer: " + answer)
这里简单介绍了HuggingFace里用的最多的两个model,但是各有各的缺点,wtq的不能连续问答,sqa的不能处理agg操作,而且两个都缺少无答案的情况,所以如果想要一个完美一点的model,就需要自己去做finetune了,后续有时间会写上finetune过程