HuggingFace TAPAS Model细节记录-锐单电子商城

HuggingFace TAPAS Model细节记录：目前google的TAPAS model在huggingface你可以找到各种各样的预训练训练model和训练好的model，所以在这里做一下model尝试，还有finetune过程。

首先，看看最常用的训练model： hugging face TableQA model 里面有很多model，例如常用：

google/tapas-base-finetuned-wtq

（download 19.2k）model 这个和基础TAPAS原论文model有什么区别？

version 首先是这个model两个版本的区别在于position-embedding，一个用的是相对位置索引（在表格的每个cell重置开始时position index，默认情况下，使用会更好)，另一个是绝对位置索引。

intermediate pre-training (不感兴趣的可以忽略这一点) 这是基础训练模型的一部分TAPAS的pre-training和fintune增加之间的很多huggingface上很多基于tapas的model都用了。下面用的都是wikipedia的table和相关text，wikipedia的table至少保留两行两列（header加上三行)，递归地将表格分成上半部分和下半部分，直到它们最多有 50 个单元格。这样我们就能得到它 370 万张表。

具体来说，分为两个步骤：

Synthetic Statements 这部分是在wikipedia的table上合成句子和SQL，用于提高模型的数值操作处理和比较能力。实际上，它是根据某个模板生成的table相关句子及SQL，而关句子以一定的概率干扰。SQL语法规则如下：在这里插入图片描述 <>用尖括号包裹SQL句子，其余的是生成句子的短语，按照这个语法规则，生成句子和相应的句子SQL，并且在正常的句子的前面或者后面再加一层数字对比短语，这样的句子可能不会通顺，但是没关系，比如： example table：

Rank	Player	Country	Earnings	Events	Wins
1	Greg Norman	Australia	1,654,959	16	3
2	Billy Mayfair	United States	1,543,192	28	2
3	Lee Janzen	United States	1,378,966	28	3
4	Corey Pavin	United States	1,340,079	22	2
5	Steve Elkington Australia 1,254,352 21 2	Australia	1,254,352	21	2

合成句子：2 is less than wins when Player is Lee Janzen. SQL：SELECT wins FROM table WHERE player = “Lee Janzen” 结果：通过SQL产生的结果为3，2 is less than 三是对的，positive 负例句子：3 is less than wins when Player is Lee Janzen. 这样的就是negative 这样生成370万对，模型是输入生成的句子和表格，输出是对是错。

Counterfactual Statements 这里是对wikipedia表附近的文本entity replace，例如，原句是：Greg Norman has the highest earnings，我们替换：Steve Elkington has the highest earnings. 模型任务是判断句子是否被替换。有些理解可能不是很正确，感兴趣可以参考： github描述论文3.1 和3.2章节

fintune

依次通过 SQA, WikiSQL and finally WTQ.

code test

环境

torch.version = ‘1.6.0 cu101’
torch-scatter (必须安装此包):pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html 这两个对应就好，哪个版本没关系，官网给1.8.0 cu101
torch-sparse：pip install --no-index torch-sparse -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html

tqa = pipeline(task="table-question-answering", model="google/tapas-base-finetuned-wtq")

table = pd.DataFrame({ 
        "fund":["CSI 300", "Bank of communications","China Shipping"
       
        ]
        , 
        "type"
        :
        [
        "income"
        ,
        "hybrid"
        ,
        "income"
        ]
        , 
        "annual increase"
        :
        [
        0.15
        ,
        0.18
        ,
        0.16
        ]
        }
        ) table 
        = table
        .astype
        (
        str
        ) table

query = ["Which funds its type is income?", 
         "Which fund has the highest annual return?",
         "Is there a fund of income type?",
         "What are the funds with an annual increase of more than 0.2?",
         "Are there any funds with annual returns higher than 20%?",
         "Which funds its type is investment?"]
answer = tqa(table=table, query=query)
for ans in answer:
    print(ans["answer"])

结果：

CSI 300
Bank of communications
CSI 300, China Shipping
Bank of communications
Bank of communications
Bank of communications

这部分很简单，就是直接调用模型计算，这里存在一个问题，因为模型进行微调的数据都是有结果的，所以inference结果也是不存在None的，如果某些问题在table中找不到答案的话，模型也会给出一个结果，尽管不对。

google/tapas-base-finetuned-sqa model

这个和google/tapas-base-finetuned-wtq的区别在于这个可以进行连续问答，其他的都一样，就是最后微调的时候只使用了SQA数据集，因为这个数据集就是连续问答数据。这个就不能用transformer的pipeline去load运行了，因为是连续问答，这里有一段code也是从网上找到的：

import collections
import numpy as np
import pandas as pd
import torch
from transformers import TapasForQuestionAnswering, TapasTokenizer

model = TapasForQuestionAnswering.from_pretrained(
    "google/tapas-base-finetuned-sqa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

# initialize the tokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")


# Runs compute on the model specified
def compute_prediction_sequence(model, data, device):
    """Computes predictions using model's answers to the previous questions."""

    # prepare data
    input_ids = data["input_ids"].to(device)
    attention_mask = data["attention_mask"].to(device)
    token_type_ids = data["token_type_ids"].to(device)

    all_logits = []
    prev_answers = None

    num_batch = data["input_ids"].shape[0]

    for idx in range(num_batch):

        if prev_answers is not None:
            coords_to_answer = prev_answers[idx]
            # Next, set the label ids predicted by the model
            # shape (seq_len,)
            prev_label_ids_example = token_type_ids_example[:, 3]
            model_label_ids = np.zeros_like(
                prev_label_ids_example.cpu().numpy())  # shape (seq_len,)

            # for each token in the sequence:
            token_type_ids_example = token_type_ids[idx]  # shape (seq_len, 7)
            for i in range(model_label_ids.shape[0]):
                segment_id = token_type_ids_example[:, 0].tolist()[i]
                col_id = token_type_ids_example[:, 1].tolist()[i] - 1
                row_id = token_type_ids_example[:, 2].tolist()[i] - 1
                if row_id >= 0 and col_id >= 0 and segment_id == 1:
                    model_label_ids[i] = int(
                        coords_to_answer[(col_id, row_id)])

            # set the prev label ids of the example (shape (1, seq_len) )
            token_type_ids_example[:, 3] = torch.from_numpy(
                model_label_ids).type(torch.long).to(device)

        prev_answers = { 
        }
        # get the example
        input_ids_example = input_ids[idx]  # shape (seq_len,)
        attention_mask_example = attention_mask[idx]  # shape (seq_len,)
        token_type_ids_example = token_type_ids[idx]  # shape (seq_len, 7)
        # forward pass to obtain the logits
        outputs = model(input_ids=input_ids_example.unsqueeze(0),
                        attention_mask=attention_mask_example.unsqueeze(0),
                        token_type_ids=token_type_ids_example.unsqueeze(0))
        logits = outputs.logits
        all_logits.append(logits)

        # convert logits to probabilities (which are of shape (1, seq_len))
        dist_per_token = torch.distributions.Bernoulli(logits=logits)
        probabilities = dist_per_token.probs * \
            attention_mask_example.type(torch.float32).to(
                dist_per_token.probs.device)

        # Compute average probability per cell, aggregating over tokens.
        # Dictionary maps coordinates to a list of one or more probabilities
        coords_to_probs = collections.defaultdict(list)
        prev_answers = { 
        }
        for i, p in enumerate(probabilities.squeeze().tolist()):
            segment_id = token_type_ids_example[:, 0].tolist()[i]
            col = token_type_ids_example[:, 1].tolist()[i] - 1
            row = token_type_ids_example[:, 2].tolist()[i] - 1
            if col >= 0 and row >= 0 and segment_id == 1:
                coords_to_probs[(col, row)].append(p)

        # Next, map cell coordinates to 1 or 0 (depending on whether the mean prob of all cell tokens is > 0.5)
        coords_to_answer = { 
        }
        for key in coords_to_probs:
            coords_to_answer[key] = np.array(coords_to_probs[key]).mean() > 0.5
        prev_answers[idx+1] = coords_to_answer

    logits_batch = torch.cat(tuple(all_logits), 0)

    return logits_batch


def get_answers():
    data = { 
        'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
            'Age': ["56", "45", "59"],
            'Number of movies': ["87", "53", "69"],
            'Date of birth': ["7 february 1967", "10 june 1996", "28 november 1967"]}

    # pass them as simmple queries in array format
    queries = ["How many movies has George Clooney played in?",
               "How old is he?", "What's his date of birth?"]

    # Convert to dataframe entry once upon checkign
    table = pd.DataFrame.from_dict(data)

    inputs = tokenizer(table=table, queries=queries,
                       padding='max_length', return_tensors="pt")
    logits = compute_prediction_sequence(model, inputs, device)

    predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(
        inputs, logits.cpu().detach())

    answers = []
    for coordinates in predicted_answer_coordinates:
        if len(coordinates) == 1:
            # only a single cell:
            answers.append(table.iat[coordinates[0]])
        else:
            # multiple cells
            cell_values = []
            for coordinate in coordinates:
                cell_values.append(table.iat[coordinate])
                answers.append(", ".join(cell_values))

    
    for query, answer in zip(queries, answers):  
        print(query)
        print("Predicted answer: " + answer)

这里简单介绍了HuggingFace里用的最多的两个model，但是各有各的缺点，wtq的不能连续问答，sqa的不能处理agg操作，而且两个都缺少无答案的情况，所以如果想要一个完美一点的model，就需要自己去做finetune了，后续有时间会写上finetune过程

资讯详情

HuggingFace TAPAS Model细节记录

google/tapas-base-finetuned-wtq

fintune

code test

环境

google/tapas-base-finetuned-sqa model

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

HuggingFace TAPAS Model细节记录

google/tapas-base-finetuned-wtq

fintune

code test

环境

google/tapas-base-finetuned-sqa model

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录