资讯详情

HuggingFace TAPAS Model细节记录

HuggingFace TAPAS Model细节记录: 目前google的TAPAS model在huggingface你可以找到各种各样的预训练训练model和训练好的model,所以在这里做一下model尝试,还有finetune过程。

首先,看看最常用的训练model: hugging face TableQA model 里面有很多model,例如常用:

google/tapas-base-finetuned-wtq

(download 19.2k)model 这个和基础TAPAS原论文model有什么区别?

  • 首先是这个model两个版本的区别在于position-embedding,一个用的是相对位置索引(在表格的每个cell重置开始时position index,默认情况下,使用会更好),另一个是绝对位置索引。

  • (不感兴趣的可以忽略这一点) 这是基础训练模型的一部分TAPAS的pre-training和fintune增加之间的很多huggingface上很多基于tapas的model都用了。下面用的都是wikipedia的table和相关text,wikipedia的table至少保留两行两列(header加上三行),递归地将表格分成上半部分和下半部分,直到它们最多有 50 个单元格。这样我们就能得到它 370 万张表。

    具体来说,分为两个步骤:

    • 这部分是在wikipedia的table上合成句子和SQL,用于提高模型的数值操作处理和比较能力。实际上,它是根据某个模板生成的table相关句子及SQL,而关句子以一定的概率干扰。SQL语法规则如下: 在这里插入图片描述 <>用尖括号包裹SQL句子,其余的是生成句子的短语,按照这个语法规则,生成句子和相应的句子SQL,并且在正常的句子的前面或者后面再加一层数字对比短语,这样的句子可能不会通顺,但是没关系,比如: example table:

      Rank Player Country Earnings Events Wins
      1 Greg Norman Australia 1,654,959 16 3
      2 Billy Mayfair United States 1,543,192 28 2
      3 Lee Janzen United States 1,378,966 28 3
      4 Corey Pavin United States 1,340,079 22 2
      5 Steve Elkington Australia 1,254,352 21 2 Australia 1,254,352 21 2

      合成句子:2 is less than wins when Player is Lee Janzen. SQL:SELECT wins FROM table WHERE player = “Lee Janzen” 结果:通过SQL产生的结果为3,2 is less than 三是对的,positive 负例句子:3 is less than wins when Player is Lee Janzen. 这样的就是negative 这样生成370万对,模型是输入生成的句子和表格,输出是对是错。

    • Counterfactual Statements 这里是对wikipedia表附近的文本entity replace,例如,原句是:Greg Norman has the highest earnings,我们替换:Steve Elkington has the highest earnings. 模型任务是判断句子是否被替换。 有些理解可能不是很正确,感兴趣可以参考: github描述 论文3.1 和3.2章节

fintune

依次通过 SQA, WikiSQL and finally WTQ.

code test

环境

  • torch. = ‘1.6.0 cu101’
  • torch-scatter (必须安装此包):pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html 这两个对应就好,哪个版本没关系,官网给1.8.0 cu101
  • torch-sparse:pip install --no-index torch-sparse -f https://pytorch-geometric.com/whl/torch-1.6.0 cu101.html
tqa = pipeline(task="table-question-answering", model="google/tapas-base-finetuned-wtq") 
table = pd.DataFrame({ 
        "fund":["CSI 300", "Bank of communications","China Shipping"
       
        ]
        , 
        "type"
        :
        [
        "income"
        ,
        "hybrid"
        ,
        "income"
        ]
        , 
        "annual increase"
        :
        [
        0.15
        ,
        0.18
        ,
        0.16
        ]
        }
        ) table 
        = table
        .astype
        (
        str
        ) table 
       

query = ["Which funds its type is income?", 
         "Which fund has the highest annual return?",
         "Is there a fund of income type?",
         "What are the funds with an annual increase of more than 0.2?",
         "Are there any funds with annual returns higher than 20%?",
         "Which funds its type is investment?"]
answer = tqa(table=table, query=query)
for ans in answer:
    print(ans["answer"])

结果:

CSI 300
Bank of communications
CSI 300, China Shipping
Bank of communications
Bank of communications
Bank of communications

这部分很简单,就是直接调用模型计算,这里存在一个问题,因为模型进行微调的数据都是有结果的,所以inference结果也是不存在None的,如果某些问题在table中找不到答案的话,模型也会给出一个结果,尽管不对。

google/tapas-base-finetuned-sqa model

这个和google/tapas-base-finetuned-wtq的区别在于这个可以进行连续问答,其他的都一样,就是最后微调的时候只使用了SQA数据集,因为这个数据集就是连续问答数据。 这个就不能用transformer的pipeline去load运行了,因为是连续问答,这里有一段code也是从网上找到的:

import collections
import numpy as np
import pandas as pd
import torch
from transformers import TapasForQuestionAnswering, TapasTokenizer

model = TapasForQuestionAnswering.from_pretrained(
    "google/tapas-base-finetuned-sqa")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

# initialize the tokenizer
tokenizer = TapasTokenizer.from_pretrained("google/tapas-base")


# Runs compute on the model specified
def compute_prediction_sequence(model, data, device):
    """Computes predictions using model's answers to the previous questions."""

    # prepare data
    input_ids = data["input_ids"].to(device)
    attention_mask = data["attention_mask"].to(device)
    token_type_ids = data["token_type_ids"].to(device)

    all_logits = []
    prev_answers = None

    num_batch = data["input_ids"].shape[0]

    for idx in range(num_batch):

        if prev_answers is not None:
            coords_to_answer = prev_answers[idx]
            # Next, set the label ids predicted by the model
            # shape (seq_len,)
            prev_label_ids_example = token_type_ids_example[:, 3]
            model_label_ids = np.zeros_like(
                prev_label_ids_example.cpu().numpy())  # shape (seq_len,)

            # for each token in the sequence:
            token_type_ids_example = token_type_ids[idx]  # shape (seq_len, 7)
            for i in range(model_label_ids.shape[0]):
                segment_id = token_type_ids_example[:, 0].tolist()[i]
                col_id = token_type_ids_example[:, 1].tolist()[i] - 1
                row_id = token_type_ids_example[:, 2].tolist()[i] - 1
                if row_id >= 0 and col_id >= 0 and segment_id == 1:
                    model_label_ids[i] = int(
                        coords_to_answer[(col_id, row_id)])

            # set the prev label ids of the example (shape (1, seq_len) )
            token_type_ids_example[:, 3] = torch.from_numpy(
                model_label_ids).type(torch.long).to(device)

        prev_answers = { 
        }
        # get the example
        input_ids_example = input_ids[idx]  # shape (seq_len,)
        attention_mask_example = attention_mask[idx]  # shape (seq_len,)
        token_type_ids_example = token_type_ids[idx]  # shape (seq_len, 7)
        # forward pass to obtain the logits
        outputs = model(input_ids=input_ids_example.unsqueeze(0),
                        attention_mask=attention_mask_example.unsqueeze(0),
                        token_type_ids=token_type_ids_example.unsqueeze(0))
        logits = outputs.logits
        all_logits.append(logits)

        # convert logits to probabilities (which are of shape (1, seq_len))
        dist_per_token = torch.distributions.Bernoulli(logits=logits)
        probabilities = dist_per_token.probs * \
            attention_mask_example.type(torch.float32).to(
                dist_per_token.probs.device)

        # Compute average probability per cell, aggregating over tokens.
        # Dictionary maps coordinates to a list of one or more probabilities
        coords_to_probs = collections.defaultdict(list)
        prev_answers = { 
        }
        for i, p in enumerate(probabilities.squeeze().tolist()):
            segment_id = token_type_ids_example[:, 0].tolist()[i]
            col = token_type_ids_example[:, 1].tolist()[i] - 1
            row = token_type_ids_example[:, 2].tolist()[i] - 1
            if col >= 0 and row >= 0 and segment_id == 1:
                coords_to_probs[(col, row)].append(p)

        # Next, map cell coordinates to 1 or 0 (depending on whether the mean prob of all cell tokens is > 0.5)
        coords_to_answer = { 
        }
        for key in coords_to_probs:
            coords_to_answer[key] = np.array(coords_to_probs[key]).mean() > 0.5
        prev_answers[idx+1] = coords_to_answer

    logits_batch = torch.cat(tuple(all_logits), 0)

    return logits_batch


def get_answers():
    data = { 
        'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
            'Age': ["56", "45", "59"],
            'Number of movies': ["87", "53", "69"],
            'Date of birth': ["7 february 1967", "10 june 1996", "28 november 1967"]}

    # pass them as simmple queries in array format
    queries = ["How many movies has George Clooney played in?",
               "How old is he?", "What's his date of birth?"]

    # Convert to dataframe entry once upon checkign
    table = pd.DataFrame.from_dict(data)

    inputs = tokenizer(table=table, queries=queries,
                       padding='max_length', return_tensors="pt")
    logits = compute_prediction_sequence(model, inputs, device)

    predicted_answer_coordinates, = tokenizer.convert_logits_to_predictions(
        inputs, logits.cpu().detach())

    answers = []
    for coordinates in predicted_answer_coordinates:
        if len(coordinates) == 1:
            # only a single cell:
            answers.append(table.iat[coordinates[0]])
        else:
            # multiple cells
            cell_values = []
            for coordinate in coordinates:
                cell_values.append(table.iat[coordinate])
                answers.append(", ".join(cell_values))

    
    for query, answer in zip(queries, answers):  
        print(query)
        print("Predicted answer: " + answer)

这里简单介绍了HuggingFace里用的最多的两个model,但是各有各的缺点,wtq的不能连续问答,sqa的不能处理agg操作,而且两个都缺少无答案的情况,所以如果想要一个完美一点的model,就需要自己去做finetune了,后续有时间会写上finetune过程

标签: wtq1050f动态扭矩传感器

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台