资讯详情

molecular-graph-bert(一)

基于分子图 BERT 模型,原文:MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction,原文解析:MG-BERT | 利用 无监督 原子表示学习 预测分子性质 | 应用于分子图BERT | GNN | 无监督学习(隐藏原子预训练) | attention,代码:Molecular-graph-BERT,缺少的数据是logD7.4例类似于上一篇文章的处理,可以删除 Index 列。代码解析从 pretrain 开始,模型整体框架如下: 请添加图片描述

文章目录

  • 1.pretrain
    • 1.1.Graph_Bert_Dataset
      • 1.1.1.get_data
      • 1.1.2.tf_numerical_smiles
      • 1.1.3.numerical_smiles
      • 1.1.4.smiles2adjoin
      • 1.1.5.summary
    • 1.2.BertModel
      • 1.2.1.Encoder
      • 1.2.2.EncoderLayer
      • 1.2.3.point_wise_feed_forward_network
      • 1.2.4.MultiHeadAttention
      • 1.2.5.scaled_dot_product_attention
    • 1.3.run
      • 1.3.1.train_step


1.pretrain

os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true" keras.backend.clear_session() os.environ['CUDA_VISIBLE_DEVICES'] = "0" optimizer = tf.keras.optimizers.Adam(1e-4)  small = { 
        'name': 'Small', 'num_layers': 3, 'num_heads': 4, 'd_model': 128, 'path': 'small_weights','addH':True} medium = { 
        'name': 'Medium', 'num_layers': 6, 'num_heads': 8, 'd_model': 256, 'path': 'medium_weights','addH':True} medium3 = { 
        'name': 'Medium', 'num_layers': 6, 'num_heads': 4, 'd_model': 256, 'path': 'medium_weights3','addH':True}
large = { 
        'name': 'Large', 'num_layers': 12, 'num_heads': 12, 'd_model': 576, 'path': 'large_weights','addH':True}
medium_balanced = { 
        'name':'Medium','num_layers': 6, 'num_heads': 8, 'd_model': 256,'path':'weights_balanced','addH':True}
medium_without_H = { 
        'name':'Medium','num_layers': 6, 'num_heads': 8, 'd_model': 256,'path':'weights_without_H','addH':False}

arch = medium3      ## small 3 4 128 medium: 6 6 256 large: 12 8 516
num_layers = arch['num_layers']
num_heads =  arch['num_heads']
d_model =  arch['d_model']
addH = arch['addH']


dff = d_model*2
vocab_size =17
dropout_rate = 0.1
model = BertModel(num_layers=num_layers,d_model=d_model,dff=dff,num_heads=num_heads,vocab_size=vocab_size)
train_dataset, test_dataset = Graph_Bert_Dataset(path='data/chem.txt',smiles_field='CAN_SMILES',addH=addH).get_data()
  • 定义优化器和参数,多个参数字典是为了比较不同超参数下模型的效果。根据参数构建 BertModel,载入 Graph_Bert_Dataset 数据

1.1.Graph_Bert_Dataset

""" {'O': 5000757, 'C': 34130255, 'N': 5244317, 'F': 641901, 'H': 37237224, 'S': 648962, 'Cl': 373453, 'P': 26195, 'Br': 76939, 'B': 2895, 'I': 9203, 'Si': 1990, 'Se': 1860, 'Te': 104, 'As': 202, 'Al': 21, 'Zn': 6, 'Ca': 1, 'Ag': 3} H C N O F S Cl P Br B I Si Se """

str2num = { 
        '<pad>':0 ,'H': 1, 'C': 2, 'N': 3, 'O': 4, 'F': 5, 'S': 6, 'Cl': 7, 'P': 8, 'Br':  9,
         'B': 10,'I': 11,'Si':12,'Se':13,'<unk>':14,'<mask>':15,'<global>':16}

num2str =  { 
        i:j for j,i in str2num.items()}

class Graph_Bert_Dataset(object):
    def __init__(self,path,smiles_field='Smiles',addH=True):
        if path.endswith('.txt') or path.endswith('.tsv'):
            self.df = pd.read_csv(path,sep='\t')
        else:
            self.df = pd.read_csv(path)
        self.smiles_field = smiles_field
        self.vocab = str2num
        self.devocab = num2str
        self.addH = addH
  • 定义词表 vocab,根据了原子的出现频率决定先后编码序号,可以减少占用空间,vocab 的大小是17,也限定了超参数 vocab_size =17

1.1.1.get_data

def get_data(self):
    data = self.df
    train_idx = []
    idx = data.sample(frac=0.9).index
    train_idx.extend(idx)

    data1 = data[data.index.isin(train_idx)]
    data2 = data[~data.index.isin(train_idx)]

    self.dataset1 = tf.data.Dataset.from_tensor_slices(data1[self.smiles_field].tolist())
    self.dataset1 = self.dataset1.map(self.tf_numerical_smiles).padded_batch(256, padded_shapes=(
        tf.TensorShape([None]),tf.TensorShape([None,None]), tf.TensorShape([None]) ,tf.TensorShape([None]))).prefetch(50)

    self.dataset2 = tf.data.Dataset.from_tensor_slices(data2[self.smiles_field].tolist())
    self.dataset2 = self.dataset2.map(self.tf_numerical_smiles).padded_batch(512, padded_shapes=(
        tf.TensorShape([None]), tf.TensorShape([None, None]), tf.TensorShape([None]),
        tf.TensorShape([None]))).prefetch(50)
    return self.dataset1, self.dataset2

1.1.2.tf_numerical_smiles

def tf_numerical_smiles(self, data):
    # x,adjoin_matrix,y,weight = tf.py_function(self.balanced_numerical_smiles,
    # [data], [tf.int64, tf.float32 ,tf.int64,tf.float32])
    x, adjoin_matrix, y, weight = tf.py_function(self.numerical_smiles, [data],
                                                 [tf.int64, tf.float32, tf.int64, tf.float32])

    x.set_shape([None])
    adjoin_matrix.set_shape([None,None])
    y.set_shape([None])
    weight.set_shape([None])
    return x, adjoin_matrix, y, weight

tf.py_function 调用 numerical_smiles,将 smiles 解析为四种数据,set_shape 补全 shape 信息

1.1.3.numerical_smiles

def numerical_smiles(self, smiles):
    smiles = smiles.numpy().decode()
    atoms_list, adjoin_matrix = smiles2adjoin(smiles,explicit_hydrogens=self.addH)
    atoms_list = ['<global>'] + atoms_list
    nums_list =  [str2num.get(i,str2num['<unk>']) for i in atoms_list]
    temp = np.ones((len(nums_list),len(nums_list)))
    temp[1:,1:] = adjoin_matrix
    adjoin_matrix = (1 - temp) * (-1e9)

    choices = np.random.permutation(len(nums_list)-1)[:max(int(len(nums_list)*0.15),1)] + 1
    y = np.array(nums_list).astype('int64')
    weight = np.zeros(len(nums_list))
    for i in choices:
        rand = np.random.rand()
        weight[i] = 1
        if rand < 0.8:
            nums_list[i] = str2num['<mask>']
        elif rand < 0.9:
            nums_list[i] = int(np.random.rand() * 14 + 1)

    x = np.array(nums_list).astype('int64')
    weight = weight.astype('float32')
    return x, adjoin_matrix, y, weight
  • smiles2adjoin 计算原子列表和邻接矩阵,添加 supernode 后编码为向量,temp 矩阵中,有键相连的是1,(1 - temp) * (-1e9) 后,adjoin_matrix 有键相连的是0,没有键相连的是-1e9
  • np.random.permutation(len(nums_list)-1) 是 0-len(nums_list)-1的乱序列表,取15%的原子下标,如果15%的原子就取一个原子,+1 应该是为了保证不取到下标0的 supernode,防止被 mask
  • rand 是概率值,80% 的概率被 mask,10% 的概率被某个原子取代,取代时 rand 的值在(0.8,1)之间,*14+1 后在(11.2,15),取 int 后相当于被 vocab 中 11-15 所代表的原子取代
  • 最后返回 x 是被 mask 后的训练数据,adjoin_matrix 是原子邻接矩阵,有键相连的是0,没有键相连的是-1e9,y 是weight 是经过处理的原子标志,y 是原始原子列表,即预测目标。示例如下:
import numpy as np
from utils import smiles2adjoin
import tensorflow as tf

str2num = { 
        '<pad>':0 ,'H': 1, 'C': 2, 'N': 3, 'O': 4, 'F': 5, 'S': 6, 'Cl': 7, 'P': 8, 'Br':  9,
         'B': 10,'I': 11,'Si':12,'Se':13,'<unk>':14,'<mask>':15,'<global>':16}

num2str =  { 
        i:j for j,i in str2num.items()}

def numerical_smiles(smiles):
    addH=True
    #smiles = smiles.numpy().decode()
    atoms_list, adjoin_matrix = smiles2adjoin(smiles,explicit_hydrogens=addH)
    atoms_list = ['<global>'] + atoms_list
    nums_list =  [str2num.get(i,str2num['<unk>']) for i in atoms_list]
    temp = np.ones((len(nums_list),len(nums_list)))
    temp[1:,1:] = adjoin_matrix
    adjoin_matrix = (1 - temp) * (-1e9)

    choices = np.random.permutation(len(nums_list)-1)[:max(int(len(nums_list)*0.15),1)] + 1
    y = np.array(nums_list).astype('int64')
    weight = np.zeros(len(nums_list))
    for i in choices:
        rand = np.random.rand()
        weight[i] = 1
        if rand < 0.8:
            nums_list[i] = str2num['<mask>']
        elif rand < 0.9:
            nums_list[i] = int(np.random.rand() * 14 + 1)

    x = np.array(nums_list).astype('int64')
    weight = weight.astype('float32')
    return x, adjoin_matrix, y, weight

smiles='CC(C)OC(=O)C(C)NP(=O)(OCC1C(C(C(O1)N2C=CC(=O)NC2=O)(C)F)O)OC3=CC=CC=C3'

x, adjoin_matrix, y, weight=numerical_smiles(smiles)
x, adjoin_matrix, y, weight
""" (array([16, 2, 2, 2, 4, 2, 4, 2, 2, 3, 8, 4, 4, 2, 2, 2, 2, 2, 4, 3, 2, 2, 2, 4, 3, 2, 4, 2, 15, 4, 4, 2, 2, 2, 15, 15, 2, 1, 15, 1, 1, 1, 1, 15, 1, 1, 1, 1, 1, 15, 1, 1, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1, 1, 1, 1, 1]), array([[-0.e+00, -0.e+00, -0.e+00, ..., -0.e+00, -0.e+00, -0.e+00], [-0.e+00, -0.e+00, -0.e+00, ..., -1.e+09, -1.e+09, -1.e+09], [-0.e+00, -0.e+00, -0.e+00, ..., -1.e+09, -1.e+09, -1.e+09], ..., [-0.e+00, -1.e+09, -1.e+09, ..., -0.e+00, -1.e+09, -1.e+09], [-0.e+00, -1.e+09, -1.e+09, ..., -1.e+09, -0.e+00, -1.e+09], [-0.e+00, -1.e+09, -1.e+09, ..., -1.e+09, -1.e+09, -0.e+00]]), array([16, 2, 2, 2, 4, 2, 4, 2, 2, 3, 8, 4, 4, 2, 2, 2, 2, 2, 4, 3, 2, 2, 2, 4, 3, 2, 4, 2, 5, 4, 4, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0.], dtype=float32)) """
  • 邻接矩阵中 supernode 与所有原子相连,即第一行第一列全为0

1.1.4.smiles2adjoin

def smiles2adjoin(smiles,explicit_hydrogens=True,canonical_atom_order=False):

    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print('error')
        #mol = Chem.MolFromSmiles(obsmitosmile(smiles))
        #assert mol is not None, smiles + ' is not valid '

    if explicit_hydrogens:
        mol = Chem.AddHs(mol)
    else:
        mol = Chem.RemoveHs(mol)

    if canonical_atom_order:
        new_order = rdmolfiles.CanonicalRankAtoms(mol)
        mol = rdmolops.RenumberAtoms(mol, new_order)
    num_atoms = mol.GetNumAtoms()
    atoms_list = []
    for i in range(num_atoms):
        atom = mol.GetAtomWithIdx(i)
        atoms_list.append(atom.GetSymbol())

    adjoin_matrix = np.eye(num_atoms)
    # Add edges
    num_bonds = mol.GetNumBonds()
    for i in range(num_bonds):
        bond = mol.GetBondWithIdx(i)
        u = bond.GetBeginAtomIdx()
        v = bond.GetEndAtomIdx()
        adjoin_matrix[u,v] = 1.0
        adjoin_matrix[v,u] = 1.0
    return atoms_list,adjoin_matrix
  • 先验证来自数据库的 smile 是否有效(这里将 obsmitosmile 及 openbabel 的导入删去,对解析影响不大),再计算原子列表和邻接矩阵。atom.GetSymbol() 示例如下:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
mol=Chem.MolFromSmiles('OC1C2C1CC2')
num_atoms = mol.GetNumAtoms()
for i in range(num_atoms):
    atom = mol.GetAtomWithIdx(i)
    print(atom.GetSymbol(),end='')  #OCCCCC

1.1.5.summary

  • 将 smiles 数据处理为原子列表,邻接矩阵,未处理原子列表,处理原子标志列表。以logD7.4数据为例,,修改 sep 。test_dataset 与 train_dataset 类似,只是 batch_size 不同。get_data 后的情况如下:
""" class Graph_Bert_Dataset(object): def __init__(self,path,smiles_field='Smiles',addH=True): if path.endswith('.txt') or path.endswith('.tsv'): self.df = pd.read_csv(path,sep='\t') 改为sep=',' else: self.df = pd.read_csv(path) self.smiles_field = smiles_field self.vocab = str2num self.devocab = num2str self.addH = addH """
from dataset import Graph_Bert_Dataset
addH=True
train_dataset, test_dataset = Graph_Bert_Dataset(path='data/logD.txt',smiles_field='SMILES',addH=addH).get_data()
for (i,(x, adjoin_matrix ,y , char_weight)) in enumerate(train_dataset):
    print("x=\n",x)
    print("adjoin_matrix=\n",adjoin_matrix)
    print("y=\n",y)
    print("char_weight=\n",char_weight)
    if i==2:break
""" x= tf.Tensor( [[16 5 2 ... 0 0 0] [16 6 4 ... 0 0 0] [16 15 2 ... 0 0 0] ... [16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0] [16 15 2 ... 0 0 0]], shape=(256, 115), dtype=int64) adjoin_matrix= tf.Tensor( [[[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] ... [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]]], shape=(256, 115, 115), dtype=float32) y= tf.Tensor( [[16 5 2 ... 0 0 0] [16 6 4 ... 0 0 0] [16 4 2 ... 0 0 0] ... [16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0]], shape=(256, 115), dtype=int64) char_weight= tf.Tensor( [[0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 1. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] [0. 1. 0. ... 0. 0. 0.]], shape=(256, 115), dtype=float32) x= tf.Tensor( [[16 7 2 ... 0 0 0] [16 7 2 ... 0 0 0] [16 4 2 ... 0 0 0] ... [16 4 2 ... 0 0 0] [16 15 15 ... 0 0 0] [16 6 2 ... 0 0 0]], shape=(256, 132), dtype=int64) adjoin_matrix= tf.Tensor( [[[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] ... [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]]], shape=(256, 132, 132), dtype=float32) y= tf.Tensor( [[16 7 2 ... 0 0 0] [16 7 2 ... 0 0 0] [16 4 2 ... 0 0 0] ... [16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0] [16 6 2 ... 0 0 0]], shape=(256, 132), dtype=int64) char_weight= tf.Tensor( [[0. 0. 0. ... 0. 0. 0.] [0. 1. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 0. 0. ... 0. 0. 0.] [0. 1. 1. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]], shape=(256, 132), dtype=float32) x= tf.Tensor( [[16 4 2 ... 0 0 0] [16 4 15 ... 0 0 0] [16 7 2 ... 0 0 0] ... [16 15 2 ... 0 0 0] [16 4 8 ... 0 0 0] [16 3 2 ... 0 0 0]], shape=(256, 130), dtype=int64) adjoin_matrix= tf.Tensor( [[[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] ... [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] [-0. -0. -0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]]], shape=(256, 130, 130), dtype=float32) y= tf.Tensor( [[16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0] [16 7 2 ... 0 0 0] ... [16 4 2 ... 0 0 0] [16 4 2 ... 0 0 0] [16 3 2 ... 0 0 0]], shape=(256, 130), dtype=int64) char_weight= tf.Tensor( [[0. 0. 0. ... 0. 0. 0.] [0. 0. 1. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.] ... [0. 1. 0. ... 0. 0. 0.] [0. 0. 1. ... 0. 0. 0.] [0. 0. 0. ... 0. 0. 0.]], shape=(256, 130), dtype=float32) """
  • 每个 batch 原子数量不一致

标签: mg643183连接器mg642570连接器

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台