DataWhale 组队学习GNN task3
参考:[DataWhale GNN 学习资料],torch_geometric.nn — pytorch_geometric 1.7.1 documentation (pytorch-geometric.readthedocs.io)、
学习基于图神经网络的节点表征
分析数据集
from torch_geometric.datasets import Planetoid from torch_geometric.transforms import NormalizeFeatures dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures()) print() print(f'Dataset: {dataset}:') print('======================') print(f'Number of graphs: {len(dataset)}') print(f'Number of features: {dataset.num_features}') print(f'Number of classes: {dataset.num_classes}') data = dataset[0] # Get the first graph object. print() print(data) print('======================') # Gather some statistics about the graph. print(f'Number of nodes: {data.num_nodes}') print(f'Number of edges: {data.num_edges}') print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}') print(f'Number of training nodes: {data.train_mask.sum()}') print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}') print(f'Contains isolated nodes: {data.contains_isolated_nodes()}') print(f'Contains self-loops: {data.contains_self_loops()}') print(f'Is undirected: {data.is_undirected()}')
Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
======================
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
由结果可以看出 Cora
图拥有:
- 2708 个节点,10556 条边,分为 7 类,特征数量 1433
- 平均节点度为 3.9
- 训练节点只有 140 个,每类节点 20 个
- 有标签的节点的比例只占到 5%
- 这个图是无向图,不存在孤立的节点(即每个文档至少有一个引文)
- ,使用
NormalizeFeatures
进行节点特征归一化,使各节点特征总和为1
Note: 无法下载数据集:Planetoid无法直接下载Cora等数据集的3个解决方式_诸神缄默不语的博客-CSDN博客
可视化节点表征分布的方法
利用 **TSNE **将对高维节点[]表征嵌入到二维平面空间,然后在二维平面空间画出节点
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
def visualize(out, color):
z = TSNE(n_components=2).fit_transform(out.detach().cpu().numpy())
plt.figure(figsize=(10,10))
plt.xticks([])
plt.yticks([])
plt.scatter(z[:, 0], z[:, 1], s=70, c=color, cmap="Set2")
plt.show()
使用 MLP 进行图节点分类
MLP 只对输入节点的特征进行操作,它在所有节点之间共享权重
构建 MLP 图节点分类器模型
import torch
from torch.nn import Linear
import torch.nn.functional as F
class MLP(torch.nn.Module):
def __init__(self, hidden_channels):
super(MLP, self).__init__()
torch.manual_seed(12345) # 为 CPU 设置种子用于生成随机数,以使得结果是确定的
self.lin1 = Linear(dataset.num_features, hidden_channels)
self.lin2 = Linear(hidden_channels, dataset.num_classes)
def forward(self, x):
x = self.lin1(x)
x = x.relu()
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return x
model = MLP(hidden_channels=16)
print(model)
""" MLP( (lin1): Linear(in_features=1433, out_features=16, bias=True) (lin2): Linear(in_features=16, out_features=7, bias=True) ) """
训练模型
利用和 进行训练:
model = MLP(hidden_channels=16)
criterion = torch.nn.CrossEntropyLoss() # 定义损失标准
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4) # 定义优化函数
def train():
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(data.x) # Perform a single forward pass.
loss = criterion(out[data.train_mask], data.y[data.train_mask]) # 只根据训练节点计算损失
loss.backward() # Derive gradients.
optimizer.step() # 根据梯度更新参数
return loss
for epoch in range(1, 51):
loss = train()
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
""" Epoch: 046, Loss: 1.1284 Epoch: 047, Loss: 1.1229 Epoch: 048, Loss: 1.0383 Epoch: 049, Loss: 1.0439 Epoch: 050, Loss: 1.0563 """
测试模型
def test():
model.eval()
out = model(data.x)
pred = out.argmax(dim=1) # 选取概率最高的一类
test_correct = pred[data.test_mask] == data.y[data.test_mask] # 预测与真实对比
test_acc = int(test_correct.sum()) / int(data.test_mask.sum()) # 准确率
return test_acc
test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')
用于训练 MLP 的有标签节点数量过少,此神经网络被过拟合,它对未见过的节点泛化性很差
使用 GCN 进行图节点分类
数学定义
X ′ = D ^ − 1 / 2 A ^ D ^ − 1 / 2 X Θ , \mathbf{X}^{\prime} = \mathbf{\hat{D}}^{-1/2} \mathbf{\hat{A}} \mathbf{\hat{D}}^{-1/2} \mathbf{X} \mathbf{\Theta}, X′=D^−1/2A^D^−1/2XΘ,
其中 A ^ = A + I \mathbf{\hat{A}} = \mathbf{A} + \mathbf{I} A^=A+I 表示插入自环的邻接矩阵, D ^ i i = ∑ j = 0 A ^ i j \hat{D}_{ii} = \sum_{j=0} \hat{A}_{ij} D^ii=∑j=0A^ij 表示其对角线度矩阵。邻接矩阵可以包括不为 1 1 1 的值,当邻接矩阵不为 {0, 1}
值时,表示邻接矩阵存储的是边的权重
D ^ − 1 / 2 A ^ D ^ − 1 / 2 \mathbf{\hat{D}}^{-1/2} \mathbf{\hat{A}} \mathbf{\hat{D}}^{-1/2} D^−1/2A^D^−1/2 为对称归一化矩阵,它的节点可表述为: x i ′ = Θ ∑ j ∈ N ( v ) ∪ { i } e j , i d ^ j d ^ i x j \mathbf{x}^{\prime}_i = \mathbf{\Theta} \sum_{j \in \mathcal{N}(v) \cup \{ i \}} \frac{e_{j,i}}{\sqrt{\hat{d}_j \hat{d}_i}} \mathbf{x}_j xi′=Θj∈N(v)∪{ i}∑d^jd^i ej,ixj 其中 d ^ i = 1 + ∑ j ∈ N ( i ) e j , i \hat{d}_i = 1 + \sum_{j \in \mathcal{N}(i)} e_{j,i} d^i=1+∑j∈N(i)ej,i, e j , i e_{j,i} ej,i 表示从源节点 j j j 到目标节点 i i i 的边的对称归一化系数(默认值为1.0)
PyG 中的 GCNConv 模型
GCNConv(in_channels: int, out_channels: int, improved: bool = False, cached: bool = False, add_self_loops: bool = True, normalize: bool = True, bias: bool = True, **kwargs)
其中:
in_channels
:输入数据维度out_channels
:输出数据维度improved
:如果为true
, A ^ = A + 2 I \mathbf{\hat{A}} = \mathbf{A} + 2\mathbf{I} A^=A+2I,其目的在于增强中心节点自身信息cached
:是否存储 D ^ − 1 / 2 A ^ D ^ − 1 / 2 \mathbf{\hat{D}}^{-1/2} \mathbf{\hat{A}} \mathbf{\hat{D}}^{-1/2} D^−1/2A^D^−1/2 的计算结果以便后续使用,这个参数只应在归纳学习的景中设置为true
add_self_loops
:是否在邻接矩阵中增加自环边normalize
:是否添加自环边并在运行中计算对称归一化系数bias
:是否包含偏置项
构建 GCN 图节点分类器模型
与 MLP 图节点分类器模型的不同在于:将线性层(torch.nn.Linear
)改为了图卷积层(torch_geometric.nn.GCNConv
)
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self, hidden_channels):
super(GCN, self).__init__()
torch.manual_seed(12345)
self.conv1 = GCNConv(dataset.num_features, hidden_channels)
self.conv2 = GCNConv(hidden_channels, dataset.num_classes)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index)
x = x.relu()
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index)
return x
model = GCN(hidden_channels=16)
print(model)
""" GCN( (conv1): GCNConv(1433, 16) (conv2): GCNConv(16, 7) ) """
可视化未训练的 GCN 网络的节点表征
model.eval()
out = model(data.x, data.edge_index)
visualize(out, color=data.y)
训练模型
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()
def train():
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(data.x, data.edge_index) # Perform a single forward pass.
loss = criterion(out[data.train_mask], data.y[data.train_mask]) # 只根据训练节点计算损失
loss.backward() # Derive gradients.
optimizer.step() # 根据梯度更新参数
return loss
for epoch in range(1, 51):
loss = train()
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
""" Epoch: 046, Loss: 1.2266 Epoch: 047, Loss: 1.2149 Epoch: 048, Loss: 1.1631 Epoch: 049, Loss: 1.1756 Epoch: 050, Loss: 1.1714 """
测试模型
def test(): model.eval() out = model(data.x, data.edge_index) pred = out.argmax(dim=1) # 选取概率最高的一类 test_correct = pred[data.test_mask]