Hands-on Graph Neural Networks with PyTorch Geometric(3): Multi-Layer Perceptron
Machine learning research on data with graph structures, such as social networks, has recently attracted a lot of attention. There are various machine learning tasks with graph data, such as node classification, link prediction, and graph classification, but in this article, we will tackle the node classification task. As a model to work with, we will deal with a simple neural model, Multi-Layer Perceptron (MLP). MLP is often used as a baseline against which to compare other GNNs because it ignores the graph topology and is trained using only node features.
In this article, we will train MLP on three different datasets and compare the results.
Through this article, we will learn the following;
- How to handle PyTorch and PyTorch Geometric
- Characteristics of Multi-Layer Perceptron
- How to train Multi-Layer Perceptron
import os
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from torch import nn
from torch import Tensor
import torch.nn.functional as F
from torch.nn import Linear, ReLU
import torch_geometric
from torch_geometric.datasets import Planetoid, WebKB
from torch_geometric.data import Data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True)
MLP
The multi-layer perceptron is a type of forward propagating network and is the most basic neural network. It is a neural network that has a structure of layered units joined only between adjacent layers, with information propagating only in one direction from the input side to the output side.
Here we will use PyTorch to create an MLP with one hidden layer.
class MLP(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels, dropout):
super(MLP, self).__init__()
self.lin1 = Linear(in_channels, hidden_channels)
self.lin2 = Linear(hidden_channels, out_channels)
self.dropout = dropout
def reset_parameters(self):
self.lin1.reset_parameters()
self.lin2.reset_parameters()
def forward(self, data):
x = data.x
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.lin1(x)
x = x.relu()
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=1)
If you want to know the structure of a neural network, you can find it in the forward method. The dropout process is a special process that is only enabled during training, but we ignore it for now.
The input data (x) is transformed by the relu function after going through the linear layer. Then it goes into the linear layer again and is transformed to predict the label with a softmax function. Very simple.
The MLP structure we will study can be schematically depicted as follows.

We will perform model training later in the article, so we will create a function for this purpose. GNN almost always performs batch learning with all training data. MLP does not use adjacency matrices for training and can do mini-batch training, but here we have written code to do batch training.
def run_training(model, data, lr=0.01, weight_decay=5e-4, epochs=200):
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
def train():
model.train()
optimizer.zero_grad()
out = model(data)
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
return float(loss)
@torch.no_grad()
def test():
model.eval()
pred = model(data).argmax(dim=-1)
accs = []
for mask in [data.train_mask, data.val_mask, data.test_mask]:
accs.append(int((pred[mask] == data.y[mask]).sum()) / int(mask.sum()))
return accs
train_acc_list, val_acc_list, test_acc_list = [], [], []
best_val_acc = final_test_acc = 0
for epoch in range(1, epochs + 1):
loss = train()
train_acc, val_acc, tmp_test_acc = test()
train_acc_list.append(train_acc)
val_acc_list.append(val_acc)
test_acc_list.append(tmp_test_acc)
if val_acc > best_val_acc:
best_val_acc = val_acc
test_acc = tmp_test_acc
print(f'Epoch: {epoch:03d}, Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')
return train_acc_list, val_acc_list, test_acc_list
Model Training
We will now train MLP on the following three datasets.
- Iris Dataset
- Cora Dataset
- Texas Dataset
Iris Dataset
Before using the graph dataset, we first train MLP using the iris dataset, a common machine learning dataset. This dataset is a dataset with 4 features and label information about species names.
iris = load_iris(as_frame=True)
df = iris["frame"]
print(df.shape) # (150, 5)
df.head()

The data is not graph data, so it is basically handled in the form of a data frame. However, in order to use the same training process as graph datasets, which will be discussed later, we will convert the data into pytorch geometric graph data format. Please note that the edge index information is not included because it is not a graph data set.
X = torch.Tensor(df.iloc[:, :4].values)
y = torch.LongTensor(df["target"].values)
train, test = train_test_split(df, test_size=0.2, random_state=0)
train, val = train_test_split(train, test_size=0.25, random_state=0)
print(train.shape, val.shape, test.shape)
def get_mask(index):
mask = np.repeat([False], 150)
mask[index] = True
mask = torch.tensor(mask, dtype=torch.bool)
return mask
train_mask = get_mask(train.index)
val_mask = get_mask(val.index)
test_mask = get_mask(test.index)
iris = Data(x=X, y=y, train_mask=train_mask, val_mask=val_mask, test_mask=test_mask)
iris
# Data(x=[150, 4], y=[150], train_mask=[150], val_mask=[150], test_mask=[150])
Excellent! We have now converted the data into a form that can be handled by PyTorch Geometric. The x contains feature values and the y contains label information. The last three, such as train_mask, may be unfamiliar to you, but they contain the information to split the data in the form of a boolean array.
Let’s see how much data are split for training, validation, and testing.
print(f'Number of training nodes: {iris.train_mask.sum()}')
print(f'Number of validation nodes: {iris.val_mask.sum()}')
print(f'Number of test nodes: {iris.test_mask.sum()}')We can see that the data is split into 90, 30, and 30 nodes for training, validation, and testing, respectively.
Now we will use this data to train MLP.
epochs = 200
mlp = MLP(in_channels=4, hidden_channels=16, out_channels=3, dropout=0)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, iris, epochs=epochs)

The fitting seems to work well since the training data accuracy is 1. The accuracy in the test data is displayed when the maximum accuracy in the validation data is reached. Finally, the test accuracy is 0.9.
To check whether overfitting is occurring, let us plot the change in the accuracy on the training data and the accuracy on the validation data.
plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Model training seems to be progressing well.
Cora Dataset
The Cora dataset is a well-known dataset in the field of graph research. This consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. See my previous article for more details.
cora_dataset = Planetoid(root=data_dir, name='Cora')
cora = cora_dataset[0]
cora
# Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
The dataset is ready. Compare it to the iris dataset we have just used. The Cora dataset contains edge_index which is specific to graph data.
The three masks, such as train_mask, are prepared in advance. Let’s see how much data are split for training, validation, and testing.
print(f'Number of training nodes: {cora.train_mask.sum()}')
print(f'Number of validation nodes: {cora.val_mask.sum()}')
print(f'Number of test nodes: {cora.test_mask.sum()}')
We can see that the data is split into 140, 500, and 1000 nodes for training, validation, and testing, respectively. The proportion of training data is low, and learning is expected to be difficult.
Now we will train MLP.
mlp = MLP(in_channels=cora_dataset.num_features, hidden_channels=16,
out_channels=cora_dataset.num_classes, dropout=0.5)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, cora)

The fitting seems to work well since the training data accuracy is 1. However, the accuracy on the test data and the accuracy on the validation data are low, around 0.5.
We will plot the change in the accuracy on the training data and the accuracy on the validation data.
plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Texas Dataset
Texas dataset is included in WebKB dataset. The WebKB is a webpage dataset collected from computer science departments of various universities by Carnegie Mellon University. We use one of the three sub-datasets of it, Cornell, Texas, and Wisconsin, where nodes represent web pages, and edges are hyperlinks between them. The Texas dataset contains 183 web pages (nodes) and 309 hyperlinks (edges). Node features are the bag-of-words representation of web pages. The web pages are manually classified into five categories, student, project, course, staff, and faculty.
The Texas dataset has different graph characteristics than the Cora dataset, but this does not affect MLP because it does not use graph information for training.
texas_dataset = WebKB(root=data_dir, name='texas')
texas = texas_dataset[0]
texas.train_mask = texas.train_mask[:, 0]
texas.val_mask = texas.val_mask[:, 0]
texas.test_mask = texas.test_mask[:, 0]
texas
# Data(x=[183, 1703], edge_index=[2, 325], y=[183], train_mask=[183], val_mask=[183], test_mask=[183])
We can see that the number of data is much smaller than the Cora dataset.
Again, the three masks, such as train_mask, are prepared in advance. Let’s see how many data are split for training, validation, and testing, respectively.
print(f'Number of training nodes: {texas.train_mask.sum()}')
print(f'Number of validation nodes: {texas.val_mask.sum()}')
print(f'Number of test nodes: {texas.test_mask.sum()}')
We can see that the data is split into 87, 59, and 37 nodes for training, validation, and test, respectively.
Now let’s go on with the training
mlp = MLP(in_channels=texas_dataset.num_features, hidden_channels=16,
out_channels=texas_dataset.num_classes, dropout=0.5)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, texas)

On the Texas dataset, the accuracy on the validation data is 0.76, and the accuracy on the test data is 0.76, which is better than on the Cora dataset.
We plot the change in the percentage correct on the training data vs. the percentage correct on the validation data below. Model training seems to be progressing well here as well.
plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Since no parameter tuning was performed this time, it is possible that a higher accuracy could be achieved. However, the performance of the model was clearly different between the Cora and Texas datasets.
It is possible that MLP, which does not learn the properties of the graph, was disadvantaged on the Cora dataset due to its homophily nature. Conversely, it performed reasonably well on the Texas dataset, which has heterophily properties.