avatarKoki Noda

Summary

The web content discusses the implementation and training of a Multi-Layer Perceptron (MLP) for node classification tasks on graph-structured data using PyTorch and PyTorch Geometric, with experiments conducted on the Iris, Cora, and Texas datasets.

Abstract

The article delves into the application of Machine learning techniques to graph-structured data, focusing on the node classification task. It introduces the Multi-Layer Perceptron (MLP) as a foundational neural network model that disregards graph topology and relies solely on node features. The author provides a detailed walkthrough of creating an MLP with PyTorch, training it on three distinct datasets—Iris, Cora, and Texas—and analyzing the performance differences. The MLP model is implemented with a simple architecture consisting of an input layer, a hidden layer with ReLU activation, and an output layer with a softmax function. The training process is demonstrated using a batch learning approach, and the results are evaluated based on accuracy metrics. The article highlights the varying performance of MLP across datasets, suggesting that the model's effectiveness is influenced by the dataset's characteristics, such as homophily and heterophily properties.

Opinions

  • The author views MLP as a baseline model for graph-structured data tasks, emphasizing its simplicity and utility in benchmarking against more sophisticated Graph Neural Networks (GNNs).
  • The article implies that while MLP ignores the graph structure, it can still yield valuable insights into the predictive power of node features alone.
  • The performance discrepancies across datasets suggest that the inherent properties of a dataset, like the nature of connections between nodes (homophily vs. heterophily), can significantly impact the model's performance.
  • The author seems to advocate for the importance of understanding model behavior across different data distributions, as evidenced by the varying results on the Cora and Texas datasets.
  • There is an underlying opinion that despite advancements in GNNs, traditional neural network architectures like MLP can still be relevant and effective for certain types of graph data.

Hands-on Graph Neural Networks with PyTorch Geometric(3): Multi-Layer Perceptron

Photo by Markus Spiske on Unsplash

Machine learning research on data with graph structures, such as social networks, has recently attracted a lot of attention. There are various machine learning tasks with graph data, such as node classification, link prediction, and graph classification, but in this article, we will tackle the node classification task. As a model to work with, we will deal with a simple neural model, Multi-Layer Perceptron (MLP). MLP is often used as a baseline against which to compare other GNNs because it ignores the graph topology and is trained using only node features.

In this article, we will train MLP on three different datasets and compare the results.

Through this article, we will learn the following;

  • How to handle PyTorch and PyTorch Geometric
  • Characteristics of Multi-Layer Perceptron
  • How to train Multi-Layer Perceptron
import os
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from torch import nn
from torch import Tensor
import torch.nn.functional as F
from torch.nn import Linear, ReLU
import torch_geometric
from torch_geometric.datasets import Planetoid, WebKB
from torch_geometric.data import Data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True)

MLP

The multi-layer perceptron is a type of forward propagating network and is the most basic neural network. It is a neural network that has a structure of layered units joined only between adjacent layers, with information propagating only in one direction from the input side to the output side.

Here we will use PyTorch to create an MLP with one hidden layer.

class MLP(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, dropout):
        super(MLP, self).__init__()
        self.lin1 = Linear(in_channels, hidden_channels)
        self.lin2 = Linear(hidden_channels, out_channels)
        self.dropout = dropout

    def reset_parameters(self):
        self.lin1.reset_parameters()
        self.lin2.reset_parameters()

    def forward(self, data):
        x = data.x
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.lin1(x)
        x = x.relu()
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.lin2(x)
        return F.log_softmax(x, dim=1)

If you want to know the structure of a neural network, you can find it in the forward method. The dropout process is a special process that is only enabled during training, but we ignore it for now.

The input data (x) is transformed by the relu function after going through the linear layer. Then it goes into the linear layer again and is transformed to predict the label with a softmax function. Very simple.

The MLP structure we will study can be schematically depicted as follows.

We will perform model training later in the article, so we will create a function for this purpose. GNN almost always performs batch learning with all training data. MLP does not use adjacency matrices for training and can do mini-batch training, but here we have written code to do batch training.

def run_training(model, data, lr=0.01, weight_decay=5e-4, epochs=200):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    def train():
        model.train()
        optimizer.zero_grad()
        out = model(data)
        loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()
        return float(loss)

    @torch.no_grad()
    def test():
        model.eval()
        pred = model(data).argmax(dim=-1)

        accs = []
        for mask in [data.train_mask, data.val_mask, data.test_mask]:
            accs.append(int((pred[mask] == data.y[mask]).sum()) / int(mask.sum()))
        return accs

    train_acc_list, val_acc_list, test_acc_list = [], [], []
    best_val_acc = final_test_acc = 0
    for epoch in range(1, epochs + 1):
        loss = train()
        train_acc, val_acc, tmp_test_acc = test()
        train_acc_list.append(train_acc)
        val_acc_list.append(val_acc)
        test_acc_list.append(tmp_test_acc)

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            test_acc = tmp_test_acc
        print(f'Epoch: {epoch:03d}, Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')
    return train_acc_list, val_acc_list, test_acc_list

Model Training

We will now train MLP on the following three datasets.

  • Iris Dataset
  • Cora Dataset
  • Texas Dataset

Iris Dataset

Before using the graph dataset, we first train MLP using the iris dataset, a common machine learning dataset. This dataset is a dataset with 4 features and label information about species names.

iris = load_iris(as_frame=True)
df = iris["frame"]
print(df.shape)  # (150, 5)
df.head()

The data is not graph data, so it is basically handled in the form of a data frame. However, in order to use the same training process as graph datasets, which will be discussed later, we will convert the data into pytorch geometric graph data format. Please note that the edge index information is not included because it is not a graph data set.

X = torch.Tensor(df.iloc[:, :4].values)
y = torch.LongTensor(df["target"].values)
train, test = train_test_split(df, test_size=0.2, random_state=0)
train, val = train_test_split(train, test_size=0.25, random_state=0)
print(train.shape, val.shape, test.shape)
def get_mask(index):
    mask = np.repeat([False], 150)
    mask[index] = True
    mask = torch.tensor(mask, dtype=torch.bool)
    return mask
train_mask = get_mask(train.index)
val_mask = get_mask(val.index)
test_mask = get_mask(test.index)
iris = Data(x=X, y=y, train_mask=train_mask, val_mask=val_mask, test_mask=test_mask)
iris
# Data(x=[150, 4], y=[150], train_mask=[150], val_mask=[150], test_mask=[150])

Excellent! We have now converted the data into a form that can be handled by PyTorch Geometric. The x contains feature values and the y contains label information. The last three, such as train_mask, may be unfamiliar to you, but they contain the information to split the data in the form of a boolean array.

Let’s see how much data are split for training, validation, and testing.

print(f'Number of training nodes: {iris.train_mask.sum()}')
print(f'Number of validation nodes: {iris.val_mask.sum()}')
print(f'Number of test nodes: {iris.test_mask.sum()}')We can see that the data is split into 90, 30, and 30 nodes for training, validation, and testing, respectively.

Now we will use this data to train MLP.

epochs = 200
mlp = MLP(in_channels=4, hidden_channels=16, out_channels=3, dropout=0)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, iris, epochs=epochs)

The fitting seems to work well since the training data accuracy is 1. The accuracy in the test data is displayed when the maximum accuracy in the validation data is reached. Finally, the test accuracy is 0.9.

To check whether overfitting is occurring, let us plot the change in the accuracy on the training data and the accuracy on the validation data.

plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Model training seems to be progressing well.

Cora Dataset

The Cora dataset is a well-known dataset in the field of graph research. This consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. See my previous article for more details.

cora_dataset = Planetoid(root=data_dir, name='Cora')
cora = cora_dataset[0]
cora
# Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

The dataset is ready. Compare it to the iris dataset we have just used. The Cora dataset contains edge_index which is specific to graph data.

The three masks, such as train_mask, are prepared in advance. Let’s see how much data are split for training, validation, and testing.

print(f'Number of training nodes: {cora.train_mask.sum()}')
print(f'Number of validation nodes: {cora.val_mask.sum()}')
print(f'Number of test nodes: {cora.test_mask.sum()}')

We can see that the data is split into 140, 500, and 1000 nodes for training, validation, and testing, respectively. The proportion of training data is low, and learning is expected to be difficult.

Now we will train MLP.

mlp = MLP(in_channels=cora_dataset.num_features, hidden_channels=16,
          out_channels=cora_dataset.num_classes, dropout=0.5)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, cora)

The fitting seems to work well since the training data accuracy is 1. However, the accuracy on the test data and the accuracy on the validation data are low, around 0.5.

We will plot the change in the accuracy on the training data and the accuracy on the validation data.

plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Texas Dataset

Texas dataset is included in WebKB dataset. The WebKB is a webpage dataset collected from computer science departments of various universities by Carnegie Mellon University. We use one of the three sub-datasets of it, Cornell, Texas, and Wisconsin, where nodes represent web pages, and edges are hyperlinks between them. The Texas dataset contains 183 web pages (nodes) and 309 hyperlinks (edges). Node features are the bag-of-words representation of web pages. The web pages are manually classified into five categories, student, project, course, staff, and faculty.

The Texas dataset has different graph characteristics than the Cora dataset, but this does not affect MLP because it does not use graph information for training.

texas_dataset = WebKB(root=data_dir, name='texas')
texas = texas_dataset[0]
texas.train_mask = texas.train_mask[:, 0]
texas.val_mask = texas.val_mask[:, 0]
texas.test_mask = texas.test_mask[:, 0]
texas
# Data(x=[183, 1703], edge_index=[2, 325], y=[183], train_mask=[183], val_mask=[183], test_mask=[183])

We can see that the number of data is much smaller than the Cora dataset.

Again, the three masks, such as train_mask, are prepared in advance. Let’s see how many data are split for training, validation, and testing, respectively.

print(f'Number of training nodes: {texas.train_mask.sum()}')
print(f'Number of validation nodes: {texas.val_mask.sum()}')
print(f'Number of test nodes: {texas.test_mask.sum()}')

We can see that the data is split into 87, 59, and 37 nodes for training, validation, and test, respectively.

Now let’s go on with the training

mlp = MLP(in_channels=texas_dataset.num_features, hidden_channels=16,
          out_channels=texas_dataset.num_classes, dropout=0.5)
train_acc_list, val_acc_list, test_acc_list = run_training(mlp, texas)

On the Texas dataset, the accuracy on the validation data is 0.76, and the accuracy on the test data is 0.76, which is better than on the Cora dataset.

We plot the change in the percentage correct on the training data vs. the percentage correct on the validation data below. Model training seems to be progressing well here as well.

plt.plot(range(epochs), train_acc_list, label='train')
plt.plot(range(epochs), val_acc_list, label='val')
# plt.plot(range(epochs), test_acc_list, label='test')
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.legend()
plt.show()

Since no parameter tuning was performed this time, it is possible that a higher accuracy could be achieved. However, the performance of the model was clearly different between the Cora and Texas datasets.

It is possible that MLP, which does not learn the properties of the graph, was disadvantaged on the Cora dataset due to its homophily nature. Conversely, it performed reasonably well on the Texas dataset, which has heterophily properties.

Graph Neural Networks
Multilayer Perceptron
Pytorch
Pytorch Geometric
Recommended from ReadMedium