chitresh.codes/writing/temporal-splits-graph-ml
Machine Learning
8 min read

Temporal Splits and the Graph ML Leakage Problem

How look-ahead bias silently overfits link prediction models, and why the fix is architectural not a hyperparameter.


When I built the LinkedIn connection graph intelligence system, the initial link prediction model looked great. Validation AUC above 0.92. Clean loss curves. The kind of numbers that make you feel like you've solved something.

Then I actually looked at why it was performing so well.

The Problem with Random Splits

The standard machine learning workflow, shuffle your dataset and take 80% for training and 20% for validation, is quietly catastrophic for graph ML.

Here's why. In a link prediction task, you're trying to answer: given the graph at time T, which edges will form by time T+k?

When you split randomly, you're allowing your training set to contain edges that were formed after the edges in your validation set. The model does not learn to predict the future. It learns to interpolate the present. The validation set is already embedded in the training graph's neighborhood structure.

# This is wrong for temporal graph data
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    features, labels, test_size=0.2, random_state=42
)
# ^ The model has seen the future. AUC of 0.92 is a lie.

What Temporal Splits Actually Look Like

The fix is straightforward in principle and requires rethinking your entire data pipeline in practice.

You need a cutoff timestamp. Everything before it trains the model. Everything after it evaluates it. No edge formed after the cutoff appears in any feature used to train on edges before the cutoff.

CUTOFF = "2024-06-01"

train_edges = graph.edges[graph.edges["formed_at"] < CUTOFF]
val_edges   = graph.edges[graph.edges["formed_at"] >= CUTOFF]

# Critical: features must be computed on the training graph only
train_graph = build_graph(train_edges)

# Compute node embeddings, centrality, clustering coefficients
# ONLY from train_graph, never from the full graph
features = extract_features(train_graph, candidate_pairs)

The Subtler Leakage: Node Features

Even after fixing the edge split, there is a second source of leakage that is easier to miss: node-level features computed on the full graph.

Measures like betweenness centrality, PageRank, and local clustering coefficient all reflect the graph's global structure. If you compute them on the full graph and then use them as features, you have smuggled future information into every single training example.

The rule is simple but strict: every feature used at train time must be computable from information available before the cutoff.

Feature Safe? Notes
Degree at cutoff Computed on train graph
Common neighbors at cutoff Computed on train graph
PageRank on full graph Leaks future edge structure
Betweenness on full graph Leaks future edge structure
Node join date External, not graph-derived

What Happened to the 0.92 AUC

After fixing both leakage sources, validation AUC dropped to 0.71.

That is the real number. It is lower, and it is honest. The model trained on a temporally correct graph learns something real about how connections form, shared institutions, second-degree proximity, and activity patterns. The 0.92 model learned to memorize a graph it was not supposed to have seen.

The lesson is not specific to graph ML. Any time your data has a temporal dimension, your validation strategy needs to respect it. The model does not know it is cheating. You have to know for it.