Temporal Splits and the Graph ML Leakage Problem
How look-ahead bias silently overfits link prediction models, and why the fix is architectural not a hyperparameter.
When I built the LinkedIn connection graph intelligence system, the initial link prediction model looked great. Validation AUC above 0.92. Clean loss curves. The kind of numbers that make you feel like you've solved something.
Then I actually looked at why it was performing so well.
The Problem with Random Splits
The standard machine learning workflow, shuffle your dataset and take 80% for training and 20% for validation, is quietly catastrophic for graph ML.
Here's why. In a link prediction task, you're trying to answer: given the graph at time T, which edges will form by time T+k?
When you split randomly, you're allowing your training set to contain edges that were formed after the edges in your validation set. The model does not learn to predict the future. It learns to interpolate the present. The validation set is already embedded in the training graph's neighborhood structure.
# This is wrong for temporal graph data
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
features, labels, test_size=0.2, random_state=42
)
# ^ The model has seen the future. AUC of 0.92 is a lie.
What Temporal Splits Actually Look Like
The fix is straightforward in principle and requires rethinking your entire data pipeline in practice.
You need a cutoff timestamp. Everything before it trains the model. Everything after it evaluates it. No edge formed after the cutoff appears in any feature used to train on edges before the cutoff.
CUTOFF = "2024-06-01"
train_edges = graph.edges[graph.edges["formed_at"] < CUTOFF]
val_edges = graph.edges[graph.edges["formed_at"] >= CUTOFF]
# Critical: features must be computed on the training graph only
train_graph = build_graph(train_edges)
# Compute node embeddings, centrality, clustering coefficients
# ONLY from train_graph, never from the full graph
features = extract_features(train_graph, candidate_pairs)
The Subtler Leakage: Node Features
Even after fixing the edge split, there is a second source of leakage that is easier to miss: node-level features computed on the full graph.
Measures like betweenness centrality, PageRank, and local clustering coefficient all reflect the graph's global structure. If you compute them on the full graph and then use them as features, you have smuggled future information into every single training example.
The rule is simple but strict: every feature used at train time must be computable from information available before the cutoff.
| Feature | Safe? | Notes |
|---|---|---|
| Degree at cutoff | ✓ | Computed on train graph |
| Common neighbors at cutoff | ✓ | Computed on train graph |
| PageRank on full graph | ✗ | Leaks future edge structure |
| Betweenness on full graph | ✗ | Leaks future edge structure |
| Node join date | ✓ | External, not graph-derived |
What Happened to the 0.92 AUC
After fixing both leakage sources, validation AUC dropped to 0.71.
That is the real number. It is lower, and it is honest. The model trained on a temporally correct graph learns something real about how connections form, shared institutions, second-degree proximity, and activity patterns. The 0.92 model learned to memorize a graph it was not supposed to have seen.
The lesson is not specific to graph ML. Any time your data has a temporal dimension, your validation strategy needs to respect it. The model does not know it is cheating. You have to know for it.