Encoding network shape has been a big factor in the success of our recent shell company detection machine learning model. Central to the model is the ability to evaluate the context of a business – given by the network of nodes surrounding it – in a way that can be input into explainable ML models. Early on in our analysis, we realized that network shape – without any additional information on nodes (e.g. financial status, jurisdiction) could already be a powerful predictor for risk alone. In the full model, network topology information is encoded alongside node information to produce the most accurate model – but just how far can you go based on shapes alone? The following is a deep dive into the data science involved.
Motif Counting
In the following, a 'director network' refers to the connections to businesses, addresses and other individuals in a 2-hop radius of a director
One way to try and differentiate high-risk networks is to count small motifs present in their neighborhoods. Below is the distribution of high-risk director networks (orange) vs normal directors (blue) for the simple path-of-length-2 motif.
Use in detecting high-risk businesses & individuals
Crucially this motif does not contain the third edge to complete the triangle – and aligns with the behavior of shell company formation agents - they are the ‘hub’ of a network, incorporating lots of businesses and then handing ownership to nominee directors. It’s well known that naturally occurring social networks have a higher than random rate of triangles present, because social connections are more stable in groups. A similar effect is seen in the directorships of ‘normal’ companies
A limitation of motif counting is that for larger motifs, getting a full count on big graph datasets (like all global business ego-networks) is computationally infeasible. It’s possible to use sampling methods to improve this to some extent, but this introduces some random chance into our detection pipeline. A recent approach is to overcome this is to use neural networks learn vectors embeddings of nodes or neighborhoods.
Graph embeddings – Graph2Vec Algorithm
Graph2Vec [1] is an algorithm that translates Graphs/Networks into vector embeddings (coordinates of numbers) through a shallow neural network. The aim is to make ‘similar’ graphs have vectors which are close by to each other. The benefit of this is that models can measure difference between vectors much more easily than on graphs.
To create the embeddings, we first convert the graph to a collection of words - a ‘document’ - and then we use a Natural Language Processing (NLP) called ‘Doc2Vec’ [2] which embed documents that contain similar words to similar vectors.
So how to convert a graph to a document? At first there seems no natural way, since nodes connected by edges have a more abstract structure than words in sentences. However, a neat trick linked to motif counting is as follows:
1. Label each node by a letter indicating entity type.
2. For each node, create a sorted list of its neighbors
3. Concatenate each node label with the sorted list of neighbors*
4. Note down the labels you have on the graph
Repeat steps 2, 3 and 4 as many times as you like – typically 3 levels are enough to get very expressive embeddings.
See below for a visual walk-through of the above algorithm on a simple graph with 4 nodes and 3 entity types - A, B and C:
Bag-of-Words:
1st level
A (x2)
B
C
2nd level
C|AB
B|AAC
A|BC
A|B
3rd level
CAB|ABCBAAC
ABC|BAACCAB
BAAC|ABABCCAB
AB|BAAC
Now, for each graph, we’ve generated a bag of words comprised of the node labels we noted down. Borrowing a method from Natural Language Processing (NLP) we can train a Doc2Vec model on these sets to generate machine learned graph embeddings.
*Note: The described algorithm above is a simplified description of the Weisfeiler-Lehman (WL) [3] Kernel for graphs. Each neighborhood is hashed to an integer after every iteration to keep the memory requirements constant.
Graph embeddings for high-risk networks
As can be seen below, a generalization of 50 graph dimensions into 2 dimensions (using principal component analysis, or PCA) already shows that the high-risk director networks (orange) can be visually separated.
generalization of 50 graph dimensions into 2 dimensions
Machine Learning on Graph Embeddings
Indeed, training a simple Logistic Regression on the graph embeddings gives an area under the receiver-operating characteristic curve (ROC-AUC) of 0.84 – a strong measure of model performance given we are only looking at network shape, and not node characteristics!
Where next?
Graph2Vec can be seen as a 'shallow' encoding - it's a one-to-one mapping of seen graphs to vectors. This can be extended to 'deeper' methods in Graph Neural Networks (GNNs) which can classify unseen graphs and incorporate node features into the embedding model. Combining these approaches with traditional risk indicators is an area of active research for the innovation team at Quantexa.
References:
[1] Narayanan et al. "graph2vec: Learning distributed representations of graphs."
arXiv:1707.05005 (2017).
[2] Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. PMLR, 2014.
[3] Shervashidze et al, 2011. “Weisfeiler-Lehman Graph Kernels”. JMLR, 2011
Read more
Did you know that you can log in (or sign up) to unlock further resources in the Community and our Documentation site?
Explore our collection for Data Scientists: Resources for Data Scientists 📈