Generating Synthetic Graphs with Fractal-like properties
The generation of synthetic data stands out as a powerful tool, especially in scenarios where real data is limited, sensitive, or unavailable. One challenging aspect within synthetic data generation is the creation of synthetic graphs or networks, mirroring the unique structures and characteristics of real-world graphs.
The difficulties in this field arise from the fact that real-world networks are complex and interconnected and nodes in these networks cannot be modelled independently. In addition, ensuring data privacy is crucial. Simply anonymizing nodes by replacing identifiable information falls short, as it leaves potential vulnerabilities in edge identification through subgraph queries (more details here). To address these challenges, our approach utilizes a mathematical method to transform the observed graph into an estimator, ensuring data privacy without relying on the real graph data.
This article delves into a novel and scalable method we use at Quantexa for generating realistic graphs while preserving data privacy, making use of Stochastic Kronecker Graphs and node embeddings. We walk through a small, simple example to demonstrate this method, but in practice this can scale up to graphs on the order of one billion nodes and edges.
What is a Kronecker Graph?
How to generate a Kronecker Graph?
To create a Kronecker graph, you need to start with a modest-sized graph (G), acting as the foundation for the larger structure.
Figure 1: Observed graph with 12 nodes and 23 edges and it's adjacency matrix
The key step in this process is to apply Kronecker power to the adjacency matrix of this observed graph. This operation transforms the adjacency matrix into a block matrix, where each entry of the original matrices is replaced by a copy of the entire other matrix.
The Kronecker product operation is then repeated recursively on the resulting matrix, forming a sequence of matrices, each depicting a progressively larger graph. Ultimately, the last matrix in this Kronecker product sequence is extracted and represents the adjacency matrix of the final Kronecker graph.
Figure 2: Kronecker power (symbolised by k) is the repeated Kronecker multiplications of the same matrix, which allows the generation of self-similar adjacency matrices in a recursive manner.
As the Kronecker power increases, the graph size expands, and as a result the resource demands of generating and storing the adjacency matrix in dense form of the graph becomes prohibitively high. The matrix becomes increasingly large, demanding substantial memory and computational power.
To overcome this limitation, we use sparse matrix multiplication. The sparse nature of real-world graphs means that using sparse matrix multiplication avoids unnecessary operations with zero elements.
To be mindful about data privacy, we implemented a stochastic version of the Kronecker graph model, which fits the observed graph and generates an estimator that both satisfies data privacy and preserves graph statistical properties.
Fitting this model to an observed graph involves a maximum likelihood estimation (MLE) calculation that exploits the recursive nature of the Kronecker product along with stochastic gradient descent (SGD) to remain tractable; more details can be found in Leskovec et al. (2008). The result of this fitting method is a set of parameters (referenced as seeding matrix) that are able to generate realistic-looking networks that match the statistical properties of the observed graph.
This method used to fit the observed graph (Figure 1) and produced the following seeding matrix which serves as the entry point to the Kronecker Generator.
Figure 3: Seeding stochastic matrix
After fitting the observed graph we run the Stochastic Kronecker Generator model and present the results below.
Figure 4: Generated Stochastic Kronecker Graph
The generated Kronecker graph replicates specific structural patterns from the observed graph, resulting in a barbell-like structure.
How realistic are the generated networks?
At this point we need to assess the quality of the generated data by comparing commonly used graph metrics and distance measures.
Stochastic Kronecker Graph
Zero deg nodes
Is strongly connected
Is weakly connected
Number of strongly connected components
90% effective diameter
Approx. full diameter
Average shortest path length
Table 1: graph metrics exhibit some similarity demonstrating the ability to generate similar graphs
It's also interesting to look at the distribution of some of these metrics and compare the observed graph to the Stochastic Kronecker generated graph.
Figure 6: (a) degrees distribution in the graph, (b) the distribution of eigenvalues (c) Densification Power Law (DPL) gives the number of edges versus the number of nodes over time, and (d) Betweenness centrality gives the effective diameter over time.
From the above plots, we can observe that the Stochastic Kronecker graph aligns with all the patterns quite well, indicating promising results.
Having successfully generated the structure of the network, our focus shifts to label the generated nodes accurately aiming to model heterogeneous graphs encountered in real Quantexa networks. In our initial observed graph, there are 2 distinct node types, individual and business.
Figure 7: Observed Graph with node label annotation
Our objective is to ensure that nodes in our generated network contain a variety of types, from individuals to businesses, or any other entity. To model this diversity, we introduce graph node embeddings for node classification, a pivotal step in achieving a richer, more realistic graphs.
Node embeddings are vector representations that capture important information about the local neighbourhood structure of each node. These vectors enable us to classify nodes based on its distinctive characteristics.
"Nodes that are similar in the network will have similar embeddings."
To generate node embeddings, we used
node2vec from Grover, Leskovec, node2vec, 2016. This algorithm transforms each node of a graph into numerical representations that preserves structural information so that similar nodes have similar representations and vise versa. For every node in the graph, the algorithm generates a series of random walks capturing the neighbourhood information for each particular node. These embeddings have been used to train an XGBoost classifier aiming to predict the entity type for each node on the generated graphs.
The classifier trained on the embeddings was used to predict the node types of the Kronecker Graph and below you can see the resulting entity types.
Figure 7: Generated Stochastic Kronecker Graph with node label annotation
We dived into a family of graph generation models that uses a non-traditional matrix operation, the Kronecker product. The generated graphs exhibit fractal-like properties maintaining the structure and information from the observed graphs. We addressed the challenge of data privacy introducing the concept of Stochastic Kronecker Graphs. Finally, we extended the solution to model heterogeneous graphs via node embeddings and node classification resulting in more realistic results.