From keyword extraction to knowledge graphs, graph and network science offer a good framework to deal with natural language. We love using graph-based methods in our work so much, like generating more labeled data, visualizing language acquisition and shedding light on hidden biases in language, that we decided to start a series on the topic. The first part explored the theoretical background of network science and dealt with graphs using Python. This part focuses on graph processing frameworks and graph databases.
Why do we need graph databases and frameworks?
The question seems to be naive for everyone but newbies. You should keep in mind that at some point:
- Your data doesn’t fit into your computer’s memory.
- Data processing lasts for ages even if you use parallelization techniques.
- It is too complicated to use csv, json, parquet or any other file format.
- You must manage your data, because it is changing over time.
- You need to process your data frequently to answer various questions.
As we mentioned in the first part of the series, NetworkX is not good at handling large networks, i.e. about over 100.000 nodes, but it really depends on the structure of the network. If you work with a large dataset, you need to use two tools, namely one for processing it (e.g. to compute centrality measures, find clusters) and another for storing it and running analytic queries on it (e.g. find the shortest path between two nodes, list all nodes that can be reached from a given node within five or less step).
The landscape of graph databases is huge and complicated. Read this post if you want to get a systematic overview of it. We have a very opinionated position on graph databases, we like open source and open standards, so we like graph databases that support the Gremlin graph traversal machine and language. The Gremlin language enables one to host language embedding, so you can use it in your own language in a very idiomatic way.
If you want to learn more about what graph databases offer, how to model your data and what kind of queries can be run on such dbs, read Graph Databases in Action from Bechberger and Perryman – it’s freely available on its website.
There are billions of graph databases, but we especially love JanusGraph. It is 100% open source and as per our experience it works fine, though it is not perfect.
Probably neo4j is the most comprehensive and most advanced graph database which is widely used in the industry . We think it is superior to others, but it is not fully open. Of course you can use its community edition for learning and testing. It supports Gremlin, so it is also a good choice to work with.
Graph Frameworks – really it’s just Spark
All graph processing framework build on a paper from Google that describes its internal system for large scale graph processing called Pregel after the river of Königsberg, and yes, this river had those seven bridges.
There are three major frameworks for graph processing: Apache Hadoop, Apache Giraph and Apache Spark. Apache Giraph is the only one which was made only for graph processing. Sadly it is neither an actively maintained project nor a well documented one. Hadoop and Spark are big data analytics engines with graph processing capabilities. These days Spark seems to be more popular, at least among data scientists.
GraphX is the graph and parallel computing API of Spark. Although it is far from being a perfect tool, it is widely used by the industry, very robust and well-supported by documentations and by a big user base.
OK, but how these things are used in NLP/ML?
Deep Learning is the sexiest things on earth these days, but it needs lots of data. Google is using its Pregel system to feed its algorithm in a semi-supervised way. This paper explains how Pregel is used for a kind of label spreading method to boost training data. Such system used to train the smart reply function of Gmail and it helped to improve Google’s sentiment analyzer.
Graph databases can be used for various task, but Knowledge Graphs are the most well-known examples. Historically, Google developed its Knowledge Graph service to enhance its search results with factual information on the basis of Freebase, a semantic database. Now, the name of the service is a synonym of semantic databases. Building knowledge graphs is a very common NLP task in the industry. E.g. by using a named entity recognizer you can build a very simple one based on the co-occurrence of entities, or you can take a step further and by using relation mining you can determine the type of the connection between the co-occuring entities. Read this post to see a simple example of building a knowledge graph from unstructured text. The knowledge graph is usually stored in a graph database. Graph analytics is used to enhance the data with centrality measures, cluster, and other metrics. Also, graph analytics helps to filter out unwanted datapoints.
Graph Algorithms Paractical Examples in Apache Spark and Neo4j by Needham and Hodler is full of great examples of using graph analytics and graph databases. You can download it for free after filling out a form here.
No one works alone in the real-world. Data engineers tend toprovide data scientists with the necessary infrastructure. So you don’t have to become an expert in graph databases and processing frameworks, but you should know enough to work with your peers and communicate with them.
What’s coming up next?
If you are interested in this topic, we have a good news. Alessandro Negro of GraphAware and author of Graph-Powered Machine Learning will speak about Using Knowledge Graphs to predict customer needs, improve product quality and save costs at our upcoming meetup. He will also present a demo, Fighting corona virus with Knowledge Graph and Hume. Register here to attend the online event, or you can watch the recorded talk later on our YouTube channel.
In the third part of this blog series we will introduce the open source tools to visualize smallish and large graphs. Stay tuned!
Subscribe to our newsletter
Get highlights on NLP, AI, and applied cognitive science straight into your inbox.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.