Elasticsearch and Why We Use It
This article gives an overview of what Elasticsearch is, and how and why it's used at Quantexa. If you're a data scientist, business analyst, or an end user, this piece will give you some useful context for what Elasticsearch is all about.
What is Elasticsearch?
Elasticsearch, or Elastic, is a near real-time, distributed storage, search, and analytics engine. Since the beginning, Quantexa has used Elasticsearch to store and query the data we ingest into the Quantexa platform, as we knew Search was going to be such a central feature of the platform.
Elasticsearch powers both our Search and Entity Resolution capabilities.
How does it work?
Data is passed to Elasticsearch, to be stored, following each step of the Extract, Transform, and Load (ETL) process. This process, as shown in the diagram, is generally handled by Data Fusion, which takes in the raw data and prepares it to be used for Entity Resolution (ER).
Each of the four indexes sources data from a different point in the process, as follows:
- Document Indexes: Once the data has been Cleansed, Document indexes are created as an output. These indexes are queried by Document Search and can also be used for dynamic Scoring of Documents.
- Resolver Indexes: The linking data extracted during Cleansing can be used to resolve Entities. This data is stored in "Resolver" indexes and is used to perform dynamic Entity Resolution for UI features such as Investigations, as well as batch processes such as Graph Scripting.
- Entity Store Indexes: Once Entities are created, the information about them is stored together to increase the speed and reliability of Entity searches and to allow more detailed inspection of them in the UI.
- Other Indexes: When data is discovered as part of the Cleansing process which can't be used for Entity Resolution, but could still add value, it is transferred to Other indexes. Examples of this might include individual transactions, which are often too numerous to visualize as Documents in a Network diagram. Other indexes can be queried and displayed in the UI within a table or with other visualizations such as Sankey diagrams in our Explorer feature.
Any "other data" in the platform not indexed after the final step relates to Quantexa-specific data around how the platform itself operates, such as a list of active Investigations, and is stored elsewhere.
You can read about the indexes in more details in our article about Elasticsearch considerations for Quantexa. You'll find more detail about the configuration of Elasticsearch indexes on our Resolver Elasticsearch configuration page on our Documentation site.
How is a search performed?
There is a layer in between what the user does and Elasticsearch, known as the Search service, which communicates using queries between the User Interface (UI) and Elasticsearch to make Search work. This allows Elasticsearch to understand our complex data models, which differ depending on the Document type.
For example, if the user puts a query into the User Interface (UI) filtered for "Forename" and "Surname", some additional work needs to be done to handle that request, as that filter might correspond to multiple different locations in the Document, such as "Shareholder name" or "Beneficial owner name".
The Search service uses logic, configured to the Document type, to translate all instances of that type of data into something consistent, so they can be passed to Elasticsearch and it can then pass back the right results.
Deployments will configure different filters, under customizable groups, depending on the needs of the project, and the user will then be able to select the corresponding options in the Search UI. Each filtered search will retrieve specific information from the indexes stored in Elasticsearch, depending on which filter is used.
Why do we use Elasticsearch?
Elasticsearch uses a particular type of logic to allow our Search to be smarter.
For example, if you ran a search for directors with forename "Michael" and surname "Greene", you'd also return results from "David Green" and "Michael Jones" if you didn't have a nested logic which tells the system you only want results where both terms match.
Plus, when you look at more detailed information about a Document, Elasticsearch data is rendered in the UI through the Document Viewer.
This functionality is only available thanks to the way we store the data in Elasticsearch.
Overall, the approach makes configuration easier down the line and means that Quantexa takes on the complexity of nesting data, rather than it being taken on by the deployment.
Where can I find out more?
For more details on the architecture of how Elasticsearch is implemented, see Elasticsearch Considerations for Quantexa. Refer to the Documentation Site for further details about Elasticsearch, or you can find general information on Elasticsearch's website.