Advanced Language Parsers Release
We are excited to announce the release of our new Advanced Language Parsers, designed to support the accurate parsing of non-Latin alphabets natively in the Quantexa platform. This new capability will enable our customers to build contextual insights from across their data estate and expand Quantexa’s use in a wider range of geographies. In this first release we will support Japanese language parsing. Parsers in the Quantexa Platform Quantexa is very well known for its best-in-class Entity Resolution. Parsers play a significant role in making our Entity Resolution as accurate as it is. Parsing is the process of extracting relevant information from ingested data and transforming it into a structured format that can be easily analyzed. For example: in a customer system you’d typically have a record such as ‘Mrs. Jane Doe’. Parsing will extract it into manageable pieces - Title: Mrs.; GivenName: Jane, FamilyName: Doe. It would do the same for a record of a different format too, such as ‘Jane Doe, Mrs.’ as it identifies the different components. The more complicated the data, the more processing is needed to prepare for the high-quality Entity Resolution, for example, translation, transliteration, normalization of the data, etc. Quantexa’s existing Standard Parsers are proven to parse data with high accuracy, while providing the ability to incorporate cultural differences and increase the accuracy of parsing of data from specific geographies by tailoring the Parsers. However, they work best with data in Latin character sets. For more information about Quantexa’s Parsers see our documentation. In order to process data in alphabets other than Latin out of the box, we have created ML-powered Advanced Language Parsers with the first release of the Advanced Japanese Parser (more Advanced Parsers are on the roadmap for later this year). This will significantly streamline Data Ingestion and result in far more accurate Entity Resolution for these non-Latin languages. By the way, now you can explore our roadmap and give feedback on our features and functionality in our Product Roadmap & Ideas Portal! Be a part of our product development! What are we working with? Japanese words can come in 3 different scripts: Kanji (Traditional Chinese Characters) Hiragana (Phonetic lettering system, used for words not covered by Kanji, and for grammatical inflections) Katakana (Phonetic lettering system, used for transcription of foreign-language words into Japanese) Apart from using different character sets, data in Japanese has a lot of interesting characteristics. For example, Japanese addresses are typically formatted from big to small values (from the country > city > street > house number), while Western addresses are usually formatted small to big (house number > street > city > country). Transliteration vs Translation Japanese words can be transliterated to create a Romanized version of the Japanese words using Latin script – Romaji. Or translated – so that English equivalent of the word is used if exists. Japanese Romaji English ソニーグループ株式会社 Sonī Gurūpu Kabushiki-gaisha Sony Group Corporation What is included in Advanced Parsers? Advanced Japanese Parsers includes Individual, Business and Address parsers. Individual parser Based on a library provided by the CJK institute which tokenizes and transliterates characters representing Japanese names Library consists of code and a database to be distributed Code makes calls to the database to retrieve most likely transliterations of Japanese names based on combinations of input characters Business Parser Uses existing business parser architecture with Japanese standardizations Translates using lookup from JMDict Transliterates using two third-party tools Address Parser Uses an AI model – a ‘Mixed Field Parser’ trained on Japanese data for parsing addresses Transliterates only (no translation) using 2 third-party transliterators Can produce enriched variants using publicly available address postcode information Also, a new configuration of Email Parser was created to handle emails with Unicode characters (including Japanese). What is needed to configure Japanese Parsers? To create entities with the Japanese data, you will need to take the following steps: Add data sources that contain Japanese names, addresses, businesses. For the data sources with Japanese data, update the parse method to use the Advanced Japanese parsers: Advanced Parser will be applied if the input address contains Japanese characters. If input contains Latin characters only – parse data using standard (composite) parsers Modify the entity files to use new compound groups. Add custom Japanese resolution templates + compounds to the resolver config. Run ETL with the correct usage of the Advanced Parsers that include CJK and MFP files/Spark config. Important to know For now, Advanced Language Parsers are an experimental release in Parsers 4.2.1 version. Advanced parsers include a few tools (including an ML model) that are targeted to increase the accuracy of the data processing and subsequently – ER. The trade-off of accuracy is performance. Users can expect an increase in runtime (compared to the standard parsers), and on average 2x increase in Elastic Index sizes. The good news is that these estimations would only be applied on the % of the data that is in Japanese characters and will not affect the figures for the data processed by standard parsers. For more information about performance and testing, check the Release Notes. How can I get Advanced Japanese Parser for my project? Full information about Advanced Japanese Parser is available on the Doc site. However, since it is an experimental release of the functionality, please reach out to @Anastasia Petrovskaia if you feel that the parser is applicable for your project. We are working on adding this capability to the Demo environment and targeting March 2025 with this piece of work. What is next? Adoption and feedback from the users would be a big part of maturing the Advanced Parsers, so there are no immediate plans to move the capability straight to EA/GA. You can provide feedback directly to the Advanced Language Parsers for Non-Latin Scripts using the Product Roadmap & Ideas Portal. The next Parser release will be focused on the improvements of the Standard Parsers. More Advanced Language Parsers for different languages/countries (e.g. Chinese, Arabic) are expected in H2 2025. For more information reach out to Anastasia Petrovskaia .Detection Packs 0.3 Release
We are excited to announce the release of version 0.3 of Detection Packs. This is the third major release of Detection Packs and builds on the 0.2 version which introduced our low-code interface. For full details of the release, including compatible Quantexa Platform versions and minor enhancements, please see the Quantexa Documentation site. Expanded Score Coverage This release focuses on the expansion of our score coverage and general maturing of the product, with no significant changes to the interface, enabling those projects already using 0.2 to upgrade to 0.3 easily. 2 new transaction score pipelines were added, each with 4 score types such as “Transaction with Different Currencies” and “Transaction in Listed Jurisdiction”. 5 new Entity Record score types have been added, such as “Highly Connected Entity” and “Entity With Listed Type”. 5 new Entity Network score types have also been added, such as “Entity With Indirect Relation To Listed Jurisdiction” and “Entity Linked To Entity With Listed Status”. In total, the Fincrime Detection Pack now contains 26 pre-written configurable, re-usable, and extensible Score types which can be combined to produce a total of 56 Scores. For the full documentation on these please see our technical documentation. These new scores, in addition to those already in the Fincrime Detection pack, can be extended further to meet project-specific needs by utilizing the customization options documented on the Quantexa Documentation Site. The collection of supporting Reference Scores has continued to expand even as several have been adopted into this Detection Packs release. As a reminder, Reference Scores are pre-written Scores created in conjunction with our users to provide additional Scores over and above the core Detection Pack for FinCrime. They also cover additional use cases outside of FinCrime, and the catalogue currently contains over 50 further scores. Recent updates to the Reference Scores include a new correspondent banking use-case, and updates to transaction scores such as ‘Transaction With Mirrored Trading’ and ''Transaction in High Proportion of Low Value Security”. Simplified User Experience In addition to the expanded scoring options available, the Detection Packs user experience has been simplified by reducing the amount and complexity of configuration required for your project. In v0.2 of Detection Packs, projects which only wished to use a subset of supported scores were still required to setup all of their data mappings. From v0.3 this is simpler with various configuration options no longer required if not utilised. Coming soon to Detection Packs We are currently targeting mid 2024 for the 0.4 release of Detection Packs, with lots of exciting new features. Here are some of the planned features our users can look forward to in this release and beyond: Adoption of many more Reference Scores into officially supported, configuration-driven Detection Packs Scores Simplified graph-scripting support Dynamic pipeline generation Additional use case support, such as an Entity-level detection model Improved out-of-the-box testing and tooling Multi-typology and Multi-product Scorecard support Score versioning and seamless upgrade support