Entity Resolution Configuration & Parsing Health Checks
Entity Resolution (ER) and good Entity quality underpin all Quantexa deployments. The accuracy of Entity configuration and parsing are two areas that impact Entity quality. These articles outline Entity Resolution (ER) health checks and Parsing health checks to be carried out by the development team on deployments. This allows the team to identify, prioritize, and fix any potential underlying issues that could be reducing Entity quality. These checks must be completed as part of the initial deployment, but also periodically over the lifetime of the deployment. New product functionality, and data changing, may mean configuration needs to be changed, or enhanced over time. Topics covered: Entity Resolution Health Checks Pre-requisites Resolver JSON configuration health check steps Perform a comparison to the latest core Resolver JSON configuration Review configured Element exclusion criteria Review configured exclusions for Compounds in the relevant template Compound model health check steps Are all required Compounds being generated in ETL for the relevant Document types? Are Compounds being generated to populate elements required for exclusions in other Compounds? Do the traversals all look sensible? Do you have good coverage of unit tests? Parsing Health Checks Pre-requisites Parsing health check steps Is your deployment using the latest versions of Parsers? Has your deployment applied custom Parsing functions or wrappers? How does your Parsing compare to best practice Parsing? How well are the Parsers performing per source and country? How well-populated are the Parsed fields?24Views0likes0CommentsAdvanced Language Parsers Release
We are excited to announce the release of our new Advanced Language Parsers, designed to support the accurate parsing of non-Latin alphabets natively in the Quantexa platform. This new capability will enable our customers to build contextual insights from across their data estate and expand Quantexa’s use in a wider range of geographies. In this first release we will support Japanese language parsing. Parsers in the Quantexa Platform Quantexa is very well known for its best-in-class Entity Resolution. Parsers play a significant role in making our Entity Resolution as accurate as it is. Parsing is the process of extracting relevant information from ingested data and transforming it into a structured format that can be easily analyzed. For example: in a customer system you’d typically have a record such as ‘Mrs. Jane Doe’. Parsing will extract it into manageable pieces - Title: Mrs.; GivenName: Jane, FamilyName: Doe. It would do the same for a record of a different format too, such as ‘Jane Doe, Mrs.’ as it identifies the different components. The more complicated the data, the more processing is needed to prepare for the high-quality Entity Resolution, for example, translation, transliteration, normalization of the data, etc. Quantexa’s existing Standard Parsers are proven to parse data with high accuracy, while providing the ability to incorporate cultural differences and increase the accuracy of parsing of data from specific geographies by tailoring the Parsers. However, they work best with data in Latin character sets. For more information about Quantexa’s Parsers see our documentation. In order to process data in alphabets other than Latin out of the box, we have created ML-powered Advanced Language Parsers with the first release of the Advanced Japanese Parser (more Advanced Parsers are on the roadmap for later this year). This will significantly streamline Data Ingestion and result in far more accurate Entity Resolution for these non-Latin languages. By the way, now you can explore our roadmap and give feedback on our features and functionality in our Product Roadmap & Ideas Portal! Be a part of our product development! What are we working with? Japanese words can come in 3 different scripts: Kanji (Traditional Chinese Characters) Hiragana (Phonetic lettering system, used for words not covered by Kanji, and for grammatical inflections) Katakana (Phonetic lettering system, used for transcription of foreign-language words into Japanese) Apart from using different character sets, data in Japanese has a lot of interesting characteristics. For example, Japanese addresses are typically formatted from big to small values (from the country > city > street > house number), while Western addresses are usually formatted small to big (house number > street > city > country). Transliteration vs Translation Japanese words can be transliterated to create a Romanized version of the Japanese words using Latin script – Romaji. Or translated – so that English equivalent of the word is used if exists. Japanese Romaji English ソニーグループ株式会社 Sonī Gurūpu Kabushiki-gaisha Sony Group Corporation What is included in Advanced Parsers? Advanced Japanese Parsers includes Individual, Business and Address parsers. Individual parser Based on a library provided by the CJK institute which tokenizes and transliterates characters representing Japanese names Library consists of code and a database to be distributed Code makes calls to the database to retrieve most likely transliterations of Japanese names based on combinations of input characters Business Parser Uses existing business parser architecture with Japanese standardizations Translates using lookup from JMDict Transliterates using two third-party tools Address Parser Uses an AI model – a ‘Mixed Field Parser’ trained on Japanese data for parsing addresses Transliterates only (no translation) using 2 third-party transliterators Can produce enriched variants using publicly available address postcode information Also, a new configuration of Email Parser was created to handle emails with Unicode characters (including Japanese). What is needed to configure Japanese Parsers? To create entities with the Japanese data, you will need to take the following steps: Add data sources that contain Japanese names, addresses, businesses. For the data sources with Japanese data, update the parse method to use the Advanced Japanese parsers: Advanced Parser will be applied if the input address contains Japanese characters. If input contains Latin characters only – parse data using standard (composite) parsers Modify the entity files to use new compound groups. Add custom Japanese resolution templates + compounds to the resolver config. Run ETL with the correct usage of the Advanced Parsers that include CJK and MFP files/Spark config. Important to know For now, Advanced Language Parsers are an experimental release in Parsers 4.2.1 version. Advanced parsers include a few tools (including an ML model) that are targeted to increase the accuracy of the data processing and subsequently – ER. The trade-off of accuracy is performance. Users can expect an increase in runtime (compared to the standard parsers), and on average 2x increase in Elastic Index sizes. The good news is that these estimations would only be applied on the % of the data that is in Japanese characters and will not affect the figures for the data processed by standard parsers. For more information about performance and testing, check the Release Notes. How can I get Advanced Japanese Parser for my project? Full information about Advanced Japanese Parser is available on the Doc site. However, since it is an experimental release of the functionality, please reach out to @Anastasia Petrovskaia if you feel that the parser is applicable for your project. We are working on adding this capability to the Demo environment and targeting March 2025 with this piece of work. What is next? Adoption and feedback from the users would be a big part of maturing the Advanced Parsers, so there are no immediate plans to move the capability straight to EA/GA. You can provide feedback directly to the Advanced Language Parsers for Non-Latin Scripts using the Product Roadmap & Ideas Portal. The next Parser release will be focused on the improvements of the Standard Parsers. More Advanced Language Parsers for different languages/countries (e.g. Chinese, Arabic) are expected in H2 2025. For more information reach out to Anastasia Petrovskaia .Welcome to Parsers 4.2 | Release Announcement
Alongside the release of QP2.7, we are happy to share the release of 4.2.0 of Standard Parsers. This release extends the cleansing options you can define purely in config - configurable simple generic cleansers. We have introduced new config-based cleansers that allow you to perform replacements in strings, remove and keep parts of a string based on pre-defined options, change the case of a string for different languages and to extract specific parts of input strings, all without writing any Scala. These back-end config improvements extend to the front-end. The users of QP 2.7 will have access to the extended configurability described above within the UI - more details are available in the release notes and documentation for QP2.7. The release also brings some key bug fixes and a small improvement to business parsing that should catch more edge cases of business names with odd punctuation distributions. For more information on the release features, please see the 4.2.0 release notes and for general info on Parsers, see the documentation.112Views1like0CommentsWelcome to Parsers 4 | Parsers 4.0.0 Release Announcement
We are excited to announce the version 4.0 release of Quantexa's Standard Parsers. This release marks a shift in the way the Standard Parsers are used, and it includes an expansion of the data models so that more information is made available for use in Entity Resolution. Here are a few of the features you’ll find in this release: A new low-code interface Standard Parsers are now quicker and easier to deploy and use with new, easily shareable customization options, integration with Data Fusion, and less custom code, giving better coverage of auto-migrations to support upgrades. The setup and customization now uses configuration files, much like Data Fusion, meaning anyone can make changes without the need for writing code. Updated Data Models Entity Resolution can now form higher quality Entities, thanks to more information stored in our Standard Data Models at the parsing stage, so Entities have more contextual information associated with them and alignment with international standards. Higher match rates with Variants The introduction of Variants enables even more use cases, and lets users resolve the same Entity in more ways, in both Search and within Entity Resolution itself, so it’s possible to catch more edge cases. You can now use out-of-the-box Variants with the `address`, `individual` and `business` data models, and even define custom Variants. Name flexibility and internationalisation In the global market, localizations to both individual and business names can impact the accuracy of data, so we’ve updated name structures to represent names in a wider variety of cultures more accurately to the real world, improving Entity quality. For more detail on the new features – including contextual parsing (also known as composite parsing) and additional experimental country-specific address parsing, as well as compatibility and support with different versions of the Quantexa Platform, and other minor changes, see the full set of Release Notes on the Quantexa Documentation site. If you are unable to access the Documentation site, please get in touch with your Quantexa point of contact or the Community team at community@quantexa.com.231Views1like1CommentWelcome to Parsers 4.1 | Release Announcement
We are excited to announce the release of version 4.1 of Quantexa's Standard Parsers. This release focuses on improving the integration with Fusion UI (look out for exciting 2.6 release announcements coming soon) and improvements of the file structure of configuration files. This release includes the following highlights, which are detailed below. Consistency of configuration files - general improvements to Parser and lexicon configuration and files have been introduced to make sure the way you use these files is consistent across all available Parsers. This will make your configuration easier to understand and simplify the process of making future modifications. To minimise redundant data storage in Elastic Search, you can now exclude business standardisation terms that aren’t used in areas such as exclusions for Entity Resolution. Similarly, you can now choose to parse multiple names or just a single name for the Individual Parser to reduce your Elastic Search footprint. For the Telephone Parser you can now specify conditional parsing rules to increase the output accuracy. For example, if you have more specific parsing rules for UK telephone numbers, you can now use country code to parse these telephone numbers differently to the default telephone parsing behaviour. Note: There are no changes in this release that affect the output of parsing.151Views1like1Comment