This is a public Topic to discuss everything related to Quantexa training - content posted here will be visible to all. To raise a question please visit the Academy Q&A Topic

FAQ: I'm missing data in Elasticsearch / my number of docs are wrong

Dan_Pryer Posts: 1,665 QUANTEXA TEAM
edited December 2023 in Academy

FAQ relevant for: all Academy versions

If you have completed the ETL pipeline stages of your project and uploaded the data to ElasticSearch, then when checking your indices in the ElasticSearch Head plugin on Chrome you should have numbers similar to the picture below (to get a bigger version of the image, right click it and chose the option to open it in a new tab).

If your numbers are significantly different to this, then you will want to go back through your ETL pipeline and carefully check each stage to see if there is somewhere that you lose the data along the way. A good way to approach this problem is to work forwards from CreateCaseClass and check the output of each stage to find the problem area. You should also use the counts in ElasticSearch to guide you - for example if you have only half the number of businesses listed above, and no individuals, it lets you know that you probably haven't joined your Third Parties onto the ICIJ document properly, and so you would want to go back and double check how you have done this join and on what fields.

Specific points to consider:

  • Have I correctly parsed all of the necessary fields in my qmodel files?
  • Have I used the correct type of joins in CreateCaseClass, and have I joined on the correct fields?
  • Have I outputted the correct Dataset at the end of CreateCaseClass?
  • Have I loaded up the DocumentDataModel.parquet (the output of CreateCaseClass) into a Spark-Shell to check the output there?
  • Have I correctly identified and defined all relevant start paths in my qentity files?
  • Do I have a good range of compound keys for each Entity?

If you are convinced that you have done all of the above correctly then you can try to clear the data from ElasticSearch, restart the service and then re-upload the data to Elastic using the following three commands:

curl -X DELETE 'http://localhost:9200/_all'

sudo systemctl restart elasticsearch.service

./ -s -c ../external.conf -r elastic.icij

Dan Pryer - Senior Data Engineer

R&D - Decision Systems / Detection Packs

Did my reply answer your question? Then why not mark it as having answered in the bottom right corner of my post! 😁

User Profiles
Academy Topic Owners
Feel free to ask our Topic Owners a question on all things related to our Academy
Academy Team Lead
Academy Team Lead
Academy Team Lead
HTML tutorial