Data quality is the measurement of data in terms of accuracy, completeness, consistency, validity, uniqueness and timeliness. It enables businesses and organizations to quantify and better manage their data.
Machine learning (ML) is one of the biggest value-adds for businesses, and at the heart of every machine learning challenge is the input data. A machine learning model is only as good as the data that goes into it, which is why, in the world of supervised learning, data scientists rely on a set of accurately labelled data points.
Data challenges occurred when building the name/business model, a Quantexa project described in an earlier blog post. A few of the challenges we faced were:
- User Entered Data: To overcome this common data problem, writing processing steps could be an effective way to create a more reliable dataset from the outset, checking details such as the city listed in the address
- The ‘Italian Restaurant Problem’: The ideal solution to a challenge such as this would be to filter these examples out of the training set. Alternatives include building a model using data from jurisdictions where the ‘Italian Restaurant Problem’ is less common and applying this as a first pass
- Redacted Datapoints: Here is another example of class crossover which requires careful consideration – some could exist in either class, even if that has not been observed in the training set.
- Under-Represented Classes: Another challenge intricately linked to that of class crossover is the existence of data points that have been systematically mislabelled or missed out of the training data altogether.
Working with unreliable data, particularly user-entered data, can leave us open to bias, which has a more significant impact than mere poor model accuracy.
In these circumstances, the inclusion of additional context becomes key. Applying extra information to help clear out unreliable data points in the training set and fix problematic points will make the models more reliable, and lead to better quality data.
Click here to access the full blog and learn more, including:
- What is Data Quality?
- Why is Data Quality Important with Machine Learning?
- Challenges to Reliable Data
- Managing Unreliable Data