One important area many organisations are overlooking but we see there is a rising interest is Testing for improved Data Quality. It feels like its a trickle-down from the healthcare industry where data quality and data privacy are of the highest importance. This is great as we can learn from the best :-)
Old “legacy” providers like Informatica, Oracle, SAS and Talend have solutions for Data Quality, and Gartner even have a Magic Quadrant for Data Quality Solution. Of course, if contract negotiations and company audits are your thing and you have the budget, then I am sure they offer great platforms.
If you prefer a more open agile framework I definitely recommend looking at https://greatexpectations.io and AWS Deequ https://github.com/awslabs/deequ instead. Both are OpenSources and it is possible to extend the funtionality via Python if you have requirements that is not already supported.
Here you can see our data scientist Sara showing GreatExpectations with Dataiku and Airflow here
and Here (Snowflake User group, 48 min in the zoom meeting) demoing quarantining incoming data for a further inspection which our A Cloud Frontier team developed. Mail email@example.com if you have any questions.
No matter what solution you have chosen to improve your Data Quality you must design your solution to handle severity and stakeholder notification from the beginning.
For severity we recommend designing your solution to cover these three outcomes:
- Warning. Notify stakeholders only, as you may have a non critical data issues. This could be a KPI measure that has increased outside of your previous expectations. The Data flow runs as usual.
- Invalid data. Quarantine invalid data but process valid records in your data flow. Notify stakeholders for further analysis. This could be that 5% of your data is invalid but you have designed your data quality to continue the data flow with the valid data. Data flow runs with valid data.
- Critical stop. Halt the data flow and block further processing of data. Notify stakeholders that the data pipeline has halted. This is of course the most severe case and should only be used when the data is critical for your business. Data flow is halted and valid data from earlier is potentially used instead.
Now what we have briefly covered the severity, let’s also notify the right people!
Proper stakeholder notifications are just as important as the Data Quality Severity you defined. Not notifying the right people at the right time is almost as critical as delivering wrong or missing data.
For Stakeholder notifications design your data quality rules to notify the right people:
- Internal developers, any data quality issue should be notified to the data developers. Leaving bugs in data is not recommended and any data quality issue should be investigated and potentially solved.
- Internal users, critical data issues should be notified to the relevant users, this should be decision makers, data scientists and BI developers and in some cases the source system.
- External partners, critical data issues should be notified to your partners. They are expecting their data delivery, their business is depending on your data and the data quality you deliver to them.
More information can be found here https://acloudfrontier.com/data-quality-testing/
It is also recommended to visit the GreatExpectations and AWS community as they are also blogging about the subject for their own frameworks.
AWS recently added python support https://github.com/awslabs/python-deequ.
Enjoy improving your data quality :-)