![]() ![]() Keep in mind that there are any number of technologies and approaches you can use to build data quality monitors, and the choices you make will depend on your tech stack. To crystalize how anomaly detection works, let’s walk through a real-world tutorial in building an anomaly detector for a very anomalous data set. Incomplete or stale data that goes unnoticed until several weeks later, affecting key marketing metricsĪ code change that causes an API to stop collecting data feeding an important new productĭata drift over time, which can be challenging to catch, particularly if your tests look only at the data being written at the time of your ETL jobs, which don’t normally take into account data that is already in a given table Unknown unknowns might include:Ī distribution anomaly in a critical field that causes your Tableau dashboard to malfunctionĪ JSON schema change made by another team that turns 6 columns into 600Īn unintended change to ETL (or reverse ETL, if you fancy) leading to tests not running and bad data being missed Unknown unknowns refer to data downtime that even the most comprehensive testing can’t account for, issues that arise across your entire data pipeline, not just the sections covered by specific tests. In Figure 4-1, we highlight popular examples of both. These issues may not happen, but with a healthy dose of testing, you can often account for them before they cause issues downstream. ![]() Known unknowns are issues that you can easily predict, i.e., null values, specific freshness issues, or schema changes triggered by a system that updates regularly. There are two types of data quality issues in this world: those you can predict (known unknowns) and those you can’t (unknown unknowns). Knowing Your Known Unknowns and Unknown Unknowns In the process, we’ll introduce important concepts and terms necessary to bulk up your understanding of important anomaly detection techniques. In this chapter, we’ll walk through how to build your own data quality monitors for a data warehouse environment to monitor and alert to the pillars of data observability: freshness, volume, distribution, and schema. ![]() Data monitoring and anomaly detection function in much the same way. Most contemporary vehicles alert you when oil, brake fluid, gas, tire pressure, and other vital entities are lower than they should be and encourage you to take action. While automobiles are vastly different from data pipelines, cars and other mechanical systems have their own monitoring and anomaly detection capabilities, too. Now, as data systems become increasingly complex and companies empower employees across functions to use data, it’s imperative that teams take both proactive and reactive approaches to solving for data quality. ![]() Up until recently, anomaly detection was considered a nice-to-have-not a need-to-have-for many data teams. For a technical deep dive, we recommend Preetam Jinka and Baron Schwartz’s report Anomaly Detection for Monitoring (O’Reilly). Assuming your website is normally up and running, of course.Ī number of techniques, algorithms, and frameworks exist and are used (and developed) by industry giants like Meta, Google, Uber, and others. When it comes to understanding when data breaks, your best course of action is to lean on monitoring, specifically anomaly detection techniques that identify when your expected thresholds for volume, freshness, distribution, and other values don’t meet expectations.Īnomaly detection refers to the identification of events or observations that deviate from the norm-for instance, fraudulent credit card behavior or a technical glitch, like a website crash. Similarly, in data, all of the testing and data quality checks under the sun can’t fully protect you from data downtime, which can manifest at all stages of the pipeline and surface for a variety of reasons that are often unaffiliated with the data itself. No matter how many tests or checks your dealership could have done to validate the health of your car, there’s no accounting for unknown unknowns (i.e, nails or debris on the highway) that might affect your vehicle. After a brief investigation, you’ve identified the alleged culprit of the loud sound-a flat tire. You pull onto the shoulder, turn on your hazard lights, and jump out of the car. Everything is fine and dandy until you hear a loud pop. “There’s nothing like that new car smell!” you think as you pull onto the highway. Based on the routine prepurchase check, all systems are working according to the manual, the oil and brake fluid tanks are filled nearly to the brim, and the parts are good as new-because, well, they are.Īfter grabbing the keys from your dealer, you hit the road. Imagine that you’ve just purchased a new car. Monitoring and Anomaly Detection for Your Data Pipelines ![]()
0 Comments
Leave a Reply. |