What is a data audit?

A data audit is a step-by-step process that examines every step of the data science process. Problems can be introduced at any step of this process, so a full audit requires close examination at each step. In another post I'll talk about tests given to uncover problems and possible remediation for problems found at each step. For now I'll assume we have full access to the model in question, although many of these questions can be addressed even when there are limits to access.

At the highest level a data audit has four phases:

  1. DATA
  2. DEFINE
  3. BUILD
  4. MONITOR

In order to audit a given algorithm we delve into phase-specific questions.

DATA-related questions:

  1. What data have you collected? Is it relevant and do you have enough and the right kind?
  2. What is the integrity of this data? Does it have bias? Is some of the data more or less accurate? How do you test this?
  3. Is your data systematically missing important types of data? Is it under- or over-representing certain types of events, behaviors, or people?
  4. How are you cleaning the data, dealing with missing data, outlying data, or unreasonable data? What is your ground truth for dealing with this kind of question?

DEFINE-related questions:

  1. How do you define "success" for your algorithm? Are there other related definitions of success, and what do you think would happen if you tweaked that definition? 
  2. What attributes do you choose to search through to potentially associate with success or failure? To what extent are your attributes proxies instead of directly relevant to the definition of success, and what could go wrong?

BUILD-related questions:

  1. What kind of algorithm should you use?
  2. How do you calibrate the model?
  3. How do you decide when the algorithm has been optimized?

MONITOR-related questions:

  1. To what extent is the model working in production?
  2. Does it need to be updated over time?
  3. How are the errors distributed?
  4. Is the model creating unintended consequences?
  5. Is the model playing a part in a larger feedback loop?