Introduction

DataLinter is a library for contextual linting of data and code. The main idea behind the linter is that by providing additional context, more complex issues relating to data and code quality can be detected as issues in data modelling can arise due to both data structure as well as algorithmic or parameter choices. 'Context' here simply means additional information pertinent to the use of the data, available at runtime. For example, the classical way of linting a dataset is without any prior information on what the data will be used for or. Hence, the assumptions about what the data will be used for are implicit. Alternatively, one could provide, for example, the type of analysis or modelling the data is used for i.e. classification or, the code in a given programming language where the data is used. This provides a much higher degree of flexibility in the types of checks that can be implemented.

DataLinter development started by rewriting Google's data linter project in the Julia language. We aim for a redesign that aims at a richer and faster experience.

Installation

There are several ways to install DataLinter: cloning the Github repository or pulling a Docker image from the Github container registry. Unless one wants to develop DataLinter, the Docker installation is recommended.

Git cloning

The DataLinter repository can be downloaded through git:

$ git clone https://github.com/zgornel/DataLinter

Docker image (recommended)

$ docker pull ghcr.io/zgornel/datalinter-compiled:latest

Architecture (from the wiki)

So far the architecture looks like:

Note: arrows indicate dependencies and the arrow labels indicate intermediary modules

  • Full system: micro-kernel architecture (core system + plugins)
graph TD A[data plugin module i.e. **DataCSV**] -- DataInterface --> C[Core System] K[knowledge plugin module i.e. **KnowledgeBaseNative**] -- KnowledgeBaseInterface --> C O[output plugin module] --OutputInterface --> C
  • Core System: pipeline architecture
graph LR D[DataInterface] --> L[LinterCore] C[Configuration] --> L K[KnowledgeBaseInterface] --> L O[OutputInterface]-->L

The modules and corresponding implementations are shown below: