Introduction

DataLinter is a library for contextual linting of data and code. Its development started by rewriting a data linter written at Google in Julia. The aim of the redesign is to provide a richer and faster experience while also providing the baseline benefits outlined in the original paper.

Its main ideea is that providing additional context leads to the detection of more complex issues relating to data and code quality. These can arise due to both data structure as well as algorithmic or parameter choices.

Context here simply means additional information pertinent to the use of the data, available at runtime. For example, the classical way of linting a dataset is without any prior information on what the data will be used for. Hence, the assumptions about what the data will be used for are implicit. Context in this case could be the type of analysis or modelling the data is used for i.e. classification or, the code in a given programming language which uses the data. This provides a much higher degree of flexibility in the types of checks that can be implemented.

Features

Features at a glance:

Installation

There are several ways to install DataLinter:

  • pulling a Docker image from the Github container registry (quick & safe)
  • downloading binaries (Linux only)
  • cloning the Github repository (for development of the library)

Docker image

The latest Docker image can be downloaded with

docker pull ghcr.io/zgornel/datalinter-compiled:latest`

For specific versions, use

docker pull ghcr.io/zgornel/datalinter-compiled:v0.x.y`

Available packages (Docker images) can be viewed in the 'Packages' section of the repository's Github page.

Binaries

The cli and server binaries (linux-x86-64) can be downloaded from the releases page. Each release contains an Assets section with the binaries as datalinter-compiled-binary.zip.

Julia

Installation can be performed also from the Julia REPL with

using Pkg; Pkg.add(url="https://github.com/zgornel/DataLinter")

The repository can also be directly cloned with

git clone https://github.com/zgornel/DataLinter

Architecture

The diagram below shows the current architecture, found also on the the wiki.

Note: arrows indicate dependencies and the arrow labels indicate intermediary modules

  • The full system follows a micro-kernel pattern (core system + plugins)
graph TD A[data plugin module i.e. **DataCSV**] -- DataInterface --> C[Core System] K[knowledge plugin module i.e. **KnowledgeBaseNative**] -- KnowledgeBaseInterface --> C O[output plugin module] --OutputInterface --> C
graph LR D[DataInterface] --> L[LinterCore] C[Configuration] --> L K[KnowledgeBaseInterface] --> L O[OutputInterface]-->L

The modules and corresponding implementations are shown below: