Introduction

DataLinter is a library for contextual linting of data and code. Its development started by rewriting a data linter written at Google in Julia. The aim of the redesign is to provide a richer and faster experience while also providing the baseline benefits outlined in the original paper.

Its main ideea is that providing additional context leads to the detection of more complex issues relating to data and code quality. These can arise due to both data structure as well as algorithmic or parameter choices.

Context here simply means additional information pertinent to the use of the data, available at runtime. For example, the classical way of linting a dataset is without any prior information on what the data will be used for. Hence, the assumptions about what the data will be used for are implicit. Context in this case could be the type of analysis or modelling the data is used for i.e. classification or, the code in a given programming language which uses the data. This provides a much higher degree of flexibility in the types of checks that can be implemented.

Features

Features at a glance:

27 data+code linters (including the Google linters)
Docker image with compiled binaries, production ready
CLI and HTTP server modes with zero-config
CSV/Parquet/Arrow dataset support
Text / JSON / HTML output support
Flexible code querying through ParSitter.jl
First-class R language support by tree-sitter-based code parsing
Fully customizable rule engine (see configuration docs)

Installation

There are several ways to install DataLinter:

pulling a Docker image from the Github container registry (quick & safe)
downloading binaries (Linux only)
cloning the Github repository (for development of the library)

Docker image

The latest Docker image can be downloaded with

docker pull ghcr.io/zgornel/datalinter-compiled:latest`

For specific versions, use

docker pull ghcr.io/zgornel/datalinter-compiled:v0.x.y`

Available packages (Docker images) can be viewed in the 'Packages' section of the repository's Github page.

Binaries

The cli and server binaries (linux-x86-64) can be downloaded from the releases page. Each release contains an Assets section with the binaries as datalinter-compiled-latest-linux-x86-64.zip.

Julia

Installation can be performed also from the Julia REPL with

using Pkg; Pkg.add(url="https://github.com/zgornel/DataLinter")

The repository can also be directly cloned with

git clone https://github.com/zgornel/DataLinter

Architecture

The diagram below shows the current architecture, found also on the the wiki.

The full system follows a micro-kernel pattern: core system + plugins (arrows indicate dependencies):

graph LR A1[CSV plugin] --> DI[DataInterface] DI --> C[Core System] A2[Parquet plugin] --> DI A3[Arrow plugin] --> DI KN[KnowledgeBaseNative plugin] --> KI[KnowledgeBaseInterface] KN --> J[Julia code] KI --> C O1[Text plugin] --> OI[OutputInterface] OI --> C O2[JSON plugin] --> OI O3[HTML plugin] --> OI

The Core system follows a pipes & filters architecture (arrows indicate data flow

graph LR D[DataInterface] --> L[LinterCore] C[Configuration] --> L K[KnowledgeBaseInterface] <--> L L --> O[OutputInterface]

The modules and corresponding implementations are shown below: