Introduction
DataLinter is a library for contextual linting of data and code. Its development started by rewriting a data linter written at Google in Julia. The aim of the redesign is to provide a richer and faster experience while also providing the baseline benefits outlined in the original paper.
Its main ideea is that providing additional context leads to the detection of more complex issues relating to data and code quality. These can arise due to both data structure as well as algorithmic or parameter choices.
Context here simply means additional information pertinent to the use of the data, available at runtime. For example, the classical way of linting a dataset is without any prior information on what the data will be used for. Hence, the assumptions about what the data will be used for are implicit. Context in this case could be the type of analysis or modelling the data is used for i.e. classification or, the code in a given programming language which uses the data. This provides a much higher degree of flexibility in the types of checks that can be implemented.
Features
Features at a glance:
- 28 data+code linters (including the Google linters)
- Docker image with compiled binaries, production ready
- CLI and HTTP server modes with zero-config
- CSV/Parquet/Arrow dataset support
- Text / JSON / HTML output support
- Flexible code querying through ParSitter.jl
- First-class R language support by tree-sitter-based code parsing
- Fully customizable rule engine (see configuration docs)
Installation
There are several ways to install DataLinter:
- pulling a Docker image from the Github container registry (quick & safe)
- downloading binaries (Linux only)
- cloning the Github repository (for development of the library)
Docker image
The latest Docker image can be downloaded with
docker pull ghcr.io/zgornel/datalinter-compiled:latest`For specific versions, use
docker pull ghcr.io/zgornel/datalinter-compiled:v0.x.y`Available packages (Docker images) can be viewed in the 'Packages' section of the repository's Github page.
Binaries
The cli and server binaries (linux-x86-64) can be downloaded from the releases page. Each release contains an Assets section with the binaries as datalinter-compiled-binary.zip.
Julia
Installation can be performed also from the Julia REPL with
using Pkg; Pkg.add(url="https://github.com/zgornel/DataLinter")The repository can also be directly cloned with
git clone https://github.com/zgornel/DataLinterArchitecture
The diagram below shows the current architecture, found also on the the wiki.
- The full system follows a micro-kernel pattern: core system + plugins (arrows indicate dependencies):
- The
Coresystem follows a pipes & filters architecture (arrows indicate data flow
The modules and corresponding implementations are shown below: