
A data linter developed at the Vrije Universiteit Brussel, 2024.

I. Architecture (dataflow diagram):

       (knowledge) -------------------->|  KB INTERFACE |
                  .----------------.        .---------.        .-----------------.
       (data) --> | DATA INTERFACE | -(1)-> | LINTER  | -(4)-> |OUTPUT INTERFACE | --> (output)
                  '----------------'        '---------'        '-----------------'
       (config) --------------(2)----------------'

II. Functional components:

• KB INTERFACE (`src/kb*.jl`)
  - handles communication with the knowledgebase
  Note: at this point the knowledge i.e. the data linters, is embedded in code

• DATA INTERFACE (`src/data.jl`)
  - models types of 'data contexts' = 'data' + 'metadata' + 'information' over where/when the data exists
     (i.e. a context could contain data and the snippet of code which is executed over the data)
  - the 'context' contributes as well to how/which linters are applied to the data

• OUTPUT INTERFACE (`src/output.jl`)
  - contains all code related to exporting or printing linting output and displaying statistics

• LINTER (`src/linter.jl`)
  - functional core of the system
  - it is a loop over linters × variables that applies each linter to variables/sets of variables
    (depending on context) and generates results

III. Inputs and outputs:

• data
  - at this point only '.CSV' files are supported
  - the internal representation supports the `Tables` interface

• config
  - keeps configuration of the linter
  - should be self explanatory '.TOML' file
  - option names for linter parameters are also keyword argument names in the code

• knowledge
  - knowledge relevant for the functioning of the data linter
  - currently all knowledge is present in `src/kb*.jl` in the form of data structures and
    throughout the code as functions
    Note: this will change over time

• output
  - what the user receives from the linter

IV. Internal data transfer objects (DTOs):

• (1) - data context object i.e. data, data + code;
• (2) - linter configuration information
• (3) - knowledge i.e. linters, applicability conditions etc.
• (4) - linting output i.e. linters/context, output, data stats etc.

Returns the current DataLinter version using the Project.toml and git. If the Project.toml, git are not available, the version defaults to an empty string.

lint(ctx::AbstractDataContext, kb::Union{Nothing, AbstractKnowledgeBase}; config=nothing, debug=false, linters=["all"])

Main linting function. Lints the data provided by ctx using knowledge from kb. A configuration for the available linters can be provided in config. If debug=true, performance information for each linter are shown. By default, all available linters will be used.


Loads a linting configuration file located at configpath. The configuration file contains options regarding which linters are enabled and linter parameter values.


julia> using DataLinter
       using Pkg
        configpath = joinpath(dirname((Pkg.project()).path), "config", "default.toml")
Dict{String, Any} with 2 entries:
  "parameters" => Dict{String, Any}("uncommon_signs"=>Dict{String, Any}(), "enum_detector"=>Dict{String, Any}("distinct_max_limit"=>5, "distinct_ratio"=>0.001), "empty_example"=>Dict{String, Any}(), "negative_…
  "linters"    => Dict{String, Any}("uncommon_signs"=>true, "enum_detector"=>true, "empty_example"=>true, "negative_values"=>true, "tokenizable_string"=>true, "number_as_string"=>true, "int_as_float"=>true, "l…
build_data_context(;data=nothing, code=nothing)

Builds a data context object using data and code if available. The data context represents a context in which the linter runs: the data it lints and optionally, the code associated to the data i.e. some algorithm that will be applied on that data.


julia> using DataLinter
       ncols, nrows = 3, 10
       data = [rand(nrows) for _ in 1:ncols]
       ctx = DataLinter.build_data_context(data)
SimpleDataContext 0.00040435791015625 MB of data

julia> kb = DataLinter.kb_load("")
       DataLinter.LinterCore.lint(ctx, kb)
38-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}:
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing
         (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing
process_output(lintout; buffer=stdout, show_stats=false, show_passing=false, show_na=false)

Process linting output for display. The function takes the linter output lintout and prints lints to buffer. If show_stats, show_passing and show_na are set to true, the function will print statistics over the checks, the checks that passes and the ones that could not be applied respectively.
