DataLinter.DataLinterModule

A data linter developed at the Vrije Universiteit Brussel, 2024.

I. Architecture (dataflow diagram):

                                        .---------------.
       (knowledge) -------------------->|  KB INTERFACE |
                                        '---------------'
                                                 ^
                                                (3)
                                                 v
                  .----------------.        .---------.        .-----------------.
       (data) --> | DATA INTERFACE | -(1)-> | LINTER  | -(4)-> |OUTPUT INTERFACE | --> (output)
                  '----------------'        '---------'        '-----------------'
                                                 ^
       (config) --------------(2)----------------'

II. Functional components:

• KB INTERFACE (`src/kb*.jl`)
  - handles communication with the knowledgebase
  Note: at this point the knowledge i.e. the data linters, is embedded in code

• DATA INTERFACE (`src/data.jl`)
  - models types of 'data contexts' = 'data' + 'metadata' + 'information' over where/when the data exists
     (i.e. a context could contain data and the snippet of code which is executed over the data)
  - the 'context' contributes as well to how/which linters are applied to the data

• OUTPUT INTERFACE (`src/output.jl`)
  - contains all code related to exporting or printing linting output and displaying statistics

• LINTER (`src/linter.jl`)
  - functional core of the system
  - it is a loop over linters × variables that applies each linter to variables/sets of variables
    (depending on context) and generates results

III. Inputs and outputs:

• data
  - at this point only '.CSV' files are supported
  - the internal representation supports the `Tables` interface

• config
  - keeps configuration of the linter
  - should be self explanatory '.TOML' file
  - option names for linter parameters are also keyword argument names in the code

• knowledge
  - knowledge relevant for the functioning of the data linter
  - currently all knowledge is present in `src/kb*.jl` in the form of data structures and
    throughout the code as functions
    Note: this will change over time

• output
  - what the user receives from the linter

IV. Internal data transfer objects (DTOs):

• (1) - data context object i.e. data, data + code;
• (2) - linter configuration information
• (3) - knowledge i.e. linters, applicability conditions etc.
• (4) - linting output i.e. linters/context, output, data stats etc.
source
DataLinter.versionMethod
version()

Returns the current DataLinter version using the Project.toml and git. If the Project.toml, git are not available, the version defaults to an empty string.

source
DataLinter.LinterCore.lintMethod
lint(ctx::AbstractDataContext, kb::Union{Nothing, AbstractKnowledgeBase}; config=nothing, debug=false, linters=["all"])

Main linting function. Lints the data provided by ctx using knowledge from kb. A configuration for the available linters can be provided in config. If debug=true, performance information for each linter are shown. By default, all available linters will be used.

source
DataLinter.LinterCore.load_configMethod
load_config(configpath::AbstractString)

Loads a linting configuration file located at configpath. The configuration file contains options regarding which linters are enabled and linter parameter values.

Examples

julia> using DataLinter
       using Pkg
        configpath = joinpath(dirname((Pkg.project()).path), "config", "default.toml")
       DataLinter.LinterCore.load_config(configpath)
Dict{String, Any} with 2 entries:
  "parameters" => Dict{String, Any}("uncommon_signs"=>Dict{String, Any}(), "enum_detector"=>Dict{String, Any}("distinct_max_limit"=>5, "distinct_ratio"=>0.001), "empty_example"=>Dict{String, Any}(), "negative_…
  "linters"    => Dict{String, Any}("uncommon_signs"=>true, "enum_detector"=>true, "empty_example"=>true, "negative_values"=>true, "tokenizable_string"=>true, "number_as_string"=>true, "int_as_float"=>true, "l…
source
DataLinter.DataInterface.build_data_contextMethod
build_data_context(;data=nothing, code=nothing)

Builds a data context object using data and code if available. The data context represents a context in which the linter runs: the data it lints and optionally, the code associated to the data i.e. some algorithm that will be applied on that data.

Examples

julia> using DataLinter
       ncols, nrows = 3, 10
       data = [rand(nrows) for _ in 1:ncols]
       ctx = DataLinter.build_data_context(data)
SimpleDataContext 0.00040435791015625 MB of data

julia> kb = DataLinter.kb_load("")
       DataLinter.LinterCore.lint(ctx, kb)
38-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}:
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing
         (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing
         (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing
         ...
source
DataLinter.LinterCore.process_outputMethod
process_output(lintout; buffer=stdout, show_stats=false, show_passing=false, show_na=false)

Process linting output for display. The function takes the linter output lintout and prints lints to buffer. If show_stats, show_passing and show_na are set to true, the function will print statistics over the checks, the checks that passes and the ones that could not be applied respectively.

source