DataLinter.DataLinter
DataLinter.OutputInterface.WARN_LEVEL_TO_NUM
DataLinter.DataInterface.build_data_context
DataLinter.KnowledgeBaseInterface.kb_load
DataLinter.KnowledgeBaseInterface.kb_query
DataLinter.LinterCore.build_data_iterator
DataLinter.LinterCore.get_linter_kwargs
DataLinter.LinterCore.lint
DataLinter.LinterCore.linter_is_enabled
DataLinter.LinterCore.load_config
DataLinter.LinterCore.process_output
DataLinter.OutputInterface.score
DataLinter.cli_linting_workflow
DataLinter.printable_version
DataLinter.version
DataLinter.DataLinter
— ModuleA data linter developed at the Vrije Universiteit Brussel, 2024.
I. Architecture (dataflow diagram):
.---------------.
(knowledge) -------------------->| KB INTERFACE |
'---------------'
^
(3)
v
.----------------. .---------. .-----------------.
(data) --> | DATA INTERFACE | -(1)-> | LINTER | -(4)-> |OUTPUT INTERFACE | --> (output)
'----------------' '---------' '-----------------'
^
(config) --------------(2)----------------'
II. Functional components:
• KB INTERFACE (`src/kb*.jl`)
- handles communication with the knowledgebase
Note: at this point the knowledge i.e. the data linters, is embedded in code
• DATA INTERFACE (`src/data.jl`)
- models types of 'data contexts' = 'data' + 'metadata' + 'information' over where/when the data exists
(i.e. a context could contain data and the snippet of code which is executed over the data)
- the 'context' contributes as well to how/which linters are applied to the data
• OUTPUT INTERFACE (`src/output.jl`)
- contains all code related to exporting or printing linting output and displaying statistics
• LINTER (`src/linter.jl`)
- functional core of the system
- it is a loop over linters × variables that applies each linter to variables/sets of variables
(depending on context) and generates results
III. Inputs and outputs:
• data
- at this point only '.CSV' files are supported
- the internal representation supports the `Tables` interface
• config
- keeps configuration of the linter
- should be self explanatory '.TOML' file
- option names for linter parameters are also keyword argument names in the code
• knowledge
- knowledge relevant for the functioning of the data linter
- currently all knowledge is present in `src/kb*.jl` in the form of data structures and
throughout the code as functions
Note: this will change over time
• output
- what the user receives from the linter
IV. Internal data transfer objects (DTOs):
• (1) - data context object i.e. data, data + code;
• (2) - linter configuration information
• (3) - knowledge i.e. linters, applicability conditions etc.
• (4) - linting output i.e. linters/context, output, data stats etc.
DataLinter.cli_linting_workflow
— MethodBasic flow for running the linter in a command line interface environment such as a Unix shell.
DataLinter.printable_version
— Methodprintable_version()
Returns a pretty version string that includes the git commit and date.
DataLinter.version
— Methodversion()
Returns the current DataLinter version using the Project.toml
and git
. If the Project.toml
, git
are not available, the version defaults to an empty string.
DataLinter.LinterCore.lint
— Methodlint(ctx::AbstractDataContext, kb::Union{Nothing, AbstractKnowledgeBase}; config=nothing, debug=false, linters=["all"])
Main linting function. Lints the data provided by ctx
using knowledge from kb
. A configuration for the available linters can be provided in config
. If debug=true
, performance information for each linter are shown. By default, all available linters will be used.
DataLinter.LinterCore.get_linter_kwargs
— MethodFunction that reads linter configuration parameters.
DataLinter.LinterCore.linter_is_enabled
— MethodFunction that returns whether a linter is enabled in the config or not.
DataLinter.LinterCore.load_config
— Methodload_config(configpath::AbstractString)
Loads a linting configuration file located at configpath
. The configuration file contains options regarding which linters are enabled and linter parameter values.
Examples
julia> using DataLinter
using Pkg
configpath = joinpath(dirname((Pkg.project()).path), "config", "default.toml")
DataLinter.LinterCore.load_config(configpath)
Dict{String, Any} with 2 entries:
"parameters" => Dict{String, Any}("uncommon_signs"=>Dict{String, Any}(), "enum_detector"=>Dict{String, Any}("distinct_max_limit"=>5, "distinct_ratio"=>0.001), "empty_example"=>Dict{String, Any}(), "negative_…
"linters" => Dict{String, Any}("uncommon_signs"=>true, "enum_detector"=>true, "empty_example"=>true, "negative_values"=>true, "tokenizable_string"=>true, "number_as_string"=>true, "int_as_float"=>true, "l…
DataLinter.DataInterface.build_data_context
— Methodbuild_data_context(;data=nothing, code=nothing)
Builds a data context object using data
and code
if available. The data context represents a context in which the linter runs: the data it lints and optionally, the code
associated to the data
i.e. some algorithm that will be applied on that data.
Examples
julia> using DataLinter
ncols, nrows = 3, 10
data = [rand(nrows) for _ in 1:ncols]
ctx = DataLinter.build_data_context(data)
SimpleDataContext 0.00040435791015625 MB of data
julia> kb = DataLinter.kb_load("")
DataLinter.LinterCore.lint(ctx, kb)
38-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}:
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing
(Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing
...
DataLinter.LinterCore.build_data_iterator
— MethodFunction that returns a DataStructure ammendable for use in the data linters. It contains a row iterator, a column iterator, metadata
DataLinter.KnowledgeBaseInterface.kb_load
— FunctionLoads a knowledge base.
DataLinter.KnowledgeBaseInterface.kb_query
— FunctionRuns a query over a knowledge base.
DataLinter.OutputInterface.WARN_LEVEL_TO_NUM
— ConstantStructure that maps a warning level to a numeric value. This can be used to obtain an numeric estimate of the issues over a dataset.
DataLinter.LinterCore.process_output
— Methodprocess_output(lintout; buffer=stdout, show_stats=false, show_passing=false, show_na=false)
Process linting output for display. The function takes the linter output lintout
and prints lints to buffer
. If show_stats
, show_passing
and show_na
are set to true
, the function will print statistics over the checks, the checks that passes and the ones that could not be applied respectively.
DataLinter.OutputInterface.score
— MethodReturns a score corresponding to the severity of the issues found in the dataset. The score is based on the WARN_LEVEL_TO_NUM
mapping.