DataLinter.DataLinterDataLinter.OutputInterface.WARN_LEVEL_TO_NUMDataLinter.DataInterface.build_data_contextDataLinter.KnowledgeBaseInterface.kb_loadDataLinter.KnowledgeBaseInterface.kb_queryDataLinter.LinterCore.applicableDataLinter.LinterCore.build_data_iteratorDataLinter.LinterCore.build_linting_contextDataLinter.LinterCore.build_linting_contextDataLinter.LinterCore.get_experiment_parametersDataLinter.LinterCore.get_linter_kwargsDataLinter.LinterCore.lintDataLinter.LinterCore.linter_is_enabledDataLinter.LinterCore.load_configDataLinter.LinterCore.process_outputDataLinter.LinterCore.reconcile_contextsDataLinter.OutputInterface.scoreDataLinter.cli_linting_workflowDataLinter.printable_versionDataLinter.version
DataLinter.DataLinter — ModuleA data linter developed at the Vrije Universiteit Brussel, 2024.
I. Architecture (dataflow diagram):
.---------------.
(knowledge) -------------------->| KB INTERFACE |
'---------------'
^
(3)
v
.----------------. .---------. .-----------------.
(data) --> | DATA INTERFACE | -(1)-> | LINTER | -(4)-> |OUTPUT INTERFACE | --> (output)
'----------------' '---------' '-----------------'
^
(config) --------------(2)----------------'II. Functional components:
• KB INTERFACE (`src/kb*.jl`)
- handles communication with the knowledgebase
Note: at this point the knowledge i.e. the data linters, is embedded in code
• DATA INTERFACE (`src/data.jl`)
- models types of 'data contexts' = 'data' + 'metadata' + 'information' over where/when the data exists
(i.e. a context could contain data and the snippet of code which is executed over the data)
- the 'context' contributes as well to how/which linters are applied to the data
• OUTPUT INTERFACE (`src/output.jl`)
- contains all code related to exporting or printing linting output and displaying statistics
• LINTER (`src/linter.jl`)
- functional core of the system
- it is a loop over linters × variables that applies each linter to variables/sets of variables
(depending on context) and generates resultsIII. Inputs and outputs:
• data
- at this point only '.CSV' files are supported
- the internal representation supports the `Tables` interface
• config
- keeps configuration of the linter
- should be self explanatory '.TOML' file
- option names for linter parameters are also keyword argument names in the code
• knowledge
- knowledge relevant for the functioning of the data linter
- currently all knowledge is present in `src/kb*.jl` in the form of data structures and
throughout the code as functions
Note: this will change over time
• output
- what the user receives from the linterIV. Internal data transfer objects (DTOs):
• (1) - data context object i.e. data, data + code;
• (2) - linter configuration information
• (3) - knowledge i.e. linters, applicability conditions etc.
• (4) - linting output i.e. linters/context, output, data stats etc.DataLinter.cli_linting_workflow — MethodBasic flow for running the linter in a command line interface environment such as a Unix shell.
DataLinter.printable_version — Methodprintable_version()Returns a pretty version string that includes the git commit and date.
DataLinter.version — Methodversion()Returns the current DataLinter version using the Project.toml and git. If the Project.toml, git are not available, the version defaults to an empty string.
DataLinter.LinterCore.applicable — MethodFunction that checks whether a linter is applicable or not. The logic is that the iterable type must match and if linter.linting_ctx==true then a linting context must exist, either specified in the config, through the presence of code or both.
DataLinter.LinterCore.build_linting_context — MethodFunction that builds a LintingContext from a linter configuration
DataLinter.LinterCore.build_linting_context — MethodFunction that builds a LintingContext from code and code query
DataLinter.LinterCore.lint — Methodlint(data_ctx::AbstractDataContext, kb::Union{Nothing, AbstractKnowledgeBase}; config=nothing, debug=false, linters=["all"])Main linting function. Lints the data provided by data_ctx using knowledge from kb. A configuration for the available linters can be provided in config. If debug=true, performance information for each linter are shown. By default, all available linters will be used.
DataLinter.LinterCore.reconcile_contexts — Methodreconcile_contexts(code_ctx, config_ctx)Function that reconciles contexts obtained from code and configuration .toml file. The basic approach is to take all available data from code_ctx and when not available fill in from config_ctx.
DataLinter.LinterCore.get_experiment_parameters — MethodFunction that reads linter configuration parameters.
DataLinter.LinterCore.get_linter_kwargs — MethodFunction that reads linter configuration parameters.
DataLinter.LinterCore.linter_is_enabled — MethodFunction that returns whether a linter is enabled in the config or not.
DataLinter.LinterCore.load_config — Methodload_config(configpath::AbstractString)Loads a linting configuration file located at configpath. The configuration file contains options regarding which linters are enabled and linter parameter values.
Examples
julia> using DataLinter
using Pkg
configpath = joinpath(dirname((Pkg.project()).path), "config", "default.toml")
DataLinter.LinterCore.load_config(configpath)
Dict{String, Any} with 2 entries:
"parameters" => Dict{String, Any}("uncommon_signs"=>Dict{String, Any}(), "enum_detector"=>Dict{String, Any}("distinct_max_limit"=>5, "distinct_ratio"=>0.001), "empty_example"=>Dict{String, Any}(), "negative_…
"linters" => Dict{String, Any}("uncommon_signs"=>true, "enum_detector"=>true, "empty_example"=>true, "negative_values"=>true, "tokenizable_string"=>true, "number_as_string"=>true, "int_as_float"=>true, "l…DataLinter.DataInterface.build_data_context — Methodbuild_data_context(;data=nothing, code=nothing)Builds a data context object using data and code if available. The data context represents a context in which the linter runs: the data it lints and optionally, the code associated to the data i.e. some algorithm that will be applied on that data.
Examples
julia> using DataLinter
ncols, nrows = 3, 10
data = [rand(nrows) for _ in 1:ncols]
ctx = DataLinter.build_data_context(data)
SimpleDataContext 0.00040435791015625 MB of data
julia> kb = DataLinter.kb_load("")
DataLinter.LinterCore.lint(ctx, kb)
38-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}:
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing
(Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing
(Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing
...DataLinter.LinterCore.build_data_iterator — MethodFunction that returns a DataStructure ammendable for use in the data linters. It contains a row iterator, a column iterator, metadata
DataLinter.KnowledgeBaseInterface.kb_load — FunctionLoads a knowledge base.
DataLinter.KnowledgeBaseInterface.kb_query — FunctionRuns a query over a knowledge base.
DataLinter.OutputInterface.WARN_LEVEL_TO_NUM — ConstantStructure that maps a warning level to a numeric value. This can be used to obtain an numeric estimate of the issues over a dataset.
DataLinter.LinterCore.process_output — Methodprocess_output(lintout; buffer=stdout, show_stats=false, show_passing=false, show_na=false)Process linting output for display. The function takes the linter output lintout and prints lints to buffer. If show_stats, show_passing and show_na are set to true, the function will print statistics over the checks, the checks that passes and the ones that could not be applied respectively.
DataLinter.OutputInterface.score — MethodReturns a score corresponding to the severity of the issues found in the dataset. The score is based on the WARN_LEVEL_TO_NUM mapping.