Usage examples
A simple tutorial
First, generate some random data:
julia> using DataLinter
julia> ncols, nrows = 3, 10
(3, 10)
julia> data = [rand(nrows) for _ in 1:ncols]
3-element Vector{Vector{Float64}}: [0.058905142114214715, 0.7700524113424013, 0.8757806726050036, 0.961684263113866, 0.8233468130984755, 0.9953108181804728, 0.10069059652869794, 0.3670831500665196, 0.5817684090820441, 0.2205375503265271] [0.2946939410569388, 0.24681428749348555, 0.18128283668290257, 0.36883560918641944, 0.5355467572988539, 0.06478898995020843, 0.960092687490906, 0.6174392354811713, 0.8004342078791402, 0.8831793790407333] [0.21359851145034892, 0.0802497841745472, 0.6116224838370506, 0.7710978721581587, 0.691662470984565, 0.27836611086273255, 0.9485670477681883, 0.2016617451177959, 0.20554770604500505, 0.010909987263765797]
then, generate a context object:
julia> ctx = DataLinter.build_data_context(data)
SimpleDataContext 0.00040435791015625 MB of data
Context objects are the main linter inputs along with a knowledge base and the config.
At this point the knowledge base is not used.
julia> kb = DataLinter.kb_load("") # raises Warning
KnowledgeBase with 9.1552734375e-5 MB of data
julia> lintout = DataLinter.LinterCore.lint(ctx, kb)
41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1
julia> lintout = DataLinter.LinterCore.lint(ctx, nothing) # also works
41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1
Lastly, one can print output of activate linters i.e. the ones that found problems in the data.
julia> DataLinter.process_output(lintout)
• info (long_tailed_distrib) column: x2 the distribution for 'column: x2' has 'long tails' • info (long_tailed_distrib) column: x3 the distribution for 'column: x3' has 'long tails' • info (long_tailed_distrib) column: x1 the distribution for 'column: x1' has 'long tails'