Usage examples

A simple tutorial

First, generate some random data:

julia> using DataLinter
julia> ncols, nrows = 3, 10(3, 10)
julia> data = [rand(nrows) for _ in 1:ncols]3-element Vector{Vector{Float64}}: [0.058905142114214715, 0.7700524113424013, 0.8757806726050036, 0.961684263113866, 0.8233468130984755, 0.9953108181804728, 0.10069059652869794, 0.3670831500665196, 0.5817684090820441, 0.2205375503265271] [0.2946939410569388, 0.24681428749348555, 0.18128283668290257, 0.36883560918641944, 0.5355467572988539, 0.06478898995020843, 0.960092687490906, 0.6174392354811713, 0.8004342078791402, 0.8831793790407333] [0.21359851145034892, 0.0802497841745472, 0.6116224838370506, 0.7710978721581587, 0.691662470984565, 0.27836611086273255, 0.9485670477681883, 0.2016617451177959, 0.20554770604500505, 0.010909987263765797]

then, generate a context object:

julia> ctx = DataLinter.build_data_context(data)SimpleDataContext 0.00040435791015625 MB of data

Context objects are the main linter inputs along with a knowledge base and the config.

Note

At this point the knowledge base is not used.

julia> kb = DataLinter.kb_load("")         # raises WarningKnowledgeBase with 9.1552734375e-5 MB of data
julia> lintout = DataLinter.LinterCore.lint(ctx, kb)41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1
julia> lintout = DataLinter.LinterCore.lint(ctx, nothing) # also works41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1

Lastly, one can print output of activate linters i.e. the ones that found problems in the data.

julia> DataLinter.process_output(lintout)• info         	(long_tailed_distrib)	column: x2           the distribution for 'column: x2' has 'long tails'
• info         	(long_tailed_distrib)	column: x3           the distribution for 'column: x3' has 'long tails'
• info         	(long_tailed_distrib)	column: x1           the distribution for 'column: x1' has 'long tails'