Usage examples

A simple tutorial

First, generate some random data:

julia> using DataLinter
julia> ncols, nrows = 3, 10(3, 10)
julia> data = [rand(nrows) for _ in 1:ncols]3-element Vector{Vector{Float64}}: [0.6158754574940937, 0.6908536207883439, 0.8619170569929013, 0.032674693125333776, 0.8705619057098769, 0.8097959969114595, 0.8504221416490126, 0.8192913884325047, 0.020009574373516026, 0.29888088268664237] [0.4795835092521018, 0.4064316408592832, 0.3819276736065902, 0.13646521514576881, 0.8962829495884778, 0.08110152384479297, 0.45932297044941517, 0.2621947459820274, 0.6352971174189493, 0.4506399588398817] [0.5552814806046883, 0.007593029382384042, 0.4358275590388495, 0.038313642605069864, 0.5555529097328997, 0.7320662292624475, 0.04675084754291026, 0.5630098218361224, 0.6397900806335922, 0.24355677697471623]

then, generate a context object:

julia> ctx = DataLinter.build_data_context(data)SimpleDataContext 0.00040435791015625 MB of data

Context objects are the main linter inputs along with a knowledge base and the config.

Note

At this point the knowledge base is not used.

julia> kb = DataLinter.kb_load("")         # raises WarningKnowledgeBase with 9.1552734375e-5 MB of data
julia> lintout = DataLinter.LinterCore.lint(ctx, kb)41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1
julia> lintout = DataLinter.LinterCore.lint(ctx, nothing) # also works41-element Vector{Pair{Tuple{DataLinter.LinterCore.Linter, String}, Union{Nothing, Bool}}}: (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x2") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x3") => nothing (Linter (name=datetime_as_string, f=is_datetime_as_string), "column: x1") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x2") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x3") => nothing (Linter (name=tokenizable_string, f=is_tokenizable_string), "column: x1") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x2") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x3") => nothing (Linter (name=number_as_string, f=is_number_as_string), "column: x1") => nothing (Linter (name=zipcodes_as_values, f=is_zipcode), "column: x2") => nothing ⋮ (Linter (name=circular_domain, f=has_circular_domain), "column: x2") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x3") => 1 (Linter (name=circular_domain, f=has_circular_domain), "column: x1") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x2") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x3") => 1 (Linter (name=many_missing_values, f=has_many_missing_values), "column: x1") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x2") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x3") => 1 (Linter (name=negative_values, f=has_negative_values), "column: x1") => 1

Lastly, one can print output of activate linters i.e. the ones that found problems in the data.

julia> DataLinter.process_output(lintout)! warning      	(large_outliers)    	column: x2           the values of 'column: x2' contain large outliers
• info         	(long_tailed_distrib)	column: x2           the distribution for 'column: x2' has 'long tails'
• info         	(long_tailed_distrib)	column: x3           the distribution for 'column: x3' has 'long tails'
• info         	(long_tailed_distrib)	column: x1           the distribution for 'column: x1' has 'long tails'