Linters and configuration

Full lint catalog & Configuration guide

DataLinter ships with 28 built-in linters divided into two families:

Data-only linters – work on any tabular dataset, regardless of modeling language.
R-language specific linters – understand R modeling functions (lm, glm, glmmTMB, …) and their statistical assumptions.

Linters are disabled by default. You enable them in a config.toml configuration file which contains three sections:

experiment contains general context about the experiment
linters allows to enable or disable explicitly linters; linters are enabled in this section
parameters where individual linter parameters can be set. The names of the parameters correspond to keyword arguments names in the functions implementing the linters.

Quick Configuration Example

[experiment]
    name = "My glm model"
    target_variable = 2  # column index of target variable in the dataset
[linters]
    # Enable only what you need
    large_outliers = true
    many_missing_values = true
    imbalanced_target_variable = true
    vif_colinearity = true
    R_glm_modelling = true

[parameters]
    [parameters.large_outliers]
        tukey_fences_k=10  # larger values consider fewer elements to be anomalous
    [parameters.many_missing_values]
        threshold = 0.9  # % of values in the columns that have to be mising for the linter to trigger
    [parameters.imbalanced_target_variable]
        threshold = 0.1
    [parameters.vif_colinearity]
        vif_threshold = 20.0
    [parameters.R_glm_modelling]
        # no parameters

Full example configs are in the config folder.

Data-only linters

Linter	Description	Typical Context	Key Parameters (see config/)
`datetime_as_string`	Checks if dates are wrongly encoded as strings	Any tabular data	`match_perc`
`tokenizable_string`	Checks whether string values can be split into tokens	Text / categorical columns	`min_tokens`
`number_as_string`	Checks whether string column can be converted to numbers	Numeric data stored as text	`match_perc`
`zipcodes_as_values`	Checks whether values correspond to Zip/postal codes	Location columns	`zipcodes`, `match_perc`
`large_outliers`	Detects large outliers (Tukey’s fences)	Numerical features	`tukey_fences_k`
`int_as_float`	Checks floating-point values that could be integers	Numerical columns	-
`enum_detector`	Detects columns that are actually enumerations	Categorical data	`distinct_ratio`, `distinct_max_limit`
`uncommon_list_lengths`	Checks columns containing lists of varying lengths	List / nested data	-
`duplicate_examples`	Finds identical duplicate rows	Any dataset	-
`empty_example`	Detects completely empty rows	Any dataset	-
`uncommon_signs`	Flags numerical columns with very few opposite signs	Signed numeric data	-
`long_tailed_distrib`	Detects long-tailed distributions	Numerical features	`drop_proportion`, `zscore_multiplier`
`circular_domain`	Identifies circular data (hours, degrees, etc.)	Angular / periodic data	-
`many_missing_values`	Warns about columns with high missingness	Any dataset	`threshold`
`negative_values`	Checks for negative values in a column	Count / amount columns	-
`imbalanced_target_variable`	Detects imbalanced target classes	Classification targets	`threshold`
`vif_colinearity`	Detects high multicolinearity using VIF	Numerical data	`vif_threshold`
`cnc_colinearity`	Detects high multicolinearity using condition number analysis	Numerical data	`cnc_threshold`

R language specific linters

Linter	Description	Model Context	Key Parameters (see config/)
`R_imbalanced_target_variable`	Checks target variable imbalance in any regression function with a formula	Regression algorithms	`threshold`
`R_glmmTMB_binomial_modelling`	Validates link parameter for binomial family in `glmmTMB`	glmmTMB binomial	`acceptable_link_values`
`R_glm_modelling`	Ensures correct target variable values and family agrement in `glm`	Logistic regression	-
`R_colinearity_with_target`	Detects whether any predictor variable is highly colinear with the target	Regression algorithms	`threshold`, `algorithms`
`R_sample_size_adequacy`	Checks that the number of observations and predictors have stable ratios	Regression algorithms	`epv_threshold`, `algorithms`
`R_variables_present_in_data`	Checks that variables present in the formula are also present in the data as columns	Regression algorithms
`R_high_cardinality_categoricals`	Checks for categorical predictors with too many unique levels relative to sample size	Regression algorithms	`threshold`, `algorithms`
`R_numeric_scale_imbalance`	Detects numeric predictors with vastly different magnitudes/scales	Regression algorithms	`numeric_scale_threshold`
`R_near_zero_variance_predictors`	Flags numeric predictors with near-zero variance values (using relative variance thresholds)	`variance_threshold`, `algorithms`