Full lint catalog & Configuration guide

DataLinter ships with 28 built-in linters divided into two families:

  • Data-only linters – work on any tabular dataset, regardless of modeling language.
  • R-language specific linters – understand R modeling functions (lm, glm, glmmTMB, …) and their statistical assumptions.

Linters are disabled by default. You enable them in a config.toml configuration file which contains three sections:

  • experiment contains general context about the experiment
  • linters allows to enable or disable explicitly linters; linters are enabled in this section
  • parameters where individual linter parameters can be set. The names of the parameters correspond to keyword arguments names in the functions implementing the linters.

Quick Configuration Example

[experiment]
    name = "My R linear model"
    target_variable = 2  # column index of target variable in the dataset
[linters]
    # Enable only what you need
    # - code has to be R; checks for normality of columns
    large_outliers = true
    R_data_normally_distributed = true
[parameters]
    [parameters.R_data_normally_distributed]
        # threshold for normality tests; higher values correspond
        # to more strict normal distribution assumptions
        pvalue_threshold = 0.1

Full example configs are in the config folder.

Data-only linters

LinterDescriptionTypical ContextKey Parameters (see config/)
datetime_as_stringChecks if dates are wrongly encoded as stringsAny tabular datamatch_perc
tokenizable_stringChecks whether string values can be split into tokensText / categorical columnsmin_tokens
number_as_stringChecks whether string column can be converted to numbersNumeric data stored as textmatch_perc
zipcodes_as_valuesChecks whether values correspond to Zip/postal codesLocation columnszipcodes, match_perc
large_outliersDetects large outliers (Tukey’s fences)Numerical featurestukey_fences_k
int_as_floatChecks floating-point values that could be integersNumerical columns-
enum_detectorDetects columns that are actually enumerationsCategorical datadistinct_ratio, distinct_max_limit
uncommon_list_lengthsChecks columns containing lists of varying lengthsList / nested data-
duplicate_examplesFinds identical duplicate rowsAny dataset-
empty_exampleDetects completely empty rowsAny dataset-
uncommon_signsFlags numerical columns with very few opposite signsSigned numeric data-
long_tailed_distribDetects long-tailed distributionsNumerical featuresdrop_proportion, zscore_multiplier
circular_domainIdentifies circular data (hours, degrees, etc.)Angular / periodic data-
many_missing_valuesWarns about columns with high missingnessAny datasetthreshold
negative_valuesChecks for negative values in a columnCount / amount columns-
imbalanced_target_variableDetects imbalanced target classesClassification targetsthreshold
vif_colinearityDetects high multicolinearity using VIFNumerical datavif_threshold
cnc_colinearityDetects high multicolinearity using condition number analysisNumerical datacnc_threshold

R language specific linters

LinterDescriptionModel ContextKey Parameters (see config/)
R_imbalanced_target_variableChecks target variable imbalance in any regression function with a formulaRegression algorithmsthreshold
R_glmmTMB_binomial_modellingValidates link parameter for binomial family in glmmTMBglmmTMB binomialacceptable_link_values
R_data_normally_distributedChecks normality of non-binary numeric columns or target in modelsRegression methodspvalue_threshold, algorithms, check_target, check_predictors
R_glm_binomial_modellingChecks normality of non-binary numeric columns in binomial glmLogistic regressionpvalue_threshold
R_colinearity_with_targetDetects whether any predictor variable is highly colinear with the targetRegression algorithmsthreshold, algorithms
R_sample_size_adequacyChecks that the number of observations and predictors have stable ratiosRegression algorithmsepv_threshold, algorithms
R_variables_present_in_dataChecks that variables present in the formula are also present in the data as columnsRegression algorithms
R_high_cardinality_categoricalsChecks for categorical predictors with too many unique levels relative to sample sizeRegression algorithmsthreshold, algorithms
R_numeric_scale_imbalanceDetects numeric predictors with vastly different magnitudes/scalesRegression algorithmsnumeric_scale_threshold
R_near_zero_variance_predictorsFlags numeric predictors with near-zero variance values (using relative variance thresholds)variance_threshold, algorithms