Full lint catalog & Configuration guide

DataLinter ships with 23 built-in linters divided into two families:

  • Data-only linters – work on any tabular dataset, regardless of modeling language.
  • R-language specific linters – understand R modeling functions (lm, glm, glmmTMB, …) and their statistical assumptions.

Linters are disabled by default. You enable them in a config.toml configuration file which contains three sections:

  • experiment contains general context about the experiment
  • linters allows to enable or disable explicitly linters; linters are enabled in this section
  • parameters where individual linter parameters can be set. The names of the parameters correspond to keyword arguments names in the functions implementing the linters.

Quick Configuration Example

[experiment]
    name = "My R linear model"
    target_variable = 2  # column index of target variable in the dataset
[linters]
    # Enable only what you need
    # - code has to be R; checks for normality of columns
    large_outliers = true
    R_lm_modelling = true
[parameters]
    [parameters.R_lm_modelling]
        # threshold for normality tests; higher values correspond
        # to more strict normal distribution assumptions
        pvalue_threshold = 0.1

Full example configs are in the config folder.

Data-only linters

LinterDescriptionTypical ContextKey Parameters (see config/)
datetime_as_stringChecks if dates are wrongly encoded as stringsAny tabular datamatch_perc
tokenizable_stringChecks whether string values can be split into tokensText / categorical columnsmin_tokens
number_as_stringChecks whether string column can be converted to numbersNumeric data stored as textmatch_perc
zipcodes_as_valuesChecks whether values correspond to Zip/postal codesLocation columnszipcodes, match_perc
large_outliersDetects large outliers (Tukey’s fences)Numerical featurestukey_fences_k
int_as_floatChecks floating-point values that could be integersNumerical columns-
enum_detectorDetects columns that are actually enumerationsCategorical datadistinct_ratio, distinct_max_limit
uncommon_list_lengthsChecks columns containing lists of varying lengthsList / nested data-
duplicate_examplesFinds identical duplicate rowsAny dataset-
empty_exampleDetects completely empty rowsAny dataset-
uncommon_signsFlags numerical columns with very few opposite signsSigned numeric data-
long_tailed_distribDetects long-tailed distributionsNumerical featuresdrop_proportion, zscore_multiplier
circular_domainIdentifies circular data (hoursdegreesetc.)
many_missing_valuesWarns about columns with high missingnessAny datasetthreshold
negative_valuesChecks for negative values in a columnCount / amount columns-
imbalanced_target_variableDetects imbalanced target classesClassification targetsthreshold
vif_colinearityDetects high multicolinearity using VIFNumerical datavif_threshold
cnc_colinearityDetects high multicolinearity using condition number analysisNumerical datacnc_threshold

R language specific linters

LinterDescriptionModel ContextKey Parameters (see config/)
R_glmmTMB_target_variableChecks target variable imbalance in glmmTMB regressionsMixed-effects modelsthreshold
R_glmmTMB_binomial_modellingValidates link parameter for binomial family in glmmTMBglmmTMB binomialacceptable_link_values
R_lm_modellingChecks normality of non-binary numeric columns in lm modelsLinear regressionpvalue_threshold
R_glm_binomial_modellingChecks normality of non-binary numeric columns in binomial glmLogistic regressionpvalue_threshold
R_colinearity_with_targetDetects whether any dependent variable is highly colinear with the targetRegression algorithmsthreshold, algorithms