Full lint catalog & Configuration guide
DataLinter ships with 23 built-in linters divided into two families:
- Data-only linters – work on any tabular dataset, regardless of modeling language.
- R-language specific linters – understand R modeling functions (
lm,glm,glmmTMB, …) and their statistical assumptions.
Linters are disabled by default. You enable them in a config.toml configuration file which contains three sections:
experimentcontains general context about the experimentlintersallows to enable or disable explicitly linters; linters are enabled in this sectionparameterswhere individual linter parameters can be set. The names of the parameters correspond to keyword arguments names in the functions implementing the linters.
Quick Configuration Example
[experiment]
name = "My R linear model"
target_variable = 2 # column index of target variable in the dataset
[linters]
# Enable only what you need
# - code has to be R; checks for normality of columns
large_outliers = true
R_lm_modelling = true
[parameters]
[parameters.R_lm_modelling]
# threshold for normality tests; higher values correspond
# to more strict normal distribution assumptions
pvalue_threshold = 0.1Full example configs are in the config folder.
Data-only linters
| Linter | Description | Typical Context | Key Parameters (see config/) |
|---|---|---|---|
datetime_as_string | Checks if dates are wrongly encoded as strings | Any tabular data | match_perc |
tokenizable_string | Checks whether string values can be split into tokens | Text / categorical columns | min_tokens |
number_as_string | Checks whether string column can be converted to numbers | Numeric data stored as text | match_perc |
zipcodes_as_values | Checks whether values correspond to Zip/postal codes | Location columns | zipcodes, match_perc |
large_outliers | Detects large outliers (Tukey’s fences) | Numerical features | tukey_fences_k |
int_as_float | Checks floating-point values that could be integers | Numerical columns | - |
enum_detector | Detects columns that are actually enumerations | Categorical data | distinct_ratio, distinct_max_limit |
uncommon_list_lengths | Checks columns containing lists of varying lengths | List / nested data | - |
duplicate_examples | Finds identical duplicate rows | Any dataset | - |
empty_example | Detects completely empty rows | Any dataset | - |
uncommon_signs | Flags numerical columns with very few opposite signs | Signed numeric data | - |
long_tailed_distrib | Detects long-tailed distributions | Numerical features | drop_proportion, zscore_multiplier |
circular_domain | Identifies circular data (hours | degrees | etc.) |
many_missing_values | Warns about columns with high missingness | Any dataset | threshold |
negative_values | Checks for negative values in a column | Count / amount columns | - |
imbalanced_target_variable | Detects imbalanced target classes | Classification targets | threshold |
vif_colinearity | Detects high multicolinearity using VIF | Numerical data | vif_threshold |
cnc_colinearity | Detects high multicolinearity using condition number analysis | Numerical data | cnc_threshold |
R language specific linters
| Linter | Description | Model Context | Key Parameters (see config/) |
|---|---|---|---|
R_glmmTMB_target_variable | Checks target variable imbalance in glmmTMB regressions | Mixed-effects models | threshold |
R_glmmTMB_binomial_modelling | Validates link parameter for binomial family in glmmTMB | glmmTMB binomial | acceptable_link_values |
R_lm_modelling | Checks normality of non-binary numeric columns in lm models | Linear regression | pvalue_threshold |
R_glm_binomial_modelling | Checks normality of non-binary numeric columns in binomial glm | Logistic regression | pvalue_threshold |
R_colinearity_with_target | Detects whether any dependent variable is highly colinear with the target | Regression algorithms | threshold, algorithms |