Linters, configuration

The configuration file

The linters can be configured through a .toml configuration file which contains three sections:

  • experiment contains general context about the experiment
  • linters allows to enable or disable explicitly linters; linters are disabled by default if not enabled in this section
  • parameters where individual linter parameters can be set. The names of the parameters correspond to keyword arguments names in the functions implementing the linters.

A minimal configuration with one single linter would look as:

[experiment]
    name = "Configuration with 1 linter"
    target_variable = 2  # column index of target variable in the dataset
[linters]
    # code has to be R; checks for normality of columns
    R_lm_modelling = true
[parameters]
    [parameters.R_lm_modelling]
        # threshold for normality tests; higher values correspond
        # to more strict normal distribution assumptions
        pvalue_threshold = 0.1

Linters

A short description of the available linters is found below. Their parameters are documented in the configuration files found in the config folder.

Data-only linters

  • datetime_as_string - checks if dates are wrongly encoded as strings
  • tokenizable_string - checks whether the string values of a column can be split into tokens
  • number_as_string - checks whether the values of a string column can be converted to numbers
  • zipcodes_as_values - checks whether the values correspond to Zip (postal) codes
  • large_outliers - checks whether there are large outliers through Tuckey's fences approach
  • int_as_float - checks whether floating point encoded values can be converted to integers
  • enum_detector - checks whether the column values could correspond to an enumeration i.e. contains small number of distinct values
  • uncommon_list_lengths - checks whether the column contains lists of different lengths
  • duplicate_examples - checks whether two or more rows are identical
  • empty_example - checks for empty examples
  • uncommon_signs - checks whether there are very few values with a different sign in a numerical column
  • long_tailed_distrib - checks whether the data distribution has a long tail
  • circular_domain - checks whether the data pertains to a circular domain i.e. hours, degrees etc.
  • many_missing_values - checks whether there are many missing values in the column
  • negative_values - checks whether the are negative values in the column
  • imbalanced_target_variable - checks whether the values in the column are balanced or not in terms of numbers

R language specific linters

  • R_glmmTMB_target_variable - checks whether the target variable in the glmmTMB-specific regression is imbalanced or not
  • R_glmmTMB_binomial_modelling - checks whether link parameter values are correct for the binomial distribution in the glmmTMB-specific regression
  • R_lm_modelling - checks whether non-binary numeric columns in lm-specific regression are normal or not
  • R_glm_binomial_modelling - checks whether non-binary numeric columns in glm-specific regression are normal or not