API Reference

BruteTree index type for storing vectors. It is a wrapper around a BruteTree NN structure and performs brute search using a distance-based similarity between vectors.

source

HNSW index type for storing vectors. It is a wrapper around a HierarchicalNSW (Hierarchical Navigable Small Worlds) NN graph structure and performs a very efficient search using a distance-based similarity between vectors.

References

source

IVFADC index type for storing vectors. It is a wrapper around a IVFADCIndex (inverted file system with asymmetric distance computation) structure and performs a billion-scale search using a distance-based similarity between vectors.

References

source

K-D Tree index type for storing vectors. It is a wrapper around a KDTree NN structure and performs a more efficient search using a distance-based similarity between vectors.

source

Naive index type for storing vectors. It is a wrapper around a vector of embeddings and performs brute search using the cosine similarity between vectors.

source

Noop index type for storing vectors. Returns empty vectors of indexes, scores. Useful when search is done only in the db.

source
Search environment object. It contains all the data, searchers
and additional structures needed by the engine to function.
source

Object that stores the search results from a single searcher.

source
Search object. It contains all the indexed data and related

configuration that allows for searches to be performed.

source
build_search_env(config_path; cache_path=nothing)

Creates a search environment using the information provided by the configuration file config_path.

source
build_search_env(env_config; cache_path=nothing)

Creates a search environment using the information provided by the environment configuration env_config. A cache filepath can be specified by cache_path in which case the function will attempt to load it first.

source
parse_configuration(filename)

Parses a data configuration file (JSON format) and returns a NamedTuple that acts as a search environment configuration.

• Search environment options reference data_loader::Function # 0 argument function that when called loads the data i.e. dbdata data_sampler::Function # function that takes as input raw data and outputs a dbdata row id_key::Symbol # the name of the primary integer key in dbdata vectors_eltype::Type # the type of the vectors, scores etc. has to be <:AbstractFloat searcher_configs::Vector{NamedTuple} # vector of searcher configs (see reference below) embedder_configs::Vector{NamedTuple} # vector of embedder configs (see reference below) config_path::String # the path to the config

• Embedder config fields reference id::String description::String language::String # the embedder-level language stem_words::Bool # whether to stem words ngram_complexity::Int # ngram complexity (i.e. max number of tokes for an n-gram) vectors::Symbol # wordvectors calculation/source i.e. :count, :tf, :tfidf, :bm25, :word2vec, :glove, :conceptnet, :compressed vectors_transform::Symbol # transform to apply to the vectors i.e. :lsa, :rp, :none vectors_dimension::Int # desired dimensionality after transform (ignored for word2vec approaches) embeddings_path::Union{Nothing, String} # path to the embeddings file embeddings_kind::Symbol # Type of the embedding file for Word2Vec, GloVe i.e. :text, :binary doc2vec_method::Symbol # How to arrive at a single embedding from multiple i.e. :boe, :sif etc. glove_vocabulary::Union{Nothing, String} # Path to a GloVe-generated vocabulary file (only for binary embeddings) oov_policy::Symbol # what to do with non-embeddable documents i.e. :none, :largevector `embedderkwarguments::Dict{Symbol, Any}# explicit specification of embedder keyword argumentsembeddablefields::Union{Nothing, Vector{Symbol}}# which fields to use for training the embeddertextstripflags::UInt32# How to strip text data before indexingsifalpha::Float# smooth inverse frequency α parameter (for 'sif' doc2vec method only)borepdimension::Int# output dimension for BOREP embedderboreppoolingfunction::Symbol# pooling function for the BOREP embedderdiscngram::Int` # DisC embedder ngram parameter

• Searcher config fields reference id::String # searcher id id_aggregation::String # aggregation id description::String # description of the searcher enabled::Vector{Bool} # whether to use the searcher in search or not search_index::Symbol # type of the search index i.e. :naive, :kdtree, :hnsw search_index_arguments::Vector{Any} search_index_kwarguments::Dict{Symbol, Any} indexable_fields::Union{Nothing, Vector{Symbol}} # which fields to index data_embedder::String # id of the data/document embedder input_embedder::String # id of the input/query embedder heuristic::Union{Nothing, Symbol} # search heuristic for suggesting mispelled words (nothing means no recommendations) score_alpha::Float # score alpha (parameter for the scoring function) score_weight::Float # weight of scores of searcher (used in result aggregation)

source
rest_server(port::Integer, io_port::Integer, search_server_ready::Condition [;ipaddr::String])

Starts a bi-directional HTTP REST server at address ipaddr::String (defaults to "0.0.0.0" i.e. all ip's) that uses the TCP port port and communicates with the search server through the TCP port io_port. The server is started once the condition search_server_ready is triggered.

source
Garamond.searchMethod.
search(srcher, query [;kwargs])

Searches for query (i.e. key terms) in srcher, and returns information regarding the the documents that match best the query. The function returns an object of type SearchResult.

Arguments

  • srcher::Searcher is the searcher
  • query the query, can be either a String or Vector{String}

Keyword arguments

  • search_method::Symbol controls the type of matching: :exact uses exact matches while :regex consideres the needle a regular expression
  • max_matches::Int is the maximum number of search results to return
  • max_suggestions::Int is the maximum number of suggestions to return for each missing needle
source
Garamond.searchMethod.
search(srchers, query [;kwargs])

Searches for query (i.e. key terms) in multiple searches and returns information regarding the documents that match best the query. The function returns the search results in the form of a Vector{SearchResult}.

Arguments

  • srchers::Vector{Searcher} is the searchers vector
  • query the query, can be either a String or Vector{String}

Keyword arguments

  • search_method::Symbol controls the type of matching: :exact uses exact matches while :regex consideres the needle a regular expression
  • max_matches::Int is the maximum number of search results to return
  • max_suggestions::Int is the maximum number of suggestions to return for each missing needle
  • custom_weights::Dict{Symbol, Float} are custom weights for each searcher's results used in result aggregation
source
search_server(data_config_path, io_port, search_server_ready; cache_path=nothing)

Search server for Garamond. It is a finite-state-machine that when called, creates the searchers i.e. search objects using the data_config_path and the proceeds to looping continuously in order to asynchronously handle outside requests.

After the searchers are loaded, the search server sends a notification using search_server_ready to any listening I/O servers.

source
unix_socket_server(socket::AbstractString, io_port::Integer, start::Condition)

Starts a bi-directional unix socket server that uses a UNIX-socket socket and communicates with the search server through the TCP port io_port. The server is started once the condition start is triggered.

source
web_socket_server(port::UInt16, io_port::Integer, start::Condition [; ipaddr::String])

Starts a bi-directional web socket server that uses a WEB-socket at address ipaddr::String (defaults to "127.0.0.1") and port port and communicates with the search server through the TCP port io_port. The server is started once the condition start is triggered.

source
Garamond.DTVModelConstant.

Constant that represents document term vector (DTV) models used in text embedding.

source

Request corresponding to an environment operation command.

source

Request corresponding to an error i.e. in parsing.

source

Constant that represents embeddings libraries used in text embedding.

source
Garamond.KILL_REQUESTConstant.

Request corresponding to a kill server command.

source

Request corresponding to a searcher read configuration command.

source

Standard response terminator. It is used in the client-server communication mark the end of sent and received messages.

source

Default request.

source

Bag-of-embeddings (BOE) structure for document embedding using word vectors.

source

Bag-of-random-embedding-projections (BOREP) structure for document embedding using word vectors.

References

source

Concatenated-power-mean-embeddings (CPMean) structure for document embedding using word vectors.

References

source

Structure for document embedding using DTV's.

source

Distributed Co-occurence (DisC) structure for document embedding using word vectors.

References

source

Request object for the internal server of the engine.

source

Smooth inverse frequency (SIF) structure for document embedding using word vectors.

References

source
Base.deleteat!Method.
deleteat!(env::SearchEnv, pos)

Deletes from a search environment the db and index elements with linear indices found in pos.

source
Base.lengthMethod.
length(index)

Returns the number of points indexed in index.

source
Base.parseMethod.
parse(::Type{InternalRequest}, request::AbstractString)

Parses an outside request received from a client into an InternalRequest usable by the search server.

source
Base.pop!Method.
pop!(env::SearchEnv)

Pops last point from a search environment. Returns last db row and associated indexed vector.

source
Base.popfirst!Method.
popfirst!(env::SearchEnv)

Pops first point from a search environment. Returns first db row and associated indexed vector.

source
Base.push!Method.
push!(env::SearchEnv, rawdata)

Pushes to a search environment i.e. to the db and all indexes.

source
Base.pushfirst!Method.
pushfirst!(env::SearchEnv, rawdata)

Pushes to the first position to a search environment i.e. to the db and all indexes.

source
document2vec(embedder, document)

Word-embeddings approach to document embedding. It embeds documents using word embeddings libraries and some algorithm for combining these (depending on the type of embedder).

Arguments

  • embedder::WordVectorsEmbedder is the embedder
  • document::Vector{String} the document to be embedded, where each vector element corresponds to a sentence
source

Aggregates search results from several searchers based on their aggregation_id i.e. results from searchers with identical aggregation id's are merged together into a new search result that replaces the individual searcher ones.

source
build_data_env(env::SearchEnv)

Strips searchers from env.

source
Garamond.build_loggerFunction.
build_logger(logging_stream, log_level)

Builds a logger using the stream logging_streamand log_level provided.

Arguments

  • logging_stream::String is the output stream and can take the values:

"null" logs to /dev/null, "stdout" (default) logs to standard output, "/path/to/existing/file" logs to an existing file and "/path/to/non-existing/file" creates the log file. If no valid option is provided, the default stream is the standard output.

  • log_level::String is the log level can take the values "debug",

"info", "error" and defaults to "info" if no valid option is provided.

source
build_response(dbdata, request, results, [; kwargs...])

Builds a response for an engine client using the data, request and results.

source

Constructs a search result from a list of data ids.

source
build_searcher(dbdata, config)

Creates a Searcher from a searcher configuration.

source

Post-processes a string to fit a certain length, adding … if necessary at the end of its choped represenation.

source
Garamond.densifyMethod.
densify(array)

Transforms sparse arrays into dense ones.

source
detect_language(text [; default=DEFAULT_LANGUAGE])

Detects the language of a piece of text. Returns a language of type Languages.Language. If the text is empty of the confidence is low, return the default language.

source
document2vec(embedder, document [;isregex=false])

Embeds documents. The document representation is conceptually a vector of sentences, the output is always a vector of floating point numbers.

Arguments

  • embedder::AbstractEmbedder is the embedder
  • document::Vector{AbstractString} the document to be embedded, where each vector element corresponds to a sentence

Keyword arguments

  • isregex::Bool a false value (default) specifies that the document tokens are to be matched exactly while a true value specifies that the tokens are to be matched partially (for DTV-based document embedding only)
source
env_operator(env, channels)

Saves/Loads/Updates the search environment env. Communication with the search server i.e. getting the command and its arguments and sending back a new environment is done via channels.

source
garamond_log_formatter(level, _module, group, id, file, line)

Garamond -specific log message formatter. Takes a fixed set of input arguments and returns the color, prefix and suffix for the log message.

source

Returns found and missing needles using an embedder

source

Noop ranker, does not rank, returns the first input argument unchanged.

source
printable_version()

Returns a pretty version string that includes the git commit and date.

source
read_configuration_to_json(env)

Returns a JSON dictionary with the full configuration of the search environment.

source
Garamond.respondMethod.
respond(env, socket, counter, channels)

Responds to search server requests received on socket using the search data from searchers. The requests are counted through the variable counter.

source
sentences2vec(embedder, document_embedding, embedded_words [;dim=0])

Returns a matrix of sentence embeddings from a vector of matrices containing individual sentence word embeddings. Used mostly for word-vectors based embedders.

Arguments

  • embedder::AbstractEmbedder is the embedder
  • document_embedding::Vector{Matrix{AbstractFloat}} are the document's word embeddings, where each element of the vector represents the embedding of a sentence (whith the matrix columns individual word embeddings)

Keyword arguments

  • dim::Int is the dimension of the word embeddings i.e. number of components in the word vector (default 0)
  • embedded_words::Vector{Vector{AbstractString}} are the words in each sentence the were embedded (their order corresponds to the order of the matrix columns in document_embedding
source
Garamond.squashMethod.
squash(m)

Function that creates a single mean vector from a matrix m and performs some normalization operations as well.

source
Garamond.squashMethod.
squash(vv, m)

Function that creates a single mean vector from a vector of vectors vv where each vector has a length m and performs some normalization operations as well.

source
suggestion_search!(suggestions, search_tree, needles [;max_suggestions=1])

Searches in the search tree for partial matches for each of the needles.

source
Garamond.summarizeMethod.
summarize(sentences [;ns=1, flags=DEFAULT_SUMMARIZATION_STRIP_FLAGS])

Build a summary of the text's sentences. The resulting summary will be a ns sentence document; each sentence is pre-procesed using the flags option.

source
Garamond.versionMethod.
version()

Returns the current Garamond version using the Project.toml and git. If the Project.toml, git are not available, the version defaults to an empty string.

source
word_embeddings(word_vectors, document_tokens [;kwargs])

Returns a matrix corresponding to the word embeddings of document_tokens as well as the indices of missing i.e. not-embedded tokens.

Arguments

  • word_vectors::EmbeddingsLibrary wordvectors object; can be a Word2Vec.WordVectors, Glowe.WordVectors or ConceptnetNumberbatch.ConceptNet
  • document_tokens::Vector{String} the words to be embedded, where each vector element corresponds to a word

Keyword arguments

  • keep_size::Bool a false value discards vectors for words not found while a true value (default) places a zero vector in the embeddings matrix
  • print_matched_words::Bool if true, the words that were and that were not embedded are printed (default false)
  • kwargs... the rest of the keyword arguments are ConceptNet specific and can be found by inspecting the help of ConceptnetNumberbatch.embed_document
source
HNSW.knn_searchMethod.
knn_search(index, point, k, keep)

Searches for the k nearest neighbors of point in data contained in the index. The index may vary from a simple wrapper inside a matrix to more complex structures such as k-d trees, etc. Only neighbors present in keep are returned.

source