Garamond.DTVModel
Garamond.ENVOP_REQUEST
Garamond.ERRORED_REQUEST
Garamond.EmbeddingsLibrary
Garamond.KILL_REQUEST
Garamond.READCONFIGS_REQUEST
Garamond.RESPONSE_TERMINATOR
Garamond.UNINITIALIZED_REQUEST
Garamond.BOEEmbedder
Garamond.BOREPEmbedder
Garamond.BruteTreeIndex
Garamond.CPMeanEmbedder
Garamond.DTVEmbedder
Garamond.DisCEmbedder
Garamond.HNSWIndex
Garamond.IVFIndex
Garamond.InternalRequest
Garamond.KDTreeIndex
Garamond.NaiveIndex
Garamond.NoopIndex
Garamond.SIFEmbedder
Garamond.SearchEnv
Garamond.SearchResult
Garamond.Searcher
Base.deleteat!
Base.length
Base.parse
Base.pop!
Base.popfirst!
Base.push!
Base.pushfirst!
Garamond.__document2vec
Garamond.aggregate!
Garamond.build_data_env
Garamond.build_logger
Garamond.build_response
Garamond.build_result_from_ids
Garamond.build_search_env
Garamond.build_search_env
Garamond.build_searcher
Garamond.chop_to_length
Garamond.densify
Garamond.detect_language
Garamond.document2vec
Garamond.env_operator
Garamond.garamond_log_formatter
Garamond.missing_needles
Garamond.noop_ranker
Garamond.parse_configuration
Garamond.printable_version
Garamond.read_configuration_to_json
Garamond.respond
Garamond.rest_server
Garamond.search
Garamond.search
Garamond.search_server
Garamond.sentences2vec
Garamond.squash
Garamond.squash
Garamond.suggestion_search!
Garamond.summarize
Garamond.unix_socket_server
Garamond.version
Garamond.web_socket_server
Garamond.word_embeddings
HNSW.knn_search
Garamond.BruteTreeIndex
— Type.BruteTree index type for storing vectors. It is a wrapper around a BruteTree
NN structure and performs brute search using a distance-based similarity between vectors.
Garamond.HNSWIndex
— Type.HNSW index type for storing vectors. It is a wrapper around a HierarchicalNSW
(Hierarchical Navigable Small Worlds) NN graph structure and performs a very efficient search using a distance-based similarity between vectors.
References
Garamond.IVFIndex
— Type.IVFADC index type for storing vectors. It is a wrapper around a IVFADCIndex
(inverted file system with asymmetric distance computation) structure and performs a billion-scale search using a distance-based similarity between vectors.
References
Garamond.KDTreeIndex
— Type.K-D Tree index type for storing vectors. It is a wrapper around a KDTree
NN structure and performs a more efficient search using a distance-based similarity between vectors.
Garamond.NaiveIndex
— Type.Naive index type for storing vectors. It is a wrapper around a vector of embeddings and performs brute search using the cosine similarity between vectors.
Garamond.NoopIndex
— Type.Noop index type for storing vectors. Returns empty vectors of indexes, scores. Useful when search is done only in the db.
Garamond.SearchEnv
— Type.Search environment object. It contains all the data, searchers
and additional structures needed by the engine to function.
Garamond.SearchResult
— Type.Object that stores the search results from a single searcher.
Garamond.Searcher
— Type.Search object. It contains all the indexed data and related
configuration that allows for searches to be performed.
Garamond.build_search_env
— Method.build_search_env(config_path; cache_path=nothing)
Creates a search environment using the information provided by the configuration file config_path
.
Garamond.build_search_env
— Method.build_search_env(env_config; cache_path=nothing)
Creates a search environment using the information provided by the environment configuration env_config
. A cache filepath can be specified by cache_path
in which case the function will attempt to load it first.
Garamond.parse_configuration
— Method.parse_configuration(filename)
Parses a data configuration file (JSON format) and returns a NamedTuple
that acts as a search environment configuration.
• Search environment options reference data_loader::Function
# 0 argument function that when called loads the data i.e. dbdata
data_sampler::Function
# function that takes as input raw data and outputs a dbdata
row id_key::Symbol
# the name of the primary integer key in dbdata
vectors_eltype::Type
# the type of the vectors, scores etc. has to be <:AbstractFloat
searcher_configs::Vector{NamedTuple}
# vector of searcher configs (see reference below) embedder_configs::Vector{NamedTuple}
# vector of embedder configs (see reference below) config_path::String
# the path to the config
• Embedder config fields reference id::String
description::String
language::String
# the embedder-level language stem_words::Bool
# whether to stem words ngram_complexity::Int
# ngram complexity (i.e. max number of tokes for an n-gram) vectors::Symbol
# wordvectors calculation/source i.e. :count, :tf, :tfidf, :bm25, :word2vec, :glove, :conceptnet, :compressed vectors_transform::Symbol
# transform to apply to the vectors i.e. :lsa, :rp, :none vectors_dimension::Int
# desired dimensionality after transform (ignored for word2vec approaches) embeddings_path::Union{Nothing, String}
# path to the embeddings file embeddings_kind::Symbol
# Type of the embedding file for Word2Vec, GloVe i.e. :text, :binary doc2vec_method::Symbol
# How to arrive at a single embedding from multiple i.e. :boe, :sif etc. glove_vocabulary::Union{Nothing, String}
# Path to a GloVe-generated vocabulary file (only for binary embeddings) oov_policy::Symbol
# what to do with non-embeddable documents i.e. :none, :largevector `embedderkwarguments::Dict{Symbol, Any}# explicit specification of embedder keyword arguments
embeddablefields::Union{Nothing, Vector{Symbol}}# which fields to use for training the embedder
textstripflags::UInt32# How to strip text data before indexing
sifalpha::Float# smooth inverse frequency α parameter (for 'sif' doc2vec method only)
borepdimension::Int# output dimension for BOREP embedder
boreppoolingfunction::Symbol# pooling function for the BOREP embedder
discngram::Int` # DisC embedder ngram parameter
• Searcher config fields reference id::String
# searcher id id_aggregation::String
# aggregation id description::String
# description of the searcher enabled::Vector{Bool}
# whether to use the searcher in search or not search_index::Symbol
# type of the search index i.e. :naive, :kdtree, :hnsw search_index_arguments::Vector{Any}
search_index_kwarguments::Dict{Symbol, Any}
indexable_fields::Union{Nothing, Vector{Symbol}}
# which fields to index data_embedder::String
# id of the data/document embedder input_embedder::String
# id of the input/query embedder heuristic::Union{Nothing, Symbol}
# search heuristic for suggesting mispelled words (nothing means no recommendations) score_alpha::Float
# score alpha (parameter for the scoring function) score_weight::Float
# weight of scores of searcher (used in result aggregation)
Garamond.rest_server
— Method.rest_server(port::Integer, io_port::Integer, search_server_ready::Condition [;ipaddr::String])
Starts a bi-directional HTTP REST server at address ipaddr::String
(defaults to "0.0.0.0"
i.e. all ip's) that uses the TCP port port
and communicates with the search server through the TCP port io_port
. The server is started once the condition search_server_ready
is triggered.
Garamond.search
— Method.search(srcher, query [;kwargs])
Searches for query (i.e. key terms) in srcher
, and returns information regarding the the documents that match best the query. The function returns an object of type SearchResult
.
Arguments
srcher::Searcher
is the searcherquery
the query, can be either aString
orVector{String}
Keyword arguments
search_method::Symbol
controls the type of matching::exact
uses exact matches while:regex
consideres the needle a regular expressionmax_matches::Int
is the maximum number of search results to returnmax_suggestions::Int
is the maximum number of suggestions to return for each missing needle
Garamond.search
— Method.search(srchers, query [;kwargs])
Searches for query (i.e. key terms) in multiple searches and returns information regarding the documents that match best the query. The function returns the search results in the form of a Vector{SearchResult}
.
Arguments
srchers::Vector{Searcher}
is the searchers vectorquery
the query, can be either aString
orVector{String}
Keyword arguments
search_method::Symbol
controls the type of matching::exact
uses exact matches while:regex
consideres the needle a regular expressionmax_matches::Int
is the maximum number of search results to returnmax_suggestions::Int
is the maximum number of suggestions to return for each missing needlecustom_weights::Dict{Symbol, Float}
are custom weights for each searcher's results used in result aggregation
Garamond.search_server
— Method.search_server(data_config_path, io_port, search_server_ready; cache_path=nothing)
Search server for Garamond. It is a finite-state-machine that when called, creates the searchers i.e. search objects using the data_config_path
and the proceeds to looping continuously in order to asynchronously handle outside requests.
After the searchers are loaded, the search server sends a notification using search_server_ready
to any listening I/O servers.
Garamond.unix_socket_server
— Method.unix_socket_server(socket::AbstractString, io_port::Integer, start::Condition)
Starts a bi-directional unix socket server that uses a UNIX-socket socket
and communicates with the search server through the TCP port io_port
. The server is started once the condition start
is triggered.
Garamond.web_socket_server
— Method.web_socket_server(port::UInt16, io_port::Integer, start::Condition [; ipaddr::String])
Starts a bi-directional web socket server that uses a WEB-socket at address ipaddr::String
(defaults to "127.0.0.1"
) and port port
and communicates with the search server through the TCP port io_port
. The server is started once the condition start
is triggered.
Garamond.DTVModel
— Constant.Constant that represents document term vector (DTV) models used in text embedding.
Garamond.ENVOP_REQUEST
— Constant.Request corresponding to an environment operation command.
Garamond.ERRORED_REQUEST
— Constant.Request corresponding to an error i.e. in parsing.
Garamond.EmbeddingsLibrary
— Constant.Constant that represents embeddings libraries used in text embedding.
Garamond.KILL_REQUEST
— Constant.Request corresponding to a kill server command.
Garamond.READCONFIGS_REQUEST
— Constant.Request corresponding to a searcher read configuration command.
Garamond.RESPONSE_TERMINATOR
— Constant.Standard response terminator. It is used in the client-server communication mark the end of sent and received messages.
Garamond.UNINITIALIZED_REQUEST
— Constant.Default request.
Garamond.BOEEmbedder
— Type.Bag-of-embeddings (BOE) structure for document embedding using word vectors.
Garamond.BOREPEmbedder
— Type.Bag-of-random-embedding-projections (BOREP) structure for document embedding using word vectors.
References
Garamond.CPMeanEmbedder
— Type.Concatenated-power-mean-embeddings (CPMean) structure for document embedding using word vectors.
References
Garamond.DTVEmbedder
— Type.Structure for document embedding using DTV's.
Garamond.DisCEmbedder
— Type.Distributed Co-occurence (DisC) structure for document embedding using word vectors.
References
Garamond.InternalRequest
— Type.Request object for the internal server of the engine.
Garamond.SIFEmbedder
— Type.Smooth inverse frequency (SIF) structure for document embedding using word vectors.
References
Base.deleteat!
— Method.deleteat!(env::SearchEnv, pos)
Deletes from a search environment the db and index elements with linear indices found in pos
.
Base.length
— Method.length(index)
Returns the number of points indexed in index
.
Base.parse
— Method.parse(::Type{InternalRequest}, request::AbstractString)
Parses an outside request received from a client into an InternalRequest
usable by the search server.
Base.pop!
— Method.pop!(env::SearchEnv)
Pops last point from a search environment. Returns last db row and associated indexed vector.
Base.popfirst!
— Method.popfirst!(env::SearchEnv)
Pops first point from a search environment. Returns first db row and associated indexed vector.
Base.push!
— Method.push!(env::SearchEnv, rawdata)
Pushes to a search environment i.e. to the db and all indexes.
Base.pushfirst!
— Method.pushfirst!(env::SearchEnv, rawdata)
Pushes to the first position to a search environment i.e. to the db and all indexes.
Garamond.__document2vec
— Method.document2vec(embedder, document)
Word-embeddings approach to document embedding. It embeds documents using word embeddings libraries and some algorithm for combining these (depending on the type of embedder
).
Arguments
embedder::WordVectorsEmbedder
is the embedderdocument::Vector{String}
the document to be embedded, where each vector element corresponds to a sentence
Garamond.aggregate!
— Method.Aggregates search results from several searchers based on their aggregation_id
i.e. results from searchers with identical aggregation id's are merged together into a new search result that replaces the individual searcher ones.
Garamond.build_data_env
— Method.build_data_env(env::SearchEnv)
Strips searchers from env
.
Garamond.build_logger
— Function.build_logger(logging_stream, log_level)
Builds a logger using the stream logging_stream
and log_level
provided.
Arguments
logging_stream::String
is the output stream and can take the values:
"null"
logs to /dev/null
, "stdout"
(default) logs to standard output, "/path/to/existing/file"
logs to an existing file and "/path/to/non-existing/file"
creates the log file. If no valid option is provided, the default stream is the standard output.
log_level::String
is the log level can take the values"debug"
,
"info"
, "error"
and defaults to "info"
if no valid option is provided.
Garamond.build_response
— Method.build_response(dbdata, request, results, [; kwargs...])
Builds a response for an engine client using the data, request and results.
Garamond.build_result_from_ids
— Method.Constructs a search result from a list of data ids.
Garamond.build_searcher
— Method.build_searcher(dbdata, config)
Creates a Searcher from a searcher configuration.
Garamond.chop_to_length
— Method.Post-processes a string to fit a certain length, adding … if necessary at the end of its choped represenation.
Garamond.densify
— Method.densify(array)
Transforms sparse arrays into dense ones.
Garamond.detect_language
— Method.detect_language(text [; default=DEFAULT_LANGUAGE])
Detects the language of a piece of text
. Returns a language of type Languages.Language
. If the text is empty of the confidence is low, return the default
language.
Garamond.document2vec
— Method.document2vec(embedder, document [;isregex=false])
Embeds documents. The document representation is conceptually a vector of sentences, the output is always a vector of floating point numbers.
Arguments
embedder::AbstractEmbedder
is the embedderdocument::Vector{AbstractString}
the document to be embedded, where each vector element corresponds to a sentence
Keyword arguments
isregex::Bool
afalse
value (default) specifies that the document tokens are to be matched exactly while atrue
value specifies that the tokens are to be matched partially (for DTV-based document embedding only)
Garamond.env_operator
— Method.env_operator(env, channels)
Saves/Loads/Updates the search environment env
. Communication with the search server i.e. getting the command and its arguments and sending back a new environment is done via channels
.
Garamond.garamond_log_formatter
— Method.garamond_log_formatter(level, _module, group, id, file, line)
Garamond -specific log message formatter. Takes a fixed set of input arguments and returns the color, prefix and suffix for the log message.
Garamond.missing_needles
— Function.Returns found and missing needles using an embedder
Garamond.noop_ranker
— Method.Noop ranker, does not rank, returns the first input argument unchanged.
Garamond.printable_version
— Method.printable_version()
Returns a pretty version string that includes the git commit and date.
Garamond.read_configuration_to_json
— Method.read_configuration_to_json(env)
Returns a JSON dictionary with the full configuration of the search environment.
Garamond.respond
— Method.respond(env, socket, counter, channels)
Responds to search server requests received on socket
using the search data from searchers
. The requests are counted through the variable counter
.
Garamond.sentences2vec
— Method.sentences2vec(embedder, document_embedding, embedded_words [;dim=0])
Returns a matrix of sentence embeddings from a vector of matrices containing individual sentence word embeddings. Used mostly for word-vectors based embedders.
Arguments
embedder::AbstractEmbedder
is the embedderdocument_embedding::Vector{Matrix{AbstractFloat}}
are the document's word embeddings, where each element of the vector represents the embedding of a sentence (whith the matrix columns individual word embeddings)
Keyword arguments
dim::Int
is the dimension of the word embeddings i.e. number of components in the word vector (default0
)embedded_words::Vector{Vector{AbstractString}}
are the words in each sentence the were embedded (their order corresponds to the order of the matrix columns indocument_embedding
Garamond.squash
— Method.squash(m)
Function that creates a single mean vector from a matrix m
and performs some normalization operations as well.
Garamond.squash
— Method.squash(vv, m)
Function that creates a single mean vector from a vector of vectors vv
where each vector has a length m
and performs some normalization operations as well.
Garamond.suggestion_search!
— Method.suggestion_search!(suggestions, search_tree, needles [;max_suggestions=1])
Searches in the search tree for partial matches for each of the needles
.
Garamond.summarize
— Method.summarize(sentences [;ns=1, flags=DEFAULT_SUMMARIZATION_STRIP_FLAGS])
Build a summary of the text's sentences
. The resulting summary will be a ns
sentence document; each sentence is pre-procesed using the flags
option.
Garamond.version
— Method.version()
Returns the current Garamond version using the Project.toml
and git
. If the Project.toml
, git
are not available, the version defaults to an empty string.
Garamond.word_embeddings
— Method.word_embeddings(word_vectors, document_tokens [;kwargs])
Returns a matrix corresponding to the word embeddings of document_tokens
as well as the indices of missing i.e. not-embedded tokens.
Arguments
word_vectors::EmbeddingsLibrary
wordvectors object; can be aWord2Vec.WordVectors
,Glowe.WordVectors
orConceptnetNumberbatch.ConceptNet
document_tokens::Vector{String}
the words to be embedded, where each vector element corresponds to a word
Keyword arguments
keep_size::Bool
afalse
value discards vectors for words not found while atrue
value (default) places a zero vector in the embeddings matrixprint_matched_words::Bool
iftrue
, the words that were and that were not embedded are printed (defaultfalse
)kwargs...
the rest of the keyword arguments areConceptNet
specific and can be found by inspecting the help ofConceptnetNumberbatch.embed_document
HNSW.knn_search
— Method.knn_search(index, point, k, keep)
Searches for the k
nearest neighbors of point
in data contained in the index
. The index may vary from a simple wrapper inside a matrix to more complex structures such as k-d trees, etc. Only neighbors present in keep
are returned.