Getting started
Work in progress!

This section currently under construction and is incomplete.

Getting started

The engine uses a pluggable approach in which data loaders, parsers, recommenders and rankers can be compiled in the engine at runtime. The following usage examples use functionality already provided by the engine. Although by no means exhaustive, it is meant to provide a starting point for exploring the functionality and features of the engine.

Glossary

Throughout the documentation, certain terms will appear when refering to the internals of the engine. Some of the most frequent ones are:

  • configuration may refer to:
    • searcher configuration, a SearcherConfig object which holds the configuration options for individual searchers.
    • environment configuration, a NamedTuple that contains searcher configurations as well as other parameters.
    • data configuration file, a JSON file which is parsed to generate an environment configuration.
  • search environment a SearchEnv object that holds the data and searchers among other. It fully describes the state of the engine.
  • searcher, a Searcher object that is used to perform the actual search. It holds the indexed documents in some vectorial representation.
  • index - the data structure holding the vector representation of the documents.
  • request - may refer to:
    • a request form an outside system to the engine i.e. HTTP request.
    • the internal representation of a request, of type InternalRequest.

Engine configuration

The main configuration of the engine pertains to data loading, parsing and indexing. Its role is to provide all necessary details as well as the internal architecture of the engine. The recommended way for configuring the engine is to create a JSON file with all necessary options. Alternatively, the result of parsing the configuration file i.e. the configuration object can be created explicitly however it is, at least at this point, a cumbersome operation.

julia> using Logging, JSON, JuliaDB, Garamond

julia> include(joinpath(@__DIR__, "..", "..", "test", "configs", "configgenerator.jl"));

julia> cfg = mktemp() do path, io  # write and parse config file on-the-fly
           write(io, generate_sample_config_1())
           flush(io)
           parse_configuration(path)
       end
(data_loader = Garamond.var"#120#130"{Array{Any,1},Dict{Symbol,Any},typeof(Garamond.juliadb_loader)}(Any["/home/travis/build/zgornel/Garamond.jl/test/data/generated_data_100_samples.csv"], Dict{Symbol,Any}(), Garamond.juliadb_loader), data_sampler = Garamond.identity_sampler, id_key = :id, vectors_eltype = Float32, searcher_configs = Any[(id = "searcher_1", id_aggregation = "aggid", description = "A searcher using BM25+RP embeddings and naive indexing", enabled = Bool[1], search_index = :naive, search_index_arguments = Any[], search_index_kwarguments = Dict{Symbol,Any}(), indexable_fields = [:RandString, :StringField, :StringField2, :IntField], data_embedder = "embedder_1", input_embedder = "embedder_1", heuristic = nothing, score_alpha = 0.4f0, score_weight = 0.8f0)], embedder_configs = Any[(id = "embedder_1", description = "BM25+RP embedder", language = "english", stem_words = false, ngram_complexity = 1, vectors = :bm25, vectors_transform = :rp, vectors_dimension = 50, embeddings_path = nothing, embeddings_kind = :binary, doc2vec_method = :boe, glove_vocabulary = nothing, oov_policy = :large_vector, embedder_kwarguments = Dict{Symbol,Any}(), embeddable_fields = [:RandString, :StringField, :StringField2, :IntField], text_strip_flags = 0x00700607, sif_alpha = 0.01f0, borep_dimension = 1024, borep_pooling_function = :sum, disc_ngram = 2)], config_path = "/tmp/jl_vtqzZn")

The configuration contains the data loader (a closure that only needs to be called with no argument to load the data), the path of the configuration file, the primary id key of the data (which needs to be a JuliaDB data type) and a list of configuration objects for the individual searchers of the engine.

julia> for field in fieldnames(typeof(cfg))
           println("$field=$(getfield(cfg, field))")
       end
data_loader=#120
data_sampler=identity_sampler
id_key=id
vectors_eltype=Float32
searcher_configs=Any[(id = "searcher_1", id_aggregation = "aggid", description = "A searcher using BM25+RP embeddings and naive indexing", enabled = Bool[1], search_index = :naive, search_index_arguments = Any[], search_index_kwarguments = Dict{Symbol,Any}(), indexable_fields = [:RandString, :StringField, :StringField2, :IntField], data_embedder = "embedder_1", input_embedder = "embedder_1", heuristic = nothing, score_alpha = 0.4f0, score_weight = 0.8f0)]
embedder_configs=Any[(id = "embedder_1", description = "BM25+RP embedder", language = "english", stem_words = false, ngram_complexity = 1, vectors = :bm25, vectors_transform = :rp, vectors_dimension = 50, embeddings_path = nothing, embeddings_kind = :binary, doc2vec_method = :boe, glove_vocabulary = nothing, oov_policy = :large_vector, embedder_kwarguments = Dict{Symbol,Any}(), embeddable_fields = [:RandString, :StringField, :StringField2, :IntField], text_strip_flags = 0x00700607, sif_alpha = 0.01f0, borep_dimension = 1024, borep_pooling_function = :sum, disc_ngram = 2)]
config_path=/tmp/jl_vtqzZn

The search environment

Building the search environment out of the configuration is straightforward. The environment holds the in-memory data in the form of an IndexedTable or NDSparse object, the searchers as well as other information such as primary db key and configuration paths.

julia> env = build_search_env(cfg)
[ Info: • Environment successfully built using config /tmp/jl_vtqzZn.
SearchEnv{Float32} with:
`-dbdata = Table with 100 rows, 7 columns
  id_key = id
  sampler = Garamond.identity_sampler
  embedders = [
    RP embedder, 258 to 50, DTV(bm25) vectors
  ]
  searchers = [
    [enabled] Searcher searcher_1/aggid, Naive index, 100 Float32 embedded documents, one embedder
  ]
  config_path = /tmp/jl_vtqzZn

Engine operations

The internal API is designed to be straightforward and uniform in the way it is called. First, one has to build a request which fully describes the operation to be performed and subsequently, call the operation desired. For example, to perform a search, one request would be:

julia> request = Garamond.InternalRequest(operation=:search,
                                          query="Q",
                                          search_method=:exact,
                                          max_matches=10,
                                          response_size=5,
                                          max_suggestions=0,
                                          return_fields=[:id, :RandString, :StringField],
                                          input_parser=:noop_input_parser,
                                          ranker=:noop_ranker)
InternalRequest: OPERATION=:search | QUERY="Q" | MAX_MATCHES=10 | SEARCH_METHOD=:exact | SEARCHABLE_FILTERS=Symbol[] | MAX_SUGGESTIONS=0 | RETURN_FIELDS=[:id, :Ran… | CUSTOM_WEIGHTS=Dict{Symbo… | REQUEST_ID_KEY=Symbol("") | SORT_FIELDS=Symbol[] | SORT_REVERSE=false | RESPONSE_SIZE=5 | RESPONSE_PAGE=1 | INPUT_PARSER=:noop_inpu… | RANKER=:noop_rank… | RECOMMENDER=:noop_reco…

with searching done by

julia> search_results = search(env, request)
1-element Array{SearchResult{Float32},1}:
 Search results for searcher_1:  10 hits, 1 query terms, 0 suggestions.

Ranking the results using the ranker specified in the request is done with:

julia> ranked = rank(env, request, search_results)
1-element Array{SearchResult{Float32},1}:
 Search results for searcher_1:  10 hits, 1 query terms, 0 suggestions.

Results and responses

Once results are available, these can be printed

julia> print_search_results(env.dbdata, ranked; id_key=:id, fields=[:id, :RandString])
[searcher_1] 10 search results:
  0.93404645 ~ 12 - Q
  0.8485508 ~ 46 - q
  0.707569 ~ 79 - 4
  0.6926229 ~ 14 - clAsVujw
  0.6845983 ~ 68 - Eh
  0.6749313 ~ 42 - 03Zc3
  0.6718404 ~ 15 - mnhdBWI
  0.66862476 ~ 66 - jlfCgUFB4
  0.66711974 ~ 63 - hM
  0.66711974 ~ 93 - s

or a JSON response created are sent elsewhere

julia> response = Garamond.build_response(env.dbdata, request, ranked, id_key=env.id_key)
"{\"n_searchers\":1,\"n_searchers_w_results\":1,\"suggestions\":{\"d\":{}},\"elapsed_time\":-1.0,\"results\":{\"searcher_1\":[{\"id\":12,\"RandString\":\"Q\",\"score\":0.93404645,\"StringField\":\"\"},{\"id\":46,\"RandString\":\"q\",\"score\":0.8485508,\"StringField\":\"A B C D E\"},{\"id\":79,\"RandString\":\"4\",\"score\":0.707569,\"StringField\":\"A\"},{\"id\":14,\"RandString\":\"clAsVujw\",\"score\":0.6926229,\"StringField\":\"A B C D\"},{\"id\":68,\"RandString\":\"Eh\",\"score\":0.6845983,\"StringField\":\"A B\"}]},\"n_total_results\":10}"

To verify the response, it can be parsed and displayed:

julia> parsed_response = JSON.parse(response)
Dict{String,Any} with 6 entries:
  "n_searchers"           => 1
  "n_searchers_w_results" => 1
  "suggestions"           => Dict{String,Any}("d"=>Dict{String,Any}())
  "elapsed_time"          => -1.0
  "results"               => Dict{String,Any}("searcher_1"=>Any[Dict{String,Any…
  "n_total_results"       => 10

julia> parsed_response["results"][collect(keys(parsed_response["results"]))[1]]
5-element Array{Any,1}:
 Dict{String,Any}("score" => 0.93404645,"StringField" => "","id" => 12,"RandString" => "Q")
 Dict{String,Any}("score" => 0.8485508,"StringField" => "A B C D E","id" => 46,"RandString" => "q")
 Dict{String,Any}("score" => 0.707569,"StringField" => "A","id" => 79,"RandString" => "4")
 Dict{String,Any}("score" => 0.6926229,"StringField" => "A B C D","id" => 14,"RandString" => "clAsVujw")
 Dict{String,Any}("score" => 0.6845983,"StringField" => "A B","id" => 68,"RandString" => "Eh")