API Reference

A Julia library for working with text, hard-forked from TextAnalysis.jl.

source

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coomm::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

source
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source

Basic Document-Term-Matrix (DTM) type.

Fields

  • dtm::SparseMatriCSC{T,Int} the actual DTM; rows represent terms

and columns represent documents

  • terms::Vector{String} a list of terms that represent the lexicon of

the corpus associated with the DTM

  • row_indices::OrderedDict{String, Int} a map between the terms and the

rows of the dtm

source
DocumentTermMatrix{T}(docs [,terms] [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Auxiliary constructor(s) of the DocumentTermMatrix type. The type T has to be a subtype of Real. The constructor(s) requires a corpus or vector of strings docs and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be missing, in which case the lexicon field of the corpus is used.

source
LSAModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}

LSA (latent semantic analysis) model. It constructs from a document term matrix (dtm) a model that can be used to embed documents in a latent semantic space pertaining to the data. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat} because the elements of the matrices resulted from the SVD operation are floating point numbers and these have to match or be convertible to type T.

Fields

  • vocab::Vector{S} a vector with all the words in the corpus
  • vocab_hash::OrderedDict{S,H} a word to index in word embeddings matrix mapping
  • Σinv::A diagonal of the inverse singular value matrix
  • Uᵀ::A transpose of the word embedding matrix
  • stats::Symbol the statistical measure to use for word importances in documents. Available values are: :count (term count), :tf (term frequency), :tfidf (default, term frequency-inverse document frequency) and :bm25 (Okapi BM25)
  • idf::Vector{T} inverse document frequencies for the words in the vocabulary
  • nwords::T averge number of words in a document
  • ngram_complexity::Int ngram complexity
  • κ::Int the κ parameter of the BM25 statistic
  • β::Float64 the β parameter of the BM25 statistic
  • tol::T minimum size of the vector components (default T(1e-15))

SVD matrices U, Σinv and V:

If X is a m×n document-term-matrix with n documents and m words so that X[i,j] represents a statistical indicator of the importance of term i in document j then:

  • U, Σ, V = svd(X)
  • Σinv = diag(inv(Σ))
  • Uᵀ = U'
  • X ≈ U * Σ * V'

The matrix V of document embeddings is not actually stored in the model.

Examples

julia> using StringAnalysis

       doc1 = StringDocument("This is a text about an apple. There are many texts about apples.")
       doc2 = StringDocument("Pears and apples are good but not exotic. An apple a day keeps the doctor away.")
       doc3 = StringDocument("Fruits are good for you.")
       doc4 = StringDocument("This phrase has nothing to do with the others...")
       doc5 = StringDocument("Simple text, little info inside")

       crps = Corpus(AbstractDocument[doc1, doc2, doc3, doc4, doc5])
       prepare!(crps, strip_punctuation)
       update_lexicon!(crps)
       dtm = DocumentTermMatrix{Float32}(crps, collect(keys(crps.lexicon)))

       ### Build LSA Model ###
       lsa_model = LSAModel(dtm, k=3, stats=:tf)

       query = StringDocument("Apples and an exotic fruit.")
       idxs, corrs = cosine(lsa_model, crps, query)

       println("Query: "$(query.text)"")
       for (idx, corr) in zip(idxs, corrs)
           println("$corr -> "$(crps[idx].text)"")
       end
Query: "Apples and an exotic fruit."
0.9746108 -> "Pears and apples are good but not exotic  An apple a day keeps the doctor away "
0.870703 -> "This is a text about an apple  There are many texts about apples "
0.7122063 -> "Fruits are good for you "
0.22725986 -> "This phrase has nothing to do with the others "
0.076901935 -> "Simple text  little info inside "

References:

source
RPModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}

Random projection model. It constructs from a document term matrix (DTM) a model that can be used to embed documents in a random sub-space. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat} because the elements of the matrices resulted projection operation are floating point numbers and these have to match or be convertible to type T. The approach is based on the effects of the Johnson-Lindenstrauss lemma.

Fields

  • vocab::Vector{S} a vector with all the words in the corpus
  • vocab_hash::OrderedDict{S,H} a word to index in the random projection maatrix mapping
  • R::A the random projection matrix
  • stats::Symbol the statistical measure to use for word importances in documents. Available values are: :count (term count), :tf (term frequency), :tfidf (default, term frequency-inverse document frequency) and :bm25 (Okapi BM25)
  • idf::Vector{T} inverse document frequencies for the words in the vocabulary
  • nwords::T averge number of words in a document
  • ngram_complexity::Int ngram complexity
  • κ::Int the κ parameter of the BM25 statistic
  • β::Float64 the β parameter of the BM25 statistic
  • project::Bool specifies whether the model actually performs the projection or not; it is false if the number of dimensions provided is zero or negative

References:

source
TextHashFunction(hash_function::Function, cardinality::Int)

The basic structure for performing text hashing: uses the hash_function to generate feature vectors of length cardinality.

Details

The hash trick is the use a hash function instead of a lexicon to determine the columns of a DocumentTermMatrix-like encoding of the data. To produce a DTM for a Corpus for which we do not have an existing lexicon, we need someway to map the terms from each document into column indices. We use the now standard "Hash Trick" in which we hash strings and then reduce the resulting integers modulo N, which defines the numbers of columns we want our DTM to have. This amounts to doing a non-linear dimensionality reduction with low probability that similar terms hash to the same dimension.

To make things easier, we wrap Julia's hash functions in a new type, TextHashFunction, which maintains information about the desired cardinality of the hashes.

References:

Examples

julia> doc = StringDocument("this is a text")
       thf = TextHashFunction(hash, 13)
       hash_dtv(doc, thf, Float16)
13-element Array{Float16,1}:
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 2.0
 0.0
 0.0
 0.0
 0.0
 0.0
source
coom(c::CooMatrix)

Access the co-occurrence matrix field coom of a CooMatrix c.

source
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source
StringAnalysis.cosineFunction.
cosine(model, docs, doc, n=10)

Return the positions of the n closest neighboring documents to doc found in docs. docs can be a corpus or document term matrix. The vector representations of docs and doc are obtained with the model which can be either a LSAModel or RPModel.

source
StringAnalysis.dtmMethod.
dtm(d::DocumentTermMatrix)

Access the matrix of a DocumentTermMatrix d.

source
StringAnalysis.dtmMethod.
dtm(docs::Corpus, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Access the matrix of the DTM associated with the corpus docs. The DocumentTermMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source
StringAnalysis.dtvMethod.
dtv(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document d using the lexicon lex. d can be an AbstractString or an AbstractDocument.

source
StringAnalysis.dtvMethod.
dtv(crps::Corpus, idx::Int, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document idx of the corpus crps.

source
dtv_regex(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a document-term-vector with elements of type T for document d using the lexicon lex. The tokens of document d are assumed to be regular expressions in text format. d can be an AbstractString or an AbstractDocument.

Examples

julia> dtv_regex(NGramDocument("a..b"), OrderedDict("aaa"=>1, "aaab"=>2, "accb"=>3, "bbb"=>4), Float32)
4-element Array{Float32,1}:
 0.0
 1.0
 1.0
 0.0
source
each_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Iterates through the columns of the DTM of the corpus crps without constructing it. Useful when the DTM would not fit in memory. eltype specifies the element type of the generated vectors.

source
each_hash_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Iterates through the columns of the hashed DTM of the corpus crps without constructing it. Useful when the DTM would not fit in memory. eltype specifies the element type of the generated vectors.

source
embed_document(lm, doc)

Return the vector representation of doc, obtained using the LSA model lm. doc can be an AbstractDocument, Corpus or DTV or DTM.

source
embed_document(rpm, doc)

Return the vector representation of doc, obtained using the random projection model rpm. doc can be an AbstractDocument, Corpus or DTV or DTM.

source
frequent_terms(doc, alpha)

Returns a vector with frequent terms in the document doc. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

source
frequent_terms(crps::Corpus, alpha)

Returns a vector with frequent terms among all documents. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

source
get_vector(lm, word)

Returns the vector representation of word from the LSA model lm.

source
get_vector(rpm, word)

Returns the random projection vector corresponding to word in the random projection model rpm.

source
hash_dtm(crps::Corpus [,h::TextHashFunction], eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a hashed DTM with elements of type T for corpus crps using the the hashing function h. If h is missing, the hash function of the Corpus is used.

source
hash_dtv(d, h::TextHashFunction, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])

Creates a hashed document-term-vector with elements of type T for document d using the hashing function h. d can be an AbstractString or an AbstractDocument.

source
in_vocabulary(lm, word)

Return true if word is part of the vocabulary of the LSA model lm and false otherwise.

source
in_vocabulary(rpm, word)

Return true if word is part of the vocabulary of the random projection model rpm and false otherwise.

source
index(lm, word)

Return the index of word from the LSA model lm.

source
index(rpm, word)

Return the index of word from the random projection model rpm.

source
StringAnalysis.ldaMethod.
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64)

Perform Latent Dirichlet allocation.

Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Return values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
load_lsa_model(filename, eltype; [sparse=false])

Loads an LSA model from filename into an LSA model object. The embeddings matrix element type is specified by eltype (default DEFAULT_FLOAT_TYPE) while the keyword argument sparse specifies whether the matrix should be sparse or not.

source
load_rp_model(filename, eltype; [sparse=true])

Loads an random projection model from filename into an random projection model object. The projection matrix element type is specified by eltype (default DEFAULT_FLOAT_TYPE) while the keyword argument sparse specifies whether the matrix should be sparse or not.

source
StringAnalysis.lsaMethod.
lsa(X [;k=<num documents>, stats=:tfidf, κ=2, β=0.75, tol=1e-15])

Constructs a LSA model. The input X can be a Corpus or a DocumentTermMatrix. Use ?LSAModel for more details. Vector components smaller than tol will be zeroed out.

source
StringAnalysis.ngramsFunction.
ngrams(d, n=DEFAULT_GRAM_COMPLEXITY [; tokenizer=DEFAULT_TOKENIZER])

Access the document text of d as n-gram counts. The ngrams contain at most n tokens which are obtained using tokenizer.

source
ngrams!(d, new_ngrams)

Replace the original n-grams of document d with new_ngrams.

source
StringAnalysis.rpMethod.
rp(X [;k=m, density=1/sqrt(k), stats=:tfidf, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, κ=2, β=0.75])

Constructs a random projection model. The input X can be a Corpus or a DocumentTermMatrix with m words in the lexicon. The model does not store the corpus or DTM document embeddings, just the projection matrix. Use ?RPModel for more details.

source
save(lm, filename)

Saves an LSA model lm to disc in file filename.

source
save_rp_model(rpm, filename)

Saves an random projection model rpm to disc in file filename.

source
sentence_tokenize([lang,] s)

Splits string s into sentences using WordTokenizers.split_sentences function to perform the tokenization. If a language lang is provided, it ignores it ;)

source
similarity(model, doc1, doc2)

Return the cosine similarity value between two documents doc1 and doc2 whose vector representations have been obtained using the model, which can be either a LSAModel or RPModel.

source
sparse_terms(doc, alpha)

Returns a vector with rare terms in the document doc. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

source
sparse_terms(crps::Corpus, alpha)

Returns a vector with rare terms among all documents. The parameter alpha indicates the sparsity threshold (a frequency <= alpha means sparse).

source
text!(d, new_text)

Replace the original text of document d with new_text.

source
text(d)

Access the text of document d if possible.

source

" tokenize(doc [;method, splitter])

Tokenizes the document doc based on the mehtod (default :default, i.e. a WordTokenizers.jl tokenizer) and the splitter, which is a Regex used if method=:stringanalysis.

source
tokens!(d, new_tokens)

Replace the original tokens of document d with new_tokens.

source
tokens(d [; method=DEFAULT_TOKENIZER])

Access the tokens of document d as a token array. The method keyword argument specifies the type of tokenization to perform. Available options are :default and :stringanalysis.

source
vocabulary(lm)

Return the vocabulary as a vector of words of the LSA model lm.

source
vocabulary(rpm)

Return the vocabulary as a vector of words of the random projection model rpm.

source
Base.sizeMethod.
size(lm)

Return a tuple containin input and output dimensionalities of the LSA model lm.

source
Base.sizeMethod.
size(rpm)

Return a tuple containing the input data and projection sub-space dimensionalities of the random projection model rpm.

source
Base.summaryMethod.
summary(doc)

Shows information about the document doc.

source
Base.summaryMethod.
summary(crps)

Shows information about the corpus crps.

source
abstract_convert(document::AbstractDocument, parameter::Union{Nothing, Type{T}})

Tries converting document::AbstractDocument to one of the concrete types with witch StringAnalysis works i.e. StringDocument{T}, TokenDocument{T}, NGramDocument{T}. A user-defined convert method between the typeof(document) and the concrete types should be defined.

source
columnindices(terms)

Identical to rowindices. Returns a dictionary that maps each term from the vector terms to a integer idex.

source
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)

Basic low-level function that calculates the co-occurence matrix of a document. Returns a sparse co-occurence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions.

Examples

julia> using StringAnalysis
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = tokenize(text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       StringAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999
source
embed_word(lm, word)

Return the vector representation of word using the LSA model lm.

source
random_projection_matrix(k::Int, m::Int, eltype::Type{T<:AbstractFloat}, density::Float64)

Builds a k×m sparse random projection matrix with elements of type T and a non-zero element frequency of density. k and m are the output and input dimensionalities.

Matrix Probabilities

If we note s = 1 / density, the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(k) with probability 1/2s
  • 0 with probability 1 - 1/s
  • +sqrt(s) / sqrt(k) with probability 1/2s

No projection hack

If k<=0 no projection is performed and the function returns an identity matrix sized m×m with elements of type T. This is useful if one does not want to embed documents but rather calculate term frequencies, BM25 and other statistical indicators (similar to dtv).

source
remove_patterns!(d, rex)

Removes from the document or corpus d the text matching the pattern described by the regular expression rex.

source
remove_patterns(s, rex)

Removes from the string s the text matching the pattern described by the regular expression rex.

source
rowindices(terms)

Returns a dictionary that maps each term from the vector terms to a integer idex.

source
tokenize_default([lang,] s)

Splits string s into tokens on whitespace using WordTokenizers.tokenize function to perform the tokenization. If a language lang is provided, it ignores it ;)

source
tokenize_stringanalysis(doc [;splitter])

Function that quickly tokenizes doc based on the splitting pattern specified by splitter::RegEx. Supported types for doc are: AbstractString, Vector{AbstractString}, StringDocument and NGramDocument.

source