StringAnalysis.StringAnalysis
StringAnalysis.CooMatrix
StringAnalysis.CooMatrix
StringAnalysis.DocumentTermMatrix
StringAnalysis.DocumentTermMatrix
StringAnalysis.LSAModel
StringAnalysis.RPModel
StringAnalysis.TextHashFunction
Base.size
Base.size
Base.summary
Base.summary
StringAnalysis.abstract_convert
StringAnalysis.columnindices
StringAnalysis.coo_matrix
StringAnalysis.coom
StringAnalysis.coom
StringAnalysis.cosine
StringAnalysis.dtm
StringAnalysis.dtm
StringAnalysis.dtv
StringAnalysis.dtv
StringAnalysis.dtv_regex
StringAnalysis.each_dtv
StringAnalysis.each_hash_dtv
StringAnalysis.embed_document
StringAnalysis.embed_document
StringAnalysis.embed_word
StringAnalysis.frequent_terms
StringAnalysis.frequent_terms
StringAnalysis.get_vector
StringAnalysis.get_vector
StringAnalysis.hash_dtm
StringAnalysis.hash_dtv
StringAnalysis.in_vocabulary
StringAnalysis.in_vocabulary
StringAnalysis.index
StringAnalysis.index
StringAnalysis.lda
StringAnalysis.load_lsa_model
StringAnalysis.load_rp_model
StringAnalysis.lsa
StringAnalysis.ngrams
StringAnalysis.ngrams!
StringAnalysis.random_projection_matrix
StringAnalysis.remove_patterns
StringAnalysis.remove_patterns!
StringAnalysis.rowindices
StringAnalysis.rp
StringAnalysis.save_lsa_model
StringAnalysis.save_rp_model
StringAnalysis.sentence_tokenize
StringAnalysis.similarity
StringAnalysis.sparse_terms
StringAnalysis.sparse_terms
StringAnalysis.text
StringAnalysis.text!
StringAnalysis.tokenize
StringAnalysis.tokenize_default
StringAnalysis.tokenize_stringanalysis
StringAnalysis.tokens
StringAnalysis.tokens!
StringAnalysis.vocabulary
StringAnalysis.vocabulary
StringAnalysis.StringAnalysis
— Module.A Julia library for working with text, hard-forked from TextAnalysis.jl.
StringAnalysis.CooMatrix
— Type.Basic Co-occurrence Matrix (COOM) type.
Fields
coomm::SparseMatriCSC{T,Int}
the actual COOM; elements represent
co-occurrences of two terms within a given window
terms::Vector{String}
a list of terms that represent the lexicon of
the document or corpus
column_indices::OrderedDict{String, Int}
a map between theterms
and the
columns of the co-occurrence matrix
StringAnalysis.CooMatrix
— Method.CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])
Auxiliary constructor(s) of the CooMatrix
type. The type T
has to be a subtype of AbstractFloat
. The constructor(s) requires a corpus crps
and a terms
structure representing the lexicon of the corpus. The latter can be a Vector{String}
, an AbstractDict
where the keys are the lexicon, or can be omitted, in which case the lexicon
field of the corpus is used.
Basic Document-Term-Matrix (DTM) type.
Fields
dtm::SparseMatriCSC{T,Int}
the actual DTM; rows represent terms
and columns represent documents
terms::Vector{String}
a list of terms that represent the lexicon of
the corpus associated with the DTM
row_indices::OrderedDict{String, Int}
a map between theterms
and the
rows of the dtm
StringAnalysis.DocumentTermMatrix
— Method.DocumentTermMatrix{T}(docs [,terms] [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Auxiliary constructor(s) of the DocumentTermMatrix
type. The type T
has to be a subtype of Real
. The constructor(s) requires a corpus or vector of strings docs
and a terms
structure representing the lexicon of the corpus. The latter can be a Vector{String}
, an AbstractDict
where the keys are the lexicon, or can be missing, in which case the lexicon
field of the corpus is used.
StringAnalysis.LSAModel
— Type.LSAModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}
LSA (latent semantic analysis) model. It constructs from a document term matrix (dtm) a model that can be used to embed documents in a latent semantic space pertaining to the data. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat}
because the elements of the matrices resulted from the SVD operation are floating point numbers and these have to match or be convertible to type T
.
Fields
vocab::Vector{S}
a vector with all the words in the corpusvocab_hash::OrderedDict{S,H}
a word to index in word embeddings matrix mappingΣinv::A
diagonal of the inverse singular value matrixUᵀ::A
transpose of the word embedding matrixstats::Symbol
the statistical measure to use for word importances in documents. Available values are::count
(term count),:tf
(term frequency),:tfidf
(default, term frequency-inverse document frequency) and:bm25
(Okapi BM25)idf::Vector{T}
inverse document frequencies for the words in the vocabularynwords::T
averge number of words in a documentngram_complexity::Int
ngram complexityκ::Int
theκ
parameter of the BM25 statisticβ::Float64
theβ
parameter of the BM25 statistictol::T
minimum size of the vector components (defaultT(1e-15)
)
SVD matrices U
, Σinv
and V
:
If X
is a m
×n
document-term-matrix with n
documents and m
words so that X[i,j]
represents a statistical indicator of the importance of term i
in document j
then:
U, Σ, V = svd(X)
Σinv = diag(inv(Σ))
Uᵀ = U'
X ≈ U * Σ * V'
The matrix V
of document embeddings is not actually stored in the model.
Examples
julia> using StringAnalysis
doc1 = StringDocument("This is a text about an apple. There are many texts about apples.")
doc2 = StringDocument("Pears and apples are good but not exotic. An apple a day keeps the doctor away.")
doc3 = StringDocument("Fruits are good for you.")
doc4 = StringDocument("This phrase has nothing to do with the others...")
doc5 = StringDocument("Simple text, little info inside")
crps = Corpus(AbstractDocument[doc1, doc2, doc3, doc4, doc5])
prepare!(crps, strip_punctuation)
update_lexicon!(crps)
dtm = DocumentTermMatrix{Float32}(crps, collect(keys(crps.lexicon)))
### Build LSA Model ###
lsa_model = LSAModel(dtm, k=3, stats=:tf)
query = StringDocument("Apples and an exotic fruit.")
idxs, corrs = cosine(lsa_model, crps, query)
println("Query: "$(query.text)"")
for (idx, corr) in zip(idxs, corrs)
println("$corr -> "$(crps[idx].text)"")
end
Query: "Apples and an exotic fruit."
0.9746108 -> "Pears and apples are good but not exotic An apple a day keeps the doctor away "
0.870703 -> "This is a text about an apple There are many texts about apples "
0.7122063 -> "Fruits are good for you "
0.22725986 -> "This phrase has nothing to do with the others "
0.076901935 -> "Simple text little info inside "
References:
StringAnalysis.RPModel
— Type.RPModel{S<:AbstractString, T<:AbstractFloat, A<:AbstractMatrix{T}, H<:Integer}
Random projection model. It constructs from a document term matrix (DTM) a model that can be used to embed documents in a random sub-space. The model requires that the document term matrix be a DocumentTermMatrix{T<:AbstractFloat}
because the elements of the matrices resulted projection operation are floating point numbers and these have to match or be convertible to type T
. The approach is based on the effects of the Johnson-Lindenstrauss lemma.
Fields
vocab::Vector{S}
a vector with all the words in the corpusvocab_hash::OrderedDict{S,H}
a word to index in the random projection maatrix mappingR::A
the random projection matrixstats::Symbol
the statistical measure to use for word importances in documents. Available values are::count
(term count),:tf
(term frequency),:tfidf
(default, term frequency-inverse document frequency) and:bm25
(Okapi BM25)idf::Vector{T}
inverse document frequencies for the words in the vocabularynwords::T
averge number of words in a documentngram_complexity::Int
ngram complexityκ::Int
theκ
parameter of the BM25 statisticβ::Float64
theβ
parameter of the BM25 statisticproject::Bool
specifies whether the model actually performs the projection or not; it is false if the number of dimensions provided is zero or negative
References:
StringAnalysis.TextHashFunction
— Type.TextHashFunction(hash_function::Function, cardinality::Int)
The basic structure for performing text hashing: uses the hash_function
to generate feature vectors of length cardinality
.
Details
The hash trick is the use a hash function instead of a lexicon to determine the columns of a DocumentTermMatrix-like encoding of the data. To produce a DTM for a Corpus for which we do not have an existing lexicon, we need someway to map the terms from each document into column indices. We use the now standard "Hash Trick" in which we hash strings and then reduce the resulting integers modulo N, which defines the numbers of columns we want our DTM to have. This amounts to doing a non-linear dimensionality reduction with low probability that similar terms hash to the same dimension.
To make things easier, we wrap Julia's hash functions in a new type, TextHashFunction, which maintains information about the desired cardinality of the hashes.
References:
Examples
julia> doc = StringDocument("this is a text")
thf = TextHashFunction(hash, 13)
hash_dtv(doc, thf, Float16)
13-element Array{Float16,1}:
1.0
1.0
0.0
0.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
0.0
StringAnalysis.coom
— Method.coom(c::CooMatrix)
Access the co-occurrence matrix field coom
of a CooMatrix
c
.
StringAnalysis.coom
— Method.coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])
Access the co-occurrence matrix of the CooMatrix
associated with the entity
. The CooMatrix{T}
will first have to be created in order for the actual matrix to be accessed.
StringAnalysis.cosine
— Function.cosine(model, docs, doc, n=10)
Return the positions of the n
closest neighboring documents to doc
found in docs
. docs
can be a corpus or document term matrix. The vector representations of docs
and doc
are obtained with the model
which can be either a LSAModel
or RPModel
.
StringAnalysis.dtm
— Method.dtm(d::DocumentTermMatrix)
Access the matrix of a DocumentTermMatrix
d
.
StringAnalysis.dtm
— Method.dtm(docs::Corpus, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Access the matrix of the DTM associated with the corpus docs
. The DocumentTermMatrix{T}
will first have to be created in order for the actual matrix to be accessed.
StringAnalysis.dtv
— Method.dtv(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Creates a document-term-vector with elements of type T
for document d
using the lexicon lex
. d
can be an AbstractString
or an AbstractDocument
.
StringAnalysis.dtv
— Method.dtv(crps::Corpus, idx::Int, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Creates a document-term-vector with elements of type T
for document idx
of the corpus crps
.
StringAnalysis.dtv_regex
— Method.dtv_regex(d, lex::OrderedDict{String,Int}, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Creates a document-term-vector with elements of type T
for document d
using the lexicon lex
. The tokens of document d
are assumed to be regular expressions in text format. d
can be an AbstractString
or an AbstractDocument
.
Examples
julia> dtv_regex(NGramDocument("a..b"), OrderedDict("aaa"=>1, "aaab"=>2, "accb"=>3, "bbb"=>4), Float32)
4-element Array{Float32,1}:
0.0
1.0
1.0
0.0
StringAnalysis.each_dtv
— Method.each_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Iterates through the columns of the DTM of the corpus crps
without constructing it. Useful when the DTM would not fit in memory. eltype
specifies the element type of the generated vectors.
StringAnalysis.each_hash_dtv
— Method.each_hash_dtv(crps::Corpus [; eltype::Type{U}=DEFAULT_DTM_TYPE, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Iterates through the columns of the hashed DTM of the corpus crps
without constructing it. Useful when the DTM would not fit in memory. eltype
specifies the element type of the generated vectors.
StringAnalysis.embed_document
— Method.embed_document(lm, doc)
Return the vector representation of doc
, obtained using the LSA model lm
. doc
can be an AbstractDocument
, Corpus
or DTV or DTM.
StringAnalysis.embed_document
— Method.embed_document(rpm, doc)
Return the vector representation of doc
, obtained using the random projection model rpm
. doc
can be an AbstractDocument
, Corpus
or DTV or DTM.
StringAnalysis.frequent_terms
— Function.frequent_terms(doc, alpha)
Returns a vector with frequent terms in the document doc
. The parameter alpha
indicates the sparsity threshold (a frequency <= alpha means sparse).
StringAnalysis.frequent_terms
— Function.frequent_terms(crps::Corpus, alpha)
Returns a vector with frequent terms among all documents. The parameter alpha
indicates the sparsity threshold (a frequency <= alpha means sparse).
StringAnalysis.get_vector
— Method.get_vector(lm, word)
Returns the vector representation of word
from the LSA model lm
.
StringAnalysis.get_vector
— Method.get_vector(rpm, word)
Returns the random projection vector corresponding to word
in the random projection model rpm
.
StringAnalysis.hash_dtm
— Method.hash_dtm(crps::Corpus [,h::TextHashFunction], eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Creates a hashed DTM with elements of type T
for corpus crps
using the the hashing function h
. If h
is missing, the hash function of the Corpus
is used.
StringAnalysis.hash_dtv
— Method.hash_dtv(d, h::TextHashFunction, eltype::Type{T}=DEFAULT_DTM_TYPE [; ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, tokenizer=DEFAULT_TOKENIZER])
Creates a hashed document-term-vector with elements of type T
for document d
using the hashing function h
. d
can be an AbstractString
or an AbstractDocument
.
StringAnalysis.in_vocabulary
— Method.in_vocabulary(lm, word)
Return true
if word
is part of the vocabulary of the LSA model lm
and false
otherwise.
StringAnalysis.in_vocabulary
— Method.in_vocabulary(rpm, word)
Return true
if word
is part of the vocabulary of the random projection model rpm
and false
otherwise.
StringAnalysis.index
— Method.index(lm, word)
Return the index of word
from the LSA model lm
.
StringAnalysis.index
— Method.index(rpm, word)
Return the index of word
from the random projection model rpm
.
StringAnalysis.lda
— Method.ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64)
Perform Latent Dirichlet allocation.
Arguments
α
Dirichlet dist. hyperparameter for topic distribution per document.α<1
yields a sparse topic mixture for each document.α>1
yields a more uniform topic mixture for each document.β
Dirichlet dist. hyperparameter for word distribution per topic.β<1
yields a sparse word mixture for each topic.β>1
yields a more uniform word mixture for each topic.
Return values
ϕ
:ntopics × nwords
Sparse matrix of probabilities s.t.sum(ϕ, 1) == 1
θ
:ntopics × ndocs
Dense matrix of probabilities s.t.sum(θ, 1) == 1
StringAnalysis.load_lsa_model
— Method.load_lsa_model(filename, eltype; [sparse=false])
Loads an LSA model from filename
into an LSA model object. The embeddings matrix element type is specified by eltype
(default DEFAULT_FLOAT_TYPE
) while the keyword argument sparse
specifies whether the matrix should be sparse or not.
StringAnalysis.load_rp_model
— Method.load_rp_model(filename, eltype; [sparse=true])
Loads an random projection model from filename
into an random projection model object. The projection matrix element type is specified by eltype
(default DEFAULT_FLOAT_TYPE
) while the keyword argument sparse
specifies whether the matrix should be sparse or not.
StringAnalysis.lsa
— Method.lsa(X [;k=<num documents>, stats=:tfidf, κ=2, β=0.75, tol=1e-15])
Constructs a LSA model. The input X
can be a Corpus
or a DocumentTermMatrix
. Use ?LSAModel
for more details. Vector components smaller than tol
will be zeroed out.
StringAnalysis.ngrams
— Function.ngrams(d, n=DEFAULT_GRAM_COMPLEXITY [; tokenizer=DEFAULT_TOKENIZER])
Access the document text of d
as n-gram counts. The ngrams contain at most n
tokens which are obtained using tokenizer
.
StringAnalysis.ngrams!
— Method.ngrams!(d, new_ngrams)
Replace the original n-grams of document d
with new_ngrams
.
StringAnalysis.rp
— Method.rp(X [;k=m, density=1/sqrt(k), stats=:tfidf, ngram_complexity=DEFAULT_NGRAM_COMPLEXITY, κ=2, β=0.75])
Constructs a random projection model. The input X
can be a Corpus
or a DocumentTermMatrix
with m
words in the lexicon. The model does not store the corpus or DTM document embeddings, just the projection matrix. Use ?RPModel
for more details.
StringAnalysis.save_lsa_model
— Method.save(lm, filename)
Saves an LSA model lm
to disc in file filename
.
StringAnalysis.save_rp_model
— Method.save_rp_model(rpm, filename)
Saves an random projection model rpm
to disc in file filename
.
StringAnalysis.sentence_tokenize
— Method.sentence_tokenize([lang,] s)
Splits string s
into sentences using WordTokenizers.split_sentences
function to perform the tokenization. If a language lang
is provided, it ignores it ;)
StringAnalysis.similarity
— Method.similarity(model, doc1, doc2)
Return the cosine similarity value between two documents doc1
and doc2
whose vector representations have been obtained using the model
, which can be either a LSAModel
or RPModel
.
StringAnalysis.sparse_terms
— Function.sparse_terms(doc, alpha)
Returns a vector with rare terms in the document doc
. The parameter alpha
indicates the sparsity threshold (a frequency <= alpha means sparse).
StringAnalysis.sparse_terms
— Function.sparse_terms(crps::Corpus, alpha)
Returns a vector with rare terms among all documents. The parameter alpha
indicates the sparsity threshold (a frequency <= alpha means sparse).
StringAnalysis.text!
— Method.text!(d, new_text)
Replace the original text of document d
with new_text
.
StringAnalysis.text
— Method.text(d)
Access the text of document d
if possible.
StringAnalysis.tokenize
— Method." tokenize(doc [;method, splitter])
Tokenizes the document doc
based on the mehtod
(default :default
, i.e. a WordTokenizers.jl
tokenizer) and the splitter
, which is a Regex
used if method=:stringanalysis
.
StringAnalysis.tokens!
— Method.tokens!(d, new_tokens)
Replace the original tokens of document d
with new_tokens
.
StringAnalysis.tokens
— Method.tokens(d [; method=DEFAULT_TOKENIZER])
Access the tokens of document d
as a token array. The method
keyword argument specifies the type of tokenization to perform. Available options are :default
and :stringanalysis
.
StringAnalysis.vocabulary
— Method.vocabulary(lm)
Return the vocabulary as a vector of words of the LSA model lm
.
StringAnalysis.vocabulary
— Method.vocabulary(rpm)
Return the vocabulary as a vector of words of the random projection model rpm
.
Base.size
— Method.size(lm)
Return a tuple containin input and output dimensionalities of the LSA model lm
.
Base.size
— Method.size(rpm)
Return a tuple containing the input data and projection sub-space dimensionalities of the random projection model rpm
.
Base.summary
— Method.summary(doc)
Shows information about the document doc
.
Base.summary
— Method.summary(crps)
Shows information about the corpus crps
.
StringAnalysis.abstract_convert
— Method.abstract_convert(document::AbstractDocument, parameter::Union{Nothing, Type{T}})
Tries converting document::AbstractDocument
to one of the concrete types with witch StringAnalysis
works i.e. StringDocument{T}
, TokenDocument{T}
, NGramDocument{T}
. A user-defined convert
method between the typeof(document)
and the concrete types should be defined.
StringAnalysis.columnindices
— Function.columnindices(terms)
Identical to rowindices
. Returns a dictionary that maps each term from the vector terms
to a integer idex.
StringAnalysis.coo_matrix
— Method.coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool)
Basic low-level function that calculates the co-occurence matrix of a document. Returns a sparse co-occurence matrix sized n × n
where n = length(vocab)
with elements of type T
. The document doc
is represented by a vector of its terms (in order). The keywords
windowand
normalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions.
Examples
julia> using StringAnalysis
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = tokenize(text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
StringAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 2.0
[1, 2] = 2.0
[3, 2] = 0.3999
[2, 3] = 0.3999
StringAnalysis.embed_word
— Method.embed_word(lm, word)
Return the vector representation of word
using the LSA model lm
.
StringAnalysis.random_projection_matrix
— Method.random_projection_matrix(k::Int, m::Int, eltype::Type{T<:AbstractFloat}, density::Float64)
Builds a k
×m
sparse random projection matrix with elements of type T
and a non-zero element frequency of density
. k
and m
are the output and input dimensionalities.
Matrix Probabilities
If we note s = 1 / density
, the components of the random matrix are drawn from:
-sqrt(s) / sqrt(k)
with probability1/2s
0
with probability1 - 1/s
+sqrt(s) / sqrt(k)
with probability1/2s
No projection hack
If k<=0
no projection is performed and the function returns an identity matrix sized m
×m
with elements of type T
. This is useful if one does not want to embed documents but rather calculate term frequencies, BM25 and other statistical indicators (similar to dtv
).
StringAnalysis.remove_patterns!
— Method.remove_patterns!(d, rex)
Removes from the document or corpus d
the text matching the pattern described by the regular expression rex
.
StringAnalysis.remove_patterns
— Method.remove_patterns(s, rex)
Removes from the string s
the text matching the pattern described by the regular expression rex
.
StringAnalysis.rowindices
— Method.rowindices(terms)
Returns a dictionary that maps each term from the vector terms
to a integer idex.
StringAnalysis.tokenize_default
— Method.tokenize_default([lang,] s)
Splits string s
into tokens on whitespace using WordTokenizers.tokenize
function to perform the tokenization. If a language lang
is provided, it ignores it ;)
StringAnalysis.tokenize_stringanalysis
— Method.tokenize_stringanalysis(doc [;splitter])
Function that quickly tokenizes doc
based on the splitting pattern specified by splitter::RegEx
. Supported types for doc
are: AbstractString
, Vector{AbstractString}
, StringDocument
and NGramDocument
.