Usage examples
Documents
Documents are simple wrappers around basic structures that contain text. The underlying data representation can be simple strings, dictionaries or vectors of strings. All document types are subtypes of the parametric type AbstractDocument{T}
where T<:AbstractString
.
julia> using StringAnalysis
julia> sd = StringDocument("this is a string document")
A StringDocument{String}
julia> nd = NGramDocument("this is a ngram document")
A NGramDocument{String}
julia> td = TokenDocument("this is a token document")
A TokenDocument{String}
julia> # fd = FileDocument("/some/file") # works the same way ...
Documents and types
The string type can be explicitly enforced:
julia> nd = NGramDocument{String}("this is a ngram document")
A NGramDocument{String}
julia> ngrams(nd)
Dict{String,Int64} with 5 entries:
"document" => 1
"this" => 1
"is" => 1
"ngram" => 1
"a" => 1
julia> td = TokenDocument{String}("this is a token document")
A TokenDocument{String}
julia> tokens(td)
5-element Array{String,1}:
"this"
"is"
"a"
"token"
"document"
Conversion methods are available to switch between document types (the type parameter has to be specified as well).
julia> convert(TokenDocument{SubString}, StringDocument("some text"))
A TokenDocument{SubString{String}}
julia> convert(NGramDocument{String}, StringDocument("some more text"))
A NGramDocument{String}
Metadata
Alongside the text data, documents also contain metadata.
julia> doc = StringDocument("this is another document")
A StringDocument{String}
julia> metadata(doc)
<no ID> <no name> <unknown author> ? (?)
julia> fieldnames(typeof(metadata(doc)))
(:language, :name, :author, :timestamp, :id, :publisher, :edition_year, :published_year, :documenttype, :note)
Metadata fields can be modified through methods bearing the same name as the metadata field. Note that these methods are not explicitly exported.
julia> StringAnalysis.id!(doc, "doc1");
julia> StringAnalysis.author!(doc, "Corneliu C.");
julia> StringAnalysis.name!(doc, "A simple document");
julia> StringAnalysis.edition_year!(doc, "2019");
julia> StringAnalysis.published_year!(doc, "2019");
julia> metadata(doc)
doc1 "A simple document" by Corneliu C. 2019 (2019)
Corpus
A corpus is an object that holds a bunch of documents together.
julia> docs = [sd, nd, td]
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
A StringDocument{String}
A NGramDocument{String}
A TokenDocument{String}
julia> crps = Corpus(docs)
A Corpus with 3 documents
julia> crps.documents
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
A StringDocument{String}
A NGramDocument{String}
A TokenDocument{String}
The corpus can be 'standardized' to hold the same type of document,
julia> standardize!(crps, NGramDocument{String})
julia> crps.documents
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
A NGramDocument{String}
A NGramDocument{String}
A NGramDocument{String}
however, the corpus has to be created from an AbstractDocument
document vector for the standardization to work (AbstractDocument{T}
vectors are converted to a Union
of all documents types parametrized by T
during Corpus
construction):
julia> doc1 = StringDocument("one");
julia> doc2 = StringDocument("two");
julia> doc3 = TokenDocument("three");
julia> standardize!(Corpus([doc1, doc3]), NGramDocument{String}) # works
julia> standardize!(Corpus([doc1, doc2]), NGramDocument{String}) # fails because we have a Vector{StringDocument{T}}
ERROR: MethodError: Cannot `convert` an object of type NGramDocument{String} to an object of type StringDocument{String}
Closest candidates are:
convert(::Type{StringDocument{T}}, !Matched::Union{FileDocument, StringDocument}) where T<:AbstractString at /home/travis/build/zgornel/StringAnalysis.jl/src/document.jl:245
convert(::Type{T}, !Matched::T) where T at essentials.jl:168
StringDocument{String}(::Any, !Matched::Any) where T<:AbstractString at /home/travis/build/zgornel/StringAnalysis.jl/src/document.jl:40
julia> standardize!(Corpus(AbstractDocument[doc1, doc2]), NGramDocument{String}) # works
The corpus can be also iterated through,
julia> for (i,doc) in enumerate(crps)
@show (i, doc)
end
(i, doc) = (1, A NGramDocument{String})
(i, doc) = (2, A NGramDocument{String})
(i, doc) = (3, A NGramDocument{String})
indexed into,
julia> doc = crps[1]
A NGramDocument{String}
julia> docs = crps[2:3]
2-element Array{AbstractDocument{String,DocumentMetadata},1}:
A NGramDocument{String}
A NGramDocument{String}
and used as a container.
julia> push!(crps, NGramDocument{String}("new document"))
4-element Array{AbstractDocument{String,DocumentMetadata},1}:
A NGramDocument{String}
A NGramDocument{String}
A NGramDocument{String}
A NGramDocument{String}
julia> doc4 = pop!(crps)
A NGramDocument{String}
julia> ngrams(doc4)
Dict{String,Int64} with 2 entries:
"document" => 1
"new" => 1
The lexicon and inverse index
The Corpus
object offers the ability of creating a lexicon and an inverse index for the documents present. These are not automatically created when the Corpus is created,
julia> crps.lexicon
OrderedCollections.OrderedDict{String,Int64} with 0 entries
julia> crps.inverse_index
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 0 entries
but instead have to be explicitly built:
julia> update_lexicon!(crps)
julia> crps.lexicon
OrderedCollections.OrderedDict{String,Int64} with 7 entries:
"string" => 1
"document" => 3
"this" => 3
"is" => 3
"a" => 3
"ngram" => 1
"token" => 1
julia> update_inverse_index!(crps)
julia> crps.inverse_index
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 7 entries:
"string" => [1]
"document" => [1, 2, 3]
"this" => [1, 2, 3]
"is" => [1, 2, 3]
"a" => [1, 2, 3]
"ngram" => [2]
"token" => [3]
It is possible to explicitly create the lexicon and inverse index:
julia> create_lexicon(Corpus([sd]))
OrderedCollections.OrderedDict{String,Int64} with 5 entries:
"string" => 1
"document" => 1
"this" => 1
"is" => 1
"a" => 1
julia> create_inverse_index(Corpus([sd]))
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 5 entries:
"string" => [1]
"document" => [1]
"this" => [1]
"is" => [1]
"a" => [1]
Ngram complexity can be specified as a second parameter
julia> create_lexicon(Corpus([sd]), 2)
OrderedCollections.OrderedDict{String,Int64} with 9 entries:
"this is" => 1
"string" => 1
"document" => 1
"this" => 1
"is" => 1
"string document" => 1
"is a" => 1
"a" => 1
"a string" => 1
The create_lexicon
and create_inverse_index
functions are available from v0.3.9
. Both functions support specifying the ngram complexity.
Preprocessing
The text preprocessing mainly consists of the prepare
and prepare!
functions and preprocessing flags which start mostly with strip_
except for stem_words
. The preprocessing function prepare
works on AbstractDocument
, Corpus
and AbstractString
types, returning new objects; prepare!
works only on AbstractDocument
s and Corpus
as strings are immutable.
julia> str="This is a text containing words, some more words, a bit of punctuation and 1 number...";
julia> sd = StringDocument(str);
julia> flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
0x00300600
julia> prepare(str, flags)
"This is text containing words some more words bit of punctuation and 1 number "
julia> prepare!(sd, flags);
julia> text(sd)
"This is text containing words some more words bit of punctuation and 1 number "
More extensive preprocessing examples can be viewed in test/preprocessing.jl
.
One can strip parts of speech i.e. prepositions, articles, in languages other than English (support provided from Languages.jl):
julia> using Languages
julia> it = StringDocument("Quest'e un piccolo esempio di come si puo fare l'analisi");
julia> StringAnalysis.language!(it, Languages.Italian());
julia> prepare!(it, strip_articles|strip_prepositions|strip_whitespace);
julia> text(it)
"Quest'e piccolo esempio come si puo fare analisi"
In the case of AbstractString
s, the language has to be explicitly defined:
julia> prepare("Nous sommes tous d'accord avec les examples!", stem_words, language=Languages.French())
"Nous somm tous d accord avec le exampl"
Features
Document Term Matrix (DTM)
If a lexicon is present in the corpus, a document term matrix (DTM) can be created. The DTM acts as a basis for word-document statistics, allowing for the representation of documents as numerical vectors. The DTM is created from a Corpus
by calling the constructor
julia> M = DocumentTermMatrix(crps)
A 7x3 DocumentTermMatrix{Int64}
julia> typeof(M)
DocumentTermMatrix{Int64}
julia> M = DocumentTermMatrix{Int8}(crps)
A 7x3 DocumentTermMatrix{Int8}
julia> typeof(M)
DocumentTermMatrix{Int8}
or the dtm
function
julia> M = dtm(crps, Int8);
julia> Matrix(M)
7×3 Array{Int8,2}:
1 0 0
1 1 1
1 1 1
1 1 1
1 1 1
0 1 0
0 0 1
It is important to note that the type parameter of the DTM object can be specified (also in the dtm
function) but not specifically required. This can be useful in some cases for reducing memory requirements. The default element type of the DTM is specified by the constant DEFAULT_DTM_TYPE
present in src/defaults.jl
.
From version v0.3.2
, the columns of the document-term matrix represent document vectors. This convention holds accross the package where whenever multiple documents are represented. This represents a breaking change from previous versions and TextAnalysis.jl and may break code if not taken into account.
One can verify the DTM dimensions with:
julia> @assert size(dtm(crps)) == (length(lexicon(crps)), length(crps)) # O.K.
Document Term Vectors (DTVs)
The individual rows of the DTM can also be generated iteratively whether a lexicon is present or not. If a lexicon is present, the each_dtv
iterator allows the generation of the document vectors along with the control of the vector element type:
julia> for dv in map(Vector, each_dtv(crps, eltype=Int8))
@show dv
end
dv = Int8[1, 1, 1, 1, 1, 0, 0]
dv = Int8[0, 1, 1, 1, 1, 1, 0]
dv = Int8[0, 1, 1, 1, 1, 0, 1]
Alternatively, the vectors can be generated using the hash trick. This is a form of dimensionality reduction as cardinality
i.e. output dimension is much smaller than the dimension of the original DTM vectors, which is equal to the length of the lexicon. The cardinality
is a keyword argument of the Corpus
constructor. The hashed vector output type can be specified when building the iterator:
julia> new_crps = Corpus(documents(crps), cardinality=7);
julia> hash_vectors = map(Vector, each_hash_dtv(new_crps, eltype=Int8));
julia> for hdv in hash_vectors
@show hdv
end
hdv = Int8[1, 1, 1, 0, 0, 2, 0]
hdv = Int8[0, 2, 1, 0, 0, 2, 0]
hdv = Int8[0, 1, 1, 1, 0, 2, 0]
One can construct a 'hashed' version of the DTM as well:
julia> hash_dtm(Corpus(documents(crps), cardinality=5), Int8)
5×3 SparseArrays.SparseMatrixCSC{Int8,Int64} with 9 stored entries:
[2, 1] = 1
[3, 1] = 2
[5, 1] = 2
[2, 2] = 1
[3, 2] = 2
[5, 2] = 2
[1, 3] = 1
[3, 3] = 2
[5, 3] = 2
The default Corpus
cardinality is specified by the constant DEFAULT_CARDINALITY
present in src/defaults.jl
.
From version v0.3.4
, all document vectors are instances of SparseVector
. This consequently has an impact on the output and performance of methods that directly employ DTVs such as the embed_document
method. In certain cases, if speed is more important than memory consumption, it may be useful to first transform the vectors into a dense representation prior to transformation i.e. dtv_dense = Vector(dtv_sparse)
.
TF, TF-IDF, BM25
From the DTM, three more document-word statistics can be constructed: the term frequency, the tf-idf (term frequency - inverse document frequency) and Okapi BM25 using the tf
, tf!
, tf_idf
, tf_idf!
, bm_25
and bm_25!
functions respectively. Their usage is very similar yet there exist several approaches one can take to constructing the output.
The following examples use the term frequency i.e. tf
and tf!
functions only. When calling the functions that end without a !
, which do not require the specification of an output matrix, one does not control the output's element type. The default output type is defined by the constant DEFAULT_FLOAT_TYPE = eltype(1.0)
:
julia> M = DocumentTermMatrix(crps);
julia> tfm = tf(M);
julia> Matrix(tfm)
7×3 Array{Float64,2}:
0.447214 0.0 0.0
0.447214 0.447214 0.447214
0.447214 0.447214 0.447214
0.447214 0.447214 0.447214
0.447214 0.447214 0.447214
0.0 0.447214 0.0
0.0 0.0 0.447214
Control of the output matrix element type - which has to be a subtype of AbstractFloat
- can be done only by using the in-place modification functions. One approach is to directly modify the DTM, provided that its elements are floating point numbers:
julia> M = DocumentTermMatrix{Float16}(crps)
A 7x3 DocumentTermMatrix{Float16}
julia> Matrix(M.dtm)
7×3 Array{Float16,2}:
1.0 0.0 0.0
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
0.0 1.0 0.0
0.0 0.0 1.0
julia> tf!(M.dtm); # inplace modification
julia> Matrix(M.dtm)
7×3 Array{Float16,2}:
0.4473 0.0 0.0
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.0 0.4473 0.0
0.0 0.0 0.4473
julia> M = DocumentTermMatrix(crps) # Int elements
A 7x3 DocumentTermMatrix{Int64}
julia> tf!(M.dtm) # fails because of Int elements
ERROR: MethodError: no method matching tf!(::SparseArrays.SparseMatrixCSC{Int64,Int64}, ::SparseArrays.SparseMatrixCSC{Int64,Int64})
Closest candidates are:
tf!(::SparseArrays.SparseMatrixCSC{T,Ti} where Ti<:Integer, !Matched::SparseArrays.SparseMatrixCSC{F,Ti} where Ti<:Integer) where {T<:Real, F<:AbstractFloat} at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:20
tf!(::AbstractArray{T,2}) where T<:Real at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:35
tf!(::AbstractArray{T,2}, !Matched::AbstractArray{F,2}) where {T<:Real, F<:AbstractFloat} at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:7
or, to provide a matrix output:
julia> rows, cols = size(M.dtm);
julia> tfm = zeros(Float16, rows, cols);
julia> tf!(M.dtm, tfm);
julia> tfm
7×3 Array{Float16,2}:
0.4473 0.0 0.0
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.0 0.4473 0.0
0.0 0.0 0.4473
One could also provide a sparse matrix output however it is important to note that in this case, the output matrix non-zero values have to correspond to the DTM's non-zero values:
julia> using SparseArrays
julia> rows, cols = size(M.dtm);
julia> tfm = spzeros(Float16, rows, cols)
7×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 0 stored entries
julia> tfm[M.dtm .!= 0] .= 123; # create explicitly non-zeros
julia> tf!(M.dtm, tfm);
julia> Matrix(tfm)
7×3 Array{Float16,2}:
0.4473 0.0 0.0
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.4473 0.4473 0.4473
0.0 0.4473 0.0
0.0 0.0 0.4473
Co-occurrence Matrix (COOM)
Another type of feature matrix that can be created is the co-occurence matrix (COOM) of the document or corpus. The elements of the matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type Corpus
, AbstractDocument
(with the exception of NGramDocument
since order is word order is lost) and AbstractString
. The constructor supports specification of the window size, whether the counts should be normalized (to the distance between words in the window) as well as specific terms for which co-occurrences in the document should be calculated.
Remarks:
- The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents
- The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus
- The co-occurrence matrix always has elements that are subtypes of
AbstractFloat
and cannot be calculated forNGramDocument
s
julia> C = CooMatrix(crps, window=1, normalize=false) # fails, documents are NGramDocument
ERROR: The tokens of an NGramDocument cannot be reconstructed
julia> smallcrps = Corpus([sd, td])
A Corpus with 2 documents
julia> C = CooMatrix(smallcrps, window=1, normalize=false) # works
A 17x17 CooMatrix{Float64}
- The actual size of the sliding window is
2 * window + 1
, with the keyword argumentwindow
specifying how many words to consider to the left and right of the center one
For a simple document, one should first preprocess the document and subsequently calculate the matrix:
julia> some_document = "This is a document. In the document, there are two sentences.";
julia> filtered_document = prepare(some_document, strip_whitespace|strip_case|strip_punctuation)
"this is a document in the document there are two sentences "
julia> C = CooMatrix{Float32}(some_document, window=3) # word distances matter
A 13x13 CooMatrix{Float32}
julia> Matrix(coom(C))
13×13 Array{Float32,2}:
0.0 2.0 1.0 0.666667 … 0.0 0.0 0.0
2.0 0.0 2.0 1.0 0.0 0.0 0.0
1.0 2.0 0.0 2.0 0.0 0.0 0.0
0.666667 1.0 2.0 0.0 0.0 0.0 0.0
0.0 0.666667 1.0 2.0 0.0 0.0 0.0
0.0 0.0 0.666667 1.0 … 0.0 0.0 0.0
0.0 0.0 0.0 0.666667 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.666667 0.0 0.0
0.0 0.0 0.0 0.0 1.0 0.666667 0.0
0.0 0.0 0.0 0.0 2.0 1.0 0.666667
0.0 0.0 0.0 0.0 … 0.0 2.0 1.0
0.0 0.0 0.0 0.0 2.0 0.0 2.0
0.0 0.0 0.0 0.0 1.0 2.0 0.0
One can also calculate the COOM corresponding to a reduced lexicon. The resulting matrix will be proportional to the size of the new lexicon and more sparse if the window size is small.
julia> C = CooMatrix(smallcrps, ["this", "is", "a"], window=1, normalize=false)
A 3x3 CooMatrix{Float64}
julia> C.column_indices
OrderedCollections.OrderedDict{String,Int64} with 3 entries:
"this" => 1
"is" => 2
"a" => 3
julia> Matrix(coom(C))
3×3 Array{Float64,2}:
0.0 2.0 0.0
2.0 0.0 2.0
0.0 2.0 0.0
Dimensionality reduction
Random projections
In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name of random indexing. The core idea behind random projection is given in the Johnson-Lindenstrauss lemma which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points (Wikipedia).
The implementation here relies on the generalized sparse random projection matrix to generate a random projection model. For more details see the API documentation for RPModel
and random_projection_matrix
. To construct a random projection matrix that maps m
dimension to k
, one can do
julia> m = 10; k = 2; T = Float32;
julia> density = 0.2; # percentage of non-zero elements
julia> R = StringAnalysis.random_projection_matrix(m, k, T, density)
10×2 SparseArrays.SparseMatrixCSC{Float32,Int64} with 4 stored entries:
[2 , 1] = 0.707107
[3 , 1] = 0.707107
[10, 1] = -0.707107
[2 , 2] = -0.707107
Building a random projection model from a DocumentTermMatrix
or Corpus
is straightforward
julia> M = DocumentTermMatrix{Float32}(crps)
A 7x3 DocumentTermMatrix{Float32}
julia> model = RPModel(M, k=2, density=0.5, stats=:tf)
Random Projection model (tf), 7 terms, dimensionality 2, Float32 vectors
julia> model2 = rp(crps, T, k=17, density=0.1, stats=:tfidf)
Random Projection model (tfidf), 7 terms, dimensionality 17, Float32 vectors
Once the model is created, one can reduce document term vector dimensionality. First, the document term vector is constructed using the stats
keyword argument and subsequently, the vector is projected into the random sub-space:
julia> doc = StringDocument("this is a new document")
A StringDocument{String}
julia> embed_document(model, doc)
2-element SparseArrays.SparseVector{Float32,Int64} with 1 stored entry:
[1] = -1.0
julia> embed_document(model2, doc)
17-element SparseArrays.SparseVector{Float32,Int64} with 7 stored entries:
[1 ] = -0.377964
[2 ] = -0.377964
[4 ] = 0.377964
[5 ] = 0.377964
[8 ] = -0.377964
[10] = -0.377964
[14] = 0.377964
Embedding a DTM or corpus can be done in a similar way:
julia> Matrix(embed_document(model, M))
2×3 Array{Float32,2}:
-0.707107 -1.0 0.0
-0.707107 0.0 -1.0
julia> Matrix(embed_document(model2, crps))
17×3 Array{Float32,2}:
-0.260059 -0.377964 -0.210481
-0.260059 -0.377964 -0.210481
0.0 0.0 0.0
0.260059 0.377964 0.210481
0.260059 0.377964 0.210481
0.0 0.0 0.0
0.0 0.0 0.415297
-0.260059 -0.377964 0.204816
-0.51312 0.0 0.415297
-0.260059 -0.377964 -0.210481
0.0 0.0 0.0
0.0 0.0 0.0
-0.51312 0.0 0.0
0.260059 0.377964 0.625777
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
Random projection models can be saved/loaded to/from disk using a text format.
julia> file = "model.txt"
"model.txt"
julia> model
Random Projection model (tf), 7 terms, dimensionality 2, Float32 vectors
julia> save_rp_model(model, file) # model saved
julia> print(join(readlines(file)[1:5], "\n")) # first five lines
Random Projection Model saved at 2020-11-23T10:57:37.597
7 2
true
tf
1.4054651 0.71231794 0.71231794 0.71231794 0.71231794 1.4054651 1.4054651
julia> new_model = load_rp_model(file, Float64) # change element type
Random Projection model (tf), 7 terms, dimensionality 2, Float64 vectors
julia> rm(file)
No projection hack
As previously noted, before projection, the DTV is calculated according to the value of the stats
keyword argument value. The vector can composed of term counts, frequencies and so on and is more generic than the output of the dtv
function which yields only term counts. It is useful to be able to calculate and output these vectors without projecting them into the lower dimensional space. This can be achieved by simply providing a negative or zero value to the model parameter k
. In the background, the random projection matrix of the model is replaced by the identity matrix.
julia> model = RPModel(M, k=0, stats=:bm25)
Identity Projection (bm25), 7 terms, dimensionality 7, Float32 vectors
julia> embed_document(model, crps[1]) # normalized BM25 document vector
7-element SparseArrays.SparseVector{Float32,Int64} with 5 stored entries:
[1] = 0.702301
[2] = 0.35594
[3] = 0.35594
[4] = 0.35594
[5] = 0.35594
julia> embed_document(model, crps)'*embed_document(model, crps[1]) # intra-document similarity
3-element SparseArrays.SparseVector{Float32,Int64} with 3 stored entries:
[1] = 1.0
[2] = 0.506774
[3] = 0.506774
Semantic Analysis
The semantic analysis of a corpus relates to the task of building structures that approximate the concepts present in its documents. It does not necessarily involve prior semantic understanding of the documents (Wikipedia).
StringAnalysis
provides two approaches of performing semantic analysis of a corpus: latent semantic analysis (LSA) and latent Dirichlet allocation (LDA).
Latent Semantic Analysis (LSA)
The following example gives a straightforward usage example of LSA. It is geared towards information retrieval (LSI) as it focuses on document comparison and embedding. We assume a number of documents,
julia> doc1 = StringDocument("This is a text about an apple. There are many texts about apples.");
julia> doc2 = StringDocument("Pears and apples are good but not exotic. An apple a day keeps the doctor away.");
julia> doc3 = StringDocument("Fruits are good for you.");
julia> doc4 = StringDocument("This phrase has nothing to do with the others...");
julia> doc5 = StringDocument("Simple text, little info inside");
and create the corpus and its DTM:
julia> crps = Corpus(AbstractDocument[doc1, doc2, doc3, doc4, doc5]);
julia> prepare!(crps, strip_punctuation);
julia> update_lexicon!(crps);
julia> M = DocumentTermMatrix{Float32}(crps, collect(keys(crps.lexicon)));
Building an LSA model is straightforward:
julia> lm = LSAModel(M, k=4, stats=:tfidf)
LSA Model (tfidf), 38 terms, dimensionality 4, Float32 vectors
Once the model is created, it can be used to either embed documents,
julia> query = StringDocument("Apples and an exotic fruit.");
julia> embed_document(lm, query)
4-element Array{Float32,1}:
0.73225236
0.14379191
0.3171733
-0.5852612
embed the corpus,
julia> V = embed_document(lm, crps)
4×5 Array{Float32,2}:
0.735058 0.822174 0.361583 0.369555 0.267472
-0.127939 0.281849 0.155932 0.312645 -0.925021
0.0854172 0.393246 0.432528 -0.844548 -0.165549
-0.660322 -0.299915 0.811087 0.228956 0.213045
search for matching documents,
julia> idxs, corrs = cosine(lm, crps, query);
julia> for (idx, corr) in zip(idxs, corrs)
println("$corr -> \"$(crps[idx].text)\"");
end
0.94282204 -> "Pears and apples are good but not exotic An apple a day keeps the doctor away "
0.9334041 -> "This is a text about an apple There are many texts about apples "
-0.0503197 -> "Fruits are good for you "
-0.08630329 -> "This phrase has nothing to do with the others "
-0.11434817 -> "Simple text little info inside"
or check for structure within the data
julia> U = lm.Uᵀ;
julia> V'*V # document to document similarity
5×5 Array{Float32,2}:
1.0 0.799916 -0.252798 0.00832165 0.160135
0.799916 1.0 0.268066 -0.00882498 -0.169804
-0.252798 0.268066 1.0 0.0027891 0.0536656
0.00832165 -0.00882498 0.0027891 1.0 -0.00176561
0.160135 -0.169804 0.0536656 -0.00176561 1.0
julia> U'*U # term to term similarity
38×38 Array{Float32,2}:
0.0487284 0.0633793 0.0633793 … 0.00940091 0.00940091
0.0633793 0.0906999 0.0906999 -0.00765931 -0.00765931
0.0633793 0.0906999 0.0906999 -0.0076593 -0.0076593
0.0633793 0.0906999 0.0906999 -0.00765931 -0.00765931
0.0487284 0.0633793 0.0633793 0.00940091 0.00940091
0.0487284 0.0633793 0.0633793 … 0.00940091 0.00940091
0.0487284 0.0633793 0.0633793 0.00940091 0.00940091
0.0689124 0.0896318 0.0896318 0.0132949 0.0132949
0.0341388 0.0431566 0.0431566 0.00776654 0.00776653
0.0249999 0.0599407 0.0599407 0.0043364 0.0043364
⋮ ⋱
-0.00542773 -0.00864058 -0.00864057 0.000449956 0.000449949
-0.00542773 -0.00864058 -0.00864058 … 0.000449959 0.000449952
-0.00542773 -0.00864058 -0.00864057 0.000449953 0.000449947
-0.00542773 -0.00864058 -0.00864057 0.000449956 0.00044995
-0.00542773 -0.00864058 -0.00864057 0.000449953 0.000449946
0.00940091 -0.00765931 -0.00765931 0.207998 0.207998
0.00940091 -0.0076593 -0.0076593 … 0.207998 0.207998
0.00940091 -0.00765931 -0.0076593 0.207998 0.207998
0.00940091 -0.00765931 -0.0076593 0.207998 0.207998
LSA models can be saved/loaded to/from disk using a text format similar to the random projection model one.
julia> file = "model.txt"
"model.txt"
julia> lm
LSA Model (tfidf), 38 terms, dimensionality 4, Float32 vectors
julia> save_lsa_model(lm, file) # model saved
julia> print(join(readlines(file)[1:5], "\n")) # first five lines
LSA Model saved at 2020-11-23T10:57:44.239
38 4
tfidf
1.9162908 1.5108256 1.5108256 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.5108256 1.2231436 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.5108256 1.9162908 1.9162908 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908
9.6
julia> new_model = load_lsa_model(file, Float64) # change element type
LSA Model (tfidf), 38 terms, dimensionality 4, Float64 vectors
julia> rm(file)
Latent Dirichlet Allocation (LDA)
Documentation coming soon; check the API reference for information on the associated methods.