Usage examples

Documents

Documents are simple wrappers around basic structures that contain text. The underlying data representation can be simple strings, dictionaries or vectors of strings. All document types are subtypes of the parametric type AbstractDocument{T} where T<:AbstractString.

julia> using StringAnalysis

julia> sd = StringDocument("this is a string document")
A StringDocument{String}

julia> nd = NGramDocument("this is a ngram document")
A NGramDocument{String}

julia> td = TokenDocument("this is a token document")
A TokenDocument{String}

julia> # fd = FileDocument("/some/file") # works the same way ...

Documents and types

The string type can be explicitly enforced:

julia> nd = NGramDocument{String}("this is a ngram document")
A NGramDocument{String}

julia> ngrams(nd)
Dict{String,Int64} with 5 entries:
  "document" => 1
  "this"     => 1
  "is"       => 1
  "ngram"    => 1
  "a"        => 1

julia> td = TokenDocument{String}("this is a token document")
A TokenDocument{String}

julia> tokens(td)
5-element Array{String,1}:
 "this"
 "is"
 "a"
 "token"
 "document"

Conversion methods are available to switch between document types (the type parameter has to be specified as well).

julia> convert(TokenDocument{SubString}, StringDocument("some text"))
A TokenDocument{SubString{String}}

julia> convert(NGramDocument{String}, StringDocument("some more text"))
A NGramDocument{String}

Metadata

Alongside the text data, documents also contain metadata.

julia> doc = StringDocument("this is another document")
A StringDocument{String}

julia> metadata(doc)
<no ID> <no name> <unknown author> ? (?)

julia> fieldnames(typeof(metadata(doc)))
(:language, :name, :author, :timestamp, :id, :publisher, :edition_year, :published_year, :documenttype, :note)

Metadata fields can be modified through methods bearing the same name as the metadata field. Note that these methods are not explicitly exported.

julia> StringAnalysis.id!(doc, "doc1");

julia> StringAnalysis.author!(doc, "Corneliu C.");

julia> StringAnalysis.name!(doc, "A simple document");

julia> StringAnalysis.edition_year!(doc, "2019");

julia> StringAnalysis.published_year!(doc, "2019");

julia> metadata(doc)
doc1 "A simple document" by Corneliu C. 2019 (2019)

Corpus

A corpus is an object that holds a bunch of documents together.

julia> docs = [sd, nd, td]
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
 A StringDocument{String}
 A NGramDocument{String}
 A TokenDocument{String}

julia> crps = Corpus(docs)
A Corpus with 3 documents

julia> crps.documents
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
 A StringDocument{String}
 A NGramDocument{String}
 A TokenDocument{String}

The corpus can be 'standardized' to hold the same type of document,

julia> standardize!(crps, NGramDocument{String})

julia> crps.documents
3-element Array{AbstractDocument{String,DocumentMetadata},1}:
 A NGramDocument{String}
 A NGramDocument{String}
 A NGramDocument{String}

however, the corpus has to be created from an AbstractDocument document vector for the standardization to work (AbstractDocument{T} vectors are converted to a Union of all documents types parametrized by T during Corpus construction):

julia> doc1 = StringDocument("one");

julia> doc2 = StringDocument("two");

julia> doc3 = TokenDocument("three");

julia> standardize!(Corpus([doc1, doc3]), NGramDocument{String})  # works

julia> standardize!(Corpus([doc1, doc2]), NGramDocument{String})  # fails because we have a Vector{StringDocument{T}}
ERROR: MethodError: Cannot `convert` an object of type NGramDocument{String} to an object of type StringDocument{String}
Closest candidates are:
  convert(::Type{StringDocument{T}}, !Matched::Union{FileDocument, StringDocument}) where T<:AbstractString at /home/travis/build/zgornel/StringAnalysis.jl/src/document.jl:245
  convert(::Type{T}, !Matched::T) where T at essentials.jl:168
  StringDocument{String}(::Any, !Matched::Any) where T<:AbstractString at /home/travis/build/zgornel/StringAnalysis.jl/src/document.jl:40

julia> standardize!(Corpus(AbstractDocument[doc1, doc2]), NGramDocument{String})  # works

The corpus can be also iterated through,

julia> for (i,doc) in enumerate(crps)
           @show (i, doc)
       end
(i, doc) = (1, A NGramDocument{String})
(i, doc) = (2, A NGramDocument{String})
(i, doc) = (3, A NGramDocument{String})

indexed into,

julia> doc = crps[1]
A NGramDocument{String}

julia> docs = crps[2:3]
2-element Array{AbstractDocument{String,DocumentMetadata},1}:
 A NGramDocument{String}
 A NGramDocument{String}

and used as a container.

julia> push!(crps, NGramDocument{String}("new document"))
4-element Array{AbstractDocument{String,DocumentMetadata},1}:
 A NGramDocument{String}
 A NGramDocument{String}
 A NGramDocument{String}
 A NGramDocument{String}

julia> doc4 = pop!(crps)
A NGramDocument{String}

julia> ngrams(doc4)
Dict{String,Int64} with 2 entries:
  "document" => 1
  "new"      => 1

The lexicon and inverse index

The Corpus object offers the ability of creating a lexicon and an inverse index for the documents present. These are not automatically created when the Corpus is created,

julia> crps.lexicon
OrderedCollections.OrderedDict{String,Int64} with 0 entries

julia> crps.inverse_index
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 0 entries

but instead have to be explicitly built:

julia> update_lexicon!(crps)

julia> crps.lexicon
OrderedCollections.OrderedDict{String,Int64} with 7 entries:
  "string"   => 1
  "document" => 3
  "this"     => 3
  "is"       => 3
  "a"        => 3
  "ngram"    => 1
  "token"    => 1

julia> update_inverse_index!(crps)

julia> crps.inverse_index
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 7 entries:
  "string"   => [1]
  "document" => [1, 2, 3]
  "this"     => [1, 2, 3]
  "is"       => [1, 2, 3]
  "a"        => [1, 2, 3]
  "ngram"    => [2]
  "token"    => [3]

It is possible to explicitly create the lexicon and inverse index:

julia> create_lexicon(Corpus([sd]))
OrderedCollections.OrderedDict{String,Int64} with 5 entries:
  "string"   => 1
  "document" => 1
  "this"     => 1
  "is"       => 1
  "a"        => 1

julia> create_inverse_index(Corpus([sd]))
OrderedCollections.OrderedDict{String,Array{Int64,1}} with 5 entries:
  "string"   => [1]
  "document" => [1]
  "this"     => [1]
  "is"       => [1]
  "a"        => [1]

Ngram complexity can be specified as a second parameter

julia> create_lexicon(Corpus([sd]), 2)
OrderedCollections.OrderedDict{String,Int64} with 9 entries:
  "this is"         => 1
  "string"          => 1
  "document"        => 1
  "this"            => 1
  "is"              => 1
  "string document" => 1
  "is a"            => 1
  "a"               => 1
  "a string"        => 1

Note

The create_lexicon and create_inverse_index functions are available from v0.3.9. Both functions support specifying the ngram complexity.

Preprocessing

The text preprocessing mainly consists of the prepare and prepare! functions and preprocessing flags which start mostly with strip_ except for stem_words. The preprocessing function prepare works on AbstractDocument, Corpus and AbstractString types, returning new objects; prepare! works only on AbstractDocuments and Corpus as strings are immutable.

julia> str="This is a text containing words, some more words, a bit of punctuation and 1 number...";

julia> sd = StringDocument(str);

julia> flags = strip_punctuation|strip_articles|strip_punctuation|strip_whitespace
0x00300600

julia> prepare(str, flags)
"This is text containing words some more words bit of punctuation and 1 number "

julia> prepare!(sd, flags);

julia> text(sd)
"This is text containing words some more words bit of punctuation and 1 number "

More extensive preprocessing examples can be viewed in test/preprocessing.jl.

One can strip parts of speech i.e. prepositions, articles, in languages other than English (support provided from Languages.jl):

julia> using Languages

julia> it = StringDocument("Quest'e un piccolo esempio di come si puo fare l'analisi");

julia> StringAnalysis.language!(it, Languages.Italian());

julia> prepare!(it, strip_articles|strip_prepositions|strip_whitespace);

julia> text(it)
"Quest'e piccolo esempio come si puo fare analisi"

In the case of AbstractStrings, the language has to be explicitly defined:

julia> prepare("Nous sommes tous d'accord avec les examples!", stem_words, language=Languages.French())
"Nous somm tous d accord avec le exampl"

Features

Document Term Matrix (DTM)

If a lexicon is present in the corpus, a document term matrix (DTM) can be created. The DTM acts as a basis for word-document statistics, allowing for the representation of documents as numerical vectors. The DTM is created from a Corpus by calling the constructor

julia> M = DocumentTermMatrix(crps)
A 7x3 DocumentTermMatrix{Int64}

julia> typeof(M)
DocumentTermMatrix{Int64}

julia> M = DocumentTermMatrix{Int8}(crps)
A 7x3 DocumentTermMatrix{Int8}

julia> typeof(M)
DocumentTermMatrix{Int8}

or the dtm function

julia> M = dtm(crps, Int8);

julia> Matrix(M)
7×3 Array{Int8,2}:
 1  0  0
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 0  1  0
 0  0  1

It is important to note that the type parameter of the DTM object can be specified (also in the dtm function) but not specifically required. This can be useful in some cases for reducing memory requirements. The default element type of the DTM is specified by the constant DEFAULT_DTM_TYPE present in src/defaults.jl.

Note

From version v0.3.2, the columns of the document-term matrix represent document vectors. This convention holds accross the package where whenever multiple documents are represented. This represents a breaking change from previous versions and TextAnalysis.jl and may break code if not taken into account.

One can verify the DTM dimensions with:

julia> @assert size(dtm(crps)) == (length(lexicon(crps)), length(crps))  # O.K.

Document Term Vectors (DTVs)

The individual rows of the DTM can also be generated iteratively whether a lexicon is present or not. If a lexicon is present, the each_dtv iterator allows the generation of the document vectors along with the control of the vector element type:

julia> for dv in map(Vector, each_dtv(crps, eltype=Int8))
           @show dv
       end
dv = Int8[1, 1, 1, 1, 1, 0, 0]
dv = Int8[0, 1, 1, 1, 1, 1, 0]
dv = Int8[0, 1, 1, 1, 1, 0, 1]

Alternatively, the vectors can be generated using the hash trick. This is a form of dimensionality reduction as cardinality i.e. output dimension is much smaller than the dimension of the original DTM vectors, which is equal to the length of the lexicon. The cardinality is a keyword argument of the Corpus constructor. The hashed vector output type can be specified when building the iterator:

julia> new_crps = Corpus(documents(crps), cardinality=7);

julia> hash_vectors = map(Vector, each_hash_dtv(new_crps, eltype=Int8));

julia> for hdv in hash_vectors
           @show hdv
       end
hdv = Int8[1, 1, 1, 0, 0, 2, 0]
hdv = Int8[0, 2, 1, 0, 0, 2, 0]
hdv = Int8[0, 1, 1, 1, 0, 2, 0]

One can construct a 'hashed' version of the DTM as well:

julia> hash_dtm(Corpus(documents(crps), cardinality=5), Int8)
5×3 SparseArrays.SparseMatrixCSC{Int8,Int64} with 9 stored entries:
  [2, 1]  =  1
  [3, 1]  =  2
  [5, 1]  =  2
  [2, 2]  =  1
  [3, 2]  =  2
  [5, 2]  =  2
  [1, 3]  =  1
  [3, 3]  =  2
  [5, 3]  =  2

The default Corpus cardinality is specified by the constant DEFAULT_CARDINALITY present in src/defaults.jl.

Note

From version v0.3.4, all document vectors are instances of SparseVector. This consequently has an impact on the output and performance of methods that directly employ DTVs such as the embed_document method. In certain cases, if speed is more important than memory consumption, it may be useful to first transform the vectors into a dense representation prior to transformation i.e. dtv_dense = Vector(dtv_sparse).

TF, TF-IDF, BM25

From the DTM, three more document-word statistics can be constructed: the term frequency, the tf-idf (term frequency - inverse document frequency) and Okapi BM25 using the tf, tf!, tf_idf, tf_idf!, bm_25 and bm_25! functions respectively. Their usage is very similar yet there exist several approaches one can take to constructing the output.

The following examples use the term frequency i.e. tf and tf! functions only. When calling the functions that end without a !, which do not require the specification of an output matrix, one does not control the output's element type. The default output type is defined by the constant DEFAULT_FLOAT_TYPE = eltype(1.0):

julia> M = DocumentTermMatrix(crps);

julia> tfm = tf(M);

julia> Matrix(tfm)
7×3 Array{Float64,2}:
 0.447214  0.0       0.0
 0.447214  0.447214  0.447214
 0.447214  0.447214  0.447214
 0.447214  0.447214  0.447214
 0.447214  0.447214  0.447214
 0.0       0.447214  0.0
 0.0       0.0       0.447214

Control of the output matrix element type - which has to be a subtype of AbstractFloat - can be done only by using the in-place modification functions. One approach is to directly modify the DTM, provided that its elements are floating point numbers:

julia> M = DocumentTermMatrix{Float16}(crps)
A 7x3 DocumentTermMatrix{Float16}

julia> Matrix(M.dtm)
7×3 Array{Float16,2}:
 1.0  0.0  0.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 0.0  1.0  0.0
 0.0  0.0  1.0

julia> tf!(M.dtm);  # inplace modification

julia> Matrix(M.dtm)
7×3 Array{Float16,2}:
 0.4473  0.0     0.0
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.0     0.4473  0.0
 0.0     0.0     0.4473

julia> M = DocumentTermMatrix(crps)  # Int elements
A 7x3 DocumentTermMatrix{Int64}

julia> tf!(M.dtm)  # fails because of Int elements
ERROR: MethodError: no method matching tf!(::SparseArrays.SparseMatrixCSC{Int64,Int64}, ::SparseArrays.SparseMatrixCSC{Int64,Int64})
Closest candidates are:
  tf!(::SparseArrays.SparseMatrixCSC{T,Ti} where Ti<:Integer, !Matched::SparseArrays.SparseMatrixCSC{F,Ti} where Ti<:Integer) where {T<:Real, F<:AbstractFloat} at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:20
  tf!(::AbstractArray{T,2}) where T<:Real at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:35
  tf!(::AbstractArray{T,2}, !Matched::AbstractArray{F,2}) where {T<:Real, F<:AbstractFloat} at /home/travis/build/zgornel/StringAnalysis.jl/src/stats.jl:7

or, to provide a matrix output:

julia> rows, cols = size(M.dtm);

julia> tfm = zeros(Float16, rows, cols);

julia> tf!(M.dtm, tfm);

julia> tfm
7×3 Array{Float16,2}:
 0.4473  0.0     0.0
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.0     0.4473  0.0
 0.0     0.0     0.4473

One could also provide a sparse matrix output however it is important to note that in this case, the output matrix non-zero values have to correspond to the DTM's non-zero values:

julia> using SparseArrays

julia> rows, cols = size(M.dtm);

julia> tfm = spzeros(Float16, rows, cols)
7×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 0 stored entries

julia> tfm[M.dtm .!= 0] .= 123;  # create explicitly non-zeros

julia> tf!(M.dtm, tfm);

julia> Matrix(tfm)
7×3 Array{Float16,2}:
 0.4473  0.0     0.0
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.4473  0.4473  0.4473
 0.0     0.4473  0.0
 0.0     0.0     0.4473

Co-occurrence Matrix (COOM)

Another type of feature matrix that can be created is the co-occurence matrix (COOM) of the document or corpus. The elements of the matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type Corpus, AbstractDocument (with the exception of NGramDocument since order is word order is lost) and AbstractString. The constructor supports specification of the window size, whether the counts should be normalized (to the distance between words in the window) as well as specific terms for which co-occurrences in the document should be calculated.

Remarks:

The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents
The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus
The co-occurrence matrix always has elements that are subtypes of AbstractFloat and cannot be calculated for NGramDocuments

julia> C = CooMatrix(crps, window=1, normalize=false)  # fails, documents are NGramDocument
ERROR: The tokens of an NGramDocument cannot be reconstructed

julia> smallcrps = Corpus([sd, td])
A Corpus with 2 documents

julia> C = CooMatrix(smallcrps, window=1, normalize=false)  # works
A 17x17 CooMatrix{Float64}

The actual size of the sliding window is 2 * window + 1, with the keyword argument window specifying how many words to consider to the left and right of the center one

For a simple document, one should first preprocess the document and subsequently calculate the matrix:

julia> some_document = "This is a document. In the document, there are two sentences.";

julia> filtered_document = prepare(some_document, strip_whitespace|strip_case|strip_punctuation)
"this is a document in the document there are two sentences "

julia> C = CooMatrix{Float32}(some_document, window=3)  # word distances matter
A 13x13 CooMatrix{Float32}

julia> Matrix(coom(C))
13×13 Array{Float32,2}:
 0.0       2.0       1.0       0.666667  …  0.0       0.0       0.0
 2.0       0.0       2.0       1.0          0.0       0.0       0.0
 1.0       2.0       0.0       2.0          0.0       0.0       0.0
 0.666667  1.0       2.0       0.0          0.0       0.0       0.0
 0.0       0.666667  1.0       2.0          0.0       0.0       0.0
 0.0       0.0       0.666667  1.0       …  0.0       0.0       0.0
 0.0       0.0       0.0       0.666667     0.0       0.0       0.0
 0.0       0.0       0.0       0.0          0.666667  0.0       0.0
 0.0       0.0       0.0       0.0          1.0       0.666667  0.0
 0.0       0.0       0.0       0.0          2.0       1.0       0.666667
 0.0       0.0       0.0       0.0       …  0.0       2.0       1.0
 0.0       0.0       0.0       0.0          2.0       0.0       2.0
 0.0       0.0       0.0       0.0          1.0       2.0       0.0

One can also calculate the COOM corresponding to a reduced lexicon. The resulting matrix will be proportional to the size of the new lexicon and more sparse if the window size is small.

julia> C = CooMatrix(smallcrps, ["this", "is", "a"], window=1, normalize=false)
A 3x3 CooMatrix{Float64}

julia> C.column_indices
OrderedCollections.OrderedDict{String,Int64} with 3 entries:
  "this" => 1
  "is"   => 2
  "a"    => 3

julia> Matrix(coom(C))
3×3 Array{Float64,2}:
 0.0  2.0  0.0
 2.0  0.0  2.0
 0.0  2.0  0.0

Dimensionality reduction

Random projections

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name of random indexing. The core idea behind random projection is given in the Johnson-Lindenstrauss lemma which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves the distances between the points (Wikipedia).

The implementation here relies on the generalized sparse random projection matrix to generate a random projection model. For more details see the API documentation for RPModel and random_projection_matrix. To construct a random projection matrix that maps m dimension to k, one can do

julia> m = 10; k = 2; T = Float32;

julia> density = 0.2;  # percentage of non-zero elements

julia> R = StringAnalysis.random_projection_matrix(m, k, T, density)
10×2 SparseArrays.SparseMatrixCSC{Float32,Int64} with 4 stored entries:
  [2 , 1]  =  0.707107
  [3 , 1]  =  0.707107
  [10, 1]  =  -0.707107
  [2 , 2]  =  -0.707107

Building a random projection model from a DocumentTermMatrix or Corpus is straightforward

julia> M = DocumentTermMatrix{Float32}(crps)
A 7x3 DocumentTermMatrix{Float32}

julia> model = RPModel(M, k=2, density=0.5, stats=:tf)
Random Projection model (tf), 7 terms, dimensionality 2, Float32 vectors

julia> model2 = rp(crps, T, k=17, density=0.1, stats=:tfidf)
Random Projection model (tfidf), 7 terms, dimensionality 17, Float32 vectors

Once the model is created, one can reduce document term vector dimensionality. First, the document term vector is constructed using the stats keyword argument and subsequently, the vector is projected into the random sub-space:

julia> doc = StringDocument("this is a new document")
A StringDocument{String}

julia> embed_document(model, doc)
2-element SparseArrays.SparseVector{Float32,Int64} with 1 stored entry:
  [1]  =  -1.0

julia> embed_document(model2, doc)
17-element SparseArrays.SparseVector{Float32,Int64} with 7 stored entries:
  [1 ]  =  -0.377964
  [2 ]  =  -0.377964
  [4 ]  =  0.377964
  [5 ]  =  0.377964
  [8 ]  =  -0.377964
  [10]  =  -0.377964
  [14]  =  0.377964

Embedding a DTM or corpus can be done in a similar way:

julia> Matrix(embed_document(model, M))
2×3 Array{Float32,2}:
 -0.707107  -1.0   0.0
 -0.707107   0.0  -1.0

julia> Matrix(embed_document(model2, crps))
17×3 Array{Float32,2}:
 -0.260059  -0.377964  -0.210481
 -0.260059  -0.377964  -0.210481
  0.0        0.0        0.0
  0.260059   0.377964   0.210481
  0.260059   0.377964   0.210481
  0.0        0.0        0.0
  0.0        0.0        0.415297
 -0.260059  -0.377964   0.204816
 -0.51312    0.0        0.415297
 -0.260059  -0.377964  -0.210481
  0.0        0.0        0.0
  0.0        0.0        0.0
 -0.51312    0.0        0.0
  0.260059   0.377964   0.625777
  0.0        0.0        0.0
  0.0        0.0        0.0
  0.0        0.0        0.0

Random projection models can be saved/loaded to/from disk using a text format.

julia> file = "model.txt"
"model.txt"

julia> model
Random Projection model (tf), 7 terms, dimensionality 2, Float32 vectors

julia> save_rp_model(model, file)  # model saved

julia> print(join(readlines(file)[1:5], "\n"))  # first five lines
Random Projection Model saved at 2020-11-23T10:57:37.597
7 2
true
tf
1.4054651 0.71231794 0.71231794 0.71231794 0.71231794 1.4054651 1.4054651
julia> new_model = load_rp_model(file, Float64)  # change element type
Random Projection model (tf), 7 terms, dimensionality 2, Float64 vectors

julia> rm(file)

No projection hack

As previously noted, before projection, the DTV is calculated according to the value of the stats keyword argument value. The vector can composed of term counts, frequencies and so on and is more generic than the output of the dtv function which yields only term counts. It is useful to be able to calculate and output these vectors without projecting them into the lower dimensional space. This can be achieved by simply providing a negative or zero value to the model parameter k. In the background, the random projection matrix of the model is replaced by the identity matrix.

julia> model = RPModel(M, k=0, stats=:bm25)
Identity Projection (bm25), 7 terms, dimensionality 7, Float32 vectors

julia> embed_document(model, crps[1])  # normalized BM25 document vector
7-element SparseArrays.SparseVector{Float32,Int64} with 5 stored entries:
  [1]  =  0.702301
  [2]  =  0.35594
  [3]  =  0.35594
  [4]  =  0.35594
  [5]  =  0.35594

julia> embed_document(model, crps)'*embed_document(model, crps[1])  # intra-document similarity
3-element SparseArrays.SparseVector{Float32,Int64} with 3 stored entries:
  [1]  =  1.0
  [2]  =  0.506774
  [3]  =  0.506774

Semantic Analysis

The semantic analysis of a corpus relates to the task of building structures that approximate the concepts present in its documents. It does not necessarily involve prior semantic understanding of the documents (Wikipedia).

StringAnalysis provides two approaches of performing semantic analysis of a corpus: latent semantic analysis (LSA) and latent Dirichlet allocation (LDA).

Latent Semantic Analysis (LSA)

The following example gives a straightforward usage example of LSA. It is geared towards information retrieval (LSI) as it focuses on document comparison and embedding. We assume a number of documents,

julia> doc1 = StringDocument("This is a text about an apple. There are many texts about apples.");

julia> doc2 = StringDocument("Pears and apples are good but not exotic. An apple a day keeps the doctor away.");

julia> doc3 = StringDocument("Fruits are good for you.");

julia> doc4 = StringDocument("This phrase has nothing to do with the others...");

julia> doc5 = StringDocument("Simple text, little info inside");

and create the corpus and its DTM:

julia> crps = Corpus(AbstractDocument[doc1, doc2, doc3, doc4, doc5]);

julia> prepare!(crps, strip_punctuation);

julia> update_lexicon!(crps);

julia> M = DocumentTermMatrix{Float32}(crps, collect(keys(crps.lexicon)));

Building an LSA model is straightforward:

julia> lm = LSAModel(M, k=4, stats=:tfidf)
LSA Model (tfidf), 38 terms, dimensionality 4, Float32 vectors

Once the model is created, it can be used to either embed documents,

julia> query = StringDocument("Apples and an exotic fruit.");

julia> embed_document(lm, query)
4-element Array{Float32,1}:
  0.73225236
  0.14379191
  0.3171733
 -0.5852612

embed the corpus,

julia> V = embed_document(lm, crps)
4×5 Array{Float32,2}:
  0.735058    0.822174  0.361583   0.369555   0.267472
 -0.127939    0.281849  0.155932   0.312645  -0.925021
  0.0854172   0.393246  0.432528  -0.844548  -0.165549
 -0.660322   -0.299915  0.811087   0.228956   0.213045

search for matching documents,

julia> idxs, corrs = cosine(lm, crps, query);

julia> for (idx, corr) in zip(idxs, corrs)
           println("$corr -> \"$(crps[idx].text)\"");
       end
0.94282204 -> "Pears and apples are good but not exotic  An apple a day keeps the doctor away "
0.9334041 -> "This is a text about an apple  There are many texts about apples "
-0.0503197 -> "Fruits are good for you "
-0.08630329 -> "This phrase has nothing to do with the others "
-0.11434817 -> "Simple text  little info inside"

or check for structure within the data

julia> U = lm.Uᵀ;

julia> V'*V  # document to document similarity
5×5 Array{Float32,2}:
  1.0          0.799916    -0.252798    0.00832165   0.160135
  0.799916     1.0          0.268066   -0.00882498  -0.169804
 -0.252798     0.268066     1.0         0.0027891    0.0536656
  0.00832165  -0.00882498   0.0027891   1.0         -0.00176561
  0.160135    -0.169804     0.0536656  -0.00176561   1.0

julia> U'*U  # term to term similarity
38×38 Array{Float32,2}:
  0.0487284    0.0633793    0.0633793   …   0.00940091    0.00940091
  0.0633793    0.0906999    0.0906999      -0.00765931   -0.00765931
  0.0633793    0.0906999    0.0906999      -0.0076593    -0.0076593
  0.0633793    0.0906999    0.0906999      -0.00765931   -0.00765931
  0.0487284    0.0633793    0.0633793       0.00940091    0.00940091
  0.0487284    0.0633793    0.0633793   …   0.00940091    0.00940091
  0.0487284    0.0633793    0.0633793       0.00940091    0.00940091
  0.0689124    0.0896318    0.0896318       0.0132949     0.0132949
  0.0341388    0.0431566    0.0431566       0.00776654    0.00776653
  0.0249999    0.0599407    0.0599407       0.0043364     0.0043364
  ⋮                                     ⋱
 -0.00542773  -0.00864058  -0.00864057      0.000449956   0.000449949
 -0.00542773  -0.00864058  -0.00864058  …   0.000449959   0.000449952
 -0.00542773  -0.00864058  -0.00864057      0.000449953   0.000449947
 -0.00542773  -0.00864058  -0.00864057      0.000449956   0.00044995
 -0.00542773  -0.00864058  -0.00864057      0.000449953   0.000449946
  0.00940091  -0.00765931  -0.00765931      0.207998      0.207998
  0.00940091  -0.0076593   -0.0076593   …   0.207998      0.207998
  0.00940091  -0.00765931  -0.0076593       0.207998      0.207998
  0.00940091  -0.00765931  -0.0076593       0.207998      0.207998

LSA models can be saved/loaded to/from disk using a text format similar to the random projection model one.

julia> file = "model.txt"
"model.txt"

julia> lm
LSA Model (tfidf), 38 terms, dimensionality 4, Float32 vectors

julia> save_lsa_model(lm, file)  # model saved

julia> print(join(readlines(file)[1:5], "\n"))  # first five lines
LSA Model saved at 2020-11-23T10:57:44.239
38 4
tfidf
1.9162908 1.5108256 1.5108256 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.5108256 1.2231436 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.5108256 1.9162908 1.9162908 1.5108256 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908 1.9162908
9.6
julia> new_model = load_lsa_model(file, Float64)  # change element type
LSA Model (tfidf), 38 terms, dimensionality 4, Float64 vectors

julia> rm(file)

Latent Dirichlet Allocation (LDA)

Documentation coming soon; check the API reference for information on the associated methods.