More on documents

Extending the document model

Sometimes it may make sense to define new document types with whom to operate and use only some functionality of this package. For example, let us define two new document types, a SimpleDocument with no metadata

julia> using StringAnalysis

julia> struct NoMetadata <: AbstractMetadata end

julia> struct SimpleDocument{T<:AbstractString} <: AbstractDocument{T, NoMetadata}
           text::T
       end

and a ConferencePublication with only a limited number of metadata fields.

julia> struct ConferenceMetadata <: AbstractMetadata
           name::String
           authors::String
           conference::String
       end

julia> struct ConferencePublication{T<:AbstractString} <: AbstractDocument{T, ConferenceMetadata}
           text::T
           metadata::ConferenceMetadata
       end

At this point, one can create documents and use basic containers along with other standard documents of the package:

julia> sd = SimpleDocument("a simple document")
A Main.ex-index.SimpleDocument{String}

julia> cmd = ConferenceMetadata("Tile Inc.","John P. Doe","IEEE Conference on Unknown Facts")
Main.ex-index.ConferenceMetadata("Tile Inc.", "John P. Doe", "IEEE Conference on Unknown Facts")

julia> cd = ConferencePublication("publication text", cmd)
A Main.ex-index.ConferencePublication{String}

julia> doc = StringDocument("a document")
A StringDocument{String}

julia> docs = [sd, cd, doc]
3-element Array{AbstractDocument{String,M} where M<:AbstractMetadata,1}:
 A Main.ex-index.SimpleDocument{String}
 A Main.ex-index.ConferencePublication{String}
 A StringDocument{String}

However, creating a Corpus fails because no conversion method exists between the new document types and any of the standardized ones StringDocument, NGramDocument etc.

julia> Corpus(AbstractDocument[sd, cd, doc])
ERROR: Could not convert the Main.ex-index.SimpleDocument{String} to any GenericDocument type.

By defining at least one conversion method to a known type,

julia> Base.convert(::Type{NGramDocument{String}}, doc::SimpleDocument) =
           NGramDocument{String}(doc.text)

julia> Base.convert(::Type{NGramDocument{String}}, doc::ConferencePublication) = begin
           new_doc = NGramDocument{String}(doc.text)
           new_doc.metadata.name = doc.metadata.name
           new_doc.metadata.author = doc.metadata.authors
           new_doc.metadata.note = doc.metadata.conference
           return new_doc
       end

the Corpus can be created and the rest of the functionality of the package i.e. numerical operations, can be employed on the document data.

julia> crps = Corpus(AbstractDocument[sd, cd, doc])
A Corpus with 3 documents

julia> metadata.(doc for doc in crps)
3-element Array{DocumentMetadata,1}:
 <no ID> <no name> <unknown author> ? (?)
 <no ID> "Tile Inc" by John P. Doe ? (?)
 <no ID> <no name> <unknown author> ? (?)

julia> DocumentTermMatrix(crps)
A 5x3 DocumentTermMatrix{Int64}

The SimpleDocument and ConferencePublication were both converted to NGramDocuments since this was the only conversion method available. If more would be available, the priority of conversion is given by the code in the abstract_convert function. Generally, one single conversion method suffices.