Extending the document model
Sometimes it may make sense to define new document types with whom to operate and use only some functionality of this package. For example, let us define two new document types, a SimpleDocument with no metadata
julia> using StringAnalysis
julia> struct NoMetadata <: AbstractMetadata end
julia> struct SimpleDocument{T<:AbstractString} <: AbstractDocument{T, NoMetadata}
text::T
endand a ConferencePublication with only a limited number of metadata fields.
julia> struct ConferenceMetadata <: AbstractMetadata
name::String
authors::String
conference::String
end
julia> struct ConferencePublication{T<:AbstractString} <: AbstractDocument{T, ConferenceMetadata}
text::T
metadata::ConferenceMetadata
endAt this point, one can create documents and use basic containers along with other standard documents of the package:
julia> sd = SimpleDocument("a simple document")
A Main.ex-index.SimpleDocument{String}
julia> cmd = ConferenceMetadata("Tile Inc.","John P. Doe","IEEE Conference on Unknown Facts")
Main.ex-index.ConferenceMetadata("Tile Inc.", "John P. Doe", "IEEE Conference on Unknown Facts")
julia> cd = ConferencePublication("publication text", cmd)
A Main.ex-index.ConferencePublication{String}
julia> doc = StringDocument("a document")
A StringDocument{String}
julia> docs = [sd, cd, doc]
3-element Array{AbstractDocument{String,M} where M<:AbstractMetadata,1}:
A Main.ex-index.SimpleDocument{String}
A Main.ex-index.ConferencePublication{String}
A StringDocument{String}However, creating a Corpus fails because no conversion method exists between the new document types and any of the standardized ones StringDocument, NGramDocument etc.
julia> Corpus(AbstractDocument[sd, cd, doc])
ERROR: Could not convert the Main.ex-index.SimpleDocument{String} to any GenericDocument type.By defining at least one conversion method to a known type,
julia> Base.convert(::Type{NGramDocument{String}}, doc::SimpleDocument) =
NGramDocument{String}(doc.text)
julia> Base.convert(::Type{NGramDocument{String}}, doc::ConferencePublication) = begin
new_doc = NGramDocument{String}(doc.text)
new_doc.metadata.name = doc.metadata.name
new_doc.metadata.author = doc.metadata.authors
new_doc.metadata.note = doc.metadata.conference
return new_doc
endthe Corpus can be created and the rest of the functionality of the package i.e. numerical operations, can be employed on the document data.
julia> crps = Corpus(AbstractDocument[sd, cd, doc])
A Corpus with 3 documents
julia> metadata.(doc for doc in crps)
3-element Array{DocumentMetadata,1}:
<no ID> <no name> <unknown author> ? (?)
<no ID> "Tile Inc" by John P. Doe ? (?)
<no ID> <no name> <unknown author> ? (?)
julia> DocumentTermMatrix(crps)
A 5x3 DocumentTermMatrix{Int64}The SimpleDocument and ConferencePublication were both converted to NGramDocuments since this was the only conversion method available. If more would be available, the priority of conversion is given by the code in the abstract_convert function. Generally, one single conversion method suffices.