Extending the document model
Sometimes it may make sense to define new document types with whom to operate and use only some functionality of this package. For example, let us define two new document types, a SimpleDocument
with no metadata
julia> using StringAnalysis
julia> struct NoMetadata <: AbstractMetadata end
julia> struct SimpleDocument{T<:AbstractString} <: AbstractDocument{T, NoMetadata}
text::T
end
and a ConferencePublication
with only a limited number of metadata fields.
julia> struct ConferenceMetadata <: AbstractMetadata
name::String
authors::String
conference::String
end
julia> struct ConferencePublication{T<:AbstractString} <: AbstractDocument{T, ConferenceMetadata}
text::T
metadata::ConferenceMetadata
end
At this point, one can create documents and use basic containers along with other standard documents of the package:
julia> sd = SimpleDocument("a simple document")
A Main.ex-index.SimpleDocument{String}
julia> cmd = ConferenceMetadata("Tile Inc.","John P. Doe","IEEE Conference on Unknown Facts")
Main.ex-index.ConferenceMetadata("Tile Inc.", "John P. Doe", "IEEE Conference on Unknown Facts")
julia> cd = ConferencePublication("publication text", cmd)
A Main.ex-index.ConferencePublication{String}
julia> doc = StringDocument("a document")
A StringDocument{String}
julia> docs = [sd, cd, doc]
3-element Array{AbstractDocument{String,M} where M<:AbstractMetadata,1}:
A Main.ex-index.SimpleDocument{String}
A Main.ex-index.ConferencePublication{String}
A StringDocument{String}
However, creating a Corpus
fails because no conversion method exists between the new document types and any of the standardized ones StringDocument
, NGramDocument
etc.
julia> Corpus(AbstractDocument[sd, cd, doc])
ERROR: Could not convert the Main.ex-index.SimpleDocument{String} to any GenericDocument type.
By defining at least one conversion method to a known type,
julia> Base.convert(::Type{NGramDocument{String}}, doc::SimpleDocument) =
NGramDocument{String}(doc.text)
julia> Base.convert(::Type{NGramDocument{String}}, doc::ConferencePublication) = begin
new_doc = NGramDocument{String}(doc.text)
new_doc.metadata.name = doc.metadata.name
new_doc.metadata.author = doc.metadata.authors
new_doc.metadata.note = doc.metadata.conference
return new_doc
end
the Corpus
can be created and the rest of the functionality of the package i.e. numerical operations, can be employed on the document data.
julia> crps = Corpus(AbstractDocument[sd, cd, doc])
A Corpus with 3 documents
julia> metadata.(doc for doc in crps)
3-element Array{DocumentMetadata,1}:
<no ID> <no name> <unknown author> ? (?)
<no ID> "Tile Inc" by John P. Doe ? (?)
<no ID> <no name> <unknown author> ? (?)
julia> DocumentTermMatrix(crps)
A 5x3 DocumentTermMatrix{Int64}
The SimpleDocument
and ConferencePublication
were both converted to NGramDocument
s since this was the only conversion method available. If more would be available, the priority of conversion is given by the code in the abstract_convert
function. Generally, one single conversion method suffices.