Representing documents using document keys
A US patent
Documents are made up of paragraphs, which in-turn are made up of sentences, which in-turn are made up of words. That’s the usual representation of documents.
More abstract way could be that documents are made of themes or concepts or topics, such that a document is, say, 40% politics, 30% sports and 30% entertainment.
Following patent is about representing a contract document by a sequence of clause categories, say [‘preamble’, ‘recitals’, ‘term’,….,’signature’]. Almost like a genome sequence, only difference is instead of letters A, T, C, G, the contract document is made up of sequence of clause categories-symbols.
This terse representation is effective in various downstream machine learning based natural language processing workflows such as document similarity, contract type classification, etc.
Here is the full patent.
This info was originally published at LinkedIn