Machine Learning: Myths and Realities
The tendency exists in the automatic document classification space, to believe that machine learning is always the ideal content classification automation approach. Further, there exists the belief that this easily scales across an organization’s e-content holdings, automating classification across decades of legacy content. This may sound reasonable; however, things are not quite that simple. In this article, I hope to address some myths and realities with the present state of machine learning in the content classification space.
Myth 1: Machine Learning is Suitable for Every Metadata Element
As machine learning in the content classification space currently exists, it is best used with fields that have fifteen or fewer potential values. As you scale to more complex taxonomies, the effectiveness of this approach, and the complexity of training and creating a precise predictive classification model suffers. For it to work effectively, there needs to be clear statistical differences between documents in different labels/values. If that doesn’t exist, the precision of machine learning will be lower. Further, precision decreases as the number of terms you’re training against increases. If you have a facet with hundreds of terms, machine learning as it presently exists is unsuitable. This may shift with technological advances, but we’re not there yet.
With that said, deciding where to use machine learning involves analyzing your metadata elements, and identifying elements that would be a good fit. Highly complicated lists of terms may not be a good candidate. Further, machine learning with very similar documents against different values in your term list will provide lower precision than you desire.
Myth 2: Machine Learning is the Only Reasonable Form of Automating Content Classification.
Different fields will lend themselves to different approaches for automated content classification. In some cases, machine learning is appropriate. In other cases, a Knowledge Architecture type approach, identifying explicit rules using Regular Expressions may provide far more precise results, identifying the existence of concepts in documents. This will require upfront analysis towards rule identification. For example, identifying a particular security classification may be done where a match on a Social Insurance Number/Social Security Number and a Postal Code/Zip Code are found.
Myth 3: Autoclassification is the Only Way to Automate Content Classification.
Beyond machine learning and explicit rule identification, another classification automation approach involves semantic links between fields and values. This notion overlaps with the idea of an ontology, where you can effectively have a constellation of related metadata elements and values. By creating these links, an appropriate tool can help automatically select related values (e.g. selection of the city field value “Montreal” has an associated country field value “Canada”). In this way, you can help reduce the number of selections people classifying content has to make. This approach can also help automate the consistent application of both tags and compliance rules.
Myth 4: Machine Learning Isn’t Mature Enough for Content Autoclassification.
There are some facets where a machine learning approach makes sense. A field where you have around fifteen abstract categories, such as in a Document Category facet, may be a good candidate for this technology. It may not be suitable everywhere, but it is certainly suitable for more abstract field values with a limited number of categories.
Myth 5: Without a Global Machine Learning Approach, Document Autoclassification Isn’t Ready.
For a given document’s metadata classification, you can use a combination of approaches to help automate the classification experience. Machine learning, explicit rules and semantic relationships between facets are not mutually exclusive. Further, on any given field, you may decide that a combination of approaches makes sense, depending on the scenario. You may not be able to automate the selection of every value in a model, but you can certainly make content classification far easier than traditional content classification.
Machine learning in of itself is not a panacea for all content classification scenarios and should not be considered in isolation. Proper analysis is required for each facet/field, against relevant content, towards identification of the appropriate automation solution(s) for each field. A combination of strategies can help you move your organization towards consistently tagged content that meets compliance requirements in a more predictable way.