While we are at the point in time where artificial intelligence is increasingly showing its true potential, we are still slowly benefiting from that potential. Many processes are still being designed and created by following a human approach. And although this makes sense to those processes that actually require and depend on human interaction, a lot of processes don’t really require human interaction at all. One of those processes is the classification of a written human response. In this blog article I’ll explain the basics behind Text Classification and the potential benefit using an example.
For instance, think about an online customer service form, in which you need to select a category from a list of options. It seems odd that you still need to manually select a category, while technology has already proved (relatively) successful in dealing with this situation. The process of automatically providing one or more categories for a given body of text is called Text Classification, a subfield of Natural Language Processing.
Choosing the right category is difficult for people
Let’s explain the process using an example. A municipality’s website has an online customer service form used to collect complaints about all sorts of nuisance. Some of the complaints would be about noise or stench disturbance, some about pests inconvenience and some about deferred maintenance to roads. In total people can choose from 19 different categories to submit their complaint. Also presume each category of complaints is handled by a different department. It’s likely that people who fill out the form don’t always select the right category, either because they honestly don’t know which category their complaint belongs to, or because they just don’t take the time to select the right one. As you can imagine, this leads to unnecessary time waste as service reps send the complaint from department to department.
The process of text classification
Text classification can automate this process by adding the right category to a given complaint. In order to train and use a Text Classification model, it is essential you have enough input data, and also enough data that holds the correct category for the complaints.
Figure 1: The process of text classification
The first step is to__prepare__ the raw input data. Depending on the type and structure of the dataset, different preprocessing tasks can take place. For instance, splitting one field into two separate fields or changing the format or case of a field.
The second step is called__feature extraction__, which is the process of deriving key elements or phrases from the input data that still represent the data accurately. The result is a set of features that will be used throughout the rest of the process. A complaint could be a long paragraph of user input, while the features are the key words used within that complaint. In our example we used 2000 consumer complaints. Through feature extraction we derive the top 3000 features from those complaints.
Now that we have prepared the data, we can use a__machine learning__ algorithm to create the__classification model__ (or classifier in short) which can be used on any new unstructured user complaint. There are a lot of different pattern recognition algorithms to choose from, in our case we will use the Naive Bayes algorithm. To create our text classifier we need the mentioned example training input data and the validated list of categories all input data should be mapped to. The result is a text classification model we can use to classify new (still unclassified) user complaints. The analysis of a new complaint needs to follow the same process, from data preprocessing to feature extraction and finally the validation against the classification model. The final result is a category the algorithm proposes based on the analysis of the input text.
Once trained, the classifier can also provide a list of the most informative features in its training set. For instance, our classifier was able to determine that the word "speed" appears 68 times more often in the category deferred maintenance compared to noise disturbance. Patterns like these can take a while for humans to find but for machine learning algorithms they are found in seconds. This list of most informative features can provide key insights to both your organization and to end users.
The accuracy of the prediction
Of course, a text classification model won’t be accurate all the time. It’s still an analysis based on existing data and needs to be retrained every now and again. However in just a very short amount of time, you can create a valuable model. In the above mentioned example we reached an accuracy of 82% after just an afternoon of work. With some more time and energy you can even get a more accurate model, although that does depend on many different variables.
So you may ask what’s the use of a model that’s ‘only’ 82% accurate, why can’t it be 100% accurate? Well that depends what you put into the equation. If you trust the model enough, it can help the consumer by prefilling the category after he’s written his complaint. That way he doesn’t need to select a category by himself. This improves the user experience and reduces the time the consumer is busy with his task. Further, the municipality can handle complaints faster which improves the per-formance of the departments since they don’t have to redirect falsely labelled complaints any longer.
Still, 82% accuracy already might be an enormous improvement if your consumers pick the wrong category more than 18% of the time. Studies (B. Schwartz, 2005 or S. Iyengar, 2000) have indicated that choice paralysis is linearly correlated with the amount of possible choices. When the list approaches 19 options, as in our case, the accuracy of humans is often lower than 82%. In this case, using automated text classification both decreases the error rate as well as relieves the consumers of the burden of choice.
Natural language processing gives us the opportunity to automate processes that handle written user input. Using text classification we were able to predict the category for a given body of text. This limits the processing time of your customer service team but also changes the way we think about the user experience of input forms. There are of course many more examples in which text classification could be very useful. Any place where user input needs to be processed could benefit from Natural Language Processing and specific Text Classification.
Do you want to know more about our Machine learning, Natural Language Processing or Text Classification? Drop us a line! or get in touch with me at firstname.lastname@example.org.