If you use Duolingo, you might have noticed it also—Duolingo doesn’t care about punctuation. As a former English teacher and grammar nerd, the lack of punctuation bugged me. But as a Smartphone user, I was grateful that I didn’t need to find ¿ on the keyboard.
So Why Do Chatbots Ignore Punctuation?
Chatbots ignore punctuation because their programs rely on tokenization. Tokens are the chopped-up pieces of text that allow a chatbot to determine what the user wants and give an appropriate response.
So What is Tokenization?
Tokenization is the process of breaking sentences into individual “tokens.” Programmers choose how to define a token—words or phrases. Usually, the most effective splitting is done by words.
The next part of tokenization is deciding how to standardize and eliminate distractions. English teachers might cringe at tokenization. However, these steps ensure the information provided to the consumer is accurate.
How Does Tokenization Work?
To program chatbots so they can use language efficiently and accurately, developers will pick and choose between the following steps:
Lowercase everything. A chatbot does not need to decipher whether a Capital letter was an acCident or is required if all uppercase letters are changed to lower case. sure, readers want this sentence capitalized, but chatbots have immunity from a teacher’s red pen.
Remove punctuation. This took me aback when I first started using Duolingo. I expected to use a comma in this situation:
Senor, hablas espanol
But Duolingo doesn’t care—as long as I use the correct verb form (hablas instead of hablo), I earn the points.
We expect punctuation when we read and write. But when we speak, the only time most of us use punctuation is the two-fingered “quote.”
Removing punctuation makes it easier for the chatbot to recognize words.
Eliminate Stop Words. Words that contribute little to the meaning can be filtered out. These include articles (a, an, the) and simple linking verbs (like is or was).
Not all small words are stop words. For example, and indicates an addition while but suggests an opposite idea.
And there’s more. Developers can stem tokens (chop up words and only use the root word). However, this is becoming less common. Numbers need to be turned into full-length words, and nuances need to be standardized. For example, the chatbot needs to turn any of these–email, e-mail, e mail—and consistently use one.
So there you have it—a chatbot, which is a type of AI I explore in other posts, can understand what you want—usually—because the tokenization has created a type of grammar for machine learning.