How testing is done ?
The basic idea is simple : we take a dataset of english sentences, we corrupt these sentences by introducing artificially generated typos, and then we measure how these typos are corrected.
Artificial typos
To introduce typos in the clean text, we simulate all possible typos that a human typing on a mobile keyboard could do. This include :
- Characters additions / deletions
- Characters transpositions
- Accent simplifications
- Case simplifications
- Fat-finger syndrome (fuzzy typing)
- Common typos (sampled from a dataset of most common typos)
We use the following typo rates :
- Character transpositions : 1% of all characters
- Character additions : 0.5% of all characters
- Character deletions : 0.5% of all characters
- Space deletions : 1% of all space characters
- Symbol deletions : 10% of symbol characters
- Accent simplification : 8% of accented characters
- Case simplification : 8% of uppercased characters
- Common typos : 5% of words
With these rates, we obtain an overall typo rate of 12%.
Sources
These rates come from studies on real-human typing habits : Reference #1, Reference #2.
Particularly, Reference #1 (which focus on mobile device typing) shows that typing on mobile devices leads to 2.3% of uncorrected errors (see introduction), and 8% of words autocorrected (see Intelligent text entry, page 8), for an overall typo rate of 10.3%.
Details
Additionally to these typo rates, we further modify the probabilities :
- FRONT_DELETION_MULTIPLIER is used to reduce the probability of a deletion happening on the first character of the word. This number was computed after analyzing the Tweeter typo corpus (see this script)
Here is a few examples of sentences before and after introducing typos :
Clean sentence | Corrupted sentence | Typos introduced |
---|---|---|
He went hiking and said he'd think about it; never came back. | He went hikimg and said hed think about it; never came back. | Fuzzy typing & Symbol deletion |
Like, what you're doing here and what all this stuff is. | Like, what you're doinghere and waht all this stuff is. | Space deletion & Character transposition |
You must do something about yourself. | You must do something about yourself. | |
That's the way to get rid of pests like that. | That's the waj to get rid of pedts like thhat. | Common typo & Fuzzy typing & Character addition |
He obviously wanted an ally. | he obviously wanted an ally. | Case simplification |
This is all we got between us and the Almighty! | This is lal we got beween us and the Almgihty! | 2 x Character transposition & Character deletion |
Swipe gesture generation
For the task of swipe gesture resolution, the input is not simple text : we need to generate a swipe gesture.
When generating fuzzy typing typo, we sample key taps positions on the keyboard, using Gaussian distributions, and use these key taps position to see if the correct character was typed, or if a neighbor key was typed.
For generating the swipe gesture, we sample some key taps positions just like we do for fuzzy typing, and then link the different keystrokes of the word using bezier curves. Some randomness on the speed & acceleration between points is added, in order to generate more natural swipe gestures.
Here is some examples of the generated swipe gestures (in red are the keystrokes generated by the fuzzy typing, in blue the points of the corresponding swipe gesture created).
For the word gives
:
For the word they
:
Data
Test data
For the data, we use the test set of the SODA dataset.
We chose to use this dataset for the evaluation for several reasons :
- Recent
- Extremely clean dataset
- Cover two very distinct domains (
narrative
&dialogue
)
Common typos dataset
As mentioned in the section Artificial typos, we rely on a dataset of common typos, and use these common typos when generating plausible typos.
The dataset of common typos that we use is the Twitter Typo Corpus.
Tasks
We test the most important NLP features of a mobile keyboards. These are :
- Auto-correction: Corrects the words typed by the user.
For example, if a user typesI’m especialy touched
, the typo should be detected and corrected toI’m especially touched
. - Auto-completion: Completes the word typed by the user.
For example, if a user typesI love y
, the word should be auto-completed toI love you
. - Next-word prediction: Predicts the next word to be typed.
For example, if a user typesI want to eat french
, a probable next word can befries
. - Swipe gesture resolution: Predicts the intended word from a swipe gesture.
Metrics
If you look into the results from kebbie
, for each task we have a handful of metrics that help us understand how good the tested keyboard is. Let's look at the details of these metrics.
Formulas
Next-word prediction, swipe resolution, auto-completion
For these three tasks, the metric used is Accuracy.
The formula is : accuracy = correct / total
Where correct
is the number of correct predictions, and total
the total number of predictions.
For the next-word prediction task and auto-completion task, we use top-3 accuracy as the main reference metric. It’s the same as accuracy, but instead of considering only one candidate (which is either correct or not), we consider the 3 most probable candidates (if any one of these 3 candidates is correct).
The reason for this is because the next-word predictions and auto-completion predictions are not “forced” upon the user : 3 predictions are displayed at the top of the keyboard, and the user can choose any of the prediction displayed. So the correct prediction should appear among these 3 predictions displayed.
For swipe resolution however, only the best prediction is selected and applied. So we use accuracy as the main reference metric (and not top-3 accuracy).
Auto-correction
For auto-correction, it’s different. We have a notion of true/false positive/negative. Let’s first define these notions :
- True Negative : No typo introduced, the model doesn’t correct anything
- False Positive : No typo introduced, but the model correct (wrongly) the word
- True Positive : A typo is introduced, the model correct the word into the expected word
- False Negative : A typo is introduced, but the model doesn’t correct anything
With an example it’s easier to visualize :
Word typed by the user | Word after being corrected by the model | Expected word | |
---|---|---|---|
True Negative | love | love | love |
False Positive | love | loev | love |
True Positive | loev | love | love |
False Negative | loev | loev | love |
From these notions, we can compute the following metrics : accuracy, precision, recall, F-score, using the following formulas :
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
Note
F-score is the harmonic mean of precision and recall. It’s a way to gather both precision and recall in a single metric.
Important
Actually we use Fβ-score, which is a variant of the F-score where we can use a constant β to weight the precision/recall ratio (see the wikipedia page about F-score).
This is useful because we value precision more.
We currently use β = 0.9, which means precision has slightly more weight than recall.
Understanding the metrics
Swipe resolution
Accuracy
- [0 - 1]
- higher is better
Accuracy is straightforward : this is the ratio of correct predictions.
So an accuracy of 0.8
means the model correctly predicted the word being swiped 80% of the time.
Next-word prediction & auto-completion
Top-3 accuracy
- [0 - 1]
- higher is better
Same as accuracy, but 3 candidates are considered.
So a top-3 accuracy of 0.6
means that within the 3 candidates predicted by the model, the next word (or the word completion) is in these 3 candidates 60% of the time.
Auto-correction
Precision
- [0 - 1]
- higher is better
Precision is the ratio of typos among what is corrected by the model.
So a precision of 0.7
means that among all corrections made by the model, 70% were actually typos (and 30% were correct words that didn’t need to be corrected).
A low precision means many words are corrected when they should not, and a high precision means only actual typos are corrected.
Recall
- [0 - 1]
- higher is better
Recall is the ratio of typos detected by the model.
So a recall of 0.65
means that the model correctly detected 65% of typos (and 35% of typos were not corrected by the model).
A low recall is symptom that most typos are not detected, and a high recall means most of typos are detected as typos.
F-score
- [0 - 1]
- higher is better
F-score is the harmonic mean of precision and recall, it’s just a way to gather both precision and recall in a single metric.
Note that we weight precision slightly more than recall.