This model is the 2011 ICDAR winner and should-be winner of the 2013 ICDAR if they submitted the right preprocessing code. The model is named Multi-column Deep Convolutional Neural Network. The best MCDNN has 8 CNNs, all of which are 11-layered, they did some normalization on the picture as preprocessing step but nothing more.
This model claims to be the first to surpass human performance on MNIST and CASIA datasets. It’s architecture is as follows:
There model can be divided into three parts: Sample Generation, CNN and Voting.
First in the sample generation part, they applied distortion both locally and globally. They didn’t use normalization because that’s contrictory with distortion. The local distortion add a small displacement to the original image on x, y coordinates and gray scale value respectively, and then Gaussian smoothing is applied, as well as bilinear interpolation. The global distortion includes some global transformations such as scaling and rotation. The distorted images are then used to train the CNNs.
For CASIA, they used a 15-layer network, represented as:
In the training, they mainly used dropout and multi-supervised training, which is crucial for their model to converge during training. This is somewhat important as it is used in several huge convnets. The idea is to add auxiliary classifiers connected to these intermediate layers, in order to increase the gradient signal that gets propagated back, and provide additional regularization.
Last thing they do is multi-model voting. They used 5 model with the same architecture to vote, and higher accuracy is achieved. However I think this process costs too much extra memory and time, though therotically they should be good to the result, because they are trained on slightly different data (random distorted) and they may have reached at different local optimum. As a result this strategy lowered the error rate by 0.2%.
The major difference of this new model is they didn’t use data augmentation and model ensemble, which means their model are far less light-weighted than the two models before. As the preprocessing step, they represent the characters by the normalization-cooperated direction-decomposed feature maps (directMap), which can be viewed as a d × n × n sparse tensor.
For direction decomposition, they used Sobel operator, then decompose the direction of gradient into its to adjacent standard chaincode directions by the parallelogram rule. Then the gradient elements are directly mapped to directional maps, that’s why its called normalization-cooperated, they didn’t generate normalized images, they use the normalization as a mapping from gradients of original images to directMaps.
Later they used a single 11-layer CNN.
They’ve also proposed an adaptation layer, but that’s some kind of hyper-parameter, so we can come back later.