Skip to main content

I Compared MNIST-Style Digits Across Languages. Mandarin Chinese Was 4x Harder to Separate

· 21 min read

I started with a joke about "Linear Mandarin" and ended up with a measurable result: in this PCA experiment, Mandarin Chinese digits were about 4x harder to separate than English MNIST.

Introduction

I came across this tweet:

Linear algebra expression written with Chinese characters


And for a second I even thought this wasn't real, because the symbols look really hard to distinguish among themselves.

Which made me think that there must be a way for us to quantify that.

This immediately reminded me of one of my favorite data visualization examples using MNIST dataset with PCA. And so I wondered if we could explore different numerical systems from different languages with the same approach, and compare the cluster centroids mean to extract a "difficulty" proxy.

The data

I looked for real MNIST-style handwritten digit datasets where the actual written symbols differ. If I could not find a real equivalent dataset, I dropped that language.

The final set was:

  • English, from OpenML MNIST 784
  • Mandarin Chinese, from a Chinese MNIST dataset
  • Hindi, using Devanagari digits
  • Arabic, using Arabic-Indic digits
  • Bengali
  • Urdu/Persian
  • Telugu

Here are representative samples from each dataset. I picked one sample per digit close to that digit's PCA centroid, so these are meant to be boring examples rather than outliers.

2,500 samples · OpenML MNIST 784
English digit 0
0 (0)
sample #30
English digit 1
1 (1)
sample #450
English digit 2
2 (2)
sample #645
English digit 3
3 (3)
sample #875
English digit 4
4 (4)
sample #1169
English digit 5
5 (5)
sample #1272
English digit 6
6 (6)
sample #1714
English digit 7
7 (7)
sample #1791
English digit 8
8 (8)
sample #2048
English digit 9
9 (9)
sample #2265

Why PCA

At this point I needed a way to make the comparison visual. This is where PCA (Principal Component Analysis) comes in.

PCA is still one of my favorite machine learning algorithms. Part of that is because it is simple enough to understand intuitively. You have high-dimensional data (e.g. a vector of dimensios pixel width x pixel height), you find the directions where it varies the most within a dataset, and you project the data onto those directions.

If you are unfamiliar, this is a good place to get started.

When I was writing my paper that got accepted to ICMLA, I used PCA to understand whether the raw sensor data had a shape that could be useful for classification. Wrote more about that here if you are curious: How I wrote a machine learning paper in 1 week that got accepted to ICMLA.

This digit experiment is much lower stakes, obviously.

But it takes on the same idea, take something high-dimensional and make it visible.

English MNIST first

I started with normal MNIST because that is the reference point everyone knows.

Each image is 28x28 pixels, which means each sample starts as 784 dimensions. I flattened the image into a vector, normalized it, and projected the samples into two dimensions using PCA.

Mechanically, PCA does not really care that the input started as an image. It wants a table.

So the image has to be flattened first. I take row 1 of pixels, append row 2 after it, then row 3, and keep going until all 28 rows are joined together. That turns a 28 x 28 image into a single row with 784 pixel values.

Below I used one actual MNIST sample from the dataset. It is an English 2, sample #645. The zoomed patch makes the transformation more concrete: the bright stroke pixels are just large grayscale values, and their position in the image decides their position in the vector.

1 Start with the image
English MNIST sample digit 2
This is one English MNIST sample: digit 2, sample #645. It is treated as a 28 x 28 grid of grayscale pixels. The blue box is the patch zoomed in next.
2 Zoom into pixels
0
0
0
1
1
0
0
0
0
0
0
5
0
0
2
18
68
150
1
0
17
131
229
255
0
8
133
242
248
246
7
63
236
255
224
183
Each square is one pixel value from 0 to 255. Dark cells are near zero; bright cells are part of the stroke.
3 Flatten row by row
Same sample #645: append rows in reading order
rows 0-11x[0:336]12 rows x 28 values
row 12x[336:364]
cols 0-6000110cols 13-27
row 13x[364:392]
cols 0-6000005cols 13-27
row 14x[392:420]
cols 0-60021868150cols 13-27
row 15x[420:448]
cols 0-61017131229255cols 13-27
rows 16-27x[448:784]12 rows x 28 values
One sample vector
x_645 =rows 0-1112 rows x 28 values+row 1228 values+row 1328 values+row 1428 values+row 1528 values+rows 16-2712 rows x 28 values
28 rows appended together, 28 x 28 = 784 pixel values.
Flattening does not average or mix pixels. It just joins the rows of sample #645 end to end: row 0, then row 1, all the way to row 27.
4 Stack samples
Matrix X: one flattened row per sample
imgsamplestartmiddleend
English MNIST sample digit 0
#30
label 0
x[0:6]
000000
772 values
in between
x[778:784]
000000
English MNIST sample digit 1
#450
label 1
x[0:6]
000000
772 values
in between
x[778:784]
000000
English MNIST sample digit 2
#645
label 2
x[0:6]
000000
772 values
in between
x[778:784]
000000
After flattening, the dataset is a table: one sample per row and one pixel position per column. With 2,500 plotted English samples, this input matrix has shape 2500 x 784.
5 Run PCA and plot
X_centered
samples x 784
x
directions
784 x 2
=
Z
samples x 2
Output Z: two coordinates per sample
samplelabelPC1PC2
#300z[30, 0]z[30, 1]
#4501z[450, 0]z[450, 1]
#6452z[645, 0]z[645, 1]
PC1PC2012sample #645
Stacking all vectors creates X with shape samples x 784. Multiplying by the first two principal directions creates Z with shape samples x 2. Those two values are the PC1 and PC2 coordinates.

After that, the dataset is just a matrix. Each row is one digit sample. Each column is one pixel position. If I have 2,500 English samples in the plot, the PCA input table has shape 2500 x 784.

PCA centers each pixel column, then finds the directions through the 784-dimensional pixel space where the samples spread out the most. The first direction is PC1, the second is PC2. When I plot those two coordinates for each sample, I get the 2D chart below.

English MNIST handwritten digits projected to 2D PCA

Note that it does not know what a 3 or an 8 means. It just finds the two directions that explain the most variance in the pixel data, and projects each sample against those.

Some digits pull apart clearly. 1 has its own region because it is visually sparse. 0 also tends to carve out space because the loop shape is so distinctive. But a lot of digits overlap because two dimensions is brutally compressed. 3, 5, 8 and 9 aren't too far from each other.

This is expected, and "explains" in a way, why sometimes we can find it harder to distinguish between those numbers.

Then the other numeral systems

After that I ran the same process for the other datasets.

Six non-English MNIST-style digit datasets projected to 2D PCA

Each language dataset has a different "footprint".

Mandarin Chinese is the most compressed in this 2D view.

This confirms our initial assumption, that the characters have very similar stroke structures and that for a PCA with 2 dimensions, it's hard to "tell them apart".

However, I must say that given the sample mandarin data (from the first image shown), this could also be due to the dataset where the characters are more "squeezed" in the center of the matrix - which makes it harder to distinguish between numbers given all of them will have an "empty" surrounding area.

A small metric for separation

To make the comparison less hand-wavy, I computed a simple centroid separation score.

For each dataset:

  1. Compute the 2D PCA coordinates.
  2. Compute the centroid of each digit class.
  3. Compute all pairwise distances between those ten centroids.
  4. Take the mean distance, with the min and max shown as the error range.

Centroid separation bar chart for MNIST-style digit datasets

This is not a universal measure of how "easy" the dataset is.

It only says: after reducing the images to two PCA dimensions, how far apart are the class centers?

The ranking I got:

  • Mandarin Chinese: 0.72
  • English: 3.02
  • Arabic: 3.31
  • Urdu/Persian: 3.50
  • Hindi: 4.07
  • Bengali: 4.15
  • Telugu: 4.18

I would not over-interpret the exact values because the datasets were collected differently - and as I mentioned, the Mandarin chinese does seem to be slightly lower quality than others.

But as a visual summary, it is still useful. Telugu, Bengali and Hindi have the largest class-center separation in this PCA view. English MNIST, which we tend to treat as the default digit dataset, is somewhere in the lower middle.

Mandarin Chinese is the clear outlier in the other direction. Its digit centroids are much closer together than every other dataset here.

So for this experiment, I think it is fair to say Mandarin Chinese is the hardest numerical system to recognize by sample proximity alone.

Not "hardest" in some universal language-learning sense. Hardest in the specific sense that if I only give myself this 2D PCA space and ask which handwritten numerical values sit closest to each other, Mandarin Chinese gives me the least separation.

The explorer

The static plots were fun, but I thought I could build a nicer more interactive HTML explorer that would allow me to look to the data. After all, you always want to look at the data. ALWAYS!


You can switch datasets, keep the same digit colors, zoom and pan the PCA plot, click a sample, and inspect the selected image plus the five closest samples in PCA space.

The code is here: DidierRLopes/mnist-explorer.

Conclusion

Under this PCA centroid-separation proxy, Mandarin Chinese is about 4.2x less separated than English MNIST and about 5.8x less separated than Telugu. So if I treat inverse separation as a rough difficulty score, Mandarin Chinese is about 4x harder than English and nearly 6x harder than Telugu in this experiment.