Tuesday 13 December 2011

Gzip and Determining Language

While watching the last set of videos for the free Stanford Artificial Intelligence course, the following video completely blew me out of the water. It was so completely unexpected.

This is my third thing I learnt on Monday. The unix command gzip can be used to recognise which language a passage is written in. Suppose you have passages in different languages, and want to know which language a new passage is written in. What you can do is concatenate the passage onto each of the other passages, compress and check which one has compressed the best. The reason this method works is because compression works by shortening the representation of common language patterns, such as "is " in English, by a single byte. This compression will be different for each language and thus the best compression should be achieved for the concatenation with the same language.




No comments: