SongKong Jaikoz

SongKong and Jaikoz Music Tagger Community Forum

Tutorial:Fix Charset Encoding

In Next Release

Next

image

When metadata is saved to our music files can be encoded according to a charset, historically different charsets have been designed for encoding particular languages, for example the ISO-8859-1 charset can encode English and some Western European Languages but is no use for different scripts such as used in Japanese, Russian (e.g Cyrillic) and Chinese.

These days we usually use UTF-8 because this can store any character but in the past UTF-8 was not so well supported. So certain metadata formats did not have a good way to support non English characters.

For example the Wav format originally only supported the Info chunk for storing metadata, and this expects all metadata to be stored using the ISO-8859-1 charset only but this is no help for other scripts. So if a tagging application was being used in a country that used a different default charset sometimes it would write the metadata using the default charset for that country, for example CPC932 in Japan. But the trouble with this approach is that now this is breaking the specification and there is nothing to tell applications reading the metadata that a different charset was used so the metadata is read incorrectly, this is particulary a problem in this example if reading the file in a different country where CPC932 may not be the default encoding.

It is not possioble to reliably guess the charset because a particular byte value is usually valid for a number of different charset encodings and music metadata fields are typically too short to make a confident guess.

The situation is further compilcated because a mix of charsets maybe used in one file for different fields, for example the CPC932 is not useful if the user has some English music metadata that they want to save

The Fix Charset Encoding task has been created to allow us to read the incorrectly encodied metadata using a specified charset for specified fields and then rewrite the metadata using a standard charset that can support the metadata. For example Wav format now supports ID3v2 metadata, and ID3v2 can write metadata using UTF-8.