SongKong Jaikoz

SongKong and Jaikoz Music Tagger Community Forum

Umlauts and Tildes and Accents Oh My! Best Practices for Cross Platform Character Encoding.

First off, I’ve just ‘completed’ my journey of tagging ~80,000 tracks with the help of Jaikoz. It took a little while to understand all the options, but now that I do, I believe Jakoiz to be the finest piece of music meta data management software I’ve ever used!

Secondly, the problems I’m about to describe are in no way attributed to Jaikoz. They are not ‘bugs’ nor feature requests. Rather they are problems intrinsic to a world of different languages and operating systems. I hope that some of you out there have walked this path ahead of me, and can offer some pointers on how to proceed.

I have my music collection on a Linux box NFS mounted to my Mac. I cannot see (some) files that contain special characters from Finder. Tildes, accents and umlauts being the most common. Consequently, Jaikoz does not see those files when run from the Mac machine. However, if I run Jaikoz from the Linux machine, I can see those files. iTunes (on the Mac) does see these files. However, it will show the little ‘!’ icon next to the files after an attempt to access them has been made. What’s most strange, and also most hopeful for me, is that this only happens for some files with accents. Other files with accents do show up. So one set of characters is encoded in a way that works, the other not. However, I can’t figure it out. ‘B�la Fleck’ is broken. ‘B�la Fleck And The Flecktones’ shows up just fine. Copying the ‘good’ B�la into the ‘bad’ B�la field in Jaikoz shows no difference.

I have done some research into why this is so, and my best guess has to do with Unicode Normalization Form C vs. Form D. A thrilling topic (not) should anyone care to Google it :slight_smile: Although, I’m not convinced this is the problem, and would love to hear from the community if there is any knowledge out there.

Some solutions I’m considering are:

Use the File and Folder Correct/File and Folder Naming/Replace options. I have not tried this, as I don’t like the solution. It might work for many things, but what about something like this: http://musicbrainz.org/release/3ec4e9d8-e151-49cb-b8b9-22b4ffe2ee80 (Sigur Ros, an Icelandic bad)? What if I want some Korean music some day?

Maybe it’s as simple as changing Save/ID3 Tag V2/Default Text Encoding from ISO-8859-1 to UTF-16. Would I then Force Save every file, and perform Correct Filenames from Metadata and Correct Sub Folders from Metadata? I like this, but the weird ‘B�la Fleck’ stuff makes me wonder if this would work. Also, there must be some reason that ISO-8859-1 is the default, and it does work for some B�la’s!

Sorry for the long diatribe, it was as much to organize my thoughts as anything. I know there aren’t a ton of users NFS mounting their music collection from Linux to Mac, but I can’t be the only one? Help me Jaikoz community, you’re my only hope!

The Ourea

[quote=theourea]I believe Jakoiz to be the finest piece of music meta data management software I’ve ever used!
[/quote]
Thankyou very much.

File and Folder Correct/File and Folder Naming/Replace Non-Ascii Characters should work, and it would work for the Sigor Ros tracks just stripping the accents from the filename so for example Sigur R�s would become Sigur Ros but as you say its not a perfect solution.

What you really want to do is mount your files from linux to your Mac using a format that macs properly understand, i.e AFP. I have found this link which could well work but I haven’t tried it myself http://krypted.com/mac-os-x-server/hosting-afp-on-linux/

[quote]
Maybe it’s as simple as changing Save/ID3 Tag V2/Default Text Encoding from ISO-8859-1 to UTF-16. [/quote]
No because the problem is only with the filename itself not the metadata within the file.

This is a common problem especially when dealing with utf-8 unicode on windows based systems. As Paul mentioned above, the utf-8 in the tags is not the problem, but rather the characters used in the file system. The problem is that a lot of the higher characters in both the uft-8 charset as well as the windows charsets Windows-1252 or ISO-8859-1 are not the same.

What I have done is similar to what Paul mentioned. I check the replace non ascii characters. I also use the replace from characters with the to characters located on that tab. I have included a few dozen rules there. I find the low characters on both charsets to be the same, but a lot of differences on the high characters between the 2 charsets. These usually include a lot of the special characters, or the non basic english A-Z a-z and 0-9 characters.

The other thing I recommend is to limit characters that might be reserved in windows, linux, or mac. Stuff like /\?%*:|"<>. or null characters.

Other things to consider is the full length of the file name including the path. Restrictions in windows limits this to a total of 255 characters.

Finally I find some systems have problems with things like starting or ending a folder with a period or a space or ending a folder with two consecutive periods, etc.

There are a few other programs I use from time to time as well to help with the file naming. Mainly the free Bulk Rename Utility. It is a great program to do batch file and folder renaming.

I also use the paid program Delete Invalid File. This one is great for identifying possible file or folder name problems. I just point it to my music folder and let it go through all the sub folders and find any file or folder that does not follow windows standards.

Paul-

I run a netatalk server on that box, and accessing the files via AFP does not help. The behavior is a little different, I can see directories whose names contain special characters, but I still can’t see the files in those directories.

Right, but what of the Correct Filenames from Metadata and Correct Sub Folders from Metadata? If the metadata is encoded ‘correctly’/differently wouldn’t that result in a different file name?

I’m specifically thinking about my [quote]‘B�la Fleck’ is broken. ‘B�la Fleck And The Flecktones’ shows up just fine.[/quote] problem. There is a way to get files with accents to show up correctly. I just need whatever the byte’s that represent ‘B�la’ in the second case to be used in the first case. Since both directories were moved to their present location by Jaikoz, there must be a way to do this.

What effect would starting the JVM with -Dfile.encoding=UTF-8 have on the process?
As a further complication/simplification it may be important to note that all the files can be read from Terminal. This is (I think) because bash uses the same Unicode Normalization as the rest of the world.

I’m guessing the real ‘right’ action here is to write up a python (or bash, or java, or…) program to print the filenames in the various encodings and see just what the heck is up. I haven’t had time to go that far. Any tips (from anyone) on how to do that would be great.

Greengeek-

Thanks for the tips! In the end, I think I will do something similar. Stripping out the offending characters ins’t that big of a deal, as I interact with the files via Clementine/iTunes or some other program that displays the metadata, not the filenames. It’s only my over developed since of organization and curiosity, something I’m sure Jaikoz users will sympathize with :-), that makes me hold out for another solution.

Thanks,

The Ourea

[quote=theourea]Right, but what of the Correct Filenames from Metadata and Correct Sub Folders from Metadata? If the metadata is encoded ‘correctly’/differently wouldn’t that result in a different file name?
[/quote]
The way the data is encoded within the file itself is not the problem, and when Jaikoz is copying data from x to y it is alwasy dealing with unicode so I don’t how encoding the metadata differently would make any difference.