SongKong Jaikoz

SongKong and Jaikoz Music Tagger Community Forum

Awful write performance for songs stored over NFS

I’ve been having problems trying to tidy my library which is remotely stored on a NAS, connected with NFSv3 (Network File System, that is), under Linux.

The performance for write operations are awfully low (no, I’m not talking about some minutes to save a song; I’m talking about HOURS, REALLY, LITERALLY 2 or 3 hours). This issue can be kind of softened by connecting to the NAS by wire (but still using NFS). But my setup asks for a wireless connection, which is not a problem even for playing videos (not FullHD though).

Using the strace utility (strace -f -e open) I saw jaikoz opening files with the O_SYNC flag (synchronize every write access), which is a known performance killer for NFSv3 (don’t know about v4).

I tried all the java.io methods for “opening” files and I found that only java.io.RandomAccessFile with “rws” or “rwd” mode flags triggers the open syscall with O_SYNC flag.

Being a long time programmer, having both high- and low-level experience, I was left wondering why do we need that “sync” option. Usually only database coders should need it, and only for log file operations (for ACID properties, strictly speaking).

Still if you have a really good reason for keeping this sync option (namely “s” or “d” in “rws” or “rwd”), then please at lease offer some sort of configuration, be it through the UI, java property (-Dproperty=value), environment variable, direct configuration file manipulation; any kind will do. I’ll happily give up any consistency granted by sync in favor of a 100 fold increase in write performance.

Until there I’ll keep trying to find some workaround, like tracing every syscall to open and changing “rws” or “rwd” to “rw”, or instrumenting classes in JVM on the fly, whichever resolves the problem.

What type of NAS are you using?

I run jaikoz on win7 64 with all my music being stored on a remote NAS as well. Typically it takes about 10 minutes to load about 20k for the first time, under a minute if i open them again later after they have been cached. Saving that many takes about the same amount of time, about 5 to 10 minutes. This is even with a fairly under powered NAS that has only 4gb of ram and is about 6 years old.

Usually the only thing that may take me an hour or two is running autocorrect on that many files. However I contribute a lot of that to the limitations of some of the services jaikoz has to query as I have it setup to scan and double check all of them.

I am not disagreeing that jaikoz can be better optimized and use resources more efficiently. However, typically only time I have seen it take hours to perform any type of action on a single song is when the programs allotted resources have nearly ran out and it goes into this overly cautious mode where it tries to continue running the process without crashing. Something like running autocorrect on 50k songs without bumping up the memory high enough in the batch file.

Hi rslemos

Sync was added to the underlying jaudiotagger library that Jaikoz uses to try and fix a problem, but I could have misunderstood its purpose.

If you are in a hurry and since you are a programmer http://jthink.net/jaudiotagger/index.jsp is open source so you could try modifying it yourself and seeing if that fixes the problem for Jaikoz.

Oh nice. I didn’t know jaudiotagger was opensource. Suddenly a workaround got a lot easier to do. I’ll try it this weekend.

Thanks for your support.

greengeek,

win7 is a no comparison here. We are not talking about CIFS (the filesystem protocol windows uses to access remote filesystems).

The problem lies on: i) NFS, and ii) O_SYNC flag to open syscall. Both are UNIX specific (well, I’m sure there is something like an O_SYNC flag for windows too, but I’m not a good windows programmer).

Though the NAS appliance could have its share of culprit (like not allocating enough threads or buffers to allow for efficient handling of work), surely it would not cause such a huge performance loss; remember that I’m not talking about a 2 or 3 fold increase in times… it’s more like a 100 fold. No, the equipment is not the culprit. NFS+O_SYNC is.

I’ve just made the code changes to jaudiotagger and checked in to source control

https://java.net/projects/jaudiotagger/sources/svn/show

So you could download trunk and build with maven

Haven’t deployed a new maven snahpshot yet because working on another jaudiotagger issue

Paul

Thanks. I’ve already done it locally.

But it was not the main cause of the poor performance (though it does make a dent).

Fortunately I found who’s to blame: NFS lock daemon. Actually, the NAS NFS implementation lacks (or fails to start) the lock service. On client side (where Jaikoz runs) the system wide lockd server waits for a moment before bailing out. Apparently it nearly doubles the amount of time waiting after each failure (don’t know how much at start, but by the 5th or 6th file it was waiting as much as 1800 seconds before giving up).

For the people out there suffering the same issue, it can be found by using strace to trace the fcntl calls, with F_SETLK argument. Here is some C code which can help those people to pin it down:

int main(int argc, char* argv[]) {
        int fd;
        struct flock flock = {
                l_type: F_WRLCK,
                l_whence: SEEK_SET,
                l_start: 0,
                l_len: 0,
        };

        fd = open(*(++argv), O_RDWR | O_CREAT, 0600);
        fcntl(fd, F_SETLK, &flock);

        return 0;
}

Call it many times under strace -e fcntl to see the response time growing bigger each time.

I managed to solve the issue disabling lock on NFS mount (option: “nolock”). Nothing was lost since it would not work anyway.

And for reference (also for those experiencing some poor performance over NFS) my NAS is a Synology DS 212j with DSM 4.2-3202.

I’ve sifted through jaudiotagger source to find what could have triggered the losses reported on issue #278 (https://java.net/jira/browse/JAUDIOTAGGER-278). I could only find one point where a file was opened but not closed (thus not giving the OS the clue to flush its buffers to disk/network). It was corrected on SVN version 875 (after the introduction of “rws” option).

On the issue body itself there is a suggestion of what could be the culprit: “[…] the temporary file may not be actually flushed before being renamed […]”. I’m pretty convinced that synchronized IO merely worked around that (fixing it, but not properly, at the cost of avoiding every trick the kernel developers built to give us more processing power).

If you care to left this… eeeerrrr… “ticket” open, I’ll try to measure just how much the synchronized IO hurts the throughput to better judge how affordable it is to risk the come back of a bug.

okay, thank you reopened issue.