SongKong Jaikoz

SongKong and Jaikoz Music Tagger Community Forum

Improved matching algorithm for Jaikoz

Hi Paul,

Thanks in advance for your help.

I kicked off Jaikoz Auto Correct on a larger collection of around 8000 tracks last night to test Jaikoz on a more realistic subset of my collection before I commit to making the license purchase. I should mention that I am running on a 4 core (8 threads) 64 bit Linux machine.

The next morning I could see that Album Art exists for under 15% of the tracks. I believe that many of those tracks already had Album Art, so actually the percentage successfully added by Jaikoz is probably much lower than that. Assuming that some of the automatic changes were probably to the wrong album and will probably need manual adjustment I’m thinking that the success rate will be even lower still. It would be useful if Jaikoz could output some statistics after each operation regarding how many tags of which type where changed, added, or deleted.

Can anyone help? Unless there is a way of increasing the success rate I don’t think Jaikoz is going to be a feasible way of adding the correct Album Art to my library.

All of the tracks have existing tag data for at least Title, Artist, Album and Track Number. In some cases they even have Year. The tracks have all been ripped from CDs using Windows Media Player and are all organized neatly in directories according to Artist/Album/Tracks.wma.

All I want to do for now is to add Album art where it is missing (and ensure that the tracks that live in the same directory get assigned the same correct album art). I don’t care so much about MusicBrainz Id’s at this point since I believe that my meta data is in reasonable shape for these directories and I don’t want the MusicBrainz metadata to overwrite my metadata unless I’m absolutely sure that it’s improving data quality.

In case it helps. here is a concrete example which illustrates some of the problems. I have taken one Album in a single directory called Eminem - 8 Mile. As you can see from the attached screenshot Jaikoz assigns artwork to only 4 out of 16 tracks but Lyrics to 8 out of 16. No Accoustic Id can be generated for any of the songs. I believe this is because Jaikoz fails for generate Accoustic Id for all WMA files. Is this a bug? Jaikoz claims to support WMA:
http://www.jthink.net/jaikoz/jsp/overview/startup.jsp

Regardless, I would expect that Jaikoz should be able to find album art without MB Id’s given a reasonable amount of meta data. [Aside - Logically I’m reasoning that if this were not the case then Jaikoz would be completely reliant on successful Acoustic Id generation to be able to function at all, and this would mean that Jaikoz would therefore be completely dependent on the quality and coverage of data in the MusicBrainz database rather than the meta data already in the libraries of users. This would also mean that Jaikoz would be unable to take reasonable meta-data that users already have in their ripped collections and contribute to MB by adding the albums into MB.]

The correct MB release page for this album is here:
http://musicbrainz.org/show/release/?mbid=35254763-e3ea-42c3-9ab0-af964e5a18c9

As you can see from my screenshot Jaikoz has a wealth of information that it could potentially use for matching:

  1. Title matches exactly for every track (even case matches)
  2. Album matches exactly
  3. Track number matches exactly
  4. Playing time matches to within a second or two for every track
  5. Year matches
  6. Other tracks with the same Album name and in the same directory are able to be successfully matched to the same correct MB release
  7. Total number of tracks in directory matches total number of tracks in release. All track numbers are unique within the directory and the highest track number matches with the number of tracks in the correct release

I realize that fuzzy matching algorithms can be tricky, but one would think that all of the above would be sufficient for Jaikoz to have a high degree of certainty regarding which release to choose, but in 12 out of 16 cases it seems unable to select a release automatically.

The only deficiency in the prior existing meta data in this example is that the Artist does not match and Album Artist is empty. In all cases, my tracks have the artist as “Eminem” whereas in MB the tracks are attributed to different artists. In MB Album Artist is “Various Artists”.

Here’s the console output:

Oct 26, 2009 4:24:21 PM: INFO: Started Autocorrecter on 16 Songs
Oct 26, 2009 4:24:21 PM: INFO: Task 1:Started Correct Artists on 16 Songs
Oct 26, 2009 4:24:21 PM: INFO: Task 1:Completed Correct Artists on 16 Songs
Oct 26, 2009 4:24:21 PM: INFO: Task 2:Started Correct Albums on 16 Songs
Oct 26, 2009 4:24:21 PM: INFO: Task 2:Completed Correct Albums on 16 Songs
Oct 26, 2009 4:24:21 PM: INFO: Task 3:Started Correct Titles on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 3:Completed Correct Titles on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 4:Started Correct Genres on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 4:Completed Correct Genres on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 5:Started Correct Comments on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 5:Completed Correct Comments on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 6:Started Correct Track Numbers on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 6:Completed Correct Track Numbers on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 7:Started Correct Recording Times on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 7:Completed Correct Recording Times on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 8:Started Correct Tags from Filename on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 8:Completed Correct Tags from Filename on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 9:Started Correct Artists on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 9:Completed Correct Artists on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 10:Retrieving Acoustic Ids for 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 11:Started Correct Tags from MusicBrainz on 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 12:Started Updating tag data for 16 Songs
Oct 26, 2009 4:24:22 PM: INFO: Task 13:Started Correct Lyrics on 16 Songs
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 6
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 0
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 4
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 5
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 13
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 1
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 8
Oct 26, 2009 4:24:23 PM: WARNING: Unable to retrieve an acoustic id for song 14
Oct 26, 2009 4:24:24 PM: WARNING: Unable to retrieve an acoustic id for song 10
Oct 26, 2009 4:24:25 PM: WARNING: Unable to retrieve an acoustic id for song 11
Oct 26, 2009 4:24:26 PM: WARNING: Unable to retrieve an acoustic id for song 2
Oct 26, 2009 4:24:27 PM: WARNING: Unable to retrieve an acoustic id for song 3
Oct 26, 2009 4:24:28 PM: WARNING: Unable to retrieve an acoustic id for song 12
Oct 26, 2009 4:24:29 PM: WARNING: Unable to retrieve an acoustic id for song 15
Oct 26, 2009 4:24:30 PM: WARNING: Unable to retrieve an acoustic id for song 9
Oct 26, 2009 4:24:31 PM: WARNING: Unable to retrieve an acoustic id for song 7
Oct 26, 2009 4:24:41 PM: INFO: Task 10:Retrieve Acoustic Ids was unable to find a match for 16 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 10:Completed retrieval of Acoustic Ids for 16 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 11:Correct Tags From MusicBrainz was unable to find a match for 12 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 11:Corrected 4 Songs from MusicBrainz successfully
Oct 26, 2009 4:24:41 PM: INFO: Task 11:Completed Correcting Tags from MusicBrainz on 16 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 12:Updated 4 tags from existing Discogs Id successfully
Oct 26, 2009 4:24:41 PM: INFO: Task 12:Update Tags from Existing Discogs Ids was unable to find a match for 12 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 12:Completed Updating Tags from Discogs for 16 files
Oct 26, 2009 4:24:41 PM: INFO: Task 13:Corrected 8 Lyrics
Oct 26, 2009 4:24:41 PM: INFO: Task 13:Unable to find a match for 8 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 13:Completed Correct Lyrics on 16 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 14:Started Clustering Albums for 16 Songs
Oct 26, 2009 4:24:41 PM: INFO: Task 14:Before clustering there were 1 albums spread over 1 MusicBrainz Release Ids
Oct 26, 2009 4:24:41 PM: INFO: 8 Songs were modified with the Autocorrector
Oct 26, 2009 4:24:41 PM: INFO: Completed Autocorrecter on 16 Songs

Here’s the same Eminem 8Mile directory of files loaded into Picard:

On the plus side, at least Picard manages to assign 15 out of the 16 tracks to the correct album which beats Jaikoz’s 4/16. On the minus side, Picard still leaves manual work to the user by mis-assigning track 2 “Love Me” to the wrong Album. It looks like Picard is able to generate fingerprints for all tracks though, unlike Jaikoz which fails to generate any fingerprints.

Here’s another example for “Headlines and Deadlines: the Hits of a-Ha”
http://musicbrainz.org/show/release/?mbid=f8c6f21d-a775-45b7-a49e-9e0a06f3a94f

Jaikoz is unable to find album art for track 8 “Hunting High and Low (remix)” and all other tracks containing the string “remix”. Is this because the ‘r’ of ‘remix’ is not a capital ‘R’ in my existing meta data?

Here’s a slightly different example. Jaikoz fails to assign any songs from the full Trainspotting album to any release.

When you try Manual Correct, it seems that Jaikoz thinks the potential releases all have only 10 tracks (and perhaps for this reason fails to find a match)

In fact the MusicBrainz database shows that this album actually has 14 tracks contrary to what Jaikoz shows. And all 14 tracks have meta data that match well with the existing meta data on the tracks in my directory:
http://musicbrainz.org/show/release/?mbid=b5f740e5-5ea6-4e07-bc14-8b0448c96174

Hi mjw thx for describing your problems in such detail I think I should be able to clear most of your issues up.

Looking at your examples I think you are hitting a few problems, and there are a few misundertsndings about how Jaikoz works.

  1. For Fingerprinting Jaikoz uses Genpuid provided by MusicIP/Amplified Music Services this does support fingerprinting WMA but ONLY on Windows not on Linux because it uses some of the decoders built into the OS, Is your collection accessible from Windows. It works with Picard because this uses an opensrc library based on the same code as genpuid but they use their own encoders, the disavnatge of this approach is Picard cannot submit new ids to MusicDNS but Jaikoz can. AMplified Music Services are currently updating their sw so these restrictions should disappear.

  2. Looking up songs from Musicbrainz works either by acoustic ids or by matching metadata, although your metadata is quite good because the artist is completely incorrect in the examples thats having a big impact on the score for potential matches. From your screenshot of Manual Matches you can see that the scores are about 60 but the default required to get an automatch is 70, but you can reduce this value in Preferences/Musicbrainz/Automatch/Minimum Rating required if Meta Match Only.

  3. Everything (except lyrics) is keyed off Musicbrainz for reliable data matching, if you don’t match a Musicbrainz release you won’t get artwork with Jaikoz because trying to get artwork directly from Amazon/Google based purely on metadata isnt very accurate, the results aren’t good enough to let a program do it without needing to check all the results. But soon I will be adding full Discogs support. The combination of Musicbrainz and Discogs databases should cover at least 95% of releases.

As I said Musicbraizn can be looked up by Metadata or Acoustic Ids, and Musicbrainz does not currently allow albums to be added automatically to their db. You have to understand that the quality of the data in Musicbrainz is VERY good, and they are very cautious in order to keep it that way.

Yes it uses all this information, its just that the artist name does not match at all which is causing the problem.

No all that is happening is that Jaikoz shows 10 songs at time (you can change this in Preferences/ManualMatch/No of songs to process before showing Dialog ).

This might seem weird if you are only correcting 1 ‘album’, but Jaikoz is normally used to correct hundreds of songs in one go so need to set a
manageable limit to view.

I would suggest that is the case, but best to expirement by running Manual tag from Musicbrainz to see what results come back.

Actually Im think Im getting confused about, Linux doesnt do M4a.

Look at this post http://www.jthink.net/jaikozforum/posts/list/926.page, you might need to do:

sudo aptitude install libstdc++5

What I don’t understand is why this has just started being a problem and what linux distros it applies to

[quote=paultaylor]Actually Im think Im getting confused about, Linux doesnt do M4a.

Look at this post http://www.jthink.net/jaikozforum/posts/list/926.page, you might need to do:

sudo aptitude install libstdc++5
[/quote]
This does not seem to be the source of my problem with fingerprinting. On my linux box I already have this library installed and genpuid works fine. I’ve actually now found that this problem is not limited to WMA, its also M4A. I can’t seem to check if it also is a problem with MP3 or not since I get the console message “Ignored 11 Songs because they already have an Acoustic Id” and can’t find a way to remove the Acoustic Ids or force them to be recalculated.

Is there any way you can load up my files in an Ubuntu virtual machine and try this out at your end?

Thanks for the detailed answers.

Do you have any guess as to how long it would take to get this into Jaikoz? I’m wondering whether to put the tagging of my WMA files on hold for a month, or alternatively try tagging that part of the collection iunder WinXP within VMWare (although I would only have 1 thread and 1GB RAM instead of 8 threads and 6GB RAM).

It looks like the score is always dropping below 70 when the artist is “Various”. Reducing the threshold to 55 seems to be the level that is required to get good matching but I worry that may also be a little too low and start matching things that it shouldn’t.

Any thoughts on a timeframe? I could delay this cleanup for a month if it’s coming soonish.

I did some experimentation and the results are interesting. Capitalization had no effect on score, but the type of bracket I use matters. Changing from [remix] to (remix) increases the score from 73 to 84. Even when I change every single other editable field on the popup manual match window so that it matches exactly, I still cannot get the score above 84. The only other field that I can see on the page but am unable to edit is the Track Length which is off by 1 second. So I’m assuming that we must be having 16 points deducted even though a 1 second difference is a pretty negligible difference IMO.

I can think of 3 improvements that would largely eliminate the AutoCorrect problems:

  1. Can you perhaps adjust the scoring algorithm so that the type of brackets that are used does not have any impact on scoring? Also make it so that a track length that is off by 1 or 2 seconds only causes one or two points to be deducted rather than 16 points.
  2. Do you think you could detect the string “Various” and treat it like a wildcard, as if the Artist field was a 100% match no matter what? Perhaps you could even ideally find matches regardless of whether the “Various” is in the local data or the remote data or both.
  3. Add a Manipulator that goes through all the clusters (leaf directories?) at the last step and attempts to convert all the tracks in the same cluster so that they use the same Release Id. If different tracks are mapped to different Release Id’s it should pick the Release Id to use for each track in the Cluster according to the following criteria:
    a) Number of tracks in Release >= Number of tracks in Cluster
    b) Track Number is 100% match
    c) Track Length is 95% match
    While ensuring that the above criteria are satisfied in all cases it should try to minimize the number of separate releases.
  4. [This suggestion is probably is redundant if you develop 3 above] Provide a “Apply Top Manual Match from MusicBrainz” option which just picks the best match from MB without bothering with the dialog box.

Yes having double checked Linux doesnt support acoustic ids for WMA and M4a, OSX doesnt support WMA. It not top priority because Linux users don’t usually opt for WMA, in the short term your best bit would be to use VMWare. (You can just run Retrieve Acoustic Ids under VMWare and do the rest of the matching directly on Linux)

Your getting no match on artist , so losing 15 on score.

Well I hope it to be within a couple of months, but I cant promise for sure , I think if you do fingerprinting in VMWare that will solve most of your issues.

RE:Aha example, I also did some experimenting

The problem is not that Remix case is different it is that your song uses square brackets, and Musicbrainz uses curly brackets, and this is affecting the score, with the square brackets the top match for Hunting High and Low [Remix] is 69 (the autocorrect threshold is 70), change it to Hunting High and Low (Remix) you get 80. This is probably a bug it should treat the brackets the same.

BTW if you run Autocorrect once, then run it again it will ignore tracks that have already been matched and just retry the failed ones. Because a track gets a better score if it matches a release that is used by another release (so we do group them) it will get a slightlty better score second time round (73 instead of 69) and will actually get matched.

So two workarounds are to run lower the threshold score, run Autocorrect twice
[/quote]

  1. track length doesnt contribute 16 pts, there must be other items that don’t match, I dont have the full scoring algorithm handy at the moment but I’ll publish it later.
    2)This doesn’t make much sense in the general case because your artist name is wrong so it shoudlt get as good as score as when you had the artist field matching. Also makes it difficult to get a good match, i.e we could find the same song covered by two different artsists without the artist name to match on not clear which is the correct one
    3)This is what ‘Cluster Albums’ does isnt it.
    4)Don’t see why this would be reundant if I did (3), as you describe it wouldn’t it be just the same as running Tag from Musicbrainz but with a low threshold score. I did consider chnaging Manual Tag from Musicbrainz so it defaulted to selecting the best score rather than selecting no score, would that be useful.

Is this a bug in MB or Jaikoz? If it is the former, would you be able to submit the bug?

Thanks. I’ll try both.

That would be useful. My suggestion would be to ensure that all of the fields that are used in the algorithm are made visible in the Manual Match popup. At the moment, all I can see if that we’ve somehow lost 16 pts even though the only difference in the popup is Track Length. And I can’t verify this because there is no way I can edit Track Length in Jaikoz.

I’m not sure that I follow. Let’s take the Trainspotting album example. If you take this CD and rip it with Windows Media Player then each track will have Artist=“Various”. This is not wrong per-say, it’s just ambiguous. Ripped compilation albums often have Artist=“Various” from what I have seen. What “Various” means in 99% of the cases is - “Don’t know, but each track in the album might be a different artist”. If you look at MB, there are also many case there where the Artist=“Various Artists”. All I am suggesting is that you don’t deduct any matching points for a mismatch where either the local or remote data contains the string “Various”. I’m not sure how you would implement this, but perhaps it is as simple as replacing “Various” with “”.

Regarding your example “same song covered by two different artists without the artist name to match on not clear which is the correct one”. I can see that this might be tricky, but couldn’t you use the following other meta data if available to choose the right one:

  • Fingerprint
  • Consider the other tracks in the user’s same Album/Directory. If MB has 2 tracks to choose from with identical Title but different Artist, then pick the track that is from the same Album as the other track(s) that the user has in that Album/Directory
  • Track Length
  • Release Date
  • etc

If at least some of this is available, the chances are that you should be able to pick the correct track in preference to the wrong tracks even while ignoring the Artist field.

I’m not sure, could you perhaps outline the algorithm that you use here? I guess I might just be confused because “Cluster” does something slightly different in Picard where it is used as a precursor to a manual match whereas in Jaikoz cluster seems to be used as a post-processing step.

[quote=paultaylor]
4)Don’t see why this would be reundant if I did (3), as you describe it wouldn’t it be just the same as running Tag from Musicbrainz but with a low threshold score. I did consider chnaging Manual Tag from Musicbrainz so it defaulted to selecting the best score rather than selecting no score, would that be useful. [/quote]
In the popup window, right? I think it would be useful. It would also be clearer if the first row if moved completely out of the table to just above the title row of the table so that it doesn’t have a radio button beside it. I didn’t realise for a long time that this first row was local data whereas the other rows are remote data because initially the first row looked the same as the other rows to me.

Ive looked into this in more detail and the situation is more complex than first thought.

Actually searching using [Remix] or [remix] brings back identical results from Musicbrainz, but the score returned by Musicbrainz doesnt is only based on the similarity of the track name searched to the trackname found so within Jaikoz we have to rescore potential matches as shown at http://www.jthink.net/jaikozforum/posts/list/1141.page

The score for ‘Hunting High and Low [Remix]’ is as follows

Artist :18 (from 18 )
Title :17 (from 28 )
Release :12 (from 12)
Release is Original Album :0 ( from 13)
Track No : 7 (from 7)
Track Duration : 5 (from 5)
TotaltracksOnRelease : 4 (from 4)
ReleaseCountry : 4 (from 4)
Release used by another Song : 0 (from 4)
ReleaseType : 0 (from 3)
Release is an Offical Album : 2 from (2)

giving 69

Note only getting 17 out of 28 for title because different in brackets (which is reasonable though could be a bit higher). (If in fact that the name was modified to (Remix) would then get the full 28 out of 28 ).

Getting 0 out of 13 for Release is Original Album , the general idea is that users often want to match to the original track even if their track actually came from a compilation. Unfortunately because this remixed version is only available on a compilation you are being a bit penalised by this.

Getting 0 out of 3 for Release Type, only get this score if track is on Album/E.P/Single. This needs revisiting it existed before Release is Original Album and is now redundant or certainly is a double blow when trying to match compilation track.

if you were to disable ‘Prefer original release even if better meta Match to later compilation release’ you would get

Artist :18 (from 18 )
Title :17 (from 28 )
Release :22 (from 22)
Release is Original Album :0 ( from 3)
Track No : 7 (from 7)
Track Duration : 5 (from 5)
TotaltracksOnRelease : 4 (from 4)
ReleaseCountry : 4 (from 4)
Release used by another Song : 0 (from 4)
ReleaseType : 0 (from 3)
Release is an Offical Album : 2 from (2)

giving score of 79

but of course its not that practical to change this setting on an album by album basis, not least because in normal use you’ll want to modify a mixture of albums.

So in summary there is a bug with ReleaseType scoring, its currently nearly duplicating the ReleaseIsOriginal match. What is was meant to do (but doesnt) is give you a match if your track seemed to be of the same release type as the match, i.e if you had the isCompilation flag set on your track you would get credit if the matching tracks album was Compilation. If you track was part of less than 5 tracks of same release name would be considered Single/E.P and would get credit for that.But this wouldn’t help you in any case coz its not marked as a compilation,

Also there is a usability problem with ‘Prefer original release even if better meta Match to later compilation release’, if the only matches returned are compilations thay are going to get unfairly low scores, any ideas ?

So we now know the score was lost because
Track name didnt fully match
Track was a compilation
Release not used before

I could add isCompilation field, in theory could even add the individual scores but then because too confusing for most users. Earlier versions of Jaikoz actually allowed you to set the score weightings but this just confused people, s Im trying to go down the Apple way a bit more of just doing as much as I can without user being fully involved in exaclty what is happening, whilst still letting you retain control of what is saved ectera.

Actually is it wrong, WMP should set ‘Album Artist’ field to ‘Various’ and ‘Artist’ to the artist for that track. Also if it is indentifying it as a compilation it should set the isCompilation field.

[quote=mjw] If you look at MB, there are also many case there where the Artist=“Various Artists”. All I am suggesting is that you don’t deduct any matching points for a mismatch where either the local or remote data contains the string “Various”. I’m not sure how you would implement this, but perhaps it is as simple as replacing “Various” with “”.
[/quote]
But the scoring is additive not subtractive , you are saying give credit for the artist field matching even when it doesn’t.

Your algorithm is a series of big if statements.
Check this field IF this do this else do this …, its not maintainable. My approach is get a list of possible tracks using a variey of searches then calculate a score and pick this one. This algorithm is better at dealing with all the unusual cases that havent been explicity covered, but yes it needs more work.

I think it does what you want, it groups together all tracks with the same release and artist name, then when the releaseids within the release differ trys to match all the tracks to the same releaseid. Itis completely different to Picard Cluster.

[quote=mjw]
In the popup window, right? I think it would be useful. It would also be clearer if the first row if moved completely out of the table to just above the title row of the table so that it doesn’t have a radio button beside it. I didn’t realise for a long time that this first row was local data whereas the other rows are remote data because initially the first row looked the same as the other rows to me.[/quote]
Yes, used to do this but looked very clunky

Thanks for this thorough analysis. It has really helped me to understand better how things work.

It seems like there are a few quick fixes that would help here:

  1. Make the Title match agnostic of the type of bracket that is used
  2. Remove ReleaseType from the scoring and add the 3 points to the TrackNumber max score instead.
  3. Making isCompilation a column in the Manual Correct popup
  4. Consider displaying each component of the total score at the beginning of each cell in the Manual Correct popup. So for example, instead of ‘Hunting High and Low [Remix]’ it would instead appear as ‘{17/28} Hunting High and Low [Remix]’. I guess you could even make this part of the string a different colour to make it stand out. By seeing each component of the score like this it would be very easy to work out why particular local tracks were matching with particular remote MB Tracks

If you were to implement the release-level matching algorithm that I describe below, then item 4 would need to be revisited since there would now be a release-level score as well as track-level scores so the UI might need to look different.

Thanks for explaining this. I will try experimenting with this setting.

Sorry, yes I was suggesting that you give credit for the artist field matching only in the case where it contains the string “Various”. But let me think on this one a bit more. Now that I understand the principals a bit better I’ve got the beginnings of a few ideas that might be able to greatly increase the accuracy of the matching without having to use this approach.

You’ve given me a few ideas and after a lot of thought I think there’s a way of doing this with very few if statements. At a high level it could work like this:

  1. Get a group of local tracks that are in the same subDirectory and have the same AlbumName
  2. Use your existing “ManualCorrect” algorithm to find all the candidate MB Releases for all of those tracks and add them to a Set to eliminate duplicate releases
  3. Calculate the maximum total track-level score for each of the candidate releases as follows
    …3.1 Get the next candidate release from the Set returned by step 2
    …3.2 Assign all the local tracks from step 1 to the optimal track number in the MB Release that will cause the SUM of the individual track-level scores to be the highest possible number. This number is the initial release-level score. Clearly if you have 20 tracks you don’t want to try 20 factorial combinations so we need some smarter (parallel?) algorithm to reduce the number of combinations to test.
    …3.3 Goto 3.1 and process the next candidate release
    …3.4 Return a List of ScoreResults containing the release-level score of each candidate release (from step 2) paired with the optimal MBRelease=>LocalTrack association (step 3.2) that achieved that score
  4. Apply any release-level scoring adjustments that might be required using the release attributes AlbumName, Release is Original Album, Release is an Offical Album, isCompilation, ReleaseCountry. TotalTracksOnReleaseGreaterOrEqualToNumberOfLocalTracks might not be needed since the algorithm will naturally bump up the score for a release where more tracks match.
  5. Sort the candidate MB Releases by the final total release-level score
  6. Map all the local tracks to the MBTracks in the MBRelease with the highest final score. At this point there could be some local tracks that have no assigned MBTrack. The user could try re-running to see if these remaining tracks could also be matched to separate Releases, but most likely it’s an indication that the precise Release has not yet been submitted to MB for that AlbumName. E.g. A common example would be a release with an extra bonus track as the last track.

Note 1 - The algorithm for calculating individual track match scores would be almost the same as your current algorithm, except it would exclude any attribute that belongs to the Release rather than the Track. So it would include only Artist, Title, TrackNo, TrackDuration. “Release used by another Song” would no longer be needed since the algorithm now only checks songs that are together in the same release.
Note 2 - The score for a release containing 14 tracks is going to be in the range of 0-1400. This works nicely because it means that the more tracks in the local grouping that match with a MB Release, the higher the score will be. It will also guarantee that only one MB Release is chosen for all the tracks. If a set of tracks matches somewhat with one MBRelease containing 9 tracks with good track match scores and another containing 14 tracks with good track match score, it will tend to pick the MBRelease containing 14 tracks.
Note 3 - If you provide both the Release-Level score (same for each track in the same release) and the Track-Level score as columns in the table, then it would be very easy for users to sort and review the fixes made by the AutoCorrector so that you can spot any major mistakes in assigning tracks to the wrong release. It would be handy though if you could sort by more than one column simultaneously though like the way that it works in Excel for example. If you have this functionality then you might not need to even have minimum score thresholds any more since you can just show all the best matches and let the user review them.
Note 4 This approach also solves the problem you raised about ‘Prefer original release even if better meta Match to later compilation release’. It will always pick the release with the highest final score so it doesn’t matter if there are no original releases in MB.

Perhaps just removing the radio button from the first row would do the trick.

Yes, okay.

Its broken as is, but this doesnt really fix it.

Yes, okay.

Most users are just going to think, what is this, information overload ecetera, and as you say if the algorithm get s more complex it wouldnt make much sense to display this anyway.

Yes I think you have something here , and Im going to implement something along these lines, and see how it goes.

If we were just to group tracks by subDirectory it would be okay as long as the user is organizing there files as one album per folder. But this is not always the case either by design or because of lack of organization. Your’e rule of subDirectory AND album is safe enough, but why not just use artist AND Album then this would group tracks from the same album , that don’t happen to be in same folder, I assume its because you are trying to deal with the case of artist being set to ‘Various’ for some tracks. Trouble is people aren’t always burning from a CD, so all the tracks might not have the same album set even if the user intends this. Another case is users trying to amass a complete original album but because of the rarity of the album having to make up some tracks from compilations, so we have the case where tracks are not all form one album but they would like them to be. Conversely we have the case where the user does want the traks to be to maintained on the compilation album.

Ok, so we are matching tracks to mb tracks via puids or metadata, and matching mb releases to releases found by these means or also by doing release match as well.

So this is the key difference with what Jaikoz currently does, actually working out a release-level score considering all the tracks on the release that you have in your collection. This is a good idea, I think section 3.2 would be more a case of just matching tracks to mb release by comparing name,length, track nums and I already do this. It doesn’t really need to consider all combinations because normally most tracks were clearly match one location on the release. Ive been reluctant to this in the past because users may not elect to match all tracks, and the list of songs in the Edit Pane may not be displayed in release order. Processing is currently parallelised so separate threads just take the next song off the queue. But I don’t think these are that much of a problem.

…3.3 Goto 3.1 and process the next candidate release
…3.4 Return a List of ScoreResults containing the release-level score of each candidate release (from step 2) paired with the optimal MBRelease=>LocalTrack association (step 3.2) that achieved that score

I think this step can just be applied as part of 3.1, so that 3.4 returns the best score.
5. Sort the candidate MB Releases by the final total release-level score
6. Map all the local tracks to the MBTracks in the MBRelease with the highest final score.

If there is a reasonable match for the track during the initial steps it should be assigned to the release even if its a difference release to the other tracks.

This is the key difference/improvement.

This is the one problem, it doesn’t solve this issue because your’re algorithm would always favour the compilation. A user may have 20 tracks in a compilation but they actually want them tagged to their original release, even though this might mean they ends up with 20 different releases. If both the original releases and compilations are in the database I want the traks to match the original releases when ‘Prefer original release even if better meta Match to later compilation release’ is selected and the compilation release when it isn’t.

Its further complicated by the fact that Musicbrainz would (correctly) call a release a compilation if it was a Greatest Hits album by one album, but iTunes expect it have multiple artists on the release.

So need to finesse the algorithm a little perhaps as follows.

IF ‘Prefer original release even if better meta Match to later compilation release’ selected
IF the tracks appear to be from a compilation because either

  1. the IsCompilation field is checked
  2. The artist or album artist is Various.
  3. The release has different artists on each track.
  4. The release name is ‘Greatest Hits’

Use simplified original algorithm to find the best release for track without considering other tracks.

Thanks for listening to all these long-winded suggestions and even better for considering implementing them! This certainly sounds like a lot of work. Perhaps it’s even worthy of a major revision number - Jaikoz 4.0? :slight_smile:

True. Perhaps you could hide it as an advanced option or something. I’m thinking that once you get this new algorithm implemented, the next stage is going to be to find the optimum scoring weightings. This is going to be much easier for myself and other users to assist you with if there is an clear way to see how each score was computed. It will also be useful to answer future user questions along the lines of - “Why did Jaikoz pick release X instead of Y?” without requiring your support time to perform the analysis.

Even if they put multiple albums in one subDirectory it will also work as long as the string “AlbumName” is unique and consistent for all the tracks that belong to the same Album within that subDirectory.

If users have some other file organization, there is a workaround - move all the tracks that you want to consider as a Group of Albums into the same leaf directory before running Jaikoz.

A problem arises when they want to combine tracks that have inconsistent AlbumName as a Group. Maybe your concept of Manipulator Tasks (and their associated Preferences) can be used to provide the flexibility to handle some of these user-specific cases. Maybe the “Grouping” algorithm could be an optional task that you run prior to any matching and scoring. You could potentially allow users to configure the “Group” task by adding some preference check-boxes to allow the user to specify which fields the Grouping algorithm would use (a bit like defining the “WHERE” clause in a SELECT statement operating on all the tracks) when performing the Grouping. It sounds like the cases that you have mentioned could be handled if the following were optionally allowed in the WHERE clause for the SELECT:

  • Subfolder [true by default]
  • AlbumName [true by default]
  • AlbumArtist [false by default]
  • TrackArtist [false by default]

You could even optionally display a GroupId (populated once the Grouping task has run across the tracks) as a column in the table so that we can easily see whether the grouping worked as expected.

I’m presuming you are suggesting AlbumArtist rather than TrackArtist (since a compilation release can have many TrackArtists so this would prevent the correct matching of most compilation albums). In my collection AlbumArtist is empty for every single local track so it wouldn’t be helpful in the initial grouping process. I think that this might be the most common case with ripped CD collections.

TrackArtist wouldn’t help my usecase either since personally, I don’t want Jaikoz to prioritize Original Releases over Compilations. In my personal view it is preferable that songs are matched accurately to the actual albums that they were ripped from. Then I would like MB metadata to improve fields like Artist, TrackName, AlbumArtist, TrackArtist, Genre etc to make it easy to find songs on an iPod/Sonos etc. But there would be no harm (and possible benefit to some types of users) if you could add both AlbumArtist and TrackArtist as an option in the specification of a new “Group” task (as described above).

I would think that users with small music collections don’t need Jaikoz as much as those with large ones. In the large music collection case I would say that not using subDirectory in the case where a user does have reasonable organization is a missed opportunity for increasing accuracy and reducing the percentage of manual corrections that are needed (perhaps by as much as 50%). I wouldn’t sacrifice this potential improved accuracy just for some potential fringe cases that can probably be accommodated via preferences like “Prefer Original Release”

This is certainly one example of where a Grouping based on AlbumArtist or even TrackArtist would break down.

Realistically in any large collection there are going to be some cases that require some manual intervention. Hopefully these can be minimized. But if AlbumName really cannot be trusted for some subset of directories in a collection, then it might make sense to disable “AlbumName” from the Grouping task preferences as described above.

If they are indeed trying to amass a specific original album from tracks found from various sources, then this sounds like a case where only the user himself knows which tracks he wants to belong to which album. There’s no way for Jaikoz to know for sure. You already have the Manual Correct functionality that can handle this case. In addition, it doesn’t seem unreasonable to expect the user to do some manual edits to the AlbumName and Subdirectory himself before kicking off ManualCorrect/AutoCorrect to encourage Jaikoz to match tracks to the albums that the user wants.

This case could also be handled by disabling “AlbumName” in the Grouping task preferences and either
a) Manually fixing meta data and moving all the tracks into a common subDirectory and enabling “SubDirectory” in the Grouping task prefs, and/or
b) Enabling AlbumArtist and/or TrackArtist in the Grouping task prefs

If the “Prefer original release over compilation” option is set then modify the sorting algorithm used for sorting the best release match so that original releases come first even if they have a lower score than compilation albums. If the option is not set, then simply numerically sort by release-score.

That sounds reasonable. The weightings can always be experimented with and optimized later. My sense is that, at present, Length and TrackNo are usually more reliable indicators than TrackName where slight differences can cause a disproportionate number of points to be deducted.

I suppose it might become tricky when there are cases like:
a) User doesn’t have all the tracks in the release in his music collection
b) A similar TrackName occurs twice on the same release (e.g. one of the is an instrumental/extended/remix version of another song)
One thing that is useful in Picard is that it colour codes each track to show how good the match was (green, amber, red) and it also has a place to put tracks in the same Group/Cluster) that didn’t match properly. I think that you could achieve similar if not better results by displaying your track and release level matching scores in your table (perhaps also with colour coding to quickly spot problems).
c) TrackNo is missing or unreliable

It would definitely be helpful to allow users to sort by multiple columns simultaneously. What I have been doing is using the Filter to select 1 album at a time, then sorting by TrackNo to get the songs into TrackNo order. Then I can more easily compare to a release in the MB website.

It could be combined (i.e. after scores have been calculated for all tracks in a potential matching release, but before sorting based on final release score). It might be more flexible though to keep them as separate methods (at least from a code perspective) so that in the future it might be easier to make these scoring algorithms separate plugable tasks with preferences that you can sequence together via your Manipulator concept. That way you could expose all the track-level scoring parameter weightings as one set of “Track-scoring task preferences” and all the release-level scoring parameter weightings as another “Release-scoring task preferences”, then just run them in sequence one after the other. In the future you might even define alternative plugable scoring algorithms that suit other usecases that come up.

Makes sense. I guess there would probably need to be some kind of minimum threshold to define what constitutes a “reasonable match” though. If a track falls below that threshold (e.g.the calculated fingerprint doesn’t even slightly match with any songs on the release), then perhaps there needs to be some way in the UI to indicate that it was not matched to a MBRelease in a similar way to how Picard does. Perhaps you could put some colour coding on your “Status” column to allow users to easily spot where the problems lie.

Perhaps both the original algorithm and the new algorithm can be made into a set of Manipulator task steps so that users could use whatever approach suits their collection. Perhaps call the old one “Track orientated match with MusicBrainz” and the new one “Release orientated match with MusicBrainz”. Prior to “Release orientated match with MusicBrainz” you would need to have the task “Group (before release-orientated match)”. After “Track orientated match with MusicBrainz” you would need the current “Cluster” task, but perhaps to reduce confusion
“Cluster” could be renamed “Group (after track-orientated match)” so that users don’t try to incorrectly mix release and track orientated tasks.

In the case of the new release-scoring algorithm I’m not sure it would be necessary to even know a-priori whether the local tracks are part of a compilation album. If the “Prefer original release over compilation” option is set then modify the sorting algorithm (step 5) so that original releases come first, even if they have a lower score than compilation albums. Then the algorithm will pick the original release with the highest release-level score. This might have the consequence that there are some tracks that do not match that release. The user will need to manually decide what to do about that,
a) they might want to move the orphan songs to another directory
b) delete them
c) try to match them up with other albums as a second pass of the algorithm

Theres alot of good ideas here, and Im going to make this the focus of the next release but three points I need to reiterate.

  1. Jaikoz used to allow you to set the score weightings and other additional preferences but this didn’t really help anyone, I want as much as possible to just let Jaikoz fix everything automatically (whilst accepting it wont be 100% accurate), and any options provided to the user should be non technical such as 'Prefer Original Releaase… ’ rather than technical such as 'Grouping options.

  2. You seem to be trying Jaikoz out on one album at a time, but this isn’t how it is normally used, it is normally used on 100’s or 1000’s of songs at a time. Hence any option which would need setting/unsetting for different set of tracks would not be workable.

  3. I mention that tracks might not be in the release/track no order and you suggested multi column sorting. There is a technical reason why I haven’t done this yet, which I might be able to resolve but this is not the problem. Even if the user could sort my multiple columns my point was that the user might choose to not sort the tracks by release, and therefore could be confusing if tracks were fixed in release order regardless.

[quote=paultaylor]Theres alot of good ideas here, and Im going to make this the focus of the next release but three points I need to reiterate.
[/quote]
This is great news.

Point taken. I can appreciate the trade-offs between flexibility and complexity and why you might be concerned with exposing some of this. The only reason I suggested exposing the score results and score-weightings is that I believe that we might have been able to do better than simply “guessing” the optimal weightings to use in the scoring algorithm. I think it should be possible to find the optimal weightings for most users by optimizing on a sufficiently large initial collection of ripped music (one that has not been touched by taggers yet). Perhaps a compromise would be to put them in a disk-based preferences file?

No, I’m running on about 8,000 tracks at a time. However, the majority of those tracks are complete (or mostly complete) albums with good directory structure and reasonable existing meta data. The examples Albums that I have raised as issues are just small subsets of the 8000 tracks that illustrate certain problems.

After I run AutoCorrect across the 8,000 tracks, I scroll down the artwork column to try to spot Albums where the artwork is either missing or inconsistent. I would like to be able to sort by Album AND by Artwork simultaneously to find errors more quickly, but this is not currently possible (I guess a fancier Table widget would be required).

I can see that it would probably be very convenient to be able to have an option in the Context Menu called “AutoCorrect treating selected tracks as one release”. This would group (into one release) all the selected tracks regardless of AlbumName or SubFolder and then run the release-orientated correction algorithm. This would reduce the necessity to go into the Preferences and change settings before re-running to fix the odd AutoCorrect mistake.

Then I could use the following process to fix my collection:

  1. Run AutoCorrect across all 8000 tracks
  2. Look for mistakes (hopefully 0-10% of the tracks where there were issues with AlbumName)
  3. Fix each mistake one by one by manually selecting the tracks that should have been identified as being in the same album, and choose “AutoCorrect treating selected tracks as one release”

[quote=paultaylor]
3. I mention that tracks might not be in the release/track no order and you suggested multi column sorting. There is a technical reason why I haven’t done this yet, which I might be able to resolve but this is not the problem. Even if the user could sort my multiple columns my point was that the user might choose to not sort the tracks by release, and therefore could be confusing if tracks were fixed in release order regardless.[/quote]
I see (I think), I didn’t really fully understand your point the first time around. I’m not sure that much can be done to change an algorithm that is essentially working on a per-release basis so that it fixes in any order other than release order.

I suppose that I had always thought of the Table as just being a convenient way for browsing the tracks, making manual edits, and kicking off correction tasks on either individual tracks or on one or more releases. Certainly I didn’t expect that sort order in the table could/should affect the AutoCorrect algorithm. I did however expect that multi-selecting rows would affect the subset of tracks that were considered by the AutoCorrect algorithm.

Are you perhaps worrying that when AutoCorrect is running the user will see rows begin corrected in a seemingly random order?

Perhaps I’m still not understanding the usecase here. Are you suggesting that users might want to organize their collection around some MusicBrainz entity other than Release (say TrackArtist)? Is this even possible using the MusicBrainz API? At the heart of any algorithm to clean up using MB, surely you would need to associate a track to a concrete release before you could confidently do useful data fixing.

I could perhaps imagine for some reason a certain type of user doesn’t care about which album a song came from, but just wants to collect all the songs ever made by a particular artist. But if you don’t even associate the tracks into their respective MB releases, then I think you would also have to forego making changes to TrackNo, AlbumName, Year, Artwork, AlbumArtist, etc. Without some kind of release orientation you might not even be able to know the TrackTitle for sure since even the same song has a different title on different releases.

So my initial thoughts are (unless you have some specific example use cases) that the table sort order should not have any effect on the AutoCorrect algorithm. i.e. Regardless of whether the tracks are currently displayed in release/track no order, artist order, or whatever, just run the same AutoCorrect algorithm. The only UI thing that should have an effect is the List of currently selected Tracks.

Paul wrote

That’s right, it’s not always the case. I organize my collection in folders A, B, C etc. organized by Artist - Title. Because I don’t like to have the same song 10 times ore more in different albums. Take a look at Elvis Presley etc. His songs come on various albums again and again. I like to tag one song for the earliest release (maybe from a single not the album or there exist no album). And for my nearly 200k songs I use MusicIP Mixer to do the playlists etc.

[quote=Alfg]
That’s right, it’s not always the case. I organize my collection in folders A, B, C etc. organized by Artist - Title. Because I don’t like to have the same song 10 times ore more in different albums. Take a look at Elvis Presley etc. His songs come on various albums again and again. I like to tag one song for the earliest release (maybe from a single not the album or there exist no album). And for my nearly 200k songs I use MusicIP Mixer to do the playlists etc.[/quote]

That raises some interesting issues. So a particular Elvis Presley song like “Love Me Tender” appears on a lot of albums, but some users would like to only store 1 file on disk for the “Love Me Tender.mp3” even if they own many Albums containing that song. From a disk-space perspective I can see why people would want that, but this does seem to create problems for where to store the meta data. For example, after deleting the other 4 copies of “Love Me Tender.mp3” from disk, how would you be able to ask questions like “In my collection, which Albums contain the song Love Me Tender?”. I think the only answer you could expect is - 1 album, not 5.

[A bit of a digression - The problem seems to come down to a limitation of ID3 tagging. I may be wrong, but I think that ID3 Tags only allow you to store one TrackNo, AlbumName, ReleaseDate, etc inside one MP3 file. So this limitation seems to force you to choose one release to link each file to. I suppose that in an ideal world none of the meta data would be stored inside the music files, rather it would be stored in a central database much like MusicBrainz. If you want to play a particular album, your music player would be able to lookup the List of MB Unique Id’s associated with that Album and then see which of these you have available in your music library, then start playing in track order. All the meta data displayed by the music player would come from the MB database, not from the track. If instead you want to view songs by a particular Artist that could also be achieved in a similar way, again using the MB Unique Id as the keys to find the correct tracks to play. ]

But given the limitation that the current tagging technology requires that you to associate at Track with at most 1 Album (release), this raises some problems. Say you start with a bunch of albums (initially ripped from CD) that are all tagged with the specific release that they were ripped from:

Option 1 - Clean first, then delete duplicates
I guess you could first use Jaikoz to enrich the meta data (with Album Art etc) and once that has been done you could delete the duplicate tracks. But then you are going to lose the Album Art that was specific to that track reside in that specific Album. In fact, when you browse specific Albums with a music player it’s going to look like your albums are missing tracks all over the place. I suppose that might be acceptable to some people if they never browse their collection by Album. Is that true in your case?

Option 2 - Start without duplicates, then try to map to an arbitrary album chosen by the user
If on the other hand you start with a hodge-podge of tracks that initially came from various incomplete Albums it’s a slightly different problem. If you really only care about browsing your collection by Artist or Genre and not Album then it wouldn’t be a problem though. You could delete all the meta data that has any connection with that track appearing on a specific release (e.g. TrackNo, AlbumName, ReleaseDate, etc).
But it sounds like you would instead like to switch all of those tracks so that they appear as if they have come from some other Album of your choice. But the TrackNo, AlbumName, ReleaseDate, etc is likely to be different in the Album that you would like to link the track to than the Album that the track initially was ripped from. So perhaps what would be needed is a Manuipulator Task in Jaikoz to empty out those fields (TrackNo, AlbumName, ReleaseDate, etc). Then you could do a “Manual Correct” to pick the release that you want to associate those tracks to. Would that work for you?

Perhaps if you are using MusicIP Mixer you don’t need any tags to be written into your music files apart from MusicIP ID. Is that the case?

Yes i think this is already on the enhancement list.

Yes, that is the issue. Also it may appear to the user that hardly anyhing is happening at all if most of the tracks bing fixed are not visible.

Yes I don’t want the track order to have any effect on the final results, although it does at the moment because although we do take releases into account we don’t take all the tracks on that release into account at the same time.

I think Alfie is saying he only is interested i the album(and artwork) that he track was originally released on. He is not interested in subsequent reissues, compilations. If you consider the original album is how the artist expected their songs to be seen, and the artist has little control over subsequent compilations that it is is a very reasonable position to take.

Actually ID3v24 allows you to store mutiple fields but ID3v23 does not, nor do other formats such as OggVorbis, even if they did it would be very difficult for users to mange this.

No thanks, much as I love Musicbrainz I would like to retain control of the info stored in my Songs.

No these options are NOT acceptable , what customer wants is to select ‘Prefer Original releases …’ and for Jaikoz to do just that.

I don’t think it’s a problem. You already popup a dialog box that shows that progress is being made i.e.

Checking 7.722 songs
142 Songs processed so far