SongKong Jaikoz

SongKong and Jaikoz Music Tagger Community Forum

generate map of collection (processed vs. unprocessed)

for very large collections perhaps display a color-coded directory with checkboxes for each node in the base folder.

that way the entire library can be broken down into smaller bite sized sections for Jaikoz to work on at one time.

for folders that might exceed memory limits during processing(unknown artist with 1000’s of songs), the option for Jaikoz to subdivide such files into an appropriate amount of subfolders, so Jaikoz can work without the necessity of high ram usage at once.

Im not sure if it would be necessary to ‘monitor’ these nodes for new files as they get added, perhaps changing status when the last modified timestamp changes…just guessing here…

[quote=quaizywabbit]for very large collections perhaps display a color-coded directory with checkboxes for each node in the base folder.
[/quote]
Sometime ago I did think about display files in a tree table, with each node of the tree being a folder, but I havent gone any further with it.

It is not processing that uses alot of memory, it is the loading of all the file information into memory in the first place. This is the underlying problem which I want to address by storing all the metadata in a database, and only holding a subset in memory but it is difficult to do this without effecting table response times. For example if you have 50,000 tracks loaded and sort a column all values in that column need to be compared to get the correct order, and this isnt going to work very well if I have to retrieve all 50,000 values from the database.

it does sound like a database is necessary to gather and retain current and proposed changes(between sessions), with commital to the files being done last…

for 50,000 plus values its going to be a bit sluggish anyway, but a database is the best tool for that job…i guess it all depends on which database system you choose, since each has thier own strengths and weaknesses. Isn’t sorting built into the database itself? I have a karaoke songbook database in access with well over 100,000 entries(master listing) that sorts rather well…though album art would seem to pose the biggest headache, unless you store a path to the album art in the database.

I guess the trick here is the sequencing used to get the database populated, and whether you want to multi-thread while the database is populating. I’m running a quad core with 4 gb ram, have you done work optimizing for multi-core processors?

just for example: main app is thread 0

population phase: staggered thread start
Thread 1 reads/writes the folder/subfolder directory into the database
Thread 2 reads database entries and goes to each folder/subfolder, writing filenames into database
Thread 3 gets file location from database then reads tag info from each file and writes back to database

generation phase:
by now thread 1 is done reading folder structure and starts generating MB Id’s, writing to each database entry as it goes.(this is where most memory is used and released)
Thread 2 starts the lengthy process of online MB Id data retrieval writing results as it goes.
Thread 3 finishes writing tag info and generates MB Id starting halfway down the folder tree

i dont know if 2 threads can individually access MB online data at the same time, but if it can, then either/both threads 1 and 3 can go online to retrieve MB Tag data after finishing MB Id generation.

to avoid data corruption you have to keep threads from writing to the same entry or file at the same time(this you already know) so the idea is to stagger the load across multiple threads at multiple locations in the database and the filesystem simultaneously…Thread 0 could monitor progress and divide database sections for each thread to chew on…

to ease up on read/write access to the database, you could cache them to memory and do it in blocks… in any case the file itself only remains in memory long enough to generate the MB Id( and again to commit changes later)…everything else is database access which can be done with SQL

for multi-core systems this would scream compared to the top>down approach…for older single core, i guess it depends on how well it can multi-thread

food for thought…wish i knew how to do it myself…

second example taking into account physical hard drive access limits:

population phase:
Main app = Thread 0

Thread 1 reads and writes to database (in cached blocks) the directory tree, and filenames as it progresses until complete.

Generation phase: (which can start shortly after thread 1 has started populating the database)

Thread 2 reads database and loads blocks of files at a time (writing existing tag data and generating MB Id’s) rekeasing files from memory when finished writing to database. the entries in use by this thread are locked.

Thread 1 queries database for entries that arent locked and have null MB Id values (up to max block size) and begins generating MB id’s and writing tag values back to database.these entries are also locked while in use.

the two threads go on processing entries in interlocked blocks, until complete. the database is populated with all existing info and ready to go online to retrive data…

so long as thread one and two lock out access to locations theyre working on, the time lag in generating the mb Id should be enough to keep hard drive activity from producing a bottleneck…

Retrieval phase: MB Ids can be read in interlocked blocks by threads 1 and 2, each writing back results as they go.

the database is now ready to be manipulated as desired.

Musicbrain processing isn’t the issue here.

Im talking about the simple case where you want to load 100,000 rows and show all columns you can for each row approx 50. Now 50 * 100,000 * data uses alot of memory.

Now we could load all this information into a database instead of memory, but as we scroll down we need to retrieve the information from the database to display, this will be too slow.

We need to cache the data, so as the user scrolls up and down the rows the are just outside what they can see is already available. But then if they were to sort by a different column/ or select a different value in the tag browser we would see completely different rows and would have to get them from the database.

Im sure its possible to do, but its not a simple process.

the database (whichever you choose to use) does that for you. with a 50 x 100,000 grid, only a small fraction is visible at once, the database responds to your scroll, insert, and other sql commands and returns data accordingly. Each type of database may have configurations for how much to cache to remain responsive, but a sort on that amount of records could take from 1 to 5 seconds, depending on which one you pick, how many tables are in it, and how complex the tables are related, but thats why there’s the hourglass… i guess you need to figure out which types would work best for your programming language and patience…
Now programatically interfacing with the database, as well as with the grid itself, i can imagine is no small task…My hat’s off to you Sir…

Ive got @ 200 Gb worth of junk to sort out and clean up. I dont really care how long it takes so long as it gets there eventually…

I don’t understand why a database approach is not possible? Mediamonkey uses SQL Lite and manages very fast response times, even though my music collection is very large.

Even when using salling clicker to browse my music on my phone(although in this case it is just pulling through one field) it does this in nearly zero response time.

When I browse my music I can sort by any column for the whole database, and then just scroll down, or select all the songs. SQL is a very clever database language, and I cannot believe that mediamonkey is holding the full amount of information on all my tracks at any one moment, this would be impossible, it is just querying the SQL as it goes.

[quote=slawer]I don’t understand why a database approach is not possible? Mediamonkey uses SQL Lite and manages very fast response times, even though my music collection is very large.
[/quote]
I didnt say it was impossible, just difficult. The main difference is that Media Monkey only displays the most common fields in the table format, wherea with jaikoz you can display all 50 columns if you want.

even with 50 columns (which for most users would require horizontal scrolling) there is still the fact that only a finite amount of rows can display on a screen and vertical scrolling (in real-time), especially when you drag the scroll bar quickly. may not seem seamless because the database engine has to respond to the scroll command.(i.e smooth scrolling stops and it just updates when you stop)

Microsoft access does this right out of the box…

you can gradually phase in the database, by using it as a pointer to files that have been processed by Jaikoz(or not)

the real performance hit is that Jaikoz cant remember where it started or where it left off, so one has to pick the largest block of files( that you can remember where it leaves off) that doesnt lock up your computer.

the only other way i can imagine would be to process files one at a time, with a way to separate processed vs. unprocessed (those that returned acceptable results) so the user can easily keep track of files that need serious attention…

ever considered using xml? I have a dj program that uses an xml based library…

might be worth looking into since its a standard most if not all databases can import and export.

Export Library to XML is on the todo list, but its of limited use if we cant load the whole library into Jaikoz, and I dont think there is a standard schema defined. (Whether we use xml or a database for this to overcome the memory issues doesnt solve the table speed problems.)

I still dont understand where this table speed issue comes from…

the database is responsible for the speed. tying(or more correctly Binding) the database to the user interface leaves you with choices as to how close to real-time you want the scroll bars to work. It all depends on your strategy.You have a choice of how large a dataset to load into memory, too small, and you get a performance hit when you run out of rows, too large, well then you’re eating up resources on something that cant be displayed anyway.

Given the number of tasks that are performed on the data, i can see converting these to database operations being a formidable task, especially if you don’t specialize in database programming…

On a large collection, it would be impossible to load the entire collection anyway. You would be able to sort and pre-process (or clean up) the data, then run it row by row retriving acoustic id’s, loading and releasing each file as it went, writing that acoustic id into the database.
Once all the acoustic Id’s are written, you can turn it loose retriving musicbrainz data, and let it go autonomously until its finished.

It’s just a matter of picking a strategy that leaves no holes, and gets the job done as quickly and thoroughly as possible, without constant user intervention.

With a database, it isn’t necessary to commit ANY changes to the files until after the entire database has been processed. It can simply run autonomously. The User would then run operations on the database, until he/she was happy with the results. It doesn’t have to be loaded all at once into Jaikoz, it does have to be written in its entirety into the database, but not all at once…it can be done in logical stages that make the best use of the processing speed and connection to musicbrainz.

I’m sorry for being so pushy, Paul. This is the best tagging software i’ve found so far…Just want this to be the best it can be, as I’m sure you do as well…