Help wanted: Scraping XMs
BotB Academy Project Dev
Level 23 Chipist
post #106536 :: 2019.01.06 11:41am
  argarak, VinCMG, MiDoRi and anewuser liēkd this
Hi folks,

Recently I've been playing around with a new algorithm for enconding module data that isn't based on the traditional sequence/pattern approach, but rather on a stream/dictionary model. Initial tests are promising, but at this point I need a large test case to verify the algorithm's efficiency. I'm thinking to test against something like 1000 XMs. Now here's where I need help. Obviously, manually downloading the files is a dumb idea. However, I don't know anything about web crawling and scraping. So I was wondering if anybody here could hack up a script that can batch download the files. They should be full songs (so OHBs aren't ideal), and should be varied in style, size, and channel count. Not sure if there are 1000 non-OHB XMs on BotB, so perhaps a visit to The Mod Archive would be a good idea. The script should work on Linux. Any takers?
Level 23 Chipist
post #106539 :: 2019.01.06 12:42pm
Modarchive has been hosting packs for a few years now.
Level 22 Pixelist
post #106563 :: 2019.01.07 8:49am :: edit 2019.01.07 8:50am
  raphaelgoulart, anewuser and irrlicht project liēkd this
Just batch download from there
with your FTP client of choice or even wget
Level 23 Chipist
post #106565 :: 2019.01.07 9:13am
  anewuser and MiDoRi liēkd this
Ah, excellent. Totally forgot that modland exists, too. Thanks MiDoRi. Modarchive's torrents are overkill for what I have in mind but these will do just fine.
Level 23 Chipist
post #107881 :: 2019.02.18 4:02pm
  MiDoRi, VirtualMan, puke7 and raphaelgoulart liēkd this
For those interested, here are the results of my little experiment:
Sorry for the crappy infographic, I'm not data analysis expert lol
Level 27 Hostist
post #107884 :: 2019.02.18 5:38pm
  VinsCool liēkd this
now i want to hear dictionary encoded remixes
Level 8 Playa
post #107887 :: 2019.02.18 7:40pm
This is a fucking interesting idea, because it never occured to me, and it has utility. Thanks for the food for thought.
Level 27 Chipist
post #107901 :: 2019.02.19 12:15pm :: edit 2019.02.19 12:16pm
my guess is that dictionary compression would have greater benefits if patterns were stored in channel-major order instead of row-major order
Level 23 Chipist
post #107902 :: 2019.02.19 1:00pm :: edit 2019.02.19 1:02pm
  Jangler liēkd this
Yes, efficiency of pattern/sequence compression can undoubtedly be vastly improved by storing channels seperatly, and quite likely it will beat dictionary compression. The reason I didn't consider it for the test is because I had the specific requirements of 1-bit synthesis in mind. Decoding a multi-track sequence is inevitably going to be much slower and/or require more CPU registers than working with a combined sequence. Perhaps that's less true for memory-centric CPUs like 6502 or 6809, but for Z80 is it a problem.

Anyway, if there's any interest I can publish the test suite, so people can tweak the code to test other algorithms. It's just ~200 lines of Scheme code, plus a few lines of Julia for plotting the results. However, it uses a new XM parser library that I need to polish up a bit before the release.
Level 27 Chipist
post #107903 :: 2019.02.19 2:31pm
i would be interested in looking through the code. i'm sure other useful scheme programs exist, but i've never encountered one before :)
Level 23 Chipist
post #107990 :: 2019.02.21 12:32pm
  MiDoRi and Jangler liēkd this
'ere you go:
Usage notes are at the end.

In other news, my new XM parser
lib (which is required to run the linked code) was accepted by the Chicken Scheme project today, yay!

LOGIN or REGISTER to add your own comments!