Saturday, October 04, 2008

DMOZ Local Directories and BatchDMake


Today morning I finally settled in to perform a long-pending weekend house-cleaning job, except that the cleaning work was electronic, i.e. my computer. It is a common scenario to find your files so badly scattered around the hard-disk, inspite of some organization scheme that we all implement. Years ago, I saw a nice scheme Roshan, a friend of mine, implemented. He created directories and sub-directories based on topics and categories, like "Science", "Computers", etc, and Science would contain "Physics", "Chemistry" ... while Computers would contain "Programming", "Internet" and so on. It was a complete hierarchial structure, and I've been using it on my systems too ever since I saw it there.

But, its hard to maintain my directory, as creating such a structure is a rather tough task. You could either create an entire directory structure before-hand, with empty folders waiting to be filled, or you could create new folders as and when you require, in the correct category of course. The latter option obviously seems more efficient, as you don't have an empty mammoth directory structure, but a dynamic and to-your-taste structure. But after several years of trying out both, I have found that the first approach is much better. When you are, say, saving a file from the Internet, you often don't have the patience to ponder over the hierarchial relationship between "Geese Hunting" and "Duck Hunting", and whether you should create a new directory for each, or just club them into one single directory called "Bird Hunting". This inherent impatience of people when using files, as understandable and natural it is, is THE reason why files are scattered and chaotic. Thus I have concluded that people need an existing order to follow, that is as extensive and accomodating as it can be, but also flexible in order to allow people to delete and modify and create new topics and directories.

With this long discourse on the philosophy of file management in mind, I sat about thinking! :-D

This is what I came up with: Why not use an existing MASSIVE topic-wise directory structure, which has been developed and nurtured for about a decade by tens of thousands of people across the world? That directory structure is the DMOZ Open Directory Project (http://www.dmoz.org) by the Mozilla Foundation (of Firefox fame). It forms the backbone of Google Directory and several other places on the internet, and is a collaborative and open effort at building efficient web directories. My interest, however, is not in the links to websites that is provides, but the topic-based organizational structure. And as luck has it, they offer a downloadable plain-text file containing the ENTIRE folder structure. Perfect. Its available at:
http://rdf.dmoz.org/rdf/categories.txt (57.4 MB)

Then I wrote a simple tool in Java called "BatchDMake", which will read a given text-file line-by-line and create the folders listed in it. Each folder path is listed in a separate line. For example, this is from the DMOZ dump:

Arts
Arts/Movies
Arts/Music
Computers
Computers/Artificial Intelligence
Computers/Games


The syntax for operating it is:

java BatchDMake [filename] [root folder]

The filename parameter is obvious, the root folder is the folder name relative to the current working directory in which it should create the folders listed in the text file. For e.g. using the downloaded DMOZ categories file:

java BatchDMake categories.txt dmozRoot

The source code for BatchDMake is available here, and the executable .class file is here.

I just created the ENTIRE folder structure of DMOZ on my hard-disk, it has about 770,000 directories & sub directories. A lot of them are useless ones, so I'll go about pruning the tree to come up with the most useful structure for normal users. Will post that one soon. In the meanwhile, you could try the DMOZ structure and use BatchDMake to create it on your system!

Cheers!
Shashank
PS: I even found the place I'm currently in!
root\Regional\Asia\India\Tamil_Nadu\Districts\Vellore
This is like exploring an uncharted world in your hard-disk! :-P

No comments: