Slashdot | Web Pages Are Weak Links in the Chain of Knowledge (quoting the Washington Post's On the Web, Research Work Proves Ephemeral). It's true that linkrot is a serious problem. It's also true that archive.org is only a partial solution since it doesn't get anything and some big content providers — like the Washington Post — block it.
Is the only solution to make (copyright busting?) offline copies of everything? If so, where's the tool that will automate that for me, and — more importantly — index all that content on my drive, disk, or tape?
You’re looking at it. I’ve been asked on a couple of occasions to “nominate” things that should be cached for posterity. The problem is that few of us really know what’s worth saving. But I think the argument could be made that what we point to in our blogs is worth saving, especially at a collective level. (The counter-argument–that we end up with a lot of quizzes and re-hashed joke sites–could also be made, I suppose.)
I know that MTcache may still face some technical hurdles, but these can be overcome. Copyright hurdles also exist (as they do for Archive.org and for Google), but I think that these can be overcome using robot exclusion and metadata. Assuming blogging isn’t just a flash-in-the-pan, all that remains is a gathering system that indexes the distributed caches. A popular topic is likely to be cached in far more systems than one less popular. Even better, this could help to handle flash crowds (the “Slashdot Effect”) by distributing access.
Actually, what I had in mind was something built into browsers, that would cache stuff I read, or allow one-click caching of stuff I read to a private location. The issue is at least as (if not more) important for stuff used in research and scholarly research as it is for blogging….
And when you say “cached locally” do you mean server-side? Because if you do, I’m fairly sure that goes beyond the bounds of fair use in many cases.
I have build such a beast. Basically it snatches your browsers browsers history and downloads the pages you have visited. Its running on a server because my notebook hasn’t enough harddisk space for such experiments. Searching in this Archive is possible although at the moment only via the command line.
I share that installation with a few friends and we are looking at it as an research project. We would love to make it available to others but on thee other hand we have no desire to to though evaluation of the restrictions based upon us by the various laws governing immaterial goods.
See http://blogs.23.nu/disLEXia/stories/1412/ and http://blogs.23.nu/c0re/stories/1928/
Well, that seems to do the job, but it looks like a pretty complex install, and not something one could do when one was just, say, a disempowered user on a university network (which is what I am at the office). What the world needs is something that is client-side but doesn’t, as you put it in your blog, degrade the user experience. Plus better indexing….
You might want to have a look at Agent Frank – http://www.decafbad.com/twiki/bin/view/Main/AgentFrank
The page also has links to several similar projects at the bottom of the page.