Web archiving
Web archiving is the process of collecting the
Web or particular portions of the Web and ensuring the collection is preserved in an
archive for future researchers, historians, and the public. Due to the incredible size of the Web, web archivists typically employ
web crawlers for automated collection of the Web. The largest web archiving organization is the
Internet Archive which strives to maintain an archive of the entire Web.
National libraries,
national archives and various consortia of organizations are also involved in archiving culturally important Web content.
Web archivists generally archive all types of web content including
HTML web pages,
style sheets,
JavaScript, images, and video. They also archive
metadata about the collected resources such as access time,
MIME type, and content length. This metadata is useful in establishing
authenticity and
provenance of the archived collection.
Crawlers
Web archivists typically use
web crawlers to automate the process of collecting
web pages. Web crawlers typically view web pages in the same manner that users with a browser see the Web. The
Heritrix crawler is a popular tool used by many web archivists for making archive-quality crawls.
On-demand
There are numerous services that individuals may use to archive web resources "on-demand":
*
WebCite, a service specifically for scholarly authors, journal editors and publishers to permanently archive and retrieve cited Internet references (Eysenbach and Trudel, 2005).
*
Archive-It, a subscription service, allows institutions to build, manage and search their own web archive
*
hanzo:web is a personal web archiving service created by
Hanzo Archives that can archive a single web resource, a cluster of web resources, or an entire website, as a one-off collection, scheduled/repeated collection, an RSS/Atom feed collection or collect on-demand via Hanzo's open API.
*
Spurl.net is a free on-line bookmarking service and search engine that allows users to save important web resources.
Crawlers
Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:
* The
robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway.
* Large portions of a web site may be hidden in the
deep web. For example, the results page behind a web form lies in the deep web because a crawler cannot follow a link to the results page.
* Some web servers may return a different page for a web crawler than it would for a regular browser request. This is typically done to fool search engines into sending more traffic to a website.
*
Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl.
The Web is so large that crawling a significant portion of it takes a significant amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.
General limitations
Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman (2002) states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web."Some web archives that are made publicly accessible like
WebCite's or the
Internet Archive's allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite also cites on its
FAQ a recent lawsuit against the caching mechanism, which Google won.
*
*
*
Archives*
Heritrix*
Internet Archive*
UK Web Archiving Consortium*
Web crawling*
WebCite*
International Internet Preservation Consortium *
WebArchivist*
Web archiving bibliography* Web archiving programmes:
**
Digital Archive of Chinese Studies**
European Archive**
Internet Archive**
Kulturarw3**
Minerva **
netarchive.dk**
Pandora**
Paradigma**
UK Government Web Archive**
UK Web Archiving Consortium **
WARP