Revised by Jonas Holmgren, 2008.
Following are the steps I took to set up the UK mirror of the MIA. Please don't hesitate to email us if you are trying to set up a mirror and having problems.
These instructions assume that you have shell access to a UNIX/Linux webserver that has rsync and Apache installed.
Root access makes it easier to tweak the Apache setup but it's not necessary.
Decide where on your file system you are going to keep the archive. For instance:
/web/marxists_uk/www
Download the archive using this command:
rsync -rlptzv --delete rsync://marxists.org/www/ /web/marxists_uk/www/
This will take a while as there will be over 30Gb downloaded the first time.
(The params -rlptzv equals -azv except that it does not preserve owner, group rights and does not copy device- and special files; --delete ensures that only the latest version of the MIA archive is mirrored.)
If you don't have write permissions on the httpd.conf file then you can skip this set of instructions.
This what I added to httpd.conf:
# www.marxists.org.uk
<VirtualHost 195.10.230.120:80>
ServerName www.marxists.org.uk
ServerAlias marxists.org.uk
ServerAdmin chris@marxists.org.uk
DocumentRoot /web/marxists_uk/www
# A custom 404 page with option to go to the main MIA site
ErrorDocument 404 /cgi-uk/404
# Favicon for bookmarks in IE5.x
<Files favicon.ico>
ErrorDocument 404 /favicon.ico
</Files>
<Directory /web/marxists_uk/www>
# We want the minimum of options for security
Options Indexes -FollowSymLinks
order allow,deny
allow from all
AllowOverride None
</Directory>
# There shouldn't be any links to /cgi-bin/ but we
# redirect them to the main site just in case.
Redirect /cgi-bin/ http://marxists.org/cgi-bin/
# Local cgi-bin for things like the 404 script
ScriptAlias /cgi-uk/ /web/marxists_uk/cgi-uk/
<Directory /web/marxists_uk/cgi-uk>
Options ExecCGI
order allow,deny
allow from all
AllowOverride None
</Directory>
CustomLog /var/log/apache/marxists_uk-access_log common
CustomLog /var/log/apache/marxists_uk-referer_log referer
CustomLog /var/log/apache/marxists_uk-agent_log agent
</VirtualHost>
You don't need the 404, favicon and local cgi-bin things if you want to keep it simple, something like this should be OK:
# www.marxists.org.uk
<VirtualHost 195.10.230.120:80>
ServerName www.marxists.org.uk
ServerAdmin chris@marxists.org.uk
DocumentRoot /web/marxists_uk/www
<Directory /web/marxists_uk/www>
Options Indexes
order allow,deny
allow from all
AllowOverride None
</Directory>
</VirtualHost>
This is the simple 404 script I'm using:
#!/usr/bin/perl
my $local_page = $ENV{'REQUEST_URI'};
my $remote_page = "http://www.marxists.org" . $local_page;
print qq|Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Marxists Internet Archive - UK Mirror - 404 Error</title>
<link rel="stylesheet" type="text/css" href="css/works.css">
</head>
<body>
<h1>Marxists Internet Archive</h1>
<h2>UK Mirror</h2>
<h3>404 Error - page not found</h3>
<p>I'm sorry the page you were after, <strong>$local_page</strong>
could not be found.</p>
<p>You could try the <a href="$remote_page">same file</a>,
<strong>$remote_page</strong>, on the main site.</p>
<p>Or <a href="index.htm">return to the front page</a>.</p>
</body>
</html>
|;
exit;
There is no need to run rsync as root so the crontab can be edited as a regular user.
Use the command crontab -e to open the crontab in
editor mode,
this will generally open in vi, and then add a line like this:
# run at quarter past one every morning
15 1 * * * rsync -rlptzv --delete rsync://mia.marxists.org/www/ /web/marxists_uk/www/
A replacement for rsh can be specified, such as ssh. Here's an example that uses SSH to get a subdirectory branch.
rsync -azv -e ssh marxists@marxists.org:/www/mia/admin/janitor /tmp
If you want to mirror a part of the site using rsync server,
exclude subdirectory "mia" from rsync
syntax when using
rsync server on marxists.org because "www" is the name of a
module. It's also the name of a subdirectory
which can cause confusion. Module "www" points to
subdirectory "/www/mia"
as can be seen from this chunk from /etc/rsyncd.conf:
[www]
comment = the MIA
path = /www/mia