Howto set up a mirror of marxists.org

Following are the steps I took to set up the UK mirror of the MIA. Please don't hesitate to email us if you are trying to set up a mirror and having problems.

Things you need

These instructions assume that you have shell access to a UNIX/Linux webserver that has rsync and Apache installed.

Root access makes it easier to tweak the Apache setup but it's not necessary.

Mirror the MIA using rsync

Decide where on your file system you are going to keep the archive. This is the directory I used for the UK mirror:

/web/marxists_uk/www/
    

Download the archive using this command:

rsync -vaz rsync://marxists.org/www/ /web/marxists_uk/www/
    

This will take a while as there wil be over 1Gb downloaded the first time.

Using -L causes symlinks to be dereferenced which is needed if the mirror site is set up not to follow sym links.

Setup Apache

If you don't have write permissions on the httpd.conf file then you can skip this set of instructions.

This what I added to httpd.conf:

# www.marxists.org.uk
<VirtualHost 195.10.230.120:80>
  ServerName www.marxists.org.uk
  ServerAlias marxists.org.uk
  ServerAdmin chris@marxists.org.uk
  DocumentRoot /web/marxists_uk/www
  # A custom 404 page with option to go to the main MIA site 
  ErrorDocument 404 /cgi-uk/404
    # Favicon for bookmarks in IE5.x
    <Files favicon.ico>
      ErrorDocument 404 /favicon.ico
    </Files>
    <Directory /web/marxists_uk/www>
      # We want the minimum of options for security
      Options Indexes -FollowSymLinks
      order allow,deny
      allow from all
      AllowOverride None
    </Directory>
  # There shouldn't be any links to /cgi-bin/ but we
  # redirect them to the main site just in case.
  Redirect /cgi-bin/ http://marxists.org/cgi-bin/
  # Local cgi-bin for things like the 404 script
  ScriptAlias /cgi-uk/ /web/marxists_uk/cgi-uk/
    <Directory /web/marxists_uk/cgi-uk>
      Options ExecCGI
      order allow,deny
      allow from all
      AllowOverride None
    </Directory>
  CustomLog /var/log/apache/marxists_uk-access_log common
  CustomLog /var/log/apache/marxists_uk-referer_log referer
  CustomLog /var/log/apache/marxists_uk-agent_log agent
</VirtualHost> 
    

You don't need the 404, favicon and local cgi-bin things if you want to keep it simple, something like this should be OK:

# www.marxists.org.uk
<VirtualHost 195.10.230.120:80>
  ServerName www.marxists.org.uk
  ServerAdmin chris@marxists.org.uk
  DocumentRoot /web/marxists_uk/www
    <Directory /web/marxists_uk/www>
      Options Indexes
      order allow,deny
      allow from all
      AllowOverride None
    </Directory>
</VirtualHost>
    

This is the simple 404 script I'm using:

#!/usr/bin/perl

my $local_page    = $ENV{'REQUEST_URI'};
my $remote_page   = "http://www.marxists.org" . $local_page;


print qq|Content-Type: text/html; charset=iso-8859-1


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
   <title>Marxists Internet Archive - UK Mirror - 404 Error</title>
   <link rel="stylesheet" type="text/css" href="css/works.css">
</head>
<body> 

<h1>Marxists Internet Archive</h1>

<h2>UK Mirror</h2>

<h3>404 Error - page not found</h3>

<p>I'm sorry the page you were after, <strong>$local_page</strong>
could not be found.</p>

<p>You could try the <a href="$remote_page">same file</a>, 
<strong>$remote_page</strong>, on the main site.</p>

<p>Or <a href="index.htm">return to the front page</a>.</p>

</body>
</html>
|;

exit;
    

Using cron to automate updates

There is no need to run rsync as root so the crontab can be edited as a regular user.

Use the command crontab -e to open the crontab in editor mode, this will generally open in vi, and then add a line like this:

# run at quarter past one every morning
15 1 * * *      rsync -vazL rsync://mia.marxists.org/www/ /web/marxists_uk/www/
    

NOTES

A replacement for rsh can be specified, such as ssh. Here's an example that uses SSH to get a subdirectory branch.

rsync -avz -e ssh marxists@marxists.org:/www/public_html/admin/janitor /tmp
    

If you want to mirror a part of the site using rsync server, exclude subdirectory "public_html" from rsync syntax when using rsync server on marxists.org because "www" is the name of a module. It's also the name of a subdirectory which can cause confusion. Module "www" points to subdirectory "/www/public_html" as can be seen from this chunk from /etc/rsyncd.conf:

[www]
comment = the MIA
path = /www/public_html
    

Chris, 10th August 2004.