MPDL IT Infrastructure/backup policies

MPDL

=Backup strategies/policies=

There are 2 different backup measures running on all our servers in Göttingen and 1 on all the servers in Garching, since we have no servers in München yet it is left out here
 * 1) backup via tivoli to tape-library (provided by gwdg/rzg)
 * 2) backup via rsync to remote diskspace via nfs (also provided by gwdg)

Backup
1. Installation of pubman.mpdl.mpg.de is distributed:
 * the application (pubman.ear) runs on pubman.mpdl.mpg.de, d.h. srv03.mpdl.mpg.de
 * the escidoc-framework for pubman.mpdl.mpg.de runs on srv02.mpdl.mpg.de
 * the fedora-instance and the postgresql-databases for escidoc-core and fedora run on srv01.mpdl.mpg.de

2. Backups made on srv03:
 * via rsync the whole /usr/share/jboss is backed up incremental every night around 4:25 am
 * via tivoli the essential part of the local filesystem is backed up (including jboss also)

3. Backups made on srv02:
 * via rsync the whole /usr/share/jboss is backed up incremental every night around 4:25 am
 * via tivoli the essential part of the local filesystem is backed up (including jboss also)

4. Backups on srv01:
 * via tivoli the whole /X/fedora is backed up (you can retrieve all changes within the last 3 months)
 * via lvm-snapshot and rsync the whole postgres-data-dir is backed up (var/lib/pgsql)
 * moreover the whole /X/fedora is stored in a 2 TB SAN-LUN
 * via script/crontab a nightly dump (4 am) is made of the escidoc-core-database to local disk /data/backup/ which is in turn backed up via tivoli, but only the last 2 are kept (atime +2) otherwise it would be to much data

work around

 * /root/bin/test-watch.sh watch to jboss is shutdown
 * /root/bin/snapshot.sh creates shutdown Postgresql, Fedora and Coreservice
 * create a snapshot and starts postgresql and fedora
 * If JBoss is down test-watch.sh waits a little bit and start Coreservice again.
 * A half hour later the backup script /root/bin/snapshot-backup.sh will be run and delete the snapshot after the backup.

Recovery
for a consistent result of the recovery there is the following procedure:
 * 1) make shure that the system is stable (i.e the local file system/disk are running without errors)
 * so if some of the local files are lost, restore them from the tivoli-backup - for this case there is an example-script in /opt/tivoli/tsm/client/ba/bin/restore.sample.sh
 * or if you need to get them from the rsync-backup just mount the rsync-target again (mount /_SRV01) and get what you need from /_SRV01/sys or if you need older versions go to /_SRV01/.old-saved
 * 1) make sure that the postgres database-engine is running. MAKE SURE FEDORA ON srv01 AND JBOSS ON srv02 AND JBOSS ON srv03 ARE NOT RUNNING!
 * 2) drop the riTriples database in postgres "drop database "riTriples;" and create it again "CREATE DATABASE "riTriples" WITH ENCODING='SQL_ASCII' OWNER="fedoraAdmin";
 * 3) rebuild the fedora-database with the fedora-rebuild-tool /X/fedora/server/bin/fedora-rebuild.sh and choose option 2 and say yes(1)
 * 4) rebuild the fedora-TripleStore with the fedora-rebuild-tool /X/fedora/server/bin/fedora-rebuild.sh and choose option 1 and say yes(1) - (the riTriples Database has already been emptied in step 3)
 * 5) start fedora /X/fedora/tomcat/bin/run.sh
 * 6) start the jboss on srv02 and make a complete recache and reindex-procedure for the escidoc-core framework
 * 7) start the jboss on srv03 (pubman)

a remark to the procedure: In order to make sure that there are no inconsistencies in between fedora/escidoc-core as a leftover from the problem that caused you to start a recovery-procedure it's NOT possible to short-cut. the complete fedora-rebuild/fedora-reindex/escidoc-core-recache/escidoc-core-reindex procedure.