


       G O O D   B A C K U P   P R A C T I C E    S H O R T   G U I D E



I - PRESENTATION
-------------------
This short guide is here to gather important (and somehow obvious) techniques
about computer backups. It also explains the risks you take not following
theses principles. I thought this was obvious and well known by anyone, up to
recently when I started getting feedback of people complaining about their
lost data because of bad media or other reasons. To the question "have you
tested your archive?", I was surprised to get the negative answers.

This guide is not especially linked to dar no more than to any other tool,
thus, you can take advantage of reading this document if you are not sure
of your backup procedure, whatever is the backup software you use.


II - NOTIONS
--------------
In the following we will speak about backup and archive.

by backup, I mean a copy of some data that remains in place in an operational
system.

by archive, I mean a copy of data that is removed afterward from an operational
system. It stays available but is no more used frequently.

With the previous meaning of an archive you can also make a backup of an
archive (for example a clone copy of your archive).


ARCHIVES
---------
1. The first think to do just after making an archive is testing it on its
definitive media.

There are several reasons that make this testing important:
- any medium may have a surface error, which in some case cannot be detected
  at writing time.
- the software you use may have bugs (also dar can, yes. ;-) ... ).
- you may have done a wrong operation or missed an error message (no space left
  to write the whole archive ad so on), espetially when using poorly written
  scripts.

Of course the archive testing must be done when the backup has been put on its
definitive place (CD-R, floppy, tape, etc.), if you have to move it (copy to
another media), then you need to test it again on the new medium. The testing
operation, must read/test all the data, not just list the archive contents
(-t option instead of -l option for dar). And of course the archive must have
a minimum mechanism to detect errors (dar has one without compression, and two
when using compression).

2. As a replacement for testing, a better operation is to compare the files in
the archive with those on the original files on the disk (-d option for dar).
This makes the same as testing archive readability and coherence, while also
checking that the data is really identical whatever the corruption detection
mechanisms used are. This operation is not suited for a set of data that
changes (like a active system backup), but is probably what you need when
creating an archive.

3. Increasing the degree of security, the next thing to try is to restore the
archive in a temporary place or better on another computer. This will let you
check that from end to end, you have a good usable backup, on which you can
rely.

4. Unfortunately, many (all) media do alterate with time, and an archive that
was properly written on a correct media may become unreadable with time and/or
bad environment conditions. Thus of course, take care not to store magnetic
storages near magnetic sources (like HiFi speakers) or enclosed in metallic
boxes, as well as avoid having sun directly lighting your CD-R(W) DVD-R(W),
etc. Also mentioned for many media is humidity:  respect the acceptable
humidity range for each medium (don't store your data in your bathroom,
kitchen, cave, ...). Same thing about the temperature. More generally have a
look at the safe environmental conditions described in the documentation,
even just once for each media type.

The problem with archive is that usually you need them for a long time, while
the media has a limited lifetime. A solution is to make one (or several) copy
(i.e.: backup of archive) of the data when the original support has arrived it
half expected life.

Another solution, is to use Parchive, it works in the principle of RAID disk
systems, creating beside each file a par file which can be used later to recover
missing part or corrupted part of the original file. Of course, Parchive can
work on dar's slices. But, it requires more storage, thus you will have to
choose smaller slice size to have place to put Parchive data on your CD-R for
example. The amount of data generated by Parchive depends on the redundancy level
(Parchive's -r option). Check the NOTES file for more informations about using
Parchive with dar. When using read-only medium, you will need to copy the
corrupted file to a read-write medium for Parchive can repair it. Unfortunately
the usual 'cp' command will stop when the first I/O error will be met, making
you unavailable to get the sane data *after* the corruption. In most case
you will not have enough sane data for Parchive to repaire you file. For that
reason "dar_cp" is a cp-like command that skip over the corruptions and can
copy sane data after the corrupted part.

5. another problem arrives when an archive is often read. The fact to read,
often degrades the media little by little, and makes the media's lifetime
shorter. A possible solution is to have two copies, one for reading and one
to keep as backup, copy which should be never read except for making a new copy.
Chances are that the often read copy will "die" before the backup copy, you
then could be able to make a new backup copy from the original backup copy,
which in turn could become the new "often read" medium.

6. Of course, if you want to have an often read archive and also want to keep
it forever, you could combine the two of the previous techniques, making two
copies, one for storage and one for backup. Once you have spent a certain time
(medium half lifetime for example), you could make a new copy, and keep them
beside the original backup copy in case of.

7. Another problem, is safety of your data. In some case, the archive you have
does not need to be kept a very long time nor it needs to be read often, but
instead is very "precious". in that case a solution could be to make several
copies that you could store in very different locations. This could prevent
data lost in case of fire disaster, or other cataclysms.

8. Another aspect is the privacy of your data. An archive may not have to be
accessible to anyone. This aspect is a bit out of the scope of this document,
but several directions could be possible to answer this problem:

  - physical restriction to the access of the archive (stored in a bank or
    locked place, for example)
  - hid the archive (in your garden ;-) ) or hide the data among other data
    (Edgar Poe's hidden letter technique)
  - encrypting your archive
  And probably some other ways I forgot.
For encryption, dar now provides strong encryption inside the archive
(blowfish algorithm), it does preserve the direct access feature that
avoid you having to read the whole archive to restore just one file. But
you can also use an external encryption mechanism, like GnuPG to encrypt
slice by slice.


BACKUP
--------
Backups act a bit like an archive, except that they are a copy of a changing
set of data, which is moreover expected to stay on the original location (the
system).

The fact that the data is changing introduces two problems:
- A backup is quite never up to date, and you will probably loose data if you
  have to rely on them
- a backup becomes soon obsolete.

The backup has a also the role of keeping a recent history of changes.
For example, you may have deleted a precious data from your system. And it is
quite possible that you notice this mistake long ago after deletion. In that
case, a old backup stays useful, in spite of some more recent backups.

In consequences, backup need to be done often for having a minimum delta in
case of crash disk. But, and new backup do not mean that older can be removed.
A usual way of doing that, is to have a set of media, over which you rotate
the backups. The new backup is done over the oldest backup of the set. This
way you keep a certain history of your system changes. It is your choice to
decide how much archive you want to keep, and the frequency of your backups.

A point that can increase the history while saving media space required by
each backup is the differential backup. A differential backup is a backup done
only of what have changed since a previous backup (the "backup of reference").
The drawnback is that it is not autonomous and cannot be used alone to restore
a full system. Thus there is no problem to keep the differential backup on the
same media as the one where is located the backup of reference.

Doing a lot of consecutive differential backup (taking the last backup as
reference for the next differential backup), will save your media space, but
will cost time at restoration in case of computer accident. You will have to
restore the full backup (of reference), then you will have to restore all the
many backup you have done up to the last. This implies that you must keep all
the differential backup you have done since the backup of reference.

It is thus up to you to decide how much differential backup you do, and how
much often you make a full backup. A common scheme, is to make a full backup
once a week and make differential backup each day of the week. The backup done
in a week are kept together. You could then have ten sets of
full+differential backups, and a new full backup would erase the oldest full
backup as well as its associated differential backups, this way you keep ten
week history of backup with a backup every day, but this is just an example.

Any other trick/idea/improvement/correction/evidencies are welcome !

Denis.
