https://github.com/ckan/ckanext-archiver

CKAN Archiver Extension

Status: Production

CKAN Version: 1.5.1+

Overview

The CKAN Archiver Extension provides a set of Celery tasks for downloading and saving CKAN resources. It can be configured to run automatically, saving any new resources that are added to a CKAN instance (and saving any resources when their URL is changed). It can also be run manually from the command line in order to archive resources for specific datasets, or to archive all resources in a CKAN instance.

Installation

Install the extension as usual, e.g. (from an activated virtualenv):

$ pip install -e git+http://github.com/okfn/ckanext-archiver.git#egg=ckanext-archiver

Install the required libraries:

$ pip install -r pip-requirements.txt

Configuration

Enabling Archiver

If you want the archiver to run automatically when a new CKAN resource is added, or the url of a resource is changed, then edit your CKAN config file (eg: development.ini) to enable the extension:
```
ckan.plugins = archiver
```
If there are other plugins activated, add this to the list (each plugin should be separated with a space).

Note: You can still run the archiver manually (from the command line) on specific resources or on all resources in a CKAN instance without enabling the plugin. See section ‘Using Archiver’ for details.
Other CKAN config options

The following config variable should also be set in your CKAN config:
```
ckan.site_url: URL to your CKAN instance
```
This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option ckan.site_url_internally.

Optionally, the following config variables can also be set: :
```
ckan.cache_url_root: URL that will be prepended to the file path and saved against the CKAN resource,
  providing a full URL to the archived file.
```

Additional Archiver settings

The following Archiver settings can be changed by creating a copy of ckanext/archiver/default_settings.py at ckanext/archiver/settings.py, and editing the variables:

ARCHIVE_DIR: path to the directory that archived files will be saved
to.
MAX_CONTENT_LENGTH: the maximum size (in bytes) of files to archive.
DATA_FORMATS: the data formats that are archived.
USER_AGENT_STRING: the `User-Agent` header used when the archiver makes requests

Alternatively, if you are running CKAN with this patch: https://github.com/datagovuk/ckan/commit/83dcaf3d875d622ee0cd7f3c1f65ec27a970cd10 then you can instead add the settings to the CKAN config file as normal:

ckanext-archiver.archive_dir
ckanext-archiver.max_content_length
ckanext-archiver.data_formats  (space separated)
ckanext.archiver.user_agent_string

Using Archiver

First, make sure that Celery is running. For test/local use, you can do this by going to the CKAN root directory and typing:

paster celeryd -c <path to CKAN config>

For production use, we recommend setting up Celery to run with supervisord. For more information see

http://docs.ckan.org/en/latest/maintaining/background-tasks.html

The Archiver can be used in two ways:

Automatically

Install, enable and configure the plugin as described above. Any changes to resource URLs (either adding new or updating current URLs) in the CKAN instance will now call the archiver to try and download the resource.
Manually

From the ckanext-archiver directory run:
```
paster archiver update [dataset] -c <path to CKAN config>
```
Here dataset is an optional CKAN dataset name or ID. If given, all resources for that dataset will be archived.

If omitted, all resources for all datasets will be archived.

For a full list of manual commands run:
```
paster archiver --help
```

Testing

Tests should be run from the CKAN root directory (not the extension root).

(pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/

Archive CKAN resources
by CKAN

CKAN Archiver Extension

Overview

Installation

Configuration

Using Archiver

Testing

Recent Activity

Archive CKAN resources by CKAN

CKAN Archiver Extension

Overview

Installation

Configuration

Using Archiver

Testing

Recent Activity

Archive CKAN resources
by CKAN