Remote harvesting extension for CKAN
by CKAN

sudo apt-get install redis-server

ckan.harvest.mq.type = redis

sudo apt-get install rabbitmq-server

ckan.harvest.mq.type = rabbitmq

(pyenv) $ pip install -e git+https://github.com/ckan/ckanext-harvest.git#egg=ckanext-harvest

(pyenv) $ pip install -r pip-requirements.txt

ckan.plugins = harvest ckan_harvester

ckan.harvest.mq.type = redis

paster --plugin=ckanext-harvest harvester initdb --config=mysite.ini

harvester initdb
  - Creates the necessary tables in the database

harvester source {name} {url} {type} [{title}] [{active}] [{owner_org}] [{frequency}] [{config}]
  - create new harvest source

harvester rmsource {id}
  - remove (deactivate) a harvester source, whilst leaving any related datasets, jobs and objects

harvester clearsource {id}
  - clears all datasets, jobs and objects related to a harvest source, but keeps the source itself

harvester sources [all]
  - lists harvest sources
    If 'all' is defined, it also shows the Inactive sources

harvester job {source-id}
  - create new harvest job

harvester jobs
  - lists harvest jobs

harvester run
  - runs harvest jobs

harvester gather_consumer
  - starts the consumer for the gathering queue

harvester fetch_consumer
  - starts the consumer for the fetching queue

harvester purge_queues
  - removes all jobs from fetch and gather queue

harvester [-j] [--segments={segments}] import [{source-id}]
  - perform the import stage with the last fetched objects, optionally belonging to a certain source.
    Please note that no objects will be fetched from the remote server. It will only affect
    the last fetched objects already present in the database.

    If the -j flag is provided, the objects are not joined to existing datasets. This may be useful
    when importing objects for the first time.

    The --segments flag allows to define a string containing hex digits that represent which of
    the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f

harvester job-all
  - create new harvest jobs for all active sources.

harvester reindex
  - reindexes the harvest source datasets

paster --plugin=ckanext-harvest harvester sources --config=mysite.ini

ckan.plugins = harvest ckan_harvester

{
 "api_version": 1,
 "default_tags":["new-tag-1","new-tag-2"],
 "default_groups":["my-own-group"],
 "default_extras":{"new_extra":"Test","harvest_url":"{harvest_source_url}/dataset/{dataset_id}"},
 "override_extras": true,
 "user":"harverster-user",
 "api_key":"<REMOTE_API_KEY>",
 "read_only": true,
 "remote_groups": "only_local",
 "remote_orgs": "create"
}

from ckan.plugins.core import SingletonPlugin, implements
from ckanext.harvest.interfaces import IHarvester

class MyHarvester(SingletonPlugin):
'''
A Test Harvester
'''
implements(IHarvester)

def info(self):
    '''
    Harvesting implementations must provide this method, which will return a
    dictionary containing different descriptors of the harvester. The
    returned dictionary should contain:

    * name: machine-readable name. This will be the value stored in the
      database, and the one used by ckanext-harvest to call the appropiate
      harvester.
    * title: human-readable name. This will appear in the form's select box
      in the WUI.
    * description: a small description of what the harvester does. This will
      appear on the form as a guidance to the user.

    A complete example may be::

        {
            'name': 'csw',
            'title': 'CSW Server',
            'description': 'A server that implements OGC's Catalog Service
                            for the Web (CSW) standard'
        }

    :returns: A dictionary with the harvester descriptors
    '''

def validate_config(self, config):
    '''

    [optional]

    Harvesters can provide this method to validate the configuration entered in the
    form. It should return a single string, which will be stored in the database.
    Exceptions raised will be shown in the form's error messages.

    :param harvest_object_id: Config string coming from the form
    :returns: A string with the validated configuration options
    '''

def get_original_url(self, harvest_object_id):
    '''

    [optional]

    This optional but very recommended method allows harvesters to return
    the URL to the original remote document, given a Harvest Object id.
    Note that getting the harvest object you have access to its guid as
    well as the object source, which has the URL.
    This URL will be used on error reports to help publishers link to the
    original document that has the errors. If this method is not provided
    or no URL is returned, only a link to the local copy of the remote
    document will be shown.

    Examples:
        * For a CKAN record: http://{ckan-instance}/api/rest/{guid}
        * For a WAF record: http://{waf-root}/{file-name}
        * For a CSW record: http://{csw-server}/?Request=GetElementById&Id={guid}&...

    :param harvest_object_id: HarvestObject id
    :returns: A string with the URL to the original document
    '''

def gather_stage(self, harvest_job):
    '''
    The gather stage will recieve a HarvestJob object and will be
    responsible for:
        - gathering all the necessary objects to fetch on a later.
          stage (e.g. for a CSW server, perform a GetRecords request)
        - creating the necessary HarvestObjects in the database, specifying
          the guid and a reference to its job. The HarvestObjects need a
          reference date with the last modified date for the resource, this
          may need to be set in a different stage depending on the type of
          source.
        - creating and storing any suitable HarvestGatherErrors that may
          occur.
        - returning a list with all the ids of the created HarvestObjects.

    :param harvest_job: HarvestJob object
    :returns: A list of HarvestObject ids
    '''

def fetch_stage(self, harvest_object):
    '''
    The fetch stage will receive a HarvestObject object and will be
    responsible for:
        - getting the contents of the remote object (e.g. for a CSW server,
          perform a GetRecordById request).
        - saving the content in the provided HarvestObject.
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - returning True if everything went as expected, False otherwise.

    :param harvest_object: HarvestObject object
    :returns: True if everything went right, False if errors were found
    '''

def import_stage(self, harvest_object):
    '''
    The import stage will receive a HarvestObject object and will be
    responsible for:
        - performing any necessary action with the fetched object (e.g
          create a CKAN package).
          Note: if this stage creates or updates a package, a reference
          to the package should be added to the HarvestObject.
        - creating the HarvestObject - Package relation (if necessary)
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - returning True if everything went as expected, False otherwise.

    :param harvest_object: HarvestObject object
    :returns: True if everything went right, False if errors were found
    '''

paster --plugin=ckanext-harvest harvester gather_consumer --config=mysite.ini

paster --plugin=ckanext-harvest harvester fetch_consumer --config=mysite.ini

paster --plugin=ckanext-harvest harvester run --config=mysite.ini

sudo apt-get install supervisor

ps aux | grep supervisord

root      9224  0.0  0.3  56420 12204 ?        Ss   15:52   0:00 /usr/bin/python /usr/bin/supervisord

; ===============================
; ckan harvester
; ===============================

[program:ckan_gather_consumer]

command=/usr/lib//ckan/default/bin/paster --plugin=ckanext-harvest harvester gather_consumer --config=/etc/ckan/std/std.ini

; user that owns virtual environment.
user=ckan

numprocs=1
stdout_logfile=/var/log/ckan/std/gather_consumer.log
stderr_logfile=/var/log/ckan/std/gather_consumer.log
autostart=true
autorestart=true
startsecs=10

[program:ckan_fetch_consumer]

command=/usr/lib//ckan/default/bin/paster --plugin=ckanext-harvest harvester fetch_consumer --config=/etc/ckan/std/std.ini

; user that owns virtual environment.
user=ckan

numprocs=1
stdout_logfile=/var/log/ckan/std/fetch_consumer.log
stderr_logfile=/var/log/ckan/std/fetch_consumer.log
autostart=true
autorestart=true
startsecs=10

sudo supervisorctl reread
sudo supervisorctl add ckan_gather_consumer
sudo supervisorctl add ckan_fetch_consumer
sudo supervisorctl start ckan_gather_consumer
sudo supervisorctl start ckan_fetch_consumer

sudo supervisorctl status

ckan_fetch_consumer              RUNNING    pid 6983, uptime 0:22:06
ckan_gather_consumer             RUNNING    pid 6968, uptime 0:22:45

sudo service supervisor start; sudo service supervisor stop

`socket.error: [Errno 111] Connection refused`
RabbitMQ is not running::

  sudo service rabbitmq-server start

sudo crontab -e -u ckan

# m  h  dom mon dow   command
*/15 *  *   *   *     /usr/lib/ckan/default/bin/paster --plugin=ckanext-harvest harvester run --config=/etc/ckan/std/std.ini

Remote harvesting extension for CKAN
by CKAN

ckanext-harvest - Remote harvesting extension

Installation

Configuration

Command line interface

Authorization

The CKAN harvester

The harvesting interface

Running the harvest jobs

Setting up the harvesters on a production server

Community

Contributing

License

Recent Activity

Remote harvesting extension for CKAN by CKAN

ckanext-harvest - Remote harvesting extension

Installation

Configuration

Command line interface

Authorization

The CKAN harvester

The harvesting interface

Running the harvest jobs

Setting up the harvesters on a production server

Community

Contributing

License

Recent Activity

Remote harvesting extension for CKAN
by CKAN