Difference between revisions of "Resource Index Dataset Workflow Howto"
Line 16: | Line 16: | ||
* Run <code>ant clean build</code> | * Run <code>ant clean build</code> | ||
* <code>cd /srv/ncbo/src/resource_index/svn/tags/2005</code> - this directory contains the Resource Index client code. | * <code>cd /srv/ncbo/src/resource_index/svn/tags/2005</code> - this directory contains the Resource Index client code. | ||
− | * Edit the <code>build.properties</code> file to match the | + | * Edit the <code>build.properties</code> file to match the <code>obr.resource.ids</code> from the workflow <code>build.properties</code> file |
* Stop the Tomcat service. | * Stop the Tomcat service. | ||
* Run <code>ant clean deploywar</code> | * Run <code>ant clean deploywar</code> |
Revision as of 10:44, 2 November 2011
Introduction
The Resource Index web services is a system that allows a user to query biomedical data with terms from ontolgies. The local copy of the biomedical database and the indexing needed to access it are referred to as the Resource Index dataset. The workflow to populate this dataset processes the textual metadata of diverse elements of biomedical resources, such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed abstracts, and annotates and indexes them with terms from ontologies. The workflow that computes the annotations and indices is run from the shell using a provided shell script.
Running the Resource Index Dataset Population Workflow
Configuring Resources to Process
By default there are no resources configured to be processed in the Resource Index Dataset population workflow. Some of the resources are very time-intensive to process, so you should test them to see which ones are appropriate for your use case and feasible to process given your resources.
To configure the resources to be processed by the workflow, do the following:
- Login to the NCBO Appliance with the root user.
cd /srv/ncbo/src/resource_index_workflow/2000/
- this directory contains the Resource Index Dataset population workflow code.- Edit the
build.properties
file. - The last line on the file contains a property called
obr.resource.ids
. Change this line to include a comma-separated list of the resource identifiers. Possible identifiers are included on the line above in the same file. - Run
ant clean build
cd /srv/ncbo/src/resource_index/svn/tags/2005
- this directory contains the Resource Index client code.- Edit the
build.properties
file to match theobr.resource.ids
from the workflowbuild.properties
file - Stop the Tomcat service.
- Run
ant clean deploywar
- Start the Tomcat service.
- Continue with the workflow execution below.
To see a list of resources that NCBO uses, you can visit http://rest.bioontology.org/resource_index/resources
Executing the Workflow
Execute the following shell commands to build and execute resource index dataset population workflow:
[root@example ~]# cd /srv/ncbo/src/resource_index_workflow/2000/ [root@example 2000]# sh ./all.sh
- The script will build the resource index dataset population workflow project and create execution environment in the /srv/ncbo/src/resource_index_workflow/2000/dist/ folder and execute script run.sh
- Logs files location:
/srv/ncbo/src/resource_index_workflow/2000/dist/files/logs/branch1.0/localhost/resource_index
The application will display its progress in the console as it is running. While it is running, or after it is finished, you can look at the resource_index database to validate that data is actually being processed and written.
Database Structure
The Resource Index Database contains many tables, with a common set of six per processed resource. These resource-specific tables are named with the resource acronym as a prefix. For example, for the WikiPathways resource these are the tables that are populated:
- obr_wp_aggregation
- obr_wp_annotation
- obr_wp_concept_frequency
- obr_wp_element
- obr_wp_isa_annotation
- obr_wp_map_annotation
There is also a common set of tables that are not resource-specific. These include obr_resource, obr_context, obr_dictionary, obr_execution and obr_statistics.
Initial Population, Ontology Update Population, Resource Update Population
The Resource Index Dataset Population workflow can be configured to run the data population in a variety of modes in order to save on processing time. The following are descriptions of the various modes and the flags that need to be set in the build.properties file before running the workflow.
Initial Population
The first time you run the Resource Index Datset Population workflow, it contains no data about any of the ontologies, terms, or resources that you want to process. Therefore, it is necessary to run the entire workflow just to achieve a minimum state in which the API will function. This type of workflow execution will gather all of the ontology data that's available from the Annotator hierarchy database (ontologies with status 28) and will then process all of the resources configured in the build.properties file.
obs.slave.populate=true obr.table.index.disabled=true obr.resources.process=true obr.update.resource=true obr.dictionary.complete=true
Ontology Update
When Annotator dataset population workflow has processed new ontologies, the resource index dataset must be updated to match. Running this workflow type will update the ontology data to match Annotator hierarchy dataset and then annotate all of the resources currently present in the database using the newer ontology data.
obs.slave.populate=true obr.table.index.disabled=true obr.resources.process=true obr.update.resource=false obr.dictionary.complete=false obs.slave.ontology.remove=true
Resource Update
From time to time the resources used with in the Resource Index will make new information available. Running this workflow type will update the resources that have been configured in the build.properties file. After the resources have been updated, new elements are annotated using the existing ontology data. None of the ontology-related data will be modified.
obs.slave.populate=false obr.table.index.disabled=false obr.resources.process=true obr.update.resource=true obr.dictionary.complete=true
Troubleshooting
- The shell script may not have execution permissions by default. If this is the case, you will get a permissions error and will have to run the following command to change the permissions:
chmod +x ./all.sh
- Newly-added ontologies may not appear in the Resource Index API until you restart Tomcat, which can be done by running
ncborestart