Difference between revisions of "Annotator Dataset Workflow Howto"
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
− | Once you have submitted some ontologies into the BioPortal Ontology Services (BioPortal Core), you can use these to populate the Annotator | + | Once you have submitted some ontologies into the BioPortal Ontology Services (BioPortal Core), you can use these to populate the backend data sets required by the Annotator Web Service. These datasets are collectively referred to as the Annotator datasets and comprise: The hierarchy database (formerly known as "OBS"), a dictionary file, and (optionally) a mapping database. The population process uses classes, terms, relations, and semantic types from the ontologies. The population is done in two major steps: 1) Synchronize the hierarchy database with BioPortal Core and 2) Create the dictionary file for use with MGREP. A third step, populating mapping information, should be done if there are mappings available. |
− | == Synchronize | + | == Synchronize Hierarchy Database with Ontology Services == |
− | The ontologies and related data that will be used | + | The ontologies and related data that will be used by the Annotator are gathered from the Ontology Services (part of BioPortal Core). This process should be run any time a new ontology (or a new version of an existing ontology) is added to the Ontology Services, though it could theoretically be run from a cron script or scheduled job. |
<ol> | <ol> | ||
− | <li>Remove out-dated ontologies from | + | <li>Remove out-dated ontologies from hierarchy database (e.g. older version of ontologies that does not in BioPortal anymore). By invoking this restlet, it will remove all the outdated ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information. |
<p><code> | <p><code> | ||
See the list of ontologies/versions to be removed: | See the list of ontologies/versions to be removed: | ||
Line 16: | Line 16: | ||
</code></p></li> | </code></p></li> | ||
− | <li>Add new ontologies from BioPortal to | + | <li>Add new ontologies from BioPortal to the hierarchy database. By invoking this restlet, it will add all the new ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information. |
<p><code> | <p><code> | ||
See the list of ontologies/versions to be added: | See the list of ontologies/versions to be added: | ||
Line 54: | Line 54: | ||
== Mapping Data Population == | == Mapping Data Population == | ||
− | <p>Mappings between ontologies can be used in the Annotator to find related terms. Loading the mapping information is currently a manual process, though this will be automated in the future. If you have mapping data you would like to include in annotator, please [mailto:support@bioontology.org contact NCBO].</p> | + | <p>Mappings between ontologies can be used in the Annotator Web Service to find related terms. Loading the mapping information is currently a manual process, though this will be automated in the future. If you have mapping data you would like to include in annotator, please [mailto:support@bioontology.org contact NCBO].</p> |
== Appendix == | == Appendix == | ||
=== Data Population - Concepts and Hierarchy === | === Data Population - Concepts and Hierarchy === | ||
− | <p><b>Introduction:</b> the Annotator application pulls the ontology and the concept data from BioPortal Core via the REST services, then extracts and computes the hierarchy information, and finally stores the information in the | + | <p><b>Introduction:</b> the Annotator hierarchy database population application pulls the ontology and the concept data from BioPortal Core via the REST services, then extracts and computes the hierarchy information, and finally stores the information in the hierarchy database. Then this computed data is accessible via the Annotator REST services.</p> |
<p>Data population is divided into two parts: 1) Concepts and 2) Hierarchy. Please see below for details on these two population procedures.</p> | <p>Data population is divided into two parts: 1) Concepts and 2) Hierarchy. Please see below for details on these two population procedures.</p> | ||
Line 65: | Line 65: | ||
==== Concepts and Direct Relation (level == 1) ==== | ==== Concepts and Direct Relation (level == 1) ==== | ||
===== Pre-requisite in "status": ===== | ===== Pre-requisite in "status": ===== | ||
− | <p>The ontology should be in valid status (<font color="red"><b>"status = 3"</b></font>) in the obs_ontology table in | + | <p>The ontology should be in valid status (<font color="red"><b>"status = 3"</b></font>) in the obs_ontology table in Hierarchy Database to start this process (i.e. This is the initial status from BioPortal Core when ontology is successfully parsed). This is also serves as a safety lock to avoid launching population again for ontologies that are already in the process of population or already populated.</p> |
===== REST calls available ===== | ===== REST calls available ===== | ||
<ol> | <ol> | ||
− | <li>For all ontologies: populate all ontologies with valid status ("status = 3") currently stored in | + | <li>For all ontologies: populate all ontologies with valid status ("status = 3") currently stored in Hierarchy DB (ontologies are added during the synchronization process).<br/> |
<code>http://example/obs/loaderBigConcepts/all</code></li> | <code>http://example/obs/loaderBigConcepts/all</code></li> | ||
<li>For a specific ontology: populate the given ontology if "status = 3". Otherwise the data population process will complain that the status is invalid in the Tomcat log and exit.<br/> | <li>For a specific ontology: populate the given ontology if "status = 3". Otherwise the data population process will complain that the status is invalid in the Tomcat log and exit.<br/> | ||
Line 91: | Line 91: | ||
==== Indirect Relation Hierarchy (level > 1) ==== | ==== Indirect Relation Hierarchy (level > 1) ==== | ||
===== Pre-requisite in "status": ===== | ===== Pre-requisite in "status": ===== | ||
− | <p>The ontology should be in valid status (<font color="red"><b>"status = 14"</b></font>) in obs_ontology table in | + | <p>The ontology should be in valid status (<font color="red"><b>"status = 14"</b></font>) in obs_ontology table in Hierarchy Database to start this process (i.e. "loaderBigConcepts" should have been completed for the given ontology). This is a safety lock to avoid launching population again for the ontologies already in the process of population or already populated.</p> |
===== REST calls available: ===== | ===== REST calls available: ===== |
Revision as of 11:16, 10 February 2011
Introduction
Once you have submitted some ontologies into the BioPortal Ontology Services (BioPortal Core), you can use these to populate the backend data sets required by the Annotator Web Service. These datasets are collectively referred to as the Annotator datasets and comprise: The hierarchy database (formerly known as "OBS"), a dictionary file, and (optionally) a mapping database. The population process uses classes, terms, relations, and semantic types from the ontologies. The population is done in two major steps: 1) Synchronize the hierarchy database with BioPortal Core and 2) Create the dictionary file for use with MGREP. A third step, populating mapping information, should be done if there are mappings available.
Synchronize Hierarchy Database with Ontology Services
The ontologies and related data that will be used by the Annotator are gathered from the Ontology Services (part of BioPortal Core). This process should be run any time a new ontology (or a new version of an existing ontology) is added to the Ontology Services, though it could theoretically be run from a cron script or scheduled job.
- Remove out-dated ontologies from hierarchy database (e.g. older version of ontologies that does not in BioPortal anymore). By invoking this restlet, it will remove all the outdated ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information.
See the list of ontologies/versions to be removed: http://example/obs/admin/ontologies/list/old
Remove old ontologies: http://example/obs/admin/ontologies/remove - Add new ontologies from BioPortal to the hierarchy database. By invoking this restlet, it will add all the new ontology data and the associated entities such as concepts, terms, relations, semantic types and hierarchy information.
See the list of ontologies/versions to be added: http://example/obs/admin/ontologies/list/new
Add new ontologies: http://example/obs/admin/ontologies/add - Populate Concepts (For details, please refer to Chapter 2.1)
- Populate Hierarchy (For details, please refer to Chapter 2.2)
http://example/obs/loaderBigPaths/all
To monitor the progress and view any errors, refer to:
- The "status" field in the table obs_ontology in Annotator DB (obs_hibernate database).
- Check the Tomcat log (/var/logs/tomcat6/catalina.out)
Create Dictionary File
All the terms will be created as dictionary file. (Data is coming from the obs_term table)
- Location : The directory location is specified in build.properties
# Dictionary File path obs.dictionary.path=/ncbo/bioportal_resources/annotator/dictionary/
http://example/obs/createDictionary/0
ncborestart
to concatenate the resulting dictionary files and to restart the MGREP server.Mapping Data Population
Mappings between ontologies can be used in the Annotator Web Service to find related terms. Loading the mapping information is currently a manual process, though this will be automated in the future. If you have mapping data you would like to include in annotator, please contact NCBO.
Appendix
Data Population - Concepts and Hierarchy
Introduction: the Annotator hierarchy database population application pulls the ontology and the concept data from BioPortal Core via the REST services, then extracts and computes the hierarchy information, and finally stores the information in the hierarchy database. Then this computed data is accessible via the Annotator REST services.
Data population is divided into two parts: 1) Concepts and 2) Hierarchy. Please see below for details on these two population procedures.
Concepts and Direct Relation (level == 1)
Pre-requisite in "status":
The ontology should be in valid status ("status = 3") in the obs_ontology table in Hierarchy Database to start this process (i.e. This is the initial status from BioPortal Core when ontology is successfully parsed). This is also serves as a safety lock to avoid launching population again for ontologies that are already in the process of population or already populated.
REST calls available
- For all ontologies: populate all ontologies with valid status ("status = 3") currently stored in Hierarchy DB (ontologies are added during the synchronization process).
http://example/obs/loaderBigConcepts/all
- For a specific ontology: populate the given ontology if "status = 3". Otherwise the data population process will complain that the status is invalid in the Tomcat log and exit.
http://example/obs/loaderBigConcepts/{ontology_versoin_id}
Troubleshooting
Errors are logged both in Tomcat catalina.out and DB (obs_error_queue table).
- Case 1: BioPortal REST Service is Down
When you kick off the process using the Annotator REST call - either via web browser or shell script - it checks first if BioPortal REST service is alive. If the BioPortal REST service is down – the tomcat log will generate an error message about BioPortal REST service being down (But it does not change the ontology "status" field. Just simply kick off the process again when BioPortal REST service is back up. If BioPortal REST service is down, no change in Annotator database, therefore no need to clean up. See "Error Handling" for the "status" change scenario).
- Case 2: Critical Error
If the error is critical – RunTimeException and etc – the "status" the ontology in obs_ontology table becomes "99" and the data population process halts.
- Case 3: Non-critical Error
If the error is not critical, the "status" the ontology in obs_ontology table becomes "99" but the data population process still continues to populate the rest of the data. An example for a non-critical error is a discrepancy between the data from two different BioPortal REST calls – i.e. Some of the root concepts from getRootConcepts call are missing in BioPortal (The list of concepts from getAllConcepts does not have some of the root concepts).
In the case of case 2 & 3, data clean up may be necessary. Please see "Data Clean-Up".
Indirect Relation Hierarchy (level > 1)
Pre-requisite in "status":
The ontology should be in valid status ("status = 14") in obs_ontology table in Hierarchy Database to start this process (i.e. "loaderBigConcepts" should have been completed for the given ontology). This is a safety lock to avoid launching population again for the ontologies already in the process of population or already populated.
REST calls available:
- For all ontologies: populate all ontologies with "status = 14"
http://example/obs/loaderBigPaths/all
- For a specific ontology: populate the given ontology if "status = 14". Otherwise the data population process will complain that the status is invalid in the Tomcat log and exit.
http://example/obs/loaderBigPaths/{ontology_version_id}
e.g. http://example/obs/loaderBigPaths/40671
Monitoring the Progress
To monitor the progress, please see "status" field in obs_ontology table.
Restlets | Status Required (valid status required to begin the process) | Status Start -> Finish |
---|---|---|
loaderBigConcepts | 3 | 11 -> 14 |
loaderBigPaths | 14 | 21 -> 28 |
(status value in obs_ontology table changes as progress continues) |
Error Handling:
- If there is any error, the status will be set to 99
- If critical error (e.g. BioPortal REST services down) occurs, the population process will stop and exit. But if it is NOT a critical error (e.g. if the semantic type description is not found from semantic look up table etc), it will mark the status to 99, log the error (both Tomcat log and DB obs_error_queue), but continue with the population.
- To restart the process, just run the script to clean up. The script will reset the status either to "3" or "14" depending on Concept clean up or Hierarchy clean up. (See Section II. Data Clean up)
Data Clean-Up
The source for several SQL commands that can be used to clean up data after an aborted population attempt is located in /ncbo/sources/annotator/2004/db/sql/obs_db_cleanup.sql – DO NOT RUN the entire script since this is just compiled list of commands.
/* ----------------------------------------------------------------------- to clean up or undo everything on one ontology, just run #1 and #2 ----------------------------------------------------------------------- 1. Rollback BPConceptManager. Run this to rollback to initial state - to undo 'loaderConcepts' restlet */ set @var_ontology_version_id := '39545'; delete a.* from obs_relation a, obs_concept b, obs_ontology c where a.level = 1 and a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id; delete a.* from obs_term a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id; delete a.* from obs_semantic_type a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id; delete a.* from obs_concept a, obs_ontology b where a.ontology_id = b.id and b.local_ontology_id = @var_ontology_version_id; UPDATE obs_ontology set status = 3 where local_ontology_id = @var_ontology_version_id; /* ----------------------------------------------------------------------- 2. Rollback BPPathManager. Run this to rollback - to undo 'loaderPaths' restlet */ set @var_ontology_version_id := '39545'; delete a.* from obs_relation a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id and a.level > 1; delete a.* from obs_path_to_root a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id; delete a.* from obs_path_to_leaf a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = @var_ontology_version_id; UPDATE obs_ontology set status = 14 where local_ontology_id = @var_ontology_version_id;
Useful SQL Queries for Monitoring and Validation
The source for several queries that can be used for monitoring and data validation is located in /ncbo/sources/annotator/2004/db/sql/obs_db_cleanup.sql (In the same file as the Data Clean-Up)
Number of concepts for a specific Ontology
select count(*) from obs_concept a, obs_ontology b where a.ontology_id = b.id and b.local_ontology_id = '40261';
Number of total relations or path_to_root/leaf for a specific ontology (to see the progress) – by looking at how fast the number is growing
select count(*) from obs_relation a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = '40133' and a.level > 1; select a.*, b.local_concept_id from obs_path_to_root a, obs_concept b, obs_ontology c where a.concept_id = b.id and b.ontology_id = c.id and c.local_ontology_id = '40483';
Ontologies that have the status "Ready" (28) but have no concepts in obs_concept table
SELECT o.*, c.ontology_id FROM obs_ontology o LEFT OUTER JOIN ( SELECT DISTINCT ontology_id FROM obs_concept ) c ON o.id = c.ontology_id WHERE o.status = 28 AND c.ontology_id IS NULL;