MeCab user dictionary for science technology term

2019/05/17

HTTPS Site: https://dbarchive.biosciencedbc.jp/data/mecab/

We have made a user dictionary of morphological analysis engine MeCab headwords and synonyms of JST Thesaurus (2015 edition) .

README Content

  1. Database Component
  2. Data Description
  3. License
  4. Update History
  5. Literature
  6. Contact address

1. Database Component

  1. README
  2. JST Thesaurus Headwords and Synonyms
  3. J-GLOBAL MeSH Dictionary
  4. Nikkaji Dictionary
Return to Top

2. Data Description

2.1 README

Data name README
Description of data contents HTML file to describe "MeCab user dictionary for science technology term" data.
File README_e.html (English)
Return to Top

2.2 JST Thesaurus Headwords and Synonyms

Data name JST Thesaurus Headwords and Synonyms
Description of data contents

We have made a user dictionary of morphological analysis engine MeCab (http://taku910.github.io/mecab/) headwords and synonyms of JST Thesaurus (2015 edition) . As no reading was given to synonyms (Headword Flag: 'V') in the original thesaurus, NBDC had given natural reading to synonyms in life science (Category code: 'LSxx', where xx is a two-digit number) and computer science (Category code: 'EG01') and reading of base form to synonyms in other categories.
The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8. Entries with zenkaku alphabets, numerals and symbols converted into corresponding hankaku characters are also included.
Please note that this dictionary can not be used as a thesaurus because information on relations between words is not included in the dictionary.

File Thesaurus2015.dic.zip (MeCab dic format) (7.4 MB)
mecab_thesaurus.zip (csv format) (3.8 MB)

Data items are the following:
Data itemDescription
Surface form The word itself
Left-context ID MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html)
Right-context ID MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html)
Cost The cost for the likelihood of the word to appear in a sentence (smaller, more likely)
POS Part of speech
POS subcategory 1 POS subcategory 1
POS subcategory 2 POS subcategory 2
POS subcategory 3 POS subcategory 3
Conjugation type Conjugation type
Conjugation form Conjugation form
Base form Corresponding headword in the JST thesaurus
Reading('Furigana') Reading of the headword.
When Headword Flag is ’V’, it may be different from the reading of the surface form.
Pronunciation Automatically generated from Reading
Source dictionary It is fixed as 'Thesaurus2015'.
ID in Source dictionary ID in JST Thesaurus
J-GLOBAL ID ID in J-GLOBAL
Headword Flag ・C: The word's surface form is the same as the headword in JST Thesaurus (or corresponding hankaku form)
・V: Otherwise
Category code Category code of science fields in JST Thesaurus
Common word flag 1 ・1: There is an entry (or entries) for the surface form in IPA dictionary
・0: There are no entries for the surface in IPA dictionary
Common word flag 2 Based on "IPA dictionary analysis results":
・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result.
・When the value of Common word flag 1 is 0:
- UNKNOWN_1: if the result is one unknown word
- UNKNOWN_2: if the result is multiple tokens including unknown word
- MULTI_WORD: if the result is multiple tokens in IPA dictionary
IPA dictionary analysis results Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected.
Return to Top

2.3 J-GLOBAL MeSH Dictionary

Data name J-GLOBAL MeSH Dictionary
Description of data contents

A user dictionary for morphological analysis engine MeCab(http://taku910.github.io/mecab/) from J-GLOBAL science and technology terms that have linked to MeSH (Medical Subect Headings: https://www.nlm.nih.gov/mesh/) terms by United States National Library of Medicine. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8.

File JSTMeSH.dic.zip (MeCab dic format) (1.2 MB)
mecab_jstmesh.zip (csv format) (484 KB)

Data items are the following:
Data itemDescription
Surface form The word itself
Left-context ID MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html)
Right-context ID MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html)
Cost The cost for the likelihood of the word to appear in a sentence (smaller, more likely)
POS Part of speech
POS subcategory 1 POS subcategory 1
POS subcategory 2 POS subcategory 2
POS subcategory 3 POS subcategory 3
Conjugation type Conjugation type
Conjugation form Conjugation form
Base form Same as the surface form
Reading('Furigana') (empty)
Pronunciation (empty)
Source dictionary It is fixed as 'MeSH'.
ID in Source dictionary MeSH UID
J-GLOBAL ID ID in J-GLOBAL
Headword Flag It is fixed as 'C'.
Category code Category code of science fields in JST Thesaurus
Common word flag 1 ・1: There is an entry (or entries) for the surface form in IPA dictionary
・0: There are no entries for the surface in IPA dictionary
Common word flag 2 Based on "IPA dictionary analysis results":
・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result.
・When the value of Common word flag 1 is 0:
- UNKNOWN_1: if the result is one unknown word
- UNKNOWN_2: if the result is multiple tokens including unknown word
- MULTI_WORD: if the result is multiple tokens in IPA dictionary
IPA dictionary analysis results Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected.
Return to Top

2.4 Nikkaji Dictionary

Data name Nikkaji Dictionary
Description of data contents

A user dictionary for morphological analysis engine MeCab(http://taku910.github.io/mecab/) from J-GLOBAL science and technology terms that have linked to Japan Chemical Substance Dictionary (Nikkaji), an organic compound dictionary database prepared by the Japan Science and Technology Agency. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8.

File Nikkaji.dic.zip (MeCab dic format) (6.6 MB)
mecab_nikkaji.zip (csv format) (2.4 MB)

Data items are the following:
Data itemDescription
Surface form The word itself
Left-context ID MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html)
Right-context ID MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html)
Cost The cost for the likelihood of the word to appear in a sentence (smaller, more likely)
POS Part of speech
POS subcategory 1 POS subcategory 1
POS subcategory 2 POS subcategory 2
POS subcategory 3 POS subcategory 3
Conjugation type Conjugation type
Conjugation form Conjugation form
Base form Same as the surface form
Reading('Furigana') (empty)
Pronunciation (empty)
Source dictionary It is fixed as 'Nikkaji'.
ID in Source dictionary MeSH UID
J-GLOBAL ID ID in J-GLOBAL
Headword Flag It is fixed as 'C'.
Category code Category code of science fields in JST Thesaurus
Common word flag 1 ・1: There is an entry (or entries) for the surface form in IPA dictionary
・0: There are no entries for the surface in IPA dictionary
Common word flag 2 Based on "IPA dictionary analysis results":
・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result.
・When the value of Common word flag 1 is 0:
- UNKNOWN_1: if the result is one unknown word
- UNKNOWN_2: if the result is multiple tokens including unknown word
- MULTI_WORD: if the result is multiple tokens in IPA dictionary
IPA dictionary analysis results Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected.
Return to Top

3. License

Last updated : 2018/06/04

You may use this database in compliance with the terms and conditions of the license described below. The license specifies the license terms regarding the use of this database and the requirements you must follow in using this database.

Creative Commons License

The license for this database is specified in the Creative Commons Attribution-Share Alike 4.0 International.
If you use data from this database, please be sure attribute this database as follows: "MeCab user dictionary for science technology term © National Bioscience Database Center licensed under CC Attribution-Share Alike 4.0 International".

The summary of the Creative Commons Attribution-Share Alike 4.0 International is found here.

With regard to this database, you are licensed to:

  1. freely access part or whole of this database, and acquire data;
  2. freely redistribute part or whole of the data from this database; and
  3. freely create and distribute database and other adapted materials based on part or whole of the data from this database,

under the license, as long as you comply with the following conditions:

  1. You must attribute this database in the manner specified by the author or licensor when distributing part or whole of this database or any adapted material.
  2. You must distribute any adapted material based on part or whole of the data from this database under CC Attribution-Share Alike 4.0 (or later), or CC Attribution-Share Alike Compatible License (the list is here).
  3. You need to contact the Licensor shown below to request a license for use of this database or any part thereof not licensed under the license.

National Bioscience Database Center
E-Mail: support[at]biosciencedbc[dot]jp

About Providing Links to This Database

You can freely provide links to all contents in this database. But, contents might be changed without notice.

Return to Top

4. Update History

DateUpdate contents
2019/05/17 Archive version V2 Released.
2018/06/04 MeCab user dictionary for science technology term English archive site is opened.
(Archive V1)
Return to Top

5. Literature

建石由佳, 信定知江, 高木利久
JST科学技術用語シソーラスに基づくMeCab用専門用語辞書
言語処理学会第23回年次大会、P7-1 (予稿集 pp485-488)、2017年3月

Return to Top

6. Contact address

When you have any question about "MeCab user dictionary for science technology term", contact the following:

National Bioscience Database Center
E-Mail: support[at]biosciencedbc[dot]jp

Return to Top