We have made a user dictionary of morphological analysis engine MeCab headwords and synonyms of JST Thesaurus (2015 edition) .
Data name | README |
---|---|
Description of data contents | HTML file to describe "MeCab user dictionary for science technology term" data. |
File | README_e.html (English) |
Data name | MeCab user dictionary: JST Thesaurus Headwords and Synonyms |
---|---|
Description of data contents | We have made a user dictionary of morphological analysis engine MeCab (http://taku910.github.io/mecab/) headwords and synonyms of JST Thesaurus (2015 edition) . As no reading was given to synonyms (Headword Flag: 'V') in the original thesaurus, NBDC had given natural reading to synonyms in life science (Category code: 'LSxx', where xx is a two-digit number) and computer science (Category code: 'EG01') and reading of base form to synonyms in other categories. |
File | Thesaurus2015.dic.zip (MeCab dic format) (7.4 MB)
mecab_thesaurus.zip (csv format) (3.8 MB) |
Data item | Description |
---|---|
Surface form | The word itself |
Left-context ID | MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html) |
Right-context ID | MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html) |
Cost | The cost for the likelihood of the word to appear in a sentence (smaller, more likely) |
POS | Part of speech |
POS subcategory 1 | POS subcategory 1 |
POS subcategory 2 | POS subcategory 2 |
POS subcategory 3 | POS subcategory 3 |
Conjugation type | Conjugation type |
Conjugation form | Conjugation form |
Base form | Corresponding headword in the JST thesaurus |
Reading('Furigana') | Reading of the headword. When Headword Flag is ’V’, it may be different from the reading of the surface form. |
Pronunciation | Automatically generated from Reading |
Source dictionary | It is fixed as 'Thesaurus2015'. |
ID in Source dictionary | ID in JST Thesaurus |
J-GLOBAL ID | ID in J-GLOBAL |
Headword Flag | ・C: The word's surface form is the same as the headword in JST Thesaurus (or corresponding hankaku form) ・V: Otherwise |
Category code | Category code of science fields in JST Thesaurus |
Common word flag 1 | ・1: There is an entry (or entries) for the surface form in IPA dictionary ・0: There are no entries for the surface in IPA dictionary |
Common word flag 2 | Based on "IPA dictionary analysis results": ・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result. ・When the value of Common word flag 1 is 0: - UNKNOWN_1: if the result is one unknown word - UNKNOWN_2: if the result is multiple tokens including unknown word - MULTI_WORD: if the result is multiple tokens in IPA dictionary |
IPA dictionary analysis results | Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected. |
Data name | MeCab user dictionary: J-GLOBAL MeSH |
---|---|
Description of data contents | A user dictionary for morphological analysis engine MeCab(http://taku910.github.io/mecab/) from J-GLOBAL science and technology terms that have linked to MeSH (Medical Subect Headings: https://www.nlm.nih.gov/mesh/) terms by United States National Library of Medicine. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8. |
File | JSTMeSH.dic.zip (MeCab dic format) (1.2 MB)
mecab_jstmesh.zip (csv format) (484 KB) |
Data item | Description |
---|---|
Surface form | The word itself |
Left-context ID | MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html) |
Right-context ID | MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html) |
Cost | The cost for the likelihood of the word to appear in a sentence (smaller, more likely) |
POS | Part of speech |
POS subcategory 1 | POS subcategory 1 |
POS subcategory 2 | POS subcategory 2 |
POS subcategory 3 | POS subcategory 3 |
Conjugation type | Conjugation type |
Conjugation form | Conjugation form |
Base form | Same as the surface form |
Reading('Furigana') | (empty) |
Pronunciation | (empty) |
Source dictionary | It is fixed as 'MeSH'. |
ID in Source dictionary | MeSH UID |
J-GLOBAL ID | ID in J-GLOBAL |
Headword Flag | It is fixed as 'C'. |
Category code | Category code of science fields in JST Thesaurus |
Common word flag 1 | ・1: There is an entry (or entries) for the surface form in IPA dictionary ・0: There are no entries for the surface in IPA dictionary |
Common word flag 2 | Based on "IPA dictionary analysis results": ・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result. ・When the value of Common word flag 1 is 0: - UNKNOWN_1: if the result is one unknown word - UNKNOWN_2: if the result is multiple tokens including unknown word - MULTI_WORD: if the result is multiple tokens in IPA dictionary |
IPA dictionary analysis results | Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected. |
Data name | MeCab user dictionary: Nikkaji (Japan Chemical Substance Dictionary) |
---|---|
Description of data contents | A user dictionary for morphological analysis engine MeCab(http://taku910.github.io/mecab/) from J-GLOBAL science and technology terms that have linked to Japan Chemical Substance Dictionary (Nikkaji), an organic compound dictionary database prepared by the Japan Science and Technology Agency. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8. |
File | Nikkaji.dic.zip (MeCab dic format) (6.6 MB)
mecab_nikkaji.zip (csv format) (2.4 MB) |
Data item | Description |
---|---|
Surface form | The word itself |
Left-context ID | MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html) |
Right-context ID | MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html) |
Cost | The cost for the likelihood of the word to appear in a sentence (smaller, more likely) |
POS | Part of speech |
POS subcategory 1 | POS subcategory 1 |
POS subcategory 2 | POS subcategory 2 |
POS subcategory 3 | POS subcategory 3 |
Conjugation type | Conjugation type |
Conjugation form | Conjugation form |
Base form | Same as the surface form |
Reading('Furigana') | (empty) |
Pronunciation | (empty) |
Source dictionary | It is fixed as 'Nikkaji'. |
ID in Source dictionary | MeSH UID |
J-GLOBAL ID | ID in J-GLOBAL |
Headword Flag | It is fixed as 'C'. |
Category code | Category code of science fields in JST Thesaurus |
Common word flag 1 | ・1: There is an entry (or entries) for the surface form in IPA dictionary ・0: There are no entries for the surface in IPA dictionary |
Common word flag 2 | Based on "IPA dictionary analysis results": ・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result. ・When the value of Common word flag 1 is 0: - UNKNOWN_1: if the result is one unknown word - UNKNOWN_2: if the result is multiple tokens including unknown word - MULTI_WORD: if the result is multiple tokens in IPA dictionary |
IPA dictionary analysis results | Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected. |
You may use this database in compliance with the terms and conditions of the license described below. The license specifies the license terms regarding the use of this database and the requirements you must follow in using this database.
The license for this database is specified in the Creative Commons Attribution-Share Alike 4.0 International.
If you use data from this database, please be sure attribute this database as follows: "MeCab user dictionary for science technology term © National Bioscience Database Center licensed under CC Attribution-Share Alike 4.0 International".
The summary of the Creative Commons Attribution 4.0 International is found here.
With regard to this database, you are licensed to:
under the license, as long as you comply with the following conditions:
NBDC
https://form2.jst.go.jp/s/contact_nbdc
You can freely provide links to all contents in this database. But, contents might be changed without notice.
Date | Update contents |
---|---|
2021/12/07 | Database names were replaced with following:
|
2019/05/17 | Archive version V2 Released.
|
2018/06/04 | MeCab user dictionary for science technology term English archive site is opened. (Archive V1) |
NBDC
Contact us : https://form2.jst.go.jp/s/contact_nbdc