[ Japanese | English ]
About This Database

MeCab user dictionary: JST Thesaurus Headwords and Synonyms

Data description
Data name
MeCab user dictionary: JST Thesaurus Headwords and Synonyms
DOI
10.18908/lsdba.nbdc02358-001.V002
Description of data contents
We have made a user dictionary of morphological analysis engine MeCab (<a href="http://taku910.github.io/mecab/" target="_blank">http://taku910.github.io/mecab/</a>) headwords and synonyms of JST Thesaurus (2015 edition) . As no reading was given to synonyms (Headword Flag: 'V') in the original thesaurus, NBDC had given natural reading to synonyms in life science (Category code: 'LSxx', where xx is a two-digit number) and computer science (Category code: 'EG01') and reading of base form to synonyms in other categories. The dictionary items are based on IPA dictionary. Csv file is encoded in Shift-JIS and dic file is encoded in UTF-8. Entries with zenkaku alphabets, numerals and symbols converted into corresponding hankaku characters are also included. Please note that this dictionary can not be used as a thesaurus because information on relations between words is not included in the dictionary.
Data file
File name :
Thesaurus2015.dic.zip (MeCab dic format)
File URL :
File size :
7.4 MB
Simple search URL
http://togodb.biosciencedbc.jp/togodb/view/mecab_thesaurus#en
Data acquisition method

IPA dictionary (mecab-ipadic-2.7.0-20070801 downloaded from MeCab's site [see above]), JST Science and Technology Thesaurus (2015 edition)

Data analysis method

-

Number of data entries

127,214 entries

Data detail
Data item Description
Surface form

The word itself

Left-context ID

MeCab internal ID for left context (see http://taku910.github.io/mecab/dic.html)

Right-context ID

MeCab internal ID for right context (see http://taku910.github.io/mecab/dic.html)

Cost

The cost for the likelihood of the word to appear in a sentence (smaller, more likely)

POS

Part of speech

POS subcategory 1

POS subcategory 1

POS subcategory 2

POS subcategory 2

POS subcategory 3

POS subcategory 3

Conjugation type

Conjugation type

Conjugation form

Conjugation form

Base form

Corresponding headword in the JST thesaurus

Reading('Furigana')

Reading of the headword.When Headword Flag is ’V’, it may be different from the reading of the surface form.

Pronunciation

Automatically generated from Reading

Source dictionary

It is fixed as 'Thesaurus2015'.

ID in Source dictionary

ID in JST Thesaurus

J-GLOBAL ID

ID in J-GLOBAL

Headword Flag

・C: The word's surface form is the same as the headword in JST Thesaurus (or corresponding hankaku form)・V: Otherwise

Category code

Category code of science fields in JST Thesaurus

Common word flag 1

・1: There is an entry (or entries) for the surface form in IPA dictionary・0: There are no entries for the surface in IPA dictionary

Common word flag 2

Based on "IPA dictionary analysis results":・When the value of Common word flag 1 is 1, the value of this flag is the part of speech for the IPA dictionary analysis result.・When the value of Common word flag 1 is 0:- UNKNOWN_1: if the result is one unknown word- UNKNOWN_2: if the result is multiple tokens including unknown word- MULTI_WORD: if the result is multiple tokens in IPA dictionary

IPA dictionary analysis results

Results of the morphological analysis with the original IPA dictionary (and the dictionary with IPA dictionary entries where zenkaku alphanumeric characters and symbols are converted into corresponding hankaku characters). If the result is devided into multiple tokens, it is whitespace-separated. It is not manually corrected.