# Automatic-Annotation
## 0.データ
利用するデータは4種類。予測する2つのタイプ(celltype, antigen)それぞれについて、学習用データ(2017年1月以前)とテスト用独立サンプル(2017年2、3月)

1. label_feature/c1_human_celltype.txt   // celltypeの学習用データ
2. label_feature/c1_human_antigen.txt    // antigenの学習用データ
3. celltype_curated_201702.tsv           // celltypeの独立サンプル
4. antigen_curated_201702.tsv		 // antigenの独立サンプル

## 1.run process_unclassified.sh

* 元データからk-文字のベクトルを作成する

```$ sh process_unclassified.sh label_feature celltype 2 4 10
$ sh process_unclassified.sh label_feature antigen 2 4 10
```

### 記法

```$ sh process_unclassified.sh label_feature celltype (or antigen) 2 (ngram,from) 4 (ngram,to) 10 (filter size)
```

* `label_feature`:元データのディレクトリ
* `celltype` (or antigen): celltypeのデータ作成か、antigenのデータ作成か
* `ngram, from`: n-文字の最小値
* `ngram, to`: n-文字の最大値 (`2 4` なら、2〜4gram, `3 5`なら 3〜5gram)
* `filter size`: n回以下のクラスはunclassified にする。10なら10回未満をunclassifiedにし、10回より大きなクラスの分類を考える。

### 入力ファイル
* label_feature/c1_human_celltype.txt   // original celltype training data
* celltype_curated_201702.tsv // original celltype test data

### 出力ファイル

* label_feature/ 内に出力. 冒頭のc10の10は引数によって変わる
  * c10_unclassified_training_celltype.txt //training data when filter_n = 10
  * c10_unclassified_curated_celltype.txt //test data when filter_n = 10
  * c10_unclassified_training_curated_celltype.txt //combination of training data and test data when filter_n = 10
  * c10_unclassified_training_curated_celltype_l2.txt //label 2(the second column in the original data file) of the combination of training and test data
  * c10_unclassified_training_curated_celltype_l1.txt //label 1(the first column in the original data file) of the combination of training and test data
  * c10_unclassified_training_curated_celltype_feature.txt //feature(from the third column in the original data file) of the combination of training and test data
  * c10_unclassified_training_curated_celltype_intl2.txt //convert label 2 into interger type 
  * c10_unclassified_training_curated_celltype_intl1.txt //convet label 1 into interger type
  * c10_unclassified_training_curated_celltype_intl2_feature.txt // combine integer type label 2 and feature for combination of training and test data
  * c10_unclassified_training_celltype_intl2_feature.txt // divide above file into training data and test data, this is training data
  * c10_unclassified_curated_celltype_intl2_feature.txt // divide above file into training data and test data, this is test data
  * c10_unclassified_training_celltype_intl2.txt // integer type label 2 for training data
  * c10_unclassified_curated_celltype_intl2.txt // integer type label 2 for test data
  * c10_unclassified_training_celltype_feature.txt // feature for training data
  * c10_unclassified_curated_celltype_feature.txt // feature for test data

* word_data/ 内に出力
  * c10_unclassified_training_celltype_2-4gramfeature.txt // ngram feature for training data
  * c10_unclassified_curated_celltype_2-4gramfeature.txt // ngram feature for test data

## 2.run trans_svm.cpp to translate features into libsvm formulation

libsvmで実行可能な形式に、データを変換する。

```
$ g++ -o trans trans_svm.cpp
$ ./trans celltype 10 training
$ ./trans celltype 10 curated
$ ./trans antigen 10 training
$ ./trans antigen 10 curated
```

### 記法

```
$ ./trans celltype (or antigen) 10 (filter size) training (or curated)
```

* celltype (or antigen): celltypeかantigenか
* filter size: クラス生成時の閾値
* training (or curated): 訓練データか独立サンプルか

### 入力データ

* word_data/内
  *c10_unclassified_training_celltype_2-4gramfeature.txt  // ngram feature for training data
  *c10_unclassified_curated_celltype_2-4gramfeature.txt  // ngram feature for training data

### 出力データ

* word_data/内
  *c10_unclassified_training_celltype_2-4gramfeature_svm.txt  // ngram feature for training data in svm template
  *c10_unclassified_curated_celltype_2-4gramfeature_svm.txt  // ngram feature for training data in svm template


# 3. run process2.sh to combine labels with features to produce input files executed by libsvm
    (file names are changed accordingly)

```
$ sh process2.sh word_data celltype 2-4
$ sh process2.sh word_data antigen 2-4
```

### 記法
```
$ sh process2.sh word_data celltype (or antigen) ngram-range
```
* word_data: データのディレクトリ
* celltype (or antigen): celltypeかantigenか
* ngram-range: 作成しているn-gramの範囲。process_unclassified.shで指定したもの。2-4なら2-gramから4-gram. 3-5なら3-gramから5-gram.

### 入力ファイル
* label_feature/内の以下のファイルを利用
  * c10_unclassified_training_celltype_intl2.txt  // integer type label 2 for training data
  * c10_unclassified_curated_celltype_intl2.txt  // integer type label 2 for test data
* word_data/内の以下のファイルを利用
  *c10_unclassified_training_celltype_2-4gramfeature_svm.txt  // ngram feature for training data in svm template
  * c10_unclassified_curated_celltype_2-4gramfeature_svm.txt  // ngram feature for test data in svm template

### 出力ファイル
* word_data/内
  * c10_unclassified_training_celltype_intl2f2-4gram.txt  // combination of int type label 2 and ngram feature in svm template for training data
  * c10_unclassified_curated_celltype_intl2f2-4gram.txt  // combination of int type label 2 and ngram feature in svm template for test data

# 4. run SVM

1. ハイパーパラメータの学習（この手順はスキップしてください。）
$ sh task_qsub.sh  //10-fold cross-validation, result can be checked in the job name

### 入力ファイル
* word_data/c10_unclassified_training_celltype_intl2f2-4gram.txt  // combination of int type label 2 and ngram feature in svm template for training data
\
### 出力ファイル
Univa等で出力される標準出力
* l2-t0-unclassified-training-c${i}-celltype-v10.o***** // 10-fold cross-validation result

2. モデルの生成
```
$ ./svm-train -q -t 0 -c 0.1 -b 1 word_data/c10_unclassified_training_celltype_intl2f2-4gram.txt word_data/c10_cost1_unclassified_training_celltype_intl2f2-4gram.txt.model
```

* パラメータの値は、1で学習済みの値で設定している。

### 入力ファイル
* word_data/c10_unclassified_training_celltype_intl2f2-4gram.txt  // combination of int type label 2 and ngram feature in svm template for training data

### 出力ファイル
* word_data/c10_cost1_unclassified_training_celltype_intl2f2-4gram.txt.model  // model produced by svm from training data

3. 予測の実施
```
$ ./svm-predict -b 1 word_data/c10_unclassified_curated_celltype_intl2f2-4gram.txt word_data/c10_cost1_unclassified_training_celltype_intl2f2-4gram.txt.model word_data/output_c10_cost1_unclassified_curated_celltype_intl2f2-4gram.txt  //apply the model to predict test data
```

### 入力ファイル
* word_data/c10_unclassified_curated_celltype_intl2f2-4gram.txt  // combination of int type label 2 and ngram feature in svm template for test data
* c10_cost1_unclassified_training_celltype_intl2f2-4gram.txt.model  // model produced by svm from training data

### 出力ファイル
* word_data/output_c10_cost1_unclassified_curated_celltype_intl2f2-4gram.txt  // result of test data by applying the model

# 4.必要に応じて、統計情報の出力

以下は、特段実行の必要なし。必要に応じて利用

## 4.1: acc_eachclass_one.R to output "*_labelchek", "*_falseindex", "*_trueindex" and "*_accuracy" files

* 87行目まで実行

### 入力データ
* word_data/output_c10_cost1_unclassified_curated_celltype_intl2f2-4gram.txt  // result of test data by applying the model
* label_feature/c10_unclassified_curated_celltype_intl2.txt  // integer type label 2 for test data (true label of test data)

### 出力データ
* word_data/内に出力
  * c10_cost1_unclassified_curated_celltype_2-4gram_labelcheck  // 1th column is rue label, 2th column is predicted label
  * c10_cost1_celltype_curated_l2_2-4gram_falseindex  // indexes of false samples
  * c10_cost1_celltype_curated_l2_2-4gram_trueindex  // indexes of true samples
  * c10_cost1_celltype_curated_l2_2-4gram_accuracy  // 1th line is the micro average accuracy of all classes, the following lines are accuracy for each class ordered from 1 to 133 (in the case of c10_celltype)

## 4.2: run extract-lines.py to extract false samples and true samples according to false indexes and true indexes

```
$ python extract-lines.py true  //for true samples
$ python extract-lines.py false  //for false samples
```

### 入力データ
* word_data/内
  * c10_cost1_celltype_curated_l2_2-4gram_trueindex  // indexes of true samples
  * c10_cost1_celltype_curated_l2_2-4gram_falseindex  // indexes of false samples
* label_feature/内
  * c10_unclassified_curated_celltype_intl2_feature.txt  // combination of integer type label 2 and feature for test data

### 出力データ
* word_data/内
  * c10_unclassified_curated_celltype_intl2_truesamples  // correctly predicted samples, 1th column is original integer type label 2, 2th column original feature
  * c10_unclassified_curated_celltype_intl2_falsesamples  // incorrectly predicted samples, 1th column is original integer type label 2, 2th column original feature

## 4.3. run performance.R to calculate tp, tn, fp, fn, precision, recall, and f1-score for each class

### 入力データ
* word_data/c10_cost1_unclassified_curated_celltype_2-4gram_labelcheck  // 1th column is true label, 2th column is predicted label

### 出力データ
* word_data/c10_cost1_unclassified_curated_celltype_2-4gram_performance.txt  // prediction performance of test data for each class, 1th column is label index, and sequentially follow tp, tn, fp, fn, precision, recall, f1-score for each label.
