STEAM PLACE

エンジニアリングとマネジメント

word2vec のオプション一覧

自然言語処理

今更ながら米googleの研究者が開発したWord2Vecで自然言語処理(独自データ) を見ながら word2vec へ入門した。
色々と学習を進めるとトレーニングコマンドのオプションがまとまっているページがなかったのでまとめてみた。

が、入門したての私にはわからないものがほとんど。そして意味がわかってもどういった使い方をすればいいのかがわからないという結果に。。

word2vec のオプション

オプション	内容	備考
-train	学習に使用するファイル	分かち書きが必要
-output	学習結果を出力するファイル名
-size	ベクトルの次元数
-window	文脈の最大単語数
-sample	単語を無視する頻度
-hs	学習に階層化ソフトマックスを使用するかどうか
-negative	ネガティブサンプリングに用いる単語数
-threads	学習に使用するスレッド数
-iter	トレーニング反復回数
-min-count	n回未満登場する単語を破棄
-alpha	学習係数
-classes	（ベクトルよりもワードクラスを優先させるらしいがどういうことだ？）
-debug	デバッグモード
-binary	バイナリ形式で出力するかどうか
-save-vocab	語彙をファイル保存
-read-vocab	語彙をファイルから使用
-cbow	学習モデル CBOW を使うか、Skip-gram を使うか

help

Options:
Parameters for training:
    -train <file>
        Use text data from <file> to train the model
    -output <file>
        Use <file> to save the resulting word vectors / word clusters
    -size <int>
        Set size of word vectors; default is 100
    -window <int>
        Set max skip length between words; default is 5
    -sample <float>
        Set threshold for occurrence of words. Those that appear with higher frequency in the training data
        will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
    -hs <int>
        Use Hierarchical Softmax; default is 0 (not used)
    -negative <int>
        Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
    -threads <int>
        Use <int> threads (default 12)
    -iter <int>
        Run more training iterations (default 5)
    -min-count <int>
        This will discard words that appear less than <int> times; default is 5
    -alpha <float>
        Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW
    -classes <int>
        Output word classes rather than word vectors; default number of classes is 0 (vectors are written)
    -debug <int>
        Set the debug mode (default = 2 = more info during training)
    -binary <int>
        Save the resulting vectors in binary moded; default is 0 (off)
    -save-vocab <file>
        The vocabulary will be saved to <file>
    -read-vocab <file>
        The vocabulary will be read from <file>, not constructed from the training data
    -cbow <int>
        Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)

さいごに

私のような入門者の助けになる記事になればいいなと思っています。
不明部分についてご存知の方いれば教えていただけると助かります。

参考