fasttext使用笔记

时间：2017-11-27 16:48:32 阅读：707 评论：0 收藏：0 [点我收藏+]

标签：添加 tcl jsb ati 维度 center for 数据文件 osi

http://blog.csdn.net/m0_37306360/article/details/72832606

这里记录使用fastText训练word vector笔记

github地址：https://github.com/facebookresearch/fastText

下载到本机:

$ gitclone https://github.com/facebookresearch/fastText.git

$ cdfastText

$ make

Make报错：

技术分享图片

原因GCC版本过低

Gcc –v

技术分享图片

升级版本:参考（http://www.linuxidc.com/Linux/2016-11/136840.htm）

1. 添加源

首先添加ppa到库：

sudo add-apt-repository ppa:ubuntu-toolchain-r/test

sudo apt-get update

2. 安装新版gcc/g++(注意gcc和g++都要更新)

接着就可以选择安装gcc-4.9,gcc-5之类的啦!(注意目前gcc-5实际上是5.3.0，没有5.1或5.2可供选择)

sudo apt-get install gcc-4.8 g++-4.8

sudo apt-get install gcc-4.9 g++-4.9

sudo apt-get install gcc-5 g++-5

sudo apt-get install gcc-6 g++-6

具体希望安装哪个版本,大家可以自己选择

3. 刷新db并locate

接着现在可以考虑刷新一下, 否则locate等命令是找不到的, 这个也是可选的(非必须)

sudo updatedb && sudo ldconfig

locate gcc | grep -E "/usr/bin/gcc-[0-9]"

4. 切换版本

通过update-alternatives建立文件关联

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.620

sudo update-alternatives --install /usr/bin/gcc gcc/usr/bin/gcc-4.8 30

然后在fastText文件夹下make,成功生成fastText执行文件。

接下来就可以愉快的使用了。

fastText可以可以用于训练 word represent和text classification，这里记录使用fastText训练word embedding过程。

1. 首先打开word-vector-example.sh文件

RESULTDIR=result //结果保存文件夹

DATADIR=data //输入数据文件夹

mkdir -p"${RESULTDIR}"

mkdir -p"${DATADIR}"

if [ ! -f"${DATADIR}/fil9" ] //如果fil9不存在，就下载

then

wget -c http://mattmahoney.net/dc/enwik9.zip-P "${DATADIR}"

unzip "${DATADIR}/enwik9.zip" -d"${DATADIR}"

perl wikifil.pl "${DATADIR}/enwik9"> "${DATADIR}"/fil9

if [ ! -f"${DATADIR}/rw/rw.txt" ] //如果rw.txt不存在,就下载

then

wget -chttps://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}"

unzip "${DATADIR}/rw.zip" -d"${DATADIR}"

make

./fasttextskipgram -input "${DATADIR}"/fil9 -output"${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \

-ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns-bucket 2000000 \

-minn 3 -maxn 6 -thread 4 -t 1e-4-lrUpdateRate 100

//这行代码为训练word embedding，输入为DATADIR下的fil9，输出模型保存在RESULTDIR下fil9。

这些参数是强制性设定的:

- minCount 5：单词出现少于5就丢弃 -minn 最小长度的字符 -maxn 最长长度的字符 –t 采样阈值

这些参数是可选的：

-lr 学习率–epoch 迭代次数–neg 负采样–loss loss function {ns,hs, softmax} ---dim 词向量维度 –ws 窗口大小

cut -f1,2 "${DATADIR}"/rw/rw.txt | awk ‘{print tolower($0)}‘ | tr ‘\t‘ ‘\n‘> "${DATADIR}"/queries.txt

cat"${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin> "${RESULTDIR}"/vectors.txt

python eval.py -m "${RESULTDIR}"/vectors.txt -d"${DATADIR}"/rw/rw.txt

2. 使用自己的语料训练,这里我使用维基百科英文语料，处理过程前面有讲。

./fasttext cbow –input new_enwiki –output new_enwiki_100_30–epoch 30 –neg 5 –loss ns –dim 100 –ws 5

fasttext使用笔记

标签：添加 tcl jsb ati 维度 center for 数据文件 osi

原文地址：http://www.cnblogs.com/DjangoBlog/p/7904420.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行