Mahout主要有协同过滤、聚类和分类三种算法的实现。现在我们就用Mahout来实现经典的Kmeans聚类算法。
首先,下载Hadoop和Mahout。因为Mahout有很多实现是运行在Hadoop上的,所以要先安装Hadoop。
具体怎么安装?简单地说一下:
1. 先安装SSH。
ufw disable 关闭防火墙
cd .ssh/ 进入ssh文件夹,没有的话,下面生产密钥的时候自动生成
ssh-keygen -t rsa 生成ssh密钥
cp id_rsa.pub authorized_keys 复制多一份
ssh localhost 测试是否联通
sudo apt-get install openssh-server 安装ssh服务
net start sshd 启动ssh服务
2. 解压Hadoop
tar -zxvf hadoop-1.1.2.tar.gz 解压tar.gz
3. 添加环境变量
export JAVA_HOME=/usr/local/jdk7 增加环境变量
export PATH=.:$JAVA_HOME/bin:$PATH 增加环境变量
4. 单机运行的话至少修改四个配置文件
5. 其他命令
hadoop namenode -format 格式化hadoop的namenode,datanode不需要格式化
start-all.sh 启动所有的hadoop服务
stop-all.sh 关闭所有的hadoop服务
start-dfs.sh 单独启动hdfs
stop-dfs.sh
start-mapred.sh 启动MapReduce的两个服务
hadoop-daemon.sh start[进程名称] 单独启动进程
jps 查看正在运行的各种进程
ps -e | grep ssh 查看防火墙服务是否开启
ifconfig -a |grep inet 查看网络连接地址
6. Mahout的安装也类似
先解压,再配置环境变量,最后输入mahout命令,有各种算法列出来就是安装成功了!
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
下载 Reuters21578 文本语料。也可以自己准备数据集。我用我自己的数据集来做实现。
我收集了1000首歌曲的信息,如下:
把这些信息存入mongodb数据库中,以后还要使用,当然不存也可以。然后用java代码取出来,每首歌曲生成一个txt文件。并且做了处理,标签值赋予不同的权重,歌词进行了分词处理。
Map<String,Object> outmap = new HashMap<String, Object>(); outmap.put("flag",false); List<Song> list = songRepository.findAll(); int size = list.size() ; String[] strs = new String[size]; if (list != null){ //循环每一首歌曲 for (int i = 0; i < size; i++) { Song song = list.get(i); //有权值的标签 StringBuilder sb = new StringBuilder(); for (int j = 0; j < 8; j++) { sb.append(song.getArtist()).append(" "); } for (int j = 0; j < 2; j++) { sb.append(song.getAlbum()).append(" "); } for (int j = 0; j < 5; j++) { sb.append(song.getType()).append(" "); } for (int j = 0; j < 3; j++) { sb.append(song.getDistrict()).append(" "); } for (int j = 0; j < 6; j++) { sb.append(song.getYears()).append(" "); } for (int j = 0; j < 3; j++) { sb.append(song.getRhythm()).append(" "); } for (int j = 0; j < 4; j++) { sb.append(song.getMood()).append(" "); } //无权值的歌词 String strLrc = song.getLrc() ; //对歌词进行分词 strLrc = SplitWord.splitWordBySpace(strLrc); sb.append(strLrc); strs[i] = sb.toString() ; } //写出文件 WriteLines.writeStrBecomeTxts("C:\\Users\\xin\\Desktop\\大论文\\Scala","utf-8",strs); outmap.put("flag",true); return outmap ; } else { return outmap ; }
生成的文件如下:
把这些文件压缩成一个文件,也就是Hadoop可以解析的SequenceFile格式的文件
Mahout seqdirectory -i file:/usr/song-input -o file:/usr/song-output -c UTF-8 -chunk 64 -xm sequential
加file:前缀是指在本地文件系统上寻找,而不是HDFS。-xm sequential 就是本地执行的意思。
-chunk 64 压缩成64M一个文件,HDFS文件系统的单位就是64M。
接着就是把SequenceFile格式的文件转换为向量Vector。把上一步生成的文件放到HDFS文件系统上。运行命令:
hadoop fs -mkdir input hadoop fs -put /usr/song-output/chunk-0 input Mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent 95 --nameVector -a org.apache.lucene.analysis.WhitespaceAnalyzer
-i 输入目录
-o 输出目录
--weight 权重公式
--maxDFPercent 过滤高词频 >95%
-a 指定分词器 因为我们前面已经用IK分过词了,这里直接按空格分词就可以了
各个参数如下图:
生成目录:
root@xin:~# hadoop fs -ls output Warning: $HADOOP_HOME is deprecated. Found 7 items drwxr-xr-x - root supergroup 0 2015-03-30 14:11 /user/root/output/df-count -rw-r--r-- 1 root supergroup 48768 2015-03-30 14:10 /user/root/output/dictionary.file-0 -rw-r--r-- 1 root supergroup 51433 2015-03-30 14:11 /user/root/output/frequency.file-0 drwxr-xr-x - root supergroup 0 2015-03-30 14:11 /user/root/output/tf-vectors drwxr-xr-x - root supergroup 0 2015-03-30 14:12 /user/root/output/tfidf-vectors drwxr-xr-x - root supergroup 0 2015-03-30 14:09 /user/root/output/tokenized-documents drwxr-xr-x - root supergroup 0 2015-03-30 14:10 /user/root/output/wordcount
· dictionary.file-0:词文本 -> 词id(int)的映射。词转化为id,这是常见做法。
· frequency.file:词id -> 文档集词频(cf)。
· wordcount(目录): 词文本 -> 文档集词频(cf),这个应该是各种过滤处理之前的信息。
· df-count(目录): 词id -> 文档频率(df)。
· tf-vectors、tfidf-vectors (均为目录):词向量,每篇文档一行,格式为{词id:特征值},其中特征值为tf或tfidf。有用采用了内置类型VectorWritable,需要 用命令”mahout vectordump -i <path>”查看。
· tokenized-documents:分词后的文档。
现在来运行Kmeans算法了!
Mahout kmeans -i output/tfidf-vectors -c output/kmeans-clusters -o output/kmeas -k 10 -x 200 -ow --clustering
参数说明如下:
-i:输入为上面产出的tfidf向量。
-o:每一轮迭代的结果将输出在这里。
-k:几个簇。
-c:这是一个神奇的变量。若不设定k,则用这个目录里面的点,作为聚类中心点。否则,随机选择k个点,作为中心点。
-dm:距离公式,文本类型推荐用cosine距离。
-x :最大迭代次数。
–clustering:在mapreduce模式运行。
–convergenceDelta:迭代收敛阈值,默认0.5,对于Cosine来说略大。
其中,clusters-k(-final)为每次迭代后,簇的20个中心点的信息。
而clusterdPoints,存储了 簇id -> 文档id 的映射。
生成的结果文件夹kmeans最好拷贝出来看。
hadoop fs -get output/kmeans/* /usr/song-kmeans/ Warning: $HADOOP_HOME is deprecated. hadoop fs -get output/dictionary.file-0 /usr/song-kmeans Warning: $HADOOP_HOME is deprecated. mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result -n 20 mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints -o /usr/song-result/all
clusteredPoints文件其实就是SequenceFile文件来的。
result文件里面的内容:
可见有太多的无用词汇,分词效果不好,这些词汇需要过滤掉!
其中前面的26是簇的ID,n=7即簇中有这么多个文档。c向量是簇中心点向量,格式为 词文本:权重(点坐标),r是簇的半径向量,格式为 词文本:半径。
下面的Top Terms是簇中选取出来的特征词。
all文件里面的内容:
Key是ClusterID,上面clusterdump的时候,已经说了。
Value是文档的聚类结果:wt是文档属于簇的概率,对于kmeans总是1.0,/1.txt就是文档标志啦,前面seqdirectionary的-nv起作用了,再后面的就是这个点的各个词id和权重了。
某个簇的数据有点多了,簇与簇之间数据分布不够均匀,可见聚类效果不是很好。还要改善文档质量!
整个过程:
root@xin:~# vi /etc/profile root@xin:~# start-all.sh Warning: $HADOOP_HOME is deprecated. starting namenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-namenode-xin.out xin: starting datanode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-datanode-xin.out xin: starting secondarynamenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-secondarynamenode-xin.out starting jobtracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-jobtracker-xin.out xin: starting tasktracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-tasktracker-xin.out root@xin:~# jps 3149 NameNode 3541 SecondaryNameNode 3782 TaskTracker 3937 Jps 3632 JobTracker 3382 DataNode ============================= root@xin:~# mahout seqdirectory -i file:/usr/song-input/ -o file:/usr/song-output/ -c UTF-8 -chunk 64 -xm sequential Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 15/03/30 13:57:11 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:/usr/song-input/], --keyPrefix=[], --method=[sequential], --output=[file:/usr/song-output/], --startPhase=[0], --tempDir=[temp]} 15/03/30 13:57:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/30 13:57:11 INFO driver.MahoutDriver: Program took 411 ms (Minutes: 0.00685) ==================================================== root@xin:~# hadoop fs -ls Warning: $HADOOP_HOME is deprecated. Found 3 items drwxr-xr-x - root supergroup 0 2015-03-29 19:49 /user/root/input drwxr-xr-x - root supergroup 0 2015-03-29 22:31 /user/root/look drwxr-xr-x - root supergroup 0 2015-03-29 20:05 /user/root/output root@xin:~# hadoop fs -rmr input Warning: $HADOOP_HOME is deprecated. Deleted hdfs://xin:9000/user/root/input root@xin:~# hadoop fs -rmr look Warning: $HADOOP_HOME is deprecated. Deleted hdfs://xin:9000/user/root/look root@xin:~# hadoop fs -rmr output Warning: $HADOOP_HOME is deprecated. Deleted hdfs://xin:9000/user/root/output root@xin:~# hadoop fs -mkdir input Warning: $HADOOP_HOME is deprecated. root@xin:~# hadoop fs -put /usr/song-output/chunk-0 input Warning: $HADOOP_HOME is deprecated. ========================== /58.txt 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴情歌 蔡琴情歌 经典 经典 经典 经典 经典 台湾 台湾 台湾 70s 70s 70s 70s 70s 70s 慢板 慢板 慢板 祝福 祝福 祝福 祝福 读你 千遍 也 不 厌倦 读你 感觉 像 三月 浪漫 季节 醉人 诗篇 唔 读你 千遍 也 不 厌倦 读你 感觉 象 春天 喜悦 经典 美丽 句点 唔 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 感觉 像 三月 浪漫 季节 醉人 诗篇 唔 读你 千遍 也 不 厌倦 读你 感觉 象 春天 喜悦 经典 美丽 句点 唔 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 眉目之间 锁 着 爱怜 唇齿 之间 留着 誓言 一切 移动 左右 视线 是 诗篇 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 千遍 也 不 厌倦 读你 root@xin:~# hadoop fs -text input/chunk-0 ================================ root@xin:~# mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent 95 --namedVector -a org.apache.lucene.analysis.core.WhitespaceAnalyzer Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1 15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0 15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1 15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in input 15/03/30 14:09:51 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:09:52 INFO mapred.JobClient: Running job: job_201503301351_0001 15/03/30 14:09:53 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:10:00 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:10:00 INFO mapred.JobClient: Job complete: job_201503301351_0001 15/03/30 14:10:00 INFO mapred.JobClient: Counters: 19 15/03/30 14:10:00 INFO mapred.JobClient: Job Counters 15/03/30 14:10:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4936 15/03/30 14:10:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:10:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:10:00 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:10:00 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:10:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 15/03/30 14:10:00 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:10:00 INFO mapred.JobClient: Bytes Written=131137 15/03/30 14:10:00 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:10:00 INFO mapred.JobClient: HDFS_BYTES_READ=131227 15/03/30 14:10:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=53968 15/03/30 14:10:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=131137 15/03/30 14:10:00 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:10:00 INFO mapred.JobClient: Bytes Read=131123 15/03/30 14:10:00 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:10:00 INFO mapred.JobClient: Map input records=149 15/03/30 14:10:00 INFO mapred.JobClient: Physical memory (bytes) snapshot=89587712 15/03/30 14:10:00 INFO mapred.JobClient: Spilled Records=0 15/03/30 14:10:00 INFO mapred.JobClient: CPU time spent (ms)=590 15/03/30 14:10:00 INFO mapred.JobClient: Total committed heap usage (bytes)=120061952 15/03/30 14:10:00 INFO mapred.JobClient: Virtual memory (bytes) snapshot=675536896 15/03/30 14:10:00 INFO mapred.JobClient: Map output records=149 15/03/30 14:10:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=104 15/03/30 14:10:00 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors 15/03/30 14:10:00 INFO vectorizer.DictionaryVectorizer: Creating dictionary from output/tokenized-documents and saving at output/wordcount 15/03/30 14:10:00 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:10:00 INFO mapred.JobClient: Running job: job_201503301351_0002 15/03/30 14:10:01 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:10:06 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:10:13 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:10:15 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:10:15 INFO mapred.JobClient: Job complete: job_201503301351_0002 15/03/30 14:10:15 INFO mapred.JobClient: Counters: 29 15/03/30 14:10:15 INFO mapred.JobClient: Job Counters 15/03/30 14:10:15 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:10:15 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4037 15/03/30 14:10:15 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:10:15 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:10:15 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:10:15 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:10:15 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8693 15/03/30 14:10:15 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:10:15 INFO mapred.JobClient: Bytes Written=59037 15/03/30 14:10:15 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:10:15 INFO mapred.JobClient: FILE_BYTES_READ=69108 15/03/30 14:10:15 INFO mapred.JobClient: HDFS_BYTES_READ=131267 15/03/30 14:10:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=247350 15/03/30 14:10:15 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=59037 15/03/30 14:10:15 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:10:15 INFO mapred.JobClient: Bytes Read=131137 15/03/30 14:10:15 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:10:15 INFO mapred.JobClient: Map output materialized bytes=69108 15/03/30 14:10:15 INFO mapred.JobClient: Map input records=149 15/03/30 14:10:15 INFO mapred.JobClient: Reduce shuffle bytes=69108 15/03/30 14:10:15 INFO mapred.JobClient: Spilled Records=8116 15/03/30 14:10:15 INFO mapred.JobClient: Map output bytes=117804 15/03/30 14:10:15 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:10:15 INFO mapred.JobClient: CPU time spent (ms)=2850 15/03/30 14:10:15 INFO mapred.JobClient: Combine input records=8090 15/03/30 14:10:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=130 15/03/30 14:10:15 INFO mapred.JobClient: Reduce input records=4058 15/03/30 14:10:15 INFO mapred.JobClient: Reduce input groups=4058 15/03/30 14:10:15 INFO mapred.JobClient: Combine output records=4058 15/03/30 14:10:15 INFO mapred.JobClient: Physical memory (bytes) snapshot=310415360 15/03/30 14:10:15 INFO mapred.JobClient: Reduce output records=2542 15/03/30 14:10:15 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1367658496 15/03/30 14:10:15 INFO mapred.JobClient: Map output records=8090 15/03/30 14:10:15 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:10:15 INFO mapred.JobClient: Running job: job_201503301351_0003 15/03/30 14:10:16 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:10:21 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:10:29 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:10:31 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:10:31 INFO mapred.JobClient: Job complete: job_201503301351_0003 15/03/30 14:10:31 INFO mapred.JobClient: Counters: 29 15/03/30 14:10:31 INFO mapred.JobClient: Job Counters 15/03/30 14:10:31 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:10:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3865 15/03/30 14:10:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:10:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:10:31 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:10:31 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:10:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8558 15/03/30 14:10:31 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:10:31 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:10:31 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:10:31 INFO mapred.JobClient: FILE_BYTES_READ=178553 15/03/30 14:10:31 INFO mapred.JobClient: HDFS_BYTES_READ=131267 15/03/30 14:10:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=371870 15/03/30 14:10:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:10:31 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:10:31 INFO mapred.JobClient: Bytes Read=131137 15/03/30 14:10:31 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:10:31 INFO mapred.JobClient: Map output materialized bytes=129393 15/03/30 14:10:31 INFO mapred.JobClient: Map input records=149 15/03/30 14:10:31 INFO mapred.JobClient: Reduce shuffle bytes=129393 15/03/30 14:10:31 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:10:31 INFO mapred.JobClient: Map output bytes=128796 15/03/30 14:10:31 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:10:31 INFO mapred.JobClient: CPU time spent (ms)=2200 15/03/30 14:10:31 INFO mapred.JobClient: Combine input records=0 15/03/30 14:10:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=130 15/03/30 14:10:31 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:10:31 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:10:31 INFO mapred.JobClient: Combine output records=0 15/03/30 14:10:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=290947072 15/03/30 14:10:31 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:10:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1365016576 15/03/30 14:10:31 INFO mapred.JobClient: Map output records=149 15/03/30 14:10:31 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:10:31 INFO mapred.JobClient: Running job: job_201503301351_0004 15/03/30 14:10:32 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:10:37 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:10:44 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:10:45 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:10:46 INFO mapred.JobClient: Job complete: job_201503301351_0004 15/03/30 14:10:46 INFO mapred.JobClient: Counters: 29 15/03/30 14:10:46 INFO mapred.JobClient: Job Counters 15/03/30 14:10:46 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:10:46 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3819 15/03/30 14:10:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:10:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:10:46 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:10:46 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:10:46 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8497 15/03/30 14:10:46 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:10:46 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:10:46 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:10:46 INFO mapred.JobClient: FILE_BYTES_READ=69087 15/03/30 14:10:46 INFO mapred.JobClient: HDFS_BYTES_READ=70499 15/03/30 14:10:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=248304 15/03/30 14:10:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:10:46 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:10:46 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:10:46 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:10:46 INFO mapred.JobClient: Map output materialized bytes=69087 15/03/30 14:10:46 INFO mapred.JobClient: Map input records=149 15/03/30 14:10:46 INFO mapred.JobClient: Reduce shuffle bytes=69087 15/03/30 14:10:46 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:10:46 INFO mapred.JobClient: Map output bytes=68509 15/03/30 14:10:46 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:10:46 INFO mapred.JobClient: CPU time spent (ms)=1850 15/03/30 14:10:46 INFO mapred.JobClient: Combine input records=0 15/03/30 14:10:46 INFO mapred.JobClient: SPLIT_RAW_BYTES=128 15/03/30 14:10:46 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:10:46 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:10:46 INFO mapred.JobClient: Combine output records=0 15/03/30 14:10:46 INFO mapred.JobClient: Physical memory (bytes) snapshot=296898560 15/03/30 14:10:46 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:10:46 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1361080320 15/03/30 14:10:46 INFO mapred.JobClient: Map output records=149 15/03/30 14:10:46 INFO common.HadoopUtil: Deleting output/partial-vectors-0 15/03/30 14:10:46 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF 15/03/30 14:10:46 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:10:46 INFO mapred.JobClient: Running job: job_201503301351_0005 15/03/30 14:10:47 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:10:52 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:10:59 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:11:00 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:11:01 INFO mapred.JobClient: Job complete: job_201503301351_0005 15/03/30 14:11:01 INFO mapred.JobClient: Counters: 29 15/03/30 14:11:01 INFO mapred.JobClient: Job Counters 15/03/30 14:11:01 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:11:01 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3979 15/03/30 14:11:01 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:11:01 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:11:01 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:11:01 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:11:01 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8495 15/03/30 14:11:01 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:11:01 INFO mapred.JobClient: Bytes Written=51453 15/03/30 14:11:01 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:11:01 INFO mapred.JobClient: FILE_BYTES_READ=35608 15/03/30 14:11:01 INFO mapred.JobClient: HDFS_BYTES_READ=70500 15/03/30 14:11:01 INFO mapred.JobClient: FILE_BYTES_WRITTEN=180070 15/03/30 14:11:01 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=51453 15/03/30 14:11:01 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:11:01 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:11:01 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:11:01 INFO mapred.JobClient: Map output materialized bytes=35608 15/03/30 14:11:01 INFO mapred.JobClient: Map input records=149 15/03/30 14:11:01 INFO mapred.JobClient: Reduce shuffle bytes=35608 15/03/30 14:11:01 INFO mapred.JobClient: Spilled Records=5086 15/03/30 14:11:01 INFO mapred.JobClient: Map output bytes=80676 15/03/30 14:11:01 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:11:01 INFO mapred.JobClient: CPU time spent (ms)=2120 15/03/30 14:11:01 INFO mapred.JobClient: Combine input records=6723 15/03/30 14:11:01 INFO mapred.JobClient: SPLIT_RAW_BYTES=129 15/03/30 14:11:01 INFO mapred.JobClient: Reduce input records=2543 15/03/30 14:11:01 INFO mapred.JobClient: Reduce input groups=2543 15/03/30 14:11:01 INFO mapred.JobClient: Combine output records=2543 15/03/30 14:11:01 INFO mapred.JobClient: Physical memory (bytes) snapshot=289153024 15/03/30 14:11:01 INFO mapred.JobClient: Reduce output records=2543 15/03/30 14:11:01 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1364127744 15/03/30 14:11:01 INFO mapred.JobClient: Map output records=6723 15/03/30 14:11:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning 15/03/30 14:11:01 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:11:02 INFO mapred.JobClient: Running job: job_201503301351_0006 15/03/30 14:11:03 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:11:08 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:11:15 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:11:16 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:11:17 INFO mapred.JobClient: Job complete: job_201503301351_0006 15/03/30 14:11:17 INFO mapred.JobClient: Counters: 29 15/03/30 14:11:17 INFO mapred.JobClient: Job Counters 15/03/30 14:11:17 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:11:17 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3775 15/03/30 14:11:17 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:11:17 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:11:17 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:11:17 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:11:17 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8512 15/03/30 14:11:17 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:11:17 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:11:17 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:11:17 INFO mapred.JobClient: FILE_BYTES_READ=70763 15/03/30 14:11:17 INFO mapred.JobClient: HDFS_BYTES_READ=70500 15/03/30 14:11:17 INFO mapred.JobClient: FILE_BYTES_WRITTEN=149132 15/03/30 14:11:17 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:11:17 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:11:17 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:11:17 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:11:17 INFO mapred.JobClient: Map output materialized bytes=18910 15/03/30 14:11:17 INFO mapred.JobClient: Map input records=149 15/03/30 14:11:17 INFO mapred.JobClient: Reduce shuffle bytes=18910 15/03/30 14:11:17 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:11:17 INFO mapred.JobClient: Map output bytes=68509 15/03/30 14:11:17 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:11:17 INFO mapred.JobClient: CPU time spent (ms)=1710 15/03/30 14:11:17 INFO mapred.JobClient: Combine input records=0 15/03/30 14:11:17 INFO mapred.JobClient: SPLIT_RAW_BYTES=129 15/03/30 14:11:17 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:11:17 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:11:17 INFO mapred.JobClient: Combine output records=0 15/03/30 14:11:17 INFO mapred.JobClient: Physical memory (bytes) snapshot=288608256 15/03/30 14:11:17 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:11:17 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1364774912 15/03/30 14:11:17 INFO mapred.JobClient: Map output records=149 15/03/30 14:11:17 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:11:17 INFO mapred.JobClient: Running job: job_201503301351_0007 15/03/30 14:11:18 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:11:22 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:11:30 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:11:31 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:11:31 INFO mapred.JobClient: Job complete: job_201503301351_0007 15/03/30 14:11:31 INFO mapred.JobClient: Counters: 29 15/03/30 14:11:31 INFO mapred.JobClient: Job Counters 15/03/30 14:11:31 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:11:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3756 15/03/30 14:11:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:11:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:11:31 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:11:31 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:11:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8400 15/03/30 14:11:31 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:11:31 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:11:31 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:11:31 INFO mapred.JobClient: FILE_BYTES_READ=69087 15/03/30 14:11:31 INFO mapred.JobClient: HDFS_BYTES_READ=70510 15/03/30 14:11:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=247208 15/03/30 14:11:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:11:31 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:11:31 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:11:31 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:11:31 INFO mapred.JobClient: Map output materialized bytes=69087 15/03/30 14:11:31 INFO mapred.JobClient: Map input records=149 15/03/30 14:11:31 INFO mapred.JobClient: Reduce shuffle bytes=69087 15/03/30 14:11:31 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:11:31 INFO mapred.JobClient: Map output bytes=68509 15/03/30 14:11:31 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:11:31 INFO mapred.JobClient: CPU time spent (ms)=1530 15/03/30 14:11:31 INFO mapred.JobClient: Combine input records=0 15/03/30 14:11:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=139 15/03/30 14:11:31 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:11:31 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:11:31 INFO mapred.JobClient: Combine output records=0 15/03/30 14:11:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=288825344 15/03/30 14:11:31 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:11:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1367453696 15/03/30 14:11:31 INFO mapred.JobClient: Map output records=149 15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-partial 15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-toprune 15/03/30 14:11:31 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:11:31 INFO mapred.JobClient: Running job: job_201503301351_0008 15/03/30 14:11:32 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:11:37 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:11:44 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:11:45 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:11:46 INFO mapred.JobClient: Job complete: job_201503301351_0008 15/03/30 14:11:46 INFO mapred.JobClient: Counters: 29 15/03/30 14:11:46 INFO mapred.JobClient: Job Counters 15/03/30 14:11:46 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:11:46 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3788 15/03/30 14:11:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:11:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:11:46 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:11:46 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:11:46 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8494 15/03/30 14:11:46 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:11:46 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:11:46 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:11:46 INFO mapred.JobClient: FILE_BYTES_READ=120932 15/03/30 14:11:46 INFO mapred.JobClient: HDFS_BYTES_READ=70492 15/03/30 14:11:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=250986 15/03/30 14:11:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:11:46 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:11:46 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:11:46 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:11:46 INFO mapred.JobClient: Map output materialized bytes=69087 15/03/30 14:11:46 INFO mapred.JobClient: Map input records=149 15/03/30 14:11:46 INFO mapred.JobClient: Reduce shuffle bytes=69087 15/03/30 14:11:46 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:11:46 INFO mapred.JobClient: Map output bytes=68509 15/03/30 14:11:46 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:11:46 INFO mapred.JobClient: CPU time spent (ms)=1570 15/03/30 14:11:46 INFO mapred.JobClient: Combine input records=0 15/03/30 14:11:46 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:11:46 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:11:46 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:11:46 INFO mapred.JobClient: Combine output records=0 15/03/30 14:11:46 INFO mapred.JobClient: Physical memory (bytes) snapshot=289206272 15/03/30 14:11:46 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:11:46 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1363017728 15/03/30 14:11:46 INFO mapred.JobClient: Map output records=149 15/03/30 14:11:46 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:11:47 INFO mapred.JobClient: Running job: job_201503301351_0009 15/03/30 14:11:48 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:11:52 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:11:59 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:12:01 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:12:01 INFO mapred.JobClient: Job complete: job_201503301351_0009 15/03/30 14:12:01 INFO mapred.JobClient: Counters: 29 15/03/30 14:12:01 INFO mapred.JobClient: Job Counters 15/03/30 14:12:01 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:12:01 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3728 15/03/30 14:12:01 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:12:01 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:12:01 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:12:01 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:12:01 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8502 15/03/30 14:12:01 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:12:01 INFO mapred.JobClient: Bytes Written=70371 15/03/30 14:12:01 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:12:01 INFO mapred.JobClient: FILE_BYTES_READ=69087 15/03/30 14:12:01 INFO mapred.JobClient: HDFS_BYTES_READ=70499 15/03/30 14:12:01 INFO mapred.JobClient: FILE_BYTES_WRITTEN=248294 15/03/30 14:12:01 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=70371 15/03/30 14:12:01 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:12:01 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:12:01 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:12:01 INFO mapred.JobClient: Map output materialized bytes=69087 15/03/30 14:12:01 INFO mapred.JobClient: Map input records=149 15/03/30 14:12:01 INFO mapred.JobClient: Reduce shuffle bytes=69087 15/03/30 14:12:01 INFO mapred.JobClient: Spilled Records=298 15/03/30 14:12:01 INFO mapred.JobClient: Map output bytes=68509 15/03/30 14:12:01 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:12:01 INFO mapred.JobClient: CPU time spent (ms)=2130 15/03/30 14:12:01 INFO mapred.JobClient: Combine input records=0 15/03/30 14:12:01 INFO mapred.JobClient: SPLIT_RAW_BYTES=128 15/03/30 14:12:01 INFO mapred.JobClient: Reduce input records=149 15/03/30 14:12:01 INFO mapred.JobClient: Reduce input groups=149 15/03/30 14:12:01 INFO mapred.JobClient: Combine output records=0 15/03/30 14:12:01 INFO mapred.JobClient: Physical memory (bytes) snapshot=301170688 15/03/30 14:12:01 INFO mapred.JobClient: Reduce output records=149 15/03/30 14:12:01 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1368752128 15/03/30 14:12:01 INFO mapred.JobClient: Map output records=149 15/03/30 14:12:01 INFO common.HadoopUtil: Deleting output/partial-vectors-0 15/03/30 14:12:01 INFO driver.MahoutDriver: Program took 130017 ms (Minutes: 2.16695) ==================================== root@xin:~# hadoop fs -ls output Warning: $HADOOP_HOME is deprecated. Found 7 items drwxr-xr-x - root supergroup 0 2015-03-30 14:11 /user/root/output/df-count -rw-r--r-- 1 root supergroup 48768 2015-03-30 14:10 /user/root/output/dictionary.file-0 -rw-r--r-- 1 root supergroup 51433 2015-03-30 14:11 /user/root/output/frequency.file-0 drwxr-xr-x - root supergroup 0 2015-03-30 14:11 /user/root/output/tf-vectors drwxr-xr-x - root supergroup 0 2015-03-30 14:12 /user/root/output/tfidf-vectors drwxr-xr-x - root supergroup 0 2015-03-30 14:09 /user/root/output/tokenized-documents drwxr-xr-x - root supergroup 0 2015-03-30 14:10 /user/root/output/wordcount ========================================== root@xin:~# mahout kmeans -i output/tf-vectors -c output/kmeans-clusters -o output/kmeans -k 10 -x 200 -ow --clustering Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 15/03/30 14:17:04 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[output/kmeans-clusters], --convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[output/tf-vectors], --maxIter=[200], --method=[mapreduce], --numClusters=[10], --output=[output/kmeans], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 15/03/30 14:17:04 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/30 14:17:04 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new compressor 15/03/30 14:17:04 INFO kmeans.RandomSeedGenerator: Wrote 10 Klusters to output/kmeans-clusters/part-randomSeed 15/03/30 14:17:04 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans-clusters/part-randomSeed Out: output/kmeans 15/03/30 14:17:04 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 200 15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new decompressor 15/03/30 14:17:05 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:17:05 INFO mapred.JobClient: Running job: job_201503301351_0010 15/03/30 14:17:06 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:17:11 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:17:18 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:17:19 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:17:20 INFO mapred.JobClient: Job complete: job_201503301351_0010 15/03/30 14:17:20 INFO mapred.JobClient: Counters: 29 15/03/30 14:17:20 INFO mapred.JobClient: Job Counters 15/03/30 14:17:20 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:17:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4154 15/03/30 14:17:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:17:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:17:20 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:17:20 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:17:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8558 15/03/30 14:17:20 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:17:20 INFO mapred.JobClient: Bytes Written=64996 15/03/30 14:17:20 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:17:20 INFO mapred.JobClient: FILE_BYTES_READ=70419 15/03/30 14:17:20 INFO mapred.JobClient: HDFS_BYTES_READ=96550 15/03/30 14:17:20 INFO mapred.JobClient: FILE_BYTES_WRITTEN=250490 15/03/30 14:17:20 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=64996 15/03/30 14:17:20 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:17:20 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:17:20 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:17:20 INFO mapred.JobClient: Map output materialized bytes=70419 15/03/30 14:17:20 INFO mapred.JobClient: Map input records=149 15/03/30 14:17:20 INFO mapred.JobClient: Reduce shuffle bytes=70419 15/03/30 14:17:20 INFO mapred.JobClient: Spilled Records=20 15/03/30 14:17:20 INFO mapred.JobClient: Map output bytes=70373 15/03/30 14:17:20 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:17:20 INFO mapred.JobClient: CPU time spent (ms)=2950 15/03/30 14:17:20 INFO mapred.JobClient: Combine input records=0 15/03/30 14:17:20 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:17:20 INFO mapred.JobClient: Reduce input records=10 15/03/30 14:17:20 INFO mapred.JobClient: Reduce input groups=10 15/03/30 14:17:20 INFO mapred.JobClient: Combine output records=0 15/03/30 14:17:20 INFO mapred.JobClient: Physical memory (bytes) snapshot=306675712 15/03/30 14:17:20 INFO mapred.JobClient: Reduce output records=10 15/03/30 14:17:20 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1367547904 15/03/30 14:17:20 INFO mapred.JobClient: Map output records=10 15/03/30 14:17:20 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:17:20 INFO mapred.JobClient: Running job: job_201503301351_0011 15/03/30 14:17:21 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:17:26 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:17:34 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:17:36 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:17:36 INFO mapred.JobClient: Job complete: job_201503301351_0011 15/03/30 14:17:36 INFO mapred.JobClient: Counters: 29 15/03/30 14:17:36 INFO mapred.JobClient: Job Counters 15/03/30 14:17:36 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:17:36 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4041 15/03/30 14:17:36 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:17:36 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:17:36 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:17:36 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:17:36 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8708 15/03/30 14:17:36 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:17:36 INFO mapred.JobClient: Bytes Written=64018 15/03/30 14:17:36 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:17:36 INFO mapred.JobClient: FILE_BYTES_READ=128966 15/03/30 14:17:36 INFO mapred.JobClient: HDFS_BYTES_READ=200872 15/03/30 14:17:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=367584 15/03/30 14:17:36 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=64018 15/03/30 14:17:36 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:17:36 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:17:36 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:17:36 INFO mapred.JobClient: Map output materialized bytes=128966 15/03/30 14:17:36 INFO mapred.JobClient: Map input records=149 15/03/30 14:17:36 INFO mapred.JobClient: Reduce shuffle bytes=128966 15/03/30 14:17:36 INFO mapred.JobClient: Spilled Records=20 15/03/30 14:17:36 INFO mapred.JobClient: Map output bytes=128919 15/03/30 14:17:36 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:17:36 INFO mapred.JobClient: CPU time spent (ms)=3050 15/03/30 14:17:36 INFO mapred.JobClient: Combine input records=0 15/03/30 14:17:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:17:36 INFO mapred.JobClient: Reduce input records=10 15/03/30 14:17:36 INFO mapred.JobClient: Reduce input groups=10 15/03/30 14:17:36 INFO mapred.JobClient: Combine output records=0 15/03/30 14:17:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=301654016 15/03/30 14:17:36 INFO mapred.JobClient: Reduce output records=10 15/03/30 14:17:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1368375296 15/03/30 14:17:36 INFO mapred.JobClient: Map output records=10 15/03/30 14:17:36 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:17:36 INFO mapred.JobClient: Running job: job_201503301351_0012 15/03/30 14:17:37 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:17:42 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:17:49 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:17:50 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:17:51 INFO mapred.JobClient: Job complete: job_201503301351_0012 15/03/30 14:17:51 INFO mapred.JobClient: Counters: 29 15/03/30 14:17:51 INFO mapred.JobClient: Job Counters 15/03/30 14:17:51 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:17:51 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4081 15/03/30 14:17:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:17:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:17:51 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:17:51 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:17:51 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8601 15/03/30 14:17:51 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:17:51 INFO mapred.JobClient: Bytes Written=61455 15/03/30 14:17:51 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:17:51 INFO mapred.JobClient: FILE_BYTES_READ=125434 15/03/30 14:17:51 INFO mapred.JobClient: HDFS_BYTES_READ=198916 15/03/30 14:17:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=360520 15/03/30 14:17:51 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=61455 15/03/30 14:17:51 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:17:51 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:17:51 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:17:51 INFO mapred.JobClient: Map output materialized bytes=125434 15/03/30 14:17:51 INFO mapred.JobClient: Map input records=149 15/03/30 14:17:51 INFO mapred.JobClient: Reduce shuffle bytes=125434 15/03/30 14:17:51 INFO mapred.JobClient: Spilled Records=20 15/03/30 14:17:51 INFO mapred.JobClient: Map output bytes=125387 15/03/30 14:17:51 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:17:51 INFO mapred.JobClient: CPU time spent (ms)=2850 15/03/30 14:17:51 INFO mapred.JobClient: Combine input records=0 15/03/30 14:17:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:17:51 INFO mapred.JobClient: Reduce input records=10 15/03/30 14:17:51 INFO mapred.JobClient: Reduce input groups=10 15/03/30 14:17:51 INFO mapred.JobClient: Combine output records=0 15/03/30 14:17:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=298000384 15/03/30 14:17:51 INFO mapred.JobClient: Reduce output records=10 15/03/30 14:17:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1369067520 15/03/30 14:17:51 INFO mapred.JobClient: Map output records=10 15/03/30 14:17:51 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:17:51 INFO mapred.JobClient: Running job: job_201503301351_0013 15/03/30 14:17:52 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:17:57 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:18:04 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:18:06 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:18:06 INFO mapred.JobClient: Job complete: job_201503301351_0013 15/03/30 14:18:06 INFO mapred.JobClient: Counters: 29 15/03/30 14:18:06 INFO mapred.JobClient: Job Counters 15/03/30 14:18:06 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:18:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4191 15/03/30 14:18:06 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:18:06 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:18:06 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:18:06 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:18:06 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8661 15/03/30 14:18:06 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:18:06 INFO mapred.JobClient: Bytes Written=61248 15/03/30 14:18:06 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:18:06 INFO mapred.JobClient: FILE_BYTES_READ=121841 15/03/30 14:18:06 INFO mapred.JobClient: HDFS_BYTES_READ=193790 15/03/30 14:18:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=353334 15/03/30 14:18:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=61248 15/03/30 14:18:06 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:18:06 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:18:06 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:18:06 INFO mapred.JobClient: Map output materialized bytes=121841 15/03/30 14:18:06 INFO mapred.JobClient: Map input records=149 15/03/30 14:18:06 INFO mapred.JobClient: Reduce shuffle bytes=121841 15/03/30 14:18:06 INFO mapred.JobClient: Spilled Records=20 15/03/30 14:18:06 INFO mapred.JobClient: Map output bytes=121794 15/03/30 14:18:06 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:18:06 INFO mapred.JobClient: CPU time spent (ms)=3380 15/03/30 14:18:06 INFO mapred.JobClient: Combine input records=0 15/03/30 14:18:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:18:06 INFO mapred.JobClient: Reduce input records=10 15/03/30 14:18:06 INFO mapred.JobClient: Reduce input groups=10 15/03/30 14:18:06 INFO mapred.JobClient: Combine output records=0 15/03/30 14:18:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=306253824 15/03/30 14:18:06 INFO mapred.JobClient: Reduce output records=10 15/03/30 14:18:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1372364800 15/03/30 14:18:06 INFO mapred.JobClient: Map output records=10 15/03/30 14:18:06 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:18:06 INFO mapred.JobClient: Running job: job_201503301351_0014 15/03/30 14:18:07 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:18:12 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:18:19 INFO mapred.JobClient: map 100% reduce 33% 15/03/30 14:18:21 INFO mapred.JobClient: map 100% reduce 100% 15/03/30 14:18:21 INFO mapred.JobClient: Job complete: job_201503301351_0014 15/03/30 14:18:21 INFO mapred.JobClient: Counters: 29 15/03/30 14:18:21 INFO mapred.JobClient: Job Counters 15/03/30 14:18:21 INFO mapred.JobClient: Launched reduce tasks=1 15/03/30 14:18:21 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4242 15/03/30 14:18:21 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:18:21 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:18:21 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:18:21 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:18:21 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8624 15/03/30 14:18:21 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:18:21 INFO mapred.JobClient: Bytes Written=61248 15/03/30 14:18:21 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:18:21 INFO mapred.JobClient: FILE_BYTES_READ=121634 15/03/30 14:18:21 INFO mapred.JobClient: HDFS_BYTES_READ=193376 15/03/30 14:18:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=352920 15/03/30 14:18:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=61248 15/03/30 14:18:21 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:18:21 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:18:21 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:18:21 INFO mapred.JobClient: Map output materialized bytes=121634 15/03/30 14:18:21 INFO mapred.JobClient: Map input records=149 15/03/30 14:18:21 INFO mapred.JobClient: Reduce shuffle bytes=121634 15/03/30 14:18:21 INFO mapred.JobClient: Spilled Records=20 15/03/30 14:18:21 INFO mapred.JobClient: Map output bytes=121587 15/03/30 14:18:21 INFO mapred.JobClient: Total committed heap usage (bytes)=296222720 15/03/30 14:18:21 INFO mapred.JobClient: CPU time spent (ms)=3060 15/03/30 14:18:21 INFO mapred.JobClient: Combine input records=0 15/03/30 14:18:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:18:21 INFO mapred.JobClient: Reduce input records=10 15/03/30 14:18:21 INFO mapred.JobClient: Reduce input groups=10 15/03/30 14:18:21 INFO mapred.JobClient: Combine output records=0 15/03/30 14:18:21 INFO mapred.JobClient: Physical memory (bytes) snapshot=295936000 15/03/30 14:18:21 INFO mapred.JobClient: Reduce output records=10 15/03/30 14:18:21 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1362153472 15/03/30 14:18:21 INFO mapred.JobClient: Map output records=10 15/03/30 14:18:21 INFO kmeans.KMeansDriver: Clustering data 15/03/30 14:18:21 INFO kmeans.KMeansDriver: Running Clustering 15/03/30 14:18:21 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans Out: output/kmeans 15/03/30 14:18:22 INFO input.FileInputFormat: Total input paths to process : 1 15/03/30 14:18:22 INFO mapred.JobClient: Running job: job_201503301351_0015 15/03/30 14:18:23 INFO mapred.JobClient: map 0% reduce 0% 15/03/30 14:18:29 INFO mapred.JobClient: map 100% reduce 0% 15/03/30 14:18:30 INFO mapred.JobClient: Job complete: job_201503301351_0015 15/03/30 14:18:30 INFO mapred.JobClient: Counters: 19 15/03/30 14:18:30 INFO mapred.JobClient: Job Counters 15/03/30 14:18:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5264 15/03/30 14:18:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/03/30 14:18:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/03/30 14:18:30 INFO mapred.JobClient: Launched map tasks=1 15/03/30 14:18:30 INFO mapred.JobClient: Data-local map tasks=1 15/03/30 14:18:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 15/03/30 14:18:30 INFO mapred.JobClient: File Output Format Counters 15/03/30 14:18:30 INFO mapred.JobClient: Bytes Written=75851 15/03/30 14:18:30 INFO mapred.JobClient: FileSystemCounters 15/03/30 14:18:30 INFO mapred.JobClient: HDFS_BYTES_READ=131934 15/03/30 14:18:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54540 15/03/30 14:18:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=75851 15/03/30 14:18:30 INFO mapred.JobClient: File Input Format Counters 15/03/30 14:18:30 INFO mapred.JobClient: Bytes Read=70371 15/03/30 14:18:30 INFO mapred.JobClient: Map-Reduce Framework 15/03/30 14:18:30 INFO mapred.JobClient: Map input records=149 15/03/30 14:18:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=113307648 15/03/30 14:18:30 INFO mapred.JobClient: Spilled Records=0 15/03/30 14:18:30 INFO mapred.JobClient: CPU time spent (ms)=1620 15/03/30 14:18:30 INFO mapred.JobClient: Total committed heap usage (bytes)=120061952 15/03/30 14:18:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=680222720 15/03/30 14:18:30 INFO mapred.JobClient: Map output records=149 15/03/30 14:18:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=121 15/03/30 14:18:30 INFO driver.MahoutDriver: Program took 86159 ms (Minutes: 1.4359833333333334) ====================================== root@xin:~# hadoop fs -ls output/kmeans Warning: $HADOOP_HOME is deprecated. Found 8 items -rw-r--r-- 1 root supergroup 194 2015-03-30 14:18 /user/root/output/kmeans/_policy drwxr-xr-x - root supergroup 0 2015-03-30 14:18 /user/root/output/kmeans/clusteredPoints drwxr-xr-x - root supergroup 0 2015-03-30 14:17 /user/root/output/kmeans/clusters-0 drwxr-xr-x - root supergroup 0 2015-03-30 14:17 /user/root/output/kmeans/clusters-1 drwxr-xr-x - root supergroup 0 2015-03-30 14:17 /user/root/output/kmeans/clusters-2 drwxr-xr-x - root supergroup 0 2015-03-30 14:17 /user/root/output/kmeans/clusters-3 drwxr-xr-x - root supergroup 0 2015-03-30 14:18 /user/root/output/kmeans/clusters-4 drwxr-xr-x - root supergroup 0 2015-03-30 14:18 /user/root/output/kmeans/clusters-5-final ====================================== root@xin:~# hadoop fs -get output/kmeans/* /usr/song-kmeans/ Warning: $HADOOP_HOME is deprecated. root@xin:~# hadoop fs -get output/dictionary.file-0 /usr/song-kmeans Warning: $HADOOP_HOME is deprecated. root@xin:~# mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result -n 20 Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 15/03/30 14:34:08 INFO common.AbstractJob: Command line arguments: {--dictionary=[file:///usr/song-kmeans/dictionary.file-0], --dictionaryType=[sequencefile], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusters-5-final], --numWords=[20], --output=[/usr/song-result/result], --outputFormat=[TEXT], --startPhase=[0], --tempDir=[temp]} 15/03/30 14:34:09 INFO clustering.ClusterDumper: Wrote 10 clusters 15/03/30 14:34:09 INFO driver.MahoutDriver: Program took 716 ms (Minutes: 0.011933333333333334) Exception in thread "main" java.io.FileNotFoundException: /usr/song-result (Is a directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at java.io.FileOutputStream.<init>(FileOutputStream.java:171) at com.google.common.io.Files.newWriter(Files.java:103) at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:187) at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:157) at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) =========================================== root@xin:~# mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints -o /usr/song-result/all Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 15/03/30 14:44:28 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusteredPoints], --output=[/usr/song-result/all], --startPhase=[0], --tempDir=[temp]} 15/03/30 14:44:29 INFO driver.MahoutDriver: Program took 634 ms (Minutes: 0.010566666666666667)
原文地址:http://blog.csdn.net/u011439289/article/details/44887969