码迷,mamicode.com
首页 > 其他好文 > 详细

DocumentSimilarity

时间:2017-10-29 15:53:58      阅读:210      评论:0      收藏:0      [点我收藏+]

标签:diff   cut   读取   词干   ble   try   lap   number   more   

读取文件

abstracts = [line.strip() for line in file(../DATA/AbstractData.txt)]
print abstracts[:1]
[25177 Given n non-vertical lines in 3-space, their vertical depth (above/below) relation can contain cycles. We show that the lines can be cut into O(n3/2polylog n) pieces, such that the depth relation among these pieces is now a proper partial order. This bound is nearly tight in the worst case. As a consequence, we deduce that the number of pairwise non-overlapping cycles, namely, cycles whose xy-projections do not overlap, is O(n3/2polylog n); this bound too is almost tight in the worst case. Previous results on this topic could only handle restricted cases of the problem (such as handling only triangular cycles, by Aronov, Koltun, and Sharir, or only cycles in grid-like patterns, by Chazelle et al.), and the bounds were considerably weaker&#x2014;much closer to the trivial quadratic bound. Our proof uses a recent variant of the polynomial partitioning technique, due to Guth, and some simple tools from algebraic geometry. It is much more straightforward than the previous &#x201C;purely combinatorial&#x201D; methods. Our approach extends to eliminating all cycles in the depth relation among segments, and among constant-degree algebraic arcs. We hope that a suitable extension of this technique could be used to handle the much more difficult case of pairwise-disjoint triangles as well. Our results almost completely settle a long-standing (35 years old) open problem in computational geometry, motivated by hidden-surface removal in computer graphics. </p>]

获取摘要ID

abstrctsId = [abstract.split( )[0] for abstract in abstracts]
print abstrctsId[:1]
[25177]

大小写转换

abstractLower = [[word for word in abstract.lower().split()] for abstract in abstracts]
print abstractLower[:1]
[[25177, given, n, non-vertical, lines, in, 3-space,, their, vertical, depth, (above/below), relation, can, contain, cycles., we, show, that, the, lines, can, be, cut, into, o(n3/2polylog, n), pieces,, such, that, the, depth, relation, among, these, pieces, is, now, a, proper, partial, order., this, bound, is, nearly, tight, in, the, worst, case., as, a, consequence,, we, deduce, that, the, number, of, pairwise, non-overlapping, cycles,, namely,, cycles, whose, xy-projections, do, not, overlap,, is, o(n3/2polylog, n);, this, bound, too, is, almost, tight, in, the, worst, case., previous, results, on, this, topic, could, only, handle, restricted, cases, of, the, problem, (such, as, handling, only, triangular, cycles,, by, aronov,, koltun,, and, sharir,, or, only, cycles, in, grid-like, patterns,, by, chazelle, et, al.),, and, the, bounds, were, considerably, weaker&#x2014;much, closer, to, the, trivial, quadratic, bound., our, proof, uses, a, recent, variant, of, the, polynomial, partitioning, technique,, due, to, guth,, and, some, simple, tools, from, algebraic, geometry., it, is, much, more, straightforward, than, the, previous, &#x201c;purely, combinatorial&#x201d;, methods., our, approach, extends, to, eliminating, all, cycles, in, the, depth, relation, among, segments,, and, among, constant-degree, algebraic, arcs., we, hope, that, a, suitable, extension, of, this, technique, could, be, used, to, handle, the, much, more, difficult, case, of, pairwise-disjoint, triangles, as, well., our, results, almost, completely, settle, a, long-standing, (35, years, old), open, problem, in, computational, geometry,, motivated, by, hidden-surface, removal, in, computer, graphics., </p>]]

将标点符号与单词进行分离

from nltk.tokenize import word_tokenize
abstractsTokenized = [[word.lower() for word in word_tokenize(abstract.decode(utf-8))] for abstract in abstracts]
print abstractsTokenized[:1]
[[u25177, ugiven, un, unon-vertical, ulines, uin, u3-space, u,, utheir, uvertical, udepth, u(, uabove/below, u), urelation, ucan, ucontain, ucycles, u., uwe, ushow, uthat, uthe, ulines, ucan, ube, ucut, uinto, uo, u(, un3/2polylog, un, u), upieces, u,, usuch, uthat, uthe, udepth, urelation, uamong, uthese, upieces, uis, unow, ua, uproper, upartial, uorder, u., uthis, ubound, uis, unearly, utight, uin, uthe, uworst, ucase, u., uas, ua, uconsequence, u,, uwe, udeduce, uthat, uthe, unumber, uof, upairwise, unon-overlapping, ucycles, u,, unamely, u,, ucycles, uwhose, uxy-projections, udo, unot, uoverlap, u,, uis, uo, u(, un3/2polylog, un, u), u;, uthis, ubound, utoo, uis, ualmost, utight, uin, uthe, uworst, ucase, u., uprevious, uresults, uon, uthis, utopic, ucould, uonly, uhandle, urestricted, ucases, uof, uthe, uproblem, u(, usuch, uas, uhandling, uonly, utriangular, ucycles, u,, uby, uaronov, u,, ukoltun, u,, uand, usharir, u,, uor, uonly, ucycles, uin, ugrid-like, upatterns, u,, uby, uchazelle, uet, ual, u., u), u,, uand, uthe, ubounds, uwere, uconsiderably, uweaker, u&, u#, ux2014, u;, umuch, ucloser, uto, uthe, utrivial, uquadratic, ubound, u., uour, uproof, uuses, ua, urecent, uvariant, uof, uthe, upolynomial, upartitioning, utechnique, u,, udue, uto, uguth, u,, uand, usome, usimple, utools, ufrom, ualgebraic, ugeometry, u., uit, uis, umuch, umore, ustraightforward, uthan, uthe, uprevious, u&, u#, ux201c, u;, upurely, ucombinatorial, u&, u#, ux201d, u;, umethods, u., uour, uapproach, uextends, uto, ueliminating, uall, ucycles, uin, uthe, udepth, urelation, uamong, usegments, u,, uand, uamong, uconstant-degree, ualgebraic, uarcs, u., uwe, uhope, uthat, ua, usuitable, uextension, uof, uthis, utechnique, ucould, ube, uused, uto, uhandle, uthe, umuch, umore, udifficult, ucase, uof, upairwise-disjoint, utriangles, uas, uwell, u., uour, uresults, ualmost, ucompletely, usettle, ua, ulong-standing, u(, u35, uyears, uold, u), uopen, uproblem, uin, ucomputational, ugeometry, u,, umotivated, uby, uhidden-surface, uremoval, uin, ucomputer, ugraphics, u., u<, u/p, u>]]

除去停用词

from nltk.corpus import stopwords
englishStopwords = stopwords.words(english)
print len(englishStopwords)
abstractFilterStopwords = [[word for word in abstract if not word in englishStopwords] for abstract in abstractsTokenized]
print abstractFilterStopwords[:1]
[[u25177, ugiven, un, unon-vertical, ulines, u3-space, u,, uvertical, udepth, u(, uabove/below, u), urelation, ucontain, ucycles, u., ushow, ulines, ucut, uo, u(, un3/2polylog, un, u), upieces, u,, udepth, urelation, uamong, upieces, uproper, upartial, uorder, u., ubound, unearly, utight, uworst, ucase, u., uconsequence, u,, udeduce, unumber, upairwise, unon-overlapping, ucycles, u,, unamely, u,, ucycles, uwhose, uxy-projections, uoverlap, u,, uo, u(, un3/2polylog, un, u), u;, ubound, ualmost, utight, uworst, ucase, u., uprevious, uresults, utopic, ucould, uhandle, urestricted, ucases, uproblem, u(, uhandling, utriangular, ucycles, u,, uaronov, u,, ukoltun, u,, usharir, u,, ucycles, ugrid-like, upatterns, u,, uchazelle, uet, ual, u., u), u,, ubounds, uconsiderably, uweaker, u&, u#, ux2014, u;, umuch, ucloser, utrivial, uquadratic, ubound, u., uproof, uuses, urecent, uvariant, upolynomial, upartitioning, utechnique, u,, udue, uguth, u,, usimple, utools, ualgebraic, ugeometry, u., umuch, ustraightforward, uprevious, u&, u#, ux201c, u;, upurely, ucombinatorial, u&, u#, ux201d, u;, umethods, u., uapproach, uextends, ueliminating, ucycles, udepth, urelation, uamong, usegments, u,, uamong, uconstant-degree, ualgebraic, uarcs, u., uhope, usuitable, uextension, utechnique, ucould, uused, uhandle, umuch, udifficult, ucase, upairwise-disjoint, utriangles, uwell, u., uresults, ualmost, ucompletely, usettle, ulong-standing, u(, u35, uyears, uold, u), uopen, uproblem, ucomputational, ugeometry, u,, umotivated, uhidden-surface, uremoval, ucomputer, ugraphics, u., u<, u/p, u>]]

除去标点符号

englishPunctuations = [,, ., :, ;, ?, (, ), [, ], &, !, *, @, #, $, %,<,>,=,{,},+,",-,/]
abstracts = [[word for word in abstract if not word in englishPunctuations] for abstract in abstractFilterStopwords]
print abstracts[:1]
[[u25177, ugiven, un, unon-vertical, ulines, u3-space, uvertical, udepth, uabove/below, urelation, ucontain, ucycles, ushow, ulines, ucut, uo, un3/2polylog, un, upieces, udepth, urelation, uamong, upieces, uproper, upartial, uorder, ubound, unearly, utight, uworst, ucase, uconsequence, udeduce, unumber, upairwise, unon-overlapping, ucycles, unamely, ucycles, uwhose, uxy-projections, uoverlap, uo, un3/2polylog, un, ubound, ualmost, utight, uworst, ucase, uprevious, uresults, utopic, ucould, uhandle, urestricted, ucases, uproblem, uhandling, utriangular, ucycles, uaronov, ukoltun, usharir, ucycles, ugrid-like, upatterns, uchazelle, uet, ual, ubounds, uconsiderably, uweaker, ux2014, umuch, ucloser, utrivial, uquadratic, ubound, uproof, uuses, urecent, uvariant, upolynomial, upartitioning, utechnique, udue, uguth, usimple, utools, ualgebraic, ugeometry, umuch, ustraightforward, uprevious, ux201c, upurely, ucombinatorial, ux201d, umethods, uapproach, uextends, ueliminating, ucycles, udepth, urelation, uamong, usegments, uamong, uconstant-degree, ualgebraic, uarcs, uhope, usuitable, uextension, utechnique, ucould, uused, uhandle, umuch, udifficult, ucase, upairwise-disjoint, utriangles, uwell, uresults, ualmost, ucompletely, usettle, ulong-standing, u35, uyears, uold, uopen, uproblem, ucomputational, ugeometry, umotivated, uhidden-surface, uremoval, ucomputer, ugraphics, u/p]]

单词词干化

from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
abstractsStemed = [[st.stem(word) for word in abstract] for abstract in abstracts]
print abstractsStemed[:1]
[[u25177, ugiv, un, unon-vertical, ulin, u3-space, uvert, udep, uabove/below, urel, ucontain, ucyc, ushow, ulin, ucut, uo, un3/2polylog, un, upiec, udep, urel, uamong, upiec, uprop, upart, uord, ubound, unear, utight, uworst, ucas, uconsequ, udeduc, unumb, upairw, unon-overlapping, ucyc, unam, ucyc, uwhos, uxy-projections, uoverlap, uo, un3/2polylog, un, ubound, ualmost, utight, uworst, ucas, uprevy, uresult, utop, ucould, uhandl, urestrict, ucas, uproblem, uhandl, utriangul, ucyc, uaronov, ukoltun, usharir, ucyc, ugrid-like, upattern, uchazel, uet, ual, ubound, uconsid, uweak, ux2014, umuch, uclos, utriv, uquadr, ubound, uproof, uus, urec, uvary, upolynom, upartit, utechn, udue, uguth, usimpl, utool, ualgebra, ugeometry, umuch, ustraightforward, uprevy, ux201c, upur, ucombin, ux201d, umethod, uapproach, uextend, uelimin, ucyc, udep, urel, uamong, useg, uamong, uconstant-degree, ualgebra, uarc, uhop, usuit, uextend, utechn, ucould, uus, uhandl, umuch, udifficult, ucas, upairwise-disjoint, utriangl, uwel, uresult, ualmost, ucomplet, usettl, ulong-stand, u35, uyear, uold, uop, uproblem, ucomput, ugeometry, umot, uhidden-surface, uremov, ucomput, ugraph, u/p]]

去除低频词

allStem = sum(abstractsStemed,[])
#找到出现频率为1的词干
stemOnce = [stem for stem in set(allStem) if allStem.count(stem) == 1]
abstractOver = [[for word in abstract if not word in stemOnce] for abstract in abstractsStemed]
print abstractOver[:1]

 

DocumentSimilarity

标签:diff   cut   读取   词干   ble   try   lap   number   more   

原文地址:http://www.cnblogs.com/wlc297984368/p/7750129.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!