1 / 2018-04-25 21:38:41
An Similar Entity Identification Method for Text Big Data Based on Spark Parallel Framework
Spark,Text big data,Similar entity identification,Graph theory
全文待审
Tong YU / Northeast Electric Power University
Hongbiao LI / Northeast Electric Power University
Aiming at the problem that the similar entity identification for high-dimensional and massive text data, a method based on Spark parallel framework is proposed. Firstly, convert the corresponding records of entities into Simhash fingerprints(binary strings) by using Simhash algorithm to realize the conversion of high-dimensional text data and low-dimensional Simhash fingerprints. Secondly, a Simhash fingerprint recognition strategy (SFRS, Simhash Fingerprint Recognition Strategy) based on Graph theory is designed so as to identify the similar Simhash fingerprints, proceeding to identify the corresponding records, realize the similar entities identification. Finally, a similar entity identification algorithm based on the SFRS and Spark is proposed, which is applied to the similar entity identification of high-dimensional and massive text data, then a comparatively experimental analysis about text data from UCI is conducted, the experimental results show the good performances and applicability of the presented method.
重要日期
  • 会议日期

    10月01日

    2018

    10月03日

    2018

  • 04月28日 2018

    摘要截稿日期

  • 05月10日 2018

    初稿截稿日期

  • 05月20日 2018

    初稿录用通知日期

  • 06月01日 2018

    终稿截稿日期

  • 10月03日 2018

    注册截止日期

承办单位
V. A. Trapeznikov Institute of Control Sciences of Russian Academy of Sciences
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询