Text Duplicated-checking Algorithm Implementation Based on Natural Language Semantic Analysis

Text Duplicated-checking Algorithm Implementation Based on Natural Language Semantic Analysis

Table of Contents




Abstract

Natural language is a tool and way to express the interaction between computer and human (NATURAL) language. At present, the analysis ability of text semantics is relatively mature. The clustering algorithm, probability graph model algorithm, text mining algorithm, and other processing methods have been successfully implemented. This paper is based on natural language processing text, combined with word2vec word vector conversion technology, through similarity calculation, using its semantic analysis ability to construct an optimized LDA model which refers to an importance sampling idea to extract topic words and use cosine similarity to calculate the repetition rate. We can achieve an ideal semantic text duplicate analysis effect by comparing it with the results of the LDA model before and after optimization.

  • IEEE Keywords

    • Semantics,
    • Analytical models,
    • Training,
    • Natural language processing,
    • Monte Carlo methods,
    • Vocabulary,
    • Clustering algorithms
  • Author Keywords

    • semantic analysis,
    • text vectorization,
    • duplicated-checking model,
    • importance sampling,
    • similarity calculation

Introduction

Natural language is an essential way of human communication. In the field of computer science and artificial intelligence, a natural language is a tool and way for the expression of the interaction between computer and human (NATURAL) language. Recently, the analysis ability of text semantics has developed reasonably mature, and there are many successful processing methods, such as clustering algorithms, probability graph model algorithms, text mining algorithms, and so on. Using many excellent open-source corpora for scholars implemented semantic analysis, such as UN parallel corpora, opus, etc. At present, the typical cases of NLP include chat robot, text clustering, and public opinion analysis. Word2vec and other models derived from NLP are also widely used in fields that do not belong to NLP. This paper analyzes the application of Natural Language Processing in a text-similarity calculation, including its semantic analysis ability, using Word2vec word vector conversion technology, with independent innovation, combining Importance Sampling ideas and LDA model to extract topic words, using cosine similarity calculation method, aiming at text processing, topic modeling, similarity detection, and other links to achieve more ideal semantic text duplicate analysis.

TOPIC MODELING

The origin of the topic model is Latent Semantic Indexing (LSI) [1]. In 1999, Hofmann proposed the Probabilistic Latent Semantic Indexing (pLSI), which is regarded as a real sense topic model [2]. In 2003, David BLEI and Wu Enda proposed the Latent Dirichlet Allocation (LDA) [3], which is an improved model for PLSA. At present, it is widely used as a topic model to speculate the topic distribution of documents and give the topics in each document set in the form of probability distribution [4]. After analyzing documents to extract their topic distribution, the topic clustering or text classification is completed according to the topic distribution, which is directly or expanded. It is used in the recommendation system of natural language processing and many other tasks [5]. The paper is based on the LDA model and Word2vec weight matrix, the similarity analysis of natural language processing text is designed. A.Samples Training Word2vec is an unsupervised learning mechanism, which relies on a large number of training samples. According to the different mechanisms of sample feature and tag extraction, Word2vec has two implementation mechanisms: cbow (continuous bag of words) and skip-gram. Cbow aims to predict the probability of current words according to the context. Skip gram is the opposite: the probability of predicting the context based on the current word. Both methods use artificial neural networks as their classification algorithm. The neural network model of Word2vec is similar to the fitting of vector calculation in formula (1): {

?1,?2,⋯??}⋅(?1?1 …?1???111 …?11? … …)={ℎ1,ℎ2,⋯ℎ?}⋅(?2?1 …?2???211 …?21? … …)= {?1,?2,⋯??}(1)According to the formula, i is the input vector, o is the output vector, and h is the hidden layer vector, which is the intermediate result of the calculation. The training goal of the neural network is to find the most suitable w1 and w2 to fit the formula as much as possible. In short, each word in the input layer of Word2vec can reflect the corresponding n-dimensional vector of the output layer through the weight value matrix w.

Conclusion

In this paper, we caught the documents, which can be matched with a string matching algorithm, it is difficult to check the duplication if the semantics are the same. The NLP based semantic analysis algorithm can solve this problem well. In order to further improve the accuracy of duplicate checking, this paper combines the theme extracted from Word2vec and LDA model and uses the idea of importance sampling to train the model, so as to improve the accuracy and recall rate of the algorithm. The cosine similarity is used to calculate so that the normalization step can be omitted, and the calculation result could be more clear.

Acknowledgment

This work is supported by the National Science Foundation for Young Scientists of China under Grant No. 11800313 and Fundamental Research Funds in Heilongjiang Provincial Universities under Grant No. 135209253.

FULL Paper PDF file:

Text Duplicated-checking Algorithm Implementation Based on Natural Language Semantic Analysis

Bibliography

author

X. Wang, X. Dong and S. Chen,

Year

2020

Title

Text Duplicated-checking Algorithm Implementation Based on Natural Language Semantic Analysis

Publish in

2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 2020, pp. 732-735,

Doi

10.1109/ITOEC49072.2020.9141886

PDF reference and original file: Click here

 

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.