Text Mining in Big Data Analytics

Text Mining in Big Data Analytics

Table of Contents

Text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data. This study seeks to determine the state of text mining research by examining the developments within published literature over past years and provides valuable insights for practitioners and researchers on the predominant trends, methods, and applications of text mining research. By this, more than 200 academic journal articles on the subject are included and discussed in this review; the state-of-the-art text mining approaches and techniques used for analyzing transcripts and speeches, meeting transcripts, and academic journal articles, as well as websites, emails, blogs, and social media platforms, across a broad range of application areas, are also investigated. Additionally, the benefits and challenges related to text mining are also briefly outlined.



In recent years, we have witnessed an increase in the quantities of available digital textual data, generating new insights, and thereby opening up opportunities for research along new channels. In this rapidly evolving field of big data analytic techniques, text mining has gained significant attention across a broad range of applications. In both academia and industry, there has been a shift towards research projects and more complex research questions that mandate more than the simple retrieval of data. Due to the increasing importance of artificial intelligence and its implementation on digital platforms, the application of parallel processing, deep learning, and pattern recognition to textual information is crucial. All types of business models, market research, marketing plans, political campaigns, or strategic decision-making are facing an increasing need for text mining techniques to address the competition.

Large amounts of textual data could be collected as a part of a research, such as scientific literature, transcripts in the marketing and economic sectors, speeches in the field of political discourse, such as presidential campaigns and inauguration speeches, and meeting transcripts. Furthermore, online sources, such as emails, web pages, blogs/microblogs, social media posts, and comments, provide a rich source of textual data for research [1]. Large amounts of data are also being collected in Big Data Cogn. Comput. 2020, 4, 1; DOI:10.3390/bdcc4010001 www.mdpi.com/journal/bdcc Big Data Cogn. Comput. 2020, 4, 1 2 of 34 semi-structured forms, such as log files containing information from servers and networks. As such, text mining analysis is useful for both unstructured and semi-structured textual data [1].

Data mining and text mining differ in the type of data they handle. While data mining handles structured data coming from systems, such as databases, spreadsheets, ERP, CRM, and accounting applications, text mining deals with unstructured data found in documents, emails, social media, and the web. Thus, the difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts [2]. Since all the written or spoken information can be represented in textual form, data mining requires all kinds of text mining tools when it comes to the interpretation and analysis of sentences, words, phrases, speeches, claims, adverts, and statements. This paper conducts an extensive analysis of text mining applications in big data analytics as used in various commercial fields and academic studies. While the vast majority of the literature deals with the optimization of a specific text mining technique, this paper seeks to summarize the features of all text mining methods, thereby summarizing the state-of-the-art practices and approaches in all the possible fields of application. It is centered around seven key applications of text mining in transcripts and speeches, meeting transcripts, and academic journal articles, as well as websites, emails, blogs, and social media networking sites; for each of these, we, respectively, provide a description of the field, their functionality, the most commonly used methods, the associated problems, and the related and relevant references.

The remaining sections of this paper are organized in the following manner. In Section 2, we introduce the topic of text mining in transcripts and speeches. We explain the different classification techniques used in, for instance, the analysis of political speeches that classify opinions or sentiments in a manner that allows one to infer from a text or speech the ideology that a speaker most probably espouses. Furthermore, we explain the methods used in classifying transcripts and speeches and identify the shortcomings of these methods, which are primarily related to the behavioral nature of human beings, such as ironic or ideological behavior. In Section 3, we take a closer look at blog mining, the dominance of news-related content in blogs and micro-blogging, and present the methods used in this area. Most of the methods applied in blog mining are based on dimensionality reduction, which is also found in other fields of text mining applications. Additionally, the relationship between blog mining and cybersecurity—which is an interesting and novel application of blog mining—is also covered in this section. In Section 4, we analyze email mining and the techniques commonly used about it. A very specific feature of email mining is its noisy data, which has been discussed in this section. Moreover, we explain the challenges to the identification of the content of the email body and how email mining is used in business intelligence. The web mining techniques that are used in screening and analyzing websites are studied in Section 5. The features of a website, such as links, links between websites, anchor text, and HTML tags, are also discussed. Moreover, the difficulty of capturing unexpected and dynamically generated patterns of data is also explored. Additionally, the importance of pattern recognition and text matching in e-commerce is highlighted. In Section 6, we present studies conducted on the use of Twitter and Facebook and explain the role of text mining in marketing strategies based upon social media, as well as the use of social media platforms for the prediction of financial markets. In Sections 7 and 8, we round up our extensive analysis of text mining applications by exploring the text mining techniques used for academic journal articles and meeting transcripts. Section 9 discusses the important issue of extract hidden knowledge from a set of texts and building hypotheses. Finally, in the concluding section, we highlight the advantages and challenges related to text mining and discuss its potential benefits to society and individuals.


New technologies have facilitated access to immense quantities of digital text, recording an ever-increasing share of human interaction, communication, and culture [236]. Text mining provides a framework to maximize the value of information within large quantities of text; thereby, the use of text mining technologies has increased steadily in recent years and has become highly diverse. This study has summarized the academic research efforts on text mining and its applications by examining the published literature developed over the recent past few years. Figure 1 shows the methods and applications discussed in this study. More than 200 academic journal articles on the subject were included and discussed in this review, alongside the state-of-the-art text mining approaches used for analyzing transcripts and speeches, meeting transcripts, and academic journal articles, as well as websites, emails, blogs/microblogs, and social media networking sites across a broad range of application areas.

In practice, text mining enables the efficient exploitation of textual data on a broad range of real-world applications, such as (a) supporting large companies in faster and better decision-making by providing insights on the performance of marketing/sales strategies, enhancing customer experience, monitoring and enhancing the product/service, and gaining better customer engagement, (b) analyzing documents and verbatim transcripts in the economics sector, (c) analyzing political discourse streams that may provide valuable insights into critical discourse analysis, (d) creating more reliable and effective filtering methods for emails and websites, and (e) identifying relationships between users and certain products for social media purpose, as well as examining opinions on particular topics or sentiments on certain events. At the same time, by mining immense amounts of information in scientific literature, researchers can discover patterns and links between resources that cannot be detected through usual human viewing and reading, provide more meaningful answers to complex research questions, and even support scientific discovery in various domains.

There is a push, however, towards applications of text mining technologies on emerging crucial issues. One of the major serious issues is the relatively recent phenomenon of cybercrime [237–240] with a strong impact on citizens, societies, and economies [241–244]. There are several ways in which text mining can be utilized for security analytics. Emails can be analyzed for discerning patterns in words and phrases, which may help identify a phishing attack. Websites can be scraped and analyzed to locate trends in themes that are related to security, such as the latest botnet threats, malware, and other Internet hazards [1]. More interestingly, social media offers a repository for intelligence-led policing operations; thereby, the law enforcement community is increasingly turning to social media monitoring to prevent and investigate crimes. Techniques, such as text mining, NLP, and sentiment analysis, provide a varied toolset that may assist in this direction [245]. Without pausing to address the approaches of previous studies regarding cybersecurity here in this paper, it is relevant to note that no universally agreed-upon classification scheme would contribute towards our understanding of cybercrime and serve as a useful tool for cybercrime stakeholders [246]. Recently, Donalds and Osei-Bryson [246] designed a new cybercrime classification scheme. Nevertheless, they also pointed out, the use of text mining and artificial intelligence technologies on this new ontology should be explored.

Another emerging, serious issue is the identification and detection of the widespread misinformation on social media and websites. The use of mega-platforms, such as Facebook and Twitter, as vectors for widespread misinformation spreading, e.g., during tragedies, national crises, or political campaigns, has been the subject of collective anxiety and a growing field of research [247–251]. Moreover, while one of the most beneficial values of text mining in big data analytics for businesses and governments is derived from the monitoring of human behavior and its predictive potential, the massive collection, instantaneous transmission, and combination and reuse of personal information for unforeseen purposes have placed new strains on strictly following the principles of data protection, which calls for a thorough consideration of their applications [252]. Serious ethical concerns and legal aspects have been raised when text mining is executed over data of a personal nature [139,253,254].

As it was noted earlier in this paper, there are pertinent challenges to the text mining process. First, the problem of ambiguity that the natural language faces is an issue. It can also be argued that what is conventionally referred to as languages exhibit immense internal variability across geographical and social space [255]. Moreover, many textual data sources are rife with abbreviations, acronyms, and specialized language. Second, the world of emails and online social networking sites can be very noisy. It may contain a large number of non-words, unknown words, and grammatically poor or incoherent sentences, as well as bots and trolls. Furthermore, text mining also carries limitations concerning copyright, contracts, and licenses.

Another challenge in text mining arises when the method is employed in big textual data analysis. Since the size of big textual data rapidly grows, the text mining methods should be compatible with scalable data platforms. In other words, the employed text mining methods should have the ability to reduce the dimension of analyzed data and/or be compatible with the distributed computational systems and databases [256,257].

In summary, text mining carries immense potential as a tool for retrieving and analyzing large-scale and complex data and also allows spanning across a range of fields, disciplines, cultures, and languages. Not only are the cutting-edge of text mining technologies making significant improvements in terms of performance and accuracy within the framework of artificial intelligence and deep learning, but mining in big data analytics is also an evolving field, hence it’s having immense potential to advance science, encourage business growth in multiple industries, and ensure job growth. Moreover, text mining professionals are increasingly becoming high in demand. Furthermore, text mining may have the power to deliver significant insights to society and individuals, especially concerning public health [258,259], healthcare [260,261], and education [262–265], and help evaluate social issues, such as crime (including cybercrime) [245,266,267], child abuse [268], and poverty [269]. Nevertheless, actions must be taken in time to efficiently solve the legal, ethical, and privacy concerns contained in the use of personal data.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:



1Research Institute of Energy Management and Planning, University of Tehran, Tehran 1417466191, Iran
2Department of Tourism, Faculty of Economic Sciences, Ionian University, Kalypso Building, 4 P. Vraila Armeni, 49100 Corfu, Greece
3Department of Economics and Business, Saint Anselm College, 100 Saint Anselm Drive, Manchester, NH 03103, USA
4Department of Management, University of Tehran, Tehran 1417466191, Iran
5Department of Accounting, Islamic Azad University, Central Tehran Branch, Tehran 1955847781, Iran
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 20204(1), 1; https://doi.org/10.3390/bdcc4010001
Received: 18 November 2019 / Revised: 11 January 2020 / Accepted: 11 January 2020 / Published: 16 January 2020
(This article belongs to the Special Issue Knowledge Modelling and Learning through Cognitive Networks)

Author Contributions:

Conceptualization, H.H., C.B., S.U., M.T.M., M.R.Y.; investigation, H.H., C.B., S.U., M.T.M.,
M.R.Y.; writing–original draft preparation, H.H., C.B., S.U., M.T.M., M.R.Y.; writing–review and editing, H.H.,
C.B., S.U., M.T.M., M.R.Y.; supervision, H.H. All authors have read and agreed to the published version of the


This research received no external funding.

Conflicts of Interest:

The authors declare no conflict of interest.


MDPI and ACS Style

Hassani, H.; Beneki, C.; Unger, S.; Mazinani, M.T.; Yeganegi, M.R. Text Mining in Big Data Analytics. Big Data Cogn. Comput. 20204, 1.

PDF reference and original file: Click here

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

+ posts

I am a Data Scientist with a strong background in software engineering; and used to handling a variety of data pipelines and databases, included unstructured ones.