Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Tamannas Siddiqui and Abdullah Yahya Abdullah Amer?
Department of Computer Science, Aligarh Muslim University, Aligarh, UP, India
Text data mining techniques are an essential tool for dealing with raw text data (future fortune). The Text data mining process of securing exceptional knowledge and information from the unstructured text is a fundamental principle of Text data mining to facilitate relevant insights by analyzing a huge volume of raw data in association with Artificial Intelligence natural language processing NLP Machine Learning algorithms. The salient features of text data mining are attracted by the contemporary business applications to have their extraordinary benefits in global area operations. In this, a brief review of text mining techniques, such as clustering, information extraction, text preprocessing, information retrieval, text classification, and text mining applications, that demonstrate the significance of text mining, the predominant text mining techniques, and the predominant contemporary applications that are using text mining. This review includes various existing algorithms, text feature extractions, compression methods, and evaluation techniques. Finally, we used a spam dataset for classification detection data and a three classifier algorithm with TF-IDF feature extraction and through that model achieved higher accuracy with Naïve Bayes. Illustrations of text classification as an application in areas such as medicine, law, education, etc., are also presented.
Keywords: Text mining, text classification, spam detection, text preprocessing, text analysis
Text data mining techniques are predominantly used for extracting relevant and associated patterns based on specific words or sets of phrases. Text data mining is associated with text clustering, text classification, and the product of granular taxonomy, sentiment analysis, entity relation modeling, and document summarization [1]. Prominent techniques in text mining techniques include extraction, summarization, categorization, retrieval, and clustering. These techniques are used to infer distinguished, quality knowledge from text from previously unknown information and different written resources obtained from books, emails, reviews, emails, and articles with the help of information retrieval, linguistic analysis, pattern recognition, information extraction, or information extraction tagging and annotation [2]. Text preprocessing is the predominant functionality in text data mining. Text preprocessing is essential to bring the text into a form that can be predictable and analyzable for text data mining. Text preprocessing can be done in different phases to formulate the text into predictable and analyzable forms. These are namely lowercasing, lemmatization, stemming, stop word removal, and tokenization. These important text preprocess steps are predominantly performed by machine learning algorithms for natural language processing tasks. These preprocessing steps implement data cleaning and transformation to eliminate outliers and make it standardized to create a suitable model to incorporate the text data mining process [3]. Text data mining techniques are predominantly used for records management, distinct document searches, e-discovery, organizing a large set of a text data, analysis and monitoring of understandable online text in internet communication and blogs, identification of large textual datasets associated with patients during a clinical area, and clarification of knowledge for the readers with more extraordinary search experience [4]. Text data mining techniques are predominantly used in scientific literature mining, business, biomedical, and security applications, computational sociology, and digital humanities as shown in Figure 1.1 below.
Figure 1.1 Overview of text classification.
Table 1.1 Text classification compared model classifiers.
The paper reviews text data mining techniques, various steps involved in text preprocessing, and multiple applications that implement text data mining methods discussed in Table 1.1.
Text Mining (TM) indicates informational content involved in several sources like newspapers, books, social media posts, email, and URLs. Text data summary and classification are typical applications of text mining, particularly among different fields. It is appropriate to discuss some of the techniques applied to achieve them through the step set shown in Figure 1.2 below.
Text mining is empowered in big data analytics to analyze unstructured textual data to extract new knowledge and distinguish significant patterns and correlations hidden in the huge amount of data sets. Big data analytics are predominantly used for extracting the information and patterns that are hidden implicitly in the data sets in the form of automatic or semi-automatic unstructured formats or natural language texts. To perform this test, mining operations, unsupervised learning algorithms, and supervised learning algorithms or methods are predominantly used. These methods' functionality is used for classification and prediction by using a set of predictors to reveal hidden structures in the information database [5]. In this process, text mining is performed using pattern matching on regular documents and unstructured manuscripts [6].
Figure 1.2 Text data mining techniques.
Information Retrieval [IR] is a prominent method in text data mining techniques. The fundamental principle of IR is identifying documents stored in the database in unstructured formats, which meets the requirements of the information needed from the large collection of documents stored in the datasets. IR is available in three models: Boolean Model, Vector Model, and Probabilistic Model. In text data mining techniques, IR plays a vital role with the indexing system and collection of documents [7]. This method is predominantly used for locating a specific item in natural language documents. IR is used for learned knowledge extraction to convert text within structured data for interesting mining relationships [8]. It has been identified as a big issue to discover the appropriate designs and analyze text records from huge amounts of data. Text data mining technique IR has resolved the issue and successfully selected attractive patterns from the greatest knowledge data sets. IR techniques are predominantly used for choosing the appropriate text documents from the huge volume of databases with enhanced speed within a short period. The text data mining technique IR extracts the exact required text documents from the greatest databases and presents the accuracy and relevance of results [9].
NLP linguistics is subfield of computer science and AI. The fundamental principle of NLP is to deal with the connection between computer machines and humans with an assistant of NLP to read, interpret, learn, and make sense of languages spoken by humans in a valuable way. It is powered by AI, which can facilitate the machines to read, understand, interpret, manipulate, and derive meaning from human languages [10]. It is a prominent AI technology used in text data mining to transform the unstructured text depicted in documents and databases into normalized, structured data suitable for performing analysis or implementing machine learning algorithms [11]. Long Short-Term Memory [LSTM] is one of the predominant AI Machine Learning algorithms to remember values with a recurrent neural network's help. Seq2seq model is another predominant model used in the NLP technique, which works with encoder-decoder structure. In this model, it initially built the vocabulary list to identify the correct grammar syntax. It works with some tags to identify the structured and unstructured language identified in the documents. The named entity recognition model is another predominant model to identify relevant names and classify names by their entity. It is used to find the names of people, names of places, and any other important entity in the given dataset in text or documents. The NLP process features a Preferences' Graph [12]. It is utilized to build a set of user preferences. While the document is written, the repetitively chosen tense, adjectives, conjunctions, and prepositions are identified and NLP creates a User Preference Graph. Based on the graph, it...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.