In this paper, we will learn tools and methods so that we can use them to analyze macro data based on machine learning.
You will be able to prepare the data and look at the deeper relationships between them. We will be able to build some of the main models in machine learning and then implement and interpret them.
To remind you, the following topics need to be brought up before the discussion continues and a brief definition of them is provided in this introduction.
- The map / reduce framework, which was a strategy for working with big data.
- HDFS FILE SYSTEM
- HADOOP ecosystem that provides tools to process big data using them, which makes this framework more flexible.
These tools are used for the following:
- Query processing
- Making statistical summaries
Discovery of modeling knowledge:
- A type of analysis used to find relationships and rules in data.
- To find these relationships in data We need to use the following methods:
- The knowledge of experts in the field of machine learning algorithms
REAL-TIME platform with high performance:
We want to talk about high-performance REAL-TIME platforms such as HADOOP, SPARK,, and KNIME.
We need to know how to bring data to these platforms and use them and do basic analysis on them and get general results from them.
The data is growing very fast, and it has moved from terabyte and petabyte to zettabyte and exabytes. Use big data. We know how to organize and control data and then analyze it and get statistical summaries from them. Now, we come to the part where we talk about the middle ground, ADVANCED ANALYTIC.
- BIG DATA MINING
- RULE MINING
and other techniques that we can apply this data to give us additional knowledge.
Problems with big data:
1- The first is to find the God-given talents and blessings that we are looking for in this paper.
2- Gathering data from different sources is the second problem. For example, the data in this paper comes from various sources, such as Twitter, Yelp, and others.
3- Understanding, understanding, and recognizing tools and platforms related to big data. In this paper, we want to discuss specific tools and platforms for finding knowledge and insight in data through data mining, forecasting, and all machine learning techniques.
What is the goal?
Our goal is to turn data into knowledge, insight, and insight to make better decisions.
We’re going to start with the data and apply some kind of analysis to it. We’re going to look for insights and insights. Our main goal is to extract from these data the decisions that guide and guide us to take action and make decisions.
Traditional Analysis Methods: Traditional analysis methods are usually related to a business in which a person has a large amount of data in a number of forms and wants to draw an inference from them to see how he can help his business. Slow down. In most cases, you need many different methods to perform the analysis based on the problem you want to solve. For example, some analysts use traditional repositories, while others use other methods to ensure that there is no coordination between the data.
We also get data from SUBSET and try to uncover a number of patterns so that we can use those patterns in our business to activate new capabilities and functions or provide some insights and insights.
Big Data Analysis Methods: The ability to manage data at the Pythagorean level enables companies to come across clusters of data that can have the slightest impact on their business. That’s why we need new analytics engines that can handle these very scattered data. These engines can provide results in order to optimize and solve many of our business conflicts and problems. Predictive models help us find the answers to our questions, such as what might happen? How and why will this happen? After that, we can reach a higher level, which is optimization.
Take a closer look at machine learning:
There are many ways to talk about machine learning:
1. Humans first used new technologies along with artificial intelligence and machine learning using computers. Like Alan Turing’s use of these tools in World War II and helped a lot in defeating the Nazis. Over time, some of these techniques have been used by various industries to detect fraud and predict the stock market, and so on.
2- The result of progress in the above cases has been the emergence of a science called data mining or knowledge discovery among the data.
3. Over time, data mining began to evolve, and a new science called predictive analysis emerged, including machine learning algorithms associated with business intelligence tools.
4- Then the above cases started to evolve and advanced analyzes were created.
5. Until today we are talking about data science.
So when people talk about data mining science, they may instead say each of these phrases, which is kind of true.
Definition of data mining:
Extracting the knowledge in question from large databases.
When we have a large database, there are several ways to analyze the data in it. Each data layer has its own
- analytical tools
- Top to bottom
- bottom to top
The chart below talks about the inactive or hyperactive role of software for presenting, exploring, and discovering knowledge.
What can be hidden in the data?
- Hypothetical rules
- hidden data in data Hidden
- categories in data
- Predictions that we can have using data
- Anomalies in data
- clustering or grouping
data mining is a multidisciplinary field that uses the following: Makes:
- Artificial Intelligence
- Algorithms Machine
- Learning Algorithms
History of Data Mining:
The footprint and roots of data mining go back to three sciences:
- Artificial Intelligence
- Machine Machine Learning
Statistics are the basis of most of the technologies on which data mining works. There is no data mining without statistics.
Classical statistics include concepts such as regression analysis, standard distribution, standard deviation, variance, cluster analysis, all of which are used to study data and the relationship between them.
Artificial intelligence is the second most discussed topic in data mining. Artificial intelligence is based on exploration and exploration so that it has nothing to do with statistics and uses statistics only as experience.
Machine learning: The
the third topic discussed in data mining is machine learning, which is the union of statistics with artificial intelligence.
Learning artificial intelligence can be considered the evolution of artificial intelligence. By learning to the machine, we allow computers and machines to learn from the data we give them. Like a program that makes different decisions that can make better decisions based on the quality of the data, we give it.
The purpose of artificial intelligence is to mimic the behavior of the human mind. For this purpose, the machine or computer needs learning abilities. That is, artificial intelligence is more general than machine learning, and machine learning is part of it.
Major Machine Learning Groups
1. Predictive Methods:
Uses variables to check unknown or unknown values of other variables.
2. Descriptive methods:
Finding comprehensible patterns for humans that describe data.
Another topic of discussion in machine learning is learning with supervision and without supervision.
Like a teacher or expert who teaches us and answers our questions, and we use that information to teach our model to find patterns. Like what we do in a category. We gave (for example, in the decision tree).
Learning without supervision:
Like when we have no information about Rajab data other than the data itself and there is no category tag and we can only cluster the data that is most similar to each other.
How does machine learning work?
First, we explore the data, then we find the patterns, and according to these patterns, we make the relevant predictions and use them for business or personal and academic purposes.
What kind of knowledge and insight can be gained from data mining?
1. Predictive models:
There are several types of predictive models. For example, in classification, the system learns how to categorize data and make decisions based on the characteristics of each group and section.
2. Descriptive models:
We have several types of descriptive models, such as clustering, which cluster the data based on their similarity and make decisions based on them.
3- Discovering patterns and rules:
such as implicit rules and related and continuous rules.
4- Discovering anomalies:
such as discovering fraud and predicting it.
Applications of Machine Learning:
Machine learning has a wide range of applications:
from chemistry, physics, medicine, pharmacy, insurance, health, smart city to financial institutions, business, e-commerce, market analysis, and management and entertainment and sports all and all. There are only a few uses for this science.
Preparing Data for Machine Learning:
We look at the principles of data preparation for applying machine learning algorithms to data.
Data collection and preparation is 80% of the work in machine learning and data modeling.
Data preparation is a process of organizing data that is also called the argument phase. That includes cleaning, filtering, and converting data for modeling.
The final product of the data preparation work is a table as shown below, which is the power supply for machine learning algorithms.
Samples are called variables and columns are called variables, or properties.
Here are some aspects related to big data. Sometimes we are dealing with tables that have a large number of rows that if we pay attention to it conceptually, this table is a list consisting of rows and columns and values inside the table.
It is also possible that in some cases a table has a large number of columns. In big data, we usually have to work with databases in which the data is partitioned in rows because the data grows in that dimension.
We assume that a line is exactly the size of the computer space.
There are three main activities in data preprocessing:
- Data Cleaning: Includes a check to find empty or no-data or no-noise data.
- Variables conversion: These conversions are used to introduce variables differently (for example, combining two variables together).
- Selecting a variable: Selecting the variables that we want to use to create the desired model.
When dealing with noise data, we must be able to use the knowledge available in that domain to replace it with another value. This is especially true in big data because there is so much more data. This condition is called entity resolution and is called record linkage.
Where in two different datasets we want to see if they are the same in the visual data, this can also be done visually, but in the big data, we have to have filters to prevent all cases from finding a pair. Let’s look at it.
Sometimes statistics also have noise, which can be discarded data. A histogram can help us see the distribution of data, and sometimes shows the boundary between discarded data and other data.
Where the boundary between discarded data and the rest of the data can be determined, arbitrariness can be determined. The statistical threshold is also a guide based on the knowledge available in that domain as part of the judgment. You can only evaluate discarded data if you are unsure.
Clearing lost data:
Clearing lost data also has a number of things that can be due to the nature of some data.
- Not applicable: No need at all. A person who is not married does not have a spouse’s name.
- Not available: We know it hasn’t been entered and we don’t know it.
When the number of samples is large and the missing item appears random and we do not have a large number of missing samples, we can remove it.
Delete a feature column:
If one column has a large number of blank lines, it can be deleted.
You can set the NULL value instead and use algorithms that support the NULL value, such as the decision tree.
Using other values:
- You can use the rest of the values in that column and determine the value of the column.Using
- sophisticated models to determine missing values is the best strategy for determining missing values to do so in less time and more accurately.
Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.