Latest and most comprehensive big data/artificial intelligence terms & terminology in English (highly recommended for collection) for years 2021 and 2022
A
2. Apache Mahout: Mahout provides a library of pre-made algorithms for machine learning and data mining, and can also be used as an environment for creating more algorithms. In other words, the best environment for machine learning geeks.
7. Apache Pig: Pig is a platform for creating, querying, and executing routines on large distributed data sets. The scripting language used is called Pig Latin (I am definitely not talking nonsense, believe me). It is said that Pig is easy to understand and learn. But I doubt how much can be learned?
8. Apache Sqoop: A tool for transferring data from Hadoop to non-Hadoop data storage (such as data warehouses and relational databases).
13. Anomaly detection-Search for data items that do not match the expected pattern or behavior in the data set. In addition to "Anomalies", there are the following words used to indicate exceptions: outliers, exceptions, surprises, contaminants. They usually provide key executable information
16. Analytics: Used to discover the inner meaning of data. Let us imagine a possible situation. Your credit card company has sent you an email that records the funds transfer situation in your card throughout the year. If you take this list at this time, start to study your food and clothing carefully. What is the percentage of consumption in terms of entertainment, entertainment, etc.? You are doing analysis work. You are mining useful information from your original data (these data can help you make decisions about your own consumption in the coming year). So, what if you use a similar method to process the posts made by people throughout the city on Twitter and Facebook? In this case, we can call it big data analysis. The so-called big data analysis is to reason about a large amount of data and tell useful information from it. There are three different types of analysis methods below. Now let's sort them out separately.
B
17. Batch processing: Although batch data processing has existed since the mainframe era, batch processing has gained more significance in the era of big data processing large amounts of data. Batch data processing is an effective method for processing large amounts of data (such as a bunch of transaction data collected over a period of time). Distributed computing (Hadoop), which will be discussed later, is a method that specializes in processing batches of data.
18. Behavioral Analytics: Have you ever wondered how Google provides advertisements for the products/services you need? Behavioral analytics focuses on understanding what consumers and apps do, and how and why they are in a certain way kick in. This involves understanding our online patterns, social media interaction behaviors, and our online shopping activities (shopping carts, etc.), connecting these irrelevant data points, and trying to predict the outcome. For example, after I found a hotel and emptied my shopping cart, I received a call from the resort’s holiday line. Should I say more?
19. Business Intelligence: I will reuse Gartner's definition of BI because it explains it well. Business intelligence is a general term that includes applications, infrastructure, tools, and best practices. It can access and analyze information to improve and optimize decision-making and performance.
20. Biometrics: This is a technology that combines James Bondish technology and analysis technology to identify people through one or more physical characteristics of the human body, such as facial recognition, iris recognition, fingerprint recognition, etc.
21. Descriptive Analytics: If you only say that your credit card consumption last year was: 25% for food, 35% for clothing, 20% for entertainment, and the remaining 20% for miscellaneous expenses, then this analysis method is It is called descriptive analysis. Of course, you can also find out more details.
22. Big Data Scientist: A person who can design big data algorithms to make big data useful
23. Big data startup: Refers to emerging companies that develop the latest big data technology
24. B bytes (BB: Brontobytes): approximately equal to 1000 YB (Yottabytes), which is equivalent to the size of the digital universe in the future. 1 B byte contains 27 zeros!
25. Big data: refers to the massive, high growth rate, and diversified information assets that require new processing models to have stronger decision-making power, insight and discovery, and process optimization capabilities.
26. Data science platforms: A working platform for data scientists to create and test data science solutions. According to Gartner’s definition, a data science platform is “a software system composed of a number of closely related data processing core technology modules to support the development of various data science solutions and their use in business processes, peripheral infrastructure, and Application in the product.
C
27. Clickstream analytics: Used to analyze online click data when users browse the Internet. Have you ever wondered why some Google ads are still lingering even when you switch sites? Because the Google boss knows what you are clicking on.
28. Cluster Analysis: It is an exploratory analysis that tries to identify the structure of the data, also known as segmentation analysis or classification analysis. More specifically, it attempts to determine the homogenous groups of the case, that is, observation, participant, and interviewee. If the grouping was previously unknown, cluster analysis is used to identify the case group. Because it is exploratory, it does distinguish between dependent variables and independent variables. The different cluster analysis methods provided by SPSS can handle binary, nominal, ordinal, and scale (interval or ratio) data.
29. Comparative Analytics: Because the key to big data lies in the analysis, as the name suggests, the comparative analysis uses statistical techniques such as pattern analysis, filtering, and decision tree analysis to compare multiple processes, data sets, or other objects. I know it involves fewer and fewer technologies, but I still can't completely avoid the use of terminology. Comparative analysis can be used in the field of medical care, by comparing a large number of medical records, files, images, etc., to give a more effective and accurate medical diagnosis.
30. Connection Analytics: You must have seen a spider web-like a graph that connects people and topics to determine the influencers of a particular topic. Correlation analysis can help discover the connections and influences between people, products, systems in the network, and even the combination of data and multiple networks.
32. Cloud computing: A distributed computing system built on the network, data is stored outside the computer room (ie, the cloud), software or data is processed on a remote server, and these resources can be accessed anywhere on the network , Then it can be called cloud computing.
33. Cluster computing: This is a visual term to describe the computing of a cluster that uses the rich resources of multiple servers. A more technical understanding is that in the context of cluster processing, we may discuss nodes, cluster management layers, load balancing, parallel processing, and so on.
34. Classification analysis: The systematic process of obtaining important relevance information from data; this type of data is also called meta data, which is data describing data.
35. Commerce analytics: Refers to including examining whether the estimated sales, costs and profits have reached the company's estimated goals; if it is reached, the product concept can be further developed to the product development stage.
36. Clustering analysis-It is the process of grouping similar objects together, and combining each type of similar objects into a cluster (also called cluster). The purpose of this analysis method is to analyze the differences and similarities between data.
37. Cold data storage-Stores old data that is hardly used on low-power servers. But the retrieval of these data will be time-consuming.
38. Crowdsourcing: The practice of obtaining desired ideas, services, or content contributions from a wide range of groups, especially online communities.
39. Cluster server (Cluster server): Connect multiple servers through fast communication links. From the outside, these servers are working like one server, while internally, the load from the outside is dynamically changed through a certain mechanism. Allocate to these node machines, so as to achieve the high performance and high availability of super servers.
40. Comparative analysis-When pattern matching is performed in a very large data set, a step-by-step comparison and calculation process is performed to obtain the analysis result.
42. Computer generated data-computer generated data such as log files.
43. Concurrency-execute multiple tasks or run multiple processes at the same time.
45. Customer Relationship Management (CRM: Customer Relationship Management)-A technology used to manage sales and business processes. Big data will affect the company's customer relationship management strategy.
46. Cloud data: is the general term for technologies and platforms based on data integration, data analysis, data integration, data distribution, and data early warning based on cloud computing business model applications.
D
48. Data Cleansing: As the name implies, data cleaning involves detecting and correcting or deleting inaccurate data or records in the database, and then remembering "dirty data". With the help of automated or manual tools and algorithms, data analysts can correct and further enrich the data to improve data quality. Remember that dirty data can lead to incorrect analysis and poor decision-making.
49. Data as a Service (DaaS): By providing users with on-demand access to cloud data, DaaS providers can help us quickly obtain high-quality data.
51. Dirty Data: Dirty data is unclean data, in other words, inaccurate, duplicate and inconsistent data. Obviously, you don't want to mix with dirty data. So, fix it as soon as possible.
52. Dark data: All data accumulated and processed by the company that is actually not used at all. In this sense, we call them "dark" data, and they may not be analyzed at all. These data can be information in social networks, call center records, meeting records, and so on. Many estimates believe that 60% to 90% of all company data may be dark data, but no one actually knows.
54. Data lake (Data lake): that is, a company-level data repository in a large number of original formats. Here we introduce the data warehouse (Data warehouse). The data warehouse is a concept similar to the data lake mentioned here, but the difference is that it stores structured data that has been cleaned up and integrated with other resources. Data warehouses are often used for general data (but not necessarily so). It is generally believed that a data lake can make it easier to access the data you really need. In addition, you can process and use them more conveniently.
55. Data Resources Management (Data Resources): It is a management activity that applies information technology and software tools to complete the task of organizing data resource management.
56. Data Source: As the name implies, the source of data is a device or original media that provides some required data. All the information for establishing a database connection is stored in the data source. Just as you can find a file in the file system by specifying the file name, you can find the corresponding database connection by providing the correct data source name.
57. Data mining: Use complex pattern recognition techniques to find meaningful patterns from a large group of data and get relevant insights.
58. Data analyst platforms: Mainly through the integration of enterprise internal operation support systems and external data, including transaction-oriented big data (Big Transaction Data) and interactive big data (Big Interaction Data), through a variety of cloud computing technologies The integration and processing of the company provides information support and intelligent solutions with great commercial value to the internal and external corporate customers. Based on the data warehouse built on the big data platform, it provides report tools and analysis tools, combined with the actual needs of the company The solution implementation service is carried out; enterprise managers, business analysts, etc. can be accessed through the web, mobile phones or other mobile devices, so as to understand the key indicators of the enterprise and conduct in-depth business analysis at any time.
59. Distributed File System: The amount of big data is too large to be stored in a single system. The distributed file system is a file system that can store large amounts of data on multiple storage devices. It can reduce storage The cost and complexity of large amounts of data.
60. Dashboard: Use algorithms to analyze data and display the results on the dashboard in a graphical manner.
61. Data access: refers to the realization and maintenance of database data storage organization and storage path.
62. Data transfer: refers to the process of transferring data between a data source and a data sink, also known as data communication.
70. Data modeling: Use data modeling techniques to analyze data objects to gain insight into the inner meaning of data.
72. Data virtualization: The process of data integration to obtain more data information. This process usually introduces other technologies, such as databases, applications, file systems, web technologies, big data technologies, and so on.
75. Distributed File System: A system that provides a simplified, highly available way to store, analyze, and process data.
77. Data Governance: Data governance refers to the transition from using scattered data to using unified master data, from having little or no organization and process governance to comprehensive data governance within the enterprise, from trying to deal with the chaotic state of master data to master data A process in which the data is well organized.
78. Data Transfer Service: Mainly used to convert data between different databases, such as converting data between SQL Server and Oracle.
79. Data integration: It is to logically or physically centralize data from different sources, formats, and characteristics, so as to provide enterprises with comprehensive data sharing.
E
80. ETL: ETL stands for extraction, transformation and loading. It refers to this process: "extracting" the original data, "converting" the data into a "suitable for use" form through cleaning/enriching methods, and "loading" it into a suitable library for system use. Even though ETL originates from a data warehouse, this process is also used when acquiring data, for example, to acquire data from external sources in a big data system.
81: Enterprise applications: In fact, it is a term commonly used within the software industry. If it is explained in plain and easy-to-understand terms, it is a computer-based stable, secure and efficient distributed information management system used within an enterprise.
82. Exploratory analysis: Exploring patterns from data without standard procedures or methods. It is a way to discover the main characteristics of data and data sets
83. E bytes (EB: Exabytes): approximately equal to 1000 PB (petabytes), approximately equal to 1 million GB. Nowadays, the amount of new information produced every day in the world is about 1 exabyte.
84. Extract, Transform and Load (ETL: Extract, Transform and Load)-is a process used in a database or data warehouse. That is, extract (E) data from various data sources, transform (T) into data that can meet business needs, and finally load (L) it into the database.
85. Enterprise productivity: The ability of an enterprise to provide a certain product or service to the society in a certain period of time.
F
86. Fuzzy logic: How many times have we been certain about a thing, such as 100% correct? Very rare! Our brain aggregates data into partial facts, and these facts are further abstracted into something that can determine our decisions Threshold. Fuzzy logic is such a calculation method. Contrary to the "0" and "1" in Boolean algebra, etc., it aims to imitate the human brain by gradually eliminating some facts.
87. Failover: When a server in the system fails, the running task can be automatically switched to another available server or node.
93. Graph Databases: Use graph structures (for example, a limited set of ordered pairs, or certain entities) to store data. This graph storage structure includes edges, attributes, and nodes. It provides a free index function between adjacent nodes, that is, each element in the database is directly related to other adjacent elements.
94. Grid computing: Connect many computers distributed in different locations together to deal with a specific problem, usually through the cloud to connect the computers together.
H
107. In-memory computing: It is generally believed that any calculation that does not involve I/O access will be faster. In-memory computing is such a technology. It moves all working data sets to the collective memory of the cluster, avoiding writing intermediate results to disk during the calculation process. Apache Spark is an in-memory computing system, which has great advantages over I/O-bound systems such as Mapreduce.
L
111. Load balancing is a tool that can distribute the workload between two or more computers on a computer network so that all users want to get services faster and complete their work in a short time. This is the main reason for computer server clusters. It can be used with software or hardware, or a combination of both.
112. Linked Data
113. . Location Analytics
M
115. Metadata
116. MongoDB
117. Multi-Dimensional Database (MDB) Multi-Dimensional Database (MDB)
118. Multi-Value Database
119. Machine-Generated Data
Machine-generated data is information generated by machines (computers, applications, processes, or other inhumane mechanisms). Machine-generated data is called amorphous data, because humans rarely modify/change this data.
120. Machine Learning
Machine learning is the field of computer science, which uses statistical strategies to provide the function of "learning" data on the computer. Machine learning is used to uncover hidden opportunities in big data.
121. MapReduce
MapReduce is a processing technology that can process large data sets through parallel distributed algorithms on clusters. There are two types of MapReduce jobs. The "map" function is used to divide the query into multiple parts and then process the data at the node level. The "reduce" function collects the results of the "map" function, and then finds the answer to the query. When combined with HDFS, MapReduce is used to process big data. This coupling of HDFS and MapReduce is called Hadoop.
123. Mahout
Apache Mahout is an open source data mining library. It uses data mining algorithms for regression testing, execution, clustering, statistical modeling, and then implements them using MapReduce models.
N
124. Network Analysis
Network analysis is the application of graph/chart theory, which is used to classify, understand and view the relationship between nodes in network terms. This is an effective way to analyze connections and check their capabilities in any field, such as forecasting, market analysis, and healthcare.
125. NewSQL
NewSQL is a modern relational database management system that can provide the same scalable performance as NoSQL systems for OLTP read/write workloads. It is a well-defined database system and easy to learn.
126. NoSQL
Widely referred to as "not just SQL", it is a system for database management. The database management system is independent of the relational database management system. NoSQL database is not built on tables, and it does not use SQL to process data.
O
127. Object Databases
A database that stores data in the form of objects is called an object database. These objects are used in the same way as the objects used in OOP. Object databases are different from graph databases and relational databases. Most of the time, these databases provide a query language to help find objects with declarations.
128. Object-based Image Analysis
This is based on the analysis of the image of the object, which is performed by data acquired by selected relevant pixels (called image objects or simply objects). It is different from digital analysis that uses data from a single pixel.
129. Online Analytical Processing (OLAP) Online Analytical Processing (OLAP)
In this process, three operators (drill down, merge, and slice and dice) are used to analyze multi-dimensional data.
130. Online transactional processing (OLTP) Online transactional processing (OLTP)
131. Open Data Center Alliance (ODCA) Open Data Center Alliance (ODCA)
132. Operational Data Store (ODS) Operational Data Store (ODS)
It is defined as a location where data obtained from various sources is collected and stored. It allows users to perform many other operations on the data before sending it to the data warehouse report.
133. Oozie
P
134. Parallel Data Analysis
The process of decomposing the analysis problem into smaller partitions, and then running the analysis algorithm on each partition at the same time is called parallel data analysis. This type of data analysis can be run on different systems or on the same system.
135. Parallel Method Invocation (PMI) Parallel Method Invocation (PMI)
The system allows program code to call or call multiple methods/functions at the same time.
136. Parallel Processing
The system has the ability to perform multiple tasks at the same time.
Pentaho is a software organization that provides open source business intelligence products, these products are called Pentaho Business Analytics. Pentaho provides OLAP services, data integration, dashboards, reports, ETL and data mining functions.
140. Petabyte megabytes
Q
141. Query
142. Query Analysis
R
143. R
144. Re-identification
145 Real-time Data
146. Reference Data
147. Recommendation Engine
It is an algorithm that can analyze various operations and purchases made by customers on e-commerce sites. Then, the analysis data is used to recommend some supplementary products to customers.
148. Risk Analysis
149. Routing Analysis
S
150. SaaS Software as a Service
151. Semi-Structured Data
152. Server
153. Spatial Analysis
The analysis of spatial data (that is, topological and geographic data) is called spatial analysis. This analysis helps to identify and understand all the information about a specific area or location.
154. Structured Query Language (SQL) Structured Query Language (SQL)
155. Sqoop
156. Storm
T
157. Text Analytics
158. Thrift
U
159. Unstructured Data
V
160. Value
161. Volume
W
162. WebHDFS Apache Hadoop
163. Weather Data
X
164. XML Databases
Y
165. Yottabyte
166. ZooKeeper
167. Zettabytes
Comments
Post a Comment