Latest and most comprehensive big data/artificial
intelligence terms & terminology in English (highly recommended for
collection)
A
Apache Kafka: named after the Czech writer Kafka, used to
build real-time data pipelines and streaming media applications. The reason it
is so popular is that it can store, manage, and process data streams in a fault-tolerant
manner, and it is said to be very "fast". Given that the social
network environment involves a lot of data stream processing, Kafka is
currently very popular.
Apache Mahout: Mahout provides a library of pre-made
algorithms for machine learning and data mining, and can also be used as an
environment for creating more algorithms. In other words, the best environment
for machine learning geeks.
Apache Oozie: In any programming environment, you need some
workflow system to arrange and run work through predefined methods and defined
dependencies. This is exactly what Oozie provides for big data jobs written in
languages such as pig, MapReduce, and Hive.
Application Development (APP DEV): Application development
is the process of building a software system or software parts of the system
according to user requirements, including system engineering of demand capture,
demand analysis, design, implementation, and testing. It is generally
implemented in a certain programming language. Generally, application
development tools can be used for development.
Apache Drill, Apache Impala, Apache Spark SQL: These three
open source projects all provide fast and interactive SQL, such as interaction
with Apache Hadoop data. If you already know SQL and deal with data stored in a
big data format (ie HBase or HDFS), these features will be very useful. Sorry,
what I said here is a bit strange.
Apache Hive: Do you know SQL? If you know, then you will be
able to get started with Hive. Hive helps to use SQL to read, write, and manage
large data sets residing in distributed storage.
Apache Pig: Pig is a platform for creating, querying, and
executing routines on large distributed data sets. The scripting language used
is called Pig Latin (I am definitely not talking nonsense, believe me). It is
said that Pig is easy to understand and learn. But I doubt how much can be
learned?
Apache Sqoop: A tool for transferring data from Hadoop to
non-Hadoop data storage (such as data warehouses and relational databases).
Apache Storm: A free and open source real-time distributed
computing system. It makes it easier to process unstructured data while using
Hadoop for batch processing.
Artificial Intelligence (Artificial Intelligence): research
and development of intelligent machines and intelligent software, these
intelligent devices can perceive the surrounding environment, and respond
accordingly according to requirements, and even self-learning
Aggregation-the process of searching, merging, and
displaying data
Algorithm (Algorithm): Algorithm can be understood as a
mathematical formula or statistical process for data analysis. So, why is
"algorithm" related to big data? You know, although the word
algorithm is a collective term, in this era of popular big data analysis,
algorithms are often mentioned and become more popular.
Anomaly detection-Search for data items that do not match
the expected pattern or behavior in the data set. In addition to
"Anomalies", there are the following words used to indicate
exceptions: outliers, exceptions, surprises, contaminants. They usually provide
key executable information
Anonymization-Make data anonymous, that is, remove all data
related to personal privacy
Application-computer software that implements a specific
function
Analytics: Used to discover the inner meaning of data. Let
us imagine a possible situation. Your credit card company has sent you an email
that records the funds transfer situation in your card throughout the year. If
you take this list at this time, start to study your food and clothing
carefully. What is the percentage of consumption in terms of entertainment,
entertainment, etc.? You are doing analysis work. You are mining useful
information from your original data (these data can help you make decisions
about your own consumption in the coming year). So, what if you use a similar
method to process the posts made by people throughout the city on Twitter and
Facebook? In this case, we can call it big data analysis. The so-called big
data analysis is to reason about a large amount of data and tell useful
information from it. There are three different types of analysis methods below.
Now let's sort them out separately.
B
Batch processing: Although batch data processing has existed
since the mainframe era, batch processing has gained more significance in the
era of big data processing large amounts of data. Batch data processing is an
effective method for processing large amounts of data (such as a bunch of
transaction data collected over a period of time). Distributed computing
(Hadoop), which will be discussed later, is a method that specializes in
processing batches of data.
Behavioral Analytics: Have you ever wondered how Google
provides advertisements for the products/services you need? Behavioral
analytics focuses on understanding what consumers and apps do, and how and why
they are in a certain way kick in. This involves understanding our online
patterns, social media interaction behaviors, and our online shopping
activities (shopping carts, etc.), connecting these irrelevant data points, and
trying to predict the outcome. For example, after I found a hotel and emptied
my shopping cart, I received a call from the resort’s holiday line. Should I
say more?
Business Intelligence: I will reuse Gartner's definition of
BI because it explains it well. Business intelligence is a general term that
includes applications, infrastructure, tools, and best practices. It can access
and analyze information to improve and optimize decision-making and
performance.
Biometrics: This is a technology that combines James Bondish
technology and analysis technology to identify people through one or more
physical characteristics of the human body, such as facial recognition, iris
recognition, fingerprint recognition, etc.
Descriptive Analytics: If you only say that your credit card
consumption last year was: 25% for food, 35% for clothing, 20% for
entertainment, and the remaining 20% for miscellaneous expenses, then this
analysis method is It is called descriptive analysis. Of course, you can also
find out more details.
Big Data Scientist: A person who can design big data
algorithms to make big data useful
Big data startup: Refers to emerging companies that develop
the latest big data technology
B bytes (BB: Brontobytes): approximately equal to 1000 YB
(Yottabytes), which is equivalent to the size of the digital universe in the
future. 1 B byte contains 27 zeros!
Big data: refers to the massive, high growth rate, and
diversified information assets that require new processing models to have
stronger decision-making power, insight and discovery, and process optimization
capabilities.
Data science platforms: A working platform for data
scientists to create and test data science solutions. According to Gartner’s
definition, a data science platform is “a software system composed of a number
of closely related data processing core technology modules to support the
development of various data science solutions and their use in business
processes, peripheral infrastructure, and Application in the product.
C
Clickstream analytics: Used to analyze online click data
when users browse the Internet. Have you ever wondered why some Google ads are
still lingering even when you switch sites? Because the Google boss knows what
you are clicking on.
Cluster Analysis: It is an exploratory analysis that tries
to identify the structure of the data, also known as segmentation analysis or
classification analysis. More specifically, it attempts to determine the
homogenous groups of the case, that is, observation, participant, and
interviewee. If the grouping was previously unknown, cluster analysis is used
to identify the case group. Because it is exploratory, it does distinguish
between dependent variables and independent variables. The different cluster
analysis methods provided by SPSS can handle binary, nominal, ordinal, and
scale (interval or ratio) data.
Comparative Analytics: Because the key to big data lies in
analysis, as the name suggests, comparative analysis uses statistical
techniques such as pattern analysis, filtering, and decision tree analysis to
compare multiple processes, data sets, or other objects. I know it involves
fewer and fewer technologies, but I still can't completely avoid the use of
terminology. Comparative analysis can be used in the field of medical care, by
comparing a large number of medical records, files, images, etc., to give a
more effective and accurate medical diagnosis.
Connection Analytics: You must have seen a spider web like a
graph that connects people and topics to determine the influencers of a
particular topic. Correlation analysis can help discover the connections and
influences between people, products, systems in the network, and even the
combination of data and multiple networks.
Cassandra: is a very popular open source data management
system developed and operated by the Apache Software Foundation. Apache has
mastered a lot of big data processing technologies, and Cassandra is their
system specially designed to process large amounts of data between distributed
servers.
Cloud computing: A distributed computing system built on the
network, data is stored outside the computer room (ie, the cloud), software or
data is processed on a remote server, and these resources can be accessed
anywhere on the network , Then it can be called cloud computing.
Cluster computing: This is a visual term to describe the
computing of a cluster that uses the rich resources of multiple servers. A more
technical understanding is that in the context of cluster processing, we may
discuss nodes, cluster management layers, load balancing, parallel processing,
and so on.
Classification analysis: The systematic process of obtaining
important relevance information from data; this type of data is also called
meta data, which is data describing data.
Commerce analytics: Refers to including examining whether
the estimated sales, costs and profits have reached the company's estimated
goals; if it is reached, the product concept can be further developed to the
product development stage.
Clustering analysis-It is the process of grouping similar
objects together, and combining each type of similar objects into a cluster
(also called cluster). The purpose of this analysis method is to analyze the
differences and similarities between data.
Cold data storage-Stores old data that is hardly used on
low-power servers. But the retrieval of these data will be time-consuming.
Crowdsourcing: The practice of obtaining desired ideas,
services, or content contributions from a wide range of groups, especially
online communities.
Cluster server (Cluster server): Connect multiple servers
through fast communication links. From the outside, these servers are working
like one server, while internally, the load from the outside is dynamically
changed through a certain mechanism. Allocate to these node machines, so as to
achieve the high performance and high availability of super servers.
Comparative analysis-When pattern matching is performed in a
very large data set, a step-by-step comparison and calculation process is
performed to obtain the analysis result.
Complex structured data-data composed of two or more complex
and interrelated parts. This type of data cannot simply be parsed by structured
query languages or tools (SQL).
Computer generated data-computer generated data such as log
files.
Concurrency-execute multiple tasks or run multiple processes
at the same time.
Correlation analysis-is a data analysis method used to
analyze whether there is a positive or negative correlation between variables.
Customer Relationship Management (CRM: Customer Relationship
Management)-A technology used to manage sales and business processes. Big data
will affect the company's customer relationship management strategy.
Cloud data: is the general term for technologies and
platforms based on data integration, data analysis, data integration, data
distribution, and data early warning based on cloud computing business model
applications.
D
Data Analyst: Data Analyst is a very important and popular
job. In addition to preparing reports, it is also responsible for collecting,
editing and analyzing data.
Data Cleansing: As the name implies, data cleaning involves
detecting and correcting or deleting inaccurate data or records in the
database, and then remembering "dirty data". With the help of
automated or manual tools and algorithms, data analysts can correct and further
enrich the data to improve data quality. Remember that dirty data can lead to
incorrect analysis and poor decision-making.
Data as a Service (DaaS): By providing users with on-demand
access to cloud data, DaaS providers can help us quickly obtain high-quality
data.
Data virtualization: This is a data management method that
allows an application to extract and manipulate data without knowing the
technical details (such as where the data is stored and in what format). For
example, social networks use this method to store our photos.
Dirty Data: Dirty data is unclean data, in other words,
inaccurate, duplicate and inconsistent data. Obviously, you don't want to mix
with dirty data. So, fix it as soon as possible.
Dark data: All data accumulated and processed by the company
that is actually not used at all. In this sense, we call them "dark"
data, and they may not be analyzed at all. These data can be information in
social networks, call center records, meeting records, and so on. Many
estimates believe that 60% to 90% of all company data may be dark data, but no
one actually knows.
Data stream: originally a concept used in the field of
communications, representing a sequence of digitally encoded signals used in
transmission. However, the data flow concept we mentioned is different.
Data lake (Data lake): that is, a company-level data
repository in a large number of original formats. Here we introduce the data
warehouse (Data warehouse). The data warehouse is a concept similar to the data
lake mentioned here, but the difference is that it stores structured data that
has been cleaned up and integrated with other resources. Data warehouses are
often used for general data (but not necessarily so). It is generally believed
that a data lake can make it easier to access the data you really need. In
addition, you can process and use them more conveniently.
Data Resources Management (Data Resources): It is a
management activity that applies information technology and software tools to
complete the task of organizing data resource management.
Data Source: As the name implies, the source of data is a
device or original media that provides some required data. All the information
for establishing a database connection is stored in the data source. Just as
you can find a file in the file system by specifying the file name, you can
find the corresponding database connection by providing the correct data source
name.
Data mining: Use complex pattern recognition techniques to
find meaningful patterns from a large group of data and get relevant insights.
Data analyst platforms: Mainly through the integration of
enterprise internal operation support systems and external data, including
transaction-oriented big data (Big Transaction Data) and interactive big data
(Big Interaction Data), through a variety of cloud computing technologies The
integration and processing of the company provides information support and
intelligent solutions with great commercial value to the internal and external
corporate customers. Based on the data warehouse built on the big data
platform, it provides report tools and analysis tools, combined with the actual
needs of the company The solution implementation service is carried out;
enterprise managers, business analysts, etc. can be accessed through the web,
mobile phones or other mobile devices, so as to understand the key indicators
of the enterprise and conduct in-depth business analysis at any time.
Distributed File System: The amount of big data is too large
to be stored in a single system. The distributed file system is a file system
that can store large amounts of data on multiple storage devices. It can reduce
storage The cost and complexity of large amounts of data.
Dashboard: Use algorithms to analyze data and display the
results on the dashboard in a graphical manner.
Data access: refers to the realization and maintenance of
database data storage organization and storage path.
Data transfer: refers to the process of transferring data
between a data source and a data sink, also known as data communication.
Data aggregation tools: The process of transforming data
scattered across many data sources into a new data source.
Database (Database): A warehouse that stores a collection of
data with a specific technology.
Database Management System (DBMS: Database Management
System): Collect and store data, and provide data access.
Data center: A physical location where servers used to store
data are placed.
Data custodian: The professional and technical personnel
responsible for maintaining the technical environment required for data
storage.
Data ethical guidelines: These guidelines help organizations
make their data transparent and ensure data simplicity, security and privacy.
Data feed: A data stream, such as Twitter subscription and
RSS.
Data marketplace: An online trading place for buying and
selling data sets.
Data modelling: Use data modeling technology to analyze data
objects to gain insight into the inner meaning of data.
Data set: A collection of large amounts of data.
Data virtualization: The process of data integration to
obtain more data information. This process usually introduces other
technologies, such as databases, applications, file systems, web technologies,
big data technologies, and so on.
De-identification (De-identification): Also known as
anonymization (anonymization), to ensure that individuals will not be
identified through data.
Discriminant analysis: classify data; according to different
classification methods, data can be assigned to different groups, categories or
directories. It is a statistical analysis method that can analyze the known
information of certain groups or clusters in the data and obtain classification
rules from it.
Distributed File System: A system that provides a
simplified, highly available way to store, analyze, and process data.
Document Store Databases, also known as document-oriented
databases, are databases specially designed for storing, managing, and
restoring document data. This type of document data is also called
semi-structured data.
Data Governance: Data governance refers to the transition
from using scattered data to using unified master data, from having little or
no organization and process governance to comprehensive data governance within
the enterprise, from trying to deal with the chaotic state of master data to
master data A process in which the data is well organized.
Data Transfer Service: Mainly used to convert data between
different databases, such as converting data between SQL Server and Oracle.
Data integration: It is to logically or physically
centralize data from different sources, formats, and characteristics, so as to
provide enterprises with comprehensive data sharing.
E
ETL: ETL stands for extraction, transformation and loading.
It refers to this process: "extracting" the original data,
"converting" the data into a "suitable for use" form
through cleaning/enriching methods, and "loading" it into a suitable
library for system use. Even though ETL originates from a data warehouse, this
process is also used when acquiring data, for example, to acquire data from
external sources in a big data system.
Enterprise applications: In fact, it is a term commonly used
within the software industry. If it is explained in plain and
easy-to-understand terms, it is a computer-based stable, secure and efficient
distributed information management system used within an enterprise.
Exploratory analysis: Exploring patterns from data without
standard procedures or methods. It is a way to discover the main
characteristics of data and data sets
E bytes (EB: Exabytes): approximately equal to 1000 PB
(petabytes), approximately equal to 1 million GB. Nowadays, the amount of new
information produced every day in the world is about 1 exabyte.
Extract, Transform and Load (ETL: Extract, Transform and
Load)-is a process used in a database or data warehouse. That is, extract (E)
data from various data sources, transform (T) into data that can meet business
needs, and finally load (L) it into the database.
Enterprise productivity: The ability of an enterprise to
provide a certain product or service to the society in a certain period of
time.
F
Fuzzy logic: How many times have we been certain about a
thing, such as 100% correct? Very rare! Our brain aggregates data into partial
facts, and these facts are further abstracted into something that can determine
our decisions Threshold. Fuzzy logic is such a calculation method. Contrary to
the "0" and "1" in Boolean algebra, etc., it aims to imitate
the human brain by gradually eliminating some facts.
Failover: When a server in the system fails, the running
task can be automatically switched to another available server or node.
Framework: also known as software architecture, is an
abstract description of the overall structure and components of the software,
used to guide the design of various aspects of large-scale software systems.
Flow monitoring (Flow monitoring): Flow monitoring refers to
the monitoring of data flow, which usually includes the speed of outgoing data,
incoming data, and total flow. WeChat users can achieve precise monitoring of
traffic on Tencent Mobile Manager 4.7.
Fault-tolerant design: A system that supports fault-tolerant
design should be able to continue running when a certain part fails.
Finance: It is the behavior of people making decisions about
optimal allocation of resources across periods in an uncertain environment.
G
Gamification: Using game thinking and mechanisms in other
non-game fields, this method can create and detect data in a very friendly way,
which is very effective.
Graph Databases: Use graph structures (for example, a
limited set of ordered pairs, or certain entities) to store data. This graph
storage structure includes edges, attributes, and nodes. It provides a free
index function between adjacent nodes, that is, each element in the database is
directly related to other adjacent elements.
Grid computing: Connect many computers distributed in
different locations together to deal with a specific problem, usually through
the cloud to connect the computers together.
H
Hadoop User Experience (Hue): Hue is an open source
interface that makes it easier to use Apache Hadoop. It is a web-based
application; it has a file browser for a distributed file system; it has a task
design for MapReduce; it has a framework Oozie that can schedule workflows; it
has a shell, an Impala, and a Hive UI And a set of Hadoop APIs.
Human capital (Human capital): refers to the accumulation of
knowledge and skills acquired by workers through investment in education,
training, practical experience, migration, health care, etc., also known as
"non-material capital."
Hardware: A general term for various physical devices
composed of electronic, mechanical, and optoelectronic components in a computer
system.
High Performance Analysis Application (HANA): This is a
software and hardware memory platform designed by SAP for big data transmission
and analysis.
HBase: A distributed column-oriented database. It uses HDFS
as its underlying storage, which not only supports batch computing using
MapReduce, but also batch computing using transaction interaction.
Hadoop-an open source distributed system basic framework,
which can be used to develop distributed programs for the operation and storage
of big data.
Hadoop database (HBase): an open source, non-relational,
distributed database, used in conjunction with the Hadoop framework.
Hadoop Distributed File System: is a distributed file system
designed to run on commodity hardware.
High-Performance Computing (HPC:
High-Performance-Computing): Use supercomputers to solve extremely complex
computing problems.
Hadoop in the cloud (Hadoop in the cloud): Some cloud
solutions are based entirely on a specific service that will load and process
data. For example, with IBM Bluemix, you can configure a MapReduce service
based on IBM InfoSphere BigInsights, which can process up to 20GB of
information. But the size, configuration, and complexity of Hadoop services are
not configurable. Other service-based solutions also provide the same type of
complexity.
I
Infrastructure As a Service: Consumers can obtain services
from a complete computer infrastructure through the Internet. This type of
service is called infrastructure as a service.
Infrastructure As a Code: A way to analyze computing and
network architecture through source code, and then it can be considered as any
kind of software system. These codes can be saved in source code management to
ensure auditability and re-plasticity, subject to all criteria of testing
practices and continuous delivery. This is the method that has been used to
deal with the growing cloud computing platform more than ten years ago, and it
will also be the main way to deal with the computing architecture in the
future.
In-memory computing: It is generally believed that any
calculation that does not involve I/O access will be faster. In-memory
computing is such a technology. It moves all working data sets to the
collective memory of the cluster, avoiding writing intermediate results to disk
during the calculation process. Apache Spark is an in-memory computing system,
which has great advantages over I/O-bound systems such as Mapreduce.
Internet of Things (IoT): The latest buzzword is the
Internet of Things (IoT). IoT is the interconnection of computing devices in
embedded objects (such as sensors, wearable devices, cars, refrigerators, etc.)
through the Internet, and they can send and receive data. The Internet of
Things has generated massive amounts of data and brought many opportunities for
big data analysis.
In-memory database (IMDB: In-memory): A database management
system, which is different from ordinary database management systems in that it
uses main memory to store data instead of hard disks. Its characteristic is
that it can process and access data at high speed.
Legal data compliance (Juridical data compliance): When you
use cloud computing solutions that store your data in different countries or
different continents, it will have something to do with this concept. You need
to pay attention to whether these data stored in different countries comply
with local laws.
Load balancing is a tool that can distribute the workload
between two or more computers on a computer network so that all users want to
get services faster and complete their work in a short time. This is the main
reason for computer server clusters. It can be used with software or hardware,
or a combination of both.
47. Linked Data
Linked data refers to interconnected data sets that can be
shared or published on the network and collaborate with machines and users. It
is highly structured and different from big data. It is used to construct the
Semantic Web. In the Semantic Web, a large amount of data is provided in a
standard format on the Web.
48. Location Analytics
Location analysis is the process of obtaining insights from
geographic locations or business data locations. It is the visual effect of
analyzing and interpreting the information depicted by the data, and it allows
users to associate location-related information with the data set.
49. Log File
A log file is a special file type that allows users to
record events that occur or records of operating systems or conversations
between users or any software that is running.
M
50. Metadata
Metadata is data about data. It is management, descriptive
and structural data that identifies assets.
51. MongoDB
MongoDB is an open source NoSQL document-oriented database
program. It uses JSON documents to store data structures in an agile solution
called the MongoDB BSON format. It can integrate data into applications very
quickly and easily.
52. Multi-Dimensional Database (MDB) Multi-Dimensional
Database (MDB)
Multidimensional database (MDB) is a database optimized for
OLAP (online analytical processing) applications and data warehouses. MDB can
be easily created using the input of relational database. MDB is the ability to
process data in the database, so the results can be developed quickly.
53. Multi-Value Database
Multi-value database is a multi-dimensional NoSQL database
that can understand three-dimensional data. These databases are sufficient to
directly process XML and HTML strings.
Some examples of commercial multi-value databases are
OpenQM, Rocket D3 database management system, jBASE, inter-system caching,
OpenInsight and InfinityDB.
54. Machine-Generated Data
Machine-generated data is information generated by machines
(computers, applications, processes, or other inhumane mechanisms).
Machine-generated data is called amorphous data, because humans rarely
modify/change this data.
55. Machine Learning
Machine learning is the field of computer science, which
uses statistical strategies to provide the function of "learning"
data on the computer. Machine learning is used to uncover hidden opportunities
in big data.
56. MapReduce
MapReduce is a processing technology that can process large
data sets through parallel distributed algorithms on clusters. There are two
types of MapReduce jobs. The "map" function is used to divide the
query into multiple parts and then process the data at the node level. The "reduce"
function collects the results of the "map" function, and then finds
the answer to the query. When combined with HDFS, MapReduce is used to process
big data. This coupling of HDFS and MapReduce is called Hadoop.
57. Mahout
Apache Mahout is an open source data mining library. It uses
data mining algorithms for regression testing, execution, clustering,
statistical modeling, and then implements them using MapReduce models.
N
58. Network Analysis
Network analysis is the application of graph/chart theory,
which is used to classify, understand and view the relationship between nodes
in network terms. This is an effective way to analyze connections and check
their capabilities in any field, such as forecasting, market analysis, and
healthcare.
59. NewSQL
NewSQL is a modern relational database management system
that can provide the same scalable performance as NoSQL systems for OLTP
read/write workloads. It is a well-defined database system and easy to learn.
60. NoSQL
Widely referred to as "not just SQL", it is a
system for database management. The database management system is independent
of the relational database management system. NoSQL database is not built on
tables, and it does not use SQL to process data.
O
61. Object Databases
A database that stores data in the form of objects is called
an object database. These objects are used in the same way as the objects used
in OOP. Object databases are different from graph databases and relational
databases. Most of the time, these databases provide a query language to help
find objects with declarations.
62. Object-based Image Analysis
This is based on the analysis of the image of the object,
which is performed by data acquired by selected relevant pixels (called image
objects or simply objects). It is different from digital analysis that uses
data from a single pixel.
63. Online Analytical Processing (OLAP) Online Analytical
Processing (OLAP)
In this process, three operators (drill down, merge, and
slice and dice) are used to analyze multi-dimensional data.
Drill down is a function that provides users to view the
underlying detailed information
Consolidation is available summary
Slicing and dice is the function of providing users with a
subset of choices and viewing them from various contexts
64. Online transactional processing (OLTP) Online
transactional processing (OLTP)
It is a big data term used in the process, allowing users to
access large amounts of transaction data. The way to do this is to enable users
to derive meaning from the data accessed.
65. Open Data Center Alliance (ODCA) Open Data Center
Alliance (ODCA)
OCDA is a combination of global IT organizations. The main
goal of the consortium is to increase the development of cloud computing.
66. Operational Data Store (ODS) Operational Data Store
(ODS)
It is defined as a location where data obtained from various
sources is collected and stored. It allows users to perform many other
operations on the data before sending it to the data warehouse report.
67. Oozie
This is a big data term used for processing systems,
allowing users to define a set of jobs. These jobs are written in different
languages, such as Pig, MapReduce and Hive. Oozie allows users to link these
jobs to each other.
P
68. Parallel Data Analysis
The process of decomposing the analysis problem into smaller
partitions, and then running the analysis algorithm on each partition at the
same time is called parallel data analysis. This type of data analysis can be
run on different systems or on the same system.
69. Parallel Method Invocation (PMI) Parallel Method
Invocation (PMI)
The system allows program code to call or call multiple
methods/functions at the same time.
70. Parallel Processing
The system has the ability to perform multiple tasks at the
same time.
71. Parallel Query
Parallel queries can be defined as queries that can be
executed on multiple system threads to improve performance.
72. Pattern Recognition
The process of classifying or labeling the recognized
patterns in the machine learning process is called pattern recognition.
73. Pentaho
Pentaho is a software organization that provides open source
business intelligence products, these products are called Pentaho Business
Analytics. Pentaho provides OLAP services, data integration, dashboards,
reports, ETL and data mining functions.
74. Petabyte megabytes
A unit of data measurement equal to 1,024 TB or 1 million
gigabytes is called PB.
Q
75. Query
Inquiry is a method of obtaining certain information to
arrive at the answer to a question.
76. Query Analysis
The process of performing search query analysis is called
query analysis. Complete query analysis to optimize the query for best results.
R
77. R
It is a programming language and an environment for graphics
and statistical computing. This is a very extensible language that provides
many graphics and statistical techniques, such as nonlinear and linear
modeling, time series analysis, classical statistical testing, clustering,
classification, etc.
78. Re-identification
Data re-identification is the process of matching anonymous
data with available auxiliary data or information. This approach helps to find
out who this data belongs to.
79. Real-time Data
Data that can be created, stored, processed, analyzed, and visualized
immediately (that is, in milliseconds) is called real-time data.
80. Reference Data
Big data terms define data used to describe objects and
their attributes. The object described by the reference data can actually be
virtual or physical.
81. Recommendation Engine
It is an algorithm that can analyze various operations and
purchases made by customers on e-commerce sites. Then, the analysis data is
used to recommend some supplementary products to customers.
82. Risk Analysis
It is the process or process of tracking behavior, project
or decision risk. The risk analysis is completed by applying different
statistical techniques to the data set.
83. Routing Analysis
Finding the best route is a process or process. By using
various variables for transportation, efficiency can be improved and fuel costs
can be reduced.
S
84. SaaS Software as a Service
It is a big data term used for software as a service. It
allows vendors to host applications and then make the applications available
over the Internet. SaaS services are provided by SaaS providers in the cloud.
85. Semi-Structured Data
Data that is not applied by conventional methods but
represented in a traditional way is called semi-structured data. The data is
neither fully structured nor unstructured, but contains some tags, data tables,
and structured elements. Few examples of semi-structured data are XML
documents, emails, tables, and graphs.
86. Server
A server is a virtual or physical computer that receives
requests related to software applications and therefore sends these requests
over the network.
87. Spatial Analysis
The analysis of spatial data (that is, topological and
geographic data) is called spatial analysis. This analysis helps to identify
and understand all the information about a specific area or location.
88. Structured Query Language (SQL) Structured Query
Language (SQL)
SQL is a standard programming language used to retrieve and
manage data in relational databases. This language is very useful for creating
and querying relational databases.
89. Sqoop
It is a connection tool used to move data from non-Hadoop
data storage to Hadoop data storage. The tool instructs Sqoop to retrieve data
from Teradata, Oracle or any other relational database, and specify the target
location in Hadoop to move the retrieved data.
90. Storm
Apache Storm is a distributed, open source, real-time
computing system for data processing. It is one of the essential big data terms
and is responsible for the reliable processing of unstructured data in real
time.
T
91. Text Analytics
Text analysis is basically the process of applying
linguistics, machine learning and statistical techniques to text-based sources.
Text analysis is used to derive insights or meanings from text data by applying
these techniques.
92. Thrift
It is a software framework for developing cross-language
services. It integrates the code generation engine with the software stack to
develop services that can work seamlessly and efficiently between different
programming languages (such as Ruby, Java, PHP, C++, Python, C#, etc.).
U
93. Unstructured Data
Data whose structure cannot be defined is called
unstructured data. It becomes difficult to process and manage unstructured
data. Common examples of unstructured data are text entered in email messages
and data sources with text, images, and videos.
V
94. Value
Big data terms basically define the value of available data.
The data collected and stored may be valuable to society, customers and
organizations. This is one of the important big data terms, because big data is
aimed at large companies, and companies will gain some value, that is, benefit
from big data.
95. Volume
Big data items are related to the total amount of data
available. The data can range from megabytes to Brown bytes.
W
96. WebHDFS Apache Hadoop
WebHDFS is a protocol for accessing HDFS to take advantage
of industry RESTful mechanisms. It contains native libraries, so it can access
HDFS. It uses the parallelism of the Hadoop cluster to help users connect to
HDFS from the outside. It also strategically provides web service access to all
Hadoop components.
97. Weather Data
Data trends and patterns that help track the atmosphere are
called weather data. The data basically consists of numbers and factors. Now,
real-time data is available for organizations to use in different ways. For
example, logistics companies use weather data to optimize the transportation of
goods.
X
98. XML Databases
A database that supports storing data in XML format is
called an XML database. These databases are usually connected with
document-specific databases. You can export, serialize and query the data in
the XML database.
Y
99. Yottabyte
It is a big data term related to data measurement. One kilobyte
is equal to 1000 ZB, which is the data stored in 250 trillion DVDs.
WITH
100. ZooKeeper
It is an Apache software project and Hadoop sub-project that
provides open code name generation for distributed systems. It also supports
the consolidation of large-scale distributed systems.
101. Zettabytes
It is a big data term related to data measurement. One
megabyte is equal to one billion megabytes or 1,000 exabytes.
Comments
Post a Comment