Skip to main content

100 + Big Data and AI terms and terminology

 Latest and most comprehensive big data/artificial intelligence terms & terminology in English (highly recommended for collection)

 

A

 Apache Kafka: named after the Czech writer Kafka, used to build real-time data pipelines and streaming media applications. The reason it is so popular is that it can store, manage, and process data streams in a fault-tolerant manner, and it is said to be very "fast". Given that the social network environment involves a lot of data stream processing, Kafka is currently very popular.

 Apache Mahout: Mahout provides a library of pre-made algorithms for machine learning and data mining, and can also be used as an environment for creating more algorithms. In other words, the best environment for machine learning geeks.

 Apache Oozie: In any programming environment, you need some workflow system to arrange and run work through predefined methods and defined dependencies. This is exactly what Oozie provides for big data jobs written in languages ​​such as pig, MapReduce, and Hive.

 Application Development (APP DEV): Application development is the process of building a software system or software parts of the system according to user requirements, including system engineering of demand capture, demand analysis, design, implementation, and testing. It is generally implemented in a certain programming language. Generally, application development tools can be used for development.

 Apache Drill, Apache Impala, Apache Spark SQL: These three open source projects all provide fast and interactive SQL, such as interaction with Apache Hadoop data. If you already know SQL and deal with data stored in a big data format (ie HBase or HDFS), these features will be very useful. Sorry, what I said here is a bit strange.

 Apache Hive: Do you know SQL? If you know, then you will be able to get started with Hive. Hive helps to use SQL to read, write, and manage large data sets residing in distributed storage.

 Apache Pig: Pig is a platform for creating, querying, and executing routines on large distributed data sets. The scripting language used is called Pig Latin (I am definitely not talking nonsense, believe me). It is said that Pig is easy to understand and learn. But I doubt how much can be learned?

 Apache Sqoop: A tool for transferring data from Hadoop to non-Hadoop data storage (such as data warehouses and relational databases).

 Apache Storm: A free and open source real-time distributed computing system. It makes it easier to process unstructured data while using Hadoop for batch processing.

 Artificial Intelligence (Artificial Intelligence): research and development of intelligent machines and intelligent software, these intelligent devices can perceive the surrounding environment, and respond accordingly according to requirements, and even self-learning

 Aggregation-the process of searching, merging, and displaying data

 Algorithm (Algorithm): Algorithm can be understood as a mathematical formula or statistical process for data analysis. So, why is "algorithm" related to big data? You know, although the word algorithm is a collective term, in this era of popular big data analysis, algorithms are often mentioned and become more popular.

 Anomaly detection-Search for data items that do not match the expected pattern or behavior in the data set. In addition to "Anomalies", there are the following words used to indicate exceptions: outliers, exceptions, surprises, contaminants. They usually provide key executable information

 Anonymization-Make data anonymous, that is, remove all data related to personal privacy

 Application-computer software that implements a specific function

 Analytics: Used to discover the inner meaning of data. Let us imagine a possible situation. Your credit card company has sent you an email that records the funds transfer situation in your card throughout the year. If you take this list at this time, start to study your food and clothing carefully. What is the percentage of consumption in terms of entertainment, entertainment, etc.? You are doing analysis work. You are mining useful information from your original data (these data can help you make decisions about your own consumption in the coming year). So, what if you use a similar method to process the posts made by people throughout the city on Twitter and Facebook? In this case, we can call it big data analysis. The so-called big data analysis is to reason about a large amount of data and tell useful information from it. There are three different types of analysis methods below. Now let's sort them out separately.

 

B

 

Batch processing: Although batch data processing has existed since the mainframe era, batch processing has gained more significance in the era of big data processing large amounts of data. Batch data processing is an effective method for processing large amounts of data (such as a bunch of transaction data collected over a period of time). Distributed computing (Hadoop), which will be discussed later, is a method that specializes in processing batches of data.

 

Behavioral Analytics: Have you ever wondered how Google provides advertisements for the products/services you need? Behavioral analytics focuses on understanding what consumers and apps do, and how and why they are in a certain way kick in. This involves understanding our online patterns, social media interaction behaviors, and our online shopping activities (shopping carts, etc.), connecting these irrelevant data points, and trying to predict the outcome. For example, after I found a hotel and emptied my shopping cart, I received a call from the resort’s holiday line. Should I say more?

 Business Intelligence: I will reuse Gartner's definition of BI because it explains it well. Business intelligence is a general term that includes applications, infrastructure, tools, and best practices. It can access and analyze information to improve and optimize decision-making and performance.

 Biometrics: This is a technology that combines James Bondish technology and analysis technology to identify people through one or more physical characteristics of the human body, such as facial recognition, iris recognition, fingerprint recognition, etc.

 Descriptive Analytics: If you only say that your credit card consumption last year was: 25% for food, 35% for clothing, 20% for entertainment, and the remaining 20% ​​for miscellaneous expenses, then this analysis method is It is called descriptive analysis. Of course, you can also find out more details.

 Big Data Scientist: A person who can design big data algorithms to make big data useful

 Big data startup: Refers to emerging companies that develop the latest big data technology

 B bytes (BB: Brontobytes): approximately equal to 1000 YB (Yottabytes), which is equivalent to the size of the digital universe in the future. 1 B byte contains 27 zeros!

 Big data: refers to the massive, high growth rate, and diversified information assets that require new processing models to have stronger decision-making power, insight and discovery, and process optimization capabilities.

 Data science platforms: A working platform for data scientists to create and test data science solutions. According to Gartner’s definition, a data science platform is “a software system composed of a number of closely related data processing core technology modules to support the development of various data science solutions and their use in business processes, peripheral infrastructure, and Application in the product.

  

C

 

Clickstream analytics: Used to analyze online click data when users browse the Internet. Have you ever wondered why some Google ads are still lingering even when you switch sites? Because the Google boss knows what you are clicking on.

 Cluster Analysis: It is an exploratory analysis that tries to identify the structure of the data, also known as segmentation analysis or classification analysis. More specifically, it attempts to determine the homogenous groups of the case, that is, observation, participant, and interviewee. If the grouping was previously unknown, cluster analysis is used to identify the case group. Because it is exploratory, it does distinguish between dependent variables and independent variables. The different cluster analysis methods provided by SPSS can handle binary, nominal, ordinal, and scale (interval or ratio) data.

 Comparative Analytics: Because the key to big data lies in analysis, as the name suggests, comparative analysis uses statistical techniques such as pattern analysis, filtering, and decision tree analysis to compare multiple processes, data sets, or other objects. I know it involves fewer and fewer technologies, but I still can't completely avoid the use of terminology. Comparative analysis can be used in the field of medical care, by comparing a large number of medical records, files, images, etc., to give a more effective and accurate medical diagnosis.

 Connection Analytics: You must have seen a spider web like a graph that connects people and topics to determine the influencers of a particular topic. Correlation analysis can help discover the connections and influences between people, products, systems in the network, and even the combination of data and multiple networks.

 Cassandra: is a very popular open source data management system developed and operated by the Apache Software Foundation. Apache has mastered a lot of big data processing technologies, and Cassandra is their system specially designed to process large amounts of data between distributed servers.

 Cloud computing: A distributed computing system built on the network, data is stored outside the computer room (ie, the cloud), software or data is processed on a remote server, and these resources can be accessed anywhere on the network , Then it can be called cloud computing.

 Cluster computing: This is a visual term to describe the computing of a cluster that uses the rich resources of multiple servers. A more technical understanding is that in the context of cluster processing, we may discuss nodes, cluster management layers, load balancing, parallel processing, and so on.

 Classification analysis: The systematic process of obtaining important relevance information from data; this type of data is also called meta data, which is data describing data.

 Commerce analytics: Refers to including examining whether the estimated sales, costs and profits have reached the company's estimated goals; if it is reached, the product concept can be further developed to the product development stage.

 Clustering analysis-It is the process of grouping similar objects together, and combining each type of similar objects into a cluster (also called cluster). The purpose of this analysis method is to analyze the differences and similarities between data.

 Cold data storage-Stores old data that is hardly used on low-power servers. But the retrieval of these data will be time-consuming.

 Crowdsourcing: The practice of obtaining desired ideas, services, or content contributions from a wide range of groups, especially online communities.

 Cluster server (Cluster server): Connect multiple servers through fast communication links. From the outside, these servers are working like one server, while internally, the load from the outside is dynamically changed through a certain mechanism. Allocate to these node machines, so as to achieve the high performance and high availability of super servers.

 Comparative analysis-When pattern matching is performed in a very large data set, a step-by-step comparison and calculation process is performed to obtain the analysis result.

 Complex structured data-data composed of two or more complex and interrelated parts. This type of data cannot simply be parsed by structured query languages ​​or tools (SQL).

 Computer generated data-computer generated data such as log files.

 Concurrency-execute multiple tasks or run multiple processes at the same time.

 Correlation analysis-is a data analysis method used to analyze whether there is a positive or negative correlation between variables.

 Customer Relationship Management (CRM: Customer Relationship Management)-A technology used to manage sales and business processes. Big data will affect the company's customer relationship management strategy.

 Cloud data: is the general term for technologies and platforms based on data integration, data analysis, data integration, data distribution, and data early warning based on cloud computing business model applications.

  

D

 Data Analyst: Data Analyst is a very important and popular job. In addition to preparing reports, it is also responsible for collecting, editing and analyzing data.

 Data Cleansing: As the name implies, data cleaning involves detecting and correcting or deleting inaccurate data or records in the database, and then remembering "dirty data". With the help of automated or manual tools and algorithms, data analysts can correct and further enrich the data to improve data quality. Remember that dirty data can lead to incorrect analysis and poor decision-making.

 Data as a Service (DaaS): By providing users with on-demand access to cloud data, DaaS providers can help us quickly obtain high-quality data.

 Data virtualization: This is a data management method that allows an application to extract and manipulate data without knowing the technical details (such as where the data is stored and in what format). For example, social networks use this method to store our photos.

 Dirty Data: Dirty data is unclean data, in other words, inaccurate, duplicate and inconsistent data. Obviously, you don't want to mix with dirty data. So, fix it as soon as possible.

 Dark data: All data accumulated and processed by the company that is actually not used at all. In this sense, we call them "dark" data, and they may not be analyzed at all. These data can be information in social networks, call center records, meeting records, and so on. Many estimates believe that 60% to 90% of all company data may be dark data, but no one actually knows.

 Data stream: originally a concept used in the field of communications, representing a sequence of digitally encoded signals used in transmission. However, the data flow concept we mentioned is different.

 Data lake (Data lake): that is, a company-level data repository in a large number of original formats. Here we introduce the data warehouse (Data warehouse). The data warehouse is a concept similar to the data lake mentioned here, but the difference is that it stores structured data that has been cleaned up and integrated with other resources. Data warehouses are often used for general data (but not necessarily so). It is generally believed that a data lake can make it easier to access the data you really need. In addition, you can process and use them more conveniently.

 Data Resources Management (Data Resources): It is a management activity that applies information technology and software tools to complete the task of organizing data resource management.

 Data Source: As the name implies, the source of data is a device or original media that provides some required data. All the information for establishing a database connection is stored in the data source. Just as you can find a file in the file system by specifying the file name, you can find the corresponding database connection by providing the correct data source name.

 Data mining: Use complex pattern recognition techniques to find meaningful patterns from a large group of data and get relevant insights.

 Data analyst platforms: Mainly through the integration of enterprise internal operation support systems and external data, including transaction-oriented big data (Big Transaction Data) and interactive big data (Big Interaction Data), through a variety of cloud computing technologies The integration and processing of the company provides information support and intelligent solutions with great commercial value to the internal and external corporate customers. Based on the data warehouse built on the big data platform, it provides report tools and analysis tools, combined with the actual needs of the company The solution implementation service is carried out; enterprise managers, business analysts, etc. can be accessed through the web, mobile phones or other mobile devices, so as to understand the key indicators of the enterprise and conduct in-depth business analysis at any time.

 Distributed File System: The amount of big data is too large to be stored in a single system. The distributed file system is a file system that can store large amounts of data on multiple storage devices. It can reduce storage The cost and complexity of large amounts of data.

 Dashboard: Use algorithms to analyze data and display the results on the dashboard in a graphical manner.

 Data access: refers to the realization and maintenance of database data storage organization and storage path.

 Data transfer: refers to the process of transferring data between a data source and a data sink, also known as data communication.

 Data aggregation tools: The process of transforming data scattered across many data sources into a new data source.

 Database (Database): A warehouse that stores a collection of data with a specific technology.

 Database Management System (DBMS: Database Management System): Collect and store data, and provide data access.

 Data center: A physical location where servers used to store data are placed.

 Data custodian: The professional and technical personnel responsible for maintaining the technical environment required for data storage.

 Data ethical guidelines: These guidelines help organizations make their data transparent and ensure data simplicity, security and privacy.

 Data feed: A data stream, such as Twitter subscription and RSS.

 Data marketplace: An online trading place for buying and selling data sets.

 Data modelling: Use data modeling technology to analyze data objects to gain insight into the inner meaning of data.

 Data set: A collection of large amounts of data.

 Data virtualization: The process of data integration to obtain more data information. This process usually introduces other technologies, such as databases, applications, file systems, web technologies, big data technologies, and so on.

 De-identification (De-identification): Also known as anonymization (anonymization), to ensure that individuals will not be identified through data.

 Discriminant analysis: classify data; according to different classification methods, data can be assigned to different groups, categories or directories. It is a statistical analysis method that can analyze the known information of certain groups or clusters in the data and obtain classification rules from it.

 Distributed File System: A system that provides a simplified, highly available way to store, analyze, and process data.

 Document Store Databases, also known as document-oriented databases, are databases specially designed for storing, managing, and restoring document data. This type of document data is also called semi-structured data.

 Data Governance: Data governance refers to the transition from using scattered data to using unified master data, from having little or no organization and process governance to comprehensive data governance within the enterprise, from trying to deal with the chaotic state of master data to master data A process in which the data is well organized.

 Data Transfer Service: Mainly used to convert data between different databases, such as converting data between SQL Server and Oracle.

 Data integration: It is to logically or physically centralize data from different sources, formats, and characteristics, so as to provide enterprises with comprehensive data sharing.

  

E

 ETL: ETL stands for extraction, transformation and loading. It refers to this process: "extracting" the original data, "converting" the data into a "suitable for use" form through cleaning/enriching methods, and "loading" it into a suitable library for system use. Even though ETL originates from a data warehouse, this process is also used when acquiring data, for example, to acquire data from external sources in a big data system.

 Enterprise applications: In fact, it is a term commonly used within the software industry. If it is explained in plain and easy-to-understand terms, it is a computer-based stable, secure and efficient distributed information management system used within an enterprise.

 Exploratory analysis: Exploring patterns from data without standard procedures or methods. It is a way to discover the main characteristics of data and data sets

E bytes (EB: Exabytes): approximately equal to 1000 PB (petabytes), approximately equal to 1 million GB. Nowadays, the amount of new information produced every day in the world is about 1 exabyte.

 Extract, Transform and Load (ETL: Extract, Transform and Load)-is a process used in a database or data warehouse. That is, extract (E) data from various data sources, transform (T) into data that can meet business needs, and finally load (L) it into the database.

 Enterprise productivity: The ability of an enterprise to provide a certain product or service to the society in a certain period of time.

 

F

 Fuzzy logic: How many times have we been certain about a thing, such as 100% correct? Very rare! Our brain aggregates data into partial facts, and these facts are further abstracted into something that can determine our decisions Threshold. Fuzzy logic is such a calculation method. Contrary to the "0" and "1" in Boolean algebra, etc., it aims to imitate the human brain by gradually eliminating some facts.

 Failover: When a server in the system fails, the running task can be automatically switched to another available server or node.

 Framework: also known as software architecture, is an abstract description of the overall structure and components of the software, used to guide the design of various aspects of large-scale software systems.

 Flow monitoring (Flow monitoring): Flow monitoring refers to the monitoring of data flow, which usually includes the speed of outgoing data, incoming data, and total flow. WeChat users can achieve precise monitoring of traffic on Tencent Mobile Manager 4.7.

 Fault-tolerant design: A system that supports fault-tolerant design should be able to continue running when a certain part fails.

 Finance: It is the behavior of people making decisions about optimal allocation of resources across periods in an uncertain environment.

 G

 Gamification: Using game thinking and mechanisms in other non-game fields, this method can create and detect data in a very friendly way, which is very effective.

 Graph Databases: Use graph structures (for example, a limited set of ordered pairs, or certain entities) to store data. This graph storage structure includes edges, attributes, and nodes. It provides a free index function between adjacent nodes, that is, each element in the database is directly related to other adjacent elements.

 Grid computing: Connect many computers distributed in different locations together to deal with a specific problem, usually through the cloud to connect the computers together.

 

H

 Hadoop User Experience (Hue): Hue is an open source interface that makes it easier to use Apache Hadoop. It is a web-based application; it has a file browser for a distributed file system; it has a task design for MapReduce; it has a framework Oozie that can schedule workflows; it has a shell, an Impala, and a Hive UI And a set of Hadoop APIs.

 Human capital (Human capital): refers to the accumulation of knowledge and skills acquired by workers through investment in education, training, practical experience, migration, health care, etc., also known as "non-material capital."

 Hardware: A general term for various physical devices composed of electronic, mechanical, and optoelectronic components in a computer system.

 High Performance Analysis Application (HANA): This is a software and hardware memory platform designed by SAP for big data transmission and analysis.

 HBase: A distributed column-oriented database. It uses HDFS as its underlying storage, which not only supports batch computing using MapReduce, but also batch computing using transaction interaction.

 Hadoop-an open source distributed system basic framework, which can be used to develop distributed programs for the operation and storage of big data.

 Hadoop database (HBase): an open source, non-relational, distributed database, used in conjunction with the Hadoop framework.

 Hadoop Distributed File System: is a distributed file system designed to run on commodity hardware.

 High-Performance Computing (HPC: High-Performance-Computing): Use supercomputers to solve extremely complex computing problems.

 Hadoop in the cloud (Hadoop in the cloud): Some cloud solutions are based entirely on a specific service that will load and process data. For example, with IBM Bluemix, you can configure a MapReduce service based on IBM InfoSphere BigInsights, which can process up to 20GB of information. But the size, configuration, and complexity of Hadoop services are not configurable. Other service-based solutions also provide the same type of complexity.

  I

 Infrastructure As a Service: Consumers can obtain services from a complete computer infrastructure through the Internet. This type of service is called infrastructure as a service.

 Infrastructure As a Code: A way to analyze computing and network architecture through source code, and then it can be considered as any kind of software system. These codes can be saved in source code management to ensure auditability and re-plasticity, subject to all criteria of testing practices and continuous delivery. This is the method that has been used to deal with the growing cloud computing platform more than ten years ago, and it will also be the main way to deal with the computing architecture in the future.

In-memory computing: It is generally believed that any calculation that does not involve I/O access will be faster. In-memory computing is such a technology. It moves all working data sets to the collective memory of the cluster, avoiding writing intermediate results to disk during the calculation process. Apache Spark is an in-memory computing system, which has great advantages over I/O-bound systems such as Mapreduce.

 Internet of Things (IoT): The latest buzzword is the Internet of Things (IoT). IoT is the interconnection of computing devices in embedded objects (such as sensors, wearable devices, cars, refrigerators, etc.) through the Internet, and they can send and receive data. The Internet of Things has generated massive amounts of data and brought many opportunities for big data analysis.

 In-memory database (IMDB: In-memory): A database management system, which is different from ordinary database management systems in that it uses main memory to store data instead of hard disks. Its characteristic is that it can process and access data at high speed.

 Legal data compliance (Juridical data compliance): When you use cloud computing solutions that store your data in different countries or different continents, it will have something to do with this concept. You need to pay attention to whether these data stored in different countries comply with local laws.



Load balancing is a tool that can distribute the workload between two or more computers on a computer network so that all users want to get services faster and complete their work in a short time. This is the main reason for computer server clusters. It can be used with software or hardware, or a combination of both.

 

47. Linked Data

 

Linked data refers to interconnected data sets that can be shared or published on the network and collaborate with machines and users. It is highly structured and different from big data. It is used to construct the Semantic Web. In the Semantic Web, a large amount of data is provided in a standard format on the Web.

 

48. Location Analytics

 

Location analysis is the process of obtaining insights from geographic locations or business data locations. It is the visual effect of analyzing and interpreting the information depicted by the data, and it allows users to associate location-related information with the data set.

 

49. Log File

 

A log file is a special file type that allows users to record events that occur or records of operating systems or conversations between users or any software that is running.

 

M

50. Metadata

 

Metadata is data about data. It is management, descriptive and structural data that identifies assets.

 

51. MongoDB

 

MongoDB is an open source NoSQL document-oriented database program. It uses JSON documents to store data structures in an agile solution called the MongoDB BSON format. It can integrate data into applications very quickly and easily.

 

52. Multi-Dimensional Database (MDB) Multi-Dimensional Database (MDB)

 

Multidimensional database (MDB) is a database optimized for OLAP (online analytical processing) applications and data warehouses. MDB can be easily created using the input of relational database. MDB is the ability to process data in the database, so the results can be developed quickly.

 

53. Multi-Value Database

 

Multi-value database is a multi-dimensional NoSQL database that can understand three-dimensional data. These databases are sufficient to directly process XML and HTML strings.

 

Some examples of commercial multi-value databases are OpenQM, Rocket D3 database management system, jBASE, inter-system caching, OpenInsight and InfinityDB.

 

54. Machine-Generated Data

 

Machine-generated data is information generated by machines (computers, applications, processes, or other inhumane mechanisms). Machine-generated data is called amorphous data, because humans rarely modify/change this data.

 

55. Machine Learning

 

Machine learning is the field of computer science, which uses statistical strategies to provide the function of "learning" data on the computer. Machine learning is used to uncover hidden opportunities in big data.

 

56. MapReduce

 

MapReduce is a processing technology that can process large data sets through parallel distributed algorithms on clusters. There are two types of MapReduce jobs. The "map" function is used to divide the query into multiple parts and then process the data at the node level. The "reduce" function collects the results of the "map" function, and then finds the answer to the query. When combined with HDFS, MapReduce is used to process big data. This coupling of HDFS and MapReduce is called Hadoop.

 

57. Mahout

 

Apache Mahout is an open source data mining library. It uses data mining algorithms for regression testing, execution, clustering, statistical modeling, and then implements them using MapReduce models.

 

N

58. Network Analysis

 

Network analysis is the application of graph/chart theory, which is used to classify, understand and view the relationship between nodes in network terms. This is an effective way to analyze connections and check their capabilities in any field, such as forecasting, market analysis, and healthcare.

 

59. NewSQL

 

NewSQL is a modern relational database management system that can provide the same scalable performance as NoSQL systems for OLTP read/write workloads. It is a well-defined database system and easy to learn.

 

60. NoSQL

 

Widely referred to as "not just SQL", it is a system for database management. The database management system is independent of the relational database management system. NoSQL database is not built on tables, and it does not use SQL to process data.

 

O

61. Object Databases

 

A database that stores data in the form of objects is called an object database. These objects are used in the same way as the objects used in OOP. Object databases are different from graph databases and relational databases. Most of the time, these databases provide a query language to help find objects with declarations.

 

62. Object-based Image Analysis

 

This is based on the analysis of the image of the object, which is performed by data acquired by selected relevant pixels (called image objects or simply objects). It is different from digital analysis that uses data from a single pixel.

 

63. Online Analytical Processing (OLAP) Online Analytical Processing (OLAP)

 

In this process, three operators (drill down, merge, and slice and dice) are used to analyze multi-dimensional data.

 

Drill down is a function that provides users to view the underlying detailed information

 

Consolidation is available summary

 

Slicing and dice is the function of providing users with a subset of choices and viewing them from various contexts

 

64. Online transactional processing (OLTP) Online transactional processing (OLTP)

 

It is a big data term used in the process, allowing users to access large amounts of transaction data. The way to do this is to enable users to derive meaning from the data accessed.

 

65. Open Data Center Alliance (ODCA) Open Data Center Alliance (ODCA)

 

OCDA is a combination of global IT organizations. The main goal of the consortium is to increase the development of cloud computing.

 

66. Operational Data Store (ODS) Operational Data Store (ODS)

 

It is defined as a location where data obtained from various sources is collected and stored. It allows users to perform many other operations on the data before sending it to the data warehouse report.

 

67. Oozie

 

This is a big data term used for processing systems, allowing users to define a set of jobs. These jobs are written in different languages, such as Pig, MapReduce and Hive. Oozie allows users to link these jobs to each other.

 

P

68. Parallel Data Analysis

 

The process of decomposing the analysis problem into smaller partitions, and then running the analysis algorithm on each partition at the same time is called parallel data analysis. This type of data analysis can be run on different systems or on the same system.

 

69. Parallel Method Invocation (PMI) Parallel Method Invocation (PMI)

 

The system allows program code to call or call multiple methods/functions at the same time.

 

70. Parallel Processing

 

The system has the ability to perform multiple tasks at the same time.

 

 

 

71. Parallel Query

 

Parallel queries can be defined as queries that can be executed on multiple system threads to improve performance.

 

72. Pattern Recognition

 

The process of classifying or labeling the recognized patterns in the machine learning process is called pattern recognition.

 

73. Pentaho

 

Pentaho is a software organization that provides open source business intelligence products, these products are called Pentaho Business Analytics. Pentaho provides OLAP services, data integration, dashboards, reports, ETL and data mining functions.

 

74. Petabyte megabytes

 

A unit of data measurement equal to 1,024 TB or 1 million gigabytes is called PB.

 

Q

75. Query

 

Inquiry is a method of obtaining certain information to arrive at the answer to a question.

 

76. Query Analysis

 

The process of performing search query analysis is called query analysis. Complete query analysis to optimize the query for best results.

 

R

77. R

 

It is a programming language and an environment for graphics and statistical computing. This is a very extensible language that provides many graphics and statistical techniques, such as nonlinear and linear modeling, time series analysis, classical statistical testing, clustering, classification, etc.

 

78. Re-identification

 

Data re-identification is the process of matching anonymous data with available auxiliary data or information. This approach helps to find out who this data belongs to.

 

79. Real-time Data

 

Data that can be created, stored, processed, analyzed, and visualized immediately (that is, in milliseconds) is called real-time data.

 

80. Reference Data

 

Big data terms define data used to describe objects and their attributes. The object described by the reference data can actually be virtual or physical.

 

81. Recommendation Engine

 

It is an algorithm that can analyze various operations and purchases made by customers on e-commerce sites. Then, the analysis data is used to recommend some supplementary products to customers.

 

82. Risk Analysis

 

It is the process or process of tracking behavior, project or decision risk. The risk analysis is completed by applying different statistical techniques to the data set.

 

83. Routing Analysis

 

Finding the best route is a process or process. By using various variables for transportation, efficiency can be improved and fuel costs can be reduced.

 

S

84. SaaS Software as a Service

 

It is a big data term used for software as a service. It allows vendors to host applications and then make the applications available over the Internet. SaaS services are provided by SaaS providers in the cloud.

 

85. Semi-Structured Data

 

Data that is not applied by conventional methods but represented in a traditional way is called semi-structured data. The data is neither fully structured nor unstructured, but contains some tags, data tables, and structured elements. Few examples of semi-structured data are XML documents, emails, tables, and graphs.

 

86. Server

 

A server is a virtual or physical computer that receives requests related to software applications and therefore sends these requests over the network.

 

87. Spatial Analysis

 

The analysis of spatial data (that is, topological and geographic data) is called spatial analysis. This analysis helps to identify and understand all the information about a specific area or location.

 

88. Structured Query Language (SQL) Structured Query Language (SQL)

 

SQL is a standard programming language used to retrieve and manage data in relational databases. This language is very useful for creating and querying relational databases.

 

89. Sqoop

 

It is a connection tool used to move data from non-Hadoop data storage to Hadoop data storage. The tool instructs Sqoop to retrieve data from Teradata, Oracle or any other relational database, and specify the target location in Hadoop to move the retrieved data.

 

90. Storm

 

Apache Storm is a distributed, open source, real-time computing system for data processing. It is one of the essential big data terms and is responsible for the reliable processing of unstructured data in real time.

 

T

91. Text Analytics

 

Text analysis is basically the process of applying linguistics, machine learning and statistical techniques to text-based sources. Text analysis is used to derive insights or meanings from text data by applying these techniques.

 

92. Thrift

 

It is a software framework for developing cross-language services. It integrates the code generation engine with the software stack to develop services that can work seamlessly and efficiently between different programming languages ​​(such as Ruby, Java, PHP, C++, Python, C#, etc.).

 

U

93. Unstructured Data

 

Data whose structure cannot be defined is called unstructured data. It becomes difficult to process and manage unstructured data. Common examples of unstructured data are text entered in email messages and data sources with text, images, and videos.

 

V

94. Value

 

Big data terms basically define the value of available data. The data collected and stored may be valuable to society, customers and organizations. This is one of the important big data terms, because big data is aimed at large companies, and companies will gain some value, that is, benefit from big data.

 

95. Volume

 

Big data items are related to the total amount of data available. The data can range from megabytes to Brown bytes.

 

W

96. WebHDFS Apache Hadoop

 

WebHDFS is a protocol for accessing HDFS to take advantage of industry RESTful mechanisms. It contains native libraries, so it can access HDFS. It uses the parallelism of the Hadoop cluster to help users connect to HDFS from the outside. It also strategically provides web service access to all Hadoop components.

 

97. Weather Data

 

Data trends and patterns that help track the atmosphere are called weather data. The data basically consists of numbers and factors. Now, real-time data is available for organizations to use in different ways. For example, logistics companies use weather data to optimize the transportation of goods.

 

X

98. XML Databases

 

A database that supports storing data in XML format is called an XML database. These databases are usually connected with document-specific databases. You can export, serialize and query the data in the XML database.

 

Y

99. Yottabyte

 

It is a big data term related to data measurement. One kilobyte is equal to 1000 ZB, which is the data stored in 250 trillion DVDs.

 

WITH

100. ZooKeeper

 

It is an Apache software project and Hadoop sub-project that provides open code name generation for distributed systems. It also supports the consolidation of large-scale distributed systems.

 

101. Zettabytes

 

It is a big data term related to data measurement. One megabyte is equal to one billion megabytes or 1,000 exabytes.


Comments

Popular posts from this blog

Defination of the essential properties of operating systems

Define the essential properties of the following types of operating sys-tems:  Batch  Interactive  Time sharing  Real time  Network  Parallel  Distributed  Clustered  Handheld ANSWERS: a. Batch processing:-   Jobs with similar needs are batched together and run through the computer as a group by an operator or automatic job sequencer. Performance is increased by attempting to keep CPU and I/O devices busy at all times through buffering, off-line operation, spooling, and multi-programming. Batch is good for executing large jobs that need little interaction; it can be submitted and picked up later. b. Interactive System:-   This system is composed of many short transactions where the results of the next transaction may be unpredictable. Response time needs to be short (seconds) since the user submits and waits for the result. c. Time sharing:-   This systems uses CPU scheduling and multipro-gramming to provide economical interactive use of a system. The CPU switches rapidl

What is a Fair lock in multithreading?

  Photo by  João Jesus  from  Pexels In Java, there is a class ReentrantLock that is used for implementing Fair lock. This class accepts optional parameter fairness.  When fairness is set to true, the RenentrantLock will give access to the longest waiting thread.  The most popular use of Fair lock is in avoiding thread starvation.  Since longest waiting threads are always given priority in case of contention, no thread can starve.  The downside of Fair lock is the low throughput of the program.  Since low priority or slow threads are getting locks multiple times, it leads to slower execution of a program. The only exception to a Fair lock is tryLock() method of ReentrantLock.  This method does not honor the value of the fairness parameter.

How do clustered systems differ from multiprocessor systems? What is required for two machines belonging to a cluster to cooperate to provide a highly available service?

 How do clustered systems differ from multiprocessor systems? What is required for two machines belonging to a cluster to cooperate to provide a highly available service? Answer: Clustered systems are typically constructed by combining multiple computers into a single system to perform a computational task distributed across the cluster. Multiprocessor systems on the other hand could be a single physical entity comprising of multiple CPUs. A clustered system is less tightly coupled than a multiprocessor system. Clustered systems communicate using messages, while processors in a multiprocessor system could communicate using shared memory. In order for two machines to provide a highly available service, the state on the two machines should be replicated and should be consistently updated. When one of the machines fails, the other could then take‐over the functionality of the failed machine. Some computer systems do not provide a privileged mode of operation in hardware. Is it possible t