What is Data Engineering? Defined, Explained, and Trends
Chief Technology Evangelist
May 10, 2022
Data is not just big, it is enormous. In 2021 the volume of data reached 79 zettabytes; by 2025, this volume will more than double to 181 zettabytes. How big is a zettabyte? It is 10 to the power of 21 bytes. If harnessed and understood, data can contribute important insights to a myriad of processes and behaviors. While data scientists use data to generate these insights, a data engineer designs and builds software systems that facilitate a data pipeline to transform and deliver data to data scientists in a format they can leverage. The data engineer often sits between the software engineer and the data scientist in the solution pipeline. A data engineer designs, builds, and uses tools typically based on artificial intelligence (AI) and machine learning (ML) and provides data integration capabilities.
In broad terms, a data engineer’s role encapsulates the following processes:
Query data from a source (Extract)
Modify the data (Transform)
Move the transformed data to a place where a data scientist can access it (Load)
A change of process order is disrupting the space as ETL (Extract, Transform, Load) becomes ELT: Extract – Load – (Transform). This allows data to be modified in the location it will be processed instead of transforming the data before moving it.
We've teamed up with Arrow to break down what is data engineering, the processes, and trends you should know about.
Why is data engineering important?
Data is vital in the transformation and optimization of digital systems. But data comes in many forms. A data engineer ensures that data is provided in an optimized, sanitized format. As data volumes grow, optimizing and managing different data types across disparate systems will become even more crucial.
What type of data engineering technologies are there?
Data engineers use a mix of tools and processes to standardize data used to drive data analytics and data science projects. The methods behind data engineering require a pipeline where source data is transformed and made usable.
Typical technologies used by data engineers that are part of the data pipeline include:
Data is created from myriad sources. Regardless of where it is sourced, a data engineer must ensure it gets to data scientists in a form they can use for analysis. Typical data sources include:
Relational databases (e.g., Oracle, Microsoft SQL server)
JSON databases (e.g. MongoDB)
Columnar databases (e.g., Cassandra)
File systems (API)
Spreadsheets (e.g., Excel)
Data ingestion & transportation
Data ingestion is used to transport data from its original source to a data storage medium. These mediums can be a data lake or data warehouse, etc. Once in storage, data can be processed for analysis. Data ingestion can be batch-processed or stream-processed.
Data can be stored in several ways:
Data warehouse: typically stores historical data
Non-traditional data stores include:
Search engine query strings
Data querying and processing
The data pipelines and tools built by data engineers are used to query data before it is processed. Query processing involves querying a data store, such as a SQL database and retrieving results. Once those results are retrieved, they can be processed to transform the data into a ‘clean,’ or standardized, form for use by data scientists.
Data transformation occurs when data is changed from one format to another. Data transformation is a vital step in the ETL process, as it generates the information needed for business intelligence and data mining. Data transformation optimizes the value a business can get from its data and enhances data quality.
Data analytics and reporting
Analytics and reporting are two sides of the same coin. Data reporting organizes data into accessible forms such as visuals. Data analytics takes data and looks for patterns and trends to gain insights for improving business processes.
Data governance processes and procedures oversee data integrity, availability, and security throughout the data life cycle. This includes how data is gathered, stored, shared, processed, and disposed of.
What are the data engineering trends and disruptions?
As data volumes continue to increase and as new data sources evolve, the security threats and trends that follow are disrupting the space:
Trends within data engineering
The rise of the modern data stack (MDS)
The traditional data stack (TDS) is rapidly morphing into the ‘modern data stack’ (MDS) in data engineering. The main changes noted in this movement are:
The cloud-data warehouse as the new data source;
A move from ETL (Extract-Transform-Load) to ELT (Extract – Load – Transform), with the main difference being that data is loaded into the master database before being transformed
Use of self-service analytics tools to generate reports and insights
Data technologies move to the cloud
The need for speed, connectivity across disparate data sources, and user access has moved data engineering into the cloud. Cloud-based data warehouses also provide scalability and flexibility. Other tools in the data engineering process, such as business intelligence tools, are also cloud-based or SaaS (Software-as-a-Service) in nature.
Automation technologies continue to grow
Automation technologies offset the lack of available data engineering talent and the increasing number of data environments. The emerging field of DataOps looks set to utilize automation technologies to optimize data delivery cycle times and minimize human errors.
Data security and privacy technologies
Data security and privacy continue to dominate corporate data strategies because of regulations set by the likes of the CCPA and GDPR. Cyber threats to data also play a vital part in developing robust data strategies. Because of a data engineer's key role in data processing, security and privacy are now an intrinsic part of the discipline of data engineering. Utilizing robust data security and privacy technologies as part of the data pipeline ensures that data projects adhere to regulations.
Disruptions within data engineering
These trends are opening the market to disruptive forces:
Analytics systems and modern business intelligence
Reverse Extract Transform Load (ETL)
As organizations begin utilizing central data stores, such as data warehouses, as a single source of truth, the opportunity arises to use the same methods of loading data into the data warehouse (ETL) to synchronize data across business applications (reverse ETL):
ETL reads data from various data sources, including SaaS tools, and writes that information to a data warehouse.
Reverse ETL reads data from the data warehouse and writes it to SaaS tools.
Reverse ETL prevents data from becoming siloed in a data warehouse. Reverse ETL provides a mechanism to make data available to the tools used daily by business users, such as Salesforce. Two disruptors in this space are Census and Hightouch, both of which provide Reverse ETL tools for DataOps. One of the reasons for the emergence of these Reverse ETL systems is for data democratization. Both companies make data available and helpful to everyone in a company, whether they work in sales, marketing, or customer support, for example.
Increasing types of data stores with complex functionality across distributed systems has led to the need for data observability tools. These tools monitor, track, and triage incidents to prevent downtime. They are essential for maintaining healthy data flow and are considered part of DataOps.
Disruptors in this space include Monte Carlo Data, creator of the world's first end-to-end platform for data health and to prevent data downtime, and Accel Data, whose 'Torch' product provides a unified view of data, using AI and ML to automate data quality.
Centralization in data engineering can lead to data processing bottlenecks, slowing down data-driven decisions. Decentralization is a paradigm force in data engineering. The core principle of a data mesh encapsulates the move from a monolithic structure, such as a centralized data lake, to a decentralized architecture. A data mesh is a platform that enables cross-domain data analysis and connects data in a microservice architecture. Two of the most disruptive vendors in the data mesh space are Starburst and Dremio. Both vendors offer a data mesh platform to decentralize data analytics for super-fast data analysis. They let users query that data directly for high-performance dashboards and interactive data models, eliminating the need to transfer data into data warehouses.
Metrics and KPIs (Key Performance Indicators) are an essential part of data engineering, specifically data analysis. A metrics store is a metrics layer traditionally thought of as a central place to store and govern key data metrics. Perhaps the biggest issue with modern metrics stores is handling inconsistent data. Solutions from GoodData and Transform are making waves in the metrics store space. Transform offers data analytics and visualization software through a centralized metrics store. The platform allows data engineers to define metrics definitions in code. GoodData provides a low code/no code way to embed its data into existing applications.
Operational Systems and Data Science
Data labeling is a system of classifying different types of data. These labels provide context to allow ML algorithms to learn from the data. ML-based systems such as Natural Language Processing (NLP) and Computer Vision need data to be labeled in order to train the algorithms. For example, a label could identify an image as a bird or a mammal. Snorkel and Labelbox are innovating in this area. Snorkel has worked with Google, Intel, and Stanford Medicine, to develop Snorkel Flow, an end-to-end machine learning platform for developing and deploying AI applications powered by programmatic data labeling. Labelbox is a start-up that has developed a data annotation and labeling platform offered through a web service and API.
A feature store is used to store 'features values,' created from transformed raw data for machine learning models. These features can be reused. In addition, a feature store is used to store and manage features and retrieve data for training. This is a new area of data engineering designed to apply automation to new feature computation, track feature versions, and monitor the health of features through the pipeline. Tecton and Hopsworks are providing innovative feature store solutions. Tecton uses machine learning to deliver an enterprise-grade feature store that manages the complete lifecycle of ML features from development to production. Hopsworks is a Python-centric feature store that is data/environment agnostic and helps teams collaborate.
The vast amounts of data found across a massive surface area and the complex business questions begging for answers will result in slowed bottlenecks unless symbiotic technologies are used to handle the bandwidth. Specialized AI Hardware is designed to deliver insights from these swatches of data quickly. Cerebras and Graphcore are key innovators in the AI Hardware space. Cerebras is building an easy-to-use AI accelerator based on the largest processor in the industry. Graphcore's Intelligence Processing Unit (IPU) technology enables "AI researchers to undertake entirely new types of work, not possible using current technologies, to drive the next advances in machine intelligence."
Data science notebook
Data science notebooks are an essential tool in data engineering. Although a notebook sounds like it is paper-based, a data science notebook is actually a type of interactive computing. Such notebooks are used to write and execute code, visualize output and results, and share insights with others. Data science notebooks are not a new idea per se, but new features of these notebooks deliver innovation in the area. Features include open-source notebooks and notebooks that can run code from multiple computing languages. Colab and Deepnote are two disruptors in the data science notebook area. Colab (Colaboratory) is a Google project with an open-source offering that combines executable code and rich text in a single document. It supports data import from many data sources. Deepnote is Jupyter-compatible and specifically designed for team collaboration.
What are the CXO priorities surrounding data engineering?
A robust and forward-thinking data strategy is critical for the CXO role. But CXOs must prioritize certain areas in their data strategy:
Data security and data privacy present challenges to data projects
Data privacy and protection are complicated by the amount of data and the disparate nature of data types and stores. As data volumes increase, cyber threats follow. A CXO must ensure that data projects align with data privacy and protection regulations while also monitoring the variety of cyber threats targeting those projects. When assessing data engineering skills and technologies, look for people and vendors who understand the importance of data security and privacy across the data pipeline.
Rising data silos make it hard to consolidate data
Data silos make the consolidation of data across disparate silos challenging. The result can be holes in business intelligence. The application of Reverse ETL and data mesh can provide the means to break down data silos.
Lack of data analytics and data science talent limits in-house capabilities
As in many other technical areas, the skills gap in data modeling and data science is holding back data projects. The competition for skilled data scientists is stiff. Quanthub tracks the data science skills gap and notes that 67% of companies are looking to hire data scientists. A lack of in-house capability can stall an organization's data strategy. A CXO should prioritize the recruitment of data scientists and data analysts or expedite the training of internal employees who show aptitude and interest in this area.
Increasing focus on streaming analytics for real-time insight and integration
By 2030, 30% of data will be generated in real-time by everyday devices such as smart TVs, wearables, online clickstreams, connected cars, etc. Traditionally, data has been sent for analysis in batches. This means the perspective of the analytics output is often historical. Real-time data is used with streaming analytics to give an organization real-time, reactive insight. Look to include a real-time analysis in your modern data stack.