On Big Data
As the appetite for ad-hoc access to both live and historical data rises among business users, the demand stretches the limits of how even the most robust analytics tools can navigate a spiraling universe of datasets. Satisfying this new order is often constrained by the laws of physics. Increasingly, the analytics-on-demand phenomenon, borne from an intense focus on data-driven business decision-making, means there is either no time for traditional extract transform and load (ETL) processes nor time to ingest live data from their source repositories.
Time is not the only factor. The pure volume and speed with which data is generated is beyond the capacity and economic bounds of today’s typical enterprise infrastructures.
While breaking the laws of physics is obviously not in the domain of data professionals, a viable way to work around these physical limitations of querying data is by applying federated, virtual access. This approach, data virtualization (DV), is a solution a growing number of large organizations are exploring and many have implemented in recent years.
The appeal of DV is straightforward: by creating a federated tier where information is abstracted, it can enable centralized access to data services. In addition, with some DV solutions, cached copies of the data are available, providing the performance of more direct access without the source data having to be rehomed.
Implementing DV is also attractive because it bypasses the need for ETL, which can be time-consuming and unnecessary in certain scenarios. Whether under the “data virtualization,” “data fabric,” or “data as a service” moniker, many vendors and customers see it as a core approach to creating logical data warehousing.
Data virtualization has been around for a while; nevertheless, we are seeing a new wave of DV solutions and architectures that promise to enhance its appeal and feasibility to solve the onslaught of new BI, reporting, and analysis requirements.
Data virtualization is a somewhat nebulous term. A handful of vendors offer platforms and services that are focused purely on enabling data virtualization and are delivered as such. Others offer it as a feature in broader big data portfolios. Regardless, enterprises that implement data virtualization gain this virtual layer over their structured and even unstructured datasets from relational and NoSQL databases, Big Data platforms, and even enterprise applications which allows for the creation of logical data warehouses, accessed with SQL, REST, and other data query methods. This provides access to data from a broader set of distributed sources and storage formats. Moreover, DV can do this without requiring users to know where the data resides.
Various factors are influencing the need for DV. In addition to the growth of data, increased accessibility of self-service Business Intelligence (BI) tools such as Microsoft’s Power BI, Tableau, and Qlik, are creating more concurrent queries against both structured and unstructured data. The notion that data is currency, while perhaps cliché, is increasingly and verifiably the case in the modern business world.
Accelerating the growth in data is the overall trend toward digitization, the pools of new machine data and the ability of analytics tools and machine learning platforms to analyze streams of data from these and other sources, including social media. Compounding this trend is the growing use and capabilities of cloud services, and the evolution of Big Data solutions such as Apache Hadoop and Spark.
Besides ad-hoc reporting and self-service BI demands being bigger than ever, many enterprises now have data scientists whose jobs are to figure out how to make use of all this new data in order to make their organizations more competitive. The emergence of cloud-native apps, enabled by Docker containers and Kubernetes, will only make analysis features more common throughout the enterprise technology stack. Meanwhile, the traditional approach of moving and transforming data to meet these needs and power these analytic capabilities is becoming less feasible with each passing requirement.
In this report we explore data virtualization products and technologies, and how they can help organizations that are experiencing this accelerated demand, while simplifying the query process for end users.
- Data virtualization is a relatively new option, with still-evolving capabilities for query or search against transactional and historical data, in near real-time, without having to know where the data resides.
- DV is often optimized for remote access to a cached layer of data, eliminating the need to move the data or allocate storage resources for it.
- DV is an alternative to the more common approach of moving data into warehouses or marts, by ingesting and transforming it using ETL and data prep tools.
- In addition to providing better efficiency and faster access to data, data virtualization can offer a foundation for addressing data governance requirements, such as compliance with the European Union’s General Data Protection Regulation (GDPR), by ensuring data is managed and stored as required by such regulatory regimes.
- Data virtualization often provides the underlying capability for logical data warehouses.
- Many DV vendors are accelerating the capabilities of their solutions by offering the same massively parallel processing (MPP) query capabilities found on conventional data warehouse platforms.
- Products are available across a variety of data virtualization approaches, including (1) core DV platforms; (2) standalone SQL query engines that can connect to a variety of remote data sources and can query across them; (3) data source bridges from conventional database platforms that connect to files in cloud object storage, big data platforms, and other databases; and (4) automated data warehouses.