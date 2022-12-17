Data, as an asset, continues to grow to be more and more valuable in today’s organizations because it can drive significant competitive advantage and profitability when harnessed appropriately. But, this can be extremely difficult to do, especially when it comes to unlocking insights contained in data.

Many businesses are building teams of data engineers and scientists to incorporate machine learning (ML) into their data pipelines and to help apply algorithms to data at scale. But using traditional ML approaches on large volumes of data can be challenging.

Let’s explore the top barriers to applying ML at scale:

Delayed Time-to-Insight

Moving large volumes of data between systems is often time-consuming. The reason is that tracing the relationships between disparate data sources (for example, from back-end customer databases to clickstream behavior or sensor logs) are far too complex for ordinary business intelligence tools. This complexity sees data scientists spending an inordinate amount of time integrating and managing relationships. As a result, they end up relying on predictive analytics on smaller data sets, leading to the next barrier.

Inaccuracy in predictions

Since large data sets cannot be processed due to memory and computational limitations, most data scientists will build and train machine learning models using only small subsets of data, referred to as down-sampling. This will inevitably reduce the accuracy of subsequent insights and put at risk any business decisions that are based on those insights. This can also lead to training to fit the data and subsequent failure of the models to replicate the predictions on real-world data.

Slow and tedious deployment

The operational tools available to data scientists are slowly gaining maturity, resulting in difficulty managing and deploying predictive models across multiple environments. This difficulty significantly threatens the success of large-scale analytics initiatives and considerably increases the time to production.

Increased overheads

Moving data, building down-samples, rebuilding machine learning models into production, and running them on multiple platforms typically need additional hardware, software, developer tools, and resources which come at a substantial cost. Furthermore, many ML algorithms are not designed for distributed processing, which is the only way to process data faster. So, the algorithms and the data exploration and preparation functions that support them must be custom-built to take advantage of modern distributed and parallel engines. Data movement in and out of the models can also increase the effort and cost of the models.

How can businesses iplement ML models faster and at scale?

To overcome the barriers above and reduce the overall time taken for machine learning models to produce useful results, businesses should choose databases that provide in-database ML capabilities. There are several benefits to this approach:

High performance

In-database machine learning offers many of the commonly used ML algorithms natively, including data preparation, exploration, and model evaluation functions. It minimizes or eliminates many barriers associated with applying ML at scale. Furthermore, as databases already support massively parallel processing (MPP) and high data compression, analytics query times can be reduced from hours to minutes or seconds.

Reduced cost and complexity

Since the database is already optimized for machine learning models to run directly from within, it eliminates the need for data duplication and processing on alternative platforms while reducing overhead and complexity. Additionally, users can train, test, and deploy ML models using familiar tools, languages, and interfaces (for example, SQL, R, Python, PMML, TensorFlow, etc.), improving speed, productivity, and overall user experience.

Accelerated model training and Time-to-Insight

By distributing workloads across the database’s multiple clusters and nodes, analytics teams and data scientists can achieve faster computation, shorter querying time, and accelerated model training and prediction.

Increased prediction accuracy

It’s a well-known fact of machine learning that more data equals greater accuracy. In-database machine learning eliminates the constraints of small-scale analytics, such as creating down-samples or moving data to different systems. Data scientists can discover insights and patterns buried in large data sets resulting in increased prediction accuracy and better business decisions.

Better machine learning model management

In traditional approaches, ML models might exist on someone’s laptop. If that individual or system is unavailable, the model might be inaccessible. But when the database trains models within itself, a repository is created, that can be shared by all data scientists using the platform. This improves overall model management and visibility and makes it easy to compare models — how models were generated, what data sets were used, and what the results were.

Remember, it’s not just large enterprises that accrue vast amounts of data. Even a small business might possess gigabytes of information that could yield significant insights and competitive advantage. So the pragmatic approach is to look for analytical databases (with in-database ML, of course) that offer a free version, then pay when your business has grown to a certain level or wants enterprise support.

To summarize, the idea is to have a robust ML foundation. Once there, analytics teams can build models with relative ease, analyze data at scale and deliver insights as the business demands.

Rohit Amarnath, chief technology officer and head of accelerator for Vertica, a Micro Focus company, wrote this article.