The Sparkhound team is often asked, “What is the best way to use machine learning platforms to get accurate and meaningful data analytics results?” To help our clients, we have assembled these tips on:
Collecting and preparing relevant data sets
Properly training machine learning algorithms
Avoiding bias in the algorithms
Revising machine learning models as needed to ensure they continue to meet business needs, and
Other to-do items that come up in the reporting process for users of machine learning platforms.
Precise questions must be posed to the data sets because data preparation will take up most of a team’s time. Even though the industry celebrates automated learning, most algorithms used in machine learning and data science expect very structured and clean data tables as inputs. Transforming data from its raw state to a format that will provide answers to the questions can take as much as 70% of the time.
It is important to remember that not all machine learning and data science specialists are experienced data pipeline engineers. Data integration specialists should be used for the task of combining and cleaning data, leaving researchers and data scientists to focus on what they are best at—actually analyzing data.
The details matter. In machine learning projects, algorithms are often concerned with many data-set details that are irrelevant for other reporting or application development projects. For example, in a typical report or spreadsheet, there is no effective difference between a cell that contains spaces or zeros, and one that is left blank. A machine-learning algorithm may intentionally skip rows of data that have blank values in them. An expensive, predictive maintenance algorithm could skip observations that really matter (like equipment logs for items that don’t include temperature readings), leading to wildly different results. Knowing what null values mean in the analysis, and having employee data analysts that know how to handle them, is necessary.
Identifying outlier records, i.e. observations that show extreme results for a given set of metrics, is the first order of business once there is a properly skilled team in place. In this particular case, certain steps in the manufacturing process may vary wildly in run time. Make sure to review those outliers and decide how to handle them. That one production run that takes 10 times longer than others might be a mistake, but it also might represent a real-life downtime situation that should be considered.
Prepare to operationalize algorithms from day one. A careful review of models and their data inputs is essential for ensuring they are accurate—and stay that way. Simply put, script everything! If the team codes every step in their data integration and analysis project, re-running that analysis over and over again (and validating the results) will be a far more efficient process than not doing it at all.
Preparing to re-engineer the data pipeline is necessary. Once the algorithm works, it needs to be operationalized in a real production scenario by frequently retraining the model using much larger datasets than are currently available, or by sending the results of that model to other software applications. In these cases, the team should bring in trained data pipeline and software engineers to embed the algorithms in easy-to-maintain code.
It’s important to be upfront about interpretation needs; not all algorithms are created equally. Algorithms, like neural nets, can give exceptionally accurate predictive results, but are notoriously difficult to explain. It may be understood that there is a group of customers likely to leave based on past examples, but the reason why they might leave is not. Sacrificing some predictive accuracy for the ability to understand and act on data modeling results is key.
To conclude, in order to gather relevant and useful data, the algorithms must be trained and scientists must always be on the lookout for bias. Having properly skilled and trained employees for teaching and revising the machine-learning models is imperative, or all that computer power will be wasted because its results will be erroneous, and therefore, not applicable to the company’s business needs.