Welcome. This video will present a brief introduction to automated machine learning, also known as AutoML. First off, it's useful to define and distinguish artificial intelligence from machine learning.
Artificial intelligence has been defined simply as getting computers to do tasks that require human intelligence. Artificial intelligence broadly focuses on automation. to improve speed, efficiency, and generally serve to reduce manual human effort in various tasks.
On the other hand, machine learning has been defined as a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to learn with data without being explicitly programmed. The goals of machine learning involve scaling up to perform tasks that most or all humans can't achieve. As well, as seeking to discover and innovate by integrating information to find novel patterns, designs, and strategies. So ultimately, artificial intelligence methods can perform tasks that don't involve machine learning, and machine learning in particular requires input data to learn from. Next, let's briefly explain what machine learning does in a nutshell.
Machine learning is all about discovering patterns and or associations within data. and to represent those patterns as what can be thought of as useful generalizations, which can include models. The generalization process that it uses takes specific examples and simplifies them into a single best fit representation.
As a very simplistic example, here on the right we have some cartoons illustrating specific individual trees. The generalization process forms a generalized representation or model of what it is to be a tree. Later, we can apply these generalizations to decision making, for example classification or other forms of prediction. Again on the right, we see two additional cartoons which we're trying to label with the model, predicting each to be either a tree or not a tree. Now let's review some essential basics of data.
Specifically, let's distinguish unstructured from structured data. Structured data. can't be displayed in rows or columns. This includes data such as images, audio, video, and long strings, otherwise known as natural language text. In practice, most data is unstructured, and most algorithms work with structured data.
A process known as feature extraction can be used to transform unstructured data into structured data. Structured data, on the other hand, can be displayed in rows and columns. and it includes data values such as numbers, dates, and short strings of text.
Of note, structured data can be labeled or unlabeled, meaning that it can include a target endpoint or outcome as a dependent variable or not. Here's a specific illustration of some structured data, given here as a tabular dataset. The rows here represent instances in the data, and the columns here include a unique instance identifier, a variety of features, which can also be thought of as independent variables, and a label or outcome column, which generally represents the thing we're trying to predict in a given dataset. Machine learning itself comes in a number of major flavors representing different learning styles. The first is supervised learning, which can be accomplished on data that includes a label or outcome column.
Supervised learning includes classification tasks such as binary or multi-class classification. Binary classification seeks to distinguish between two outcome groups, and multi-class classification seeks to distinguish between three or more. Supervised learning also includes regression tasks, where we're trying to predict a specific quantitative outcome as a value as close to the true value as possible.
Another major branch of machine learning deals with unlabeled data that instead focuses on patterns and relationships among features. Unsupervised learning includes clustering and dimensionality reduction. Another less well-known style is semi-supervised learning, which seeks to make the best of having partially labeled data.
And the last major style we'll mention is reinforcement learning. This style largely deals with multi-step problems, or problems that require multiple decisions to be made before feedback might be available suggesting whether that series of decisions was useful or not. This style of learning would be relevant, for example, to training an algorithm to play games or navigate an environment. Overall, machine learning is a general term that captures a wide variety of algorithms. These cartoon images capture some of the major families of machine learning algorithms that have been developed and widely used.
These algorithms differ by how they represent knowledge, the search and learning strategies they use, and the explicit or implicit assumptions that they make about the problems they are tasked to solve. It's also important to know that every algorithm can be implemented in different ways and in different coding languages, with effectively infinite ways to algorithmically approach machine learning as a methodology. Now let's discuss some of the challenges and goals involved in data analysis that motivate our use of machine learning.
From a logistical perspective, data can be very big and very difficult to manage, as well as computationally expensive to analyze. Data can also include missing values, which is an issue that some algorithms can't accommodate, requiring users to make educated guesses as to their original values using strategies such as imputation. When dealing with classification outcomes, data can have imbalance, meaning that there can be many more instances of one class than another.
This can impact modeling performance and how we evaluate our resulting trained models. Datasets can also include a diverse or mixed set of feature types, which also can impact algorithm performance in certain situations. With respect to detecting patterns or associations within data, we have the challenge of noisy signal, which prevents us from ever being able to perfectly predict a given outcome, and this in turn also can contribute to overfitting by machine learning algorithms.
Also, some features can include very rare states or values limiting the available power to detect the relevance of those features or values. Also, we often have the motivation to distinguish correlation from causation, requiring us to identify and understand the impact of covariates in data. Another challenge that we'll dig a bit deeper into is that of complex associations.
In supervised learning, our goal is to connect potentially predictive features to one or more target. outcomes in order to discover, predict, and understand these relationships. However, these connections and relationships can take many forms, including simple univariate or additive relationships, or situations where a single feature can impact multiple outcomes, also known as multilabel problems or pleiotropy in the context of genetics. There's also multivariate interactions, also known as epistasis, in the context of genetics, and heterogeneous effects, where multiple independent factors can lead to the same or similar outcome, and this is also known as genetic heterogeneity in the context of genetics.
Or these relationships can be some novel, complicated combination of any of these. Moving on to quality control considerations, there is often a need to be able to interpret or explain how a model makes decisions. by understanding the factors that drive predictions and understanding the nature of these underlying relationships.
We also need to ensure that proper evaluation metrics are being used to gauge performance, as well as seek to detect potential biases in our analyses or models in order to ensure fairness, and ultimately ensure the reproducibility of our model or findings. Now let's quickly review the potential elements of a typical data science pipeline. that utilizes machine learning.
Typically, we start by defining a problem or task. We collect data and then move on to data preparation. This is followed by modeling and a number of post-analysis steps, ultimately seeking to yield useful knowledge. Looking closer at data preparation, this typically includes an exploratory analysis, a variety of data wrangling elements including feature extraction, data formatting, data cleaning, and data integration, Also, some form of data splitting is essential to the downstream evaluation of our models. And feature processing can help us prepare the data in a way that downstream modeling can be more successful.
Feature processing can include feature transformation, feature learning, dimensionality reduction, feature engineering, and feature selection. Modeling typically involves algorithm selection, hyperparameter optimization, training a final model, evaluating that model, and often estimating the importance of the different features utilized by that model. Post-analysis can include generating summary statistics, conducting statistical significance analyses, generating visualizations, interpreting the results, and ideally replicating those results. If you want to learn more about these many elements of a machine learning data science pipeline, check out our separate video series on Machine Learning Essentials for Data Science. So what's automated machine learning?
AutoML is a newer subfield of artificial intelligence and machine learning research focused on the development of tools or code libraries that automate elements of machine learning pipeline to ideally improve ease of use and performance. The primary aim of AutoML is to facilitate machine learning analyses, or put another way, to relax but not exclude the need for a user in the loop. It also focuses on making machine learning accessible to those with little or no coding or data science expertise. AutoML can also offer a more rigorous exploration of machine learning model optimization. Currently, the most commonly automated components of AutoML systems include hyperparameter optimization, model selection, feature extraction, and feature processing, such as feature selection, feature learning, and feature transformation.
One example of an automated machine learning tool is TPOT. This illustration shows a basic machine learning analysis pipeline. TPOT seeks to automate the process of identifying the optimal combination of machine learning analysis pipeline elements, specifically those in the pink region of this illustration. TPOT uses a strategy known as genetic programming to drive the search, exploring different pipeline configurations, algorithms, and hyperparameter settings in order to try and optimize the resulting machine learning model's performance. If you're interested in learning more or downloading TPOT, check out the GitHub link here.
So what are the motivations behind using automated machine learning? First, significant experience is needed to correctly assemble a machine learning pipeline. There are many elements and options to consider, and there are many opportunities for new as well as seasoned researchers to make mistakes and inadvertently include biases. Ultimately, there's no single correct way to assemble a machine learning analysis pipeline.
There are effectively infinite possible machine learning pipelines taking different approaches to questions like how should the data be cleaned or transformed, what features should be engineered, what features should be selected, what algorithms or methods should we use, and what algorithm run parameters will yield optimal performance. Unfortunately, the answers to these questions are likely to be different for every new problem and data set. In distinguishing different options for automated machine learning, it's first important to note that there are both enterprise and open source options available.
Enterprise options are proprietary, with different pay-to-use business models, as well as the advertised expectation of having more numerous features and direct customer support. However, a likely drawback of these options is that many of the specifics of how analyses are conducted or being able to validate this by viewing source code is prevented by the proprietary wall. Alternatively, there are currently what seem to be a much larger variety of open-source AutoML options, developed with different goals, problems, and user bases in mind. These options vary broadly in their capabilities, transparency, and upkeep.
With respect to automated machine learning options, it's also useful to note the difference between tools and libraries. What we will refer to as tools were generally designed to be broadly accessible, requiring little to no coding experience to use. These AutoMLs typically seek to guide pipeline implementation or automate pipeline assembly.
Examples of tools include TPOT, Aliro, MLMe, AutoWeka, H2O AutoML, MLJar, and Streamline. Libraries, on the other hand, are primarily designed as a code library to facilitate building a customized pipeline with potentially some automated elements, for example, hyperparameter optimization. Examples of libraries include PyCaret, Flamel, Lama, Hyperopt, Excessive, MLBox, and Transmogriff AI.
Recently, we conducted an informal survey of 24 open-source AutoML tools and libraries. This table, arranged by decreasing latest release date, provides some basic information about these 24 tools and libraries, including their respective GitHub links. In this survey, we sought to characterize the capabilities of these different tools and libraries, as illustrated in this second table.
These capabilities included applicable data types, target outcomes, ease of use, and what automated pipeline elements were included or addressed. Of note, this survey was admittedly subjective and based on the respective documentation and publications available for each AutoML. Also of note, there are many underlying caveats not captured by this table, including what specific methods were made available in each AutoML, the quality of the AutoML implementation, and the quality of the respective documentation. In this table, white cells serve to highlight limitations and uncertainty in our survey. Details regarding this survey were included in the supporting information of a manuscript we recently prepared and posted as a preprint to the archive link at the bottom of this page.
Taking a closer look at that table, we can see the different data types the different AutoML tools are capable of accommodating, including structured or tabular data, images, longer strings of text, time series, or multimodal. which is the ability to use multiple data types in modeling at once. We can see here that all 24 AutoML tools can accommodate structured data, but only three of the surveyed tools can currently accommodate all five data types, these including AutoKeras, FedDOT, and AutoGluon. Next, looking at applicable targets or tasks, these include binary classification, multi-class classification, regression, multitask problems, where a user tries to solve multiple related problems simultaneously, multilabel classification tasks, clustering, and anomaly detection. While none of these open-source AutoML tools currently appear to accommodate all of these target problems, PyCaret currently accommodates the most, and AutoKeras, FedDOT, Ludwig, and AutoSKlearn accommodate at least four each.
Moving on to considerations involving ease of use, we examined a number of factors, including whether the pipeline could be implemented without code, run without code, available as a GUI, automatically generated figures or other visualizations, automatically organized and output results, whether it generated a summary report, whether it included a recommender system, and whether it supported parallelization. With respect to ease of use, the Aliro AutoML stood out in particular. Having checked all of these boxes, and Streamline, H2O AutoML, MLMe, and MLJar also stood out.
Looking a bit more closely at ease of use with respect to user coding experience, let's talk more about the availability of a graphical user interface, codeless running, and codeless implementation. First, a graphical user interface, or GUI, has the ability to allow use of the AutoML by those with no coding experience. An exception to this is the excessive AutoML, which does provide a GUI interface, but only as a vehicle for a user to more easily code the pipeline implementation.
Of note, even when a GUI is available, pipeline design decisions may still be in the hands of the user, which can be a good or bad thing based on your experience level. In general, AutoML GUIs can offer potentially limited pipeline customization. AutoMLs that currently include a GUI option include Aliro, Excessive, H2O AutoML, MLJar, and MLMe.
Regarding codeless running, this implies that an AutoML can be used by someone with no coding experience whatsoever. Specifically, that no coding is required to run the pipeline or set up the run environment. Codeless running could be achieved by using a GUI, a web interface, or a pre-configured notebook.
Other example AutoMLs that offer codeless running include AutoKeras, and Streamline. Regarding codeless implementation, this implies that no code is required to set up the pipeline configuration, including pipeline elements and their order or options. This can be achieved by allowing options to be selected by a user in a GUI, a notebook, for example Jupyter Notebook, or as command line arguments. This could also be achieved by the AutoML by automatically selecting or suggesting these elements or options using a recommender system.
or a meta learner. Other examples of AutoMLs that offer codeless implementation include AutoSKLearn and Teapot. And lastly, looking more closely at what pipeline elements were automated by a given AutoML, we considered a fairly diverse list of capabilities. In short, the following AutoML tools included the greatest range of elements.
Streamline, PyCaret, AutoGluon, MLJar, and Teapot. Another consideration for choosing an AutoML is the overall pipeline design. This involves what pipeline elements are included and in what order.
For example, exploratory data analysis, data processing, feature processing, modeling algorithms, evaluation metrics, figure generation, and statistical analysis. This also involves what options or parameters are used for each of these elements. For example, hyperparameter settings and other algorithm settings. In general, there seem to be three AutoML pipeline design approaches.
User-customized, where pipeline design is primarily left up to the user. Preconfigured, where the pipeline has been designed ahead of time by data science experts. And automated recommendation or search approaches that seek to automate or assist in the discovery of the best pipeline design. Note that AutoML tools and libraries that fall into any one of these categories can vary dramatically with respect to the degree of automation and the breadth of available options or algorithms, as well as overall transparency. Regarding these three pipeline design approaches, user-customized approaches require the user to specify pipeline elements and their order, which is mostly typical of AutoML libraries.
The advantages of this design is that they're likely to be the most customizable, flexible, adaptable, and efficient based on user needs and the target problem. They're also less likely to be computationally expensive in contrast with something like a meta learner. However, they offer the least degree of automation, requiring potentially significant coding, machine learning, and data science expertise, and ultimately they're likely to require the most time to design and implement the pipeline. Here we have some examples of AutoML options that primarily take this user-customized approach. Differently, a pre-configured pipeline has been pre-designed by machine learning or data science experts, and typically offers a higher degree of automation and user-run parameter options for customization.
The advantages of this approach is that it requires minimal user design and implementation, it's also less likely to be computationally expensive than a meta learner, and especially for non-experts it's likely to reduce the chance of making an error or adding bias to the pipeline or resulting models. And overall this approach is more likely to be simpler and more transparent than a meta learner in terms of pipeline design decisions. However, the disadvantages of pre-configured approach include potentially limited customizability and optimization capabilities.
Examples of a pre-configured pipeline design include Streamline, MLMe, and MLJar. And the third pipeline design approach focuses on automated recommendation or search of the pipeline. This has the advantage of minimal user design and implementation, as well as increased opportunity to optimize pipeline performance, as well as a potential opportunity to benefit from previous learning experience on other problems or datasets.
However, such systems are likely to be much more computationally expensive, in particular meta-learners. Also, they can have limited transparency in terms of pipeline design decisions, and they can have a tendency to prioritize model performance over model interpretability. Examples of AutoML with automated recommendation systems include Aliro and Autopytorch. And examples of AutoML with a meta learner include Gamma, MLplan, TPot, FedDot, AutoSKlearn, and Recipe. Yet another consideration for choosing an AutoML is the output focus.
Generally speaking, the aims of a given AutoML can vary in terms of what they output for the user. One output focus is to identify a single best optimized model or pipeline. Here, the goal is to optimize model performance, where we only really care about identifying the best performer at the end of the run. This effectively yields simpler overall output for the user, and it would be well suited to machine learning performance competitions, or a user's general need to just quickly find a good performer.
AutoSKLearn is an example of an AutoML with this type of output focus. Alternatively, the output focus could be to identify a leaderboard of top models and pipelines. Again, the primary goal is to optimize model performance, but it's also important to be able to see alternative top performers. This type of output focus yields moderately simple output suited to performance competitions, as well as research where we want to consider other reasonable solutions but perhaps not directly compare algorithmic methodologies.
TPOT is an example of an AutoML with this output focus. A third focus would be to conduct a direct comparison of model performance across algorithms. Here the goal is not only to optimize model performance, but to also understand algorithm performance differences.
This yields a more extensive and rigorous output that would be well suited to research in general. Examples of this type of output focus include Streamline, MLMe, and PyCaret. The last consideration that we'll discuss is transparency of the documentation. AutoML tools and libraries vary in terms of documentation, which can be represented in terms of code organization and annotation, readme and user guides, videos and tutorials, preprints or whitepapers, as well as peer-reviewed publications. Such documentation varies in terms of clarity, transparency, and level of detail across all the open source AutoMLs we surveyed.
For example, it's not always clear from the documentation whether data science and machine learning best practices are being adopted correctly, what specific algorithms or options are available in the tool or library, what the precise order of pipeline elements are, and how one would prepare replication or future data to be applied to the model for future prediction. These considerations are important not only for ease of use, but also scientific rigor, reproducibility, and the ability of researchers to accurately report the applied methodology. Lastly, let's briefly discuss some potential limitations and risks of automated machine learning.
First off, Many AutoML approaches can be extremely computationally expensive, in particular those that employ automated pipeline search and discovery. Also of note, not all elements of a machine learning analysis can yet be reliably or effectively automated. For example, many aspects of data cleaning, feature engineering, bias identification, and interpretation are still best addressed by an expert with domain knowledge. One potential risk is that any implementation mistakes, limitations, or biases can unknowingly impact all users of a given AutoML solution, and any AutoML tool is limited by the algorithms and heuristics that it employs. Ultimately, we still have a lot to learn and discover with regards to individual machine learning methods themselves.
One last risk is that AutoML approaches that focus only on finding top-performing models may have a missed opportunity. to learn from the perspectives of multiple machine learning algorithms, as well as overemphasize predictive performance at the expense of model interpretability or simplicity. So in summary, automated machine learning is a relatively new subfield of AI and machine learning, with the potential to bring the power of machine learning to a much broader user base.
But there were many factors to consider in selecting an AutoML tool or library. And many AutoML tools or libraries are under active development, including fixes, updates, and expansions. So always make sure to get the most recent version or release of your AutoML solution before getting started. In general, I'd advise using AutoML that is transparently presented, meaning that it's clear what elements it includes and in what order, and it's also clear what its respective strengths and limitations are. Thanks for listening to this brief introduction to automated machine learning.
Please feel free to reach out if you're interested in collaborating or if you have any questions.