Cloud changes IT in many ways. A new class of platform, database, messaging and app services have emerged to enable the rapid delivery of cloud native apps. IT architecture can no longer be satisfied with delivering compute, network and storage. It must expand “up the stack”, putting more capabilities more rapidly into the hands of developers and business users.
A primary example of new IT capacities in
demand is in the area of Big Data and Machine learning. With elasticity and on-demand
computing, cloud has dramatically lowered the cost of entry. With emerging open
source tool sets (e.g., Distributed Machine Learning Common, Jupyter, Anaconda, Python...), even individuals
are now capable of performing analytics on large data sets, at a fraction of
the cost of traditional methods (SAS grid).
To gain insight and bridge the gap between
IT and data science community, I have experimented with Amazon Machine Learning
(AML) service, comparing with custom built open source tool sets. By
participating in a Kaggle competition, the results are also benchmarked in the
real world.
The particular Kaggle competition I used
has the goal of predicting hazard score using a dataset of property
information. The hazard score to be predicted is a numeric value.
Machine Learning “as a service” test
AWS has delivered a service that puts modeling
and predictive analysis capabilities into the hands of a non-IT and
non-data-scientist person. Its documentation provides sufficient information to
build a model and perform analytics, and requires no prior modeling skills.
The first step is creating a data source. From
input data, AWS infers attribute types and creates a schema, which can be
further modified by user.
The second step is to build a ML model
using the data source. Amazon supports only three Models (Binary
classification, Multiclass classification, Numerical regression). Since the
prediction result is a numeric value, the only Amazon model applicable is the
regression model. Note that there are many more models, attribute selection techniques
and sequencing variations than the three models offered by AWS, thus making
data science equally an art than it is a science.
AWS’s built in regression model evaluation
uses residual distribution to evaluate the model. In this particular case, the
model has a tendency for negative residuals which indicates an overestimation
(the actual target tends to be smaller than the predicted target).
To further evaluate model’s performance,
it is used to calculate Hazard score for the real data set in Kaggle competition.
After the competition closed, the AWS ML model obtained a score of 0.343. Compared
to all submission, it ranks 1830th (over a total of 2236). The winning
submission scored 0.397.
Machine Learning “on a server” test
For comparison, a custom server is built
on AWS infrastructure. A set of data science tools and libraries from the open
source community are then deployed, with no additional cost. For the Kaggle
competition, an emerging ML model called XGBoost is used (developed by Tianqi
Chen, a PhD student at University of Washington). The resulting score is 0.392,
which ranks 299th/2236.
For cost comparison, running evaluation on
Amazon ML was quite expensive. I only ran a few times with a record size of 50000,
and end up spending over $50. The cost of custom server is almost negligible, as
the use of a mid-sized EC2 instance is quite adequate to run XGBoost Python
code.
Amazon Machine Learning “as a service”
delivers a very easy to use tool. It frees users from build, scale, and
maintain machine learning infrastructure. However, in its current form, it is
only suited to handle a narrow set of problems that matches the simple models
provided. As large enterprises typically faces more sophisticated data
analytical challenges, as those represented in Kaggle competitions, AML is of limited
value to the data science community.
On the other hand, as data science is
being revolutionized by open source, there seems to be huge opportunities for
Amazon and AML to improve on.
I haven’t found much benchmarking work out
there. Here are a couple of posts comparing AML with others including Google
Prediction and Azure Machine Learning.