The
advancement in data science and machine learning has not only brought
breakthroughs like AlphaGO, but also starting to have broad impact in our
everyday lives (Airbnb uses data to predict traveler’s destinations). Gone are the days when data science is only accessible by those in
the ivory tower with million dollar proprietary software, new trends have
emerged:
- open source software and tools
- compute capacity at cloud scale, with dramatic cost reduction
- public data set, community based problem solving (Kaggle)
AWS provides
both cost efficiency and scalable capabilities. It makes sense for data
scientist to tap into the power of public cloud. An AWS image is developed here
which:
- Automates the installation and configuration of a comprehensive set of open source data science tools
- Allows instance sizing based on needs
- Control cost (shut down or terminate when done, launch in a few minutes)
What it is
An AWS AMI which
provides “data science server in a box” with current open source toolkit (RStudio,
Jupyter Notebook, Anaconda, Xgboost…). Builds automatically, fully configured
ready for use in less than five minutes.
How to build a Kaggle Machine
Using the community AMI named “kaggle machine”, build your
Kaggle Machine in AWS, with one of
the following method. Note the AMIs are currently available in us-east-1 and
us-west-2. For other regions, you can build your machine in the above two
regions, and copy AMI across regions.
Build Kaggle Machine from AWS consoleLaunch EC2 instance, search for “Kaggle-machine” in Community AMIs, specify a key pair. After instance creation, add a Security Group which allows port 8787, 9999 and 22 for ssh.
Build Kaggle Machine using CloudFormation Stack
A CloudFormation template can be used to build Kaggle Machine and Security Group automatically. Download the template and use it to build your stack in us-east-1 or us-west-2. The template can be found at http://github.com/seanxwang/kagglemachine/.
How to use it
After
instance creation, note public DNS name of the machine, from any client on the
internet, access services by pointing your browser to:
Rstudio:
http://:8787 (default ruser/ruser)
Jupyter:
http://:9999 (default password jupyter)
Change the
default password immediately. The EC2 instance runs on Ubuntu, you can ssh to
it
Cost
The cost is
based on AWS EC2 usage. You only pay hourly when instance running, shutdown the
instance when done. When you are done with your project and no longer need data
to be saved on server, terminate the instance.
The
development of Kaggle Machine originates from the needs of data scientists
participating in Kaggle challenges, hope it will be provide you a useful
toolset as well.