The Four Horsemen

18th August, 2015

Yesterday we added CloudWatch metrics support for the EC2 Container Service (hat tip to long time partner in crime, Deepak and his team), and I thought this would be a good time to talk about considerations for containers.

Containerized application components

There are four key components to building and deploying a long running, container-based application in production:

  1. One or more application components, encapsulated into containers.
  2. A description of the resource requirements for each application component (memory, CPU, IO, networking, storage)
  3. A pool of computational capacity.
  4. A method of placing the application components to most efficiently take advantage of the pool of resources while meeting the requirements of each component.

On AWS today, this equates to 1) applications encapsulated into Docker containers, 2) ECS task definitions, 3) instances provisioned through EC2, potentially using multiple instance types and 4) & 5) the ECS Service Scheduler.

Consider a simple e-commerce application with three components: inventory (product details, pricing, stock, etc), ordering (processing, payment) and a web front end. From a customer perspective, through the web site (served from the web front end component), they would view product details (from the inventory component), and then buy them (through the ordering component).

Three task definitions capture the resource requirements for each component: web front end needs small amount of CPU and memory; inventory needs lots of memory; ordering needs lots of IO.

We provision a collection of EC2 instances, with a five T2, three R3 and two I2 instances (10 in total). ECS treats these instances as a single cluster, onto which application containers can be placed.

Scheduling Resources

The question is, what is the most efficient way to place the application components across the cluster, based on the individual requirements of each component?

ECS uses the service scheduler to answer this, by running the containers in the most efficient way across the resources available on the cluster. In this example, it would place containers with the front end component on the T2 instances, containers with the inventory component on the R3 instances and containers with the ordering component on the I2 instances.

Since this is a long running application, the service scheduler manages the containers as they handle requests and process orders, with the following features:

  1. Distribute traffic across multiple containers (the five web front ends, in this example) via ELB, by automatically registering and de-registering containers with the load balancer. Customers can direct traffic to the load balancer and have that traffic routed to the containers.

  2. Auto-recover containers in an unhealthy state. If one of the inventory components fails, for example, the service scheduler will remove the unhealthy container from the cluster, and place a fresh container with that component back on the cluster.

  3. Add and remove capacity for each application component, but placing and removing containers from the cluster dynamically in response.

  4. Update application components in a running application without interruption. Let’s say we add a new product description field to the inventory component: the service scheduler would put this into production by replacing containers one by one across the running application component containers on the cluster.

  5. Easily adjust application component resource definitions for a running application. If the new inventory component needed additional CPU requirements, we could adjust this and the service scheduler would adhere to the new requirements without interrupting the application. The cluster specification can also be changed in real time: the instance families used in the cluster can be adjusted to reflect new application requirements (adding D2 instances, for example). The size of the cluster can also change dynamically (adding more T2 instances, for example), using EC2 auto-scaling based on custom application metrics.

  6. This is where today’s announcement fits in: metrics from your application components and the cluster they are running on are now available in CloudWatch, and importantly, via the CloudWatch API allowing you to programmatically respond to changes in both resource utilization and application-level metrics (page load time, query time, order load, etc). Auto Scaling groups let you scale each cluster of EC2 instance types independently.

These are essential features for a container-based application to run in production while maintaining the flexibility and speed to make changes to the application components, and the availability of the application.

Getting Started

The EC2 Container Service docs have a great tutorial to help get you started: Scaling Container Instances With CloudWatch Alarms. Enjoy.

Spark

16th June, 2015

Earlier this week I posted about how the cloud can help remove the constraints of working with data productively. When it comes to big data tools or techniques, there are three variables that impact productivity, that is, the ability to get real work done efficiently:

  • Speed of provisioning: the faster the better. Tools which are quick to setup and access are beneficial in two ways: firstly, they are easy to evaluate to see if they are a good fit since the investment required is minimal; and secondly, those same speed benefits pay off on every subsequent usage.

  • Resource fit: all the speed in the world doesn’t do you much good if you don’t have sufficient resources for the task in hand. Likewise, having resources available but being unable to access them or put them to work is frustrating (and wasteful). A range of resource sizes and shapes helps to create a perfect fit between your workload and the resourcing, so you don’t end up trying to fit a square peg into a round hole.

  • Iterative by default: Building apps, big data or otherwise, is an iterative process with benefits from low cost experimentation. The ability to be able to rapidly and easily build, test, refine, reject or evolve the logic and architecture is hugely valuable. The perfect fit of resources, quickly available isn’t all that useful if it’s effectively frozen in time as your requirements change from day to day.

Partnering infrastructure services which aim to meet these requirements with software that is designed from the outset to support (and in many cases, accelerate) the iterative nature of building applications is greater than the sum of its parts in terms of actually getting real work done.

Enter Spark

Apache Spark is one such a tool. If you’re unfamiliar, Spark uses a mixture of in-memory data storage (so called, resilient distributed data), graph based execution and a programming model designed to be easy to use. The result is a highly productive environment for data engineering and scientist to crunch data at scale (in some cases, 10x to 100x faster than Hadoop map/reduce).

Today, at the Spark Summing in San Francisco, it was a pleasure to announce that we’re coupling the speed of provisioning and broad resource mix of Amazon EMR with the iterative-friendly programming model of Apache Spark. More on the AWS blog.

Spark has already been put into production using EMR with folks such as Yelp, the Washington Post and Hearst, and I’m excited to see how better support in the console and EMR APIs help bring Spark to a broader audience.

Revisiting “Big Data”

12th June, 2015

From television to David Bowie, many new or emerging phenomenon go through a transition from fledgling to mature, from an insider’s tip to a household name. Technology adoption follows a similar trend, and a good indicator a transition is a well established and pervasive answer to the question ’what is X?’.

We saw this quickly with cloud computing, as early adopters, analysts and vendors discussed and contributed to an evolving and changeable definition which sought to bring some shape and color to a related but disparate collection of technologies. For example, the one that seems to have stuck for the cloud, at least for now, focuses on infrastructure, platforms and software delivered as a service.

But, much like David Bowie’s metamorphosis from Ziggy Stardust to the Thin White Duke, technologies shift and change over time, and so it’s worth revisiting their syntax and definitions.

The Vs

Big Data, like cloud computing, is a collection of technologies and tools which were polarized, adopted and accelerated by a specific opportunity (some might even say, a requirement), of developers and businesses to ask questions of increasingly complex data. The adopted, pervasive definition of big data focused around The Vs. I’ve seen as many as five of these referenced, but the most common are velocity, variety and volume (funny how these definitions seem to come in threes).

These three characteristics, originally intended to define the qualities of the data itself, have become synonymous with the challenges organizations faced when attempting to ask key questions of data which was being freshly generated, or which already resided inside their organization. Velocity, volume and variety weren’t celebrated as characteristics which would help answer increasingly complex problems, instead they were to be feared (a perspective some vendors deliberately propagated in an effort to sell their wares).

In an environment where data center walls can’t move, and where procurement and provisioning take months, you can understand why these factors start to loom large and contribute to defining a set of problems. Instead of asking the question you would like answer to, you end up having to scope your question around what your available resources can support. This is boxed-in thinking: “what can I ask given that I only have 100 cores and 10Tb?”, “how long can I leave this running for before I need the answer?”, or more insidious on shared resources, “what scope of resources will get me off the queue and on to the cluster soonest”. Yikes.

Constraints give way to creativity

I believe that cloud computing has come to be recognized as a key enabling foundation for ‘big data’ primarily because the velocity, volume and variety of data cease to be challenges (and in some cases, blockers) to working with that data. The resources required for data of any scale, at any volume and of virtually any complexity are immediately available to hand, and as a result the constraints of boxed-in thinking evaporate, and creative analysis, data exploration, reporting, data preparation, transformation, visualization become quicker and easier.

What next for Big Data?

As a result, we have entered a more mature era of ‘big data’ which is unbounded by many of the original complexities of scale or throughput, where the focus is on working productively with data. Instead of a focus on the data, today we are focused on working backwards from answers we’re looking for and in fitting together a collection of tools, techniques and best practices to let us ask the right questions.

The ability to work productively with data is the defining characteristic of ‘big data’ today.

The ability to be able to quickly develop, evaluate, adopt and scale new tools is an important part of productivity in big data. It’s tempting to think of data analytics as a linear timeline of generation, collection, storage, computation and collaboration, and at a high level, that’s correct. Take a look a step closer and you’re more likely to find a collection of diverse and evolving branching workflows which ultimately expose data to allow specialists inside an organize to interact with it in as productive was as possible. A business analyst who lives in Excel all day long, for example, has very different ways of working with data than a data scientist who is hacking on Python scripts. Within those workflows, there are three main categories of components:

  1. Sources of truth: canonical stores of data which act as a single source of truth for a specific set of information. Commonly stored as objects, in a database or as part of a data warehouse.

  2. Streaming data: Fresh data arriving as a stream of events which are either processed in some way (stored in logs, aggregated).

  3. “Task” clusters: a cluster running a software stack which is tuned and optimized for a specific task. These can be ephemeral or long lived, and multiple clusters may be orchestrated to answer a specific question.

The cloud, of course, provides an environment which is well suited to each of these categories. Object storage is plentiful and cheap, databases are easily provisioned and how low management overheads, even at scale, services such as Kinesis seek to collect and process streaming data, and an platform such as EC2 (or its cousins, EMR and the EC2 Container Service), provides a perfect way to automatically provision and scale clusters with the right mix of resources for a specific task.

The result is that working with data rarely requires us to fill a round hole with a square peg; instead we can create and scale clusters which mix ElasticSearch with Hadoop, Spark with Splunk.

Tomorrow and today

A focus on productivity is a great sign of maturity for a group of technologies and techniques which are still relatively young, and leaves plenty of scope to experiment with tomorrow’s approaches while attempting to make today’s even easier to work with. It’s an exciting time to work in this world, and to bring about the sort of impact analytics is capable of, free of the scary monsters and super creeps of the ‘Vs’.

How to Rock Re:Invent

9th June, 2015

One of my favorite sessions at SXSW was always the ‘How to Rawk SXSW’ panel. This usually took place the evening before the SXSW Interactive festivities kicked off and featured seasoned SXSW veterans discussing their tips, tricks and hacks to getting the most out of the conference (and Austin) during their time there. In fact, I liked it so much I thought I would try to take a similar approach for our very own AWS conference, Re:Invent.

I’ve been fortunate enough to attend every re:Invent conference so far, all of which have taken place at The Venetian in Las Vegas. It’s the same venue this year, so here are some tips to get the most out of 2015 (you may know others, drop me a line):

Pre-game

I’ve been attending technical conferences for over a decade and through that time my event pre-game prep has varied from ‘none at all’, to what I can only describe as ‘military ops’. I’ve landed somewhere in the middle, and while my role at AWS means that my pre-game for re:invent is a little atypical (primarily since it starts mid-2014), for the event itself I follow some general pre-game steps:

  1. Get the agenda (bonus points if there is an app)
  2. Scour the agenda for the breakout sessions I absolutely, definitely do not want to miss
  3. Defer decisions on the times when I have some tough choices on which sessions to attend
  4. Choose a couple of breakout sessions out of left field which sound interesting
  5. Be sure to leave time to decompress through the day - it’s a marathon, not a sprint.

Overall, I’ve found that having a working knowledge of the agenda is helpful, and having a rough plan sketched out while staying open to learning new things that I might hear about at the event serves me well.

There are a lot of parallel sessions at re:Invent, so you’ll have to get comfortable with missing something in Las Vegas. That’s OK. You’ll make good decisions most of the time, and if you chance on a session which isn’t for you, you can catch up since the vast majority of the sessions are available shortly after the event. The cost of failure is relatively low, so be open to experimentation.

Time and Space

Las Vegas hotels are big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big they are, and that means you’ll end up doing a lot of walking around - around the conference, but also around the hotel just to get to and from the event. Plan your time accordingly - it can easily take 20 minutes to get from a hotel room to the conference floor (longer if, like me, you get perpetually lost or are easily distracted by shiny things - there are lots of shiny things in Las Vegas).

The conference sessions, expo floor itself are relatively (relatively) self contained, but it can take some time to jump between breakout sessions. There are 15 minute breaks baked into the breakout session agenda so you don’t miss the start of the next session as you’re traveling between them.

Gear

Just as importantly, plan your gear. In 2014, I was working with:

  1. Herschel ‘Little America’ backpack, into which goes…
  2. 2013 13” MacBook Pro (my primary workhorse)
  3. 2011 15” MacBook Pro (hot backup; the day before a service launch is not a good time to experience laptop fail)
  4. iPhone - I use this almost continuously, which means I also need…
  5. PowerAll 12,000 mAh Power Bank (this thing can jump start a car 20 times on a full charge, and kept me going through a busy week)

This year I’ll probably swap from a backpack back to a messenger bag since it’s easier to access the gear within and just as easy to carry.

The Corridor

As much as the technical sessions are valuable, the attendees make re:Invent what it is for me, so be sure to take advantage of the various networking sessions.

Re:Invent has become something of an ‘everybody conference’, which probably means that the person who you follow on Twitter, who writes your favorite blog or who contributed that code to Netflix OSS at just the right time to save you two weeks of development, is there too.

Find that person.

Surprise Sessions

The keynotes usually contain announcements of new services and features, the majority of which are accompanied by a dedicated break out session for more of a deep dive. AWS keeps these sessions hidden to avoid spoiling the surprise of the keynote, so keep an eye on the agenda immediately after the keynotes to check in and learn more about any new services.

James

Go to James Hamilton’s session. Consider it mandatory and thank me later.

See you there

AWS Re:Invent 2015 is taking place at The Venetian in Las Vegas from 6th to the 9th October. Registration is now open. See you there.

Event-Driven Blogging with AWS Lambda

12th May, 2015

I wrote a short post (and an even shorter follow up), outlining the various tools I use to author and publish this blog. A few folks have been in touch asking for a bit more information on publishing with Lambda, so here I’ll dive into the details and hopefully start to provide some color on how to think about designing applications around AWS Lambda.

In the abstract, most application logic consists of three main components:

  1. Nouns: a collection of data models (people, places, things or ideas)
  2. Verbs: operators which act on those models, directly or indirectly (sing, dance, talk, save)
  3. Sentence structure: a way of controlling the interaction, ordering and results of nouns and verbs

I’ve found this a useful mental model when starting to whiteboard application architectures. Identifying the nouns (and adjectives), the verbs (and adverbs), and a way of composing them to solve a specific requirement is the essence of application architecture and nailing it can lead to better software design. It’s also notoriously difficult to get right.

Lambda provides an environment optimized for verbs, which are ultimately implemented as discrete, stateless, transient functions which operate against data. It is as close to a raw execution environment for ‘doing’ as I’ve come across, and as a result I’ve found this a useful mental model in helping design, architect and implement application which run in that environment.

With Lambda handling the verbs of our application, we can focus on the operations themselves, the descriptions of the nouns and how they all hang together in a sentence.

Take a blogging engine, for example

If I were asked to boil down a typical blogging engine to a single sentence, it would be:

An application which can process text, images and other media for display, publish that content to the web, and index it chronologically.

Of course, most blogging engines do this in addition to many other things (aggregation, sharing, social interaction, etc), but as a core you want to take stuff, make it readable on the web, and put the most recent stuff at the top of a list of all the other stuff.

In a single sentence we have identified the nouns (the data we need to model - text, content, etc), the verbs (the functional operations we need to implement - process, publish and index), and the sequence of events which will orchestrate them. Not bad for something that would fit inside a tweet.

The Nouns: media and content

We’ve only really got two categories of things to think about in a blog: media (the raw text, images, etc), and post content (the text, images, etc processed display on the web). To get more specific, I chose to work primarily in raw text formatted with Markdown for the media, and for the final post content, you can’t beat HTML.

Both are stored in S3, with metadata (date, title, etc.) stored in DynamoDB.

The Verbs: process, publish and index

Our blogging engine has three verbs, which we’ll implement as Lambda functions, using Javascript and Node.js.

Process: convert raw Markdown into HTML

This function takes raw Markdown text, and converts it to HTML content. The excellent marked library does all the heavy lifting here, including using the lexer to extract the post title from the first top level heading. We stash HTML in S3, and the metadata in DynamoDB.

Publish: publishes HTML to the web

This function combines the HTML content with the site template (the logo and menu on the left hand side), and publishes it to the web. The site templates are stored in a private bucket in S3, and Handlebars takes care of rendering before the final post is published to a static web site, also on S3.

Index: build a new home page

Finally, we want to index the content in reverse chronological order on the home page of the blog. We pull and sort the post metadata from DynamoDB, build the new home page using the relevant HTML snippets in S3, and render the whole thing to the home page using Handlebars.

The Sentence

Lambda functions can be called directly via the API. We could construct a command line script or even another Lambda function to control the orchestration and application flow, but less code is usually better code (and no code is, therefore, the best), so in this case we’ll use S3 event notifications to orchestrate the application components.

The event notification of new objects in each S3 bucket trigger a Lambda event for the next step of the applications; we’re taking advantage of the automatic execution of Lambda functions in response to these events to drive our application process. In essence, the event notifications encode our sentence structure in this case.

Wrapping up

Using Lambda and these simple application architecture guidelines, this blog runs on a little under 300 lines of Javascript. Lambda, DynamoDB and S3 take care of virtually all the heavy lifting of processing, storing and serving both the blog and the blogging engine, but without having to manage or maintain servers, databases or storage. Hopefully that gives you a flavor of some of the design considerations of building applications with Lambda, and some of the implementation details of this humble site.

To The Stars: now with added RSS

5th May, 2015

After I published the colophon this morning, AWS uber-blogger Jeff Barr dropped me a line with a feature request: to add an RSS feed.

I was happy to oblige, and because this site runs on Lambda, it was a simple tweak.

Extensibility through Events

One way to put Lambda to work is via an event-driven architecture. Application components emit events which Lambda responds to asynchronously. Certain AWS services such as S3, Kinesis and DynamoDB emit events, but Lambda can also be driven by custom events from your own application code.

The little engine that publishes this blog already emits a new event when the index page has been updated. I use this to send myself a text when the new blog post has been published, so with a few clicks in the AWS console and a handful of Javascript, voila, an RSS feed is now available, served from S3 like the rest of the site.

For Jeff, and anyone else who uses syndication, enjoy.

The Colophon

5th May, 2015

When I was thinking about setting up this new blog, I had some goals in mind for what I wanted to cover (the how and the why of cloud computing technology, without being duplicative to the sheer volume of great updates from Jeff on the AWS Blog), how I would author it (I’m most able to focus in plain text, a hang up from my LaTeX-laden days as a post-doc), and how I wanted to publish and run it (something simple and reliable with very little overhead, but where I kept control of the ownership and experience).

With real, physical books the details of the publication are usually found on the turn of the title page in the colophon. This post is a short equivalent for this humble journal.

Writing

The posts here are authored in Markdown, using the excellent iA Writer. This is a great piece of software: reliable, lightweight with a really easy to grok workflow. Wherever I happen to be, I jot down ideas as a note in Writer on my iPhone, then pick it up on my MacBook or iPad to flesh it out to a full post (Writer syncs posts across devices via iCloud).

I’ve found this to be a good way to quickly figure out if there is something to write about, and the full screen mode keeps me out of my email inbox and focused on the post for much longer periods of time and at a higher quality than I had expected. Indeed, I’ve taken this approach for other long form writing too (working on narratives for the AWS keynotes or other presentation ideas).

Publishing

This site is served from Amazon S3 as a static website, and published using AWS Lambda with ‘stellar’, a tiny blogging service I put together. It’s similar in concept to something like Jekyll or Octopress, but allows me to automate the process by keeping the logic on AWS.

To publish a new post I just upload it to S3.

From there, the raw Markdown is converted to HTML and then rendered to the post and home pages. Metadata is stored in DynamoDB.

It’s an event-driven system: Lambda responds automatically to new object events from S3, and renders each new post to the home page in just a few seconds. There are no servers to manage.

The whole system weighs in at a little over 300 lines of Javascript, split across three Lambda functions and a few helpers.

Monitoring

I use GoSquared for web analytics, and CloudWatch Logs to keep an eye on the Lambda execution.

Controlled-Access Genomics Data in the Cloud

5th May, 2015

Last month the NIH issued a position statement on using the cloud for the storage and analysis of controlled-access data, which is subject to the NIH Genomic Data Sharing Policy. The statement is worth a read, as is the blog post by Vivien at the NIH.

“One small step for NIH, one giant leap forward for the community”

If we’re talking about controlled access datasets at the NIH, we’re really talking about dbGaP, a collection data which relates the interaction of genotypes and phenotypes. Since it contains patient phenotypic information, access to the dataset is restricted to scientific research which is consistent with the informed consent agreements provided by individual research participants. A look at the best practices for working with the data shows just how seriously the NIH and PIs take access (quite rightly): no direct access to the internet, password requirements, principle of least privilege, data destruction policies. The list goes on.

With the updated guidance from the NIH, researchers can now meet these requirements, and store and analyze this important dataset in the cloud - a big deal - since the cloud is effectively custom made to work with large datasets like this which often have complex, multi-stage analysis pipelines.

How To Store dbGaP on AWS

This announcement, and the discussions which will take place at Bio-IT World this week, seem like a good time to walk through how to load and secure a genomics dataset such as dbGaP on AWS.

The NIH announcement contains specific requirements on how to properly secure your environment in the cloud, and below I’m going to follow the guidance in the excellent Architecting for Genomic Data Security and Compliance in AWS white paper (hat tip to Angel and Chris), but you’ll be up and running and ready to store genomic data in alignment with this guidance on AWS in just a few minutes.

Roles and instances and Aspera, oh my!

In addition to an AWS account, you’ll need some basic command line chops to get up and running, but although this list of steps may seem a little intimidating, it’s relatively straightforward even if you’re new to AWS.

We’ll use the AWS platform to create an encrypted, controlled access environment with usage audit trails, and then download and store dbGaP into that environment.

Step 0: Securing your Root Account and Adding Usage Auditing

If this is your first AWS account, or you’re still using your root credentials, we need to do some initial setup to make sure your root AWS account is protected. This is a common part of the setup for securing your data in the cloud - so common - that the guidance and checks are built right into the AWS management console.

Once you are logged into AWS, start securing the account by clicking on the Identity & Access Management button on the AWS console.

From here, you will see four security status checks that should be completed before trying to move any controlled data into AWS. Take the time to work with your internal security and compliance teams on this step - it’s best practice to take steps to create separate groups for developers and administrators, create security auditor accounts and enable MFA on the administrator accounts. All IAM user permissions should be designed to align with the permissions assigned in the DAR.

Once all four of the security status checks are green, we can meet the requirement for usage of controlled-access data to be audited by activating CloudTrail for your account.

Simply choose CloudTrail from AWS console and then switch the logging switch to “ON” - just tell CloudTrail which S3 bucket to store the audit logs in.

Step 1: Identity and Access

First, using the IAM service, let’s create an IAM Role that will be used to control access to our controlled-access dataset.

  • From the AWS, choose ‘Identity & Access Management’.
  • From the left hand side, choose ‘Roles’.
  • Click the button Create New Role from the top of the page.
  • Enter a Role Name (such as r-dbsrv1).
  • Select role type of Amazon EC2 under ‘AWS Service Roles’
  • Under ‘Attach Policy’ simply click ‘Next Step’ without assigning any policies. We will manually assign permissions to our role later.
  • On the review page, make a note of the ‘Role ARN’. This is an Amazon Resource Name that uniquely identifies our resource. We will use this ARN to identify our role in the next step.
  • Click ‘Create Role’ to finish.

Step 2: Setting up our storage

Next we will create a bucket in S3 to store our controlled access datasets. dbGaP will ultimately be stored in S3 (but we’ll load it from the NIH into S3 via EC2 in the next step).

  • From the AWS Console, click on S3.
  • Click ‘Create Bucket’.
  • Choose a bucket name that will hold the controlled access data.
  • Once the bucket is created, click on the bucket name to be taken into the empty bucket and then choose Properties. From the properties menu, choose ‘Add bucket policy’ from the Permissions section.
  • Add this JSON document as a bucket policy which will provide the role we created in the first step the rights to add encrypted objects to our bucket. Be sure to change the bucket name and role name.

Step 3: Moving dbGaP from the NIH to AWS

Now that we have a home for dbGaP in S3, we’ll load it from the NIH into AWS using EC2, using Aspera Connect to speed up the data transfer.

  • From the EC2 console, choose ‘Launch’, and select an m3.medium instance type, follow the launch wizard until…
  • On the ‘configure instance details’ step, select the IAM Role Name we’ve been using. This will enable our EC2 instance to securely invoke the permissions given to the role without requiring us to exchange keys or passwords.
  • This default configuration will create our instance within a default network configuration that will expose our instance to an Internet Gateway. Since we will only be using this instance for a one-time data transfer, our exposure is limited but best security practice is to logically isolate instances from an Internet Gateway and send traffic through a Network Address Translation (NAT) server.
  • On the “Add Storage” step make sure that you have a root volume for booting up your instance as well as an “EBS” volume marked for Encryption which can be used to store controlled datasets. Be sure to choose a GiB size on the EBS volume which is large enough to store your dbGaP dataset.
  • “Security Groups” act as firewall rules to control incoming traffic to your instance. It is possible to bootstrap all the commands needed to install Aspera connect and use the command link ASCP executable utility to automatically pull from the NIH GDS Data Repository and avoid granting any rights to the security group at all. However, for this walk through, we will leave either SSH (if using Linux) or RDP (if using Windows) in our security group but limit the source to our IP address or IP address range. We will also want to name our security group, sg-appsrv1.
  • You are now ready to review all instance settings, launch the instance and generate a secure key used to connect to the instance.
  • Connect to your instance, and follow the download instructions from dbGaP FAQ Archive for instructions on installing Aspera software and downloading the controlled data. When downloading the dataset, be sure to store the data on the encrypted, persisted EBS volume and not the root volume.

Step 4: Copy dbGaP to S3

Now that you have downloaded our controlled dataset onto your encrypted EBS volume, you will want to move that data over to your S3 bucket to make it accessible from any application or server to which you give permission.

  • You will need to utilize the AWS command line tools to easily move our controlled data to S3. If you choose an Amazon Linux operating system, these tools are pre-installed. Otherwise, you will need to obtain and install those tools based on the instructions specific to the operating system you choose from here.
  • Run the command: aws s3 cp downloaded_data/ s3://db-gap –recursive –-sse
  • This command will uploaded all the files in your “downloaded data” directory to your db-gap S3 bucket using AES-256 server side encryption. Be sure to change the yellow highlights to the names you used.

Step 5: Tidy up

After you have verified that the controlled access DbGaP data has been successfully moved into S3, the final step is to stop our running Aspera Connect instance on EC2. Since the server is only needed when we pull data from the NCBI repository into AWS, we can now stop our server and avoid paying for it until we need it again. From the EC2 console, right click on the Aspera Connect instance and choose ‘stop’.

Job done

You should now have your controlled DbGaP data moved into and stored in S3, ready for analysis (another story, for another post, another day).

With thanks to Chris and Angel for their help with this post.

A little more about Amazon Machine Learning

5th May, 2015

Data has become part of the fabric of applications: in the past 18-24 months, we’ve seen developers use data and analytics as part one of their frequently used tools, alongside more traditional endeavors such as working on front-ends, mobile and backend operations. Developers are using data to get a better look into their applications, their operations and the customers, not just retrospectively, not just in real time, but to make predictions on the future.

It’s to that end that we introduced Amazon Machine Learning at the AWS Summit last week, a fully managed, predictive analytics service, geared for developers. I was lucky enough to be part of the announcement at the event, so I thought I would share a few more details on the service here, my freshly minted new blog.

Machine Learning and Amazon

Machine learning is an approach to predictive modeling: the ability to automatically find patterns in existing data, and use them to make confident predictions on new data. For example… based on what you know about your customer, you can ask the question: “will they use our product?”; based on what you know about an e-commerce order, you can ask “is this order fraudulent?”, or based on what you know about a news article, you can ask “what other articles are interesting?”. Machine learning lets you use existing data to build predictive models to answer all these questions, confidently.

At Amazon, we’ve got a deep history in using machine learning: for example, even on very early gateway pages from amazon.com, you could see evidence that we were using machine learning to automatically make recommendations for customers. And since then, we’ve used it across many areas of the business, including on product detail pages, with the famous “customers who bought this also bought” feature. We also use machine learning to improve search, to understand and process speech and natural language (for a recent example, look no further than Amazon Echo), or in improving our fulfillment operations with new vision systems that enabled the unloading and receipt of an entire trailer of inventory in as little as 30 minutes (instead of hours).

And from these examples, and in talking to our customers, we’ve seen that the key challenge of machine learning is that, while there is some overlap in the expertise required with the typical SDEs, it’s sufficiently small that there is a huge amount of friction for developers who work with their data day in, day out, to apply machine learning to that data. Instead of building and reusing software components, there is deep expertise required in statistics, model building, validation of those models, algorithm selection and optimization and data transformation.

The friction gets even larger when you need to put these predictive machine learning models into production, so that they are optimized, and high performance, and even further when you need to do this at scale: scale of both the volume of data needed to build and validate predictive models, but also in putting those models to work in high traffic mobile apps or web sites.

What we’ve heard from customers is that they see a large opportunity of using existing and growing datasets with machine learning, but that to take advantage of this, there has to be a very low barrier to entry

Amazon and Machine Learning

This was the driving force behind Amazon Machine Learning, a fully managed machine learning service, geared and tooled for developers - available today in your AWS console. The service exists to allow any developer to apply predictive modeling to their data in minutes, not months, and to provide easy to use tools which help get the best from those models. We saw wonderful things when we put machine learning in the hands of the development teams at Amazon, and I’m looking forward to seeing the same level of innovation across our customers at AWS.