AWS Reinvent 2016 Conference Notes

From Resiliency to Ubiquity - Netflix everywhere global architecture


back to top

In this session, they covered architectural patterns that enable seamless, multi-region traffic management; reliable, fast data propagation; and efficient service infrastructure.

Session Info

back to top

  • Presenter - Coburn Watson - Director, Performance and Reliability, Netflix, Inc
  • Youtube Link - Link

Key Takeaways

back to top

  • 35% of internet traffic at peak in US.
  • Netflix has their own CDN - openconnect.
  • Believe in Failure driven architecture - never fail the same way twice.
  • Device Ubiquity
    • Windows
    • Roku
    • Mac
    • Xbox 360
    • Apple iPad, iPhone
    • Android and others
  • Geographic Ubiquity
  • Language Ubiquity
  • Started their journey with building their own Data center
    • No automation, virtualization, standardization
    • Manual, error prone and slow
    • Monoliths
  • Then, focused on building their main product and made AWS as their cloud platform.
  • Architecture pillars
    • Microservices
    • Database
    • Cache
    • Traffic
  • Microservices
    • Edge
      • ELB -> Zuul -> API
    • Middle Tier & Platform
      • Hysterix
      • Chaos Monkey
      • FIT - Fault injection test framework
  • Database
    • Initially SimpleDB
      • Not web scalable
    • Cassandra
      • Scalable, durable, global
      • Multi-region
      • Multi-directional
      • How they deployed Cassandra
        • Single region, Multiple AZ’s
        • Client writes to any node.
        • Coordinator replicates to nodes.
        • Nodes ack to coordinator
        • Coordinator acks to client
      • Not quite fast enough
  • Caching
    • In Past
      • Evcache - Elastic Volatile Cache
        • Clustered memcached optimized for AWS
        • EvCache Server registers to Eureka
        • Evcache Client asks eureka and connects to server
      • How Reads works
        • Client will always read from the same zones - avoid cross region
      • Fronting MicroServices
        • Put Cache in front of the microservice versus cache behind microservice
        • Client refers cache first and connects to service if cache is not found
        • Improves performance by 100ms or so
    • Use Kafka & EV Cache together
  • Traffic
    • Uses DNS geo mapping
    • Multi region writes
      • Writes data in local zone
      • Replicates to local zone and cross region
      • Takes about 500ms
      • Bidirectional nightly compare and repair
    • Evcache Cross region replication
      • 1st Pass:
        • Evcache write -> SQS -> Evcache replication -> Cross Region
    • Active-Active traffic management
      • Api-global.netflix.com ->
        • ELB us-west-1
        • ELB us-west-2
    • How they achieved failover
      • Shim
        • Active-Active failover
      • Uses Vizceral
  • Content Ubiquity
    • Content available everywhere - all episodes, devices, countries
  • Ubiquitous, Resilient architecture
    • Multi stage and Multi Region Failover
      • Purchase capacity in advance
      • Does multistage failovers
        • If us-east1 goes down, 80% of traffic to us-west-2, 20% traffic to eu-west-1
      • Takes about 40 minutes to bring all the services
    • If error rate starts going up
      • Proxy us-east-1 traffic to eu-west-1 and us-west-2 via zuul
      • DNS flip to default configuration, active proxying
      • Reduce proxying to 0%.
  • What’s next
    • Global latency routing
    • ML-based monitoring - Atlas handles 3 billion metrics per second.
    • Fast failover
      • 40 minutes to 5 minutes
      • Allocate more capacity
    • Improved capacity utilization
    • Integrate DB & Caching
    • Automated chaos Experiments
  • Takeaways
    • Never fail the same way twice
    • Know your resiliency patterns
      • DC
      • Cloud
      • Islands
      • Isthmus
      • Active-active
      • Global
    • Invest in architectural pillars

Earth on AWS - Next generation open data platforms


back to top

In the session, they conveyed that aws has made available satellite imagery data on s3. They also covered, how AWS customer Digital Globe use open data stored in S3 to distribute high-resolution satellite imagery to their customers around the world.

Session Info

back to top

  • Presenter - Jed Sundwall, Open Data Global Lead
  • Slideshare link - Link

Key Takeaways

back to top

  • Traditional Data acquisition
  • Data Acquisition in the cloud
  • Open Data
    • LANDSAT on AWS - landsetonaws.com
      • No Javascript
    • Architecture: USGS->.tar->EC2->.tiff->S3
    • Open source tools and apps
      • GDAL
      • Rasterio
      • Sat-utils suite
    • Observed Earth iphone app
    • AWS Public Data sets
    • Aws.amazon.com/earth
      • Climate models
      • Aerial Imagery
    • Research Credits - aws.amazon.com/earth/research-credits
  • Wine and grape supply data lake
    • E&J Gallo Winery
    • Data driven insights
      • Distribution of soil across vineyard
      • Improve quality
      • Increase Yield
      • Predictive yield estimation
    • Architecture TODO:
  • Digital Globe
    • Satellite - 60TB data collected in a day
    • 350 square KM everyday
    • Detects high resolution images
    • Earth Image Library
    • They want to do Large scale image extraction
    • Advancements
      • Elastic computing
      • Deep Learning advancements
    • Companies
      • GBDX
      • Crowd Sourcing
      • Space net
    • Common Frameworks to accelerate prototyping
      • TensorFlow
      • Torch
      • Caffe
      • Dataset
        • Imagenet
        • Spacenet
    • Developer.digitalglobe.com
  • Book - Data driven
  • Exciting things
    • Satellite Imagery Analytics
    • Computer vision

Sony Play station


back to top

In the session, they conveyed that how the microservices that power Playstation achieved low latency interactions while conserving on precious network bandwidth.

Session Info

back to top

  • Presenter -
    • David Green - Enterprise Solutions Architect, Amazon Web Services
    • Dustin Pham - Principal Engineer, Sony Interactive Entertainment
    • Alexander Filipchik - Principal Software Engineer, Sony Interactive entertainment
  • Slideshare link - link

Key Takeaways

back to top

  • Soft state
  • Sony Use Case
    • Friend Finder
    • Social Graph
      • 100s million of users
      • Rich Networking features
  • Architecture
    • Solr
    • Elastic Search
  • Data
    • High cardinality
    • Low cardinality
  • Use Indexing to find relations between users
  • They used Cassandra earlier and removed due to high storage and servers issues
  • They used ehcache off heap cache eventually.

DNS Dymstified


back to top

In the session, they covered high level DNS overview and then covered experience of Warner Brother’s who moved to Route 53.

Session Info

back to top

  • Presenters
    • Sean Meckley - Sr. Product Manager, Amazon Web Services
    • Vahram Sukyas - VP, Application Infrastructure & Operations, Warner Bros.
  • Youtube link - link

Key Takeaways

back to top

  • Route53
    • Worldwide anycast network with redundant locations
    • Advanced routing - LBR, Geo, WRR, Failover
    • Provides AWS integrations: alias
    • Manage via APIs, sdks etc
  • You can not have cname for root record. However, we can have Alias.
  • We can create alias records for S3, Cloudfront and ELB resources.
  • Delegation set - Set of 4 name servers. Delegation set is unique to each customer.
  • Route 53 functionality
    • We can create private DNS hosted zone.
    • We can also do health checks and failover.
    • Advanced Multi Region Architecture can be achieved with very less effort.
    • Using Traffic flow, we can build the complex routing logic based on multiple factors.
  • You can customize and have branded delegation sets

How Warner Brothers moved to Route 53

back to top

  • They had over 25K domain names
  • Primary drivers for moving to AWS
    • Application Isolation
    • Security
    • Agility
    • Billing Clarity
  • DNS set up before Route 53
    • On premise solution
      • Bind9 (Tinydns, microsoft dns)
      • No Self Service
      • Poor fault tolerance
      • Poor geographic distribution
    • 25k + domains
    • Some zones have 10 K records
  • Problems to solve
    • Domain registration process
    • Devise a scheme for reusable(and WB branded) delgation sets
    • Import thousand of zones
    • Raised aws limits
  • Wrote a tool to validate entire zones in route 53 vs bind.
  • Wrote a tool to easily set up new domains
  • Lower TTLs
  • Handle Migration - cli53 (with some custom patches)
  • Benefits
  • Increased performance
  • Easier to manage. Supports self service by different teams.

Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark


back to top

In the session, they covered high level overview of Apache Spark framework and how to develop Classification and Recommendation Engines with Amazon EMR and Apache Spark

Session Info

back to top

  • Presenters
    • Jonathan Fritz - Sr. Product Manager, Amazon Web Services
    • Jasjeet Thind - Sr. Director, Data Science & Engineering, Zillow Group
  • Youtube link - link

Key Takeaways

back to top

  • Spark
    • Spark ML addresses the full ML pipeline.
    • Supports ML supports most common ML algorithms.
  • Storage layers - can pull data from different storage layers
    • S3
    • Kinesis
    • Redshift
    • DynamoDB
    • Elastic Cache
  • Productionizing your application
    • Submit a Spark application via EMR api
    • Use AWS lambda to submit applications to EMR steps api or directly to Spark on your cluster
    • Create a pipeline to submit the job on demand

Zillow Group use case

back to top

  • Group of Trulia, Zillow, hotpads, streeteasy, naked apartments, retsly, dotloop, mortech
  • Recommendation use cases
    • Email - homes for sales / for Rent
    • Home Details - home for sales / homes like this
    • Personalized search
    • Home owner / pre-seller predictions
    • Lender selection algorithm
    • Similar photos / videos

Union of State of Containers


back to top

In this session, they discussed the evolution of containers on AWS and AWS vision going forward.

Session Info

back to top

  • Presenter - Deepak Singh - General Manager, Container Services, Amazon Web Services
  • youtube link - link

Key Takeaways

back to top

  • Why ECS
    • Separations of Concerns
    • Reduce deployment time
    • Rollbacks are also faster
    • Reduce waste with better instance packing
    • Stability on spot with instance diversity
    • Spot instances used by ECS cluster
    • Resulted in 25% cost savings
  • Common use cases for containers
    • Microservices
    • Batch Processing
    • Paas
    • CICD
  • Features added in last year
    • Added Task and cluster auto scaling
    • Already have cloud trail integration
    • We also have splunk plugin
    • Can run X-ray on top of ECS
  • Customers using ECS
    • Mapbox
    • Airtime
    • Mytaxi
    • Okta
    • Lyft
    • Meetup
    • Expedia
      • Expedia running 200 apps in ECS clusters.
      • Custom ECS AMI with IT configurations
    • Slack
  • MapBox Use case
    • Quarter billion users each Month Ex. Calculate prevailing speed of road map
    • They collect 3 billion probes every day
    • 21 services, 2000 tasks, 1.3 billion requests per day.
  • How Mapbox divided their Monolithic services
    • Functional Decomposition
      • Teams
      • Functionality
      • Traffic

    Ex.Map Service, Search Service, Directions Service

Deploying deep learning applications in ECS


back to top

In the workshop, they covered creating an MXNet container in Docker and deploying it with Amazon ECS.

Session Info

back to top

  • Presenter - Chad Schmutzer - Solutions Architect, Amazon Web Services

Key Takeaways

back to top

  • Mxnet - open source deep learning framework
  • Highly scalable - single/multiple hosts, CPU/GPU support
  • With Application Load balancers - We can run multiple app instances in the same host
  • Lab Steps
    • Lab1 - Set up the workshop environment
    • Lab2 - Build an Mxnet Docker Image
    • Lab3 - Deploy Mxnet container with ECS
    • Lab4 - Image classification Demo
    • Lab5 - Wrap image classification in an Image Task
  • Github Repo link - https://github.com/awslabs/ecs-deep-learning-workshop

Deploying Swift applications in ECS


back to top

In this workshop, they covered on how to develop a mobile front-end using Swift, and develop a Swift microservices-based web application on Amazon ECS.

Session Info

back to top

  • Presenter - Asif Khan - Solutions Architect, Amazon Web Services

Key Takeaways

back to top

  • Technologies
    • Swift
    • Vapor
    • Docker
    • ECS
    • ECR
    • RDS
    • AWS Mobile Services
    • AWS Device Farm
    • Amazon Cognito
    • AWS Code Commit
    • AWS Code Pipeline
    • AWS Code Deploy
  • Workshop Github Repository - https://github.com/awslabs/swift-ecs-workshop
  • Lab Steps
    • Lab 1: Deploy a Swift web application on Amazon ECS Workshop
    • Lab 2: Create a Swift mobile app using Mobile Hub and Amazon Cognito
    • Lab 3: Build and test mobile app using Amazon Device Farm
    • Lab 4: Deploy to Amazon ECS using CodeCommit and CodePipeline

Learn How FINRA Aligns Billions of Time Ordered Events with Spark on EC2


back to top

In this Session, they discussed FINRA’s journey to move toward real-time data insights for billion of time ordered events.

Session Info

back to top

  • Presenters
    • Bob Griffiths - Solutions Architect Manager, Amazon Web Services
    • Brett Shriver - Sr. Director Market Regulation Technology, FINRA
  • Youtube link - link

Key Takeaways

back to top

  • They covered high level Spark processing overview
  • Different ways to use SPARK on AWS
    • Amazon EC2 using Mesos
    • Amazon EC2 using Spark
    • Amazon EMR as managed hadoop
  • FINRA - What they do?
    • FINRA is dedicated to investor protection and market integrity through effective and efficient regulation of the securities industry.
    • FINRA’s technology is vital to protecting investors—and has become a key component of our ability to:
      • Effectively oversee brokerage firms;
      • Accurately monitor the U.S. equities markets;
      • Quickly detect potential fraud; and
      • Keep investors informed through tools like BrokerCheck.
  • Problem statement
    • Handles 75 billion events per day
    • Over 20 petabytes of storage
    • Using data they reconstruct and replay the market containing trillions of nodes and edges
      • Receive data from 12 markets and security exchanges
      • Example 1: Intermarket price protection
    • High level architecture
    • Example 2: Exchange scenario
  • Legacy Solution
    • Ran 300 sql jobs
    • Proprietary processors - 380 cores, 80 TB of data
    • 7 figures yearly to maintain and operate
    • Reprocessing was difficult
  • New Solution Requirements
    • Scalability / elasticity
    • Cost effectiveness
    • Supports real time processing in future
  • What options were considered?
    • Apache Spark on Amazon EMR
    • Java Map Reduce
    • Apache Giraph
    • Apache Crunch
  • AWS Architecture
    • S3
    • Spark on EMR

A day in the life of Netflix engineer


back to top

In this session, Dave discussed day in the life of netflix engineer and discussed below aspects on high level * Making the Bits Bigger - Scaling at scale * Keeping an Eye Out - Billions of metrics * Break all the Things - Chaos in production is key * DevOps - How culture affects your velocity and uptime

Session Info

back to top

  • Presenter - Dave Hahn - Senior SRE & Chief AWS Botherer, Netflix
  • Youtube link - link

Key Takeaways

back to top

  • How the world sees netflix
  • Open Connect
    • Other cdns focus on general purpose and try to optimize everything
    • Built around the idea of taking video bits near to customers
    • Builds the caching machinery and give it free to ISP’s
  • Was one of the best sessions with lot of good analogies
  • Netflix is getting bigger and at the same time they have to be fast to align with the growth.
  • Enables big and fast by
    • Enabling Resiliency
    • Enabling Insight
    • Enabling Choice
    • Enabling Focus
    • Enabling People
  • Enabling Resiliency
    • Principles of chaos engineering
      • Hypothesis
      • Vary real time events
      • If AZ goes down
      • If Region goes down
      • Experiment
      • Automate
    • Chaos engineering in practice
      • Chaos Monkey - ok, if instance goes down. Enables resiliency
      • Chaos Kong - ok, if region is down.
      • latency monkey - gracefully degrade, increased latency
        • However, had blast monkey problem
        • If one change is made, they startup parallel clusters and observe and compare behaviours
    • Many organizations are developing separate departments for chaos engineering. For more check link
  • Enabling Insight
    • Tools help in enabling insight
      • Vizceral - insight
      • Spinnaker - cluster manager - enables velocity
        • Automated canary feature
        • Is also integrated with Slack
        • Vibrant and healthy community
        • spinnaker.io
  • Enabling choice
    • Multiple Languages used by different use cases
      • Java
      • Scala
      • Groovy
      • Node.js
      • C
      • C++
    • Containers
      • Microservices and Immutable infrastructure (TODO)
      • They are currently using it for batch jobs
        • Tides - Enables options
        • Why rewrite if kubernetes / mesos?
        • Netflix has tight environment
        • They are planning to move some of Mesos environments to ECS
        • Appis - Matrices system
    • Stats
      • Run more than 100K instances in EC2
      • Run more than 800,000 CPU cores
      • NASA super computer
      • More than 50gbps per region
      • Over 37% traffic goes to netflix
  • Enable Focus
    • Don’t take focus off the goal
    • Having partnership with aws and others
    • Enabling teams to give time to build on their idea
  • Enable People
    • Company
    • Culture
    • Values
    • Freedom & Responsibility
    • Context not control
    • Enable Success

Version History


Date Description
2016-11-29 Initial Version