Using Amazon Web Services in HDX
*** Work in Progress ***
Amazon Web Services offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, management tools, IoT, security and enterprise applications. Of the 90+ services, the most popular include Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). Most services are not exposed directly to end users, but instead offer functionality through APIs for developers to use in their applications. Amazon Web Services’ offerings are accessed over HTTP, using the REST architectural style and SOAP protocol.
What relevant services are there?
In the list below taken mainly from Wikipedia, I use green to indicate where we might use the service to create new or enhanced functionality. I use orange to indicate we might replace something we are already using or add to a list of products we are evaluating. I use blue to indicate a speculative future usage. I use strikethrough to indicate that the service is as far as I can tell not relevant to us (or we would only use it indirectly).
Compute
- Amazon Elastic Compute Cloud (EC2) is an IaaS service providing virtual servers controllable by an API, based on the Xen hypervisor. Equivalent remote services include Microsoft Azure, Google Compute Engine and Rackspace; and on-premises equivalents such as OpenStack or Eucalyptus.
- Amazon Elastic Beanstalk provides a PaaS service for hosting applications, equivalent services include Google App Engine or Heroku or OpenShift for on-premises use.
- Amazon Lambda (AWS Lambda) runs code in response to AWS internal or external events such as http requests, transparently providing the resource required.[40] Lambda is tightly integrated with AWS but similar services such as Google Cloud Functions and open solutions such as OpenWhisk are becoming competitors.
Networking
(The below could be used to replace BlackMesh)
- Amazon Route 53 provides a scalable Managed DNS service providing Domain Name Services.
Amazon Virtual Private Cloud (VPC) creates a logically isolated set of AWS resources which can be connected using a VPN connection. This competes against on-premises solutions such as OpenStack or HPE Helion Eucalyptus used in conjunction with PaaS software.- AWS Direct Connect provides dedicated network connections into AWS data centers.
- Amazon Elastic Load Balancing (ELB) automatically distributes incoming traffic across multiple Amazon EC2 instances.
- AWS Elastic Network Adapter (ENA) provides up to 20Gbit/s of network bandwidth to an Amazon EC2 instance.[41]
Content delivery
Amazon CloudFront, a content delivery network (CDN) for distributing objects to so-called "edge locations" near the request
Contact Center
- Amazon Connect is a self-service, cloud-based contact center service available to business. Amazon Connect is based on the same contact center technology used extensively by Amazon customer service associates around the world. (Replacing our chat solution eg. Zoho)
Storage and content delivery
- Amazon Simple Storage Service (S3) provides scalable object storage accessible from a Web Service interface. Applicable use cases include backup/archiving, file (including media) storage and hosting, static website hosting, application data hosting, and more.
- Amazon Glacier provides long-term storage options (compared to S3). High redundancy and availability, but low-frequency access times. Intended for archiving data.
- AWS Storage Gateway, an iSCSI block storage virtual appliance with cloud-based backup.
- Amazon Elastic Block Store (EBS) provides persistent block-level storage volumes for EC2.
AWS Import/Export, accelerates moving large amounts of data into and out of AWS using portable storage devices for transport.- Amazon Elastic File System (EFS) a file storage service for Amazon Elastic Compute Cloud (Amazon EC2) instances.
Database
- Amazon DynamoDB provides a scalable, low-latency NoSQL online Database Service backed by SSDs.
- Amazon ElastiCache provides in-memory caching for web applications.[42] This is Amazon's implementation of Memcached and Redis.[43]
- Amazon Relational Database Service (RDS) provides scalable database servers with MySQL, Oracle, SQL Server, and PostgreSQL support.[44]
- Amazon Redshift provides petabyte-scale data warehousing with column-based storage and multi-node compute.
- Amazon SimpleDB allows developers to run queries on structured data. It operates in concert with EC2 and S3.
- AWS Data Pipeline provides reliable service for data transfer between different AWS compute and storage services (e.g., Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR). In other words, this service is simply a data-driven workload management system, which provides a management API for managing and monitoring of data-driven workloads in cloud applications.[45]
- Amazon Aurora provides a MySQL-compatible relational database engine that has been created specifically for the AWS infrastructure that claims faster speeds and lower costs that are realized in larger databases.
Mobile services
- AWS Mobile Hub lets you easily add and configure features for your mobile apps, including user authentication, data storage, backend logic, push notifications, content delivery, and analytics.
- Amazon Cognito lets you easily add user sign-up and sign-in to your mobile and web apps.
- AWS Device Farm is an app testing service that lets you test and interact with your Android, iOS, and web apps on many devices at once, or reproduce issues on a device in real time..
- Amazon Pinpoint makes it easy to engage your customers via email, SMS and Mobile Push messages, tracking overall customer and engagement activity.
Deployment
- CloudFormation provides a declarative template-based Infrastructure as Code model for configuring AWS.[46]
- AWS Elastic Beanstalk provides deployment and management of applications in the cloud.
- AWS OpsWorks provides configuration of EC2 services using Chef.
- AWS CodeDeploy provides automated code deployment to EC2 instances.
Management
Amazon Identity and Access Management (IAM) is an implicit service, providing the authentication infrastructure used to authenticate access to the various services.- AWS Directory Service a managed service that allows connection to AWS resources with an existing on-premises Microsoft Active Directory or to set up a new, stand-alone directory in the AWS Cloud.
- Amazon CloudWatch, provides monitoring for AWS cloud resources and applications, starting with EC2.
- AWS Management Console (AWS Console), A web-based point and click interface to manage and monitor the Amazon infrastructure suite including (but not limited to) EC2, EBS, S3, SQS, Amazon Elastic MapReduce, and Amazon CloudFront. A mobile application for Android has support for some of the management features from the console.
- Amazon CloudHSM - The AWS CloudHSM service helps to meet corporate, contractual and regulatory compliance requirements for data security by using dedicated Hardware Security Module (HSM) appliances within the AWS cloud.
- AWS Key Management Service (KMS) a managed service to create and control encryption keys.
- Amazon EC2 Container Service (ECS) a highly scalable and fast container management service using Docker containers.
Application services
- Amazon API Gateway is a service for publishing, maintaining and securing web service APIs.
- Amazon CloudSearch provides basic full-text search and indexing of textual content.
Amazon DevPay, currently in limited beta version, is a billing and account management system for applications that developers have built atop Amazon Web Services.Amazon Elastic Transcoder (ETS) provides video transcoding of S3 hosted videos, marketed primarily as a way to convert source files into mobile-ready versions.- Amazon Simple Email Service (SES) provides bulk and transactional email sending. (replacing our bulk emailer)
- Amazon Simple Queue Service (SQS) provides a hosted message queue for web applications.
- Amazon Simple Notification Service (SNS) provides a hosted multi-protocol "push" messaging for applications.
- Amazon Simple Workflow (SWF) is a workflow service for building scalable, resilient applications.
- Amazon Cognito is a user identity and data synchronization service that securely manages and synchronizes app data for users across their mobile devices.[47]
Amazon AppStream 2.0 is a low-latency service that streams and resources intensive applications and games from the cloud using NICE DVC technology.[48]
Analytics
- Amazon Athena is an ETL-like service launched in November 2016. It allows server-less querying of S3 content using standard SQL.[49]
- Amazon Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
- Amazon Elastic MapReduce (EMR) Provides a PaaS service delivering Hadoop for running MapReduce queries framework running on the web-scale infrastructure of EC2 and Amazon S3.
- Amazon Machine Learning a service that assists developers of all skill levels to use machine learning technology.
Amazon Kinesis is a cloud-based service for real-time data processing over large, distributed data streams. It streams data in real time with the ability to process thousands of data streams on a per-second basis. The service, designed for real-time apps, allows developers to pull any amount of data, from any number of sources, scaling up or down as needed. It has some similarities in functionality to Apache Kafka.[50]- Amazon Elasticsearch Service provides fully managed Elasticsearch and Kibana services.[51]
- Amazon QuickSight is a business intelligence, analytics, and visualization tool launched in November 2016.[52] It provides ad-hoc services by connecting to AWS or non-AWS data sources.
Miscellaneous
Amazon Marketplace Web Service (MWS) allows users to manage complete shipment process from creating listing to downloading shipment label using API.Amazon Fulfillment Web Service provided a programmatic web service for sellers to ship items to and from Amazon using Fulfillment by Amazon, later replaced by Amazon marketplace Web service.Amazon Historical Pricing provides access to Amazon's historical sales data from its affiliates. (It appears that this service has been discontinued.)Amazon Mechanical Turk (Mturk) manages small units of work distributed among many persons.Amazon Product Advertising API, formerly known as Amazon Associates Web Service (A2S) and Amazon E-Commerce Service (ECS), provides access to Amazon's product data and electronic commerce functionality.Amazon Gift Code On Demand (AGCOD) for Corporate Customers[53] enables companies to distribute Amazon gift codes instantly in any denomination.AWS Partner Network (APN) technical information and sales and marketing support. Launched in April 2012, the APN is made up of Technology Partners including Independent Software Vendors (ISVs), tool providers, platform providers, and others.[54][55][56]Amazon Lumberyard is a freeware triple-A game engine integrated with AWS.[57]Amazon Chime is a collaboration service for voice, video conference, and instant messaging.[58]
I could imagine datasets being uploaded into AWS rather than the datastore and having far more functionality and significantly faster querying of that data than we currently have for the datastore. The data uploaded into AWS could be harmonised and transformed on the fly for visualisation much like the HXL Proxy does for HXLated data eg. for future map explorer. The HXL Proxy could be refactored to use AWS for scalability and to use some of its harmonisation capabilities. I am not sure how feasible it is, but using AWS scraping (crawling) functionality, it might be possible on the fly to for example pull out an HTML table from a website, harmonise and transform the data, HXLate it and have it exposed through HDX as a url in a dataset's resource (Question: can Crawlers Frequency be set to allow on the fly scraping). Another example could be the steps that make up the FTS daily scraper which pulls from API, transforms, HXLates and uploads to HDX, being done on demand on the fly when the download button is clicked in a dataset through AWS.
One issue is how well these harmonisations and transformations would work for HXLated data as that can break the inferring mechanisms as with Frictionless.
One of the issues raised about Connect (from Javier's feedback) is orgs want to store the sensitive data on HDX. We had been worried about how to do the security well enough. If the data was stored on AWS, could that be a solution? From AWS website: "as an AWS customer, you will benefit from a data center and network architecture built to meet the requirements of the most security-sensitive organizations."