The history of data can be divided into two eras: pre-big data and post-big data.
In the pre-big data era, data was mostly structured and exchanged between enterprises through standard mechanisms such as network data mover (NDM). The need for near real-time insights was limited, and data extraction and transformation were batch-oriented and scheduled during non-peak hours to reduce MIPS (millions of instructions per second) usage and disruption to online production transactions.
Also, data formats were limited, the most common format being delimited flat files with headers and trailers. Both headers and trailers stored important information such as data arrival time, data producer information, and the number of records in the file.
Moreover, relational database management systems (RDBMs) — such as DB2, hierarchical databases such as IMSDB, flat files and custom extract, transform, load (ETL) logic within COBOL or PL/I — were sufficient to address data ingestion, analysis, and storage. Since sources of data generation were limited, it was easier to manage the volume of data.
As we ushered in the era of big data, enterprises expected more value from data as advances in technology provided the capacity to gather, store, and analyze an exponential growth in both volumes and variety of data. With the ability to extract more (and timely) business insights than ever before, data has become a competitive advantage for enterprises that can extract actionable information from their diverse data sources and formats.
At the same time, increasing regulatory requirements have also necessitated ingesting data from diverse sources to make informed decisions. Regulatory authorities in California mandate collection, storage and analysis of data to reduce disruption caused by wildfires that take a huge economic toll on the community and businesses every year. For this, utility companies need to ingest, analyze and apply artificial intelligence or machine learning-based prediction techniques on voluminous data. This shift in the dynamics of data resulted in an exponential growth in terms of data volume, data sources, data exchange patterns, and data formats.
Managing volume and complexity of data
Today, a significant amount of enterprise data is generated from external sources rather than internal systems of record (SORs). The type of stored data is transactional as well as engagement data. The engagement data can possibly be 10-20 times more volume than transactional data. Although big data technologies introduced distributed storage and accelerated data processing through massive parallel processing, they do not address dynamic scaling up of data acquisition, storage, and processing based on demand.
Elastic scaling of compute and storage on-premises is human-intensive, cumbersome, and expensive. Even data acquisition from multiple external sources increases overheads. Consequently, enterprises face several challenges with on-premises data management. It is difficult to:
- Scale up data processing and storage for an exponential increase in polymorphic data
- Manage different mechanisms to ingest data from external and internal systems
- Ensure high availability of data and near-real time secure access to data insights
Necessity is the mother of invention
The evolution of cloud computing coincided with an exponential growth in data. The cloud abstracted the problem of infinitely scaling storage and processing power on demand. It also provided a managed data landing zone for data ingestion from various internal and external systems.
Amazon Web Services (AWS) offers a broad spectrum of highly available, fully managed data services catering to several types of data, be it relational, semi-structured, or unstructured. Amazon Relational Database Service (RDS) and Amazon Aurora cater to the relational domain, while Amazon DynamoDB is a NoSQL database service.
AWS also provides managed services for other popular NoSQL compatible databases such as Amazon Document DB with MongoDB compatibility and Amazon Keyspaces for Apache Cassandra. Apart from these managed services, all leading NoSQL databases such as Couchbase, MongoDB and Cassandra have a managed database-as-a-service offering on AWS, and AWS also provides a platform where customers can use Amazon EC2 (Elastic Compute Cloud) to install and run these databases as independent software.
Navigating data migration, powered by AWS and Infosys migration strategy
A sound data migration strategy is essential to ensure seamless operations and business continuity. In some cases, it may be beneficial to retain certain types of data on-premises due to regulatory requirements. The data migration approach may vary based on the size and nature of the data.
For example, if the volume of data is huge, it is prudent to adopt AWS Snow Family, comprised of AWS Snowcone, AWS Snowball, and AWS Snowmobile. This suite of services offers a number of physical devices and capacity points to help physically transport up to exabytes of data into the AWS Cloud.
For data transformation, AWS provides Amazon Elastic Map Reduce (EMR), which manages Hadoop clusters in the cloud, and AWS Glue to manage ETL services. Furthermore, Amazon Athena and Amazon Redshift with spectrum provide data lakehouse implementation in cloud, and Amazon Quicksight adds a visualization layer for business users.
For continuous data ingestion from various resources in the AWS Cloud, AWS provides data migration and ingestion services that can be utilized — such as AWS Data Migration Service (DMS), which ingest relational data into AWS. Also, Amazon Kinesis services help to ingest, store and process streaming data.
Post-migration, enterprises need to consider managing running costs. Implementing an observatory layer helps track and manage resource usage and optimization on the cloud. The metrics collected through AWS Cloud Trail, Cloud Watch and Billing metrics assist enterprises in creating and building this observatory layer.
Infosys has worked with several global clients in migrating, modernizing, and building data platforms on cloud. We believe a platform-based approach to migrate applications and data to the cloud is imperative for a seamless migration.
For example, we redesigned the data landscape of a device manufacturer to better manage almost a petabyte of data residing in on-premises network-attached storage (NAS). The data was growing by 300% year on year. The system allowed users to upload images, incident descriptions, and application logs related to device defects. The solution for data management system was designed using Amazon S3, Amazon EMR and AWS Glue Catalog for metadata management. Our choice was determined by several factors:
- Amazon Simple Storage Service S3 (Amazon S3) provides security, scalability, and a highly available object store for the petabyte-scale file storage on the NAS.
- Amazon S3 TransferManager helps manage large file uploads through multi-part uploads.
- Amazon S3 Transfer Accelerator enables data to be routed to the nearest edge location over an optimized network path for faster and more secure transfer of files.
- Amazon S3 provides a common and standard landing zone for data exchange between stakeholders.
- Amazon EMR and AWS Glue Catalog is a good fit to large volume ETL processing at scale and store metadata, which goes through frequent structural changes.
Migrating data and application workloads to the cloud are imperatives for enterprises to future-proof their businesses. A well-orchestrated, automated approach allows enterprises to realize the benefits from migrating data to the cloud.
In order to lend predictability to the modernization, Infosys offers its customers the Infosys Modernization Suite and its component Infosys Database Migration Platform, which is part of Infosys Cobalt. This helps enterprises to migrate from on-premises RDBMs to cloud databases — such as AWS RDS, Amazon Aurora — or NoSQL databases such as Amazon DynamoDB and Amazon DocumentDB.
About the authors:
Naresh Duddu, AVP and Head, Cloud & Open Source, Modernization Practice, Infosys
Jignesh Desai is the AWS WW Migration Partner Solutions Architect for Infosys
Saurabh Shrivastava is the AWS Global SA Leader for Infosys
Read More from This Article: Data Management on the Cloud Leveraging AWS
Source: News