With the advent of large data sets, the traditional databases could not handle the load. The scale out factor increased from thousands to millions of concurrent connections. Traditional database technologies were not designed for millions of concurrent connection and large data sets. The data sets are in the range of Terabytes to Petabytes. Data, by itself, is not very useful unless it is analyzed and made actionable. To analyze large data sets, the data and computation had to be distributed. Map-Reduce originally was implemented to build a search engine by Google. Google published their MapReduce paper and also GFS (“Google File System”). MapReduce as defined in the original Google paper by Jeffrey Dean and Sanjay Ghemawat – “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key”. The original authors of Hadoop then integrated a file system and MapReduce algorithm to compute in parallel on large data sets.
Hadoop has evolved to include batch computing, real time computing, Graph Database, and NoSQL Databases – all of them operating in parallel on large data sets. The landscape is fairly complicated and Moogilu does offer a variety of Big Data technologies to support customers varied needs in Big Data.
Machine Learning is another aspect of computing which depends on data and models to arrive at the right answer based on the data. Moogilu can help with the analysis, design, architecture, development, deployment, hosting, and support on the following Big Data Technologies:
Hadoop and File System
The Hadoop architecture is built on Hadoop Distributed File System (“HDFS”). HDFS is the storage foundation of Hadoop and the other upstream applications and data stores. It provides scalable, fault tolerant, cost efficient, and distributed storage for big data.
With regard to big data import/export (ETL), Moogilu has expertise with Sqoop and Flume. With Sqoop you can:
- Import and Export Data from a variety of sources
- Parallel data loads
- Log data flow info
SQL is well understood and it is critical that Big Data Stores support SQL for traditional DBA’s to manage data access and write queries. HIVE is a data warehouse on Hadoop which provided SQL interface. HIVE can work on very large data sets and the Map-Reduce is transparent to the user. It can be used to answer data analytic questions on data sets that can scale to Petabytes. For analysis that take time, Big Data Warehouse with MapReduce is the right approach.
NoSQL, In-Memory Database, and Real-time Analytics
NOSQL databases can scale out to millions of operations/second and are horizontally scalable with latencies in milliseconds for some of the NoSQL Databases. We support columnar model (HBase and Cassandra), document model (CouchBase), and key-value store (Aerospike). With NoSQL data store, it is possible to have:
- Flexible Data Model
- High Availability
- In-memory caching and Real Time Big Data
- Support million+ ops/sec (reads/writes)
- Linear Scalability
- Atomic and Consistent
- Replication across data centers
- Easy Administration
Many of the applications need real-time updates of large data sets with latency in milliseconds and analysis thereof in seconds or minutes. The solution above with Reactive Programming is the right approach to many use case including IOT (“Internet of Things”) applications.
Search is integral part of most applications. At Moogilu we support Solr and Elastic Search. Both the technologies are based on Apache Lucene. With Moogilu search solution, you can:
- Real time search for both text data and Hadoop data
- Full Text Search
- Real time analytics
- Horizontal scalability
- Document Oriented
- Open interface – JSON, HTTP, XML
Real time Big Data Computation
With large data sets, the traditional approaches to programming, where the resources are blocked and operations serialized will not work. A newer framework that supports parallel programming (high concurrency) is required. The newer programming model has to handle large streams of data and low latency. This was an impossible task, but with newer frameworks that are available, Moogilu can help customers to scale out their product suites. With Apache Spark as a general computation engine, one can increase the speed of MapReduce by two orders of magnitude for some of the workloads. Apache Spark combine with Cassandra, Couchbase, or any other NoSQL database provides a powerful Big Data Computation Engine. This is a better approach than Hadoop for most workloads.
Moogilu supports both Storm and AKKA (“Reactive Programming”). As mentioned earlier, Real-time NOSQL complements Real-time Big Data Computation for IOT Applications and many other applications.
Google Big Data Platform
Google big data platform is probably the most impressive based on our experience. Google provides the same toolkit that is used internally for their own product suite including Search through Google Cloud Services. Google pioneered MapReduce and have optimized Network, Servers, and Storage to provided unprecedented scalability, performance, costs, and operational efficiency. The components define below describe the Google Big Data Platform. Combining the components below, Big Data problems can be effectively built securely and scaling to Petabytes with literally thousands of server and Terabit network.
- Google BigQuery
This can easily scale to Petabytes analytics data warehouse. The speed of data loading in incredible. Unlike other NoSQL solutions, there is no concept of a primary key, and every column is equally important. And once can run aggregation and other queries at incredible speeds. With BigQuery, it is fairly easy to build incredible Analytic backend with very little effort.
- Google Cloud Dataflow
Any Big Data system is characterized by complex data flows characterized by – programming model, data transformation, collection, and I/O source and sinks. It takes time and effort to orchestrate a complex workflow. It combines programming, infrastructure, I/O, and data transformation. This means data has to be orchestrated from different sources, transformed, and then updated to the required system.
- With Google Dataflow, building complex workflows and managing them is easier. It reduces the time and effort to build complex products.
- Google Cloud Dataproc
Google provides managed Big Data open source ecosystem that includes Spark, Hadoop, Pig, and Hive. With Google’s aggressive pricing for managed open source big data solutions and integrated with other Google cloud products, it provides a huge reduction in cost and also powerful paradigm to envision big data solutions.
- Google Cloud Datalab
- As with all Google products this can scale to gigabytes. And provides support for a variety of packages that includes statistics and machine learning.
- Google Cloud PUB/SUB
This is a scalable messaging middleware that can scale up to million messages/second. Most applications rarely ever will have the need to scale to million messages/second. But this could be extremely useful to IOT and Mobile products will millions of users. Any product in Google Cloud can use PUB/SUB and build complex pipelines.
- Moogilu can help customers with Google Big Data Platform along with Google Cloud.
Big Data Migration
It is very likely that data has to be migrated for a variety of reasons – new cloud or data center, consolidation, backup, and operational efficiencies. Moogilu has experience migrating data from any cloud to cloud or data center. Migrating data should include preserving the order of data, distribution of data, backup process, and preserving or improving the performance. All the applications that worked should also work with the newly migrated Big Data.
Moogilu has migrated both relational and big data from Data Center to Amazon, Data Center to Rackspace, and from Amazon to Google Cloud.
Predictive Analytics is a combination of techniques that help predict the future based on current and historical data sets. The techniques used in Predictive Analytics include modeling, machine learning, statistics, and data mining. The models used include: Naïve Bayes, k-nearest neighbor, support vector machines, Neural Networks, Least Squares, Logistic Regression, etc. Moogilu can help customers with Predictive Analytics and Machine Learning. Moogilu has Data engineers and Data Scientists on its staff who can help our customers build models and deliver solutions that can be used every day on applications like customer retention, recommendation, etc.