Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

The Big Data Cookbook

Big data

Big data has become one the new buzzwords on the Internet. It refers to the massive amounts of data that many modern web services deal with. This post will list some of the more useful software available to web developers for working with big data.

You don’t have to operate at the scale of Google or Facebook to enter into big data territory. Web analytics services, monitoring services (like our very own Pingdom), search engines, etc., all process and store massive amounts of data.

To quote Wikipedia:

Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. […] Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

At this scale, many traditional approaches for handling and processing data are either impractical or break down completely.

That’s why the web development community has been turning to alternative ways to handle all this data, developing new software that scales to these extremes. You may have heard about NoSQL databases, but that’s just a small piece of the puzzle.

So what are the various ingredients available for handling big data? We’ve divided them into four categories:

  • Storage and file systems
  • Databases
  • Querying and data analysis
  • Streaming and event processing

We figured this could be a good starting point, and we’re hoping that you’ll help us add to the list in this post by making your own suggestions in the comments.

In other words, read the list, and help us add more useful ingredients!

Here we go…

Storage and file systems

When you need to store massive amounts of data, you’ll want a storage solution designed to scale out on multiple servers.

  • HDFS (Hadoop Distributed File System) – Part of the open source Hadoop framework, HDFS is a distributed, scalable file system inspired by the Google File System. It runs on top of the file system of the underlying OSs and is designed to scale to petabytes of storage. The Hadoop project (you’ll see several of the other components further down) has several high-profile contributors, the main one being Yahoo. Hadoop is used by Yahoo, AOL, eBay, Facebook, IBM, Meebo, Twitter and a large number of other companies and services.
  • CloudStore (KFS) – An open source implementation of the Google File System from Kosmix. It can be used together with Hadoop and Hypertable. A well-known CloudStore user and contributor is Quantcast.
  • GlusterFS – A free, scalable, distributed file system developed by Gluster

Databases

While classics like MySQL are still widely used, there are other options out there that have been designed with “web scalability” in mind, many of them so-called NoSQL databases (speaking of buzzwords…).

  • HBase – A distributed, fault-tolerant database modeled after Google’s BigTable. It’s part of the Apache Hadoop project, and runs on top of HDFS.
  • Hypertable – An open source database inspired by Google’s BigTable. A notable Hypertable user is Baidu.
  • Cassandra – A distributed key-value database originally developed by Facebook, released as open source, and now run under the Apache umbrella. Cassandra is used by Facebook, Digg, Reddit, Twitter and Rackspace, to name a few.
  • MongoDB – An open source, scalable, high-performance, document-oriented database. It’s used by, among others, Foursquare, Bit.ly, Shutterfly, Etsy and Chartbeat.
  • Membase – An open source, distributed, key-value database optimized for interactive web applications, developed by several team members from the famous Memcached project. Users include Zynga and Heroku. A month ago, the Membase project merged with CouchDB, creating a new project called Couchbase.

Querying and data analysis

All that data is of no use without the ability to access, process and analyze it.

  • Hadoop MapReduce – Open source version of Google’s MapReduce framework for distributed processing of large datasets.
  • Hive – An open source data warehouse infrastructure with tools for querying and analyzing large datasets in Hadoop. Supports an SQL-like query language called Hive QL.
  • Pig – A high-level language used for processing data with Hadoop. Funny aside: the language is sometimes referred to as Pig Latin.

Streaming and event processing

When you have massive amounts of data flowing into your system, you will often want to process and react on this data in real time.

  • S4 – A general-purpose, distributed, scalable platform for processing continuous streams of data. Developed by Yahoo and released as open source in 2010. It’s apparently not quite ready for prime time yet, although Yahoo is using a version of it internally.
  • Esper – An event-processing platform from EsperTech for handling continuous streams of incoming data.
  • StreamInsight – Microsoft’s entry in the EST/CEP field, included with SQL Server.

A small aside when speaking of streaming and event processing, you’ll hear two industry terms repeated over and over again: EST, Event Stream Processing, and CEP, Complex Event Processing. Just in case you were wondering what that actually stood for.

The Google legacy

It’s interesting how influential Google has been in the big data field in spite of having released very little actual software to the public.

Much of the open source big data movement is centered around Apache’s Hadoop project, which essentially has tried to replicate Google’s internal software based on the various whitepapers Google has made available. (More specifically, Hadoop has replicated GFS, BigTable and Mapreduce.)

Here is a list of some of Google’s proprietary software relating to big data:

  • GFS (Google File System) – Google’s scalable, fault-tolerant, distributed file system. Designed from scratch for use with data-intensive applications.
  • BigTable – A distributed, high-performance database system built on top of GFS.
  • Mapreduce – A framework for distributed processing of very large data sets.
  • Pregel – A framework for analyzing large-scale graphs with billions of nodes.
  • Dremel – Meant as a faster complement to Mapreduce, Dremel is a scalable, interactive, ad-hoc query system for large data sets. According to Google, it’s capable of running aggregation queries over trillion-row tables in seconds and scales to thousands of CPUs.

If we may be so bold as to bring out our crystal ball, there will most likely be several open source implementations of Pregel and Dremel available soon. For example, there’s already an OpenDremel project in the works.

Help us add more ingredients!

What excellent big data software did we leave out? Let’s make this post a true resource, so please give us a hand in the comments.

Introduction to Observability

These days, systems and applications evolve at a rapid pace. This makes analyzi [...]

Webpages Are Getting Larger Every Year, and Here’s Why it Matters

Last updated: February 29, 2024 Average size of a webpage matters because it [...]

A Beginner’s Guide to Using CDNs

Last updated: February 28, 2024 Websites have become larger and more complex [...]

The Five Most Common HTTP Errors According to Google

Last updated: February 28, 2024 Sometimes when you try to visit a web page, [...]

Page Load Time vs. Response Time – What Is the Difference?

Last updated: February 28, 2024 Page load time and response time are key met [...]

Monitor your website’s uptime and performance

With Pingdom's website monitoring you are always the first to know when your site is in trouble, and as a result you are making the Internet faster and more reliable. Nice, huh?

START YOUR FREE 30-DAY TRIAL

MONITOR YOUR WEB APPLICATION PERFORMANCE

Gain availability and performance insights with Pingdom – a comprehensive web application performance and digital experience monitoring tool.

START YOUR FREE 30-DAY TRIAL
Start monitoring for free