The latest techno buzzword out there is “Big Data”. It’s a hot topic that is trending upward: its Google search popularity has increased five-fold since 2009. No doubt taking advantage of Big Data, data and text analytics has some profound success stories. Do you recall the last time you were on Amazon, and it displayed the “People who bought this also bought…” section is just underneath the product of interest? Relevant, appropriate suggestions that mirror the upselling skill of an astute salesperson that historically provided those suggestions in a bricks and mortar store. But for the mentality of those who are tasked with massaging, manipulating, and otherwise making sense of this data is still handled in a way which mirrors Small Data. This is especially true with tasks of fraud detection, identifying money laundering or tax audit discovery. Sure some marketing folks coined the term a while back, but the phrase Big Data assumes there was a Small Data that preceded it. Let’s clarify a few things first.
Where is the cut line?
Relational databases have historically been the place where digital information is kept in a structured format. Other unstructured forms of data include website traffic logs, videos, and the internet at large, but let’s just focus on databases for now. From both a software developer and data analytics standpoint, I’d say the cut line between Big Data and Small Data is about 1 million records. That’s the point where common software products start to lose their steam. Microsoft Excel abruptly cuts off at about 66,000 records. Microsoft Access starts to chug at about 100,000 records. With MySQL, a popular open source database tool, 1 million records is the point where things slow down there without some efficient indexing strategy, or server configuration tweaks. So to me, it’s a safe bet that you’ll need some specialized software, systems, or expertise to proceed beyond that 1 million mark.
Think Like the Bad Guys
If you’re doing fraud detection, tax audit discovery or other types of analysis where someone’s trying to “beat the system”, you’ll have to research it in a way where the fraud sticks out like a sore thumb. For example, if someone is trying to sell a car that he knows is a lemon, and not want to attract attention, he is going to advertise his car just below market value that it tempts his victim into buying it. That same fraudster is also simultaneously trying to avoid suspicion by putting the price of the car too low that it’s a dead giveaway of a scam. The mentality is that you have to go deeper into the analysis than the bad guys do to uncover the fraud. It requires you to think creatively and it’s a constant cat-and-mouse game, because the rules are being changed all the time.
Break down the problem
The classic and often clichéd riddle “How do you eat an elephant?” has an obvious yet profound answer “One bite at a time”. If you are stuck doing analytics with Small Data, you can still achieve reasonable results by breaking things down. Many ballistics calculations used by the Allies in World War II were made by groups of women who were good with mathematics and each had their own task to perform. Imagine how many combinations of numbers they had to crunch and they had no hard drives or RAM to do it. They also all had to share use of the ENIAC, the giant supercomputer at the time, which is still inferior to the computing power of your cell phone. It’s easy to breakdown a data set. You can split your data by the first letter, by state, by month, but make sure that you reallocate the data in a different way from time to time to avoid biasing your own perspective on the data.
Once it’s a chore, scale it up
The first set of data that you do discover something worth reporting, you’re ecstatic about it. Now your superiors want it in their monthly reports. The second set of data will still be exciting, because you didn’t think of this or that obscure possibility, maybe last month’s data was one way and this month is another way and you’ve rejigged your analysis engine to cover those possibilities. The third set is just a press of a button now and it’s turning into a chore. All the drive to discover the trend and report on it has completely passed and now it needs to be folded into a standard routine. Your managers are still just as excited about the data because it helps them make decisions, but you’re not – you’re a data scientist. You thrive on the challenge of solving new problems, discovering things, not rote mechanical actions. So at this point you can benefit from a scaled up, Big Data approach.
Big Data technologies specialize in the automation of breaking data down, applying a known predictable algorithm to that data, and recombining the outputs into a report or some conclusions. So you really need to mimic each stage of the way manually first to understand how the machine is going to do it. Don’t delude yourself into thinking there won’t be hiccups in introducing an automation, but at this point it’s worth the trade off of time spent doing tedious analysis manually or just letting the machine do what it does best, crunch data over and over and never get tired of it. It’s key to make sure your analysis methods are sound first before proceeding to scale up, because that scale up effort will have its own formidable challenges such as tuning, resource optimizations, and bandwidth limitations.
Hopefully this has convinced you that despite the buzzwords and all the marketing, Big Data is just Small Data replicated across a network.