Monday, January 2, 2012

How I Deal with Lots of Data

Just for some context, by "lots of data", I mean a couple hundred million rows in a table joined to other tables of similar size. By no means is it a lot compared to what other people deal with, but it's certainly enough to warrant some forethought instead of just diving right in.

Late last year, I was tasked with a few one-off reports that required me to summarise data that was stored over a couple hundred million rows in a database. Knowing that it was going to be a long process to retrieve the data for these reports, I had to come up with a game plan to do this as quickly and as efficiently as possible. In the process of doing this, as always when handling relatively large quantities of data, a few things were learned. None of these lessons/ideas are new or original, but I wanted to write them down somewhere, and here feels like as good a place as any.

Extracting the data that I needed (which, in one instance, was a subset of a couple of ~300 million row tables) into a local database meant I was able to modify the data (like cleaning up dirty, inconsistent spellings of suburbs, states and countries) and modify the schema (like adding new indexes which, because of the odd nature of the reports, the production databases didn't have or ever need previously).

Assuming that you don't require something like schema changes, that local database doesn't even need to be a relational database. CSV files work perfectly fine a lot of the time. For a couple of reports I wrote a handful of scripts, the first of which was to pull the data out of the MySQL database, perform some simple operations on the data and output it into a CSV file. The other scripts that needed to operate on the same data could then easily (thanks to Text::CSV_XS) read the CSV data, which was a lot quicker than reading it from a relational database.

Why do CSVs lend themselves nicely to this kind of task? Because with reports like these, in my experience, you very rarely perform complicated operations on the data after extracting it from the original data source(s); you just want to suck the data up, summarise the data, output the summary, and then output the nitty gritty details on subsequent pages or into a separate file.

An obvious advantage to storing the data like this is the speed in which you can retrieve and process the data. That improvement made a huge difference for me because I like to run my scripts very often throughout the development process, no matter how small the change.

Of course, depending on the size of the data (in bytes), extracting the data into a local database of some sort may not always be possible.

The last big win I had was not using object-relational mappers (Class::DBI in this case). They're great a lot of the time and save on code and development time, but when dealing with millions of rows, they just add bloat and everything runs much slower than it should.

That's all I can think of now, a few months later.