Scalable web sites
with SQL sharding
Sharding is a pretty well-known technique to enlarge the scale of an existing application.
It is particularly suited to an organization that has already made large investments in traditional SQL techniques and it appeals to organizations where using proven techniques is essential to build the shared confidence necessary to get a project online.
(see also our Scalable Web Application projects).
SQL Sharding
Sharding is a natural step when an application is slowing down under load.
As thousands of rows per table become hundreds of thousands, you will try to solve this first
by optimizing table and field indexes.
Then developers will start doing query 'optimization'; seeking to improve the performance by
changing how the automatic query optimizer works that the database provides.
Now we already left the space of 'implementation independent' application development.
But this is only to get worse: after this there is a host of database dependent measures
to be tried, like for example adding memory to table index buffers. At some point,
this becomes more and more cumbersome; with less and less gain per 'improvement'.
Then, the weak point of classical SQL databases shows:
the centralized table index is a bottleneck. And at this point, SQL is still structured,
but not very standard anymore.
Splitting it up
Sharding basically breaks up the index in smaller sub-indices and locates these sub-indices on several host machines.
This can seriously reduce the index size and thus the seek time within one shard.
It is logically an easy step and technically also, however, there are some things to consider.
The most important change: you need to decide which shard to use when your application needs data.
This means deciding how to split up your data. We had a project where our client provides
a service for other organizations and we could split up the big dataset in chunks per organization.
This project was running on MySQL, so adding a database is easy and does not take a lot of overhead.
Working with JPA
We decided to put every organization's document data in a database. To keep this manageable, we wrote code that
automatically creates a database; using a simple resource manager to allocate new organization databases
to one database server from a known set of servers. It does move more responsibility to the application developer,
as the application needs not only to create its own databases, but it needs to create the correct indices also.
Until that point, we sought to stay as close as possible to 'known' java web application development practices.
We were building the web application on top of the EclipseLink JPA implementation.
Actually, creating the database using JPA was a pretty painless automatic thing and we only needed to add
some logic to 'mix in' some commands to set the indices right.
The most important change to EclipseLink was making it switch between shards in a 'transparent' way.
This was logically simple and a prototype of this solution was up and running in a matter of hours.
The trouble started when we filled up several 'shards' with data and then started to run parallel queries on all of them concurrently.
EclipseLink is caching so much meta data about a single database connection and its related classes,
that switching many database connections quickly made us run out of memory.
We never found out how to turn off that caching. We decided that it probably would not be a good idea anyway.
So we ended up simply coding cross-shard queries in plain JDBC and mapping the fields to objects ourselves.
A bit more work initially, but in the end we got a much more predictable and a much faster result.
Which was also remarkably resource efficient; no matter the number of documents; no matter the number of shards.
Safe and simple
In normal day-to-day coding the only thing that was really different, was that the developer had to understand
that relations between data that went outside of one shard were not automatic anymore and that data consistency
needed to be managed. In our case, this did not really pose any serious issues as the split was such that
there were no data relations from shard to shard.
Knowing what we know now, we would probably put our next project in HBase instead of
coding the 'distribution' logic ourselves, like we had to do here to distribute databases over database servers.
But there are good and practical reasons to stick with SQL.
Not only just because of investments in an existing infrastructure, but also when you need many ways
to query your data from many angles to satisfy user queries immediately, SQL is still a practical and stable tool.
And sharding can help you scale up a long way.