Plagiarism detection using HBase: How a data-rich application finally found a Big-Data solution.


I am Wiebe Ruiter. I am the owner of Docner Software. We are a small java-and-internet shop that produces custom solutions 
to clients in Europe and Brazil. 
In my european past, I have worked as an independent consultant at dutch banks, the dutch largest telecom operator and at 
Vodafone, Europe's largest supplier of mobile telephony. February 2011 I came to Brazil because I figured out that a small 
country like Holland would not need much large scale systems and that my skills would probably fit better to a big country 
like Brazil. And also a bit because the best place to practice kite-surfing is at the northern coast of Ceará. 

I am here to share some experience I built up during a sequence of projects that I did for a dutch company that sells 
plagiarism detection services. With them, I have seen and taken big part in the construction of several systems - moving 
from a single monolithic server with a central SQL database to a sharded SQL architecture to a fully HBase based solution. 
Every step on the way was taken because we run into practical issues with the previous one.  


1. Plagiarism detection technology needs data

The solution for all those 3 alternatives I saw does the same thing for the client: you can upload documents that you want 
to check. The system does a big magical black-box thing and a while later you have a report that tells you which parts of 
the document have a striking similarity with stuff found somewhere else. Somewhere else can be the internet and it can be a 
big collection of in-house produced documents, like an electronic university thesis library.

Basically, the big magic black box splits your document into parts and goes to all those sources to see if any of those 
sources has documents that match those parts. 

With the existence of big search engines this seems to be a simple task. However, it is not so simple. The biggest search 
engine we all know does not sell its searches and it limits the amount you can do for free. And, a lot of the documents we 
are interested in are private property and you need to log in somewhere to get to them. 

So in addition to doing internet search, which is a technical challenge all on its own, my client was continuously scanning 
the internet building up its own internet library. Focused on industry-specific targets, like the websites smart students 
have set up to share the 'works' they made for school. The cheater cheated, so to say. But also the university website that 
a wealth of pdf files available after logging in. Stuff like that. According to the agreement with a university, this data 
could be going into the general internet library or into a university-specific one that is only used for plagiarism 
detection for that university. Also, as the client was selling throughout Europe and there are some legal restrictions, 
there are language- and country based libraries too.

Ultimately, this boils down to doing basically the same thing that major search engines do: scan the web, copy what is 
interesting, index it. Smaller, because the goal is not to index the whole web - search engines do that already - but also 
much deeper; universities know their data and you cannot afford to miss one document. 
Scanning one website could easily yield 80 Gb of data; and scanning hundreds thousands of websites was adding up quickly. 

And of course there is also the client's client data: the documents that are uploaded. And the reports that show the level 
of plagiarism ( or not ). Let me give you a calculation that I did when we started construction of the second-generation 
system; the sharded SQL solution:

5 Search results and comparisons per uploaded document ( conservative )
30 Teachers per organization
1000 Documents uploaded and checked for plagiarism per teacher
20000 Organisations that use the system ( this is realistic, but changed from reality )
That is 3.000.000.000 Pointers to uploaded documents and 15.000.000.000 comparison reports. Mind you, this is and was not 
the amount of data they were processing at that time, but it was the capacity they estimated necessary to cover their 
clients needs in 2013.


2. Plagiarism detection technology needs serious technology

When I started working with the client they were running one really big WIndows server with a SQL server based solution. It 
was slow, expensive to expand and clients were not happy; upload times and queue times were horrible. Sometimes several 
days to produce comparison reports. 

They were working on the document processing pipeline when I started there and we created a distributed queue-like system 
to perform the magic black box steps. I produced for them also a separate system to scan websites and add that to their own 
libraries. It was primarily this sequence of projects that run into most data issues. I literally burned SDD drives because 
we were writing data faster and more intensely then they could handle. We tried and discarded virtualization because we 
needed all the power that bare iron could give us. We tried a SAN to store data but the single node that connected it to 
the network could not process the data. We literally brought a university website down unintentionally running performance 
tests on our website downloader - it was actually a client of the client.. interesting situation. 

We succeeded downloading all the data. We succeeded building up separate indexes containing the content. But somehow we 
always got into trouble when we brought all data together: combining the indexes; combining the index with the document 
data storage, … there was always a single point in the system that could not handle the pressure and got overloaded. 
Removing the bottlenecks step by step was a long process. 

For the particular combination of functionalities that was necessary for the educational market it did finally end up in a 
working and relatively efficient process. Then there was a new ambition to enter a new market and to build a system that 
would scan for plagiarism not one time, but continuously. Imagine taking the number we already saw and multiply that with a 
serious number of scans: every day, every week...  


3. Plagiarism detection technology can all be handled by HBase and it is a perfect fit

The sharded SQL system with the separate web scanning solution had grown into a forest of nodes and specialized services, 
coordinated by two central 'queue managers'. The system was complex. Actually, this seemed necessary as the technology was 
complex. Right ?

Wrong. The complexity had nothing to do with the business requirements. It was a function of the technology used: always 
working around the next single choking point. After I moved my company to Brasil, I kept working with the same customer. In 
june 2011 they came to Cumbuco to draft our new architecture. We literally sat down at the beach, drunk a beer and wrote a 
3 page document that contained the outline for this new solution. 

Putting all data in HBase, you get scalability for free. Using a 'stateless' web app architecture that is of course not 
really stateless, it merely removes the state from a particular node running some instance of the web application, we came 
to a simple architecture. We had 'hbase' machines that would run an Hadoop instance together with an HBase instance to get 
the locality advantage. And we had 'webapp' machines that would manage the process and serve pages, all-in-one.    
Put a good load balancing solution in front of the webapp machines and you have a solution that is endlessly scalable… Got 
too much data ? Plug in more hbase machines, manage the splitting, done. Got too much load on the webapp ? Plug in more 
machines, configure the load balancer, solved. The final bottleneck could be the network but fiber in the datacenter goes a 
long way.

We started the project in september 2011. It was completely built in Brazil and we delivered a full working prototype in 
december 2011. In january 2012 we concluded the work and it moved to the business team to figure out how to sell it and 
make it profitable.

We were amazed by the ease of creating something that works - and works reliably. We also learned that while this 
architecture is simple and easy on the application level, it does require a good level of engineering at the infrastructure 
level: all parts of the system must co-operate and you'll spend some serious amount of hours tuning your setup. The DNS 
must be perfect and perfectly fast. Your linux needs to be tweaked. You can indeed use cheap machines - get ready for some 
burned machines though. Get memory - lots of memory. Monitor the health of your machines. Monitor open sockets on you 
machines. 

The lack of 'frameworks' to program your stuff proved (again) to be a blessing. Solutions are straightforward and simple. 
Yes, you will need to get your hands dirty and worry about serialization of your data. Set up a good unit test 
infrastructure and you will move faster than when trying to figure out why this magic framework is not persisting the data 
as you expect.

Clearly HBase is not as mature as most SQL based solutions. There are some almost brain-dead issues, like a central 
Zookeeper that, in its shutdown procedure, spins endlessly waiting for a slave to signal its death. This signal will never 
arrive anymore however, because the slave is already dead. At least in the version we worked with this was a serious 
annoyance during deveiopment. But development is moving fast and no doubt this will improve.

And there is quite an interesting paradox here: while throwing gigabytes around gets to be real easy, when you want the 
best performance for real-time requests, you will spend quite some time tuning your key structure and the amount of bytes 
there to have success.. You have to think large and small at the same time; HBase is no 'automatic framework' to solve 
performance problems. And how do you solve those issues ? By creating more data.. just create a new index besides the old 
one and work with that. Map-reduce to the rescue. Exit normalization. Designing the application gets to be an interesting 
game playing with request-based and batch processing, to balance responsiveness and functionality for your web users. 

Our conclusion:

For Docner, HBase is an important and very versatile tool. There is still a lot to discover but having done the full 
development-test-deployment roundtrip for this HBase-only project, our confidence in the tool has grown. While we do feel 
that it is not the tool to use always and everywhere, HBase has pleasantly surprised us and we will use it more often for 
sure.