Plagiarism detection using HBase: How a data-rich application finally found a Big-Data solution. I am Wiebe Ruiter. I am the owner of Docner Software. We are a small java-and-internet shop that produces custom solutions to clients in Europe and Brazil. In my european past, I have worked as an independent consultant at dutch banks, the dutch largest telecom operator and at Vodafone, Europe's largest supplier of mobile telephony. February 2011 I came to Brazil because I figured out that a small country like Holland would not need much large scale systems and that my skills would probably fit better to a big country like Brazil. And also a bit because the best place to practice kite-surfing is at the northern coast of Ceará. I am here to share some experience I built up during a sequence of projects that I did for a dutch company that sells plagiarism detection services. With them, I have seen and taken big part in the construction of several systems - moving from a single monolithic server with a central SQL database to a sharded SQL architecture to a fully HBase based solution. Every step on the way was taken because we run into practical issues with the previous one. 1. Plagiarism detection technology needs data The solution for all those 3 alternatives I saw does the same thing for the client: you can upload documents that you want to check. The system does a big magical black-box thing and a while later you have a report that tells you which parts of the document have a striking similarity with stuff found somewhere else. Somewhere else can be the internet and it can be a big collection of in-house produced documents, like an electronic university thesis library. Basically, the big magic black box splits your document into parts and goes to all those sources to see if any of those sources has documents that match those parts. With the existence of big search engines this seems to be a simple task. However, it is not so simple. The biggest search engine we all know does not sell its searches and it limits the amount you can do for free. And, a lot of the documents we are interested in are private property and you need to log in somewhere to get to them. So in addition to doing internet search, which is a technical challenge all on its own, my client was continuously scanning the internet building up its own internet library. Focused on industry-specific targets, like the websites smart students have set up to share the 'works' they made for school. The cheater cheated, so to say. But also the university website that a wealth of pdf files available after logging in. Stuff like that. According to the agreement with a university, this data could be going into the general internet library or into a university-specific one that is only used for plagiarism detection for that university. Also, as the client was selling throughout Europe and there are some legal restrictions, there are language- and country based libraries too. Ultimately, this boils down to doing basically the same thing that major search engines do: scan the web, copy what is interesting, index it. Smaller, because the goal is not to index the whole web - search engines do that already - but also much deeper; universities know their data and you cannot afford to miss one document. Scanning one website could easily yield 80 Gb of data; and scanning hundreds thousands of websites was adding up quickly. And of course there is also the client's client data: the documents that are uploaded. And the reports that show the level of plagiarism ( or not ). Let me give you a calculation that I did when we started construction of the second-generation system; the sharded SQL solution: 5 Search results and comparisons per uploaded document ( conservative ) 30 Teachers per organization 1000 Documents uploaded and checked for plagiarism per teacher 20000 Organisations that use the system ( this is realistic, but changed from reality ) That is 3.000.000.000 Pointers to uploaded documents and 15.000.000.000 comparison reports. Mind you, this is and was not the amount of data they were processing at that time, but it was the capacity they estimated necessary to cover their clients needs in 2013. 2. Plagiarism detection technology needs serious technology When I started working with the client they were running one really big WIndows server with a SQL server based solution. It was slow, expensive to expand and clients were not happy; upload times and queue times were horrible. Sometimes several days to produce comparison reports. They were working on the document processing pipeline when I started there and we created a distributed queue-like system to perform the magic black box steps. I produced for them also a separate system to scan websites and add that to their own libraries. It was primarily this sequence of projects that run into most data issues. I literally burned SDD drives because we were writing data faster and more intensely then they could handle. We tried and discarded virtualization because we needed all the power that bare iron could give us. We tried a SAN to store data but the single node that connected it to the network could not process the data. We literally brought a university website down unintentionally running performance tests on our website downloader - it was actually a client of the client.. interesting situation. We succeeded downloading all the data. We succeeded building up separate indexes containing the content. But somehow we always got into trouble when we brought all data together: combining the indexes; combining the index with the document data storage, … there was always a single point in the system that could not handle the pressure and got overloaded. Removing the bottlenecks step by step was a long process. For the particular combination of functionalities that was necessary for the educational market it did finally end up in a working and relatively efficient process. Then there was a new ambition to enter a new market and to build a system that would scan for plagiarism not one time, but continuously. Imagine taking the number we already saw and multiply that with a serious number of scans: every day, every week... 3. Plagiarism detection technology can all be handled by HBase and it is a perfect fit The sharded SQL system with the separate web scanning solution had grown into a forest of nodes and specialized services, coordinated by two central 'queue managers'. The system was complex. Actually, this seemed necessary as the technology was complex. Right ? Wrong. The complexity had nothing to do with the business requirements. It was a function of the technology used: always working around the next single choking point. After I moved my company to Brasil, I kept working with the same customer. In june 2011 they came to Cumbuco to draft our new architecture. We literally sat down at the beach, drunk a beer and wrote a 3 page document that contained the outline for this new solution. Putting all data in HBase, you get scalability for free. Using a 'stateless' web app architecture that is of course not really stateless, it merely removes the state from a particular node running some instance of the web application, we came to a simple architecture. We had 'hbase' machines that would run an Hadoop instance together with an HBase instance to get the locality advantage. And we had 'webapp' machines that would manage the process and serve pages, all-in-one. Put a good load balancing solution in front of the webapp machines and you have a solution that is endlessly scalable… Got too much data ? Plug in more hbase machines, manage the splitting, done. Got too much load on the webapp ? Plug in more machines, configure the load balancer, solved. The final bottleneck could be the network but fiber in the datacenter goes a long way. We started the project in september 2011. It was completely built in Brazil and we delivered a full working prototype in december 2011. In january 2012 we concluded the work and it moved to the business team to figure out how to sell it and make it profitable. We were amazed by the ease of creating something that works - and works reliably. We also learned that while this architecture is simple and easy on the application level, it does require a good level of engineering at the infrastructure level: all parts of the system must co-operate and you'll spend some serious amount of hours tuning your setup. The DNS must be perfect and perfectly fast. Your linux needs to be tweaked. You can indeed use cheap machines - get ready for some burned machines though. Get memory - lots of memory. Monitor the health of your machines. Monitor open sockets on you machines. The lack of 'frameworks' to program your stuff proved (again) to be a blessing. Solutions are straightforward and simple. Yes, you will need to get your hands dirty and worry about serialization of your data. Set up a good unit test infrastructure and you will move faster than when trying to figure out why this magic framework is not persisting the data as you expect. Clearly HBase is not as mature as most SQL based solutions. There are some almost brain-dead issues, like a central Zookeeper that, in its shutdown procedure, spins endlessly waiting for a slave to signal its death. This signal will never arrive anymore however, because the slave is already dead. At least in the version we worked with this was a serious annoyance during deveiopment. But development is moving fast and no doubt this will improve. And there is quite an interesting paradox here: while throwing gigabytes around gets to be real easy, when you want the best performance for real-time requests, you will spend quite some time tuning your key structure and the amount of bytes there to have success.. You have to think large and small at the same time; HBase is no 'automatic framework' to solve performance problems. And how do you solve those issues ? By creating more data.. just create a new index besides the old one and work with that. Map-reduce to the rescue. Exit normalization. Designing the application gets to be an interesting game playing with request-based and batch processing, to balance responsiveness and functionality for your web users. Our conclusion: For Docner, HBase is an important and very versatile tool. There is still a lot to discover but having done the full development-test-deployment roundtrip for this HBase-only project, our confidence in the tool has grown. While we do feel that it is not the tool to use always and everywhere, HBase has pleasantly surprised us and we will use it more often for sure.