Ephorus Fetch Federation
The Ephorus Fetch Federation project created a
federation of web spidering services, coordinated by a centralized
controller. The project can process long lists of web sites and
builds an index using a logically endless number of machines to
retrieve, analyse and index text data. The controller distributes
work over many machines, choosing the ones that are not actively
working to start up the next site to be fetched. The fetching
process is self-sustaining, yet the process allows operator
intervention; for instance to dynamically add filters. Because of
the sheer size of the indexed data, persistence is completely
distributed and 'noSQL' . This project was created using core java
for the actual worker services and Servlet API 2.5 / Jersey REST for
the front-end. The controller uses a multicast-based service Lookup
borrowed from the Jini infrastructure; for the rest all
intra-service communication, including starting and stopping, is
done using REST API's. All data and indexes are stored in the Hadoop
HBase noSQL database. The Fetch Federation went live in the last
week of June 2011.
Ephorus Teacher-UI
The Ephorus Teacher-UI is the inbox for a
teacher to monitor the plagiarism detection process for documents
submitted by students. It connects to a distributed document
processing pipeline. Because of the massive amounts of documents
uploaded and scanned for plagiarism, the system uses a technique
called 'sharding'': it handles millions of uploads and downloads and
distributes document metadata over several database servers in a
deterministic and intuitive manner. It uses cloud storage for actual
document content. Besides being a web-ui it generates reports in PDF
and statistics in Excel format. This project was created based on
Servlet API 2.5; MySQL, a modified version of EclipseLink JPA to
support sharding, plain SQL for performance-sensitive operations,
WS-REST using Jersey, and plain Javascript and JQuery. The project
is has been handed over to the Ephorus product development group for
further integration with the Ephorus pipeline and to start migration
and the ramp-up process.
Ephorus Search Component
This component is a simple and small ATOM feed
/ OpenSearch compliant search aggregator. The project supports
Ephorus' plagiarism detection process: it searches content using
many different search engines; internal ones and search engines
external to the organisation. The component is a simple Servlet API
2.5 wep application using no database at all. In order to have
maximum control, the component controls all threading concerns
itself and does not rely on any type of container-supplied resource
apart from the URL of the configuration. Its start configuration
isolates threading issues per search engine and each search engine
is configured for 1000 threads each. Configuration can be managed
centrally and changes are picked up with a 10 second delay without
restarting the service. The search component went live ithe 21st of
september 2011