Stamp-based synchronization
In Exalead CloudView, each document has a stamp, which is an opaque String. When you push a document with a stamp, the stamp is stored and can be retrieved. This can be used to detect whether a document has been modified since its last push. For example, you could set the "last modification timestamp" of the document as the stamp, or its MD5 hash.
A basic example would look like:
for (Document document : listDocumentsInDataSource()) { String currentStamp = computeStamp(document); DocumentStatus statusInCloudView = papi.getDocumentStatus(document.getURI()); if (statusInCloudView == NOT_PRESENT) { /* This document is not in CloudView, so it's new in the data source -> push it */ papi.addDocument(document); } else { if (!statusInCloudView.stamp.equals(currentStamp)) { /* Stamp has changed: the document was modified -> push it */ papi.addDocument(document); } } }
However, this method has two drawbacks:
-
Calling the
getDocumentStatus()
method for each document is slow, as it involves one synchronous PAPI call for each document. -
It does not handle documents to be deleted in Exalead CloudView.
To fix this, you can list all documents in Exalead CloudView, and compute the differences. For example:
Map<String, String> stampsOfDocumentsInCloudView; for (SyncedEntry se : papi.enumerateSyncedEntries()) { stampsOfDocumentsInCloudView.put(se.getURI(), se.getStamp()); } Set<String> documentsInDataSource; for (Document document : listDocumentsInDataSource()) { documentsInDataSource.add(document.getURI()); String currentStamp = computeStamp(document); String stampInCloudView = stampsOfDocumentsInCloudView.get(document.getURI()); if (stampInCloudView == null) { /* This document is not in CloudView, so it's new in the data source -> push it */ papi.addDocument(document); } else { if (!stampInCloudView.equals(currentStamp)) { /* Stamp has changed: the document was modified -> push it */ papi.addDocument(document); } } } /* Now, compute the list of deleted documents: documents that are in CloudView but not in the data source */ for (String docInCloudView : stampsOfDocumentsInCloudView.keySet()) { if (!documentsInDataSource.contains(docInCloudView)) { /* Doc is not in data source anymore -> Delete it from !CloudView */ papi.deleteDocument(docInCloudView); } }
This method might not be convenient if you have huge amounts of documents in Exalead CloudView, due to the large memory requirements to store the list.
Other possible method enhancements are:
-
Batch enumeration over known data subsets. For example, folders in a filesystem connector.
-
Parallel enumeration – if suitable, this enumerates both the data source and Exalead CloudView in parallel. The Exalead CloudView enumeration is guaranteed to be in lexicographical order based on the URIs. As the Exalead CloudView enumeration is streamed, you can perform a merge between the lists and compute the differences on the fly.