Implementing Synchronization

When indexing a document collection that is evolving, the task of your connector is to ensure that the state of the index always follows the state of the source. This includes detecting new documents, modified documents and deleted documents.

Exalead CloudView provides two mechanisms to help implementing synchronization.

This page discusses:

Stamp-based synchronization
Checkpoint-based synchronization
Synchronization best-practices

Stamp-based synchronization

In Exalead CloudView, each document has a stamp, which is an opaque String. When you push a document with a stamp, the stamp is stored and can be retrieved. This can be used to detect whether a document has been modified since its last push. For example, you could set the "last modification timestamp" of the document as the stamp, or its MD5 hash.

A basic example would look like:

for (Document document : listDocumentsInDataSource()) {
  String currentStamp = computeStamp(document);
  DocumentStatus statusInCloudView = papi.getDocumentStatus(document.getURI());

    if (statusInCloudView == NOT_PRESENT) {
    /* This document is not in CloudView, so it's new in the data source -> push it */
    papi.addDocument(document);
  } else {
    if (!statusInCloudView.stamp.equals(currentStamp)) {
      /* Stamp has changed: the document was modified -> push it */
      papi.addDocument(document);
    }
  }
}

However, this method has two drawbacks:

Calling the getDocumentStatus() method for each document is slow, as it involves one synchronous PAPI call for each document.
It does not handle documents to be deleted in Exalead CloudView.

To fix this, you can list all documents in Exalead CloudView, and compute the differences. For example:

Map<String, String> stampsOfDocumentsInCloudView;
for (SyncedEntry se : papi.enumerateSyncedEntries()) {
  stampsOfDocumentsInCloudView.put(se.getURI(), se.getStamp());
}

Set<String> documentsInDataSource;

for (Document document : listDocumentsInDataSource()) {
  documentsInDataSource.add(document.getURI());

  String currentStamp = computeStamp(document);
  String stampInCloudView = stampsOfDocumentsInCloudView.get(document.getURI());

    if (stampInCloudView == null) {
    /* This document is not in CloudView, so it's new in the data source -> push it */
    papi.addDocument(document);
  } else {
    if (!stampInCloudView.equals(currentStamp)) {
      /* Stamp has changed: the document was modified -> push it */
      papi.addDocument(document);
    }
  }
}

/* Now, compute the list of deleted documents: documents that are in CloudView but not in the data source */
for (String docInCloudView : stampsOfDocumentsInCloudView.keySet()) {
  if (!documentsInDataSource.contains(docInCloudView)) {
     /* Doc is not in data source anymore -> Delete it from !CloudView */
     papi.deleteDocument(docInCloudView);
  }
}

This method might not be convenient if you have huge amounts of documents in Exalead CloudView, due to the large memory requirements to store the list.

Other possible method enhancements are:

Batch enumeration over known data subsets. For example, folders in a filesystem connector.
Parallel enumeration – if suitable, this enumerates both the data source and Exalead CloudView in parallel. The Exalead CloudView enumeration is guaranteed to be in lexicographical order based on the URIs. As the Exalead CloudView enumeration is streamed, you can perform a merge between the lists and compute the differences on the fly.

Checkpoint-based synchronization

Stamp-based synchronization is generally quite costly due to the memory requirements and should only be used when there is no notion of "event log" in the source. Many data sources have logs or mechanisms to determine what has changed between two events. In this case, you should use checkpoint-based synchronization.

A checkpoint is an opaque String, not associated with a document, that is stored persistently by Exalead CloudView, and can be retrieved.

If a checkpoint can be retrieved, then all operations that were sent to the PAPI before the checkpoint are guaranteed to be safely stored to disk, and will never be lost, even if they are not yet searchable.

The following sample shows the workflow of a checkpoint-based synchronization:

final static String CHECKPOINT_NAME = "my_checkpoint";

public void syncSource() {
  String currentCheckpointValue = papi.getCheckpoint(CHECKPOINT_NAME);

  String currentLast = dataSource.getCurrentLastEventId();

  for (Action a: dataSource.getAllDocumentsBetween(currentCheckpointValue, currentLast)) {
      if (a.kind == ADD) papi.addDocument(a.getDocument());
      else if (a.kind == DEL) papi.deleteDocument(a.getURI());
  }

  /* Now, set in CloudView the fact that we have reached currentLast */
  papi.setCheckpoint(currentLast, CHECKPOINT_NAME);
}

You can force the sync when you set a checkpoint, or just after, but this is not strictly necessary. If you don't sync and a crash occurs, you will retrieve the previous checkpoint value, and will re-scan more documents than needed. However, the indexing process is idempotent when the same document is pushed several times, therefore, there is no change to the database.

Synchronization best-practices

In some cases, you will need to garbage-collect the data source log after pushing. Make sure in this case that you sync Exalead CloudView before garbage-collecting the log.
Always compute the "next" checkpoint value before scanning. This way, if new records are added while scanning, you will not miss them.
As much as possible, avoid using the current time as a checkpoint value because in some rare cases, it can cause synchronization issues. Consider the following cases:
- t0: A new transaction begins on the data source.
- t1: A document D1 is added on the transaction; its last modification date is set to "t1".
- t2: Scan begins; we compute t2 as the next checkpoint value, and will scan the source up to t2.
- t3: Transaction commits.
  
  At the end, the document D1 has not been scanned. However, we will resume the scan from t2 next time, and therefore never scan D1.

If you don't have any other form of ever-increasing ids, and must use modification/current times, make sure to always include an "overlap offset" when computing the checkpoint, to account for currently running transactions. For example:

long now = System.currentTime();
String nextCheckpoint = "" + (now - 60000);
// Leave one minute of overlap: we'll always scan a bit more than needed,
// but won't miss any documents in currently running transactions.