Calculating a diff between Two Data Sources

It is sometimes useful to get the differences between two data sources.

First, you enumerate your data source and push all documents in the index using the Push API. Then, you need to regularly update the index by adding new documents, and deleting documents that have been deleted in the source.

Sometimes you are not notified by the source of deleted documents and you don't know which are the documents to delete from the index. In such case, your only solution is to compare documents present in the index with documents present in the data source, and then delete documents that are in the index but no longer in the source.

To do so, we provide a Subtractor class in the papi-java-datainteg-commons.jar file.

Using this class you will create an object that calculates the set of items present in a data source A and NOT in a data source B, to know exactly which items to delete from the index.

This calculation will be performed in a java heap buffer and will swap on disk if there is not enough memory available. Items enumeration is processed through Cursor objects, which are enumerators. This allows you to use the Subtractor class with various sources, you just need to provide enumerators to access items.

/**
 * Iterator on the CloudView index
*/
private Cursor<byte[]> getCursorFromCloudview(final PushAPI papi) throws PushAPIException {
    return new Cursor<byte[]>() {
        private Iterator<SyncedEntry> syncedEntries = papi.enumerateSyncedEntries("", 
EnumerationMode.NOT_RECURSIVE_ALL).iterator();

                @Override public void close() throws IOException {
        }
                @Override
        public byte[] next() throws Exception {
            if( syncedEntries.hasNext() == false ) {
                return null;
            }
                         final SyncedEntry entry = syncedEntries.next();
            final String uri = entry.getUri();
                         return uri.getBytes("UTF-8");
        }
    };
}

/**
  * Iterator on a fake data source
  * which contains 9995 documents with uris from "uri0" to "uri9994"
  */
private Cursor<byte[]> getCursorFromDataSource() {
    return new Cursor<byte[]>() {
        private int index = 0;
        private int len = 9995;

        @Override
        public void close() throws IOException {
        }

        @Override
        public byte[] next() throws Exception {
            if( index >= len )
                return  null;
            
            final String uri = new String("uri" + index++);
            return uri.getBytes("UTF-8");
        }
    };
}

/**
  * Sample code to show how to check documents needed to be deleted in 
  * the Cloudview index.
  * In this sample code we already have 10000 (10K) documents in the Cloudview index
  * with uris from "uri0" to "uri9999"
  */
private void checkItemsToDelete(final PushAPI papi, final Logger logger) throws Exception {
    // computes sources intersection
    final File workdir = new File(System.getProperty("java.io.tmpdir") + '/' + UUID.randomUUID());
         try {
        // this cursor will enumerate on 10000 documents (the cloudview index)
        final Cursor<byte[]> cursorFromCloudview = getCursorFromCloudview(papi);
                try {
            // this cursor will enumerate on 9995 documents (the data source)
            final Cursor<byte[]> cursorFromDataSource = getCursorFromDataSource();
                         try {
                // maximum amount of memory consumed in Java HEAP (5 MB)
                final int ramBudget = 5 * 1024 * 1024;
                  // create a subtractor object
                final Subtractor sub = new Subtractor(workdir, "subtractorName", ramBudget, logger);
                     // computes items that are in source1 and NOT in source2
                 final Cursor<byte[]> itemsInCloudviewAndNotInDataSource = sub.sub(cursorFromCloudview,
 cursorFromDataSource);
                                 try {
   // loop on all items that are in the Cloudview index and no longer in the data source
                    for (final byte[] bytes : new IterableCursor<byte[]>(itemsInCloudviewAndNotInDataSource))
 {
                        final String s = new String(bytes, "UTF-8");
                        papi.deleteDocument(s);// delete the document to update the index
                    }
                }
                finally {
                    itemsInCloudviewAndNotInDataSource.close();
                }
            }
            finally {
                cursorFromDataSource.close();
            }
        }
        finally  {
            cursorFromCloudview.close();
        }
    }
    finally {
        FileUtils.deleteQuietly(workdir);
    }
}