UC-1: Consolidating Data from Two Sources

For this use case, we want to gather document information about countries coming from a database and a file system on the same index.

The prerequisite is that there is a known link between database records and files. In the provided sample, the file name contains the id of the database record (stored in the country_id field).

What we want to do is create a link between a country record and a PDF document inside the Consolidation Server. The country object is enriched with the PDF file during the aggregation step.

This task shows you how to:

Step 1 - Define the Connectors Corresponding to Each Source

Create the Filesystem Connector

Before you begin:

In this use case, we have a very small set of PDF documents to push to the Consolidation Server using the Files connector. To reproduce this example with a real corpus, if your documents have large binary parts, the Consolidation Server cache ends up with a disk footprint close to the size of the indexed corpus.

For real use cases, convert document binary parts before pushing documents into the Consolidation Server:

  • Go to the Advanced tab and a Convert PushAPI Filter to the Files connector.
  • Extract text content only (and exclude binary parts) by setting the Conversion mode to Text.

  1. In the Administration Console, go to Index > Connectors and click Add connector.
    1. In Name, enter countryfiles.
    2. For Type, select the Files connector.
    3. For Push to PAPI server, select the Consolidation server cbx0 instance.
    4. Click Accept.
  2. For Store documents in data model class, choose the document class.
  3. In Filesystem paths, enter the following path: <INPUTDIR>/pdf
  4. Click Save.

Create the Database Connector

  1. In the Administration Console, go to Index > Connectors and click Add connector.
    1. In Name, enter country.
    2. For Type, select the JDBC connector.
    3. For Push to PAPI server, select the Consolidation server cbx0 instance.
    4. Click Accept.
  2. For Store documents in data model class, choose the country class.
  3. In Connection parameters:
    1. For Driver, enter org.sqlite.JDBC
    2. For Connection string, enter jdbc:sqlite://<INPUTDIR>/coffee.db
    3. Click Test connection. The database connector automatically connects to the database.
  4. In Query parameters:
    1. For Synchronization mode, select Full synchronization
    2. For Initial query, enter: Select country_id, ico_status, name from countries
  5. Click Retrieve fields.
  6. Define the country_id field as primary key.
    1. Click the country_id field to expand it.
    2. Select Use as primary key.
  7. Click Save.

Step 2 - Configure Consolidation

Configure the Transformation Processor

  1. Go to Index > Consolidation
  2. Add a new transformation processor:
    1. Select Groovy as format
    2. For Name, enter Files
    3. Click Accept
  3. For Source connector, select countryfiles
  4. Replace the default code by the following one:

    Groovy code

    // Process all nodes coming from the selected source connector
    process("") {
     // Extract the country id from the filename.
     // For example, for "brazil.pdf", we want to extract "brazil".
     String filename = it.metas.getValue("file_name");
     def values = filename.split('\\.');
     log.info "doc uri:[" + it.getUri() + "] countryId:[" + values[0] + "]";
    
     // Link the filesystem document to its related “countries” database record.
     // The default URI of a database record is: "<fieldname>=<value>&"
     it.addArcFrom("describedBy", "country_id=" + values[0] + "&");
     }

    Java equivalent code

    @Override
     public void process(IJavaAllUpdatesTransformationHandler handler, IMutableTransformationDocument document) throws Exception {
            document.setType("document");
    
            final String filename = document.getMeta("file_name");
            if (filename == null || filename.isEmpty()) {
                throw new Exception("File name not available");
            }
            final String[] values = filename.split("\\.");
            if (values == null || values.length == 0) {
                throw new Exception("Invalid file name");
            }
            LOGGER.info("doc uri:[" + document.getUri() + "] countryId:[" + values[0] + "]");
    
            document.addArcFrom("describedBy", "country_id=" + values[0] + "&");
    }

With this transformation processor, we have achieved to link files to their related database records.

Configure the Aggregation Processor

  1. Add an aggregation processor:
    1. Select Groovy as format
    2. For Name, enter Countries_UC_1
    3. Click Accept
  2. Replace the default code by the following one:

    Groovy code

    // Process nodes having the “country” type
    // The node type is deduced by the document class automatically
    process("country") {
     log.info "country found: " + it.metas.name;
     // Find nodes related to the country
     // Goal: Create a “country” consolidated document with information coming from the matched file.
        it.metas.hasfile = "no";
        for (path in match(it, "describedBy[document]")) {
    
     // If a valid path is found, retrieve its last element
     last = path.last()
     log.info "File found: " + last.getUri();
     // Retrieve the binary parts of the found nodes
     // To get all parts: it.parts.getMap().putAll(last.parts.getMap());
     // To get the master part only:
            it.parts.master += last.parts.master;
            it.metas.hasfile = "yes";
      }
    }

    Java equivalent code

    @Override
     public void process(IJavaAllUpdatesAggregationHandler handler, IAggregationDocument document) throws Exception {
            final String countryName = document.getMeta("name");
            if (countryName == null || countryName.length() == 0) {
                throw new Exception("Invalid country name '" + countryName + "'");
            }
    
            LOGGER.info("Country found: " + countryName);
    
            // find document related to the country
            // Goal: be able to consolidate information of pdf document with country database record
            final List<IAggregationDocument> pathsEnds = GraphMatchHelpers.getPathsEnd(handler.match(document, "describedBy"));
            for (IAggregationDocument file : pathsEnds) {
                LOGGER.info("File found: " + file.getUri());
    
                document.withPart("master", file.getPart("master"));
                document.withMeta("hasfile", "yes");
            }
    }

  3. Save and apply the configuration.

    Note: It is also possible to consolidate security tokens, using the security meta. After it.metas.hasfile = "yes"; add it.metas.security += last.metas.security;

Step 3 - Scan Source Connectors and Check What Is Indexed

  1. Go to the Home page and under the connectors list, click Scan for the Files and JDBC connectors.

    Note: In the Connectors list, a consolidation-<instance name> row displays status information about consolidation. All documents and countries are indexed.

  2. Open the Mashup UI application search page: http://<HOSTNAME>:<BASEPORT>/mashup-ui/page/search
  3. Check that country documents have associated parts (thumbnails/previews are available).



  4. To get a consolidated view, go to: http://<HOSTNAME>:<BASEPORT>/mashup-ui/page/searchcountry_v1



The following graphic shows what we achieved on the object graph.