Indexing Tabular Data

You can index data from tabular files (CSV, Parquet, JSON) stored in an Object Storage Bucket, as one or several classes in a Semantic Graph Index Unit.

This type of pipeline supports a transformation expression for data processing.

This task shows you how to:

Add a New Pipeline
Configure the Source
Configure the Processing
Configure the Destination
Run the Pipeline Configuration

Add a New Pipeline

Define one pipeline per Data Model class.

Before you begin: You have created an index unit.

From the left panel, select Storage.
Click Add pipeline.
The Add new pipeline dialog box opens to let you create a new pipeline or import an existing pipeline (see Exporting and Importing a Pipeline Configuration.)
Click Create pipeline.
Enter a Pipeline name.
For Pipeline type, select Index Tabular Data.
Click Create.
The General information page opens.
Recommendation: Click Save after any editing operation.

Configure the Source

Once the pipeline is created, configure the source, that is the connection to the storage on which Data Factory Studio must fetch source files.

Select the Source tab.

Configure the Storage parameters.

Parameter Description

Bucket name

Defines in which S3 bucket Data Factory Studio fetches source files.

Folder path prefix

Defines the folder path where Data Factory Studio must fetch source files.

Important: Do not forget to add a slash / at the end of the path.

Optional: If lineage with Datasets Governance is activated, you see a Source Lineage section on the page. In that case, you must select a catalog previously defined in Datasets Governance from Source Catalog.
Important: This is one of the steps to configure the interaction with Datasets Governance. For more info, see Configuring a Lineage with Datasets Governance.
Configure the Scheduling, that is the refresh frequency of the source to fetch.
By default, Data Factory Studio refreshes data every minute. You can specify longer intervals, like every 10 minutes, every hour, every day, etc. You can also specify the source scheduling to run continuously, which means that each time an S3 fetch request comes to an end, a new one starts automatically.
You can define your own custom source scheduling using a Quartz Cron Expression. For example:
- "0 0 13 * * ?" indicates that you want to schedule the processor to run at 1:00 PM every day.
- "0 20 14 ? * MON-FRI" indicates that you want to schedule the processor to run at 2:20 PM from Monday to Friday.
- "0 15 10 ? * 6L 2011-2017" indicates that you want to schedule the processor to run at 10:15 AM, on the last Friday of every month, between 2011 and 2017.
For more information on Quartz format, see https://www.quartz-scheduler.org/documentation/

Select a Data format.

Table 1. CSV Data Format Parameters
Parameter	Description
Value separator	Defines the separator character used between the CSV columns. For example, enter a comma (`,`) or a semicolon (`;`), etc.
Null string	Defines the string to replace by a null value. For more information, see Null Values.
Quote character	Defines the CSV cell delimiter character. By default, it uses `\"`. For more information, see Quoting Values.
Escape character	Defines the escape character. For more information, see Escaping Characters.

Click Save.

Configure the Processing

Advanced users, already familiar with Semantic Graph Index Data Queries can optionally define a MAP expression in the Processing tab.

The expression input must be a tuple, called record, with your data source values. For more information, see Appendix - About Processing.

Configure the Destination

Define in which Data Model class, the ingestion pipeline must push fetched data. The index unit must match the CSV columns one by one.

Notes: You cannot use the 3DSpace as destination target. No pipeline can modify it.

Select the Destination tab.
Enter the Index unit where you want to push data.
For Package, select a data model package, for example, org.common.
For Class, select the name of the class which must store indexed data, for example car. Remember that you have one pipeline per class.
Optional: If your Data Model contains a property of type List, scroll down to the Index Properties table, and specify a separator in the Multivalued separator column.

Note: The Multivalued separator must be different from the Value separator defined in the Source tab.
Click Save.

Run the Pipeline Configuration

Once the configuration is complete, you can apply it to start indexing documents.

Select the Run tab.
The running status of the indexing displays at the top of the screen.
To start the pipeline indexing, click Run.
You can stop the indexing at any time and restart it when ready.
Important: Data Factory Studio blocks the Run action if there is any unsaved modification.