CSV Parsing
CSV data format is not as strictly standardized as other formats. CSV fields
delimitation rules vary from an implementation to another. For example, quotes may be
optional, spaces may surround values, etc. So here are the general rules followed by Data Factory Studio to read CSV files.
Automatic Trimming of Space-Like Characters
Data Factory Studio removes space-like characters (spaces, tabs, and escaped carriage return) at the
beginning or the end of a value automatically. Use quotes to bypass this behavior.
Note:
This automatic trimming does not apply to elements of multivalued
properties, like lists and collections. However, Data Factory Studio trims the whole list or collection like regular values. For more information, see List.
Quoting Values
To quote a value in a csv line, enclose it with the quote character. For
example, "my text" . You can configure the character to use for quoting in
the pipeline source configuration. The following examples use the default double quote
character " .
- If quote characters surround a value, Data Factory Studio does not index them, and they are not displayed in the result.
- Quoting does not enforce the
String type. The String type does not require quotes either.
Thus, even if they are mainly useful for String and Date /Datetime , you can use quotes for any types.
For example, "true" and true are exactly equivalent.
- Data Factory Studio trims space-like characters outside the quotes, but keeps space-like characters inside the quotes.
- Quoting is useful if your value contains:
- Preceding or trailing space-like characters that you need to keep.
- Value separator characters (for example, the default comma
,
separator) that you do not want to escape.
Escaping Characters
Escape the characters that Data Factory Studio must keep. For example, to get Hello, world! , write Hello\,
world! . You can configure the character used for escaping in the pipeline source
configuration. The following examples use the default backslash \
character.
Escape the following characters:
- The value separator: replace the default comma
, by
\,
- The carriage return
\n .
- The quote character: replace the default double quote
" by
\"
- The escape character itself: replace the default backslash
\ by
\\
- Tabs may also be generated using
\t
If the escape character appears before any other character, Data Factory Studio does not interpret it as an escaping sequence, and keep both characters in the result.
For example, \x returns \x .
Null Values
Data Factory Studio evaluates a value to null if, after trimming space-like characters, unquoting, and
replacing escaped character, you have:
- An empty value
- The
NullString (which depends on your settings).
For example, with NullString set to "N/A" , the following
values evaluate to null : ` ` , "" ,
N/A , "N/A" , \tN/A\t ,
\"\" .
Note:
This does not apply to individual elements of multivalued properties, like
lists and collections. You can only set individual elements to null with an empty string.
However, Data Factory Studio follows the general rule for the list or collection as a whole. For more information,
see List.
Property Type Support
Binary
Supports Base64.
Parquet Value |
Resulting Value (in the Index Unit) |
"YWJjZGVmZ2gxMjPDqcOgw6ckJsKnwr7jg4jjg4jjg60=" |
Binary.decodeBase64("YWJjZGVmZ2gxMjPDqcOgw6ckJsKnwr7jg4jjg4jjg60=")
|
Boolean
CSV Value |
Resulting Value (in the Index Unit) |
true |
true |
"true" |
true |
True |
true |
TRUE |
true |
TruE |
true |
false |
false |
"false" |
false |
False |
false |
FALSE |
false |
FasLE |
false |
0 |
Not supported |
1 |
Not supported |
Date
CSV Value |
Resulting Value (in the Index Unit) |
ISO 8601 |
|
"1970-01-25" (yyyy-MM-dd) |
Date.of(1970, 01, 25) |
"1234-05-06T07:08:09"
(yyyy-MM-dd'T'HH:mm:ss) |
Date.of(1234, 5, 6)
|
"2222-02-22T22:22:22Z"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
Date.of(2222, 2, 22)
|
"1970-01-01T01:01:01+02:00"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
Date.of(1970, 1, 1)
|
Other formats |
|
"30/05/1978 02:34:56" (dd/MM/yyyy
HH:mm:ss) |
Not supported |
"2/4/2010 2:39:28 PM" (M/d/yyyy h:mm:ss
tt) |
Not supported |
"6:05:12 a.m." (h:mm:ss tt) |
Not supported |
"23:31:18.345" (HH:mm:ss.s) |
Not supported |
"Tuesday, August 20, 2002" (dddd, MMMM dd,
yyyy) |
Not supported |
"Wed, Aug 31 1994" (ddd, MMM dd yyyy) |
Not supported |
"03281999030456" (MMddyyyyHHmmss) |
Not supported |
"Im a random string" |
Not supported |
LocalDateTime
CSV Value |
Resulting Value (in the Index Unit) |
ISO 8601 |
|
"1970-01-25" (yyyy-MM-dd) |
Not supported |
"1234-05-06T07:08:09"
(yyyy-MM-dd'T'HH:mm:ss) |
LocalDateTime.of(1245, 5, 6, 7, 8, 9) |
"2222-02-22T22:22:22Z"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
LocalDateTime.of(2222, 2, 22, 22, 22, 22)
|
"1970-01-01T01:01:01+02:00"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
LocalDateTime.of(1970, 1, 1, 1, 1, 1)
|
Other formats |
|
"30/05/1978 02:34:56" (dd/MM/yyyy
HH:mm:ss) |
Not supported |
"2/4/2010 2:39:28 PM" (M/d/yyyy h:mm:ss
tt) |
Not supported |
"6:05:12 a.m." (h:mm:ss tt) |
Not supported |
"23:31:18.345" (HH:mm:ss.s) |
Not supported |
"Tuesday, August 20, 2002" (dddd, MMMM dd,
yyyy) |
Not supported |
"Wed, Aug 31 1994" (ddd, MMM dd yyyy) |
Not supported |
"03281999030456" (MMddyyyyHHmmss) |
Not supported |
"Im a random string" |
Not supported |
DateTime
CSV Value |
Resulting Value (in the Index Unit) |
ISO 8601 |
|
"1970-01-25" (yyyy-MM-dd) |
Not supported |
"1234-05-06T07:08:09"
(yyyy-MM-dd'T'HH:mm:ss) |
Not supported |
"2222-02-22T22:22:22Z"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
DateTime.of(2222, 2, 22, 22, 22, 22, 0,
ZoneOffset.utc()) |
"1970-01-01T01:01:01+02:00"
(yyyy-MM-dd'T'HH:mm:ssXXX) |
DateTime.of(1970, 1, 1, 1, 1, 1, 0,
ZoneOffset.ofHours(2)) |
Other formats |
|
"30/05/1978 02:34:56" (dd/MM/yyyy
HH:mm:ss) |
Not supported |
"2/4/2010 2:39:28 PM" (M/d/yyyy h:mm:ss
tt) |
Not supported |
"6:05:12 a.m." (h:mm:ss tt) |
Not supported |
"23:31:18.345" (HH:mm:ss.s) |
Not supported |
"Tuesday, August 20, 2002" (dddd, MMMM dd,
yyyy) |
Not supported |
"Wed, Aug 31 1994" (ddd, MMM dd yyyy) |
Not supported |
"03281999030456" (MMddyyyyHHmmss) |
Not supported |
"Im a random string" |
Not supported |
Float and Decimal
CSV Value |
Resulting Value (in the Index Unit) |
42 |
42. |
42L |
Not supported |
42.3 |
42.3
|
42. |
42.
|
42.3f |
42.3 |
42.3d |
42.3
|
4.23e3 |
42300.
|
0x2A |
Not supported |
0x2A.5p-0
|
42.3125
|
baba |
Not supported |
Integer
CSV Value |
Resulting Value (in the Index Unit) |
42 |
42 |
42.3 |
Not supported |
42. |
Not supported |
4.23e3 |
Not supported
|
0x2A |
Not supported |
baba |
Not supported |
Geo
Supply geometry values as WKT formatted strings.
CSV Value |
Resulting Value (in the Index Unit) |
POINT(6 10) |
Geo.ofWkt("POINT(6 10)")
|
LINESTRING(3 4,10 50,20 25) |
Geo.ofWkt("LINESTRING(3 4,10 50,20 25)")
|
POLYGON((1 1,5 1,5 5,1 5,1 1)) |
Geo.ofWkt("POLYGON((1 1,5 1,5 5,1 5,1 1))")
|
POINT EMPTY |
Geo.ofWkt("POINT EMPTY")
|
baba |
Not supported |
Reference
CSV Value |
Resulting Value (in the Index Unit) |
MyItemUri |
MyItemUri |
MyWrongItemUri |
MyWrongItemUri |
Text, String, and FullTextSearch
CSV Value |
Resulting Value (in the Index Unit) |
Any value (except Null Values) |
"Any value" |
"Any quoted value" (except Null Values) |
"Any quoted value" |
Property Data Structures
Singleton
Mono-valuated value.
List
An ordered collection of items (which may contain duplicates).
Data Factory Studio parses List columns in your CSV files using the character specified under
Destination > Index Properties, in the Multivalued
Separator column.
For example, with the Multivalued Separator
| , Data Factory Studio splits the CSV column alice|bob|camille into [alice ,
bob , camille ].
Recommendation:
It is better not to use the same character for both
Value Separator and Multivalued Separator.
If you want to use the same separator, cell values must be surrounded by quotes.
Data Factory Studio removes spaces at the beginning and at the end of the list, as for any other csv value,
unless it is quoted. However, remember that this trim operation does not apply to
individual items in the list. Data Factory Studio does not unquote list elements.
Example: name, neighbors
example-1, alice|bob|camille
example-2, alice | bob | camille
example-3, "alice"|"bob"|"camille"
example-4, ""alice"|"bob"|"camille"" Will result in the following lists:
[alice, bob, camille]
[alice␣, ␣bob␣, ␣camille]
[alice", "bob", "camille]
["alice", "bob", "camille"]
Similarly, the Null Values apply to the whole list, but not to individual elements.
Therefore, you cannot use the NullString inside a list. As of now, the only way to have
null values inside a list is to use empty strings (spaces and quotes are not allowed).
Collection
An unordered collection of items (which may contain duplicates). Collection parsing is
identical to List
parsing.
Map
Data Factory Studio does not support Maps for CSV data format.
|