The CSVReader Controller Service, expects input in such a way that the first line of a FlowFile specifies the name of each column in the data. Following the first line, the rest of the FlowFile is expected to be valid CSV data from which to form appropriate Records. The reader allows for customization of the CSV Format, such as which character should be used to separate CSV fields, which character should be used for quoting and when to quote fields, which character should denote a comment, etc.

Schemas and Type Coercion

When a record is parsed from incoming data, it is separated into fields. Each of these fields is then looked up against the configured schema (by field name) in order to determine what the type of the data should be. If the field is not present in the schema, that field is omitted from the Record. If the field is found in the schema, the data type of the received data is compared against the data type specified in the schema. If the types match, the value of that field is used as-is. If the schema indicates that the field should be of a different type, then the Controller Service will attempt to coerce the data into the type specified by the schema. If the field cannot be coerced into the specified type, an Exception will be thrown.

The following rules apply when attempting to coerce a field value from one data type to another:

If none of the above rules apply when attempting to coerce a value from one data type to another, the coercion will fail and an Exception will be thrown.

Examples

Example 1

As an example, consider a FlowFile whose contents consists of the following:

id, name, balance, join_date, notes
1, John, 48.23, 04/03/2007 "Our very
first customer!"
2, Jane, 1245.89, 08/22/2009,
3, Frank Franklin, "48481.29", 04/04/2016,

Additionally, let's consider that this Controller Service is configured with the Schema Registry pointing to an AvroSchemaRegistry and the schema is configured as the following:

{
  "namespace": "nifi",
  "name": "balances",
  "type": "record",
  "fields": [
    { "name": "id", "type": "int" },
    { "name": "name": "type": "string" },
    { "name": "balance": "type": "double" },
    { "name": "join_date", "type": {
      "type": "int",
      "logicalType": "date"
    }},
    { "name": "notes": "type": "string" }
  ]
}

In the example above, we see that the 'join_date' column is a Date type. In order for the CSV Reader to be able to properly parse a value as a date, we need to provide the reader with the date format to use. In this example, we would configure the Date Format property to be MM/dd/yyyy to indicate that it is a two-digit month, followed by a two-digit day, followed by a four-digit year - each separated by a slash. In this case, the result will be that this FlowFile consists of 3 different records. The first record will contain the following values:

Field Name Field Value
id 1
name John
balance 48.23
join_date 04/03/2007
notes Our very
first customer!

The second record will contain the following values:

Field Name Field Value
id 2
name Jane
balance 1245.89
join_date 08/22/2009
notes

The third record will contain the following values:

Field Name Field Value
id 3
name Frank Franklin
balance 48481.29
join_date 04/04/2016
notes

Example 2 - Schema with CSV Header Line

When CSV data consists of a header line that outlines the column names, the reader provides a couple of different properties for configuring how to handle these column names. The "Schema Access Strategy" property as well as the associated properties ("Schema Registry," "Schema Text," and "Schema Name" properties) can be used to specify how to obtain the schema. If the "Schema Access Strategy" is set to "Use String Fields From Header" then the header line of the CSV will be used to determine the schema. Otherwise, a schema will be referenced elsewhere. But what happens if a schema is obtained from a Schema Registry, for instance, and the CSV Header indicates a different set of column names?

For example, let's say that the following schema is obtained from the Schema Registry:

{
  "namespace": "nifi",
  "name": "balances",
  "type": "record",
  "fields": [
    { "name": "id", "type": "int" },
    { "name": "name": "type": "string" },
    { "name": "balance": "type": "double" },
    { "name": "memo": "type": "string" }
  ]
}

And the CSV contains the following data:

id, name, balance, notes
1, John Doe, 123.45, First Customer

Note here that our schema indicates that the final column is named "memo" whereas the CSV Header indicates that it is named "notes."

In this case, the reader will look at the "Ignore CSV Header Column Names" property. If this property is set to "true" then the column names provided in the CSV will simply be ignored and the last column will be called "memo." However, if the "Ignore CSV Header Column Names" property is set to "false" then the result will be that the last column will be named "notes" and each record will have a null value for the "memo" column.

With "Ignore CSV Header Column Names" property set to "false":

Field Name Field Value
id 1
name John Doe
balance 123.45
memo First Customer

With "Ignore CSV Header Column Names" property set to "true":

Field Name Field Value
id 1
name John Doe
balance 123.45
notes First Customer
memo null