Reference
Data Types
Spark SQL and DataFrames support the following data types:
- Numeric types
ByteType: Represents 1-byte signed integer numbers. The range of numbers is from-128to127.ShortType: Represents 2-byte signed integer numbers. The range of numbers is from-32768to32767.IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from-2147483648to2147483647.LongType: Represents 8-byte signed integer numbers. The range of numbers is from-9223372036854775808to9223372036854775807.FloatType: Represents 4-byte single-precision floating point numbers.DoubleType: Represents 8-byte double-precision floating point numbers.DecimalType: Represents arbitrary-precision signed decimal numbers. Backed internally byjava.math.BigDecimal. ABigDecimalconsists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
- String type
StringType: Represents character string values.
- Binary type
BinaryType: Represents byte sequence values.
- Boolean type
BooleanType: Represents boolean values.
- Datetime type
TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second.DateType: Represents values comprising values of fields year, month, day.
- Complex types
ArrayType(elementType, containsNull): Represents values comprising a sequence of elements with the type ofelementType.containsNullis used to indicate if elements in aArrayTypevalue can havenullvalues.MapType(keyType, valueType, valueContainsNull): Represents values comprising a set of key-value pairs. The data type of keys are described bykeyTypeand the data type of values are described byvalueType. For aMapTypevalue, keys are not allowed to havenullvalues.valueContainsNullis used to indicate if values of aMapTypevalue can havenullvalues.StructType(fields): Represents values with the structure described by a sequence ofStructFields (fields).StructField(name, dataType, nullable): Represents a field in aStructType. The name of a field is indicated byname. The data type of a field is indicated bydataType.nullableis used to indicate if values of this fields can havenullvalues.
All data types of Spark SQL are located in the package org.apache.spark.sql.types.
You can access them by doing
import org.apache.spark.sql.types._
| Data type | Value type in Scala | API to access or create a data type |
|---|---|---|
| ByteType | Byte | ByteType |
| ShortType | Short | ShortType |
| IntegerType | Int | IntegerType |
| LongType | Long | LongType |
| FloatType | Float | FloatType |
| DoubleType | Double | DoubleType |
| DecimalType | java.math.BigDecimal | DecimalType |
| StringType | String | StringType |
| BinaryType | Array[Byte] | BinaryType |
| BooleanType | Boolean | BooleanType |
| TimestampType | java.sql.Timestamp | TimestampType |
| DateType | java.sql.Date | DateType |
| ArrayType | scala.collection.Seq |
ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true. |
| MapType | scala.collection.Map |
MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true. |
| StructType | org.apache.spark.sql.Row |
StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
| StructField | The value type in Scala of the data type of this field (For example, Int for a StructField with the data type IntegerType) |
StructField(name, dataType, [nullable]) Note: The default value of nullable is true. |
All data types of Spark SQL are located in the package of
org.apache.spark.sql.types. To access or create a data type,
please use factory methods provided in
org.apache.spark.sql.types.DataTypes.
| Data type | Value type in Java | API to access or create a data type |
|---|---|---|
| ByteType | byte or Byte | DataTypes.ByteType |
| ShortType | short or Short | DataTypes.ShortType |
| IntegerType | int or Integer | DataTypes.IntegerType |
| LongType | long or Long | DataTypes.LongType |
| FloatType | float or Float | DataTypes.FloatType |
| DoubleType | double or Double | DataTypes.DoubleType |
| DecimalType | java.math.BigDecimal |
DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale). |
| StringType | String | DataTypes.StringType |
| BinaryType | byte[] | DataTypes.BinaryType |
| BooleanType | boolean or Boolean | DataTypes.BooleanType |
| TimestampType | java.sql.Timestamp | DataTypes.TimestampType |
| DateType | java.sql.Date | DataTypes.DateType |
| ArrayType | java.util.List |
DataTypes.createArrayType(elementType) Note: The value of containsNull will be true DataTypes.createArrayType(elementType, containsNull). |
| MapType | java.util.Map |
DataTypes.createMapType(keyType, valueType) Note: The value of valueContainsNull will be true. DataTypes.createMapType(keyType, valueType, valueContainsNull) |
| StructType | org.apache.spark.sql.Row |
DataTypes.createStructType(fields) Note: fields is a List or an array of StructFields. Also, two fields with the same name are not allowed. |
| StructField | The value type in Java of the data type of this field (For example, int for a StructField with the data type IntegerType) | DataTypes.createStructField(name, dataType, nullable) |
All data types of Spark SQL are located in the package of pyspark.sql.types.
You can access them by doing
from pyspark.sql.types import *| Data type | Value type in Python | API to access or create a data type |
|---|---|---|
| ByteType |
int or long Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. |
ByteType() |
| ShortType |
int or long Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. |
ShortType() |
| IntegerType | int or long | IntegerType() |
| LongType |
long Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType. |
LongType() |
| FloatType |
float Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. |
FloatType() |
| DoubleType | float | DoubleType() |
| DecimalType | decimal.Decimal | DecimalType() |
| StringType | string | StringType() |
| BinaryType | bytearray | BinaryType() |
| BooleanType | bool | BooleanType() |
| TimestampType | datetime.datetime | TimestampType() |
| DateType | datetime.date | DateType() |
| ArrayType | list, tuple, or array |
ArrayType(elementType, [containsNull]) Note: The default value of containsNull is True. |
| MapType | dict |
MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is True. |
| StructType | list or tuple |
StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
| StructField | The value type in Python of the data type of this field (For example, Int for a StructField with the data type IntegerType) |
StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |
| Data type | Value type in R | API to access or create a data type |
|---|---|---|
| ByteType |
integer Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. |
"byte" |
| ShortType |
integer Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. |
"short" |
| IntegerType | integer | "integer" |
| LongType |
integer Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType. |
"long" |
| FloatType |
numeric Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. |
"float" |
| DoubleType | numeric | "double" |
| DecimalType | Not supported | Not supported |
| StringType | character | "string" |
| BinaryType | raw | "binary" |
| BooleanType | logical | "bool" |
| TimestampType | POSIXct | "timestamp" |
| DateType | Date | "date" |
| ArrayType | vector or list |
list(type="array", elementType=elementType, containsNull=[containsNull]) Note: The default value of containsNull is TRUE. |
| MapType | environment |
list(type="map", keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]) Note: The default value of valueContainsNull is TRUE. |
| StructType | named list |
list(type="struct", fields=fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
| StructField | The value type in R of the data type of this field (For example, integer for a StructField with the data type IntegerType) |
list(name=name, type=dataType, nullable=[nullable]) Note: The default value of nullable is TRUE. |
NaN Semantics
There is specially handling for not-a-number (NaN) when dealing with float or double types that
does not exactly match standard floating point semantics.
Specifically:
- NaN = NaN returns true.
- In aggregations, all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.
Arithmetic operations
Operations performed on numeric types (with the exception of decimal) are not checked for overflow.
This means that in case an operation causes an overflow, the result is the same that the same operation
returns in a Java/Scala program (eg. if the sum of 2 integers is higher than the maximum value representable,
the result is a negative number).
