Projection
Schema projection is a way of optimization of reads. When calling ParquetReader.as[MyData]
Parquet4s reads the whole content of each Parquet record even when you provide a case class that maps only a part of stored columns. The same happens when you use generic records by calling ParquetReader.generic
. However, you can explicitly tell Parquet4s to use a different schema. In effect, all columns not matching your schema will be skipped and not read. You can define the projection schema in numerous ways:
- by defining case class for typed read using
projectedAs
, - by defining generic column projection (allows reference to nested fields and aliases) using
projectedGeneric
, - by providing your own instance of Parquet’s
MessageType
for generic read usingprojectedGeneric
.
import com.github.mjakubowski84.parquet4s.{Col, ParquetIterable, ParquetReader, Path, RowParquetRecord}
import org.apache.parquet.schema.MessageType
// typed read
case class MyData(column1: Int, columnX: String)
val myData: ParquetIterable[MyData] =
ParquetReader
.projectedAs[MyData]
.read(Path("file.parquet"))
// generic read with column projection
val records1: ParquetIterable[RowParquetRecord] =
ParquetReader
.projectedGeneric(
Col("column1").as[Int],
Col("columnX").as[String].alias("my_column"),
)
.read(Path("file.parquet"))
// generic read with own instance of Parquet schema
val schemaOverride: MessageType = ???
val records2: ParquetIterable[RowParquetRecord] =
ParquetReader
.projectedGeneric(schemaOverride)
.read(Path("file.parquet"))