Quick start
SBT
libraryDependencies ++= Seq(
"com.github.mjakubowski84" %% "parquet4s-core" % "2.20.0",
"org.apache.hadoop" % "hadoop-client" % yourHadoopVersion
)
Mill
def ivyDeps = Agg(
ivy"com.github.mjakubowski84::parquet4s-core:2.20.0",
ivy"org.apache.hadoop:hadoop-client:$yourHadoopVersion"
)
import com.github.mjakubowski84.parquet4s.{ ParquetReader, ParquetWriter, Path }
case class User(userId: String, name: String, created: java.sql.Timestamp)
val users: Iterable[User] = Seq(
User("1", "parquet", new java.sql.Timestamp(1L))
)
val path = Path("path/to/local/file.parquet")
// writing
ParquetWriter.of[User].writeAndClose(path, users)
// reading
val parquetIterable = ParquetReader.as[User].read(path)
try {
parquetIterable.foreach(println)
} finally parquetIterable.close()
AWS S3
In order to connect to AWS S3 you need to define one more dependency:
"org.apache.hadoop" % "hadoop-aws" % yourHadoopVersion
Next, the most common way is to define following environmental variables:
export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key
You may need to set some configuration properties to access your storage, e.g. fs.s3a.path.style.access
.
Please follow documentation of Hadoop AWS for more details and troubleshooting.
Passing Hadoop Configs Programmatically
File system configs for S3, GCS, Hadoop, etc. can also be set programmatically to the ParquetReader
and ParquetWriter
by passing the Configuration
to the ParqetReader.Options
and ParquetWriter.Options
case classes.
import com.github.mjakubowski84.parquet4s.{ ParquetReader, ParquetWriter, Path }
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.hadoop.conf.Configuration
case class User(userId: String, name: String, created: java.sql.Timestamp)
val users: Iterable[User] = Seq(
User("1", "parquet", new java.sql.Timestamp(1L))
)
val writerOptions = ParquetWriter.Options(
compressionCodecName = CompressionCodecName.SNAPPY,
hadoopConf = new Configuration()
)
ParquetWriter
.of[User]
.options(writerOptions)
.writeAndClose(Path("path/to/local/file.parquet"), users)