Note: Loading Tab-Separated Data In Cascalog

October 9, 2012

To load all fields from a tab-separated text file in Cascalog we need to use the generic hfs-tap and specify the "scheme" (notice that loading all fields and expecting tab as the separator is the default behavior of TextDelimited):


 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")

With a custom separator and fields:


 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited. (cascalog.workflow/fields ["?f1" "?f2"]) "\t") ; or cascading.tuple.Fields/ALL inst. of (fields ...)
   "hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")

Hadoop doesn't manage to load data files from nested sub-directories (for example from a Hive partitioned table). To load them, you need to use a "glob pattern" to turn the standard Hfs tap into a GlobHfs tap. This is how we would match all the subdirectories (Hadoop will then handle loading the files in them):


 (hfs-tap
   (cascading.scheme.hadoop.TextDelimited.)
   "hdfs:///user/hive/warehouse/playerevents/"
   :source-pattern "epoch_week=*/")

Enjoy.

Tags: clojure java data