Note: Loading Tab-Separated Data In Cascalog
To load all fields from a tab-separated text file in Cascalog we need to use the generic hfs-tap and specify the "scheme" (notice that loading all fields and expecting tab as the separator is the default behavior of TextDelimited):
With a custom separator and fields:
Hadoop doesn't manage to load data files from nested sub-directories (for example from a Hive partitioned table). To load them, you need to use a "glob pattern" to turn the standard Hfs tap into a GlobHfs tap. This is how we would match all the subdirectories (Hadoop will then handle loading the files in them):
Enjoy.
(hfs-tap
(cascading.scheme.hadoop.TextDelimited.)
"hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")
With a custom separator and fields:
(hfs-tap
(cascading.scheme.hadoop.TextDelimited. (cascalog.workflow/fields ["?f1" "?f2"]) "\t") ; or cascading.tuple.Fields/ALL inst. of (fields ...)
"hdfs:///user/hive/warehouse/playerevents/epoch_week=2196/output_aewa-analytics-ada_1334697041_1.log")
Hadoop doesn't manage to load data files from nested sub-directories (for example from a Hive partitioned table). To load them, you need to use a "glob pattern" to turn the standard Hfs tap into a GlobHfs tap. This is how we would match all the subdirectories (Hadoop will then handle loading the files in them):
(hfs-tap
(cascading.scheme.hadoop.TextDelimited.)
"hdfs:///user/hive/warehouse/playerevents/"
:source-pattern "epoch_week=*/")
Enjoy.