Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License.

Pig Change Log

Released 0.8.1

INCOMPATIBLE CHANGES

pig-1936: documentation update (chandec via olgan)

PIG-1680: HBaseStorage should work with HBase 0.90 (gstathis, billgraham, dvryaboy, tlipcon via dvryaboy)

IMPROVEMENTS

PIG-1936: doc updates (chandec via olgan)

PIG-1830: Type mismatch error in key from map, when doing GROUP on PigStorageSchema() variable (dvryaboy)

PIG-1886: Add zookeeper jar to list of jars shipped when HBaseStorage used (dvryaboy)

BUG FIXES

PIG-2316: Incorrect results for FILTER *** BY ( *** OR ***) with FilterLogicExpressionSimplifier optimizer turned on (knoguchi via thejas)

PIG-2077: Project UDF output inside a non-foreach statement fail on 0.8 (daijy)

PIG-2067: FilterLogicExpressionSimplifier removed some branches in some cases (daijy)

PIG-2033: Pig returns sucess for the failed Pig script (rding)

PIG-1870: HBaseStorage doesn't project correctly (dvryaboy)

PIG-1979: New logical plan failing with ERROR 2229: Couldn't find matching uid -1 (daijy)

PIG-1977: "Stream closed" error while reading Pig temp files (results of intermediate jobs) (rding)

PIG-1911: Infinite loop with accumulator function in nested foreach (thejas)

PIG-1964: PigStorageSchema fails if a column value is null (thejas)

PIG-1963: in nested foreach, accumutive udf taking input from order-by does not get results in order (thejas)

PIG-1993: PigStorageSchema throw NPE with ColumnPruning (daijy)

PIG-1935: New logical plan: Should not push up filter in front of Bincond (daijy)

PIG-1861: The pig script stored in the Hadoop History logs is stored as a concatenated string without whitespace this causes problems when attempting to extract and execute the script (rding)

PIG-1912: non-deterministic output when a file is loaded multiple times (daijy)

PIG-1892: Bug in new logical plan : No output generated even though there are valid records (daijy)

PIG-1808: Error message in 0.8 not much helpful as compared to 0.7 (daijy)

PIG-1770: matches clause problem with chars that have special meaning in dk.brics - #, @ .. (thejas) PIG-1884: Change ReadToEndLoader.setLocation not throw UnsupportedOperationException (thejas) PIG-1858: UDF in nested plan results frontend exception (daijy) PIG-1862: Pig returns exit code 0 for the failed Pig script due to non-existing input directory (rding) PIG-1850: Order by is failing with ClassCastException if schema is undefined for new logical plan in 0.8 (daijy) PIG-1831: Indeterministic behavior in local mode due to static variable PigMapReduce.sJobConf (daijy) PIG-1841: TupleSize implemented incorrectly (laukik via daijy) PIG-1843: NPE in schema generation (daijy) PIG-1820: New logical plan: FilterLogicExpressionSimplifier fail to deal with UDF (daijy) PIG-1854: Pig returns exit code 0 for the failed Pig script (rding) PIG-1812: Problem with DID_NOT_FIND_LOAD_ONLY_MAP_PLAN (daijy) PIG-1813: Pig 0.8 throws ERROR 1075 while trying to refer a map in the result of eval udf.Works with 0.7 (daijy) PIG-1776: changing statement corresponding to alias after explain , then doing dump gives incorrect result (thejas) PIG-1800: Missing Signature for maven staging release (rding) PIG-1815: pig task retains used instances of PhysicalPlan (thejas) PIG-1785: New logical plan: uid conflict in flattened fields (daijy) PIG-1787: Error in logical plan generated (daijy) PIG-1791: System property mapred.output.compress, but pig-cluster-hadoop-site.xml doesn't (daijy) PIG-1771: New logical plan: Merge schema fail if LoadFunc.getSchema return different schema with "Load...AS" (daijy) PIG-1766: New logical plan: ImplicitSplitInserter should before DuplicateForEachColumnRewrite (daijy) PIG-1762: Logical simplification fails on map key referenced values (yanz) PIG-1761: New logical plan: Exception when bag dereference in the middle of expression (daijy) PIG-1760: Need to report progress in all databags (rding) OPTIMIZATIONS Release 0.8.0 - 12/17/10 INCOMPATIBLE CHANGES PIG-1518: multi file input format for loaders (yanz via rding) PIG-1249: Safe-guards against misconfigured Pig scripts without PARALLEL keyword (zjffdu vi olgan) IMPROVEMENTS PIG-1561: XMLLoader in Piggybank does not support bz2 or gzip compressed XML files (vivekp via daijy) PIG-1728: doc updates (chandec via olgan) PIG-1756: doc updates (chandec via olgan) PIG-1707: Allow pig build to pull from alternate maven repo to enable building against newer hadoop versions (pradeepkth) PIG-1677: modify the repository path of pig artifacts to org/apache/pig in stead or org/apache/hadoop/pig (nrai via olgan) PIG-1600: Docs update (romainr via olgan) PIG-1531: Pig gobbles up error messages (nrai via hashutosh) PIG-1628: log this message at debug level : 'Pig Internal storage in use' (thejas) PIG-1632: The core jar in the tarball contains the kitchen sink (eli via olgan) PIG-1617: 'group all' should always use one reducer (thejas) PIG-1589: add test cases for mapreduce operator which use distributed cache (thejas) PIG-1575: Complete the migration of optimization rule PushUpFilter including missing test cases (xuefuz via daijy) PIG-1548: Optimize scalar to consolidate the part file (rding) PIG-1600: Docs update (chandec via olgan) PIG-1585: Add new properties to help and documentation(olgan) PIG-1399: Filter expression optimizations (yanz via gates) PIG-1531: Pig gobbles up error messages (nrai via hashutosh) PIG-1458: aggregate files for replicated join (rding) PIG-1205: Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc (zjffdu and dvryaboy) PIG-1568: Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly (xuefuz via daijy) PIG-1574: Optimization rule PushUpFilter causes filter to be pushed up out joins (xuefuz via daijy) PIG-1515: Migrate logical optimization rule: PushDownForeachFlatten (xuefuz via daijy) PIG-1321: Logical Optimizer: Merge cascading foreach (xuefuz via daijy) PIG-1483: [piggybank] Add HadoopJobHistoryLoader to the piggybank (rding) PIG-1555: [piggybank] add CSV Loader (dvryaboy) PIG-1501: need to investigate the impact of compression on pig performance (yanz via thejas) PIG-1497: Mandatory rule PartitionFilterOptimizer (xuefuz via daijy) PIG-1514: Migrate logical optimization rule: OpLimitOptimizer (xuefuz via daijy) PIG-1551: Improve dynamic invokers to deal with no-arg methods and array parameters (dvryaboy) PIG-1311: Document audience and stability for remaining interfaces (gates) PIG-506: Does pig need a NATIVE keyword? (aniket486 via thejas) PIG-1510: Add `deepCopy` for LogicalExpressions (swati.j via daijy) PIG-1447: Tune memory usage of InternalCachedBag (thejas) PIG-1505: support jars and scripts in dfs (anhi via rding) PIG-1334: Make pig artifacts available through maven (niraj via rding) PIG-1466: Improve log messages for memory usage (thejas) PIG-1404: added PigUnit, a framework fo building unit tests of Pig Latin scripts (romainr via gates) PIG-1452: to remove hadoop20.jar from lib and use hadoop from the apache maven repo. (rding) PIG-1295: Binary comparator for secondary sort (azaroth via daijy) PIG-1448: Detach tuple from inner plans of physical operator (thejas) PIG-965: PERFORMANCE: optimize common case in matches (PORegex) (ankit.modi via olgan) PIG-103: Shared Job /tmp location should be configurable (niraj via rding) PIG-1496: Mandatory rule ImplicitSplitInserter (yanz via daijy) PIG-346: grant help command cleanup (olgan) PIG-1199: help includes obsolete options (olgan) PIG-1434: Allow casting relations to scalars (aniket486 via rding) PIG-1461: support union operation that merges based on column names (thejas) PIG-1517: Pig needs to support keywords in the package name (aniket486 via olgan) PIG-928: UDFs in scripting languages (aniket486 via daijy) PIG-1509: Add .gitignore file (cwsteinbach via gates) PIG-1478: Add progress notification listener to PigRunner API (rding) PIG-1472: Optimize serialization/deserialization between Map and Reduce and between MR jobs (thejas) PIG-1389: Implement Pig counter to track number of rows for each input files (rding) PIG-1454: Consider clean up backend code (rding) PIG-1333: API interface to Pig (rding) PIG-1405: Need to move many standard functions from piggybank into Pig (aniket486 via daijy) PIG-1427: Monitor and kill runaway UDFs (dvryaboy) PIG-1428: Make a StatusReporter singleton available for incrementing counters (dvryaboy) PIG-972: Make describe work with nested foreach (aniket486 via daijy) PIG-1438: [Performance] MultiQueryOptimizer should also merge DISTINCT jobs (rding) PIG-1441: new test targets (olgan) PIG-282: Custom Partitioner (aniket486 via daijy) PIG-283: Allow to set arbitrary jobconf key-value pairs inside pig program (hashutosh) PIG-1373: We need to add jdiff output to docs on the website (daijy) PIG-1422: Duplicate code in LOPrinter.java (zjffdu) PIG-1420: Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple (rjurney via dvryaboy) PIG-1408: Annotate explain plans with aliases (rding) PIG-1410: Make PigServer can handle files with parameters (zjffdu) PIG-1406: Allow to run shell commands from grunt (zjffdu) PIG-1398: Marking Pig interfaces for org.apache.pig.data package (gates) PIG-1396: eclipse-files target in build.xml fails to generate necessary classes in src-gen PIG-1390: Provide a target to generate eclipse-related classpath and files (chaitk via thejas) PIG-1384: Adding contrib javadoc to main Pig javadoc (daijy) PIG-1320: final documentation updates for Pig 0.7.0 (chandec via olgan) PIG-1363: Unnecessary loadFunc instantiations (hashutosh) PIG-1370: Marking Pig interface for org.apache.pig package (gates) PIG-1354: UDFs for dynamic invocation of simple Java methods (dvryaboy) PIG-1316: TextLoader should use Bzip2TextInputFormat for bzip files so that bzip files can be efficiently processed by splitting the files (pradeepkth) PIG-1317: LOLoad should cache results of LoadMetadata.getSchema() for use in subsequent calls to LOLoad.getSchema() or LOLoad.determineSchema() (pradeepkth) PIG-1413: Remove svn:externals reference for test-patch.sh and create a local copy of test-patch.sh (gkesavan) PIG-1302: Include zebra's "pigtest" ant target as a part of pig's ant test target. (gkesavan) PIG-1582: To upgrade commons-logging OPTIMIZATIONS PIG-1353: Map-side joins (ashutoshc) PIG-1309: Map-side Cogroup (ashutoshc) BUG FIXES PIG-1709: Skewed join use fewer reducer for extreme large key (daijy) PIG-1751: New logical plan: PushDownForEachFlatten fail in UDF with unknown output schema (daijy) PIG-1741: Lineage fail when flatten a bag (daijy) PIG-1739: zero status is returned when pig script fails (yanz) PIG-1738: New logical plan: Optimized UserFuncExpression.getFieldSchema (daijy) PIG-1732: New logical plan: logical plan get confused if we generate the same field twice in ForEach (daijy) PIG-1737: New logical plan: Improve error messages when merge schema fail (daijy) PIG-1725: New logical plan: uidOnlySchema bug in LOGenerate (daijy) PIG-1729: New logical plan: Dereference does not add into plan after deepCopy (daijy) PIG-1721: New logical plan: script fail when reuse foreach inner alias (daijy) PIG-1716: New logical plan: LogToPhyTranslationVisitor should translate the structure for regex optimization (daijy) PIG-1740: Fix SVN location in setup doc (chandec via olgan) PIG-1719: New logical plan: FieldSchema generation for BinCond is wrong (daijy) PIG-1720: java.lang.NegativeArraySizeException during Quicksort (thejas) PIG-1715: pig-withouthadoop.jar missing automaton.jar (thejas) PIG-1727: Hadoop default config override pig.properties (rding) PIG-1731: Stack Overflows where there are composite logical expressions on UDFs using the new logical plan (yanz) PIG_1723: Need to limit the length of Pig counter names (rding) PIG-1714: Option mapred.output.compress doesn't work in Pig 0.8 but worked in 0.7 (xuefuz via rding) PIG-1706: New logical plan: PushDownFlattenForEach fail if flattened field has user defined schema (daijy) PIG-1705: New logical plan: self-join fail for some queries (daijy) PIG-1704: Output Compression is not at work if the output path is absolute and there is a trailing / afte the compression suffix (yanz) PIG-1695: MergeForEach does not carry user defined schema if any one of the merged ForEach has user defined schema (daijy) PIG-1684: Inconsistent usage of store func. (thejas) PIG-1694: union-onschema projects null schema at parsing stage for some queries (thejas) PIG-1685: Pig is unable to handle counters for glob paths ? (daijy) PIG-1683: New logical plan: Nested foreach plan fail if one inner alias is refered more than once (daijy) PIG-1542: log level not propogated to MR task loggers (nrai via daijy) PIG-1673: query with consecutive union-onschema statement errors out (thejas) PIG-1653: Scripting UDF fails if the path to script is an absolute path (daijy) PIG-1669: PushUpFilter fail when filter condition contains scalar (daijy) PIG-1672: order of relations in replicated join gets switched in a query where first relation has two mergeable foreach statements (thejas) PIG-1666: union onschema fails when the input relation has cast from bytearray to another type (thejas) PIG-1656: TOBAG udfs ignores columns with null value; it does not use input type to determine output schema (thejas) PIG-1655: code duplicated for udfs that were moved from piggybank to builtin (nrai via daijy) PIG-1670: pig throws ExecException in stead of FrontEnd exception when the plan validation fails (nrai via daijy) PIG-1668: Order by failed with RuntimeException (rding) PIG-1659: sortinfo is not set for store if there is a filter after ORDER BY (daijy) PIG-1664: leading '_' in directory/file names should be ignored; the "pigtest" build target should include all pig-related zebra tests. (yanz) PIG-1662: Need better error message for MalFormedProbVecException (rding) PIG-1658: ORDER BY does not work properly on integer/short keys that are -1 (yanz) PIG-1638: sh output gets mixed up with the grunt prompt (nrai via daijy) PIG-1607: pig should have separate javadoc.jar in the maven repository (nrai via thejas) PIG-1651: PIG class loading mishandled (rding) PIG-1650: pig grunt shell breaks for many commands like perl , awk , pipe , 'ls -l' etc (nrai via thejas) PIG-1649: FRJoin fails to compute number of input files for replicated input (thejas) PIG-1637: Combiner not use because optimizor inserts a foreach between group and algebric function (daijy) PIG-1648: Split combination may return too many block locations to map/reduce framework (yanz) PIG-1641: Incorrect counters in local mode (rding) PIG-1647: Logical simplifier throws a NPE (yanz) PIG-1642: Order by doesn't use estimation to determine the parallelism (rding) PIG-1644: New logical plan: Plan.connect with position is misused in some places (daijy) PIG-1643: join fails for a query with input having 'load using pigstorage without schema' + 'foreach' (daijy) PIG-1645: Using both small split combination and temporary file compression on a query of ORDER BY may cause crash (yanz) PIG-1635: Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed (yanz) PIG-1639: New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF (xuefuz via daijy) PIG-1643: join fails for a query with input having 'load using pigstorage without schema' + 'foreach' (thejas) PIG-1636: Scalar fail if the scalar variable is generated by limit (daijy) PIG-1605: Adding soft link to plan to solve input file dependency (daijy) PIG-1598: Pig gobbles up error messages - Part 2 (nrai via daijy) PIG-1616: 'union onschema' does not use create output with correct schema when udfs are involved (thejas) PIG-1610: 'union onschema' does handle some cases involving 'namespaced' column names in schema (thejas) PIG-1609: 'union onschema' should give a more useful error message when schema of one of the relations has null column name(thejas) PIG-1562: Fix the version for the dependent packages for the maven (nrai via rding) PIG-1604: 'relation as scalar' does not work with complex types (thejas) PIG-1601: Make scalar work for secure hadoop (daijy) PIG-1602: The .classpath of eclipse template still use hbase-0.20.0 (zjffdu) PIG-1596: NPE's thrown when attempting to load hbase columns containing null values (zjffdu) PIG-1597: Development snapshot jar no longer picked up by bin/pig PIG-1599: pig gives generic message for few cases (nrai via rding) PIG-1595: casting relation to scalar- problem with handling of data from non PigStorage loaders (thejas) PIG-1591: pig does not create a log file, if tje MR job succeeds but front end fails (nrai via daijy) PIG-1543: IsEmpty returns the wrong value after using LIMIT (daijy) PIG-1550: better error handling in casting relations to scalars (thejas) PIG-1572: change default datatype when relations are used as scalar to bytearray (thejas) PIG-1583: piggybank unit test TestLookupInFiles is broken (daijy) PIG-1563: some of string functions don't work on bytearrays (olgan) PIG-1569: java properties not honored in case of properties such as stop.on.failure (rding) PIG-1570: native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs (thejas) PIG-1343: pig_log file missing even though Main tells it is creating one and an M/R job fails (nrai via rding) PIG-1482: Pig gets confused when more than one loader is involved (xuefuz via thejas) PIG-1579: Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput (daijy) PIG-1557: couple of issue mapping aliases to jobs (rding) PIG-1552: Nested describe failed when the alias is not referred in the first foreach inner plan (aniket486 via daijy) PIG-1486: update ant eclipse-files target to include new jar and remove contrib dirs from build path (thejas) PIG-1524: 'Proactive spill count' is misleading (thejas) PIG-1546: Incorrect assert statements in operator evaluation (ajaykidave via pradeepkth) PIG-1392: Parser fails to recognize valid field (niraj via rding) PIG-1541: FR Join shouldn't match null values (rding) PIG-1525: Incorrect data generated by diff of SUM (rding) PIG-1288: EvalFunc returnType is wrong for generic subclasses (daijy) PIG-1534: Code discovering UDFs in the script has a bug in a order by case (pradeepkth) PIG-1533: Compression codec should be a per-store property (rding) PIG-1527: No need to deserialize UDFContext on the client side (rding) PIG-1516: finalize in bag implementations causes pig to run out of memory in reduce (thejas) PIG-1521: explain plan does not show correct Physical operator in MR plan when POSortedDistinct, POPackageLite are used (thejas) PIG-1513: Pig doesn't handle empty input directory (rding) PIG-1500: guava.jar should be removed from the lib folder (niraj via rding) PIG-1034: Pig does not support ORDER ... BY group alias (zjffdu) PIG-1445: Pig error: ERROR 2013: Moving LOLimit in front of LOStream is not implemented (daijy) PIG-348: -j command line option doesn't work (rding) PIG-1487: Replace "bz" with ".bz" in all the LoadFunc PIG-1489: Pig MapReduceLauncher does not use jars in register statement (rding) PIG-1435: make sure dependent jobs fail when a jon in multiquery fails (niraj via rding) PIG-1492: DefaultTuple and DefaultMemory understimate their memory footprint (thejas) PIG-1409: Fix up javadocs for org.apache.pig.builtin (gates) PIG-1490: Make Pig storers work with remote HDFS in secure mode (rding) PIG-1469: DefaultDataBag assumes ArrayList as default List type (azaroth via dvryaboy) PIG-1467: order by fail when set "fs.file.impl.disable.cache" to true (daijy) PIG-1463: Replace "bz" with ".bz" in setStoreLocation in PigStorage (zjffdu) PIG-1221: Filter equality does not work for tuples (zjffdu) PIG-1456: TestMultiQuery takes a long time to run (rding) PIG-1457: Pig will run complete zebra test even we give -Dtestcase=xxx (daijy) PIG-1450: TestAlgebraicEvalLocal failures due to OOM (daijy) PIG-1433: pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true (pradeepkth) PIG-1347: Clear up output directory for a failed job (daijy) PIG-1419: Remove "user.name" from JobConf (daijy) PIG-1359: bin/pig script does not pick up correct jar libraries (zjffdu) PIG-566: Dump and store outputs do not match for PigStorage (azaroth via daijy) PIG-1414: Problem with parameter substitution (rding) PIG-1407: Logging starts before being configured (azaroth via daijy) PIG-1391: pig unit tests leave behind files in temp directory because MiniCluster files don't get deleted (tejas) PIG-1211: Pig script runs half way after which it reports syntax error (pradeepkth) PIG-1401: "explain -script