pyspark recoger conjunto o recoger lista con groupby


¿Cómo puedo usar collect_set o collect_list en un dataframe después de groupby. por ejemplo: df.groupby('key').collect_set('values'). Me sale un error: AttributeError: 'GroupedData' object has no attribute 'collect_set'

Author: Hanan Shteingart, 2016-06-02

1 answers

Necesita usar agg. Ejemplo:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F

sc = SparkContext("local")

sqlContext = HiveContext(sc)

df = sqlContext.createDataFrame([
    ("a", None, None),
    ("a", "code1", None),
    ("a", "code2", "name2"),
], ["id", "code", "name"])

df.show()

+---+-----+-----+
| id| code| name|
+---+-----+-----+
|  a| null| null|
|  a|code1| null|
|  a|code2|name2|
+---+-----+-----+

Nota en lo anterior tienes que crear un HiveContext. Véase https://stackoverflow.com/a/35529093/690430 para tratar con diferentes versiones de Spark.

(df
  .groupby("id")
  .agg(F.collect_set("code"),
       F.collect_list("name"))
  .show())

+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
|  a|   [code1, code2]|           [name2]|
+---+-----------------+------------------+
 39
Author: ksindi,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2018-05-01 20:07:27