Spark DataFrame groupBy y ordenar en orden descendente (pyspark)
Estoy usando pyspark(Python 2.7.9/Spark 1.3.1) y tengo un GroupObject de dataframe que necesito filtrar y ordenar en orden descendente. Tratando de lograrlo a través de esta pieza de código.
group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)
Pero arroja el siguiente error.
sort() got an unexpected keyword argument 'ascending'
23
3 answers
En PySpark 1.3 sort
el método no toma el parámetro ascendente. Puedes usar el método desc
en su lugar:
from pyspark.sql.functions import col
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(col("count").desc()))
O desc
función:
from pyspark.sql.functions import desc
(group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
Ambos métodos se pueden usar con Spark >= 1.3 (incluyendo Spark 2.x).
48
Author: zero323,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-03-31 18:08:53
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-03-31 18:08:53
Use OrderBy :
group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)
Http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
18
Author: Henrique Florêncio,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-03-08 17:52:06
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-03-08 17:52:06
Similar a la anterior-pero ordenar en nombre de columna renombrado (alias):
from pyspark.sql.functions import desc
df=df.count().withColumnRenamed("count", "newColName")\
.filter("`count` >= 10")
.sort(desc("newColName"))
df.show()
1
Author: Grant Shannon,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-11-23 14:30:28
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/ajaxhispano.com/template/agent.layouts/content.php on line 61
2017-11-23 14:30:28