pyspark中文api

2023-11-09

内容基于官网pyspark-SparkSQL官方文档翻译及拓展

官方文档：https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html

具体使用可在程序中使用help，dir函数，查看帮助文档，对象包含详细信息；也可以使用object?查看帮助文档，在对象后面带上一个?，这是python的一个特性；比如：SparkSession.builder?

Spark Session

SparkSQL程序主入口

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .config("spark.sql.shuffle.partitions",24)\
    .config("spark.executor.cores",8)\
    .config("spark.executor.memory","16g")\
    .config("spark.executor.instances",1)\
    .config("spark.driver.memory","4g")\
    .appName('pyspark')\
    .getOrCreate()

`SparkSession.builder.appName`(name)	程序名称
`SparkSession.builder.config`([key, value, …])	配置选项
`SparkSession.builder.enableHiveSupport`()	添加hive支持，可读hive表
`SparkSession.builder.getOrCreate`()	获取一个现有的SparkSession，如果没有，则根据此构建器中设置的选项创建一个新的SparkSession。
`SparkSession.builder.master`(master)	设置要连接的Spark主URL，例如“local”在本地运行，“local[4]”设置要连接的Spark主URL，例如“local”在本地运行，“local[4]”在4核的本地运行，或者“Spark://master:7077”在Spark独立集群上运行。
`SparkSession.builder.remote`(url)	Sets the Spark remote URL to connect to, such as “sc://host:port” to run it via Spark Connect server.
`SparkSession.catalog`	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
`SparkSession.conf`	Spark运行时配置接口
`SparkSession.createDataFrame`(data[, schema, …])	创建DataFrame
`SparkSession.getActiveSession`()	返回当前线程的活动SparkSession，由构建器返回
`SparkSession.newSession`()	返回一个新的SparkSession作为新会话，它有单独的SQLConf，注册的临时视图和udf，但是共享SparkContext和表缓存。
`SparkSession.range`(start[, end, step, …])	Create a `DataFrame` with single `pyspark.sql.types.LongType` column named `id`, containing elements in a range from `start` to `end` (exclusive) with step value `step`.
`SparkSession.read`	返回一个DataFrameReader，它可以作为一个DataFrame来读取数据。
`SparkSession.readStream`	Returns a `DataStreamReader` that can be used to read data streams as a streaming `DataFrame`.
`SparkSession.sparkContext`	Returns the underlying `SparkContext`.
`SparkSession.sql`(sqlQuery[, args])	执行sql，返回DataFrame
`SparkSession.stop`()	停止`SparkContext`.
`SparkSession.streams`	Returns a `StreamingQueryManager` that allows managing all the `StreamingQuery` instances active on this context.
`SparkSession.table`(tableName)	Returns the specified table as a `DataFrame`.
`SparkSession.udf`	返回用于UDF注册的UDFRegistration。
`SparkSession.version`	版本号

Configuration

SparkSQL配置，比如dirver，executor资源配置，shuffle配置等

`RuntimeConfig`(jconf)	用户的配置API，可通过SparkSession.conf访问

Input/Output

SparkSession.read返回一个DataFrameReader对象；DataFrame.write返回一个DataFrameWriter对象；

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .master("local[8]")\
    .appName('pyspark')\
    .enableHiveSupport()\
    .getOrCreate()

data = [[1,'alice'],[2,'mary']]
spark_df = spark.createDataFrame(data,['id','name'])
spark_df.write.saveAsTable('dw_pub_safe.dw_pub_xxx_xxx')   # 另存为hive表

文件的输入与输出

`DataFrameReader.csv`(path[, schema, sep, …])	读取csv文件返回 `DataFrame`.
`DataFrameReader.format`(source)	指定数据源格式
`DataFrameReader.jdbc`(url, table[, column, …])	构造一个DataFrame，表示通过JDBC URL、URL和连接属性访问的名为table的数据库表
`DataFrameReader.json`(path[, schema, …])	加载json文件返回 `DataFrame`.
`DataFrameReader.load`([path, format, schema])	加载数据返回 `DataFrame`.可通过format参数指定数据格式
`DataFrameReader.option`(key, value)	为基础数据源添加输入选项。
`DataFrameReader.options`(**options)	为基础数据源添加输入选项。可输入多选项
`DataFrameReader.orc`(path[, mergeSchema, …])	读取orc文件并返回`DataFrame`.
`DataFrameReader.parquet`(paths, *options)	读取Parquet 文件返回 `DataFrame`.
`DataFrameReader.schema`(schema)	指定输入字段类型
`DataFrameReader.table`(tableName)	将指定的表作为 `DataFrame`返回
`DataFrameReader.text`(paths[, wholetext, …])	加载文本文件返回 `DataFrame`
`DataFrameWriter.bucketBy`(numBuckets, col, *cols)	按指定列分桶.
`DataFrameWriter.csv`(path[, mode, …])	将 `DataFrame` 以csv格式保存到指定路径
`DataFrameWriter.format`(source)	指定输出格式
`DataFrameWriter.insertInto`(tableName[, …])	将 `DataFrame` 插入到指定表里
`DataFrameWriter.jdbc`(url, table[, mode, …])	通过JDBC将DataFrame的内容保存到外部数据库表中
`DataFrameWriter.json`(path[, mode, …])	保存为json文件
`DataFrameWriter.mode`(saveMode)	指定数据或表已经存在时的行为
`DataFrameWriter.options`(**options)	为基础数据源添加输出选项。
`DataFrameWriter.orc`(path[, mode, …])	Saves the content of the `DataFrame` in ORC format at the specified path.
`DataFrameWriter.parquet`(path[, mode, …])	保存为Parquet文件
`DataFrameWriter.partitionBy`(*cols)	按指定列分区
`DataFrameWriter.save`([path, format, mode, …])	保存输出为指定格式数据

DataFrame

DataFrame的一些属性及方法

from pyspark.sql import SparkSession

spark = SparkSession.Builder()\
    .appName('pyspark')\
    .enableHiveSupport()\
    .getOrCreate()
    
sc = spark.sparkContext
rdd = sc.parallelize([[1,'alice'],[2,'mary']])   # 构建rdd
spark_df = rdd.toDF(schema=['id','name'])    # 转化为dataframe

# 查看agg的帮助文档
spark_df.agg?    
# 统计id字段的个数，返回结果进行字段重命名
spark_df.agg({'id':'count'}).withColumnRenamed('count(id)','cnt').show()

sql = "select * from tablename1 where field_name = 'xxx'"
spark_df1 = spark.sql(sql)
spark_df1.printSchema()

`DataFrame.__getattr__`(name)	Returns the `Column` denoted by `name`.1
`DataFrame.__getitem__`(item)	返回以name表示的列
`DataFrame.agg`(*exprs)	对不包含组的整个DataFrame进行聚合，没有groupby分组，直接按整个df字段聚合；
`DataFrame.alias`(alias)	返回一个设置了别名的新DataFrame
`DataFrame.approxQuantile`(col, probabilities, …)	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`DataFrame.cache`()	使用默认存储级别(MEMORY_AND_DISK)持久化DataFrame。
`DataFrame.checkpoint`([eager])	Returns a checkpointed version of this `DataFrame`.
`DataFrame.coalesce`(numPartitions)	返回一个新的DataFrame，它恰好有numPartitions分区。数据量filter少了，可以用该算子减少分区
`DataFrame.colRegex`(colName)	Selects column based on the column name specified as a regex and returns it as `Column`.
`DataFrame.collect`()	将数据以list对方返回到driver
`DataFrame.corr`(col1, col2[, method])	将DataFrame的两列的相关性计算为双精度值。
`DataFrame.count`()	返回记录行
`DataFrame.createGlobalTempView`(name)	用这个数据框创建一个全局临时视图
`DataFrame.createOrReplaceGlobalTempView`(name)	使用给定名称创建或替换全局临时视图。
`DataFrame.createOrReplaceTempView`(name)	用此数据框创建或替换本地临时视图。
`DataFrame.createTempView`(name)	Creates a local temporary view with this `DataFrame`.
`DataFrame.crossJoin`(other)	返回与另一个DataFrame的笛卡尔积
`DataFrame.crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`DataFrame.cube`(*cols)	使用指定的列为当前DataFrame创建一个多维多维数据集，这样我们就可以在这些列上运行聚合。
`DataFrame.describe`(*cols)	计算数值和字符串列的基本统计信息。
`DataFrame.distinct`()	返回一个包含此数据框中不同行的新数据框。
`DataFrame.drop`(*cols)	返回一个没有指定列的新DataFrame。
`DataFrame.dropDuplicates`([subset])	返回一个新的DataFrame，删除重复的行，可选地只考虑某些列。
`DataFrame.drop_duplicates`([subset])	`dropDuplicates()`的别名
`DataFrame.dropna`([how, thresh, subset])	返回一个新的DataFrame，省略带有空值的行。
`DataFrame.dtypes`	以列表形式返回所有列名及其数据类型。
`DataFrame.exceptAll`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame` while preserving duplicates.
`DataFrame.explain`([extended, mode])	将(逻辑和物理)计划打印到控制台以进行调试。
`DataFrame.fillna`(value[, subset])	替换空值，别名为na.fill()
`DataFrame.filter`(condition)	按给定条件筛选
`DataFrame.first`()	返回前n行
`DataFrame.foreachPartition`(f)	Applies the `f` function to each partition of this `DataFrame`.
`DataFrame.freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`DataFrame.groupBy`(*cols)	使用指定的列对DataFrame进行分组，以便我们可以对它们运行聚合。
`DataFrame.head`([n])	返回前n行。
`DataFrame.hint`(name, *parameters)	Specifies some hint on the current `DataFrame`.
`DataFrame.inputFiles`()	Returns a best-effort snapshot of the files that compose this `DataFrame`.
`DataFrame.intersect`(other)	返回一个新的DataFrame，其中只包含这个DataFrame和另一个DataFrame中的行：交集
`DataFrame.intersectAll`(other)	Return a new `DataFrame` containing rows in both this `DataFrame` and another `DataFrame` while preserving duplicates.
`DataFrame.isEmpty`()	如果此DataFrame为空，则返回True。
`DataFrame.isLocal`()	Returns `True` if the `collect()` and `take()` methods can be run locally (without any Spark executors).
`DataFrame.isStreaming`	Returns `True` if this `DataFrame` contains one or more sources that continuously return data as it arrives.
`DataFrame.join`(other[, on, how])	使用给定的连接表达式与另一个DataFrame连接。
`DataFrame.limit`(num)	将结果计数限制为指定的数目。
`DataFrame.localCheckpoint`([eager])	Returns a locally checkpointed version of this `DataFrame`.
`DataFrame.mapInPandas`(func, schema)	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a `DataFrame`.
`DataFrame.mapInArrow`(func, schema)	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a `DataFrame`.
`DataFrame.melt`(ids, values, …)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.na`	Returns a `DataFrameNaFunctions` for handling missing values.
`DataFrame.observe`(observation, *exprs)	Define (named) metrics to observe on the DataFrame.
`DataFrame.orderBy`(cols, *kwargs)	返回按指定列排序的新DataFrame。
`DataFrame.persist`([storageLevel])	设置存储级别，以便在第一次计算DataFrame的内容后，跨操作持久保存它
`DataFrame.printSchema`()	打印结构信息
`DataFrame.randomSplit`(weights[, seed])	Randomly splits this `DataFrame` with the provided weights.
`DataFrame.rdd`	作为pyspark返回内容。Row的RDD。
`DataFrame.registerTempTable`(name)	使用给定的名称将此DataFrame注册为临时表。
`DataFrame.repartition`(numPartitions, *cols)	返回一个由给定分区表达式划分的新DataFrame。
`DataFrame.repartitionByRange`(numPartitions, …)	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`DataFrame.replace`(to_replace[, value, subset])	返回一个新的DataFrame，用另一个值替换一个值。
`DataFrame.rollup`(*cols)	使用指定列为当前DataFrame创建多维汇总，这样我们就可以对它们运行聚合。
`DataFrame.sameSemantics`(other)	Returns True when the logical query plans inside both `DataFrame`s are equal and therefore return the same results.
`DataFrame.sample`([withReplacement, …])	返回此数据框的抽样子集。
`DataFrame.sampleBy`(col, fractions[, seed])	根据每个层的给定分数返回一个分层样本，而不进行替换。
`DataFrame.schema`	以pyspark.sql.types.StructType的形式返回该DataFrame的模式。
`DataFrame.select`(*cols)	投射一组表达式并返回一个新的DataFrame；类似sql中的select
`DataFrame.selectExpr`(*expr)	投射一组SQL表达式并返回一个新的DataFrame。
`DataFrame.semanticHash`()	Returns a hash code of the logical query plan against this `DataFrame`.
`DataFrame.show`([n, truncate, vertical])	将前n行打印到控制台。
`DataFrame.sort`(cols, *kwargs)	返回按指定列排序的新DataFrame。
`DataFrame.sortWithinPartitions`(cols, *kwargs)	Returns a new `DataFrame` with each partition sorted by the specified column(s).
`DataFrame.sparkSession`	返回创建该数据框架的Spark会话。
`DataFrame.stat`	Returns a `DataFrameStatFunctions` for statistic functions.
`DataFrame.storageLevel`	获取DataFrame的当前存储级别。
`DataFrame.subtract`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame`.
`DataFrame.summary`(*statistics)	Computes specified statistics for numeric and string columns.
`DataFrame.tail`(num)	返回最后num行作为Row的列表
`DataFrame.take`(num)	返回前num行作为Row的列表。
`DataFrame.to`(schema)	返回一个新的DataFrame，其中每一行都与指定的模式相匹配
`DataFrame.toDF`(*cols)	返回具有新指定列名的新DataFrame
`DataFrame.toJSON`([use_unicode])	将一个DataFrame转换为字符串的RDD。
`DataFrame.toLocalIterator`([prefetchPartitions])	Returns an iterator that contains all of the rows in this `DataFrame`.
`DataFrame.toPandas`()	返回该数据框的内容为 Pandas .DataFrame。
`DataFrame.to_pandas_on_spark`([index_col])
`DataFrame.transform`(func, args, *kwargs)	Returns a new `DataFrame`.
`DataFrame.union`(other)	返回一个包含此数据框和另一个数据框中的行并集的新数据框。去重
`DataFrame.unionAll`(other)	返回一个包含此数据框和另一个数据框中的行并集的新数据框。不去重
`DataFrame.unionByName`(other[, …])	Returns a new `DataFrame` containing union of rows in this and another `DataFrame`.
`DataFrame.unpersist`([blocking])	将DataFrame标记为非持久性，并从内存和磁盘中删除它的所有块
`DataFrame.unpivot`(ids, values, …)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.where`(condition)	filter()的别名
`DataFrame.withColumn`(colName, col)	通过添加列或替换具有相同名称的现有列，返回一个新的DataFrame
`DataFrame.withColumns`(*colsMap)	通过添加多个列或替换具有相同名称的现有列，返回一个新的DataFrame。
`DataFrame.withColumnRenamed`(existing, new)	通过重命名现有列返回一个新的DataFrame。
`DataFrame.withColumnsRenamed`(colsMap)	通过重命名多个列返回一个新的DataFrame。
`DataFrame.withMetadata`(columnName, metadata)	Returns a new `DataFrame` by updating an existing column with metadata.
`DataFrame.withWatermark`(eventTime, …)	Defines an event time watermark for this `DataFrame`.
`DataFrame.write`	将非流数据帧的内容保存到外部存储器的接口。
`DataFrame.writeStream`	Interface for saving the content of the streaming `DataFrame` out into external storage.
`DataFrame.writeTo`(table)	Create a write configuration builder for v2 sources.
`DataFrame.pandas_api`([index_col])	将现有的DataFrame转换为pandas-on-Spark DataFrame。
`DataFrameNaFunctions.drop`([how, thresh, subset])	Returns a new `DataFrame` omitting rows with null values.
`DataFrameNaFunctions.fill`(value[, subset])	Replace null values, alias for `na.fill()`.
`DataFrameNaFunctions.replace`(to_replace[, …])	Returns a new `DataFrame` replacing a value with another value.
`DataFrameStatFunctions.approxQuantile`(col, …)	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`DataFrameStatFunctions.corr`(col1, col2[, method])	Calculates the correlation of two columns of a `DataFrame` as a double value.
`DataFrameStatFunctions.cov`(col1, col2)	Calculate the sample covariance for the given columns, specified by their names, as a double value.
`DataFrameStatFunctions.crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`DataFrameStatFunctions.freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`DataFrameStatFunctions.sampleBy`(col, fractions)	Returns a stratified sample without replacement based on the fraction given on each stratum.

Column

DataFrame的列
spark_df.filter(spark_df.id.between(2,3)).show() 筛选id字段在2和3之间的数据行

`Column.__getattr__`(item)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.__getitem__`(k)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.alias`(alias, *kwargs)	以一个或多个新名称返回该列的别名(对于返回多个列的表达式，例如explosion)。
`Column.asc`()	返回基于列升序的排序表达式。
`Column.asc_nulls_first`()	返回基于列升序的排序表达式，空值在非空值之前返回。
`Column.asc_nulls_last`()	返回基于列升序的排序表达式，空值出现在非空值之后。
`Column.astype`(dataType)	Astype()是cast()的别名。数据类型转化
`Column.between`(lowerBound, upperBound)	是否包含在区间内
`Column.bitwiseAND`(other)	Compute bitwise AND of this expression with another expression.
`Column.bitwiseOR`(other)	Compute bitwise OR of this expression with another expression.
`Column.bitwiseXOR`(other)	Compute bitwise XOR of this expression with another expression.
`Column.cast`(dataType)	将列强制转换为dataType类型。
`Column.contains`(other)	是否包含子串，类似于pandas的column.str.contains
`Column.desc`()	根据列的降序返回排序表达式。
`Column.desc_nulls_first`()	根据列的降序返回排序表达式，空值出现在非空值之前。
`Column.desc_nulls_last`()	根据列的降序返回排序表达式，空值出现在非空值之后。
`Column.dropFields`(*fieldNames)	按名称删除StructType中的字段的表达式
`Column.endswith`(other)	字符串以结尾。
`Column.eqNullSafe`(other)	Equality test that is safe for null values.
`Column.getField`(name)	An expression that gets a field by name in a `StructType`.
`Column.getItem`(key)	An expression that gets an item at position `ordinal` out of a list, or gets an item by key out of a dict.
`Column.ilike`(other)	SQL表达式(不区分大小写的LIKE)。
`Column.isNotNull`()	如果当前表达式不为空，则为True。
`Column.isNull`()	如果当前表达式为空，则为True。
`Column.isin`(*cols)	一个布尔表达式，如果该表达式的值包含在参数的求值中，则计算为true。
`Column.like`(other)	SQL表达式。
`Column.name`(alias, *kwargs)	Name()是alias()的别名。
`Column.otherwise`(value)	计算条件列表并返回多个可能结果表达式中的一个。
`Column.over`(window)	定义一个窗口列。
`Column.rlike`(other)	SQL RLIKE表达式(如正则表达式)。
`Column.startswith`(other)	以xx开头
`Column.substr`(startPos, length)	返回列的substring字符串
`Column.when`(condition, value)	Evaluates a list of conditions and returns one of multiple possible result expressions.
`Column.withField`(fieldName, col)	An expression that adds/replaces a field in `StructType` by name.

Data Types

数据类型在from pyspark.sql.types import *下面，有时候构建DataFrame制定schema字段数据类型时可能会用到

SparkSQL的数据类型

`ArrayType`(elementType[, containsNull])	Array data type.
`BinaryType`	Binary (byte array) data type.
`BooleanType`	Boolean data type.
`ByteType`	Byte data type, i.e.
`DataType`	Base class for data types.
`DateType`	Date (datetime.date) data type.
`DecimalType`([precision, scale])	Decimal (decimal.Decimal) data type.
`DoubleType`	Double data type, representing double precision floats.
`FloatType`	Float data type, representing single precision floats.
`IntegerType`	Int data type, i.e.
`LongType`	Long data type, i.e.
`MapType`(keyType, valueType[, valueContainsNull])	Map data type.
`NullType`	Null type.
`ShortType`	Short data type, i.e.
`StringType`	String data type.
`CharType`(length)	Char data type
`VarcharType`(length)	Varchar data type
`StructField`(name, dataType[, nullable, metadata])	A field in `StructType`.
`StructType`([fields])	Struct type, consisting of a list of `StructField`.
`TimestampType`	Timestamp (datetime.datetime) data type.
`TimestampNTZType`	Timestamp (datetime.datetime) data type without timezone information.
`DayTimeIntervalType`([startField, endField])	DayTimeIntervalType (datetime.timedelta).

Row

DataFrame没每一行都是一个Row对象

Row.asDict([recursive]) 返回一个字典

Functions

SparkSQL中函数库

from pyspark.sql.functions import *

spark_df.select(upper(spark_df.name)).show()    # 把字段转化为大写，使用upper函数

String Functions

字符串函数

`ascii`(col)	Computes the numeric value of the first character of the string column.
`base64`(col)	Computes the BASE64 encoding of a binary column and returns it as a string column.
`bit_length`(col)	Calculates the bit length for the specified string column.
`concat_ws`(sep, *cols)	Concatenates multiple input string columns together into a single string column, using the given separator.
`decode`(col, charset)	Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
`encode`(col, charset)	Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
`format_number`(col, d)	Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
`format_string`(format, *cols)	Formats the arguments in printf-style and returns the result as a string column.
`initcap`(col)	Translate the first letter of each word to upper case in the sentence.
`instr`(str, substr)	Locate the position of the first occurrence of substr column in the given string.
`length`(col)	Computes the character length of string data or number of bytes of binary data.
`lower`(col)	Converts a string expression to lower case.
`levenshtein`(left, right)	Computes the Levenshtein distance of the two given strings.
`locate`(substr, str[, pos])	Locate the position of the first occurrence of substr in a string column, after position pos.
`lpad`(col, len, pad)	Left-pad the string column to width len with pad.
`ltrim`(col)	Trim the spaces from left end for the specified string value.
`octet_length`(col)	Calculates the byte length for the specified string column.
`regexp_extract`(str, pattern, idx)	Extract a specific group matched by a Java regex, from the specified string column.
`regexp_replace`(string, pattern, replacement)	Replace all substrings of the specified string value that match regexp with replacement.
`unbase64`(col)	Decodes a BASE64 encoded string column and returns it as a binary column.
`rpad`(col, len, pad)	Right-pad the string column to width len with pad.
`repeat`(col, n)	Repeats a string column n times, and returns it as a new string column.
`rtrim`(col)	Trim the spaces from right end for the specified string value.
`soundex`(col)	Returns the SoundEx encoding for a string
`split`(str, pattern[, limit])	Splits str around matches of the given pattern.
`substring`(str, pos, len)	Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
`substring_index`(str, delim, count)	Returns the substring from string str before count occurrences of the delimiter delim.
`overlay`(src, replace, pos[, len])	Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
`sentences`(string[, language, country])	Splits a string into arrays of sentences, where each sentence is an array of words.
`translate`(srcCol, matching, replace)	A function translate any character in the srcCol by a character in matching.
`trim`(col)	Trim the spaces from both ends for the specified string column.
`upper`(col)	Converts a string expression to upper case.

Sort Functions

排序函数

`asc`(col)	根据给定列名的升序返回排序表达式。
`asc_nulls_first`(col)	根据给定列名的升序返回排序表达式，空值在非空值之前返回
`asc_nulls_last`(col)	Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
`desc`(col)	根据给定列名的降序返回排序表达式。
`desc_nulls_first`(col)	根据给定列名的降序返回排序表达式，空值出现在非空值之前
`desc_nulls_last`(col)	Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.

Window Functions

窗口函数，

`cume_dist`()	Window function: returns the cumulative distribution of values within a window partition, i.e.
`dense_rank`()	窗口函数:返回窗口分区内的行秩，没有任何间隙。1,2,2,3
`lag`(col[, offset, default])	窗口函数:返回当前行之前偏移行的值，如果当前行之前的偏移行少于偏移行，则返回默认值。
`lead`(col[, offset, default])	窗口函数:返回当前行之后偏移行的值，如果当前行之后的偏移行数少于，则返回默认值。
`nth_value`(col, offset[, ignoreNulls])	Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
`ntile`(n)	窗口函数:返回一个有序窗口分区中的ntile组id(从1到n包括在内)。
`percent_rank`()	返回百分位
`rank`()	跳跃排序，比如1,2,2,4
`row_number`()	窗口函数:返回一个窗口分区内从1开始的序号。相同值排序不同

Aggregate Functions

`approx_count_distinct`(col[, rsd])	返回列col的近似不同计数的新列。
`avg`(col)	返回一组中所有值的平均值。
`collect_list`(col)	返回具有重复项的对象列表
`collect_set`(col)	聚合函数:返回一组消除重复元素的对象。分布式计算，每一次计算元素顺序可能会不一样
`corr`(col1, col2)	返回col1和col2的Pearson相关系数的新列。
`count`(col)	计数
`count_distinct`(col, *cols)	字段去重计数

`covar_samp`(col1, col2)	为col1和col2的样本协方差返回一个新列。
`first`(col[, ignorenulls])	分组value中第一个值
`grouping`(col)	表示在GROUP BY列表中指定的列是否聚合，结果集中聚合返回1，未聚合返回0。同hive中该函数；
`grouping_id`(*cols)	Aggregate function: returns the level of grouping, equals to
`kurtosis`(col)	峰度
`last`(col[, ignorenulls])	分组value中最后一个值
`max`(col)	最大值
`max_by`(col, ord)	返回与ord的最大值相关联的值。
`mean`(col)	均值
`median`(col)	中位数
`min`(col)	最小值
`min_by`(col, ord)	返回与ord的最小值相关联的值。
`mode`(col)	返回组中出现频率最高的值。众数
`percentile_approx`(col, percentage[, accuracy])	返回数值列col的近似百分位数，它是有序的col值(从最小到最大排序)中的最小值，使得不超过百分比的col值小于或等于该值。
`product`(col)	返回一组中所有值的乘积。
`skewness`(col)	偏度
`stddev`(col)	stddev_samp的别名。标准差
`stddev_pop`(col)	返回表达式在一组中的总体标准差。
`stddev_samp`(col)	返回一组表达式的无偏样本标准差。除以n-1
`sum`(col)	求和
`sum_distinct`(col)	聚合函数:返回表达式中不同值的和
`sumDistinct`(col)	返回表达式中不同值的和。
`var_pop`(col)	返回组中值的总体方差。
`var_samp`(col)	返回一组中值的无偏样本方差。
`variance`(col)	var_samp的别名

Datetime Functions

日期函数

`add_months`(start, months)	返回开始后几个月的日期。
`current_date`()	将查询求值开始时的当前日期作为DateType列返回。
`current_timestamp`()	将查询求值开始时的当前时间戳作为TimestampType列返回。
`date_add`(start, days)	返回开始后几天的日期
`date_format`(date, format)	将日期/时间戳/字符串转换为由第二个参数给出的日期格式指定的字符串值。
`date_sub`(start, days)	返回开始前几天的日期。
`date_trunc`(format, timestamp)	返回时间戳截断为格式指定的单位。
`datediff`(end, start)	返回从开始到结束的天数
`dayofmonth`(col)	提取给定日期/时间戳的月份的第几天作为整数。
`dayofweek`(col)	提取给定日期/时间戳的星期几作为整数。
`dayofyear`(col)	提取给定日期/时间戳的年份中的第几天作为整数。
`second`(col)	将给定日期的秒数提取为整数
`weekofyear`(col)	所在年的第几周
`year`(col)	提取年份
`quarter`(col)	提取给定日期/时间戳的四分之一作为整数。
`month`(col)	提取月份
`last_day`(date)	返回月所在的最后一天，类似于exce中的emonth函数
`localtimestamp`()	Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
`minute`(col)	提取分钟数
`months_between`(date1, date2[, roundOff])	返回日期date1和date2之间的月数。
`next_day`(date, dayOfWeek)	返回第一个日期，该日期晚于基于第二个星期日期参数的日期列的值。
`hour`(col)	提取小时
`make_date`(year, month, day)	生成日期，输入年月日
`from_unixtime`(timestamp[, format])	将unix epoch (1970-01-01 00:00:00 UTC)中的秒数转换为以给定格式表示当前系统时区中该时刻的时间戳的字符串。
`unix_timestamp`([timestamp, format])	将给定模式(’ yyyy-MM-dd HH:mm:ss '，默认情况下)的时间字符串转换为Unix时间戳(以秒为单位)，使用默认时区和默认区域设置，如果失败则返回null。
`to_timestamp`(col[, format])	使用可选的指定格式将列转换为pyspark.sql.types.TimestampType。
`to_date`(col[, format])	使用可选的指定格式将列转换为pyspark.sql.types.DateType。
`trunc`(date, format)	返回日期截断为格式指定的单位。
`from_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`to_utc_timestamp`(timestamp, tz)	This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
`window`(timeColumn, windowDuration[, …])	给定一个指定列的时间戳，将行分成一个或多个时间窗口。
`session_window`(timeColumn, gapDuration)	Generates session window given a timestamp specifying column.
`timestamp_seconds`(col)	Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
`window_time`(windowColumn)	Computes the event time from a window column.

Collection Functions

集合函数

`array`(*cols)	Creates a new array column.
`array_contains`(col, value)	Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
`arrays_overlap`(a1, a2)	Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
`array_join`(col, delimiter[, null_replacement])	Concatenates the elements of column using the delimiter.
`create_map`(*cols)	Creates a new map column.
`slice`(x, start, length)	Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
`concat`(*cols)	将多个输入列合并到一个列中。
`array_position`(col, value)	Collection function: Locates the position of the first occurrence of the given value in the given array.
`element_at`(col, extraction)	Collection function: Returns element of array at given index in extraction if col is array.
`array_append`(col, value)	集合函数:返回一个包含col1中的元素的数组，并在数组的最后一个位置添加col2中的元素。
`array_sort`(col[, comparator])	Collection function: sorts the input array in ascending order.
`array_insert`(arr, pos, value)	Collection function: adds an item into a given array at a specified array index.
`array_remove`(col, element)	集合函数:从给定数组中移除所有等于element的元素。
`array_distinct`(col)	收集功能:从数组中移除重复的值。
`array_intersect`(col1, col2)	集合函数:返回一个包含col1和col2交集元素的数组，没有重复元素。
`array_union`(col1, col2)	集合函数:返回一个包含col1和col2并集元素的数组，不包含重复元素。
`array_except`(col1, col2)	集合函数:返回一个包含col1中但不包含col2中的元素的数组，不包含重复元素。
`array_compact`(col)	集合函数:从数组中移除空值。
`transform`(col, f)	在对输入数组中的每个元素应用转换后返回一个元素数组。
`exists`(col, f)	返回谓词是否适用于数组中的一个或多个元素
`forall`(col, f)	返回谓词是否适用于数组中的每个元素。
`filter`(col, f)	返回给定数组中谓词所对应的元素数组。
`aggregate`(col, initialValue, merge[, finish])	对初始状态和数组中的所有元素应用二进制运算符，并将其简化为单个状态。
`zip_with`(left, right, f)	Merge two given arrays, element-wise, into a single array using a function.
`transform_keys`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
`transform_values`(col, f)	Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
`map_filter`(col, f)	返回其键值对满足谓词的映射。
`map_from_arrays`(col1, col2)	从两个数组创建一个新的映射。
`map_zip_with`(col1, col2, f)	使用函数将两个给定的映射按键合并为一个映射。
`explode`(col)	为给定数组或映射中的每个元素返回一个新行。
`explode_outer`(col)	Returns a new row for each element in the given array or map.
`posexplode`(col)	Returns a new row for each element with position in the given array or map.
`posexplode_outer`(col)	Returns a new row for each element with position in the given array or map.
`inline`(col)	Explodes an array of structs into a table.
`inline_outer`(col)	Explodes an array of structs into a table.
`get`(col, index)	Collection function: Returns element of array at given (0-based) index.
`get_json_object`(col, path)	Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
`json_tuple`(col, *fields)	Creates a new row for a json column according to the given field names.
`from_json`(col, schema[, options])	Parses a column containing a JSON string into a `MapType` with `StringType` as keys type, `StructType` or `ArrayType` with the specified schema.
`schema_of_json`(json[, options])	Parses a JSON string and infers its schema in DDL format.
`to_json`(col[, options])	Converts a column containing a `StructType`, `ArrayType` or a `MapType` into a JSON string.
`size`(col)	Collection函数:返回存储在列中的数组或映射的长度。
`struct`(*cols)	Creates a new struct column.
`sort_array`(col[, asc])	Collection函数:根据数组元素的自然排列顺序，对输入数组进行升序或降序排序。
`array_max`(col)	Collection函数:返回数组的最大值。
`array_min`(col)	集合函数:返回数组的最小值
`shuffle`(col)	生成给定数组的随机排列。
`reverse`(col)	返回一个颠倒的字符串或元素顺序颠倒的数组。
`flatten`(col)	从数组的数组中创建一个数组。
`sequence`(start, stop[, step])	生成一个从开始到结束，逐步递增的整数序列。
`array_repeat`(col, count)	Collection function: creates an array containing a column repeated count times.
`map_contains_key`(col, value)	如果映射包含键则返回true。
`map_keys`(col)	返回包含映射键的无序数组。
`map_values`(col)	返回包含映射值的无序数组。
`map_entries`(col)	Collection function: Returns an unordered array of all entries in the given map.
`map_from_entries`(col)	Collection function: Converts an array of entries (key value struct types) to a map of values.
`arrays_zip`(*cols)	Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
`map_concat`(*cols)	Returns the union of all the given maps.
`from_csv`(col, schema[, options])	Parses a column containing a CSV string to a row with the specified schema.
`schema_of_csv`(csv[, options])	Parses a CSV string and infers its schema in DDL format.
`to_csv`(col[, options])	将包含StructType的列转换为CSV字符串。

Math Functions

`sqrt`(col)	计算指定浮点值的平方根
`abs`(col)	绝对值
`acos`(col)	计算输入列的逆余弦。
`acosh`(col)	计算输入列的逆双曲余弦。
`asin`(col)	计算输入列的逆正弦。
`asinh`(col)	计算输入列的逆双曲正弦。
`atan`(col)	计算输入列的tan逆。
`atanh`(col)	计算输入列的逆双曲正切。
`atan2`(col1, col2)	New in version 1.4.0.
`bin`(col)	Returns the string representation of the binary value of the given column.
`cbrt`(col)	计算给定值的立方根。
`ceil`(col)	计算给定值的上限。
`conv`(col, fromBase, toBase)	将字符串列中的数字从一种基数转换为另一种基数。
`cos`(col)	计算输入列的余弦值。
`cosh`(col)	计算输入列的双曲余弦
`cot`(col)	Computes cotangent of the input column.
`csc`(col)	Computes cosecant of the input column.
`exp`(col)	计算给定值的指数。
`expm1`(col)	计算给定值减去1的指数
`factorial`(col)	计算给定值的阶乘。
`floor`(col)	计算给定值的下限。
`hex`(col)	Computes hex value of the given column, which could be `pyspark.sql.types.StringType`, `pyspark.sql.types.BinaryType`, `pyspark.sql.types.IntegerType` or `pyspark.sql.types.LongType`.
`unhex`(col)	Inverse of hex.
`hypot`(col1, col2)	计算根号(a^2 + b^2)，没有中间溢出或下溢。
`log`(arg1[, arg2])	返回第二个参数的基于第一个参数的对数。
`log10`(col)	以10为基数计算给定值的对数
`log1p`(col)	计算“给定值加一”的自然对数。
`log2`(col)	返回参数的以2为底的对数。
`pmod`(dividend, divisor)	返回被除数的正数
`pow`(col1, col2)	返回第一个参数的值乘以第二个参数的幂。
`rint`(col)	Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
`round`(col[, scale])	如果scale >= 0，使用HALF_UP四舍五入模式将给定值四舍五入到小数位数，或者当scale < 0时使用整数部分。
`bround`(col[, scale])	Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
`sec`(col)	Computes secant of the input column.
`shiftleft`(col, numBits)	Shift the given value numBits left.
`shiftright`(col, numBits)	(Signed) shift the given value numBits right.
`shiftrightunsigned`(col, numBits)	Unsigned shift the given value numBits right.
`signum`(col)	Computes the signum of the given value.
`sin`(col)	计算输入列的正弦值
`sinh`(col)	计算输入列的双曲正弦。
`tan`(col)	计算输入列的正切值。
`tanh`(col)	计算输入列的双曲正切。
`toDegrees`(col)	New in version 1.4.0.
`degrees`(col)	将以弧度测量的角转换为以度测量的近似等效角。
`toRadians`(col)	New in version 1.4.0.
`radians`(col)	将以度为单位的角转换为以弧度为单位的近似等效角。

Normal Functions

一般函数

`col`(col)	返回基于给定列名的列。
`column`(col)	返回基于给定列名的列。
`lit`(col)	创建具有文字值的列。
`broadcast`(df)	Marks a DataFrame as small enough for use in broadcast joins.
`coalesce`(*cols)	返回以一个非空列
`input_file_name`()	Creates a string column for the file name of the current Spark task.
`isnan`(col)	如果是null，返回true
`isnull`(col)	如果列为空，则返回true的表达式。
`monotonically_increasing_id`()	A column that generates monotonically increasing 64-bit integers.
`nanvl`(col1, col2)	Returns col1 if it is not NaN, or col2 if col1 is NaN.
`rand`([seed])	生成一个随机列，样本独立且同分布(i.i.d)，均匀分布在[0.0,1.0]中。
`randn`([seed])	从标准正态分布中生成具有独立且同分布(i.i.d)样本的列。
`spark_partition_id`()	A column for partition ID.
`when`(condition, value)	计算条件列表并返回多个可能结果表达式中的一个。
`bitwise_not`(col)	Computes bitwise not.
`bitwiseNOT`(col)	Computes bitwise not.
`expr`(str)	将表达式字符串解析为它所表示的列
`greatest`(*cols)	返回列名列表中的最大值，跳过空值。
`least`(*cols)	返回列名列表中最小的值，跳过空值。

Grouping

DataFrame.groupBy方法返回GroupedData对象，可以使用如下方法

`GroupedData.agg`(*exprs)	计算聚合并将结果作为DataFrame返回。
`GroupedData.apply`(udf)	它是pyspark.sql.GroupedData.applyInPandas()的别名;然而，它接受一个pyspark.sql.functions.pandas_udf()，而pyspark.sql.GroupedData.applyInPandas()接受一个Python本地函数。
`GroupedData.applyInPandas`(func, schema)	使用pandas udf映射当前DataFrame的每一组，并将结果作为DataFrame返回
`GroupedData.applyInPandasWithState`(func, …)	将给定函数应用于每组数据，同时保持用户定义的每组状态。
`GroupedData.avg`(*cols)	计算每个组的每个数字列的平均值
`GroupedData.cogroup`(other)	将这个组与另一个组共同分组，这样我们就可以进行共同分组的操作
`GroupedData.count`()	计数
`GroupedData.max`(*cols)	最大值
`GroupedData.mean`(*cols)	均值
`GroupedData.min`(*cols)	最小值
`GroupedData.pivot`(pivot_col[, values])	对当前DataFrame的列进行透视，并执行指定的聚合。
`GroupedData.sum`(*cols)	切合
`PandasCogroupedOps.applyInPandas`(func, schema)	使用pandas对每个协同组应用一个函数，并将结果作为DataFrame返回

UDF

用户自定义函数

strlen = spark.udf.register("len", lambda x: len(x))   # 自定义函数
spark.sql("SELECT len('id')").collect()    # 使用sql方法，调用函数，使用函数注册名len

# 基于DataFrame调用函数，使用strlen
spark.sql("SELECT '1' AS text union all select '31' as text").select(strlen("text")).show()

`UDFRegistration.register`(name, f[, returnType])	将Python函数(包括lambda函数)或用户定义函数注册为SQL函数。
`UDFRegistration.registerJavaFunction`(name, …)	将Java用户定义函数注册为SQL函数
`UDFRegistration.registerJavaUDAF`(name, …)	将Java用户定义的聚合函数注册为SQL函数。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

spark

python

大数据

PySpark