DataFrame.__getitem__ (item) |
返回以name表示的列 |
DataFrame.agg (*exprs) |
对不包含组的整个DataFrame进行聚合,没有groupby分组,直接按整个df字段聚合; |
DataFrame.alias (alias) |
返回一个设置了别名的新DataFrame |
DataFrame.approxQuantile (col, probabilities, …) |
Calculates the approximate quantiles of numerical columns of a DataFrame . |
DataFrame.cache () |
使用默认存储级别(MEMORY_AND_DISK)持久化DataFrame。 |
DataFrame.checkpoint ([eager]) |
Returns a checkpointed version of this DataFrame . |
DataFrame.coalesce (numPartitions) |
返回一个新的DataFrame,它恰好有numPartitions分区。数据量filter少了,可以用该算子减少分区 |
DataFrame.colRegex (colName) |
Selects column based on the column name specified as a regex and returns it as Column . |
DataFrame.collect () |
将数据以list对方返回到driver |
DataFrame.corr (col1, col2[, method]) |
将DataFrame的两列的相关性计算为双精度值。 |
DataFrame.count () |
返回记录行 |
DataFrame.createGlobalTempView (name) |
用这个数据框创建一个全局临时视图 |
DataFrame.createOrReplaceGlobalTempView (name) |
使用给定名称创建或替换全局临时视图。 |
DataFrame.createOrReplaceTempView (name) |
用此数据框创建或替换本地临时视图。 |
DataFrame.createTempView (name) |
Creates a local temporary view with this DataFrame . |
DataFrame.crossJoin (other) |
返回与另一个DataFrame的笛卡尔积 |
DataFrame.crosstab (col1, col2) |
Computes a pair-wise frequency table of the given columns. |
DataFrame.cube (*cols) |
使用指定的列为当前DataFrame创建一个多维多维数据集,这样我们就可以在这些列上运行聚合。 |
DataFrame.describe (*cols) |
计算数值和字符串列的基本统计信息。 |
DataFrame.distinct () |
返回一个包含此数据框中不同行的新数据框。 |
DataFrame.drop (*cols) |
返回一个没有指定列的新DataFrame。 |
DataFrame.dropDuplicates ([subset]) |
返回一个新的DataFrame,删除重复的行,可选地只考虑某些列。 |
DataFrame.drop_duplicates ([subset]) |
dropDuplicates() 的别名 |
DataFrame.dropna ([how, thresh, subset]) |
返回一个新的DataFrame,省略带有空值的行。 |
DataFrame.dtypes |
以列表形式返回所有列名及其数据类型。 |
DataFrame.exceptAll (other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
DataFrame.explain ([extended, mode]) |
将(逻辑和物理)计划打印到控制台以进行调试。 |
DataFrame.fillna (value[, subset]) |
替换空值,别名为na.fill() |
DataFrame.filter (condition) |
按给定条件筛选 |
DataFrame.first () |
返回前n行 |
DataFrame.foreachPartition (f) |
Applies the f function to each partition of this DataFrame . |
DataFrame.freqItems (cols[, support]) |
Finding frequent items for columns, possibly with false positives. |
DataFrame.groupBy (*cols) |
使用指定的列对DataFrame进行分组,以便我们可以对它们运行聚合。 |
DataFrame.head ([n]) |
返回前n行。 |
DataFrame.hint (name, *parameters) |
Specifies some hint on the current DataFrame . |
DataFrame.inputFiles () |
Returns a best-effort snapshot of the files that compose this DataFrame . |
DataFrame.intersect (other) |
返回一个新的DataFrame,其中只包含这个DataFrame和另一个DataFrame中的行:交集 |
DataFrame.intersectAll (other) |
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. |
DataFrame.isEmpty () |
如果此DataFrame为空,则返回True。 |
DataFrame.isLocal () |
Returns True if the collect() and take() methods can be run locally (without any Spark executors). |
DataFrame.isStreaming |
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. |
DataFrame.join (other[, on, how]) |
使用给定的连接表达式与另一个DataFrame连接。 |
DataFrame.limit (num) |
将结果计数限制为指定的数目。 |
DataFrame.localCheckpoint ([eager]) |
Returns a locally checkpointed version of this DataFrame . |
DataFrame.mapInPandas (func, schema) |
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame . |
DataFrame.mapInArrow (func, schema) |
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame . |
DataFrame.melt (ids, values, …) |
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.na |
Returns a DataFrameNaFunctions for handling missing values. |
DataFrame.observe (observation, *exprs) |
Define (named) metrics to observe on the DataFrame. |
DataFrame.orderBy (*cols, **kwargs) |
返回按指定列排序的新DataFrame。 |
DataFrame.persist ([storageLevel]) |
设置存储级别,以便在第一次计算DataFrame的内容后,跨操作持久保存它 |
DataFrame.printSchema () |
打印结构信息 |
DataFrame.randomSplit (weights[, seed]) |
Randomly splits this DataFrame with the provided weights. |
DataFrame.rdd |
作为pyspark返回内容。Row的RDD。 |
DataFrame.registerTempTable (name) |
使用给定的名称将此DataFrame注册为临时表。 |
DataFrame.repartition (numPartitions, *cols) |
返回一个由给定分区表达式划分的新DataFrame。 |
DataFrame.repartitionByRange (numPartitions, …) |
Returns a new DataFrame partitioned by the given partitioning expressions. |
DataFrame.replace (to_replace[, value, subset]) |
返回一个新的DataFrame,用另一个值替换一个值。 |
DataFrame.rollup (*cols) |
使用指定列为当前DataFrame创建多维汇总,这样我们就可以对它们运行聚合。 |
DataFrame.sameSemantics (other) |
Returns True when the logical query plans inside both DataFrame s are equal and therefore return the same results. |
DataFrame.sample ([withReplacement, …]) |
返回此数据框的抽样子集。 |
DataFrame.sampleBy (col, fractions[, seed]) |
根据每个层的给定分数返回一个分层样本,而不进行替换。 |
DataFrame.schema |
以pyspark.sql.types.StructType的形式返回该DataFrame的模式。 |
DataFrame.select (*cols) |
投射一组表达式并返回一个新的DataFrame;类似sql中的select |
DataFrame.selectExpr (*expr) |
投射一组SQL表达式并返回一个新的DataFrame。 |
DataFrame.semanticHash () |
Returns a hash code of the logical query plan against this DataFrame . |
DataFrame.show ([n, truncate, vertical]) |
将前n行打印到控制台。 |
DataFrame.sort (*cols, **kwargs) |
返回按指定列排序的新DataFrame。 |
DataFrame.sortWithinPartitions (*cols, **kwargs) |
Returns a new DataFrame with each partition sorted by the specified column(s). |
DataFrame.sparkSession |
返回创建该数据框架的Spark会话。 |
DataFrame.stat |
Returns a DataFrameStatFunctions for statistic functions. |
DataFrame.storageLevel |
获取DataFrame的当前存储级别。 |
DataFrame.subtract (other) |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame . |
DataFrame.summary (*statistics) |
Computes specified statistics for numeric and string columns. |
DataFrame.tail (num) |
返回最后num行作为Row的列表 |
DataFrame.take (num) |
返回前num行作为Row的列表。 |
DataFrame.to (schema) |
返回一个新的DataFrame,其中每一行都与指定的模式相匹配 |
DataFrame.toDF (*cols) |
返回具有新指定列名的新DataFrame |
DataFrame.toJSON ([use_unicode]) |
将一个DataFrame转换为字符串的RDD。 |
DataFrame.toLocalIterator ([prefetchPartitions]) |
Returns an iterator that contains all of the rows in this DataFrame . |
DataFrame.toPandas () |
返回该数据框的内容为 Pandas .DataFrame。 |
DataFrame.to_pandas_on_spark ([index_col]) |
|
DataFrame.transform (func, *args, **kwargs) |
Returns a new DataFrame . |
DataFrame.union (other) |
返回一个包含此数据框和另一个数据框中的行并集的新数据框。去重 |
DataFrame.unionAll (other) |
返回一个包含此数据框和另一个数据框中的行并集的新数据框。不去重 |
DataFrame.unionByName (other[, …]) |
Returns a new DataFrame containing union of rows in this and another DataFrame . |
DataFrame.unpersist ([blocking]) |
将DataFrame标记为非持久性,并从内存和磁盘中删除它的所有块 |
DataFrame.unpivot (ids, values, …) |
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.where (condition) |
filter()的别名 |
DataFrame.withColumn (colName, col) |
通过添加列或替换具有相同名称的现有列,返回一个新的DataFrame |
DataFrame.withColumns (*colsMap) |
通过添加多个列或替换具有相同名称的现有列,返回一个新的DataFrame。 |
DataFrame.withColumnRenamed (existing, new) |
通过重命名现有列返回一个新的DataFrame。 |
DataFrame.withColumnsRenamed (colsMap) |
通过重命名多个列返回一个新的DataFrame。 |
DataFrame.withMetadata (columnName, metadata) |
Returns a new DataFrame by updating an existing column with metadata. |
DataFrame.withWatermark (eventTime, …) |
Defines an event time watermark for this DataFrame . |
DataFrame.write |
将非流数据帧的内容保存到外部存储器的接口。 |
DataFrame.writeStream |
Interface for saving the content of the streaming DataFrame out into external storage. |
DataFrame.writeTo (table) |
Create a write configuration builder for v2 sources. |
DataFrame.pandas_api ([index_col]) |
将现有的DataFrame转换为pandas-on-Spark DataFrame。 |
DataFrameNaFunctions.drop ([how, thresh, subset]) |
Returns a new DataFrame omitting rows with null values. |
DataFrameNaFunctions.fill (value[, subset]) |
Replace null values, alias for na.fill() . |
DataFrameNaFunctions.replace (to_replace[, …]) |
Returns a new DataFrame replacing a value with another value. |
DataFrameStatFunctions.approxQuantile (col, …) |
Calculates the approximate quantiles of numerical columns of a DataFrame . |
DataFrameStatFunctions.corr (col1, col2[, method]) |
Calculates the correlation of two columns of a DataFrame as a double value. |
DataFrameStatFunctions.cov (col1, col2) |
Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrameStatFunctions.crosstab (col1, col2) |
Computes a pair-wise frequency table of the given columns. |
DataFrameStatFunctions.freqItems (cols[, support]) |
Finding frequent items for columns, possibly with false positives. |
DataFrameStatFunctions.sampleBy (col, fractions) |
Returns a stratified sample without replacement based on the fraction given on each stratum. |