Apache Spark: SparkSQLリファレンス〜関数編・その他の関数〜

SparkSQLリファレンス第四部、関数編のラストその他の関数です。

その他の関数

SparkSQLには便利なユーティリティ関数がたくさんあります。
今回はそれをご紹介していきたいと思います。

関数	内容	ver.
lit	lit(literal: Any) リテラル値を返します。 SQLだと値を直に書けますが(e.g. SELECT ‘apple’ as mac )、DataFrameの場合はこのlitを使ってリテラル値を記述します。 sql: select 'apple' as mac from table DataFrame: df.select( lit( "apple" ) as "mac" )	1.3.0
abs	abs(e: Column) 絶対値を計算します。 sql: select abs( e ) from table DataFrame: df.select( abs( $"e" ) ) 例) e = -2 の場合、2 が返ります。	1.3.0
array	array(cols: Column) array(colName: String, colNames: String) 配列型カラムを生成します。配列内はすべて同じ型である必要があります。 sql: select array( col1, col2 ) from table DataFrame: df.select( $"col1", $"col2" )	1.4.0
coalesce	coalesce(e: Column*) nullでない最初のカラムの値を返します。すべてのカラムがnullの場合はnullが返ります。 sql: select coalesce( a, b, c ) from table DataFrame: df.select( coalesce( $"a", $"b", $"c" ) ) 例) 上記のクエリで、aがnull, bもnullの場合cが返ります。	1.3.0
isNaN	isNaN(e: Column) カラムがNaNかどうかを返します。 Float型かDouble型に対してしか使えません。 sql: select isNaN( e ) from table DataFrame: df.select( isNaN( $"e" ) )	1.5.0
monotonicallyIncreasingId	monotonicallyIncreasingId 連続ではないが単調増加が保証されるユニークなIDを生成します。 sql: select monotonicallyIncreasingId() from table DataFrame: df.select( monotonicallyIncreasingId() ) 例) 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594	1.4.0
nanvl	nanvl(col1: Column, col2: Column) NaNでない最初のカラムを返します。 nvlのNaN版ですね。 Float型かDouble型に対してしか使えません。 sql: select nanvl( col1, col2 ) from table DataFrame: df.select( nanvl( $"col1", $"col2" ) ) 例) col1がNaNの場合、col2が返ります。	1.5.0
negate	negate(e: Column) 選択されたカラムの値をマイナスにして返します。 sql: select negate( e ) from table DataFrame: df.select( negate( $"e" ) ) 例) e = 5の場合、-5が返ります。	1.3.0
not	not(e: Column) 二値否定のNOTです。 Scala: df.filter( !df("isActive") ) Java: df.filter( not(df.col("isActive")) );	1.3.0
rand	rand(seed: Long)\|rand() 独立同分布に従う乱数を生成します。引数にはシードを指定します。 ※シードは省略可能です。 sql: select rand( 1 ) from table DataFrame: df.select( rand( 1 ) )	1.4.0
randn	randn(seed: Long)\|randn() 正規分布に従う独立同分布な乱数を生成します。引数にはシードを指定します。 ※シードは省略可能です。 sql: select randn( 1 ) from table DataFrame: df.select( randn( 1 ) )	1.4.0
sqrt	sqrt(e: Column) 平方根を計算して返します。 sql: select sqrt( e ) from table DataFrame: df.select( sqrt( $"e" ) ) 例) e = 4の場合、2を返します。	1.3.0
when	when(condition: Column, value: Any) いわゆるCASE..WHENのwhenです。場合分けを行うことができます。第１引数に条件、第２引数に条件にマッチした場合の値を指定します。例) Scala: people.select( when(people("gender") === "male", 0) .when(people("gender") === "female", 1) .otherwise(2)) Java: people.select( when(col("gender").equalTo("male"), 0) .when(col("gender").equalTo("female"), 1) .otherwise(2))	1.4.0
expr	expr(expr: String) 文字列表現をカラムに変換します。 DataFrame: df.groupBy( expr( "length(word)" ) ).count()
md5	md5(e: Column) バイナリカラムのmd5を計算して返します。 sql: select md5( e ) from table DataFrame: df.select( md5( $"e" ) )	1.5.0
sha1	sha1(e: Column) バイナリカラムのsha1を計算して返します。 sql: select sha1( e ) from table DataFrame: df.select( sha1( $"e" ) )	1.5.0
sha2	sha2(e: Column, numBits: Int) バイナリカラムをsha2系ハッシュ関数で計算して返します。 numBits には 224, 256, 384, 512 のいずれかを指定します。 sql: select sha2( e, 256 ) from table DataFrame: df.select( sha2( $"e", 256 ) )	1.5.0
crc32	crc32(e: Column) バイナリカラムのCRC32値を計算して返します。 sql: select crc32( e ) from table DataFrame: df.select( crc32( $"e" ) )	1.5.0
array_contains	array_contains(column: Column, value: Any) Array型カラムに指定した値が含まれるかどうかを返します。 sql: select array_contains( e, 'apple' ) from table DataFrame: df.select( array_contains( $"e", "apple" ) ) 例) e = array( “apple”, “banana” )の場合true, e = array( “grape”, “orange” )の場合falseを返します。	1.5.0
explode	explode(e: Column) Array型、もしくはMap型カラムのそれぞれの要素に対応する新しい行を生成します。実はこれすごく便利です。 sql: select explode( e ) from table DataFrame: df.select( explode( $"e" ) ) 例) \| e \| +---------------+ \| array( 1, 2 ) \| \| array( 4, 3 ) \| のようなDataFrameだった場合、 \| explode(e) \| +---------------+ \| 1 \| \| 2 \| \| 4 \| \| 3 \| が返ります。	1.3.0
json_tuple	json_tuple(json: Column, fields: String*) JSON文字列から新しい行をつくります。カラム情報は引数に与えられたフィールド名とJSONのフィールド名から照合します。 sql: select json_tuple( json, 'f1', 'f2', 'f3' ) from table DataFrame: df.select( json_tuple( $"json", "f1", "f2", "f3" ) ) 例) json={“f1″:”value1″,”f2″:3,”f3”:5.23} のような文字列だった場合、 \| f1 \| f2 \| f3 \| +--------+----+------+ \| value1 \| 3 \| 5.23 \| のようなDataFrameが返ります。	1.6.0
size	size(e: Column) Array型もしくはMap型カラムの要素数を返します。 sql: select size( e ) from table DataFrame: df.select( size( $"e" ) ) 例) e = array( 1, 2, 3 )の場合、3が返ります。	1.5.0
sort_array	sort_array(e: Column) sort_array(e: Column, asc: Boolean) Array型カラムをソートします。（自然順） sql: select sort_array( e, true ) from table DataFrame: df.select( sort_array( $"e", true ) ) 例) e = array( 5, 3, 9 ) で asc = true の場合、array( 3, 5, 9 ) を返します。	1.5.0
udf	udf(f: FunctionN[A1, .., AN, RT]) Sparkではユーザ定義関数を定義することができます。現在は引数0〜10個まで対応しています。例) val bool2int = udf { b: Boolean => if ( b ) 1 else 0 } df.select( bool2int( $"e" ) )	1.3.0

構文

演算子

関数

宜しければこちらの動画もどうぞー

Druid part2 pivotでtwitterデータを可視化してみる