Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality.
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter β€œpattern” (regex, default: "\s+") is used as delimiters to split the input text. Alternatively, users can set parameter β€œgaps” to false indicating the regex β€œpattern” denotes β€œtokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.
1
import org.apache.spark.sql.SparkSession
2
import org.apache.spark.sql.functions._
3
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
4
​
5
val sentenceDataFrame = spark.createDataFrame(Seq(
6
(0, "Hi I heard about Spark"),
7
(1, "I wish Java could use case classes"),
8
(2, "Logistic,regression,models,are,neat")
9
)).toDF("id", "sentence")
10
​
11
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
12
val regexTokenizer = new RegexTokenizer()
13
.setInputCol("sentence")
14
.setOutputCol("words")
15
.setPattern("\\W") // alternatively .setPattern("\\w+").setGaps(false)
16
​
17
val countTokens = udf { (words: Seq[String]) => words.length }
18
​
19
val tokenized = tokenizer.transform(sentenceDataFrame)
20
tokenized.select("sentence", "words")
21
.withColumn("tokens", countTokens(col("words"))).show(false)
22
​
23
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
24
regexTokenized.select("sentence", "words")
25
.withColumn("tokens", countTokens(col("words"))).show(false)
26
27
/*
28
Output:
29
+-----------------------------------+------------------------------------------+------+
30
|sentence |words |tokens|
31
+-----------------------------------+------------------------------------------+------+
32
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
33
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
34
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat] |1 |
35
+-----------------------------------+------------------------------------------+------+
36
​
37
Notice the last line separated by comma, not space
38
regexTokenizer separates comma delimimited words
39
Tokenizer separates space delimited words in this instance
40
​
41
+-----------------------------------+------------------------------------------+------+
42
|sentence |words |tokens|
43
+-----------------------------------+------------------------------------------+------+
44
|Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
45
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
46
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
47
+-----------------------------------+------------------------------------------+------+
48
​
49
*/
Copied!
Last modified 1yr ago
Copy link