One hot encoding pyspark

One hot encoding pyspark DEFAULT

PySpark One Hot Encoding with CountVectorizer

One Hot Encoding is an important technique for converting categorical attributes into a numeric vector that machine learning models can understand. In this article, you will learn how to implement one-hot encoding in PySpark.

Getting Started

Before we begin, we need to instantiate a Spark SQLContext and import required python modules.

#Import PySpark libraries import pyspark from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext

Next, we need to connect to a Spark cluster (or your local spark) and instantiate a new SQLContext class since we will be working with spark data frames.

#Connect to Spark conf = SparkConf().setAppName("Vectorizer") sc = SparkContext(conf=conf) #Create an SQLContext class sqlContext = SQLContext(sc)

Print the version of spark you are utilizing. This is important to know as functionality varies across spark versions. This article is written using spark version 2.3.2.

print("Spark Version: " + sc.version) #Spark Version: 2.3.2

Create Spark DataFrame

In the next code block, generate a sample spark dataframe containing 2 columns, an ID and a Color column. The task at hand is to one-hot encode the Color column of our dataframe. We call our dataframe, df.

df = sqlContext.createDataFrame([ (0, "Red"), (1, "Blue"), (2, "Green"), (3, "White") ], ["id", "Color"])

Display the spark dataframe we have generated.

df.show(truncate=False)
PySpark One Hot Encoding

Convert String To Array

To run one-hot encoding in PySpark we will be utilizing the CountVectorizer class from the PySpark.ML package. One of the requirements in order to run one-hot encoding is for the input column to be an array.

Our Color column is currently a string, not an array. Convert the values of the “Color” column into an array by utilizing the split function of pyspark. Run the following code block to generate a new “Color_Array” column.

from pyspark.sql.functions import col, split df = df.withColumn("Color_Array", split(col("Color")," ")) df.show()
PySpark One Hot Encoding

Our data is now ready for us to run one-hot encoding utilizing the functions from the pyspark.ml package.

PySpark CountVectorizer

Pyspark.ml package provides a module called CountVectorizer which makes one hot encoding quick and easy.

Yes, there is a module called OneHotEncoderEstimator which will be better suited for this. Bear with me, as this will challenge us and improve our knowledge about PySpark functionality. 

The CountVectorizer class and its corresponding CountVectorizerModel help convert a collection of text into a vector of counts. The result when converting our categorical variable into a vector of counts is our one-hot encoded vector. The size of the vector will be equal to the distinct number of categories we have. Let’s begin one-hot encoding. 

Import the CountVectorizer class from pyspark.ml

#Import Spark CountVectorizer from pyspark.ml.feature import CountVectorizer

Now, create a CountVectorizer class which we will call colorVectorizer. Some important parameters that we need to provide are the following:

  • inputCol:  specifies the column to be one-hot encoded
  • outputCol: the created column, this will be our one-hot encoded column.
  • VocabSize: specifies how many words to keep in our vocabulary
  • MinDF: specifies in how many rows does a word need to appear for it to be counted.
# Initialize a CountVectorizer. colorVectorizer = CountVectorizer(inputCol="Color_Array", outputCol="Color_OneHotEncoded", vocabSize=4, minDF=1.0)

Next, call the fit method of the CountVectorizer class to run the algorithm on our text. The resulting CountVectorizer Model class will then be applied to our dataframe to generate the one-hot encoded vectors.

#Get a VectorizerModel colorVectorizer_model = colorVectorizer.fit(df)

With our CountVectorizer in place, we can now apply the transform function to our dataframe. This function will use the Color_Array column defined as the input and output of the Color_OneHotEncoded column.

df_ohe = colorVectorizer_model.transform(df) df_ohe.show(truncate=False)
One Hot Encoding with PySpark

We are done. The newly added column into our spark dataframe contains the one-hot encoded vector. As we are using the CountVectorizer class and applying it to a categorical text with no spaces and each row containing only 1 word, the resulting vector has all zeros and one 1.

How do we extract the array into a numpy array for example? Do the following. 

import numpy as np x_3d = np.array(df_ohe.select('Color_OneHotEncoded').collect()) x_3d.shape #(4, 1, 4)

Only run collect in pyspark if your master driver has enough memory to handle combining the data from all your workers. Otherwise, you would need to run a batch type method instead.

We obtained the Color_OneHotEncoded column into a 3d Array. We need to convert this into a 2D array of size Rows, VocabularySize.

Get the shape from our x_3d variable and obtain the Rows and VocabSize as you can see below. Then, reshape your array into a 2D array in which each line contains the one-hot encoded value for the color input.

rows, idx, vocabsize = x_3d.shape X = x_3d.reshape(rows, features) X.shape #(4, 4)
OneHotEncoding PySpark

Reverse One-Hot Encoding

Ok, we are done going from text to a one-hot vector. What if we needed to go the other way around? From a one-hot encoded vector to a text, Color in this case. For this, we need to build a reverse dictionary.

First, get the Colors from our dataframe by running a similar command as we did to get the One Hot vector array. The result is an array, therefore we run a list comprehension method to get a list of Colors.

Colors = np.array(df_ohe.select('Color').collect()) Colors = [str(c[0]).strip() for c in Colors] print(Colors) #['Red', 'Blue', 'Green', 'White']

These correspond to each row of our X array. For each row, let’s find the index of the array which has the One-Hot vector and lastly loop through their pairs to generate or index and reverse_index dictionary. 

Use the where function in Numpy to get the location of the one-hot index.

np.where(X == 1)[1] #array([3, 1, 0, 2], dtype=int64)

 Generate the word2int and reverse_word2int dictionaries.

reverse_word2int = {} word2int= {} for color, index in zip(Colors,list(np.where(X == 1)[1])): reverse_word2int[index] = color word2int[color] = index

Let’s see if our dictionaries make sense. Let’s test the color Red getting its index in the one-hot vector and its reverse.

print(word2int['Red']) print(reverse_word2int[3]) #3 #Red

To go from a one-hot encoding vector back to the label, all you need is the location of the one-hot vector (value of 1) within the array which is easy to obtain using Numpy. Then, utilize your reverse_word2int dictionary to obtain the label. 

Conclusion

Spark is a powerful data processing engine and its ML library provides much-needed functions to build machine learning models. In this article, we saw how to implement one-hot encoding and reverse one-hot encoding using the CountVectorizer module. In other articles, I will show how to use the OneHotEncoderEstimator module.

Sours: https://www.hackdeploy.com/pyspark-one-hot-encoding-with-countvectorizer/

Machine Learning Case Study With Pyspark

0. Some random thoughts/babbling

I guess it is the best time, since you can deal with millions of data points with relatively limited computing power, and without having to know every single bit of computer science. They called it high-level. It is also the worst time, since like the wild west, there is all kinds of tools hurling in front you: from ancient dinosaurs like SAS, to the modern helicopter Apache Spark. No pun intended.

I got to use spark at the 2.0 era, the era of DataFrame. I love it since I am familiar with pandas and numpy already. Even better, I can convert DataFrame into pandas if I cannot figure out how to do it in the spark way.

Well, how fast it can be? It takes me half a hour to finish a full machine learning process, from imputation and one hot encoding stuff, and ends with a random forest. By the way, the data set has got 7 million data points with 68 variables. And I only used the community free version from databricks. Kudos to them.

1. Background Information

This is a classification problem, where the response variable has two categories. And I am going to first demonstrate a minimal amount of a complete workflow. However, two things might be considered in addition:

  1. It is always a good idea to some exploratory work in the first place. And in my opinion, it might be sensible idea to do this in and (a.k.a. base python) in a smaller data set. I have another article to talk about this specifically.

  2. More fine tuning work is rewarding. And ensemble and stacking should also be considered, along with other subtleties. Find my article here.

2. Data Preprocessing

The first couple lines loads the data and creates a data frame object. is can give a good guess about the data type for each column. And I created a dictionary to store them.. In this case, we got string type, double type and integer type.

This might come as a surprise, but the difference between integer type and double type is important in Spark. For instance, there is a new function called in Spark 2.2, which can only work with double type, and will throw an error if you pass in an integer variable. If you do not care about it, just cast integer type to double.

2.1 Handling categorical data

Let's first deal with the string types. Namely, deleting the variable with too many categories, and handling missing data.

You can use instead of since the latter is more expensive computationally. The above couple lines finds the categorical variables with more than 100 categories. This kind of variables, like IDs, usually are not informative. Of course, if you feel uncomfortable in ditching it, we'll talk about another way to deal with it. For now, we simply ditch all variables in .

Secondly, for other categorical variables that remains in the model, we need to deal with missing data. The strategy is to create a new category called to store them. Here is the implementation.

is a list with all string type variables excluding the ones with more than 100 categories. We next pass a dictionary to in order to replace all witsth the string .

However, computers are never designed to deal with strings and texts. We need to convert the categorical variables into numbers. The process includes string indexing and one hot encoding. For example, translating "man" and "women" into 1 and 0 is string indexing, and one-hot encoding is a little more complex. Here is one example.

You have three categories. Let's say republican, democrat and other. After string indexing we end up with 0, 1, 2. To one-hot encode them, you can only display them with 0 and 1 like this.

republicandemocratOther
100
010
001

By the way, one-hot is an electric engineering terms, which means you can literally only fire up a semiconductor one at a time. In fact, you do not have to understand what happens under the hood since Spark provides the and in the library.

Two things. First, the and , unlike , which is something we will talk about later, can only take one column at a time, and thus we need a list comprehension and a pipeline.

A pipeline is an interesting idea first emerges in the library, in which you can feed a series of tasks you want to do, and make them a list and the pipeline will handle everything for you. This is life saving because if you do manually, you'll have to do imputation, and save the data, and pass it on to the next task, and repeat this cycle again and again.

Another thing is, this pipeline object is slightly a little different than the one in . In , you do not need to do like we have here. You only have to do and the updated information is already in the object and we do not need to pass it on to another variable.

2.2 Handling numerical data

In handling numerical data, I am going to demonstrate two different approaches. In Spark 2.2, a new function is included, but only work for double type and not for integers. This is how it works.

Note that the new function can handle multiple columns at one time. I guess this is where Spark is headed to since handling multiple variables at a time is a much more common scenario than one column at a time. Obviously the imputed columns all end with .

What about integer type? Two different strategies. One, cast integer to double type and use the function. Two, use the old way before Spark 2.2: pass a dictionary to fill with function. We typically fill with sample mean.

Here is the implementation of the first way.

Note that is the most common way to add a new column, where the first argument being name of the new column and the second argument is the operation. You can define your own operation by as well.

Here is the second strategy, and let's pretend there is no function whatsoever.

For the function, we can pass in a dictionary like , in which the key is column name and the value is the operation for that column. And thus is a dictionary with column names and column mean, which is later feed into method.

2.3 Put them together

The idea here is to assemble everything into a vector. This is reasonable since after one-hot encoding and stuff, you end up with a mishmash of integers, floats, sparse vectors, and maybe dense vectors. And what we do next is bundle them altogether and call it features.

Interestingly, if you do not specify any variables for the algorithm to look at, the library will throw an error, but Spark will first look for as X, and in default. That's why we usually call the assembled features .

3. Dimension Reduction/feature selection

The most popular way in doing dimension reduction is principal component analysis, a.k.a PCA.

The dimension reduction intends to project the variables into a lower dimension space space. An additional benefit is these dimensions are usually independent with each other. Hence, one may find it helpful when dealing with lots of highly correlated variables. We can choose the dimension in the method. We can find the best by cross validation.

For instance, if we want to extract 30 features from 68, the following code can be used.

4. Modeling fitting and tuning

We got couple of built-in classifiers, including random forest, boosting trees, logistic regression and etc. We will implement random forest as an example, and the only parameter one needs to specify is the number of trees in the classifier.

Remember that we arbitrarily used during the PCA, and here . To find the best parameters, one should consider grid search. Luckily, we do not need to code all the grid search and cross-validation thing ourselves. Here is how to do it with the library.

Note that we got 2 grid, and got 3 x 5 different combinations to compute. No matter how powerful your server is, I still think we should do it on a smaller subset of the original file. Be reasonable.

You can choose different to match your purpose. And in this case we use area under the ROC curve as the criteria.

The accuracy score for this model is 0.54, which is an OK model. But how to make it better? Let's find out in the next article.

Sours: https://people.stat.sc.edu/haigang/sparkCaseStudy.html
  1. The stove restaurant las vegas
  2. The 100 season 3
  3. Plaster peel and stick wallpaper
  4. West elm haven storage bed
  5. Bmw m5 for sale michigan

Extracting, transforming and selecting features

This section covers algorithms for working with features, roughly divided into these groups:

  • Extraction: Extracting features from “raw” data
  • Transformation: Scaling, converting, or modifying features
  • Selection: Selecting a subset from a larger set of features
  • Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.

Table of Contents

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by , a document by , and the corpus by . Term frequency is the number of times that term appears in document , while document frequency is the number of documents that contains term . If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides: where is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF: There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.

TF: Both and can be used to generate the term frequency vectors.

is a which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is . An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

converts text documents to vectors of term counts. Refer to CountVectorizer for more details.

IDF: is an which is fit on a dataset and produces an . The takes feature vectors (generally created from or ) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.

Note: doesn’t provide tools for text segmentation. We refer users to the Stanford NLP Group and scalanlp/chalk.

Examples

In the following code segment, we start with a set of sentences. We split each sentence into words using . For each sentence (bag of words), we use to hash the sentence into a feature vector. We use to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.

Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala" in the Spark repo.

Refer to the HashingTF Java docs and the IDF Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java" in the Spark repo.

Refer to the HashingTF Python docs and the IDF Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/tf_idf_example.py" in the Spark repo.

Word2Vec

is an which takes sequences of words representing documents and trains a . The model maps each word to a unique fixed-size vector. The transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

Examples

In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.

Refer to the Word2Vec Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala" in the Spark repo.

Refer to the Word2Vec Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaWord2VecExample.java" in the Spark repo.

Refer to the Word2Vec Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/word2vec_example.py" in the Spark repo.

CountVectorizer

and aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, can be used as an to extract the vocabulary, and generates a . The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, will select the top words ordered by term frequency across the corpus. An optional parameter also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

Examples

Assume that we have the following DataFrame with columns and :

each row in is a document of type Array[String]. Invoking fit of produces a with vocabulary (a, b, c). Then the output column “vector” after transformation contains:

Each vector represents the token counts of the document over the vocabulary.

Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala" in the Spark repo.

Refer to the CountVectorizer Java docs and the CountVectorizerModel Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java" in the Spark repo.

Refer to the CountVectorizer Python docs and the CountVectorizerModel Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/count_vectorizer_example.py" in the Spark repo.

FeatureHasher

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.

The transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

  • Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the parameter.
  • String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of . Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with ).
  • Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of .

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Examples

Assume that we have a DataFrame with 4 input columns , , , and . These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors.

Then the output of on this DataFrame is:

The resulting feature vectors could then be passed to a learning algorithm.

Refer to the FeatureHasher Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala" in the Spark repo.

Refer to the FeatureHasher Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java" in the Spark repo.

Refer to the FeatureHasher Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/feature_hasher_example.py" in the Spark repo.

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.

RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: ) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.

Examples

Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/TokenizerExample.scala" in the Spark repo.

Refer to the Tokenizer Java docs and the RegexTokenizer Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java" in the Spark repo.

Refer to the Tokenizer Python docs and the RegexTokenizer Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/tokenizer_example.py" in the Spark repo.

StopWordsRemover

Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the parameter. Default stop words for some languages are accessible by calling , for which available options are “danish”, “dutch”, “english”, “finnish”, “french”, “german”, “hungarian”, “italian”, “norwegian”, “portuguese”, “russian”, “spanish”, “swedish” and “turkish”. A boolean parameter indicates if the matches should be case sensitive (false by default).

Examples

Assume that we have the following DataFrame with columns and :

Applying with as the input column and as the output column, we should get the following:

In , the stop words “I”, “the”, “had”, and “a” have been filtered out.

Refer to the StopWordsRemover Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala" in the Spark repo.

Refer to the StopWordsRemover Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaStopWordsRemoverExample.java" in the Spark repo.

Refer to the StopWordsRemover Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/stopwords_remover_example.py" in the Spark repo.

$n$-gram

An n-gram is a sequence of $n$ tokens (typically words) for some integer $n$. The class can be used to transform input features into $n$-grams.

takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than strings, no output is produced.

Examples

Refer to the NGram Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/NGramExample.scala" in the Spark repo.

Refer to the NGram Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaNGramExample.java" in the Spark repo.

Refer to the NGram Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/n_gram_example.py" in the Spark repo.

Binarizer

Binarization is the process of thresholding numerical features to binary (0/1) features.

takes the common parameters and , as well as the for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported for .

Examples

Refer to the Binarizer Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/BinarizerExample.scala" in the Spark repo.

Refer to the Binarizer Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaBinarizerExample.java" in the Spark repo.

Refer to the Binarizer Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/binarizer_example.py" in the Spark repo.

PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

Examples

Refer to the PCA Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PCAExample.scala" in the Spark repo.

Refer to the PCA Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaPCAExample.java" in the Spark repo.

Refer to the PCA Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/pca_example.py" in the Spark repo.

PolynomialExpansion

Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.

Examples

Refer to the PolynomialExpansion Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala" in the Spark repo.

Refer to the PolynomialExpansion Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java" in the Spark repo.

Refer to the PolynomialExpansion Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/polynomial_expansion_example.py" in the Spark repo.

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform transforms a length $N$ real-valued sequence in the time domain into another length $N$ real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by $1/\sqrt{2}$ such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the $0$th element of the transformed sequence is the $0$th DCT coefficient and not the $N/2$th).

Examples

Refer to the DCT Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/DCTExample.scala" in the Spark repo.

Refer to the DCT Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaDCTExample.java" in the Spark repo.

Refer to the DCT Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/dct_example.py" in the Spark repo.

StringIndexer

encodes a string column of labels to a column of label indices. can encode multiple columns. The indices are in , and four ordering options are supported: “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), “alphabetDesc”: descending alphabetical order, and “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”). Note that in case of equal frequency when under “frequencyDesc”/”frequencyAsc”, the strings are further sorted by alphabet.

The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as or make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with .

Examples

Assume that we have the following DataFrame with columns and :

is a string column with three labels: “a”, “b”, and “c”. Applying with as the input column and as the output column, we should get the following:

“a” gets index because it is the most frequent, followed by “c” with index and “b” with index .

Additionally, there are three strategies regarding how will handle unseen labels when you have fit a on one dataset and then use it to transform another:

  • throw an exception (which is the default)
  • skip the row containing the unseen label entirely
  • put unseen labels in a special additional bucket, at index numLabels

Examples

Let’s go back to our previous example but this time reuse our previously defined on the following dataset:

If you’ve not set how handles unseen labels or set it to “error”, an exception will be thrown. However, if you had called , the following dataset will be generated:

Notice that the rows containing “d” or “e” do not appear.

If you call , the following dataset will be generated:

Notice that the rows containing “d” or “e” are mapped to index “3.0”

Refer to the StringIndexer Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/StringIndexerExample.scala" in the Spark repo.

Refer to the StringIndexer Java docs for more details on the API.

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaStringIndexerExample.java" in the Spark repo.

Refer to the StringIndexer Python docs for more details on the API.

Find full example code at "examples/src/main/python/ml/string_indexer_example.py" in the Spark repo.

IndexToString

Symmetrically to , maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with , train a model with those indices and retrieve the original labels from the column of predicted indices with . However, you are free to supply your own labels.

Examples

Building on the example, let’s assume we have the following DataFrame with columns and :

Applying with as the input column, as the output column, we are able to retrieve our original labels (they will be inferred from the columns’ metadata):

Refer to the IndexToString Scala docs for more details on the API.

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/IndexToStringExample.scala" in the Spark repo.

Refer to the IndexToString Java docs for more details on the API.

Sours: https://spark.apache.org/docs/latest/ml-features
Complete Machine Learning Project with PySpark MLlib Tutorial ❌Logistic Regression with Spark MLlib


Using PySpark: import pyspark print(pyspark.__version__) 3.0.0 from pyspark import SparkContext from pyspark.sql import SQLContext # Main entry point for DataFrame and SQL functionality. from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, FeatureHasher from pyspark.sql.functions import col sc = SparkContext.getOrCreate() sqlCtx = SQLContext(sc) values = [("K1","a", 5, 'x'), ("K2","a", 5, 'x'), ("K3","b", 5, 'x'), ("K4","b", 10, 'x')] columns = ['key', 'alphabet', 'd1', 'd0'] df = sqlCtx.createDataFrame(values, columns) +---+--------+---+---+ |key|alphabet| d1| d0| +---+--------+---+---+ | K1| a| 5| x| | K2| a| 5| x| | K3| b| 5| x| | K4| b| 10| x| +---+--------+---+---+ Ref: spark.apache.org A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note: OneHotEncoder accepts numeric columns. encoder = OneHotEncoder(inputCol="key", outputCol="key_vector", dropLast = True) encoder = encoder.fit(df) df = encoder.transform(df) Error: IllegalArgumentException: requirement failed: Column key must be of type numeric but was actually of type string. Even though FeatureHasher is supposed return an output that is like OneHotEncoder, but it does not. It's output is inconsistent. From the documentation: Since a simple modulo is used to transform the hash function to a vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices. for nf in [3, 4, 5]: df = df.drop('key_vector') encoder = FeatureHasher(numFeatures = nf, inputCols=["key"], outputCol="key_vector") # 'FeatureHasher' object has no attribute 'fit' df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.key_vector, i.key_vector.toArray()) OUTPUT: (3,[1],[1.0]) [0. 1. 0.] (3,[0],[1.0]) [1. 0. 0.] (3,[0],[1.0]) [1. 0. 0.] (3,[1],[1.0]) [0. 1. 0.] (4,[2],[1.0]) [0. 0. 1. 0.] (4,[1],[1.0]) [0. 1. 0. 0.] (4,[1],[1.0]) [0. 1. 0. 0.] (4,[2],[1.0]) [0. 0. 1. 0.] (5,[0],[1.0]) [1. 0. 0. 0. 0.] (5,[3],[1.0]) [0. 0. 0. 1. 0.] (5,[1],[1.0]) [0. 1. 0. 0. 0.] (5,[4],[1.0]) [0. 0. 0. 0. 1.] for nf in [2, 3, 4, 5]: df = df.drop('alphabet_vector') encoder = FeatureHasher(numFeatures = nf, inputCols=["alphabet"], outputCol="alphabet_vector") # encoder = encoder.fit(df) # AttributeError: 'FeatureHasher' object has no attribute 'fit' df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.alphabet_vector, ' ## ', i.alphabet_vector.toArray()) OUTPUT: (2,[1],[1.0]) ## [0. 1.] (2,[1],[1.0]) ## [0. 1.] (2,[1],[1.0]) ## [0. 1.] (2,[1],[1.0]) ## [0. 1.] (3,[1],[1.0]) ## [0. 1. 0.] (3,[1],[1.0]) ## [0. 1. 0.] (3,[1],[1.0]) ## [0. 1. 0.] (3,[1],[1.0]) ## [0. 1. 0.] (4,[3],[1.0]) ## [0. 0. 0. 1.] (4,[3],[1.0]) ## [0. 0. 0. 1.] (4,[1],[1.0]) ## [0. 1. 0. 0.] (4,[1],[1.0]) ## [0. 1. 0. 0.] (5,[2],[1.0]) ## [0. 0. 1. 0. 0.] (5,[2],[1.0]) ## [0. 0. 1. 0. 0.] (5,[4],[1.0]) ## [0. 0. 0. 0. 1.] (5,[4],[1.0]) ## [0. 0. 0. 0. 1.]

Fix: Converting String categories into One-hot encoded values using StringIndexer and OneHotEncoder

df = df.drop('alphabet_vector_1', 'alphabet_vector_2', 'indexedAlphabet') alphabetIndexer = StringIndexer(inputCol="alphabet", outputCol="indexedAlphabet").fit(df) df = alphabetIndexer.transform(df) encoder = OneHotEncoder(inputCol="indexedAlphabet", outputCol="alphabet_vector_1", dropLast = True) encoder = encoder.fit(df) df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.alphabet_vector_1, " ## ", i.alphabet_vector_1.toArray()) encoder = OneHotEncoder(inputCol="indexedAlphabet", outputCol="alphabet_vector_2", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df) temp = df.collect() for i in temp: print(i.alphabet_vector_2, " ## ", i.alphabet_vector_2.toArray()) (1,[0],[1.0]) ## [1.] (1,[0],[1.0]) ## [1.] (1,[],[]) ## [0.] (1,[],[]) ## [0.] (2,[0],[1.0]) ## [1. 0.] (2,[0],[1.0]) ## [1. 0.] (2,[1],[1.0]) ## [0. 1.] (2,[1],[1.0]) ## [0. 1.] df = df.drop('key_vector', 'indexedKey') alphabetIndexer = StringIndexer(inputCol="key", outputCol="indexedKey").fit(df) df = alphabetIndexer.transform(df) encoder = OneHotEncoder(inputCol="indexedKey", outputCol="key_vector", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df) #SparseVector(int size, int[] indices, double[] values) temp = df.collect() for i in temp: print(i.key_vector, " ## ", i.key_vector.toArray()) Output: (4,[0],[1.0]) ## [1. 0. 0. 0.] (4,[1],[1.0]) ## [0. 1. 0. 0.] (4,[2],[1.0]) ## [0. 0. 1. 0.] (4,[3],[1.0]) ## [0. 0. 0. 1.]

How does it treat numeric columns?

df = df.drop('d1_vector_2') encoder = OneHotEncoder(inputCol="d1", outputCol="d1_vector_2", dropLast = False) encoder = encoder.fit(df) df = encoder.transform(df)Output: (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[5],[1.0]) [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (10,[],[]) [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Using Pandas

import pandas as pd values = [("K1", "a", 5, 'x'), ("K2","a", 5, 'x'), ("K3","b", 5, 'x'), ("K4","b", 10, 'x')] columns = ['key', 'alphabet', 'd1', 'd0'] df = pd.DataFrame(values, columns = columns) pd.get_dummies(data=df, columns=['key', 'alphabet']) Output: d1 d0 key_K1 key_K2 key_K3 key_K4 alphabet_a alphabet_b 5 x 1 0 0 0 1 0 5 x 0 1 0 0 1 0 5 x 0 0 1 0 0 1 10 x 0 0 0 1 0 1pd.get_dummies(data=df, columns=['key', 'alphabet'], drop_first=True)Output: d1 d0 key_K2 key_K3 key_K4 alphabet_b 5 x 0 0 0 0 5 x 1 0 0 0 5 x 0 1 0 1 10 x 0 0 1 1

Using SciKit Learn

Ref: scikit-learn.orgfrom sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = True, drop = none") print(enc.transform(df[['key', 'alphabet']])) enc = OneHotEncoder(sparse=False) enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = False, drop = none") print(enc.transform(df[['key', 'alphabet']])) enc = OneHotEncoder(sparse=False, drop = 'first') enc = enc.fit(df[['key', 'alphabet']]) print("OneHotEncoder on key and alphabet. sparse = False, drop = first") print(enc.transform(df[['key', 'alphabet']])) OneHotEncoder on key and alphabet. sparse = True, drop = none (0, 0) 1.0 (0, 4) 1.0 (1, 1) 1.0 (1, 4) 1.0 (2, 2) 1.0 (2, 5) 1.0 (3, 3) 1.0 (3, 5) 1.0 OneHotEncoder on key and alphabet. sparse = False, drop = none [[1. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0.] [0. 0. 1. 0. 0. 1.] [0. 0. 0. 1. 0. 1.]] OneHotEncoder on key and alphabet. sparse = False, drop = first [[0. 0. 0. 0.] [1. 0. 0. 0.] [0. 1. 0. 1.] [0. 0. 1. 1.]] Scikit-learn's LabelBinarizer Ref: scikit-learn.org from sklearn import preprocessing lb = preprocessing.LabelBinarizer(sparse_output=False) lb.fit(df[['key']]) lb.transform(df[['key']])array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]])

Using Package: category_encoders

To install this package with conda run one of the following: conda install -c conda-forge category_encoders conda install -c conda-forge/label/gcc7 category_encoders conda install -c conda-forge/label/cf201901 category_encoders conda install -c conda-forge/label/cf202003 category_encoders Ref: anaconda.orgfrom category_encoders import OneHotEncoder cat_features = ['key', 'alphabet'] enc = OneHotEncoder(cols = cat_features) enc.fit(df)OneHotEncoder(cols=['key', 'alphabet'], drop_invariant=False, handle_missing='value', handle_unknown='value', return_df=True, use_cat_names=False, verbose=0)enc.transform(df)Output: key_1 key_2 key_3 key_4 alphabet_1 alphabet_2 d1 d0 1 0 0 0 1 0 5 x 0 1 0 0 1 0 5 x 0 0 1 0 0 1 5 x 0 0 0 1 0 1 10 x
Sours: https://survival8.blogspot.com/2020/07/one-hot-encoding-from-pyspark-pandas.html

Pyspark encoding one hot

OneHotEncoder

(param)

Clears a param from the param map if it has been explicitly set.

([extra])

Creates a copy of this instance with the same uid and some extra params.

(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

()

Returns the documentation of all params with their optionally default values and user-supplied values.

([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

(dataset[, params])

Fits a model to the input dataset with optional parameters.

(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

()

Gets the value of dropLast or its default value.

()

Gets the value of handleInvalid or its default value.

()

Gets the value of inputCol or its default value.

()

Gets the value of inputCols or its default value.

(param)

Gets the value of a param in the user-supplied param map or its default value.

()

Gets the value of outputCol or its default value.

()

Gets the value of outputCols or its default value.

(paramName)

Gets a param by its name.

(param)

Checks whether a param has a default value.

(paramName)

Tests whether this instance contains a param with a given (string) name.

(param)

Checks whether a param is explicitly set by user or has a default value.

(param)

Checks whether a param is explicitly set by user.

(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

()

Returns an MLReader instance for this class.

(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

(param, value)

Sets a parameter in the embedded param map.

(value)

Sets the value of .

(value)

Sets the value of .

(value)

Sets the value of .

(value)

Sets the value of .

(value)

Sets the value of .

(value)

Sets the value of .

(self, \*[, inputCols, outputCols, …])

Sets params for this OneHotEncoder.

()

Returns an MLWriter instance for this ML instance.

Sours: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html
Encode categorical features using OneHotEncoder or OrdinalEncoder

Julia covered her face with her palms and began to cry softly. - Julia, - Georgy shook her shoulder, - what's wrong with you. He turned her to him and tried to separate her hands, but Yulia only pressed her palms harder to her wet eyes, continuing to. Sob quietly and swallow tears. They knew everything from the very beginning, but why, why they were silent.

You will also be interested:

Already too late. Grabbing my right arm, he deftly, as if doing it every day, tied it to my ankle. Instantly the same thing happened with the second hand. I would not have been able to stand up on my own, so now Alexander was already slowly taping my elbows to my knees with duct.



1654 1655 1656 1657 1658