177:[["$","p",null,{"children":"R is a programming language and statistical environment widely used in data science and statistical analysis. R offers a wide range of advanced statistical features and specialised packages for performing complex statistical analyses. It is particularly well suited to statistical modelling, data mining and visualisation. It provides a rich ecosystem of machine learning algorithms. R’s graphical capabilities enable the creation of high-quality visualizations, including scatter plots, histograms, facilitating effective data exploration and presentation."}],"\n",["$","p",null,{"children":"SparkR is an R package that provides an interface to Apache Spark, allowing users to use Spark’s distributed computing capabilities in the R programming language."}],"\n",["$","h2",null,{"id":"requirements","children":[["$","$L19",null,{"className":"anchor before","href":"#requirements","children":["$","svg",null,{"xmlns":"http://www.w3.org/2000/svg","width":"18","height":"18","viewBox":"0 0 16 16","children":["$","path",null,{"d":"M7.775 3.275a.75.75 0 0 0 1.06 1.06l1.25-1.25a2 2 0 1 1 2.83 2.83l-2.5 2.5a2 2 0 0 1-2.83 0 .75.75 0 0 0-1.06 1.06 3.5 3.5 0 0 0 4.95 0l2.5-2.5a3.5 3.5 0 0 0-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 0 1 0-2.83l2.5-2.5a2 2 0 0 1 2.83 0 .75.75 0 0 0 1.06-1.06 3.5 3.5 0 0 0-4.95 0l-2.5 2.5a3.5 3.5 0 0 0 4.95 4.95l1.25-1.25a.75.75 0 0 0-1.06-1.06l-1.25 1.25a2 2 0 0 1-2.83 0z"}]}]}],"Requirements"]}],"\n",["$","p",null,{"children":"SparkR is not currently integrated in the cluster. If you wish to use it, you must first install it with the script provided in the documentation."}],"\n",["$","blockquote",null,{"children":["\n",["$","p",null,{"children":"Note: if you build your own Spark you need to activate the R profile with an R version higher than 3.5."}],"\n"]}],"\n",["$","p",null,{"children":["This tutorial assumes you are running a cluster based on ",["$","$L19",null,{"href":"/en/academy/getting-started/","children":"TDP getting started"}],", an easy-to-launch TDP environment for testing purposes. This deployment provides you with:"]}],"\n",["$","ul",null,{"children":["\n",["$","li",null,{"children":[["$","code",null,{"children":"tdp_user"}],", a user with the ability to ",["$","code",null,{"children":"kinit"}]," for authentication."]}],"\n",["$","li",null,{"children":"An edge node accessible by SSH"}],"\n",["$","li",null,{"children":["HDFS directories:","\n",["$","ul",null,{"children":["\n",["$","li",null,{"children":["$","code",null,{"children":"/user/tdp_user"}]}],"\n"]}],"\n"]}],"\n"]}],"\n",["$","p",null,{"children":"Note: When using another TDP deployment than tdp-getting-started, some commands require adaptation to your deployment."}],"\n",["$","p",null,{"children":["Before beginning the tutorial, connect to the cluster and authenticate yourself with ",["$","code",null,{"children":"kinit"}]," using the following commands:"]}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-bash","children":[["$","span",null,{"className":"hljs-comment","children":"# Connect to edge-01.tdp "}],"\nvagrant ssh edge-01\n",["$","span",null,{"className":"hljs-comment","children":"# Switch user to tdp_user"}],"\nsudo su tdp_user\n",["$","span",null,{"className":"hljs-comment","children":"# Authenticate the user with his Kerberos principal and password"}],"\nkinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP\n"]}]}],"\n",["$","h2",null,{"id":"tutorial","children":[["$","$L19",null,{"className":"anchor before","href":"#tutorial","children":["$","svg",null,{"xmlns":"http://www.w3.org/2000/svg","width":"18","height":"18","viewBox":"0 0 16 16","children":["$","path",null,{"d":"M7.775 3.275a.75.75 0 0 0 1.06 1.06l1.25-1.25a2 2 0 1 1 2.83 2.83l-2.5 2.5a2 2 0 0 1-2.83 0 .75.75 0 0 0-1.06 1.06 3.5 3.5 0 0 0 4.95 0l2.5-2.5a3.5 3.5 0 0 0-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 0 1 0-2.83l2.5-2.5a2 2 0 0 1 2.83 0 .75.75 0 0 0 1.06-1.06 3.5 3.5 0 0 0-4.95 0l-2.5 2.5a3.5 3.5 0 0 0 4.95 4.95l1.25-1.25a.75.75 0 0 0-1.06-1.06l-1.25 1.25a2 2 0 0 1-2.83 0z"}]}]}],"Tutorial"]}],"\n",["$","p",null,{"children":"SparkR provides a DataFrame API that supports operations such as SQL queries, DataFrame transformations and statistical functions. A SparkDataFrame is the equivalent of a table in a relational database. It is built from different types of data: structured data files, tables in Hive and external databases."}],"\n",["$","p",null,{"children":"Most of spark connectors directly provides DataFrame API, without RDD equivalent. It is although still possible to get an RDD from a DataFrame through :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"converted.rdd <- SparkR:::toRDD(dataframe)\n"}]}],"\n",["$","p",null,{"children":"In the following example, the data is loaded from a csv file into a DataFrame."}],"\n",["$","p",null,{"children":"First create a data.csv with the following content :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-bash","children":[["$","span",null,{"className":"hljs-built_in","children":"id"}],",author,genre,quantity\n1,hunter.fields,romance,15\n2,leonard.lewis,thriller,81\n3,jason.dawson,thriller,90\n4,andre.grant,thriller,25\n5,earl.walton,romance,40\n6,alan.hanson,romance,24\n7,clyde.matthews,thriller,31\n8,josephine.leonard,thriller,1\n9,owen.boone,sci-fi,27\n10,max.mcBride,romance,75\n"]}]}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-bash","children":[["$","span",null,{"className":"hljs-comment","children":"# Copy data.csv to hdfs"}],"\nhdfs dfs -put /home/tdp_user/data.csv /user/tdp_user\n"]}]}],"\n",["$","p",null,{"children":"You can use SparkR from the Spark3r-shell :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-bash","children":[["$","span",null,{"className":"hljs-comment","children":"# Launch Spark3r-shell"}],"\nsparkr3-shell\n"]}]}],"\n",["$","p",null,{"children":"Now that you’re on sparkR you can load the data into a dataframe :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Create a DataFrame\ndf <- read.df(\"/user/tdp_user/data.csv\", source = \"csv\", header = \"true\", inferSchema = \"true\")\n"}]}],"\n",["$","p",null,{"children":"Here are some examples of commands to display the data :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Display first row \nfirst(df)\n\n# Display the first N rows\nhead(df,N)\n\n# Display all data \ncollect(df)\n\n# Examine the structure of the dataFrame\nstr(df)\n"}]}],"\n",["$","p",null,{"children":"Spark DataFrame Operations :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Selection\nhead(select(df, df$genre))\n\n# Count the number of items by gender\nhead(summarize(groupBy(df, df$genre), count = n(df$genre)))\n\n# Convert the 'quantity' column to an integer\ndf2 <- withColumn(df, \"quantity\", cast(df$quantity, \"integer\"))\n\n# Find the minimum, maximum and average of 'quantity'\nresult <- agg(df2, min(df2$quantity), max(df2$quantity), avg(df2$quantity))\n\n# show result\nshowDF(result)\n"}]}],"\n",["$","p",null,{"children":"To obtain data that meets certain conditions, it is useful to use filters. There are two ways of filtering data:"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"filter(df, condition)\n# Or\nwhere(df, condition)\n"}]}],"\n",["$","p",null,{"children":"Here’s an example of an application:"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Filter when the quantity is greater than 80\nfiltered_df <- filter(df, df$quantity > 80)\n# Or\nfiltered_df <- where(df, df$quantity > 80)\n\nshowDF(filtered_df)\n##id author genre quantity\n##2 leonard.lewis thriller 81\n##3 jason.dawson thriller 90\n"}]}],"\n",["$","p",null,{"children":"With SparkR, SQL queries can be executed by saving the SparkDataFrame as a temporary view :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Create the temporary view\ncreateOrReplaceTempView(df, \"df\")\n\n# Running the SQL query\nresult_sql <- sql(\"SELECT * FROM df WHERE quantity >= 80\")\n\nshowDF(result_sql)\n##id author genre quantity\n##2 leonard.lewis thriller 81\n##3 jason.dawson thriller 90\n"}]}],"\n",["$","h2",null,{"id":"machine-learning","children":[["$","$L19",null,{"className":"anchor before","href":"#machine-learning","children":["$","svg",null,{"xmlns":"http://www.w3.org/2000/svg","width":"18","height":"18","viewBox":"0 0 16 16","children":["$","path",null,{"d":"M7.775 3.275a.75.75 0 0 0 1.06 1.06l1.25-1.25a2 2 0 1 1 2.83 2.83l-2.5 2.5a2 2 0 0 1-2.83 0 .75.75 0 0 0-1.06 1.06 3.5 3.5 0 0 0 4.95 0l2.5-2.5a3.5 3.5 0 0 0-4.95-4.95l-1.25 1.25zm-4.69 9.64a2 2 0 0 1 0-2.83l2.5-2.5a2 2 0 0 1 2.83 0 .75.75 0 0 0 1.06-1.06 3.5 3.5 0 0 0-4.95 0l-2.5 2.5a3.5 3.5 0 0 0 4.95 4.95l1.25-1.25a.75.75 0 0 0-1.06-1.06l-1.25 1.25a2 2 0 0 1-2.83 0z"}]}]}],"Machine learning"]}],"\n",["$","p",null,{"children":["$","strong",null,{"children":"K-means"}]}],"\n",["$","p",null,{"children":"R offers the possibility to use implemented machine learning models such as K-Means, Naive Bayes, SVM, LDA and so on.\nHere, we focus on testing a clustering algorithm on a given dataset. Clustering is an unsupervised machine learning technique that aims to group similar data points together based on their characteristics."}],"\n",["$","p",null,{"children":"Copy the following data to hdfs :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"children":"branch_size,trunk_perimeter,tree_type\n5.1,4.9,Oak\n4.9,4.2,Oak\n4.7,4.7,Oak\n5.0,4.6,Oak\n5.2,5.0,Oak\n4.9,4.5,Oak\n5.4,5.1,Oak\n4.8,4.9,Oak\n5.8,5.1,Oak\n5.0,4.7,Oak\n6.4,6.1,Pine\n5.8,5.2,Pine\n6.3,5.0,Pine\n5.9,5.1,Pine\n6.5,5.9,Pine\n6.2,5.4,Pine\n6.4,5.5,Pine\n6.7,5.7,Pine\n6.3,5.6,Pine\n6.2,5.2,Pine\n7.9,6.4,Maple\n7.7,6.3,Maple\n7.7,6.1,Maple\n7.2,5.9,Maple\n7.1,5.8,Maple\n7.6,6.6,Maple\n7.3,6.3,Maple\n7.2,6.1,Maple\n7.4,6.4,Maple\n7.9,6.3,Maple\n"}]}],"\n",["$","p",null,{"children":"Now go back to the R shell :"}],"\n",["$","$L179",null,{"children":["$","code",null,{"className":"hljs language-R","children":"# Load SparkR library\nlibrary(SparkR)\n\n# Initialize SparkR session\nsparkR.session()\n\n# Load the CSV file\ndata <- read.df(\"/user/tdp_user/dataspark/data.csv\", source = \"csv\", header = \"true\", inferSchema = \"true\")\n\n# Split the data into training and validation datasets\nsplits <- randomSplit(data, c(0.7, 0.3), seed = 42)\ntrainData <- splits[[1]]\nvalidationData <- splits[[2]]\n\n# Apply k-means algorithm on the training data\nmodel <- spark.kmeans(trainData, ~ branch_size + trunk_perimeter, k = 3, maxIter = 20, initMode = \"k-means||\", seed = 42)\n\n# Predict on the validation data\npredictions <- predict(model, validationData)\n\n# Show the results\nshowDF(predictions)\n\n# Stop the SparkR session\nsparkR.stop()\n\n"}]}]]