Apache Spark

De BIGDATA
Aller à : navigation, rechercher

We use an emerged open-source implementation named Spark , which is adapted to machine learning algorithms and supports applications with working sets while providing similar scalability and fault tolerance properties to MapReduce.

The great Spark power is being able to put the RDD in RAM, the time saved is considerable on algorithms iteratively using the same data set.

Use case 1 : Spark implementation of Nearest Neighbours Mean Shift using LSH (with Gael Beck)

We use Spark in order to implement a well know clustering algorithm, the mean shift. Results are encouraging and show that if we multiply by 3 the number of nodes in a cluster, we decrease the execution time by 2.

https://github.com/Kybe67/Mean-Shift-LSH


Use case 2 : SOM-MapReduce/ Spark (Self-Organizing Map using MapReduce with Tugdual sarazin)

Self-organizing maps are increasingly used as tools for visualization, as they allow projection in small spaces that are generally two dimensional. We have designed two scalable implementations of SOM-MapReduce algorithm.

https://github.com/TugdualSarazin/spark-clustering


Use case 3 : BITM-MR/ Spark (Biclustering using Self-Organizing Map and MapReduce with Tugdual sarazin)

https://github.com/TugdualSarazin/spark-clustering


Use case 4 : WADA, Web Application for Data Analysis (with students from Villetaneuse IUT)

       https://github.com/CamilleGR/Wada

Use case 5 : G-Stream: Growing Neural Gas for Clustering Data Streams using Spark Streaming (With Mohammed Ghesmoune)

   to appear