Apache Spark

De BIGDATA
Aller à : navigation, rechercher

We use an emerged open-source implementation named Spark , which is adapted to machine learning algorithms and supports applications with working sets while providing similar scalability and fault tolerance properties to MapReduce. The great Spark power is being able to put the RDD in RAM, the time saved is considerable on algorithms iteratively using the same data set.

The word code (https://github.com/lebbah) refers to the SOM clustering, BiTM (Biclustering Topological Map) and Mean-shift clustering codes. - the word AUTHORS stand for LIPN (Le Laboratoire d'Informatique de Paris-Nord (LIPN) (UMR 7030)) - The code may be reproduced for personal use only. In particular, any copy addressed to a third party, even free of charge, would be fraudulent. - If modifications are performed on the code, these modifications will be communicated to the AUTHORS and I commit myself to transfer their property to the AUTHORS. - Publication about any work realized using the code must mention it. (publication).

The code is a research product and is provided without any expressed or implied warranty. There is no warranty of any kind concerning the fitness of this software for any particular purpose. Because The code is licensed free of charge, there is no warranty for The code.

The code is provided "as is" without any kind, either expressed or implied, including, but not limited to, the fitness for a particular purpose. The entire risk as to the quality and performance of the code is with you. Should the code prove defective, you assume the cost of all necessary servicing, repair or correction.


Use case 1 : Spark implementation of Nearest Neighbours Mean Shift using LSH (with Gael Beck)

We use Spark in order to implement a well know clustering algorithm, the mean shift. Results are encouraging and show that if we multiply by 3 the number of nodes in a cluster, we decrease the execution time by 2.
https://github.com/Kybe67/Mean-Shift-LSH


Use case 2 : SOM-MapReduce/ Spark (Self-Organizing Map using MapReduce with Tugdual sarazin)

Self-organizing maps are increasingly used as tools for visualization, as they allow projection in small spaces that are generally two dimensional. We have designed two scalable implementations of SOM-MapReduce algorithm. 
https://github.com/TugdualSarazin/spark-clustering


Use case 3 : BITM-MR/ Spark (Biclustering using Self-Organizing Map and MapReduce with Tugdual sarazin)

https://github.com/TugdualSarazin/spark-clustering


Use case 4 : WADA, Web Application for Data Analysis (with students from Villetaneuse IUT)

       https://github.com/CamilleGR/Wada

Use case 5 : G-Stream: Growing Neural Gas for Clustering Data Streams using Spark Streaming (With Mohammed Ghesmoune)

       https://github.com/mghesmoune/spark-streaming-clustering