Wednesday, October 12, 2016

Visualisation using d3.js based Sunburst with Apache Zeppelin

Zeppelin provides few default visual components (pie, bar, stacked, area, line chart, etc).
If users want either they can add a new default component or create visualisation using AngularJS interpreter.

I tried to create d3 based Sunburst for preparing a report in Apache Zeppelin.

It is easy and quick.

Apache Zeppelin display system adds additional div(s) and which creates some blank area on the screen. 

You can experience this as there is a blank area between sunburst and breadcrumbs in the bottom.


















********






Notebook for the above visualisation is available here which can be imported into Apache Zeppelin.
This contains the AngularJS source for sunburst visual.


Content which I downloaded from NSE historical data section and transformed it for demo purpose i.e. nsecombinedreport.csv can be downloaded from here.
This report is for the various Instrument Type, Security and the amount traded for a day.

Sunday, May 15, 2016

Focusing on implementing govt policies using the big data tool zeppelin

It was good to know from the goverment that it published lots of data collected over the period of time at https://data.gov.in/

I picked and amenities data about the villages from https://data.gov.in/catalog/village-amenities-census-2011 to do some analysis.

I believe govterment is doing sufficient analysis to find where and with what force it should use its machinery to promote its schemes.

I have been doing some analysis using the Apache Spark and eco system around it. But was interested in a quick visualization, which would help to understand the data quickly. A possible use would be using R as I wanted to build the reports quickly. I explored some of the capabilities of R and Shiny App in my earlier post of Custer Analysis of banking data.

Recently I came to know about a fantastic tool, its a web based notebook, with the in-built support for Apache-Spark, with a support of multiple langues like Scala, Python, spark sql and so on and most important that this it is opensource.
"Zeppelin" 
I picked one of the csv from the the whole data, and which is for one of the district in Karnataka state is Gulbarga and started doing some analysis.

Loading the data into the dataframe/table


It is easy to accomodate spark sql also in the notebook paragraph/sections.
Following is a very simple query to show the population spread in the villages of Gulbarga district.



Goverment make policies and spend money on that, and find the effectiveness of it based on the result. We can use the collected data to understand where should be the maximum penetration of the schemes, i.e. find the villages which needs the goverment schemes most. One of the example where goverment can initiates its policies to reduce the gap of male-female ratio, we can understand from the data available, where should be the more focus.





Changed the minbenchmark to 80% and same got updated on the fly



I stated to analyse this data to check for the education facilities in the villages which is in progress, would be publishing that information in later posts.

Installation details:
a) Zeppelin was deployed on Ubuntu VirtualBox with Windows as host.
b) Set your java home (1.7) before starting Zeppelin.
c) To start execute 'zeppelin-daemon.sh start' in the ZEPPELIN_HOME\bin




Monday, September 15, 2014

Cluster Analysis using R with banking customer balance distribution

Once a wise man (Yachine Phuneli) taught me while teaching me about java, semiconductor and wafer fabrication that how to teach/explain.

First tell to your audience/readers what are going to to tell.
Than tell the thing which you planned to tell.
And than tell what you just told.

I am going to describe how to do Cluster Analysis using R

Cluster Analysis
Cluster analysis has a vital role in numerous fields we are going to see it in the banking business to segment customers into small groups that can later be targeted for future marketing activities.
In machine learning and data mining it is used to efficiently find nearest neighbours and in summarization.

Cluster analysis aims to group data objects based on the information that is available that describes the objects and their relationships. The main goal is to group similar objects together, and the greater the similarity within a group the better and the greater the difference between group the more diverse the clustering.  A clustering is an entire collection of clusters; a cluster on the other hand is just one part of the entire picture. There are different types of clusters and also different types of clustering.

Types of Clustering Algorithms

1.     Partitioning-based clustering: are algorithms that determine all the clusters at once in most cases.
o    K-means clustering
o    K-medoids clustering
o    EM (expectation maximization) clustering
2.     Hierarchical clustering: these algorithms find successive clusters using previously established ones.
o    Divisive clustering is a top down approach.
o    Agglomerative clustering is a bottom up approach.
With the help of data mining methods, such as clustering algorithm, it is possible to discover the key characteristics from the bank’s data and possibly use those characteristics for future prediction also.
According to the facts established based on the information released by banks, attracting new customer costs five to six times more than customer retention. Retaining existing customers is the best core marketing strategy to become profitable in the very competitive banking industry. In order to maximise the profit, how to retain the existing customers has become a subject to be urgently solving for banks.

Use clustering to produce an initial working hypothesis, refine this hypothesis, then use prediction to generalize the refined hypothesis to data and evaluate how well it performs.




For this study following was the Bank customer data for the Balance distribution.


Before doing the cluster analysis above data was transformed based on the type or range of values. This would help us to do the scaling and find number of cluster quickly. We can keep the data as it is but I opted to find clustering based on the simplified data.
Balance conversion based on range:
=IF(AND(Sheet2!I2>=0,Sheet2!I2<=1500),"1",IF(AND(Sheet2!I2>1500,Sheet2!I2<=3000),"2",IF(AND(Sheet2!I2>3000,Sheet2!I2<6000 -=""> If Balance between 0 to 1500 : 1, If Balance between 1501 to 3000:2, If Balance between 3000 to 6000:3, Greater than 6000 :4
Age conversion based on the range:
=IF(AND(Sheet2!E2>=20,Sheet2!E2<=35),"1",IF(AND(Sheet2!E2>35,Sheet2!E2<=45),"2",IF(AND(Sheet2!E2>45,Sheet2!E2<=60),"3","4")))
Customer type conversion based on type::
=IF(Sheet2!C6="Enterprise","1","2")
Product conversion ::
=IF(Sheet2!C6="SavingAccount","1","2")  -> 1 if SavingAccount, 2 if CurrentAccount

So we got the data transformed in the following way:






    Find No Of Cluster

1) Load csv data from file:



 2)  Once data is loaded, per form the scaling of data by executing 
Ø  bDataScale <- ankcustomerdata="" o:p="" scale="">


 3)     Loading NBClust library and find the number of clusters for the data available in bdata,
Ø  library(NbClust)
Ø  nc <- bdatascale="" max.nc="15," method="kmeans" min.nc="2," nbclust="" o:p="">
Above command  takes a while to calculate.

Following is the output of NbClust function

The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

All 1000 observations were used.

*******************************************************************
* Among all indices:                                               
* 4 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 2 proposed 15 as the best number of clusters

                   ***** Conclusion *****                           
 * According to the majority rule, the best number of clusters is  3 



4)      Plot the chart with the number of cluster we have obtained.




     Calculate K-means cluster
   Now to find the cluster based on Quarterly average balance, bdata loaded from .csv file is






Here I want to consider only AvgBalQ1, AvgBalQ2, AvgBalQ3, AvgBalQ4 for kmeans cluster.
Ø   test <-bdata c="" o:p="">
Scale data
Ø  scaledata <-scale o:p="" test="">
Set seet so that every time we calculate the kmeans it would be consistent
Ø   set.seed(1234)
Find the number of cluster/center as 3,
Ø   km <- centers="3," iter.max="500)<o:p" kmeans="" nstart="10," scaledata="">
Check the cluster size data, 401 data in cluster 1, 310 is cluster 2 and 289 in cluster 3.
Ø   km$size
[1] 401    310     289


Let’s have a look of the function kmeans

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

Input to kmeans function
x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
iter.max: The maximum number of iterations allowed.
nstart: If centers is a number, nstart gives the number of random sets that should be chosen.
algorithm: The algorithm to be used. It should be one of "Hartigan-Wong", "Lloyd", "Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by default.

Result returned from kmeans function call
cluster: A vector of integers indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
whithnss: The within-cluster sum of squares for each cluster.
size: The number of points in each cluster.

(Fig: Quarter wise Product data distribution)
Now if bank wants take some necessary actions on the data by grouping the customer base and start a product or campaign for them, For e.g.
a)     If we see the above data distribution where bank wants to suggest the high net worth/higher account balance customer from the saving account to opt for other product, or
b)     If bank wants to offer those customers or a group of customers (cluster) additional services without charges.
Partitioning of data and making cluster/group would be helpful to take necessary actions for every cluster. It would be easy to group customers in cluster and then plan business activities on respective clusters accordingly.
K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. 
 So if we group data in 3 clusters (see the next chapter to see how to do we get 3 as a number of cluster)


(Fig: Quarterly Average Balance with 3 cluster data distribution)
If we see these four graphs for every quarter, Cluster 3 (in blue color) has some customer whose balance is greater than customer in other clusters. We can mark these clusters back to the original data and find those specific customers from the cluster to take appropriate action. 
Kmeans clustering with 3 clusters of sizes 401, 310, 289
Cluster means:
           AvgBalQ1 AvgBalQ2 AvgBalQ3 AvgBalQ4
Cluster 1     4019     2189     5059     5497
Cluster 2     2808     6413     3630     4044
Cluster 3     5611     6527     7222     6638
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 42.22%,  Cluster 2: 26.07%  Cluster 3: 31.72%

Note:
For every measure or non-factor or numeric value based parameter we choose these graph would be keep                                                              changing. Because their cluster means would be changing and so their variance.
For e.g. If we try to obtain data with 3 cluster but by including age also in the data, than our cluster size, mean and cluster data percentage changes.
Kmeans clustering with 3 clusters of sizes 353, 366, 281
Cluster means:
            Age    AvgBalQ1   AvgBalQ2   AvgBalQ3 AvgBalQ4
Cluster 1                31.5      3658       3858       5972        4869
Cluster 2                59.8      3366        4640     2972        4882
Cluster 3                57.1      5624        6023     7279        6659
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 34.21%, Cluster 2: 34.84%, Cluster 3: 30.95%


Now lets do some more analysis with the with the distribution of the data.

Data Interpretation: Understand the data based on the different combination of factors/dimensions and measures.
Factors/Dimensions: A dimension is a broad grouping of related data about a major aspect of your business. For example, you have a dimension called Products.
Measure: A measure is a performance indicator that is quantifiable and used to determine how well a business is operating. For example, useful measures may be Average yearly balance.
In the following figures, I tried to figure out only two interpretation (there could be more) and two possible actions which can be taken.
Note: I not sure which city exists in which state in US. I have just generated cities and states using talend (how to generate data using talend, check in the previous post) for the purpose of this study, as these cities are states were available in RowGenerator for generating records and moved some of them to make data look better.


(Fig: Product wise Gender data distribution)

If we try to analyse the above graph, it suggest that there are few female customer for a bank who has current account as product. Which are interesting and a bank can interrupt it in two ways:
I.        Interpretation:
a.     Either these customers are unware about which product they should have and they are paying unnecessary charges for CA, rather than going for SA.
b.     Or, they are the some entrepreneur or running small business.
II.        Actions:
a.     Customer can be notified by the bank executive about the possible change in their product from CA to SA and win the loyalty of customer by showing them bank care for their customer’s money.
b.     Marketing team target these customers with their specific needs or keep these customer’s in mind for the future product which would meet for these specific female customer who has CurrentAccount and are entrepreneur.



(Fig: Age wise Gender data distribution)
I.        Data Interpretation:
a.     Male customers are more rather than female or enterprise customers.
b.     Enterprise customer’s mainly lying in the range of age from 38-67.
II.        Possible Actions:
a.     Either products for Enterprises and Female customer are not very effective, so few new product which is more suitable to them can be launched with some campaign, etc.
b.     Start-up(Enterprise) does not have any enough accounts with the bank as the most of the customers are in the range of 40-70 and few of them are from 20-40. If an organization/startup can open account with the bank, there would be chances to get more salary accounts also from the same enterprise customer.

(Fig: City wise Gender data distribution)
I.        Data Interpretation
a.     Albany city has more enterprise customers than all other cities.
b.     Atlanta city has more personal customer (male + female)
II.        Possible Actions:
a.     Any campaigns which are targeted for personal customers should include cities like Atlanta.
b.     Any campaigns which are targeted for enterprise customer should include cities like Albany.
(Fig: City wise Gender and State wise Gender data distribution)


(Fig: Per Quarter Average Balance per Gender)
I.        Data Interpretation
a.     Q3 and Q4 average balance increases in comparison to Q1 and Q2. See the movement of boxplots not only the dots, in all quarters.
b.     Q1 and Q2 Enterprise customer are more towards lesser balance in comparison to Q3 and Q4.
II.        Possible Actions:
a.     If these are the regular trends for every year, that shows that cash flow for these customer increases  in last two quarters than it would be more appropriate time for having campaigns. Bank would be able to get better return on their investment on those campaigns.
ROA = Margin * Asset Velocity
Asset velocity = Sales / Assets
More sales with more competitive product.
More sales mean more asset velocity.
More asset velocity means more return on assets.
More ROA is more profit.

b.     Bank can offer few additional facilities for the customer who are not using overdraft or limit facility. Or based on the economic environment (linking of external data with bank dataset) in the country if organisations are looking for funds for capital expenditure, bank can offer products accordingly.

  (Fig1: Gender wise Product data distribution)                           (Fig2: Yearly Average Balance per Product)
I.        Data Interpretation
a.     There are more female current account customers than male. (Fig 1)
b.     There are few customers who have saving account but their average balance is more than the normal data population. (Fig 2)
II.        Possible Actions:
a.     Create new product or launch campaign or correct current products to increase the male customer base with current account.
b.     Customer can be suggested to move their funds from savings accounts to fixed deposit this would help bank to win the loyalty of customer by showing them bank care for their customer’s money.


(Fig: Quarter wise State data distribution)
I.        Data Interpretation
a.     Alaska and Louisiana has more customer base. Florida has the least customer base.
II.        Possible Actions:
a.     Specific marketing campaigns should be planned to get customer base in the states where number of customer are not enough.
b.     If banking operations (operational efficiency) are the problem for the lower number of customer which can be identified by
                                          i.    Checking the number of accounts/customer trend every year and see if customer base has a diminishing trend in every year. Find the exact problem and take corrective actions for it, one of the example for reducing customer base is the operational efficiency. So perform check about operational efficiency of every branch in Florida and other lower customer base states.
                                         ii.    There was not enough marketing or campaign done as these were not the area of focus in the previous years.

Cluster analysis is one of the important technique to analyse your data, which is not easy specially if you are not statistician/mathematician. R is wonderful tool to do all the analytics work with a ease. I have used R, RStudio and ShinyApp (Web-based framework for R for more practical visualization and statistical analysis )


Thanks to my friend and colleague "Alex" at Misys to introduce me to this wonderful tool and technology  The "R".