Clustering: Platform for big data analysis and knowledge extraction

August 28, 2017

The modern era of technology and its advancements have caused the data from the Internet, imaging and video surveillance to be rising at an alarming rate. One of the studies reveals that about 281 exabyte of data is increasing in a span of every ten years.

These data, for the most part, are getting stored in World Wide Web as electronic digital data. The increase is not experienced in the amount of data alone, but even in data variety (text, image and video). In addition, billions of emails, blogs, transaction data and webpages are being created in terabytes every day. Hence, the increase in the volume of datasets is experienced. This understanding is of utmost significance in a changing information scenario of the world. Suggesting techniques to perform automatic analysis, classification and retrieval of such a huge unstructured data seems highly unfeasible.

Plenty of applications require enormous datasets to be analysed for their success. The data analysis procedure is categorised as exploratory or confirmatory, in accordance with the data source being available. Structuring the high dimensional dataset, devoid of any assumptions or pre-specified models, offsets the exploratory analysis. In contrast, structuring the high dimensional data through the use of assumptions or pre-specified models offsets the confirmatory analysis. The currently existing data analysis techniques include linear regression, discriminant analysis, canonical correlation analysis, factor analysis, principal component analysis, multidimensional scaling, cluster analysis and many more. The study states that a key element, called grouping, which may possibly be based on a postulated model or natural groupings (i.e. clustering), is highly necessitated in any of the data analysis procedures. Conventional data analysis techniques often overlook the useful information from the bulk databases and as a consequence, the potential benefits of increased computational and data gathering capabilities are only partially realised.

An explorative study discloses the fact that the word ‘data-clustering’ first appeared in the anthropological data-related article in 1954. Data clustering is one among the data mining techniques, which structures the data in a way that the useful and associate information could be effectively extracted from the greater part of data corpus. Data clustering has emerged as a field of practice in its current form during the time between World War I and World War II in the discipline of ecology, where the scientists attempted to address the territorial structure of the bird species. The species’ territorial structure was structured by assuming the spots as the similarity measure.

Data clustering, aka cluster analysis, functions in a manner that the natural groupings of data patterns, points or objects could be successfully discovered. Cluster analysis can be defined as 'a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics'. Data clustering handles high dimensional data, in accordance with a similarity measure, such that the objects lying within the same group are alike and the objects belonging to different groups fail to be alike. As far as the data is concerned, the clusters are found to have the ability to diverge in terms of their shape, size and density. A study on clustering algorithm classifies it into two types, namely, hierarchical clustering and partitional clustering. Hierarchical clustering carries out data clustering in the agglomerative mode as well as divisive mode. In agglomerative mode, every single data point is imagined to be a cluster, agreeing to merge similar pairs successively. The divisive mode differs from agglomerative mode due to the fact that all the data points are clustered into a single group initially and then partitioning into smaller clusters follows. Partitional clustering algorithms, on the other hand, do not impose a hierarchical structure. Besides, it paves way to find the entire number of clusters in one shot.

While taking a thorough look into the scale of application, Cluster analysis is found to be widespread in any discipline that involves multivariate data analysis. It is highly impractical to exhaustively list the scientific field and applications, where data clustering could apply. A few renowned applications, making use of data clustering include image segmentation, document clustering and character recognition. The jobs, which data clustering does in the application side are as follows: i) storing, organising and integrating massive data, ii) data processing as well as analysinga and iii) extraction of knowledge and insights to predict the future from data. An instance may be a newspaper dataset, wherein the structuring of newspaper in relation to the relevant topics involves the data clustering methodology. While doing so, the topics serve as the cluster point to group the newspaper based on different topics. In contrast, a search via Google search engine using 'data clustering' as the keyword results in about 1,660 entries. This case is different, wherein data clustering applies to group the contents that are more centred on data clustering.

The development of clustering methodology is a truly interdisciplinary endeavour. The reason may be that any number of people, who rely on real time data collection and processing such as social scientists, engineers, computer scientists, medical researchers, taxonomists and so forth, are obligated to perform clustering methodology. In all, clustering seems to have an added impact on the environment where we live in and where the systems that acquire and understand knowledge from texts evolve.

Clustering - Platform for big data analysis and knowledge extraction

Satish Chander, Assistant Professor
Department of Computer Science and Engineering Waljat College of Applied Sciences