1. Introduction
One of methods of data mining which are under development in our institute is mining of linguistic associations from numerical data [3],[4]. Data mining is regarded as a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable knowledge in large scale data-sets [1]. Particularly interesting are associations that reflect relationships among items in data-sets. Recall that, in general, associations express specific semantics in linking data items together in the sense that if X ~ Y is such an association then “occurrence of X is associated with occurrence of Y”, where X and Y are attributes of data items.
2. Focus of our research
We present a direct method for mining associations that characterize relations among attributes using natural language. Since the mined associations have a form of natural language sentences, we call them linguistic associations. A typical form of a linguistic association is
IF number of cars per hour is very big AND wind speed is small
THEN concentration of NO2 is more or less big.
3. Description of main results
We have implemented this method using special experimental software called LAM (Linguistic Associations Mining), see Section 4. We tested our method on several standard data sets, such as Boston Housing dataset from StatLib library. Obtained associations are formulated in natural language. Hence, they can serve experts from various fields to discover new relations of dependencies in a way that is much closer to the form of their knowledge and the way of their thinking. Moreover, the discovered associations characterizing real dependencies can be directly taken as fuzzy IF-THEN rules and used as expert knowledge about the problem.
We also developed and implemented second method which uses fuzzy transform [4]. The antecedent of the found associations consists of expressions of the form “X is Fn(y)” where X is an attribute and Fn(y) is a fuzzy number which represents meaning of the linguistic expression “approximately y” where y ia real number. The consequent is a linguistic expression “B average Z” where Z is an attribute and B an evaluative linguistic expression (i.e. expression as “big, roughly medium, extremely small”, etc.), for example, “very small average concentration of gas”, etc. A typical example of such linguistic association is
IF number of cars per hour is approximately 1000
AND wind speed is approximately 5 m per second
THEN average concentration of NO2 is more or less big.
4. Demonstration
REFERENCES:
[1] FAYYAD, U., PIATETSKY-SHAPIRO, G. AND SMYTH, P.: From data mining to knowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pp. 1-30. AAAI Press/The MIT Press, MA, U.S.A., 1996.
[2] HÁJEK, P., HAVRÁNEK, T.: Mechanizing Hypothesis Formation (Mathematical Foundations for a General Theory). Springer-Verlag, Berlin-Heidelberg-New York, 1978.
[3] NOVÁK, V., PERFILIEVA, I., DVOŘÁK, A., CHEN, Q., WEI, Q., YAN, P. Mining pure linguistic associations from numerical data. In International Journal of Approximate Reasoning, 48, 2008, pp. 4-22, ISSN 0888-613X.
[4] PERFILIEVA, I., NOVÁK, V., DVOŘÁK, A. Fuzzy transform in the analysis of data. In Intern. Journal of Appr. reasoning, 48, 2008, pp. 36-46, ISSN 0888-613X.
Flash presentation illustrating LAM software in use