Wonderlane: Introduction to Data Mining and Knowledge Discovery

Thursday, April 26, 2007

Introduction to Data Mining and Knowledge Discovery

From this week’s reading I learned three new things and added a fourth from the class discussions.

1. Data and information is fluid to information managers in the same way that steel is a liquid to a steel worker.

Imagine a person holding a piece of data between their hands like someone might hold a basketball – but this blob of information is superflexible, and changes completely in shape or dimension while it is moved around -- because it is data it may be viewed from any position or shape changed as affected by a model or compared to other data.

2. Data has three basic states:
• Stored
• Processed
• Communicated

The original data should not be written back to its original source, so information once it has been altered, if it is going to be kept, and not just the results, needs to be reposited someplace other than the original source location. Storage is expensive.

3. Managing data and transforming it into information or actionable knowledge means moving the data, by applying different dimensions and techniques.

Storing data is expensive. Generally unless it is a secret, and retains value by being secret, like CIA intelligence, or a corporate secret like the Coca Cola recipe stored data is not useful.

Data in motion, that is data being viewed, or altered, or accessed is commercially valuable and can produce value and revenue.

Mexican tree, like a full database with lots of branches

The The Two Crows Corporation article "Introduction to Data Mining and Knowledge Discovery" was particularly engaging to me beginning with its definition –

"Data mining is a process that uses a variety of data analyst tools to discover patterns and relationships in data that may be used to make valid predictions."

Wow, sounds like software can do something spooky – as in supernatural – and that special talent is to "predict the future." But this is in fact what software can do – and it requires three things to do this –

+ data from a source
+ ability to munge data to information
+ a place to store either the data or just the results, but the data to be effective must affect some thing else.

All this means is that either you move and store, or you move and munge and never permanently store results. Data is currently stored, or moving, processing or mungeing, which includes modifying other data. ( see: http://www.twocrows.com/booklet.htm )

From this appears that the removal of one of these nodes might serve to improve speed – such as the uploading into memory, and logically this means that computers which never turn off and continuously churn data in some way, such as very refined "data supermarts" would be most efficient.

The illustrations provided in these articles were very helpful in visualizing and made the understanding of these models and concepts easy. Reflecting on this made me think about what a visual model for the entire process of Data Engineering into Actionable Information. What I envisioned but haven't had time to draw is a person holding a glob or blob of data between their hands like someone might hold a basketball – but this is superflexible, and changes completely in shapes while it is moved around -- because it is data it may be viewed from any position or shape changed as affected by a model or compared to other data.

Reviewing data to detect empirical patterns and so forth makes sense – but this section was particularly interesting:

Tree on the Park Strip, Anchorage, Alaska April 207

"New techniques include relatively recent algorithms like neural nets and decision trees, and new approach to older algorithms such a discriminating analysis. But virtue of bringing to beat the increased computer power on the huge volumes of available data, these techniques can approximate almost any functional form or interaction on their own. Traditional statistical technologies rely on the modeler to specify the functional form and interactions."

"Data mining is a tool for increasing the productivity of people trying to build predictive models."

If this isn't the most interesting thing a futurist, a scientist, a medical researcher, or a sales team, can hear and understand about computer science and data modeling I don't know what would be. Predictive models in and of themselves are recursively fascinating. This may lead us to the question of what does that take?

"While the power of the individual CPU has greatly increased, the real advanced in

Anchorage Park Strip, Alaska, Moon near full

scalability stem from parallel computer architectures. Virtually all servers today supposed multiple CPUs using symmetric multi-processing, and clusters of the SMP server can be created that allow hundred of CPUs to work on finding patterns in the data."

Yes, that is more exciting news, this is the same way that linked computers in off times are used to search as a massive array for unexplained patterns in space's background noise hoping to hear a signal. They are looking for alien life. Clearly such a method is helped by all that linked volunteer processing power. But that's the lesson, if you really want to find something out, it's possible, and lots of businesses and individual people use these techniques in all kinds of application because it's cheap enough.

"Visualization works because it exploits the broader information bandwidth of graphics as opposed to text or numbers. It allows people to see the forest and zoom in on the trees. Patterns, relationships, exceptional values and mission values are often easer to perceive when show graphically, rather that as list of numbers and text."

Another exciting wow when it comes to how to work with data to information to predictive modeling – how to represent it and what the most impressive thing about this is – the same as geniuses such as Da Vinci – he had the ability to visualize complex information visually, as did Faraday, and Einstein – they used visual models to create predictive models of behavior in the world. What this is telling us is that anyone with proper understanding and access to the tools can munge data so that it represents data in the same way that our greatest minds naturally accomplished.1

One of the interesting ideas which came from reading the information on link analysis, from mention of the two kinds of inquiry commonly used "association discovery" and "sequence discovery" (with the factors of support, relative frequency, confidence, association) is the idea that a database that links to an additional database besides association and sequence, over time might arrive at many expected detections in patterns of data – if for example the data base was the "Life Database of Patterns of Obvious Qualities". Such a database would contain many thousands of facts such as 'dead people do not buy anything', and corollaries such as, 'so there is no point in sending advertising to their residence.'

Such a mammoth database would be a unique scientific challenge to create, maintain and link to, of particular interest to me, is how much of the data collected would be true and how much be useful? We can not always tell how solving a problem may serve to inform something else, such as the discussion on Microsoft's edge checking algorithm, email scanning software in use to weed out spam from Microsoft email servers – as it turned out this edging method was used to sort through DNA in the successful search for a vaccine for AIDs going into clinical trials.2

Also of interest was the categorical explosion in data, as well as the concepts around pruning tree structures. I found myself wondering what would happen if processing power and storage capability exceeded our ability to create new data? What would happen if every tree was allowed to grow permanently and one branch contained no splits as part of the earlier described Obvious Qualities database? In a way this is similar to the rapid irregular growth of a smart internet, combined with such simple tools as cookies, and the analysis of purchasing behavior - what will that data come to predict if applied to a large model. What can be predicted will be of interest for a long time to come.

Even the acronyms at this level sound spacey – MARS – the Multivariate Adaptive Regression Splines. This is a much more interesting field than I was prepared to encounter, and in summation, I come away quite curious at how far our creative intelligence and need and desire to know will be able to drive the technology to the computational limit, and over into helping humanity on a mass level.

1. West T, "In the Mind's Eye: Visual Thinkers, Gifted People With Dyslexia and Other Learning Difficulties, Computer Images and the Ironies of Creativity". Prometheus Books; Upd Sub edition (September 1997)

2.The application description from Phil Fawcett, Microsoft Research Liaison PM, in person presentation on "optimized applications" http://research.microsoft.com/ivm/HDView/HDGigapixel.htm, University of Washington, Seattle, April 17, 2007.

Week 5: Modalities of Information Delivery
Data Mining

The Two Crows Corporation, "Introduction to Data Mining and Knowledge Discovery, Third Edition" 1999. Accessed on 2/25/2006 from http://www.twocrows.com/intro-dm.pdf.

Witten, I.H. and Frank, E. (2000). "What's It All About?" In Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. (Chap. 1). San Francisco: Morgan Kaufmann. pp. 1-35. (Focus on Sections 1.5 and 1.6)

Editorial Review & Delivery

McGovern, G. & Norton, R. (2002). "Editing Content." In Content Critical: Gaining Competitive Advantage through High-Quality Web Content. (Chap. 6). Pearson Education Limited. pp. 109-122.

IT Help Desk

Clarke, S. and Greaves, A. (2002). "IT Help Desk Implementation: The Case of an International Airline." In Annals of Cases on Information Technology, 4, pp. 241-259.

Walko, D. 1999. "Implementing a 24-Hour Help Desk at the University of Pittsburgh ." In Proceedings of the 27th Annual ACM SIGUCCS Conference on User Services: Mile High Expectations ( Denver, Colorado, United States). SIGUCCS '99. ACM Press, New York, NY, pp. 202-207.

Duhart, T., Monaghan, P., and Aldrich, T. 1999. "Creating the Customer Service Team: An Ongoing Process." In Proceedings of the 27th Annual ACM SIGUCCS Conference on User Services: Mile High Expectations ( Denver, Colorado, United States). SIGUCCS '99. ACM Press, New York, NY, 51-55. DOI= http://doi.acm.org/10.1145/337043.337090.

Padeletti, A., Coltrane, B., and Kline, R. 2005. "Customer service: help for the help desk." In Proceedings of the 33rd Annual ACM SIGUCCS Conference on User Services ( Monterey, CA, USA, November 06 - 09, 2005). SIGUCCS '05. ACM Press, New York, NY, 299-304. DOI= http://doi.acm.org/10.1145/1099435.1099504.

1 comment:

Anonymous said...: I suggest KDnuggets as a resource:

http://www.kdnuggets.com/; 6:00 AM