Data Mining

By Catalina Correa, IFD

Data mining can be defined as the process of sorting through large quantities of data to identify patterns and trends. These patterns and trends can be collected, analyzed and used to make intelligent decisions pertaining to specific business scenarios, such as:

  • Forecasting sales
  • Determining what products might be sold together
  • Targeting sales mailings towards specific customers
  • Predicting customer buying trends
  • Identify faltering customer sales

Creating a data mining model is a dynamic and repetitive process. It involves asking questions to define specific goals, gathering data to create an output model to answer those questions and using the model to deploy meaningful reports into the working environment.  

Defining Goals

The first step in the data mining process is to define the business problem or need as specifically and clearly as possible. To successfully complete this first step, questions such as the following might need to be answered. 

  • What similarities are you trying to find?
  • Are you trying to forecast sales?
  • Are you looking for seasonal trends?
  • Are you trying to recover lost sales?

The answers to these types of questions and the needs of the individual business users have to be analyzed and compared to the available data. If the current dataset does not support the needs, the data mining project will need to be redefined. It is important that all individuals involved from management to the end users take part. While upper management might have a solid understanding of the business needs, the end users will be the ones actually working with the data model output and will need to have a solid understanding of the final goals to be achieved using the model.

Organizing The  Data

The next step in the data mining process is to gather, consolidate and clean the available data. If the data is stored in multiple locations, it would be best to centralize it on one machine. Depending on the rules established for data entry and how strictly they are enforced, incorrect, missing or inconsistent data entries may be present in the data. The following are a few examples that can cause data filtering problems. 

  • Missing or incomplete data entries
  • Inaccurate pieces of data.
  • Data entries varying between upper case, lower case or mixed case.
  • Similar types of data that are both abbreviated and full length.

These data anomalies will need to be cleaned up before any type of accurate reporting can be performed. Furthermore, data cleanup might need to go beyond missing or inconsistent entries. There are other aspects of the data will also need to be researched. 

  • Dates
  • Price
  • Discounts
  • Data Classification Level
  • Territories
  • Salesperson

Should the output model be based on “Order Date” or “Ship Date”? Do you rely on “Net Price”, “Gross Price” or ”Discounted Price”? Is there clear and accurate data classification levels defined and how are the levels set up? Do you want see sales by “Customer Location” or “Warehouse Location”? Are sales totals to reflect “Assigned Salesperson” or “Sold By Salesperson”? These examples along with numerous others will need to be analyzed in order to determine what data is best to use for the output model.

Exploring Your Data

The third step in the data mining process is to explore the prepared data. A complete understanding of the data and knowledge of what to look for is vitally important to make intelligent decisions when creating the mining output models. You will need to determine the dataset’s accuracy and establish that it can provide the necessary results. By performing some basic calculations on the prepared data values, it will be possible to tell if the data is skewed, inaccurate or possibly incomplete. 

  • Minimum Value Calculations.
  • Maximum Value Calculations.
  • Average Value Calculations.
  • Standard Deviation Calculations.

By incorporating the above calculations into you data exploration, deviations or variances can be observed. If there are significant deviations within the values, the need for more data might be necessary in order to provide more balance throughout the dataset. If the data proves to be accurate, there might be problems with inaccurate business expectations. As a result of exploring your data, understanding it and knowing what to look out for, decisions can be made on whether or not the data is flawed. If the data is flawed, a plan can be conceived on how to fix the problems. If the data is accurate, a greater understanding of trends within the business can be attained and the data mining plan can be modified to represent these new understandings.

Create Final Reports

The final step in the process is to generate clear and meaningful data reports which display the answers to the business problems or needs defined in the first step of the data mining process. Whether you are using a powerful business intelligence software package such as IBM Cognos, or a program already included with Microsoft Office like Excel, it does not make any difference. What matters most is that the final reports be laid out in a manner that is easily understandable for all individuals using them. The information and knowledge attained from these reports can be both invaluable and endless. 

  • Sales history reports can inform of customers who once were high volume buyers and sales have diminished over the years.
  • Along with the customer sales history, a sales potential can be incorporated into the report to show which customers would be most beneficial to go after.
  • An item classification sales report might show customers that purchase a specific product class from you, but sales of associated item classes are nonexistent.

Importing report information into a mapping program is another way to use your data output. 

  • Customer locations can be pinpointed and this might show where to concentrate your sales force in order to build up customer base within a specific region.
  • Delivery routes can be laid out more efficiently directly on a map and potential customers that are skipped over within a route can be observed.

Sometimes a picture can be worth a thousand words. These are only a few examples of what you can do with a clean, well organized, accurate data set. The possibilities are virtually endless and the power within the data can be invaluable.

Printer-Friendly Version