Hi,
I'm working on a project to create a mining model.I have one "flattened" table with a productID (key) and related attributes. Example columns are weight, color, price, units sold last year, product category, product rating (top seller, etc) and similar columns, about 40 in total. There are about 15,000 products and same number of rows in the table.
The objective is to pass a ProductID (that also exists in the table) and get back the top N products that most resemble the source product based on all the attributes.
Any advice on which models I should test and how to set up the models would be much appreciated. Also if there is a similar example/sample out there, please let me know. I downloaded the MovieClick sample, but that doesn't work in my case as I only have the equivalent of the Customers table.
Thanks in Advance.
This is actually a K-Nearest Neighbor problem, for which we don't ship an implementation for SQL Server Analysis Services (although someone may have written a plug in). I could imagine that you could use the fuzzy matching transform in Integration Services to do this, however. To do so you would just use your product table as the reference table, send in the fields of the product of interest, then sort the results by the match confidence.
HTH
-Jamie
|||Thanks Jamie, I'll try fuzzy matching in SSIS. I'm also thinking of writing my own code to calculate based on KNN in TSQL. I'm not sure if the performance will be acceptable though especially as I need to pass in up to 5 products, evaluate based on their combined attributes and get a resultset back with top N matches.
Is there a third party software out there that can do this?
|||When using the fuzzy components in SSIS, do be aware that they are string matching components, evaluating the edit distance between two string. So, for example, "911" will be pretty much as close a match to "999" as "199" - so for any columns where you need to evaluate similarity of values you may need to use other components such as the Derived Column to perform the calculation.|||Hi Donald,
Can you explain what you mean by Derived column. How would I make it work to calculate distances for numeric columns/attributes.
Thanks.
|||I don't know of a 3rd party KNN, but you would likely get better performance implementing your own algorithm using our plug-in interfaces in C#.|||Jamie/Donald
Thanks for your responses. My C# is not that strong. For now I'll write something in SQL and if I can create a model that works, I'll convert that into a plug-in.
This is the plan. Normalize all attributes using mean and std dev. Apply the nearest neighbour algorithm using Euclidean distance. Also experiment with arbitrary weights for attributes or maybe use correlation to filter out some of the weaker attributes, not sure how that will work though at this point.
Any suggestions are welcome. Also, Is there a sample plug-in in C# out there I can use as template. Is there any way I can mold this table to use Association Rules or any other built-in algorithm.
Regards,
Asim.
|||There are tutorials here for writing managed plug-in algorithms: http://www.sqlserverdatamining.com/DMCommunity/Tutorials/default.aspx
You may need to register (free!) to get access to the links.