Showing posts with label related. Show all posts
Showing posts with label related. Show all posts

Friday, March 23, 2012

Mining Model?

Hi,

I'm working on a project to create a mining model.I have one "flattened" table with a productID (key) and related attributes. Example columns are weight, color, price, units sold last year, product category, product rating (top seller, etc) and similar columns, about 40 in total. There are about 15,000 products and same number of rows in the table.

The objective is to pass a ProductID (that also exists in the table) and get back the top N products that most resemble the source product based on all the attributes.

Any advice on which models I should test and how to set up the models would be much appreciated. Also if there is a similar example/sample out there, please let me know. I downloaded the MovieClick sample, but that doesn't work in my case as I only have the equivalent of the Customers table.

Thanks in Advance.

This is actually a K-Nearest Neighbor problem, for which we don't ship an implementation for SQL Server Analysis Services (although someone may have written a plug in). I could imagine that you could use the fuzzy matching transform in Integration Services to do this, however. To do so you would just use your product table as the reference table, send in the fields of the product of interest, then sort the results by the match confidence.

HTH

-Jamie

|||

Thanks Jamie, I'll try fuzzy matching in SSIS. I'm also thinking of writing my own code to calculate based on KNN in TSQL. I'm not sure if the performance will be acceptable though especially as I need to pass in up to 5 products, evaluate based on their combined attributes and get a resultset back with top N matches.

Is there a third party software out there that can do this?

|||When using the fuzzy components in SSIS, do be aware that they are string matching components, evaluating the edit distance between two string. So, for example, "911" will be pretty much as close a match to "999" as "199" - so for any columns where you need to evaluate similarity of values you may need to use other components such as the Derived Column to perform the calculation.|||

Hi Donald,

Can you explain what you mean by Derived column. How would I make it work to calculate distances for numeric columns/attributes.

Thanks.

|||I don't know of a 3rd party KNN, but you would likely get better performance implementing your own algorithm using our plug-in interfaces in C#.|||

Jamie/Donald

Thanks for your responses. My C# is not that strong. For now I'll write something in SQL and if I can create a model that works, I'll convert that into a plug-in.

This is the plan. Normalize all attributes using mean and std dev. Apply the nearest neighbour algorithm using Euclidean distance. Also experiment with arbitrary weights for attributes or maybe use correlation to filter out some of the weaker attributes, not sure how that will work though at this point.

Any suggestions are welcome. Also, Is there a sample plug-in in C# out there I can use as template. Is there any way I can mold this table to use Association Rules or any other built-in algorithm.

Regards,

Asim.

|||

There are tutorials here for writing managed plug-in algorithms: http://www.sqlserverdatamining.com/DMCommunity/Tutorials/default.aspx

You may need to register (free!) to get access to the links.

minimum price on earliest date

Ok my first posted question :
(This is related to a travel website)

I have the following table layout :

CREATE TABLE "public"."package" (
"id" BIGINT NOT NULL,
"accom_code" VARCHAR(4) NOT NULL,
"start_date" DATE NOT NULL,
"end_date" DATE NOT NULL,
"pricing_type" VARCHAR(2),
"indic_price" NUMERIC(7,2),
"unit_price" NUMERIC(7,2),
"adult_age_max_cnt" INTEGER,
CONSTRAINT "package_pkey" PRIMARY KEY("id")
) WITH OIDS;

The package table contains a list (a very large one) for holiday accomodation packages.

What i'm trying to get is the following :

The MINIMUM price using the following for "price" :

CAST (CASE p.pricing_type
WHEN 'UN' THEN p.unit_price
WHEN 'PA' THEN p.indic_price*p.adult_age_max_cnt
ELSE p.unit_price
END AS NUMERIC(7,2))

AND

The minimum date (i.e. nearest start date) greater than today
WHERE the start_date equals the minimum start_date and the price equals the minimum price.

Any thoughts?

With any luck I will be able to give you an test data insert for this.

This should probably be posted in the POSTGRESQL section however I feel that it is a more general SQL question than anything else.What I have thus far :

SELECT p.accom_code
,MIN(CAST (CASE p.pricing_type
WHEN 'UN' THEN p.unit_price
WHEN 'PA' THEN p.indic_price*p.adult_age_max_cnt
ELSE p.indic_price
END AS NUMERIC(7,2))) as min_price
,MIN(p.start_date) as min_start
FROM package p
WHERE accom_code IN ('DDNA','ADE9','CGHH','ASEC','BDB9','HGMD','CMEF', 'BGDE','YRB5','BJAM')
AND p.duration = 7
AND start_date > current_date
AND start_date < current_date + interval '28 day'
GROUP BY p.accom_code

At the present time i'm getting 10 records (as expected). However I believe these records are WRONG as the minimum price doesn't necessarily match up with the minimum date for a particular (correct me i'm wrong here). How do I go about correcting this?|||Ooh think I nearly got it, can someone verify this :

SELECT p.accom_code
,x.min_start
,MIN(CAST (CASE p.pricing_type
WHEN 'UN' THEN p.unit_price
WHEN 'PA' THEN p.indic_price*p.adult_age_max_cnt
ELSE p.indic_price
END AS NUMERIC(7,2))) as min_price
FROM package p
JOIN (
SELECT p.accom_code,MIN(p.start_date) as min_start
FROM package p
WHERE accom_code IN ('DDNA','ADE9','CGHH','ASEC','BDB9','HGMD','CMEF', 'BGDE','YRB5','BJAM')
AND p.duration = 7
AND start_date > current_date
AND start_date < current_date + interval '28 day'
GROUP BY p.accom_code
) x
ON p.accom_code = x.accom_code AND x.min_start = p.start_date
WHERE p.accom_code IN ('DDNA','ADE9','CGHH','ASEC','BDB9','HGMD','CMEF', 'BGDE','YRB5','BJAM')
AND p.duration = 7
AND start_date > current_date
AND start_date < current_date + interval '28 day'
GROUP BY p.accom_code,x.min_start

In theory it should give the cheapest price on the earliest start date for each of the 10 accomodation types.

Edit : Just to finish it off I needed the start_date out of it as well ;) (added x.min_start to select and group clause)|||looks okay to me

i would not use the same alias "p" in more than one place in the query

and i would probably remove one of the following:AND p.duration = 7
AND p.duration = 7;)|||LOL oh yeah. I've cleaned it up in the above post now.
However I do have one question to finish it off (and this annoys me about of lot of SQL i've done in the past) :
Is there any way to remove the duplicate WHERE clauses in a query like this, and somehow apply it only once but for both sections? I hope that question makes sense.|||Is there any way to remove the duplicate WHERE clauses in a query like thisdefine a view|||Btw, your queries won't work at all in SQL because you used delimited schema/table/column names in the CREATE TABLE statement (i.e. double-quotes) but not in your query.|||That DDL was written in PMS for Postgres so it's a pure copy. I did mention that I was using Postgresql in my first post.

What I DID want to be standard was the SQL query I was doing. :D|||Well, then you have to use delimited schema/table/column names in your query - as I said.|||I have come across a problem with the query above. If two records calculate the same minimum price for one accom_code I can get two results when joined to itself.
i.e. min price and min date is £189 and 25/05/07
when self-joined total price = £567 / 3 adults = £189
and total price = £378 / 2 adults = £189

They both match the minimum price and date and thus both records are output. DOH!

As a quick hack (and I don't like this method) I did a distinct and ordered by total price minimum first.|||Is there any way to remove the duplicate WHERE clauses in a query like this, and somehow apply it only once but for both sections?
For this purpose, SQL99 defines "common table expressions", i.e., a "WITH" subclause of a select statement. Not yet available in the current version of PostgreSQL but coming soon (8.4 probably) ;)
(DB2, Oracle and SQLServer already have them in place.)
Your query, with CTEs, would become:WITH p AS
( SELECT accom_code, start_date,
MIN(CAST (CASE pricing_type
WHEN 'UN' THEN unit_price
WHEN 'PA' THEN indic_price*adult_age_max_cnt
ELSE indic_price
END AS NUMERIC(7,2))) as min_price
FROM package
WHERE accom_code IN ('DDNA','ADE9','CGHH','ASEC',
'BDB9','HGMD','CMEF','BGDE','YRB5','BJAM')
AND duration = 7
AND start_date > current_date
AND start_date < current_date + interval '28 day'
),
x AS
( SELECT p.accom_code,MIN(p.start_date) as min_start
FROM p
GROUP BY p.accom_code
)
SELECT p.accom_code,x.min_start,p.min_price
FROM p INNER JOIN x
ON p.accom_code = x.accom_code AND x.min_start = p.start_date
GROUP BY p.accom_code,x.min_start
These are effectively "local view definitions", cf suggestion by r937.
(Note that in the above query, the SELECT .. p.min_price is not compatible with the GROUP BY -- maybe add p.min_price to the GROUP BY? Or replace it by "MIN(p.min_price)"?)sql

Monday, March 19, 2012

Min within a Group query

Hi
We have a table structure for storing away Hires in a SQL Server database.
Related to this table is another table that stores events/logs that have
occurred on the Hires table. I.e. Record Created, Modified, Price Changed,
etc. We are trying to do a query that will return the first log for each
hire and then return a few extra fields too. The basic table structure is
below
Hires
--
HireID
ClientID
Status
Cancelled
HireLog
--
HireLogID
HireID
LogDate
Comment
EventType
OperatorID
At first I thought we could do the following:
SELECT dbo.Hires.HireID, MIN(dbo.HireLog.HireLogID) AS HireLogID,
dbo.HireLog.LogDate, dbo.HireLog.Comment, dbo.HireLog.EventType,
dbo.HireLog.OperatorID
FROM dbo.Hires INNER JOIN dbo.HireLog ON dbo.Hires.HireID =
dbo.HireLog.HireID
GROUP BY dbo.Hires.HireID, dbo.HireLog.LogDate, dbo.HireLog.Comment,
dbo.HireLog.EventType, dbo.HireLog.OperatorID
ORDER BY dbo.Hires.HireID DESC
This works great with just the Hire ID field and the Min(HireLogID), but as
soon as you add the other fields the Group By causes the query to return all
the other Logs for the Hire too.
Is there any way around it?What is HireLogID? If that's an IDENTITY column then it's probably
unwise to rely on it to determine the earliest row. The reason is that
you don't always have full control over the order in which IDENTITY
values are assigned. IDENTITY should be used only as an arbitrary
surrogate key without any ascribed business meaning.
In this case it looks like you'll wanr to use LogDate to determine the
first row for each Hire. Declare (hireid, logdate) as unique to ensure
you have a unique sequence.
SELECT hireid, hirelogid, logdate, comment, eventtype, operatorid
FROM HireLog AS L
WHERE logdate =
(SELECT MIN(logdate)
FROM HireLog
WHERE hireid =L.hireid)
David Portas
SQL Server MVP
--|||Chris,
Just one minor, (probably unnecessary) addition..
If you are storing Date and TIme in LogDate, then the chance of anyone
recording two records in HireLog with the same HireID and LogDate is very
unlikely, and probabl;y impossible, ignore this, David's solution should wor
k
fine...
but if your application logic is only storing the date, without the time
portion, in logDate, then you will need to handle the case where are multipl
e
records with the same value for both HireID and LogDate.
The only way to do that, given your schema, is to use the HireLogID as a
discriminant. (David's comment about no guarantees as to which is REALLY
earliest apply here, but, if you don;t have the time portion of the date
stored, then there's no way to distinquish among multiple records on a
specific day anyway.)
Select hirelogid, hireid, logdate,
comment, eventtype, operatorid
From HireLog L
Where hirelogid =
(Select Min(hirelogid)
From HireLog
Where hireid = L.hireid
And logdate = (Select Min(LogDate)
From HireLog
Where hireid = L.hireid))
"David Portas" wrote:

> What is HireLogID? If that's an IDENTITY column then it's probably
> unwise to rely on it to determine the earliest row. The reason is that
> you don't always have full control over the order in which IDENTITY
> values are assigned. IDENTITY should be used only as an arbitrary
> surrogate key without any ascribed business meaning.
> In this case it looks like you'll wanr to use LogDate to determine the
> first row for each Hire. Declare (hireid, logdate) as unique to ensure
> you have a unique sequence.
> SELECT hireid, hirelogid, logdate, comment, eventtype, operatorid
> FROM HireLog AS L
> WHERE logdate =
> (SELECT MIN(logdate)
> FROM HireLog
> WHERE hireid =L.hireid)
> --
> David Portas
> SQL Server MVP
> --
>|||Thanks for you help.
Chris
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:1110365299.387914.40640@.g14g2000cwa.googlegroups.com...
> What is HireLogID? If that's an IDENTITY column then it's probably
> unwise to rely on it to determine the earliest row. The reason is that
> you don't always have full control over the order in which IDENTITY
> values are assigned. IDENTITY should be used only as an arbitrary
> surrogate key without any ascribed business meaning.
> In this case it looks like you'll wanr to use LogDate to determine the
> first row for each Hire. Declare (hireid, logdate) as unique to ensure
> you have a unique sequence.
> SELECT hireid, hirelogid, logdate, comment, eventtype, operatorid
> FROM HireLog AS L
> WHERE logdate =
> (SELECT MIN(logdate)
> FROM HireLog
> WHERE hireid =L.hireid)
> --
> David Portas
> SQL Server MVP
> --
>|||> but, if you don;t have the time portion of the date
> stored, then there's no way to distinquish among multiple records on
a
> specific day anyway
... and therefore the business requirement to display only the earliest
row would be fatally flawed, and anyway, what would be the natural key
of the table in that scenario? That is indeed the price you pay for
tables without proper keys.
David Portas
SQL Server MVP
--|||This is a date + time field.
Thanks.
Chris
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:1110367501.066963.91030@.z14g2000cwz.googlegroups.com...
> a
> ... and therefore the business requirement to display only the earliest
> row would be fatally flawed, and anyway, what would be the natural key
> of the table in that scenario? That is indeed the price you pay for
> tables without proper keys.
> --
> David Portas
> SQL Server MVP
> --
>