Monday, March 26, 2012

Query governor problem

Hi all,
We're running MS SQL 2000, latest service pack. Windows 2000 server,
advanced server and Windows 2003 server. Assorted hardware.
The above configuration refers to various client sites, and the SQL
we're having problems with is generated by an object/relational
mapping layer so we cannot easily change this.
Our problem is that certain queries are occasionally take far, far
longer to execute than at other times. My suspicion is that this
occurs when the database server is heavily loaded, and this causes it
to pick an alternative, and awful, execution plan. This only makes
things worse for everyone else. The query is given at the end of this
post.
The query will normally execute in less than a second. I can copy it
directly out of the profiler, which shows it is taking 1280 seconds of
cpu time and about 4500 seconds duration, paste it into the query
analyzer of the correct database, and a short while later have it
execute in a flash.
Statistics have very recently been updated. The data will not have
changed significantly either.
Over the next few days and we will be turning off intra query
parallelism and hyperthreading because we know these have caused us
problems in the past and may very well be the cause of things here,
but in the mean time as a very short term measure I have considered
using the query governor.
Unfortunately I don't think this will be the solution. I suspect that
the cpu estimate used by the query governor will always be for the
sub-second execution times (because if the query plan realised it
would take 20 minutes of cpu time to execute the query then it would
certainly have chosen something better, which it clearly can under the
right circumstances. If the query planner fails this way, and the
query governor is presumably driven by the estimate of the query
planner, then the query Governor will not catch these excessive
execution times). I'm pretty sure that the query governor only works
by estimates and does not actually 'watch over' the query and count
the cpu ticks.
So it looks like this will not help. This is a critical issue for our
customers. I'm therefore looking for some way of observing other
queries from a special watchdog job and killing them if the cpu
exceeds a certain amount, say a minute.
I have to look for single queries, not connections as many users are
serialised down to a few connections. Using the enterprise manager's
'process info' page I can see per-connection info but this it is not
fine-grained enough for what I want. I've discovered the sysperfinfo
table but that doesn't seem to help.
Any suggestions?
I suppose it would be better than nothing to be able to watch over
connections and if I could somehow detect excessive cpu activity on
that then I could kill that connection, even if it made a small group
of users unhappy. Better that than the whole site irate. Can anyone
suggest reliable way of doing that? How would I distinguish a very
busy but correctly functioning connection from one which was tied up
too long with a single query?
For what it's worth, when these mad queries start to execute their
lock-up the database server so tightly that it's almost impossible to
get in to kill things manually.
The query:
SELECT A14, A15, A16, A17, A18, A19, A20, A21, A22, A23
FROM (
SELECT <field names removed for confidentiality
reasons>
FROM <ditto table name> T13
) D24
WHERE ((A18 IN (21799))
AND (A17 IN (13575))
)
OR ((A21 IN (2130740, 2413061, 2199038, 2107888,
2147294, 2390317, 2390947, 2262791,
2262540, 2175261, 2436256, 2163102,
2119434, 2272644, 2356412, 2240713,
2282838, 2318792, 2333253, 2230674,
2083129, 2402966, 2367886, 2187455,
2297886, 2251478, 2069408, 2309686,
2450636, 2424679, 2061268, 2219841
)
)
AND (A17 IN (18583))
)
OR ((A18 IN (2130740, 2413061, 2199038, 2107888,
2147294, 2390317, 2390947, 2262791,
2262540, 2175261, 2436256, 2163102,
2119434, 2272644, 2356412, 2240713,
2282838, 2318792, 2333253, 2230674,
2083129, 2402966, 2367886, 2187455,
2297886, 2251478, 2069408, 2309686,
2450636, 2424679, 2061268, 2219841
)
)
AND (A17 IN (14849, 15468))
)
OR ((A21 IN (2129425, 2412043, 2197272, 2106911,
2146319, 2389802, 2260770, 2173489,
2434610, 2161990, 2117737, 2271342,
2355571, 2239606, 2282111, 2316669,
2332031, 2229624, 2081664, 2401161,
2366604, 2186416, 2296450, 2250431,
2067383, 2308778, 2450673, 2423002,
2061282, 2218767
)
)
AND (A17 IN (18583))
)
OR ((A18 IN (2129425, 2412043, 2197272, 2106911,
2146319, 2389802, 2260770, 2173489,
2434610, 2161990, 2117737, 2271342,
2355571, 2239606, 2282111, 2316669,
2332031, 2229624, 2081664, 2401161,
2366604, 2186416, 2296450, 2250431,
2067383, 2308778, 2450673, 2423002,
2061282, 2218767
)
)
AND (A17 IN (14849, 15468))
)
OR ((A18 IN (75300, 84834, 77226, 74551, 75693,
83958, 79587, 76442, 85713, 76074,
74904, 79967, 82740, 78787, 80387,
81446, 81966, 78422, 73758, 84374,
83124, 76859, 80771, 79203, 81187,
86351, 85251, 73040, 78012
)
)
AND (A17 IN (15818, 15790))
)
OR ((A18 IN (6434, 6835, 6511, 6652, 7338, 7381,
7516, 7053, 8076, 6751, 7212, 8346,
5951
)
)
AND (A17 IN (15862))
)
OR ((A18 IN (88417, 102177, 91248, 87350, 88994,
101065, 94706, 90093, 103296, 86807,
89570, 87850, 95275, 99386, 93535,
95910, 92699, 97459, 98251, 92997,
86154, 101590, 99922, 90709, 96465,
94143, 85304, 97084, 99933, 87091,
104110, 102719, 84897, 100531, 92404
)
)
AND (A17 IN (38577, 37251, 12998))
)
OR ((A18 IN (6014, 6357, 6075, 6196, 6790, 6829,
6950, 6542, 7442, 6277, 6671, 7689,
5614
)
)
AND (A17 IN (15847))
)
OR ((A18 IN (21799))
AND (A17 IN (14519))
)
FYI I have seen queries like this cause problems in the past, and on a
single processor machines. It appears that large combinations of
'and's and 'or's can cause big problems with the query planner,
possibly made worse if the statistics read out of date (which they are
not here).
Any help appreciated
cheers
jan
Jan,
AFIK, you're correct, the query governor works with estimated cost, not
actual.
Can you tell for certain that the runaway queries are parallelized, i.e.
using sp_who2 and looking for multiple rows per spid?
If bad parallel plans are the issue, then lowering Max DOP to 4 or 1 might
help, as it looks like you'll be testing.
Can you capture the query plan in Profiler and compare a slow version's plan
with a fast query's plan in QA?
That would help verify that a bad plan is truly the cause.
The only caveat I've heard about hyperthreading is not to mix it with
affinitizing.
Some other things to try:
Look for issues around waits and queues, using DBCC SQLPERF(waitstats). Find
out what the slow queries are waiting on.
Inspect the procedure cache (syscacheobjects) to determine whether these
plans are being reused.
There could be pressure on the cache if they are dynamically generated and
can't be reused.
Also check on the cache plan size to see if it's excessively large. I've
seen very large plans hose up a system.
For more info about waits and queues, and syscacheobjects, see the articles
in SQL Magazine by Tom Davidson.
Hope this helps,
Ron
Ron Talmage
SQL Server MVP
"jan" <anomalocarus@.hotmail.com> wrote in message
news:62da4c70.0409160107.3f7d305c@.posting.google.c om...
> Hi all,
> We're running MS SQL 2000, latest service pack. Windows 2000 server,
> advanced server and Windows 2003 server. Assorted hardware.
> The above configuration refers to various client sites, and the SQL
> we're having problems with is generated by an object/relational
> mapping layer so we cannot easily change this.
> Our problem is that certain queries are occasionally take far, far
> longer to execute than at other times. My suspicion is that this
> occurs when the database server is heavily loaded, and this causes it
> to pick an alternative, and awful, execution plan. This only makes
> things worse for everyone else. The query is given at the end of this
> post.
> The query will normally execute in less than a second. I can copy it
> directly out of the profiler, which shows it is taking 1280 seconds of
> cpu time and about 4500 seconds duration, paste it into the query
> analyzer of the correct database, and a short while later have it
> execute in a flash.
> Statistics have very recently been updated. The data will not have
> changed significantly either.
> Over the next few days and we will be turning off intra query
> parallelism and hyperthreading because we know these have caused us
> problems in the past and may very well be the cause of things here,
> but in the mean time as a very short term measure I have considered
> using the query governor.
> Unfortunately I don't think this will be the solution. I suspect that
> the cpu estimate used by the query governor will always be for the
> sub-second execution times (because if the query plan realised it
> would take 20 minutes of cpu time to execute the query then it would
> certainly have chosen something better, which it clearly can under the
> right circumstances. If the query planner fails this way, and the
> query governor is presumably driven by the estimate of the query
> planner, then the query Governor will not catch these excessive
> execution times). I'm pretty sure that the query governor only works
> by estimates and does not actually 'watch over' the query and count
> the cpu ticks.
> So it looks like this will not help. This is a critical issue for our
> customers. I'm therefore looking for some way of observing other
> queries from a special watchdog job and killing them if the cpu
> exceeds a certain amount, say a minute.
> I have to look for single queries, not connections as many users are
> serialised down to a few connections. Using the enterprise manager's
> 'process info' page I can see per-connection info but this it is not
> fine-grained enough for what I want. I've discovered the sysperfinfo
> table but that doesn't seem to help.
> Any suggestions?
> I suppose it would be better than nothing to be able to watch over
> connections and if I could somehow detect excessive cpu activity on
> that then I could kill that connection, even if it made a small group
> of users unhappy. Better that than the whole site irate. Can anyone
> suggest reliable way of doing that? How would I distinguish a very
> busy but correctly functioning connection from one which was tied up
> too long with a single query?
> For what it's worth, when these mad queries start to execute their
> lock-up the database server so tightly that it's almost impossible to
> get in to kill things manually.
> The query:
> SELECT A14, A15, A16, A17, A18, A19, A20, A21, A22, A23
> FROM (
> SELECT <field names removed for confidentiality
> reasons>
> FROM <ditto table name> T13
> ) D24
> WHERE ((A18 IN (21799))
> AND (A17 IN (13575))
> )
> OR ((A21 IN (2130740, 2413061, 2199038, 2107888,
> 2147294, 2390317, 2390947, 2262791,
> 2262540, 2175261, 2436256, 2163102,
> 2119434, 2272644, 2356412, 2240713,
> 2282838, 2318792, 2333253, 2230674,
> 2083129, 2402966, 2367886, 2187455,
> 2297886, 2251478, 2069408, 2309686,
> 2450636, 2424679, 2061268, 2219841
> )
> )
> AND (A17 IN (18583))
> )
> OR ((A18 IN (2130740, 2413061, 2199038, 2107888,
> 2147294, 2390317, 2390947, 2262791,
> 2262540, 2175261, 2436256, 2163102,
> 2119434, 2272644, 2356412, 2240713,
> 2282838, 2318792, 2333253, 2230674,
> 2083129, 2402966, 2367886, 2187455,
> 2297886, 2251478, 2069408, 2309686,
> 2450636, 2424679, 2061268, 2219841
> )
> )
> AND (A17 IN (14849, 15468))
> )
> OR ((A21 IN (2129425, 2412043, 2197272, 2106911,
> 2146319, 2389802, 2260770, 2173489,
> 2434610, 2161990, 2117737, 2271342,
> 2355571, 2239606, 2282111, 2316669,
> 2332031, 2229624, 2081664, 2401161,
> 2366604, 2186416, 2296450, 2250431,
> 2067383, 2308778, 2450673, 2423002,
> 2061282, 2218767
> )
> )
> AND (A17 IN (18583))
> )
> OR ((A18 IN (2129425, 2412043, 2197272, 2106911,
> 2146319, 2389802, 2260770, 2173489,
> 2434610, 2161990, 2117737, 2271342,
> 2355571, 2239606, 2282111, 2316669,
> 2332031, 2229624, 2081664, 2401161,
> 2366604, 2186416, 2296450, 2250431,
> 2067383, 2308778, 2450673, 2423002,
> 2061282, 2218767
> )
> )
> AND (A17 IN (14849, 15468))
> )
> OR ((A18 IN (75300, 84834, 77226, 74551, 75693,
> 83958, 79587, 76442, 85713, 76074,
> 74904, 79967, 82740, 78787, 80387,
> 81446, 81966, 78422, 73758, 84374,
> 83124, 76859, 80771, 79203, 81187,
> 86351, 85251, 73040, 78012
> )
> )
> AND (A17 IN (15818, 15790))
> )
> OR ((A18 IN (6434, 6835, 6511, 6652, 7338, 7381,
> 7516, 7053, 8076, 6751, 7212, 8346,
> 5951
> )
> )
> AND (A17 IN (15862))
> )
> OR ((A18 IN (88417, 102177, 91248, 87350, 88994,
> 101065, 94706, 90093, 103296, 86807,
> 89570, 87850, 95275, 99386, 93535,
> 95910, 92699, 97459, 98251, 92997,
> 86154, 101590, 99922, 90709, 96465,
> 94143, 85304, 97084, 99933, 87091,
> 104110, 102719, 84897, 100531, 92404
> )
> )
> AND (A17 IN (38577, 37251, 12998))
> )
> OR ((A18 IN (6014, 6357, 6075, 6196, 6790, 6829,
> 6950, 6542, 7442, 6277, 6671, 7689,
> 5614
> )
> )
> AND (A17 IN (15847))
> )
> OR ((A18 IN (21799))
> AND (A17 IN (14519))
> )
> FYI I have seen queries like this cause problems in the past, and on a
> single processor machines. It appears that large combinations of
> 'and's and 'or's can cause big problems with the query planner,
> possibly made worse if the statistics read out of date (which they are
> not here).
> Any help appreciated
> cheers
> jan
|||Hi Ron,
Thanks for your response, sorry for the delay.
Just disabling intra-query parallelism seemed to do the trick -- the
client has needed no reboots since, and my tracking of the cpu use
shows a much more reasonable (ie. far less 'spiky') behaviour.
While this proves nothing, it is certainly strong evidence that the
IQP was the problem. I was not able to get any different query plans
under different loads on a test server, and when the client was having
problems the database server locked up *tight* -- even logging in
locally (ie. by a person on site) could take 5 or 10 minutes. Even
starting a profiler was almost impossible. Checking query plans or
blocking etc. was not feasible. Not a good situation for debugging.
The warning about disabling hyperthreading/SMT was given to us
verbally by one of the technical support engineers (bless them,
they're wonderful) but he was quite clear.
The SQL is dynamically generated, and currently they are very
un-reusable. They can also be very large, and I have absolutely no
doubt that there is plenty of pressure being put on the procedure
cache. We'll deal with that in a release or two.
I will definitely follow up that article you recommended.
Thanks again
jan
"Ron Talmage" <rtalmage@.prospice.com> wrote in message news:<eysruLAnEHA.3876@.TK2MSFTNGP15.phx.gbl>...[vbcol=seagreen]
> Jan,
> AFIK, you're correct, the query governor works with estimated cost, not
> actual.
> Can you tell for certain that the runaway queries are parallelized, i.e.
> using sp_who2 and looking for multiple rows per spid?
> If bad parallel plans are the issue, then lowering Max DOP to 4 or 1 might
> help, as it looks like you'll be testing.
> Can you capture the query plan in Profiler and compare a slow version's plan
> with a fast query's plan in QA?
> That would help verify that a bad plan is truly the cause.
> The only caveat I've heard about hyperthreading is not to mix it with
> affinitizing.
> Some other things to try:
> Look for issues around waits and queues, using DBCC SQLPERF(waitstats). Find
> out what the slow queries are waiting on.
> Inspect the procedure cache (syscacheobjects) to determine whether these
> plans are being reused.
> There could be pressure on the cache if they are dynamically generated and
> can't be reused.
> Also check on the cache plan size to see if it's excessively large. I've
> seen very large plans hose up a system.
> For more info about waits and queues, and syscacheobjects, see the articles
> in SQL Magazine by Tom Davidson.
> Hope this helps,
> Ron
> --
> Ron Talmage
> SQL Server MVP
>
> "jan" <anomalocarus@.hotmail.com> wrote in message
> news:62da4c70.0409160107.3f7d305c@.posting.google.c om...
[snip orig post]

No comments:

Post a Comment