December 12, 2006

SQL Server 2000 Collation - Searching and Ordering on Latin Characters - Collations - Pick Your Searching and Ordering Poison

Introduction
Collations only became important to me because I didn't have a choice – I couldn’t avoid them anymore since I couldn’t live in my little bubble I call Shan’s world. It just…reared its ugly collation head - DAH! I'd always successfully avoided delving into collations too deeply in the past, but of course, ignorance is bliss for only so long. Now I've seen the light and realize that like it or not, you should include database collation strategy in your early database designs. As a last resort, you have to at least consider searching strategies if you are in a situation with databases containing collations that can't be changed.

Problem
The issue came up first with the dreaded é character, or the e with the "acute" accent overtop. One of my client's contact databases contains only contacts with latin characters. Some of the contacts with é in their last name are:

Salinési
Méghirbi
Trouvé
Pérez

The problem that was occurring was that users searching the database were unable to find these types of contacts. They were doing searches using standard web application search forms. For example, they would enter Trouve into a "Last Name" text box, and expect the contact with last name Trouvé to return as a match. Lo and behold that wasn't occurring.

Further, where users would expect Trouvé to be ordered in search results along with Trouve, they weren't. The ordering of the é character was happening after z if I recall correctly (actually this might not have been with this Latin character, have been before a or after z). Anyways it wasn't ordered where users were expecting it to be. The international sampling of application users that I polled agreed that this type of behaviour would be what is expected.

Why Searching Was Broken
Database collations in SQL Server:

  1. Control the sort order of characters
  2. Control the searching behaviour of characters
All SQL Server installations start with a default collation – typically the instance and default databases will share the same collation. Characters in this case are either Unicode or non-Unicode (i.e. NVARCHAR OR VARCHAR).

In the case of the contacts database in question, the database instance collation was set to SQL_Latin1_General_CP1_CI_AS, while the specific database collation was Latin1_General_CI_AS. For what it's worth, it's not an issue that your instance's collation is different from a DB's collation. The contacts table's collation was defaulting to the default collation of the database that it was contained within, Latin1_General_CI_AS. There is a hierarchy of collations resembling this: Database Instance > Database > Table > Column

Database, table, and column string sort and comparisons can all therefore happen with any variety and combination of collations. Of course, in practice it can be a nightmare if it “just happens to be that way”, but it could also be necessary depending on your needs.

It’s therefore obvious that an accent sensitive collation would clearly not work very well for typical real world needs like this.

Collations Available in SQL Server
You can find what your SQL Server installations supported collations are by running the following from your SQL client:

SELECT *
FROM ::fn_helpcollations()
where name like 'Latin1_General_CI_AS'

This returns the following results:
Latin1_General_CI_AS : Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive

What’s in a Collation Name
From SQL Server’s Books Online, a collation name is broken down into these parts:
::sql_::sortrules::CodePage::ComparisonStyle

Without boring you to death on the details of a collation name, I'll leave it to you to review what these parts represent other than the ComparisonStyle part. ComparisonStyle allows you to specify a collation's case sensitivity and accent sensitivity.

Thus, my contact database's default collation of Latin1_General_CI_AS indicated that SQL Server would use the Latin1_General language setting for its sorting rules, and more importantly, that sorting and comparison operations would be case insensitive, accent sensitive.

You should also find a collation named Latin1_General_CI_AI, which indicates case insensitive and accent insensitive operations for sorting and searching. Choosing that collation is the ideal solution to the problem.

Solution
The solution space is pretty wide open and simple once you understand the concepts. Here are some solutions:
  1. Choose a database instance default collation during SQL Server installation that is accent insensitive. This requires that you have complete control from the beginning over your SQL Server installations
  2. If your database instance collation is set, choose a database default collation for your application database. In this case this requires that you are creating a new database from scratch.
  3. If your database collation is set, then you can specify the collation at the table and individual column level.
  4. If your database tables and columns have collations that are already set, then the only option that you have is to hard-wire your desired collations in your queries. For example: SELECT TOP 100 * FROM Contacts WHERE LastName LIKE '%é%' COLLATE Latin1_General_CI_AS. In this example, the desired searching behaviour is accent-sensitive.
Goals
  1. Try to set your collations the same across all of your application databases. This will avoid cross database collation errors that SQL Server will burp up when trying to query data across multiple databases
  2. Choose an accent insensitive collation from the beginning even if you’re not sure if it will be needed. You’d be surprised how often accented characters show up in people’s names
  3. Recreate your databases. This can be time consuming and costly, but there are third party utilities out there that can help, such as Alex Baker’s SQL Server 2000 Collation Changer, although I haven’t tried this – it would be worth testing out to see if it is feasible for you!
Side Topic – What do some developers do to resolve this solution?
When I was researching this issue, some developers had interesting solutions to the problem. For example, I found one posting from a developer that follows:

"Create a separate table using CP1253 (greek) which has no accented characters, and fulltext index that table instead. When I copy the data from CP1 to the CP1253, SQL Server removes the accents so I get back the search results without accents on (which is what I want)"

The author of this newsgroup posting (recent, from 2005, see this posting) pointed out that:

"However, if the search criteria includes accented characters, I get no results. The text engine does not strip the accents off the search string even though the table I indexed is accent insensitive.

It seems that what I need to do is strip accents off the search strings
before submitting the text search. does anyone know of a way to do this?"

Manipulating all of your searches before executing them against your database like this is a bad idea! If you're building enterprise applications, there will inevitably be a large variety of clients requesting data - web apps, reporting tools, IT administration tools, web services, etc. This band-aid solution is only useful for small applications where the client interface is always predictable and highly controllable – i.e. if the only interface to query will be a user interface such as a web page - even then I would not recommend this since you never know how an application may evolve over time nor would you want to take these habits with you to larger projects. It's simply a hack that leaves too many holes open in its solution.

A news group post even followed this, suggesting “create a SQL user defined function that strips away the accented characters”. This of course reinvents the wheel, requires more development and maintenance and storage costs, etc. This just extends the same error prone idea.

Summary – Pick your Poison
I say pick your poison because, you have to design your databases to default case and accent sensitivities one way or the other from the beginning of your project. Inevitably, you will choose your collations using the 80 / 20 rule, but then there will be situations where users need the other 20%. Of course, you can always specify collation rules in your queries in these cases. The thing that I don’t understand is why SQL Server would select default collations that are accent sensitive for searching and sorting – I didn’t research this, I assume it was not a good decision, but it would be interesting to hear your opinions or facts on this!

Hopefully this blog helps you out a bit. I’m not up to speed yet with SQL Server 2005, but I imagine that similar principles will apply there.

Happy collating!

References
SQL Server 2000 Collation Changer, http://www.codeproject.com/vb/net/ChangeCollation.asp

Beware of Mixing Collation with SQL Server 2000, http://www.databasejournal.com/features/mssql/article.php/1587631

How to perform case sensitive searches in SQL Server? http://vyaskn.tripod.com/case_sensitive_search_in_sql_server.htm
SQL Server Books Online

4 comments:

Manuel said...

I believe case insensitive + accent sensitive is the default b/c that matches with Windows file naming rules.

cdgnfg said...

EVEN by wow gold the standards gold in wow of the worst financial buy wow gold crisis for at least wow gold cheap a generation, the events of Sunday September 14th and the day before were extraordinary. The weekend began with hopes that a deal could be struck,maplestory mesos with or without government backing, to save Lehman Brothers, America''s fourth-largest investment bank.sell wow gold Early Monday buy maplestory mesos morning Lehman maplestory money filed for Chapter 11 bankruptcy protection. It has more than maplestory power leveling $613 billion of debt.Other vulnerable financial giants scrambled maple money to sell themselves or raise enough capital to stave off a similar fate. billig wow gold Merrill Lynch, the third-biggest investment bank, sold itself to Bank of America (BofA), an erstwhile Lehman suitor,wow power leveling in a $50 billion all-stock deal.wow power leveling American International Group (AIG) brought forward a potentially life-saving overhaul and went maple story powerleveling cap-in-hand to the Federal Reserve. But its shares also slumped on Monday.

Anonymous said...

Hi
You can use the reliable code that a friend (Alex Baker) published at CodeProject:
SQL Server 2000 Collation Changer. (Free source code)
http://www.codeproject.com/KB/database/ChangeCollation.aspx?fid=257710&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=76&select=2176451

Alain

Note: Always keep a backup of your database when you make major changes.

Web Tasarım said...

Sorry my english firstly, i read well but i cant write :), So its useful post for me and i bookmarked your blog.

warez indir, oyun indir, dizi indir, program indir, photoshop indir

web tasarımfotokopi, dijital cikis, ozalit, kirtasiyesesli chat