Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Current Events
Join our developers event to win one of the valuable prizes!
posted 3 years ago
viewed 2806 times
Share this article

CUBRID it! Programming Contest Overview - Part 1

Now that the contest is closed and we have announced the happy winners, let’s remember the contest problem and take a look at the solutions you have sent us. We will divide this blog into two parts. In the first part of the article, we will focus on the contest problem itself, its possible solutions, and important details participants had to pay attention to. In the second part, we will walk through the actual code you have submitted.

Remember the challenge?

For a quick recap of the problem you were challenged by, here are the most important parts:

Given all the tables in a database (but only the user tables, not the system tables), determine:

  • The value which occurs the most times (is duplicated the most times) in the database and it is “non-numeric”! (“non-numeric” means that the respective value does not contain only digits – it must contain at least one non-digit character; for example, “1298” is “numeric”, but “12,98” is “non-numeric” ( and also “12.98” ), because it contains the “,” character ; same goes for “-100” – it is considered “non-numeric” as well, because of the “-” sign).
  • The number of occurrences for the value found above.

Additional notes:

  • Assume that the result to the problem is unique; there are no multiple values with the same maximum number of occurrences – just one single value is the most duplicated.
  • If a table column data type is not a string-type, then consider the respective values in the column as casted to “string”; this means that you should deal only with string or string-converted/casted values when doing the analysis. When converting a data type to string, assume that no extra characters will be added (for example, a number converted to string, no matter the system locale or regional settings, will not “get” thousands separators )
  • The tables in the database contain only columns of the following data types: VARCHAR, CHAR, STRING, INT, SMALLINT, BIGINT, NUMERIC, FLOAT, DOUBLE, DATE, TIME, DATETIME and TIMESTAMP.

Note: The full challenge content can be found here.

So how do we solve the problem?

Obviously, the first thing we need to do is to get the list of the columns in the database tables, which meet the criteria. And how do we get this information? Well, there’s more than just one way… 1. Do a SELECT on the database metadata objects. For example, we can use a query like this to filter the columns:

SELECT c. class_name, a.attr_name, a.data_type
FROM db_class c, db_attribute a
WHERE c.class_name = a.class_name
AND c.is_system_class='NO'
AND c.class_type='CLASS'
AND
(
    a.data_type IN ('STRING', 'CHAR', 'TIMESTAMP', 'DATETIME', 'DATE', 'TIME', 'FLOAT', 'DOUBLE')
    OR
    (a.data_type IN ('NUMERIC') and a.scale > 0)
)
AND a.prec <= 255
ORDER BY a.def_order ASC

Note that we need to treat NUMERIC data type separately, depending on the scale of the column – this is why there is a separate condition for the NUMERIC data type.

2. We could use a PHP CUBRID special function “cubrid_schema”:

cubrid_schema($conn, CUBRID_SCH_ATTRIBUTE, …);

3. We could also use the PHP PDO CUBRID special function: “cubrid_schema”:

$tables = cubrid_schema($this->conn, CUBRID_SCH_CLASS);

4. Or, if you use Java, you can get this information by using the JDBC driver functions:

DatabaseMetaData meta = conn.getMetaData();
ResultSet schemas = meta.getSchemas();

while (schemas.next()) {
    String tableSchema = schemas.getString(1);
    String tableCatalog = schemas.getString(2);
…

As a note, out of the four possible approaches we listed above, the first one provides some few additional benefits:

  • It does not depend on the programming language you use
  • It does not require complex extra filtering afterwards on the columns data types

However, any of them will get the job done and the choice performance impact is not significant.

How do we get the values we need?

Now that we have the list of columns that we need to query to identify the requested “most duplicated value”, how do we proceed next…?

Before we go any further, it’s time to clarify an important concept regarding how the data is processed. Based on where the data processing occurs, we can have:

  • Server-side processing
  • Client-side processing

The main difference is where the “data” is actually being processed… …will the “data” be transferred over the network from the database server to the client application to sort it, to filter it, etc., or these processes will be run in the database server?

Obviously, this can make a lot of difference in terms of performance and scalability! Imagine, for example, millions of records being transferred over the network, just so you do a sort in memory, in the Java client… Imagine the overhead, the performance penalty due to the network speed and even more - imagine getting an out of memory exception…

The lesson here is: The recommended way to solve the problem is to minimize the client-side processing in favor of minimizing the overhead!

Only some of the contestants took this approach… Most of the solutions did client-side in-memory processing.

Now, getting back to the problem, having the list of the columns, which need to be processed, will give us a few ways to proceed further with finding “the most duplicated value”:

Using a client-side approach:

  • Query all columns' data and build some array/collection/etc. structure that will make easy to count the occurrences of each value; this approach was chosen by most of the contestants.
  • Use the same approach as above, but additionally optimize the queries.

Using a server-side approach:

  • Query all columns data in such a way that the query will return the value requested, without any additional need to further process anything on the client-side. For example, build a query with a GROUP BY and using a LIMIT.
  • Query all the columns data and directly insert the values in a temporary table, followed by a query with a GROUP BY and a LIMIT.

Using one of these approaches, you will get the requested value, and, from this point on, there is just one last thing to be done – update the results table, which is a pretty straightforward task – we won’t go into details about this last step.

What else you should have looked for?

Besides coding a solution which outputs the correct results, you should have also looked at:

  • The code quality; use generally accepted coding standards, for both Java and PHP, have a “clean” code, easy to understand. For example, in the following code the variables t21 and t22 are not descriptive:
    executeQuery(t21 + t22)

    The following is much better, isn't it?

    /**
    * Defines the type of user tables under CUBRID schema
    * @see http://www.php.net/manual/es/function.cubrid-schema.php
    */
    const USER_TABLES_TYPE = '2';
    
  • Comment your solution. This could be done in the code, in a separate document or in both.
  • Make sure your solution deals properly with exceptions; for example, when there are no accessible tables in the database, or when you get an out of memory exception etc.
  • Test your solution on different databases/table definitions/table data.
  • Make sure you properly “escape” the string values.
  • Take care of the default PHP script execution timeout.
  • Do not modify explicitly the errors reporting level. For example, don’t do:
error_reporting(E_ALL &amp; ~E_NOTICE);

A simple server-side solution

We will conclude this first part of the article with showing a complete server-side solution, which is maybe not the best in terms of performance, but it has the advantage of being simple, very easy to understand and implement – there will be just a few lines of code. Here it is:

Get the list of columns:

SELECT c. class_name, a.attr_name, a.data_type
FROM db_class c, db_attribute a
WHERE c.class_name = a.class_name
AND c.is_system_class='NO'
AND c.class_type='CLASS'
AND
(
    a.data_type IN ('STRING', 'CHAR', 'TIMESTAMP', 'DATETIME', 'DATE', 'TIME', 'FLOAT', 'DOUBLE')
    or
    (a.data_type IN ('NUMERIC') AND a.scale > 0)
)
AND a.prec <= 255
ORDER BY a.def_order ASC

Use the results returned by the previous query to build a query which will return the results we want, by querying all columns data, grouped by distinct values.

And we will return only the “most duplicated value”, using LIMIT.

We will end up with a query like this:

SELECT x.v as val, count(x.v) AS cnt
FROM (
    SELECT s_name as v FROM code
    UNION ALL
    SELECT code as v FROM nation
    UNION ALL
    SELECT nation_code as v FROM stadium
    UNION ALL
    SELECT medal as v FROM game
    UNION ALL
    SELECT f_name as v FROM code
    …
) x
GROUP BY val
ORDER BY cnt desc, val ASC
LIMIT 1

Finally, we update the results table:

REPLACE INTO results …

That’s it – not hard at all…

And this concludes the first part of the CUBRID contest overview article. In the second part of this article, we will walk together through some of the code you've submitted.



comments powered by Disqus