Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Current Events
Join our developers event to win one of the valuable prizes!
posted 3 years ago
viewed 1695 times
Share this article

CUBRID it! Programming Contest Overview – Part 2

This is the second part of the CUBRID it! contest overview. We've already covered the first part, where we focused on the contest problem itself, its possible solutions, and important details participants had to pay attention to. We highly recommend to read the first part of this blog before reading further.

In this second part we will look into some of the most interesting code and approaches we have received from you. We hope you can learn something new from solutions submitted by other users.

Getting the results

Excluding unwanted columns by filtering on data types

Besides the obvious data types we want to exclude – like INTEGER-ones, maybe the most challenging one is NUMERIC, because it requires an extra analysis of the scale (the number of digits after the decimal point) value, for example:

if($column['type']==='NUMERIC' && $column['scale']===0)
{
    $column['type']='INTEGER';
}

By the way, here is a nice way to get table columns, without getting any data “overhead”:

//Empty SQL request (with "LIMIT 0") for getting columns of current table
$result = cubrid_execute( $this->conn , "select * from {$table['NAME']} limit 0;");

How do you verify that values are “non-numeric”?

The most used approach was to use the TRANSLATE function:

LENGTH( TRANSLATE( CAST( $tColumn AS VARCHAR(255) ), '0123456789', '' ) ) > 0
//Condition to find non-numeric values:
HAVING(LENGTH(TRANSLATE(CAST(%s AS VARCHAR(255)), '0123456789', '')) > 0)

Some users preferred to make use of regular expressions:

java.util.regex.Pattern.compile("^[0-9]+$");
java.util.regex.Pattern.compile( "d+" );
java.util.regex.Pattern.compile(".*[^0-9].*");
preg_match("/^[0-9]+$/i")
java.util.regex.Pattern.compile(".*[^d].*")
…

Be aware – generally speaking, regular expressions can have an impact on performance, when dealing with long strings or complicated patterns to match.

Many users just went with an explicit “in-code” validation, for example:

private static boolean isNumeric(String value) {
    char[] array = value.toCharArray();
 
    for(char ch: array) {
        if(ch < '0'||ch > '9') {
            return false;
        }
    }
 
    return true;
}

Please note that because the most expensive tasks are still the database queries, these various approaches shouldn’t make a great difference in terms of execution time. However, you should still try to optimize an approach like this one, where a lot of iterations are being done for every value:

protected final boolean isDigit(final char c) {
    for (int i = 0; i < DIGITS.length; i++) {
        if (DIGITS[i] == c) {
            return true;
        }
    }
 
    return false;
}

How to find the “most duplicated” value...?

Some users used ON DUPLICATE KEY UPDATE to store each values number of occurrences, for example:

$q = 'insert into temp ("hash","val") ' .
       'SELECT 1,' . $toSelect . ' ' .
       'FROM `' . $qtable.'` ' .
       $criteria . ' ' .
       'ON DUPLICATE KEY UPDATE "count" = "count" + 1';

And, as you see above, if at the same time you also calculate the number of occurrences, that’s even better, of course!

An interesting “twist” – one user choose to use “cubrid_unbuffered_query”, to speed up processing:

// Use cubrid_unbuffered_query to improve the memory performance, since we don't
// know how many records are per table
$result = cubrid_unbuffered_query($sql, $this->conn);

Many users went by with a GROUP BY, after previously selecting all columns’ values:

$SQL = "SELECT `val`, SUM(`cnt`) AS totCnt FROM temp GROUP BY `val` ORDER BY `totCnt` DESC";

In general, the server-side solutions stored the values in temporary tables:

execute(conn, String.format("CREATE TABLE %s (val VARCHAR(255), occurence INTEGER)", tableName));

... followed by finding the desired value by executing a query on that table:

PreparedStatement maxValueStmt = conn.prepareStatement(
    "SELECT D_VALUE, SUM(D_NUMBER) S FROM DUPVALUE_TMP GROUP BY D_VALUE
" +
    "HAVING SUM(D_NUMBER) = (SELECT MAX(DUP.S) FROM 
" +
    "(SELECT SUM(D_NUMBER) S FROM DUPVALUE_TMP GROUP BY D_VALUE) DUP)"
);

As we have mentioned before, most of the solutions implemented a client-side approach. For example – see the comment below:

/**
* For each table in the list, selects everything in them and puts them in a HashMap
* If it does not exist (i.e. resultsMap.containsKey == false), it's added, else, it's incremented
* On-the-fly, it determines the maximum occurrence number, by checking it agains a variable (that is initially 0)
* The main idea is actually based around the efficiency of the Java HashMap class.
* The keys are the query results, while the values are their occurrences.
*/

What about a CONNECT BY…?

Did you know that CUBRID supports hierarchical queries - CONNECT BY?

We had one submission which used it:

String sqlGetQueries =
    "SELECT MAX(SYS_CONNECT_BY_PATH(text, '')) AS myQuery " +
    "FROM " +
    "( " +
    "              SELECT ROWNUM AS r, 'INSERT INTO _values (SELECT CAST(' + a.attr_name + ' as VARCHAR(255)) from ' + c.class_name + '); ' AS text " +
    "              FROM _db_class c " +
    "                              JOIN _db_attribute a ON c = a.class_of " +
    "                              JOIN _db_data_type t ON a.data_type = t.type_id " +
    "              WHERE " +
    "                              c.is_system_class = 0 " +
    "                              AND " +
    "                              t.type_name in ('VARCHAR', 'CHAR', 'STRING', 'INT', 'SMALLINT', 'BIGINT', 'NUMERIC', 'FLOAT', 'DOUBLE', 'DATE', 'TIME', 'DATETIME', 'TIMESTAMP') " +
    ") AS res " +
    "START WITH r = 1 " +
    "CONNECT BY PRIOR r = r - 1;";

Using external libraries to speed up data processing

We had a participant who used a memcache implementation:

import net.rubyeye.xmemcached….

Updating the “results” table

Some users did combine DELETE and INSERT in just one SQL:

$stmt = $this->conn->prepare("DELETE FROM results WHERE userid='" . $this->userid . "'; "
."INSERT INTO results (userid, most_duplicated_value, total_occurrences) "
."VALUES ('" . $this->userid . "', '" . $sMaxWord . "', " . $iMaxFrequency . ")");
$stmt->execute();

Others used REPLACE INTO:

REPLACE INTO RESULTS (userid, most_duplicated_value, total_occurrences) VALUES(?, ?, ?)

And some used ON DUPLICATE KEY UPDATE:

String updateTempTableSql = String.format("insert into temptable ("value", "count")" +
    " values ('%s',1) ON DUPLICATE KEY UPDATE "count" = "count"+1", value);

Other notes

Here are some code/solutions we came across, but didn’t quite understand why users used these certain ways of implementation…

  1. There were some solutions which used a connection pool. Why?
  2. Some contestants changed explicitly the auto-commit status for the database connection, cancelling what we explicitly set in the code template…
  3. “… ORDER BY NULL” ???
  4. Using “disk” string file storage (Remember - disk operations are the most “expensive”!)
  5. “Trimming” the values:

select trim(cast(".$column["ATTR_NAME"]…

…Oh, and let’s not forget about the non-English comments in the code. This was cool, by the way!

nonEnglishCode.jpg

Ending…

And this concludes the 2nd part of the article regarding the CUBRID it! Coding Contest!

We hope you enjoyed the reading and, more importantly, you have learned some new things… We surely did – so we thank you all for the contributions you have made in the contest!

See you soon at the next CUBRID contest!



comments powered by Disqus