Tuesday, February 24, 2009

Blogging at posulliv.com Now

I've decided to start blogging with WordPress now as it just makes putting code samples and things like that so much easier! Also, I had a domain name for a while so I figured I might as well use it and it will encourage me to blog more. I really want to blog more at the moment since I'm working on an interesting project for my database course this semester.

You can find my blog here from now on. I've migrated everything from here to there so all of these posts will still be available.

Wednesday, January 28, 2009

Drizzle: A Pretty Cool Project

Drizzle is a pretty cool project whose progress I've started following in the last few weeks. I'm trying to contribute in a tiny way if I can by confirming bug reports. If I had more time, I'd like to try resolving some bugs. Hopefully, I'll find some spare time to do that in the future.

I think its definitely a project worth keeping an eye on though. Check it out if you have the time.

Saturday, January 24, 2009

A Subtle Bug

At university, I work in a research group where we are developing an application in C++ that runs on both Linux and Windows. Since I do most of my development on Linux, I rarely test our application on Windows (other people in the group who run Windows test on that platform). Recently, one of my colleagues was encountering a problem while running our application on Windows that I was not encountering when running it on Linux.

I was able to track the issue down a single piece of code and produce a simple test case which produced the same problem. Essentially, the problem was due to a piece of code like the following:
#include <iostream>
#include <map>

using namespace std;

int

main
()
{

map<char,int> mymap;
map<char,int>::iterator iter;

mymap['a'] = 10;
mymap['b'] = 20;
mymap['c'] = 30;
mymap['d'] = 40;
mymap['e'] = 50;
mymap['f'] = 60;

for
(iter = mymap.begin(); iter != mymap.end(); iter++) {
cout << "erasing " << iter->first << endl;
mymap.erase(iter);
}
}

Compiling the above code on Linux with gcc 4.2.3, the output is as follows (which is what is intended):

$ ./stuff
erasing a
erasing b
erasing c
erasing d
erasing e
erasing f
$

Compiling and running the same code on Windows causes an issue. The following output is produced (and execution halts):

$ ./stuff.exe
erasing a
erasing ^

Now when seeing the simple test case above, the actual issue may become apparent. However, it was not so apparent in the source code for our application. The issue is due to the way elements are being erased in the first for loop. Referring to the documentation for STL map, we get the following paragraph:

Map has the important property that inserting a new element into a map does not invalidate iterators that point to existing elements. Erasing an element from a map also does not invalidate any iterators, except, of course, for iterators that actually point to the element that is being erased.

Thus, one possible reason for the issue is that as soon as the element is erased, the current iterator is invalidated, and on the next trip through the loop, the next iterator is calculated on the (now) invalid current iterator. So, this could wind up pointing to an invalid area.

We think (don't know for sure) that we are seeing different behavior on the two platforms due to different implementations of the STL library or perhaps because of different implementations of the underlying OS calls such as free().

Our method for getting around this issue was to move the calculation of the next iterator (iter++) into the erase() statement so that the next iterator is calculated based on a valid iterator. Thus, the test case ends up looking as follows:
#include <iostream>
#include <map>

using namespace std;

int

main
()
{

map<char,int> mymap;
map<char,int>::iterator iter;

mymap['a'] = 10;
mymap['b'] = 20;
mymap['c'] = 30;
mymap['d'] = 40;
mymap['e'] = 50;
mymap['f'] = 60;

for
(iter = mymap.begin(); iter != mymap.end(); ) {
cout << "erasing " << iter->first << endl;
mymap.erase(iter++);
}
}

The above code runs correctly on both Linux and Windows. This was a subtle bug that was perhaps not as apparent as it should have been to me. My excuse is that I didn't write this piece of code so it took a little while longer for me to debug it.

Monday, January 5, 2009

What is Direct Data Placement

I'm currently studying Oracle's white paper on Exadata and came across the following paragraph:

"Further, Orace's interconnect protocol uses direct data placement (DMA - direct memory access) to ensure very low CPU overhead by directly moving data from the wire to database buffers with no extra data copies being made."

This got me wondering what direct data placement is. First off, the interconnect protocol which Oracle uses in Exadata is Reliable Datagram Sockets (RDSv3). The iDB (intelligent database protocol) that a database server and Exadata Storage Server software use to communicate is built on RDSv3.

Now, I found some information on direct data placement in a number of RFCs; RFC 4296, RFC 4297, and RFC 5041. Of the 3 RFCs, I found RFC 5041 (Direct Data Placement over Reliable Transports) to be the most relevant (although they are all worth a quick look). RFC 5041 sums up direct data placement quite nicely:

"Direct Data Placement Protocol (DDP) enables an Upper Layer Protocol (ULP) to send data to a Data Sink without requiring the Data Sink to Place the data in an intermediate buffer - thus, when the data arrives at the Data Sink, the network interface can place the data directly into the ULP's buffer."

The paragraph from Oracle's white paper makes much more sense to me now after briefly reading through the RFC. Since each InfiniBand link in Exadata provides 16 Gb of bandwidth, there would be a large amount of overhead if data had to be placed in an intermediate buffer. Thus, the use of direct data placement makes perfect sense since it reduces CPU overhead associated with copying data through intermediate buffers.

Also, I believe that in the paragraph quoted from Oracle's white paper, it should be RDMA for Remote DIrect Memory Access.

Tuesday, December 16, 2008

Semester Project Finally Finished

We just finished our semester project yesterday for the class I am taking on High Performance Computing. It was a pretty interesting project based on the topic of software fault injection.

More details can be found in the project report here.

Sunday, November 30, 2008

Its Been a While

I had removed this blog but kept getting some emails asking for links to certain posts so I just posted some old posts again so that they are available to anyone who is interested in them.

As an update for what I'm doing, I'm currently in my second year of graduate school. I plan on taking a grad class in database systems next semester so that should be interesting. I'll get to learn a lot about database theory. The class uses Stonebraker's 'Readings in Database Systems' as a textbook along with some more modern research papers so it should be a fun class.

I doubt I will update this much but you never know, stranger things have happened...

Configuring Oracle as a Service in SMF

In Solaris 10, Sun introduced the Service Management Facility (SMF) to simplify management of system services. It is a component of the so called Predictive Self Healing technology available in Solaris 10. The other component is the Fault Management Architecture.

In this post, I will demonstrate how to configure an Oracle database and listener as services managed by SMF. This entails that Oracle will start automatically on boot which means we don't need to go to the bother of writing a startup script for Oracle (even though its not really that hard, see Howard Roger's 10gR2 installation guide on Solaris for an example). A traditional startup script could still be created and placed appropriate /etc/rc*.d directory. These scripts are referred to as legacy run services in Solaris 10 and will not benefit from the precise fault management provided by SMF.

In this post, I am only talking about a single instance environment and I am not using ASM for storage. Also please note that this post is not an extensive guide on how to do this by any
means, it's just a short post on how to get it working. For more information on SMF and Solaris 10 in general, have a look through Sun's excellent online documentation at http://docs.sun.com.

Adding Oracle as a Service

To create a new service in SMF, a number of steps need to be performed (see the Solaris Service Management Facility - Service Developer Introduction for more details). Luckily for me, Joost Mudlers has already done all the necessary work for performing this for Oracle. The package for
installing ora-smf is available from here.

To install this package, download it to an appropriate location (in my case, the root user's home directory) and perform the following:


# cd /var/svc/manifest/application
# mkdir database
# cd ~
# pkgadd –d orasmf-1.5.pkg

There is now some configuration which needs to be performed. Navigate to the /var/svc/manifest/application/database directory. The following files will be present there

# ls -l
-r--r--r-- 1 root bin 2167 Apr 26 09:24 oracle-database-instance.xml
-r--r--r-- 1 root bin 5722 Dec 28 2005 oracle-database-service.xml
-r--r--r-- 1 root bin 2128 Apr 26 09:31 oracle-listener-instance.xml
-r--r--r-- 1 root bin 4295 Dec 28 2005 oracle-listener-service.xml
#
The two files which must be edited are:
  • oracle-database-instance.xml
  • oracle-listener-instance.xml
My oracle-database-instance.xml file looked like the following after I edited it according to my environment:

<service_bundle type='manifest' name='oracle-database-instance'>
<service
name='application/oracle/database'
type='service'
version='1'>

<!-- The SMF instance name MUST match the database instance -->
<instance name='orcl1' enabled='false'>
<method_context
working_directory='/u01/app/oracle/product/10.2.0/db_1'
project='oracle'
resource_pool=':default'>

<!--
The credentials of the user with which the method is executed.
-->
<method_credential
user='oracle'
group='dba'
supp_groups=':default'
privileges=':default'
limit_privileges=':default'/>

<method_environment>
<envvar name='ORACLE_SID' value='orcl1' />
<envvar name='ORACLE_HOME' value='/u01/app/oracle/product/10.2.0/db_1' />

<!--
For Oracle 8 & 9
<envvar name='ORA_NLS33' value='' />
For Oracle 10g
<envvar name='ORA_NLS10' value='' />
-->

</method_environment>
</method_context>

</instance>
</service>
</service_bundle>

And my oracle-listener-instance.xml file looked like so after editing:

<service_bundle type='manifest' name='oracle-listener-instance'>
<service
name='application/oracle/listener'
type='service'
version='1'>

<!-- The SMF instance name MUST match the listener instance -->
<instance name='LISTENER' enabled='false'>
<method_context
working_directory='/u01/app/oracle/product/10.2.0/db_1'
project='oracle'
resource_pool=':default'>

<!--
The credentials of the user with which the method is executed.
-->
<method_credential
user='oracle'
group='dba'
supp_groups=':default'
privileges=':default'
limit_privileges=':default'/>

<method_environment>
<envvar name='ORACLE_HOME' value='/u01/app/oracle/product/10.2.0/db_1' />

<!--
For Oracle 8 & 9
<envvar name='ORA_NLS33' value='' />
For Oracle 10g
<envvar name='ORA_NLS10' value='' />
-->

</method_environment>
</method_context>

</instance>
</service>
</service_bundle>

In the above configuration files, you can see that I have an instance (orcl1) whose ORACLE_HOME is /u01/app/oracle/product/10.2.0/db_1. I also have a resource project named oracle and the username and group which the Oracle software is installed as is oracle and dba respectively. The most important parameters which must be changed according to your environment are:
  • ORACLE_HOME
  • ORACLE_SID
  • User
  • Group
  • Project
  • Working Directory (in my case, I set it to the same value as ORACLE_HOME)
  • Instance name (needs to be the same as the ORACLE_SID for the database and the listener name for the listener)
Once these modifications have been performed according to your environment, execute the following to bring the database and listener under SMF control:

# svccfg import /var/svc/manifest/application/database/oracle-database-instance.xml
# svccfg import /var/svc/manifest/application/database/oracle-listener-instance.xml

Now, shut down the database and listener on the host (since this post presumes you are only configuring one database and listener, it shouldn't be too difficult to configure multiple instances though). Then execute the following to enable the database and listener as an SMF service and start the services:

# svcadm enable svc:/application/oracle/database:orcl1
# svcadm enable svc:/application/oracle/listener:LISTENER

In the commands above, the database instance is orcl1 and the listener name is LISTENER. Log of this process are available in the /var/svc/log directory.

# cd /var/svc/log
# ls -ltr application-*
-rw-r--r-- 1 root root 45 Apr 25 20:15 application-management-webmin:default.log
-rw-r--r-- 1 root root 120 Apr 25 20:15 application-print-server:default.log
-rw-r--r-- 1 root root 45 Apr 25 20:15 application-print-ipp-listener:default.log
-rw-r--r-- 1 root root 75 Apr 25 20:16 application-gdm2-login:default.log
-rw-r--r-- 1 root root 566 Apr 26 07:07 application-print-cleanup:default.log
-rw-r--r-- 1 root root 603 Apr 26 07:07 application-font-fc-cache:default.log
-rw-r--r-- 1 root root 3318 Apr 26 10:45 application-oracle-database:orcl1.log
-rw-r--r-- 1 root root 6847 Apr 26 10:47 application-oracle-listener:LISTENER.log
#

Testing Out SMF

Now, to test out some of the functionality of SMF, I'm going to kill the pmon process of the orcl1 database instance. SMF should automatically restart the instance.

# ps -ef | grep pmon
oracle 5113 1 0 10:19:22 ? 0:01 ora_pmon_orcl1
# kill -9 5113

Roughly 10 to 20 seconds later, the database came back up. Looking at the application-oracle-database:orcl1.log file, we can see what happened:

[ Apr 26 10:44:52 Stopping because process received fatal signal from outside the service. ]
[ Apr 26 10:44:52 Executing stop method ("/lib/svc/method/ora-smf stop database orcl1") ]
*********************************************************************
*********************************************************************
** some of '^ora_(lgwr|dbw0|smon|pmon|reco|ckpt)_orcl1' died.
** Aborting instance orcl1.
*********************************************************************
*********************************************************************
ORACLE instance shut down.
[ Apr 26 10:44:53 Method "stop" exited with status 0 ]
[ Apr 26 10:44:53 Executing start method ("/lib/svc/method/ora-smf start database orcl1") ]
ORACLE instance started.
Total System Global Area 251658240 bytes
Fixed Size 1279600 bytes
Variable Size 83888528 bytes
Database Buffers 163577856 bytes
Redo Buffers 2912256 bytes
Database mounted.
Database opened.
database orcl1 is OPEN.
[ Apr 26 10:45:05 Method "start" exited with status 0 ]
As can be seen from the content of my log file above, SMF discovered that the instance crashed and restarted it automatically. That seems pretty cool to me!

Now, let's try out the same procedure with the listener service.

Almost instantaneously, the listener came back up. Looking through the application-oracle-listener:LISTENER.log file shows us what SMF did:

[ Apr 26 10:47:50 Stopping because process received fatal signal from outside the service. ]
[ Apr 26 10:47:50 Executing stop method ("/lib/svc/method/ora-smf stop listener LISTENER") ]

LSNRCTL for Solaris: Version 10.2.0.2.0 - Production on 26-APR-2007 10:47:51

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=solaris01)(PORT=1521)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Solaris Error: 146: Connection refused
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC0)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Solaris Error: 146: Connection refused
[ Apr 26 10:47:52 Method "stop" exited with status 0 ]
[ Apr 26 10:47:52 Executing start method ("/lib/svc/method/ora-smf start listener LISTENER") ]

LSNRCTL for Solaris: Version 10.2.0.2.0 - Production on 26-APR-2007 10:47:52

Copyright (c) 1991, 2005, Oracle. All rights reserved.

Starting /u01/app/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait...

TNSLSNR for Solaris: Version 10.2.0.2.0 - Production
System parameter file is /u01/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=solaris01)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=solaris01)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Solaris: Version 10.2.0.2.0 - Production
Start Date 26-APR-2007 10:47:54
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/oracle/product/10.2.0/db_1/network/admin/listener.ora
Listener Log File /u01/app/oracle/product/10.2.0/db_1/network/log/listener.log
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=solaris01)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))
Services Summary...
Service "PLSExtProc" has 1 instance(s).
Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully
listener LISTENER start succeeded
[ Apr 26 10:47:54 Method "start" exited with status 0 ]
I havn't really played around too much else with SMF and Oracle at the moment. Obviously, Oracle has a lot of this functionality already available through Enterprise Manager using corrective actions.

Also, its worth pointing out that Oracle does not currently support SMF and does not provide any information or documentation on configuring Oracle with SMF. Metalink Note 398580.1 and Bug 5340239 have more information on this from Oracle.