Conceptual & Actual

Wednesday, January 28, 2009

Multithreading - Downstream Impacts

Multi threaded model will in most of the cases give better performance for batch applications. I would look at the below mentioned points along with the reengineering of the application code.

Processing Power – There should be enough CPU power available in the machine.
Memory – Since all threads will be working in parallel, the amount of memory used also will be considerably high. This can be however be reduced by not placing too many objects in JVM. However, the amount of memory required will be proportional to the number of threads in the application.
Network Load – If the application is having network interactions like Database queries, FTP etc, the network load also will be high. In some cases, it would be solved by adding GBit connections between servers communicating. In most of the cases, the normal NIC’s itself can process the load.
Disk Speed – If the application has lot of File processing, the disk accesses also need to be tuned. It would be better to read the files from SAN rather than NAS as the disk response time is better on SAN (my experience).
Database Setup – If there is lot of database interactions, the database also should be made aware of such a change in the application. The load on the database will increase as there will be multiple threads that will try to fetch data from database. Most probably, it will end up increasing the database parameters to accept more loads.
GC processing – The GC tuning should be performed. Since many threads are working in parallel, the amount of garbage created also will be high. An effective tune up of GC is required, without which, application will end up in OutofMemory Error. You can consider Parallel GC but ensure that the number of GC threads is mentioned.
Synchronization – Objects that are created in JVM scope should be accessed with proper synchronization. If this is not worked out properly, it can cause dreaded issues like data corruption, deadlocks etc…

Multithreading Multiprocessor Relation

A batch application is made scalable by ensuring that the executable can use the complete power provided by the machine. The batch application should be designed as multithreaded model if it’s possible to break the work into multiple smaller units of work. In this way, each thread can work on its own piece of work and complete the work. For e.g. In a single threaded model, the batch processes takes 10 hours to process 2000 customer records. If the same code is written in multi threaded model using 10 threads, the job can be split into 10 units with each unit having to process 200 customers. The same work can be completed in 1 hour. Caveat being the machine has the necessary processing power (CPU)

Any java process executes in a thread of execution. The thread can perform multiple activities like performing the task, waiting on IO, waiting on socket, waiting for lock release etc. While the thread is waiting on something, CPU is intelligent enough to remove the thread from its cycle and take up another thread which can perform the work. Any point in time, a CPU core can execute only one thread. So if the machine has 4 cores of CPU, an ideal count would be to provide 3 threads/core for the application to use. The number of threads per core is dependent on the application, primary driving factor being what is done in each thread. If there considerable wait that will happen in the thread of execution (like File IO, database read, Socket read) , the number of threads per CPU can be increased and if the thread is going to perform operations within the process area without any wait, the number of threads per CPU should be reduced. This is because the machine always pushes out the threads that are waiting and takes in thread that is ready for execution.

Impact on number of threads

More threads/core for application that has less amount of wait.
Let’s take an example where the machine has 2 CPU’s and application is configured to use 10 threads. Since there is not much wait time involved, the CPU will force the executing thread out of its cycle to give fair chance for the remaining 9 threads to execute. The thread that got pushed out will come back to execution after certain CPU cycles. At this time, it needs to rebuild till the point where it was pushed out. If there were only 1 thread of execution per CPU, this type of activity won’t happen and the single thread /CPU can complete the operation without heavy context switching. In such a scenario, it will be detrimental for the application. Such a case will be evident if the batch completes in lesser time when the number of threads for the application is reduced.
Less threads/core for application that has considerable amount of wait.
Let’s take an example where the machine has 8 CPU’s and application is configured to use 8 threads. Each CPU will execute a thread of execution and when any one of the thread goes into WAIT state, the CPU lies idle. Such a case will be evident if the batch completes in lesser time when the number of threads for the application is increased. At any point in time, the CPU usage will not be near 50% or 60%.

As mentioned in the above description, the number of threads per CPU should be decided based on the application characteristics.

Tuesday, January 6, 2009

HAAS - Hardware As A Service

SaaS or Software as a Service has picked up the momentum. The user pays the software for only what they use. This concept picked up in the industry because of many advantages. HAAS Hardware as a Service is catching up slowly. But was this not already existing in the form of hosting services? Is it something new? In hosting services, user for the hosting space for the period of time irrespective of the usage. Agreed, hosting services was existing but HAAS is not the same.

In HAAS, the user pay for the actual usage and not the usage decided in the beginning. In a hosting service, the end user for e.g. have to pay XUSD for say 1GB of space and 1 Web Application for 1 year. Irrespective of the usage of the site, the user need to pay the hosting provider. What if the user need to pay based on the amount of data that is transferred to the site or the amount of processing power the site uses instead of a fixed amount decided in the beginning. What if the application can take any amount of load i.e. scalability is available on- demand. This is possible through HAAS. User pays for the actual usage and gets the features of an enterprise class application.

Amazon has opened up the arena using Amazon Web Services. They have multiple products like Amazon EC2(Elastic Compute Cloud), Amazon S3(Simple Storage Service), Amazon SQS(Simple Queue Service). The whole idea is to achieve the application functionality by using the 3 above services. The unit of work is stored in the S3 area and the EC2 will use that unit to process the application in a scalable manner. And the benefits to the user, they get their application functional without spending heavily to setup the datacenters.

Saturday, December 27, 2008

Outer joins on non-preserved columns

In this blog I will try to explain the necessity of adding the outer join on already preserved rows.

There are 3 tables, MYUSER, TEST1, TEST2. MYUSER is related to TEST1 table and TEST1 table is related to TEST2 table. The requirement is to fetch all the users from MYUSER table and if present fetch the related data from the TEST1 table and related data from TEST2 table.

The query would be

SELECT USERV.USERID ,USERV.NAME,
TEST1.ID, TEST1.TEST1NAME,
TEST2ID , TEST2.TEST2NAME
FROM MYUSER USERV,
TEST1 TEST1,
TEST2 TEST2
WHERE USERV.USERID = TEST1.ID(+)
AND TEST1.ID = TEST2.TEST2ID(+)
order by 1

The point to note is that “AND TEST1.ID = TEST2.TEST2ID(+)“ condition should be with + sign as shown. This is required because the first join will return all the users even if there is no corresponding record in Test1 table. For those preserved rows, the value will be null for TEST1.ID column. For this row to be preserved when joined with TEST2 table, the + on TEST2 should be maintained.

This can be seen by the following example

MYUSER Table

USERID	NAME
1	Bill
2	Purva
3	James
5	Meij

TEST1 Table

ID	TEST1NAME
1	test1bill
2	test1purva

TEST2 Table

TEST2ID	TEST2NAME
2	test2purva
3	test2james

The query that meets the requirement is

ANSI Syntax

SELECT USERV.USERID ,USERV.NAME,
TEST1.ID, TEST1.TEST1NAME,
TEST2ID , TEST2.TEST2NAME
FROM MYUSER USERV
LEFT OUTER JOIN TEST1 TEST1 ON (USERV.USERID = TEST1.ID)
LEFT OUTER JOIN TEST2 TEST2 ON (TEST1.ID = TEST2.TEST2ID)
ORDER BY 1

USERID	NAME	ID	TEST1NAME	TEST2ID	TEST2NAME
1	Bill	1	test1bill
2	Purva	2	test1purva	2	test2purva
3	James
5	Meij

The query2 without outer join on the second clause is

USERID	NAME	ID	TEST1NAME	TEST2ID	TEST2NAME
2	Purva	2	test1purva	2	test2purva

To be precise, if a table is part of the outer join and the data from the table is outer joined to any other table, that condition also should be outer joined.

Thursday, December 18, 2008

Consistent Read of the Data in Oracle Queries

In some cases, you might want different queries to return same set of data as far as the where criteria defined on the query are the same. For e.g.

Select price from vehicle_price where status=’A’

Select sum(price) from vehicle_price where status=’A’

In the same transaction, if I try to do these queries, there is a possibility that the second query might give a sum that will not be equal to the sum added up by the price returned in the first query. This is possible if new records get added to the vehicle_price table with status as ‘A’ or any of the existing record status changes from ‘A’ to something else.

To ensure that you always get the same data, there are multiple ways to get this done

1. Lock the table and perform queries on it. This ensures that no one else can modify the data while your transaction is in progress. I wont be surprised if you get shouted by other folks who uses this table in some other part of the application.

2. Make the data time sensitive. For every record that gets into the table, there should be a effective start time and effective end time. The query should always query on these columns. So the row doesn’t get modified but a new row gets added with new status. As long as the query uses the effective start date and effective end date in the query, the data returned in subsequent queries will be same. This is a better approach as other users are not impacted. But there could be size and performance implications.

3. Use Serializable isolation level for the transaction which queries the tables. This ensures that all queries are made to wait serially. So till the read transaction doesn’t get over, the new transaction cannot insert data. This also impacts the table usage.

4. Use Transaction read only feature. There are few limitations in using this feature of transaction. This level can be set only for transactions that have only read statements. Oracle maintains a snapshot of the data and all queries are worked upon the snapshot effective the transaction start time. This should be used if its viable to be. A commit or rollback on this transaction will cause the flag to be turned off. Set Transaction Read only ; is the command.

These are few of the ways by which we can ensure that queries return the same data within a same transaction.

Friday, December 5, 2008

Use Decode and Sign function together

Decode statements are useful in queries to a large extent. Sometimes people write the PLSQL code to achieve somethings that are possible through a normal decode statement.

For e.g. the requirement is to add two columns retrieved from the table and if its less than zero, make it zero and if its greater than zero, provide the sum of the two values as the output. This can be done using the decode statement

select column1 , coulmn2, column1+column2 ,
decode(SIGN(column1+column2), -1, 0, (column1+column2))
from tableA

The above query means that if the sum of column1 and column2 value is less than zero, then output the value as zero and if its greater than zero, output the sum as the value. We had to use SIGN function as decode does not support checks like less than or greater than.

JDK Logging

Log4j is widely accepted as a logging component. There is yet another one, the logging component inbuilt in java. This article will discuss about the logging provided by JDK.

Ideal scenario, the entire configuration can be done through the logging.properties file. The logging.properties file should be present in the classpath of the application. In case you want to provide a different properties file, the same can be set using the system property. The command is as shown below

java –Djava.util.logging.config.file=/home/users/mypath/mylogging.properties

The logging.properties file sample is present in the jre/lib folder. This file can be edited to get the necessary logging. The structure of the logging properties file is shown below

#Defines the handlers which can be used by the logger. It can be comma separated. For
# each handler, definition should be provided in the properties file
handlers= java.util.logging.FileHandler, java.util.logging.ConsoleHandler
#Default level of logging in case specific logging level as shown in the last line is not
# mentioned.
.level= INFO
#Details about the filehandler.
java.util.logging.FileHandler.pattern = %h/java%u.log
java.util.logging.FileHandler.limit = 50000
java.util.logging.FileHandler.count = 1
java.util.logging.FileHandler.formatter = java.util.logging.XMLFormatter
#custom logging level
com.xyz.foo.level = SEVERE

By default the logging happens at the package level of the class when you are using the property file for configuring the logger.

The restrictions here are that we can specify only one log file per logging properties file. So in case you need the log to go to multiple files, may be based on the application, it’s not possible through configuration in the properties file. However, you can perform the above requirement by writing a specific logger class.

private void initialize(String name)
{
Handler fileHandler;
try {
fileHandler = new FileHandler(filePath, true);
fileHandler.setFormatter(new MyFormatter());
logger=Logger.getLogger(name);
logger.addHandler(fileHandler);
}
public static MyLogger getInstance(String someName)
{
//add the necessary synchronization blocks
mylogger = new MyLogger();
mylogger.initializer(someName);
return mylogger;
}
//this is sample method. Like this you need to provide implementation for
//all other methods
public void debug(String sourceClass, String sourceMethod, String msg)
{
logger.logp(Level.FINEST,sourceClass,sourceMethod,msg);
}
//in your application code, use it this way
MyLogger logger = MyLogger.getInstance(“namethatidentifiestheloglevel”);
logger.debug(“classname”,”methodname”,”msg”);

In the above code snippet, the log level is not mentioned. The log level can be mentioned in the properties file that is passed as the java.util.logging.config.file property. In this case, the value will be set as

namethatidentifiestheloglevel=SEVERE

There are various levels available, the important ones being SEVERE, INFO, FINE, FINEST, DEBUG etc … Please refer to the sun documentation site as the priority of the log level does matter in the content that gets logged.