Every morning I pick up my blackberry and look at an email from TSM Operational Reporting. If you are responsible for a Tivoli Storage Manager install you should be looking at a report from TSM Operational Reporting at sometime during your day as well. TSMOR has two key reports that can be emailed to you. The daily and hourly reports work in tandem to give you a wide range of information when it is needed and focused information if there is ever a problem. In addition TSMOR has the ability to store a current as well as previous reports, using html formatting, in a directory. Until TSMOR became available there was hardly any way to easily see the health of your TSM environment.
TSM Operational Reporting’s daily report is about as complete of a report you can get. Beginning with a general summary of the target TSM server some of the items monitored are shown below.
These are simple counts or numbers:
Administrative Schedules – Success, Error, Fail, Missed
Client Schedules – No error, Skipped files, warnings, Errors, Failed, Missed
Total GB – Backed up, Restored, Archived, Retrieved
Database and Log utilization
DB Cache Hit Ratio
Scratch and Unavailable Volumes
Then it gives you some detailed information of the Administrative Schedules and Client Schedules. Detail such as what missed or failed. At this point I have a solid idea of the health and success of the previous backup cycle. From here I can move on to troubleshooting problems if needed or can safely move on to other tasks if everything was successful. While scrolling down I pass up some awesome looking but often not useful for me graphs of different load summaries. I’ll list them to see if any catch your attention.
Session Load Summary
Tape Mount Load Summary
Migration Load Summary
Reclamation Load Summary
Database Backup Load Summary
Storage Pool backup Load Summary
Expiration Load Summary
These graphs are useful for more of an 50,000 ft overview of the loads your server is under through out the day. If I want a bit more detail on the clients such as Bytes Transfered or Node versions I can simply scroll down. The Node Activity Summary is one section I watch frequently. It give a list of Nodes and their version of TSM BAClient. This is currently very useful in my environment as I am currently phasing out 5.3.x.x and moving to 184.108.40.206. Support for 5.3.x.x is not going to be available after April 30th, 2008.
The next section I have recently had to turn off. Activity Log Details is the out put of your activity log for the past 24 hours. I have a few HSM for Windows clients and these have been generating a lot of information in the activity log. So much information in fact that it actually made my daily report fail. After disabling this section in the daily report, it runs just fine.
The Missed File Summary is useful at times. An example would be a new agent on all of your machines that has a file that is being skipped on all of the machines. You will see the number of occurrences of skipped files and the name of those files. The next section Missed File Details is what I actually use to troubleshoot missed files. It gives you two key pieces of information. The node name and the unc path to the file. The third piece of information is the time at which it was skipped. This time can be useful if you know some other job doesn’t finish or start until a certain time to release the file. The first two pieces of information should give you enough information to know if you need to edit your include/excludes.
The Session Summary section is awesome. But mostly used for bragging. If you are a backup administrator the only thing that comes close to being able to say you can and have restored anything is how fast you can do it. This section will list for each node:
Objects Inspected, Backed Up, Updated, Rebound, Failed
Aggregated Rate KB/Sec
If you are like me knowing how many objects and how fast they moved and the total size of that data is a very good number to know. For instance for nodes with lots of objects it may be worth it to have the tivoli journal engine running. Slow aggregated Rate and high bytes moved can sometimes reveal network bottlenecks. Session Summary is available for both backup sessions and archive sessions as well as restores and retrieves. The last section is Timing Information which is how long it took in each section to gather the data.
I hope this review and summary has informed you a bit. There are many other features which I did not cover but may do so at a later time. In case you would like to research them on your own I will point you in the right direction. You can create multiple daily and hourly reports. You can create your own custom select statements to pull data you need for your environment. You also have the ability to change any the parameters that cause the hourly report to notify you or show errors. One I change is number of scratch tapes required to be health from 5 to 3. I also have the hourly report only email me if there is a problem, such as out of scratch or a log filling up. You may also want to look at per node notifications which would be very handy in larger IT organizations where backups are done on servers you do not care about but some one else does.