home | project description | people | events | links | logos

GRIST: Project Description

October 2004

Summary

The Grist project is investigating and implementing ways for working astronomers, scientists, and the public to interact with the "grid" projects that are being constructed worldwide, to bring to flower the promise of easy, powerful, distributed computing. Our objectives are to u nderstand the role of service-oriented architectures in astronomical research, to bring the astronomical community to the grid -- particularly TeraGrid – and to w ork with the NVO to build a library of compute-based web services.

The scientific motivation of Grist derives from creation and mining of wide-area federated images, catalogs, and spectra. An astronomical image collection will generally cover an area of sky several times -- in different wavebands, different times, etc -- and the data analysis should combine these multiple observations to a unified ("federated") understanding of the physical processes in the Universe. The familiar way to do this is to identify sources from each image, then cross-match source lists from different images. However, there is growing interest in another way to federate images: by reprojecting each image to a common set of pixel planes, then stacking images and detecting sources therein. While this has been done for years for small pointing fields, we are working on wide areas of sky in a systematic way, using data from the Palomar-Quest survey (see below). We expect to detect much fainter sources than can be detected in any individual image; to detect unusual objects such as transients; and to deeply compare (eg. principle component analysis) the big surveys such as SDSS, 2MASS, DPOSS, etc. Grist is also using PQ data to find high redshift quasars, to study peculiar variable objects, and search for transients in real-time. Grist is also using TeraGrid resources for fitting of SDSS QSO spectra to measure black hole masses.

We have been using the NSF TeraGrid as a platform -- a platform for testing code, testing policies, and of course big computing on big data. The TeraGrid presents a "quasi-unified" structure of collaborating management structure, with an open approach to new ideas and research that provides a fertile field for projects such as Grist. Four of the sites run identical system software, allowing effortless distributed deployment of code by simply copying everything. Other TeraGrid sites has specific capabilities and software, so that we can experiment with heterogeneous deployment. On the management side, TeraGrid presents a federation of independent computing centers analogous to the federation of states that is the United States -- with the same tension between uniform and individual policy.

Grist is closely associated with the Palomar-Quest (PQ) synoptic sky survey, and is building an image-federation pipeline for the data. PQ observes from the P-48 telescope on Palomar Mountain , California , and has produced a terabyte of data from July 2003 to July 2004 (50 nights at 25 Gbyte). Data is archived at NCSA in Illinois , and the TeraGrid machines there and at Caltech used for the image pipeline. The high-speed TeraGrid backplane allows the data from an 8-hour observing run to be transmitted in about one hour.

Grist is also closely associated with the NSF National Virtual Observatory, which has developed a number of internationally-accepted formats and protocols for astronomical data. The PQ data is being loaded into a new database machine so that it can be exposed to the astronomical community through NVO protocols (OpenSkyNode). By this means, it becomes easy to cross-match PQ catalog data with other major surveys. Grist is also involved in international efforts to define computational abilities as services -- that can be published, discovered, and utilized in a "plug-and-play" fashion. Grist is also involved in efforts to define a security and authentication structure for the virtual observatory. The Grist team recently hosted an international workshop on these topics: "Service Composition for Data Exploration in the Virtual Observatory" (SC4DEVO).

The project concentrates on elaborating the following interlinked ideas:

Service-Oriented Architectures for Astronomy. The Grist project is building web and grid services and the enabling workflow fabric to tie together distributed services in the areas of data access, federation, mining, source extraction, image mosaicking, catalog federation, data subsetting, statistics (histograms, kernel density estimation, and R language utilities exposed by VOStatistics services), and visualization. Interactive deployment and control of these distributed services will be provided from an intuitive, graphical desktop workflow manager. We have built a number of services – for data mining, statistics, source extraction, coordinate transformations. We also expect to utilize NVO services for data access – images, catalog, and spectra -- as well as the sophisticated NVO registry structure for discovery and utilization.

The new grid services paradigm explored in Grist will pave the way for a new era of distributed astronomy , with tremendous flexibility that allows software components to be deployed as services that are: (i) controlled and maintained by the authors; (ii) close to the data source for efficiency; or (iii) controlled by the end users so they have control over policies and level of service.

Graduated security. Much of the pipeline and mining software for Grist will be built as web services. One of the reasons for using services is to be able to use them from a thin client, optimally with nothing but a common web browser. However, for such services to be able to process private data or use big computing, there must be a strong authentication of the user. The VO and Grid communities are converging around the idea of X.509 certificates as a suitable credential. However, most astronomers do not have such a certificate, and it is a challenge to demonstrate to them the benefit of making the effort to get one. Therefore we are building services with "graduated security", meaning that small requests on public data are available anonymously and simply, but also large requests on private data can be serviced through the same interface. However in the latter case, a certificate is necessary. Thus the service "proves its usefulness" with a simple learning curve, but can be used "full-strength" to those who get themselves a credential.

Palomar-Quest data mining. We have designed and built a database for the output of the catalog pipeline of the PQ survey. These are being mined for high-redshift quasars, and in FY05 we will expose the data through NVO-compliant web services, along with some standard sky-survey catalogs (DPOSS). A major objective of the PQ survey is the fast discovery of new types of transient source, through comparison of fresh data to previous. Such transients should be immediately re-observed to get maximum scientific impact, so we are experimenting with "dawn processing" on the TeraGrid, meaning that data is streamed from the telescope to the compute facility as it is taken (rather than days later). The pipeline itself is being built with streaming protocols, including the mining and discrimination, so that unknown transients (not known variable, not known asteroids, etc) can be examined within hours of observation with a view to sending an email alert.

Virtual Data . When we visit a University Library, we use large metadata services that tell us what we would like to read. Generally, however, the actual shelf-holdings of the library are much less that that of the metadata system. Each metadata object can also be thought of as an "order" for printing, for interlibrary loan, whatever. In just the same way, we can present an archive of derived data products as if everything were already computed, but in fact computation may only occur to service a request. Products are cached after they are computed, so that the popular parts of the archive are available immediately.

There are several modes of operation of a system like this. A user can request a data object through a portal, and services will run on an immediate resource to get the request satisfied as soon as possible. A service can be set up to build a large amount of cached product for data mining, visualization, or demonstration. In another scenario, a service can scavenge the grid over months, gradually building a vast federated sky atlas with any machine cycles that it can find on the Grid.

A Virtual Sky . Wide-are image federation also has a beautiful educational spinoff: we are building a web-based "Virtual Sky" (virtualsky.org) that shows views of the sky in many wavelengths and scales, analogous to the highly successful National Atlas project (nationalatlas.gov). Students can use a simple interface to explore the sky, zooming in on favourite objects and seeing them in different wavelengths, adding annotation layers, attaching virtual post-its. More advanced uses are available by drill-down to calibrated FITS images, by connection to databases such as NED and Simbad, and by facile connection to other astronomical tools such as Aladin and Oasis.

Outreach and Education

Virtual Sky

A major thrust of the outreach efforts has been redesigning the Virtual Sky web site and database as above. We expect to devote considerable effort to the implementation in FY05 based on Hyperatlas and calibrated image reprojections.

Griffith Observatory Mural

We have been using the Atlasmaker software to build a very large mosaicked image (150 feet by 20 feet) to be permanently installed in the renovated Griffith Observatory in Los Angeles, enamelled on large metal tiles ten feet high. To be known as "The Big Picture", this image is being created from both DPOSS and Palomar-Quest sky surveys, and renders a 10-degree strip that includes the Virgo cluster and Markarian chain. This mural has been computed

Students

Student involvement in Grist in FY04 has been as follows:

Web site

The project has developed a web site for documents and service descriptions, which has also been used for presentations and reports for the SC4DEVO international workshop. The web site is http://grist.caltech.edu/.

Science Investigations

In our first year it is already becoming evident how Grist technology can enable a number of science investigations in the areas of search for high redshift quasars, studying peculiar variable objects, search for transients in real-time, and the fitting of SDSS QSO spectra to measure black hole masses. These science activities are described below.

Grist Testbed Science with the Palomar-Quest Survey

Grist is enabling ongoing science investigation rooted in new data acquired by the Palomar-Quest Survey. This work has three main thrusts including multi-wavelength science, data mining to search for high-redshift quasars, and exploriation of the time-variant sky, as described below.

The Hyperatlas/Atlasmaker technology is being used for the remapping of images in a framework which would enable novel astronomical data mining in a multi-image domain. Possible implementations include statistically weighted image stacking, in order to increase S/N; image subtraction, in order to detect variable, transient, and moving sources (e.g., supernovae, Earth-crossing asteroids, etc.); and image “drilling” in the time domain in order to produce multi-wavelength light curves. Finally, we have also embarked on a major public outreach project in collaboration with the Griffith Observatory, which will use this technology applied on the Palomar-Quest data in order to generate a major new science exhibit. We expect that we will engage services of a number of Caltech undergraduate students in the upcoming stages of this project. A more detailed description of the Hyperatlas/Atlasmaker is given elsewhere in this report.

Efficient data mining of Palomar-Quest data is helping us search for outliers in the color parameter space, specifically a search for high-redshift quasars. Given the size and complexity of the data sets, this places significant demands on the computational and database capability, and we have conducted a number of experiments along these lines. We have also started to explore the use of novel clustering algorithms for this type of astronomical studies. While the work is still in the early stages, it has already resulted in the first discoveries of high-redshift quasars (as well as a large number of brown dwarfs), and we expect that further work will bring this capability in a steady production regime. This would be a powerful and scientifically highly visible application of Grid computing in a computationally challenging, real-life astronomical context. A Caltech graduate student, Milan Bogosavljevic, was involved in the initial stages of this work, and we expect that he will continue with it as his Ph.D. thesis project.

This work is also focused on exploration of the time domain in astronomy, and in particular a search for optical transients in Palomar-Quest images, and a systematic study of quasar variability on time scales ranging from days to decades. The former has a broad relevance for many fields of astronomy, from stellar physics to cosmology, and has a potential for discovery of genuinely new astrophysical objects or phenomena. The latter has a potential significance for new, achromatic searches for AGN, as well as the understanding of quasar fueling mechanisms and lifetimes. Exploratory studies are currently under way, with the assistance of two undergraduate students, Priya Kollipara and Elisabeth Krause, and both are expected to produce scientific results soon. While these initial studies are done in the archival research mode, we have been designing a system for a real-time discovery and automated classification of astronomical transients; this system is expected to rely heavily on the Grid technology, and it will pose some interesting technical challenges, including scheduled and on-demand Grid computing, etc.

Quasar spectrum fitting

The spectrum fitting application is designed to fit complex models to the spectra of quasars. From the resulting model parameter values, physical parameters, such as black hole mass, can be deduced. The application uses genetic algorithms to iteratively converge on a satisfactory solution. In each iteration, a population of solutions is evaluated and ranked. The population of solutions for the next iteration (or “generation”) is “bred” from the members of the current population, based on the member ranks using directed random processes inspired from natural selection. Genetic algorithms were selected for the modeling because they do not require human monitoring (we wish to fit tens of thousands of spectra), and they avoid certain common problems such as converging to a local minimum.

The application is written in standard ANSI C++, and has been compiled and run on numerous platforms. The input is currently an ASCII file which contains the spectrum (wavelength, flux density, flux density uncertainty, pixel mask values). The output is an ASCII file containing the final values of the model parameters. In addition, a log file is generated which tracks the progress of the algorithm after each iteration.

The genetic algorithm-based fitting algorithm is quite general and is easily adapted to other problems, astronomical or otherwise. It could also be "parallelized" to run on multiple nodes for a single spectrum (e.g. by evaluating solutions on separate nodes), if this would be advantageous for particular applications.

The spectrum fitting application is currently running on the NCSA TeraGrid cluster. The goal is to fit many thousands of spectra by running individual jobs simultaneously on separate nodes. The C++ code was compiled using the available Intel compiler. Individual spectrum files are copied to a local disk (each file occupies about 200k of memory, so storage issues are not a concern for this project). A perl script is used to generate RSL (resource specification language) files that specify the executable, command line arguments, stdout/stdin, and maximum running time for each job. There is one RSL file for each input file. Then globusrun is called to do the job submission. A log file is written which stores the globusrun job identifier string. The job progress is monitored by looking at the grid monitor web page. So far, up to 2700 jobs have been submitted at one time.

Text Box:    Figure 1.  Comparison of a quasar spectrum (black) to a model fit (magenta) and the model components.
Figure 1. Comparison of a quasar spectrum (black) to a model fit (magenta) and the model components.

Figure 1 shows the spectrum (black) of a quasar with a redshift of 0.631, along with the model fit (magenta), and the model components. The power-law continuum is shown in red, the Balmer continuum in brown, the total spectrum of the iron ion complexes in green, and the total spectrum of the non-iron atomic species in blue. Several major emission lines are labeled. The model is able to cleanly disentangle heavily blended components. (The redshift of 0.631 implies a distance of 5 billion lightyears to the quasar.)

From the model parameters for the atomic emission lines, the quasar black hole mass can be calculated. The emission line widths correspond to the velocity of the gas in orbit around the black hole. The distance of the gas from the black hole is directly related to the power-law continuum luminosity. From the gas velocity and distance, we deduce the mass of the black hole of this quasar to be 3.1e8 times the mass of the Sun. This is a typical value for the quasars in our sample.

There are 173 parameters in the model for this quasar. Most of the parameters are related to the numerous iron emission line components. The fitting algorithm was run with 50 solutions per iteration, for 50 iterations. The reduced Chi^2 value of the final fit is 0.91.

The spectral data are from the Sloan Digital Sky Survey (SDSS). The SDSS spectroscopic pipeline can identify the objects as quasars, and measure their redshifts. The pipeline does not fit models to the quasar spectra because the fitting time per spectrum is prohibitive for a single processor.

Currently the input is an ASCII table of wavelength, flux density, flux density uncertainty, and pixel mask values. Output is an ASCII file of the parameter values for each of the quasar components. In the future, we will build in the ability to specify a spectrum, and have it automatically retrieved from the SDSS spectrum server at JHU.

Image Federation

One of the objectives of Grist is a service-oriented pipeline for federation and processing of astronomical image data. Images are reprojected to a common set of projection planes. Multi-wavelength federation can be used to detect very faint or unusual objects, and multi-temporal federation for finding faint and transient sources. We expect that source extraction from resampled images will be an important way to extract knowledge from the Quest data, as well as by combining independent surveys that have been exposed through NVO services (DPOSS, SDSS, 2MASS, First, etc).

Atlasmaker is a prototype grid-based workflow manager for building atlases of astronomical images. It is designed for high throughput on distributed supercomputing facilities such as Teragrid. It is built from components that include Montage, the NPACI SRB, the NVO Image Access protocol, and the Hyperatlas standard. This package uses and relies on the compute modules in the core Montage code. It provides an executive to run the whole mosaicking machinery, including background estimation and subtraction. It also prints timings of serial and parallel computing, as well as data fetching times.

Atlasmaker can run on a Unix workstation, or on a parallel machine such as the NSF TeraGrid. Parallelism is through MPI (Message Passing Interface), and assumes that each processor can see the same file space. Atlasmaker can build scripts suitable for being queued in a PBS batch system.

We have developed a suite of scripts for reprojection that has been running on the TeraGrid with several image archives. The Atlasmaker suite includes:

There are also modules for functions specific to the Palomar-Quest survey, such as combining WCS metadata with image data.

We have used Atlasmaker to resample Images from the Palomar-Quest survey, the SDSS DR1 (using Teragrid), and much of the 2MASS pre-release images. Discussions are underway for reprojection of the GOODS survey. Once all these surveys are reprojected to the same pixel planes, direct multi-wavelength images stacking will start.

Hyperatlas is an open standard intended to facilitate the large-scale federation of image-based data. The subject of hyperatlas is the space of sphere-to-plane projection mappings (the FITS-WCS information), and the standard consists of coherent collections of these on which data can be resampled and thereby federated with other image data. We hope for a distributed effort that will produce a multi-faceted image atlas of the sky, made by federating many different surveys at different wavelengths and different times. We expect that hyperatlas-compliant imagery will be published and discovered through an International Virtual Observatory Alliance (IVOA) registry, and that grid-based services will emerge for the required resampling and mosaicking.

We have built services that define the nature of the Hyperatlas. Given a sky position, the service can provide the page number of an atlas corresponding to the “best” atlas page for that point, and also the FITS-WCS header for that page.

Workflow Manager

A common component of Grist use-cases is connecting together various distributed services to form an integrated workflow. After analysing several ways in which this might be achieved, we decided to use an existing workflow manager, if possible. We ideally wanted a package that was: open source, platform independent, GUI driven, Grid capable, and easily extensible. We found that three current packages - Triana, Taverna and Viper - met most, if not all, of these criteria. After carrying out a more detailed analysis, Triana has been selected as the baseline Grist workflow manager since it is most clearly aligned with our objectives, and we have seeded a strong working relationship with the Triana developers, who have been responsive to bug reports and requests for specific functionality needed for Grist.

There are three extensions that we need to implement regarding Triana in the short term:

•  Use of a VO Registry instead of a UDDI registry for service discovery

•  A module to handle the proposed XML dictionary structures that are exchanged between Grist services to set input parameters for services

•  A revised data transfer model that does not involve any unnecessary third party transfers. Currently when multiple services are chained together in Triana, there is a communication back to the Triana client at the completion of each service in the chain. In Grist, we plan to implement a more distributed architecture in which the service chain is processed without needing to connect back to the client except at completion or for status updates.

In the longer term, we plan to incorporate the asynchronous service and security models emerging from the IVOA. We are in consultation with the Triana development team and it may be possible that some of our requests form part of a subsequent roadmap.

Service Components

Grist will provide a library of services that can be chained together and controlled via the workflow manager. Examples of Grist service types include data access, image mosaicking, source extraction, data mining, statistics, and visualization. We expect that the underlying software for each of these services will be accessible from open source or from our collaborators at no cost to this project.

Text Box:    Figure 2. Web service deployment with Java Native Interface (JNI), Axis, and Tomcat.
Figure 2. Web service deployment with Java Native Inteface (JNI), Axis, and Tomcat.

The general mechanism we are using to convert each existing source code to a service is illustrated in Figure 2. A prerequisite is that the host machine for the service needs to have a web server and servlet container such as Apache Tomcat installed, as well as a SOAP (Simple Object Access Protocol, a protocol for exchanging structured information over the internet) implementation such as Apache Axis. One problem in service deployment is that the Tomcat/Axis mechanism we are using to deploy a service is rooted in the Java language, but most of our algorithms to be deployed are implemented in languages other than Java (e.g. C, C++, Fortran). To bridge this gap, we use the Java Native Interface (JNI) in order to make the native code and data objects accessible from a Java implementation code. This requires making a small number of modifications to the native code and compiling it as a shared dynamic library that can be loaded at run time from the Java code. The Java code with hooks into the underlying native code can then be directly deployed as a service using standard mechanisms provided by the SOAP software.

The subsections that follow describe the main services that we have implemented in the first year.

SExtractor Web Service

The SExtractor web service is used to catalog objects in a FITS image based on user search criteria. It is based on an existing code called SExtractor, developed and placed in the public domain by Emmanuel Bertin, that builds a catalog of objects from an astronomical image. It is particularly oriented towards reduction of large-scale galaxy-survey data, but it also performs well on moderately crowded star fields.

The current version of the SExtractor web service takes a FITS image as input and outputs a catalog of objects found to match the search criteria. It is designed to get search criteria from the client, but currently uses default search criteria, since passing this information from client to server has yet to be implemented. The object catalog is currently returned as an ASCII table, but will soon be modified to be compliant with the VOTable standard.

The next Sextractor web service will be changed to work with the Grist workflow manager (e.g. Triana). We expect to modify the inputs and outputs to simple strings containing URLs, since this would simplify the way files are sent and returned to the Sextractor web service.

K-Means Web Service

The K-Means clustering service was developed to allow large image catalogs to be clustered based on a given property. An example in the astronomy domain would be clustering objects in color-color space in order to identify different classes of quasar and dwarfs. The k-means algorithm developed at Carnegie Mellon University solves the clustering problem efficiently by using kd-trees to reduce the number of nearest neighbor queries required by the traditional algorithm. Dan Pelleg at CMU provided the Grist project with an implementation of the k-means algorithm as a console program.

We have deployed three versions of a K-Means web service, each having a different communication mechanism between the client and the server. In all cases the input is a catalog of objects and the output is a new catalog with an extra column giving the cluster identifier for each object. The first implementation sends these catalogs in a single package as a binary byte array. We anticipate that this may cause problems for large datasets due to computer memory limitations, so another version with a unique file exchange method was implemented to support large data files. The files are exchanged in streams using a hand-shaking mechanism as follows. The input file sent from the client is stored as a temporary file on the server. The server returns a handle to the temporary file back to the client. The client then uses this unique file handle to communicate to the server for further processing. When the client's request is finished, the server deletes any temporary files it had created for that client. A third implementation uses URLs pointing to the catalogs as input and output, which we expect will facilitate using the service in a workflow. We expect all three of these interfaces to the K-Means web service will be useful components of the Grist services library in different circumstances.

To obtain interoperability with other services, the K-Means service was modified to be compliant with the VOTable standard. Translation to and from VOTable was accomplished with STIL (Starlink Tables Infrastructure Library). Currently, the K-Means service is in its testing phase and will require the use of a workflow manager in order to communicate with other services.

The K-Means Web Service implementation was done by Harshpreet Walia, an undergraduate student from Western Washingon University , who is contributing to the Grist project at JPL as a NASA Space Grant Fellow.

WCS Web Services

We have built a set of services to implement the standard projections that astronomers use to represent a celestial sphere on a flat pixel plane. These transformations are known as World Coordinate System (WCS) transformations. Given a transformation, in the form of a FITS-WCS header, the services transform from sky to pixel plane and vice versa. These services are built to interact seamlessly with the Hyperatlas services. We are developing now a set of "footprint" services to decide if regions on the celestial sphere intersect with each other. These will be used for dependency analysis of the DPOSS survey in constructing a dependency graph.

VOStat Web Service Package

A novel prototype “VOStat” Web service whereby statistical computations are performed on Virtual Observatory datasets based on concepts of grid computing, was developed.
Several astrophysically interesting applications (i.e., scientific validation tests) for this are being designed.

We have been working on this translation of the VOStatistics concept into a fully-interactive statistical toolkit within the Virtual Observatory (VO) software environment. The prototype implementation consists of both a server-side web services-based framework and a client GUI application. The framework is modular in design and easily extensible: new functionality can be incorporated with essentially just a registration operation. Framework wrapper classes also handle any platform/language dependencies that new functionality may present; this means, for example, that legacy code in Fortran could be integrated into the toolkit as seamlessly as Java. The framework also fully supports distributed components - both data and compute nodes. The client GUI is implemented in Java and provides a structured interface to the available functionality. Expert knowledge of statistical methodology has been incorporated into the GUI design, both in terms of a tailored layout, e.g. all non-parametric goodness-of-fit tests grouped together, and a fully-integrated searchable help system.

The toolkit is VO-compliant with all data exchanges utilizing the VOTable data format. It currently offers access to a subset of the R statistical language functions and a k-dimensional tree package for clustering and outlier detection. A booth was rented and
this prototype was demonstrated at AAS meeting in Denver in June 2004. The demonstration included workings of the web page interactively, on a dataset gathered from Strasbourg and analyzed in Pasadena from a convention center in Denver . This effort was mainly supported by NSF-FRG grant to Penn State (PI-G. J. Babu).

The following three posters were presented.

•  VOStatistics: a distributed statistical toolkit for the Virtual Observatory

•  VOStat: The R Statistics Package for Astronomical Data Analysis

•  Doing Science with VOStat

We continue to add functionalities and help files and documentation for VOStat.

The Center for Astrostatistics was established at Penn State . Under the auspices of the Center, several statistical methods for use by astronomers are being developed. They include: streaming procedures for density estimation in higher dimensions, Outlier detection using density and bootstrap resampling techniques, model selection in the multivariate setting using jackknife type procedures.

Equipment

Figure 3 shows the Caltech Virtual Astronomy rack, partly bought with Grist funds (Windows database server and one Linux datawulf), and partly bought with other funds. These components act in concert with large shared resources such as TeraGrid. Data from the Palomar-Quest survey is staged on the Linux machines, and permanently archived at NCSA, with transport via TeraGrid backbone. The Linux machines also provide a sandbox for code development; an environment for immediate computing that is more responsive than the TeraGrid batch queues; and a place to build the product that will be ingested by the database.

The Windows database machine is used for large-scale crossmatch of PQ data. It also runs the NVO OpenSkyNode protocol that was developed at Johns Hopkins and Microsoft by Alex Szalay and Jim Gray. This allows PQ data – and other massive catalogs – to be effectively federated even though they are geographically separated. The VirtualSky machine at the bottom of the rack was previously funded by an anonymous donor.

Participants

In the first year, the following organizations and key personnel collaborated on Grist as part of the core team:

•  Caltech Center for Advanced Computing Research:
Roy D. Williams (PI), Sarah Emery Bunn, and Matthew J. Graham.

•  Caltech Astronomy:
Ashish Mahabal and S. George Djorgovski

•  Jet Propulsion Laboratory, Caltech (Parallel Applications Technologies Group):
Joseph C. Jacob, Daniel S. Katz, Craig D. Miller, and Harshpreet Walia

•  The Pennsylvania State University :
Jogesh Babu and Daniel Vanden Berk

•  Carnegie-Mellon University
Robert Nichol

International Cooperation

There was a four-day workshop at Caltech in July entitled "Service Composition for Data Exploration in the Virtual Observatory" (SC4DEVO). The project is one of four initiatives funded by the UK e-Science Core Programme to foster collaboration between international e-science “sister projects” to bring together those actively working in the data exploration and service composition areas, with a view to understanding the infrastructure requirements of data exploration specifically for astronomical data, but also in a wider e-Science context.

This first SC4DEVO workshop (July 04) defined the scope of the project and developed its workplan. The report identifies science requirements and summarizes current and planned work. We also expect to begin the process of standardizing interfaces to the VO compute services, just as the VO effort has started to standardize data services. Five of the talks at SC4DEVO were about Grist-funded work that is described elsewhere in this report.

Non-Astronomy Collaborations

We have been working with other projects to ensure that our GRIST work is useful in other fields. Two such projects are GENESIS (General Earth Science Investigation Suite) and mROIPAC.

GENESIS (http://genesis.jpl.nasa.gov/) was selected in 2003 under NASA's REASoN (Research, Education, and Applications Solution Network) program. The GENESIS team includes scientists and information technologists at JPL, UCLA, the University of Maine, Scripps Institution of Oceanography, and three NASA data centers (DAACs). GENESIS is building a new suite of web services tools to facilitate multi-sensor investigations in Earth System Science. Residing within a framework known as Sci-Flo, these tools will offer versatile operators for data access, subsetting, registration, fusion, compression, and advanced statistical analysis. They will first be deployed in a model server at JPL, and later released as an open-source toolkit to encourage enhancement by independent developers. While the tools are designed for reuse across many science disciplines, GENESIS focuses on the needs of NASA atmospheric sensors, including AIRS, MODIS, and MISR, on NASA's Terra and Aqua spacecraft, and the GPS occultation sensors on CHAMP, SAC-C, and GRACE. Three DAACs participate in GENESIS to provide the data products, evaluate key technologies, serve as test-beds, and eventually integrate proven functions into their operations. Members of the GRIST team have been regularly participating in GENESIS meetings, both to track their progress and to share knowledge gained in GRIST. Our aim is to ensure that tools developed in one project are compatible with the tools developed in the other project.

mROIPAC is being developed at JPL as a follow-on to ROI_PAC (http://www.openchannelfoundation.org/projects/ROI_PAC), a package of software that applies Interferometric Synthetic Aperture Radar (InSAR) methods to data from satellite radar instruments. While ROI_PAC is a series of executable modules that are driven by a Perl script, mROIPAC will eventually exist in a number of forms: executables, web services, and grid services, all relying on modules built from python-wrapped executables. One of the GRIST developers is participating in mROIPAC development, and we will be sharing our findings on web and grid services as these two projects proceed.

Publications and Products

J. C. Jacob, R. Williams, J. Babu, S. G. Djorgovski, M. J. Graham, D. S. Katz, A. Mahabal, C. D. Miller, R. Nichol, D. E. Vanden Berk, and H. Walia, Grist: Grid Data Mining for Astronomy , submitted to Astronomical Data Analysis Software & Systems (ADASS) XIV, October 2004.

H. Walia, Interoperable Web Services for Astronomy , Summer Undergraduate Research Fellowships (SURF) Summer Seminar Day Presentation, Caltech, August 2004.

Babu, G. J. (2004). A note on the bootstrapped empirical process. Journal of Statistical Planning and Inference. To appear.

Babu, G. J. Model fitting in the presence of nuisance parameters. In the proceedings of Astronomical Data Analysis-III. Fionn D. Murtagh (Ed.). To appear.

Babu, G. J., Boyarsky, A., Chaubey, Y. P. and Gora, P. New statistical method for filtering and entropy estimation of a chaotic map from noisy data. Inter. J. Bifurcation and Chaos. Submitted.

Babu, G. J. and Chaubey, Y. P. Smooth estimation of a distribution and density function on hypercube using Bernstein polynomials for dependent random vectors. Submitted.

Babu, G. J., and Djorgovski, S. George. (2004). Some statistical and computational challenges, and opportunities in astronomy. Statistical Science. In press.

Babu, G. J., and Rao, C. R. (2004). Goodness-of-fit tests when parameters are estimated. Sankhya, 66, 1-12.

Feigelson, E. D. and Babu, G. J. (2004). Statistical Challenges in Modern Astronomy. In `PhyStat2003: Statistical problems in Particle Physics, Astrophysics, and Cosmology', 1-7, L. Lyons, R. Mount and R. Reitmeyer (Eds.), Stanford Linear Accelerator Center, Stanford, CA.

McDermott, James P., Babu, G. J., Liechty, John C., and Lin, Dennis K. J. Data skeletons: simultaneous estimation of multiple quantiles for massive streaming data sets, with applications to density estimation. Submitted.

R. D. Williams, S. G. Djorgovski, M. T. Feldmann, and J. C. Jacob, Hyperatlas: A New Framework for Image Federation , Astronomical Data Analysis Software & Systems (ADASS) XIII, October 2003.

R. D. Williams, S. G. Djorgovski, M. T. Feldmann, and J. C. Jacob, Atlasmaker: A Grid-based Implementation of the Hyperatlas , Astronomical Data Analysis Software & Systems (ADASS) XIII, October 2003.

M. J. Graham et al., VOStat prototype demonstration, AAS, Denver, June 2004.

M. J. Graham et al, et al., VOStatistics: a distributed statistical toolkit for the Virtual Observatory, AAS poster presentation, Denver , June 2004.