3 ChemExpo Data Overview
To support chemical decision-making, EPA’s Office of Research and Development (ORD) must identify and characterize relevant exposure pathways - the path of a chemical from source to a receptor. How a chemical is used (e.g., in a consumer, occupational, or industrial context) is critical to determining exposure pathways. Over the last decade, ORD has developed a series of datasets and databases containing information collected from public documents that describe how chemicals are used in commerce, including in consumer products. These data are currently released as the Chemicals and Products Database (CPDat). ChemExpo is a new tool that will provide to the public metadata associated with CPDat data sources and new ways to explore and download the data.
This section describes the main classification systems used in ChemExpo and in CPDat to organize chemical use information extracted from public documents. These include harmonized chemical identifiers (DTXSIDs), Product Use Categories (PUCs), Chemical List Presence Keywords (List Keywords), and Function Categories (FCs). Each of these classification systems have been developed over time by U.S. EPA and/or other groups and are described in the peer-reviewed literature. In the following sections we briefly describe the systems, provide some information about the history of their development, and link to both relevant literature and to the formal definitions (e.g., category definitions) associated with each. Document Types in ChemExpo are also defined in this section. Detailed descriptions of the terminology used herein and in the ChemExpo application can be found in the ChemExpo Glossary Chapter 12.
The relationships in ChemExpo among documents, products, the various classification schemes, and various extracted information are summarized in Figure 3.1. These relationships are described in more detail in the following sections.
3.1 Chemical Identifiers
ChemExpo makes use of EPA chemical curation workflows (Grulke et al. (2019)) that were developed to support ORD research and the CompTox Dashboard. These workflows match reported chemical identifiers, e.g. chemical names, Chemical Abstracts Service Registry Numbers (CASRNs), etc., to EPA chemical identifiers called Distributed Structure-Searchable Toxicity Database (DSSTox) Substance Identifiers (DTXSIDs) (Grulke et al. (2019)). DTXSIDs can be used to directly link data in ChemExpo to other chemical data (e.g., toxicity information) available in the Dashboard. Chemical curation in ChemExpo is ongoing and in DSSTox; some identifiers may not be able to be successfully curated to DTXSIDs, either because they don’t represent a unique substance (e.g., a chemical reported on a document as “fragrance”) or because they aren’t currently recognized by the curation workflow.
3.2 Product Use Categories
Some data in ChemExpo are curated from documents that describe specific products. These products are organized into Product Use Categories (PUCs) developed explicitly for exposure assessment and modeling. Currently, there are three different kinds of PUC in ChemExpo: those associated with consumer formulations (e.g., cleaners, personal care products), consumer articles (e.g., furniture), and industrial/occupational products (e.g., raw materials, laboratory supplies). The organizational hierarchy for PUCs consists of three levels: General Category, Product Family, and Product Type. PUCs are organized from general to more specific product types, i.e., ‘Personal care’ is the most general category, while ‘Personal care – dental care – toothpaste’ would be a specific product type. Products can be curated to a higher level (e.g., General Category or Product Family only) if there is not enough information to assign a product type. The consumer product PUCs used in ChemExpo are those described in Isaacs et al. (2020), with some additions/refinements necessitated by the addition of new products to CPDat. In addition, some very general article and industrial/occupational PUCs have been added to ChemExpo as new data sources for these types of products have been curated. The article PUCs were based on harmonized article categories developed by the Organisation for Economic Co-operation and Development (OECD) (Directorate (2017)) and the industrial/occupational product PUCs were developed ad hoc from examination of the data sources. Products in ChemExpo were curated to PUC manually, in bulk based on data source metadata, or via automated assignment based on product name, using natural language processing methods (peer-reviewed publication in development). In ChemExpo, the type of PUC assignment used (manual, bulk, or automated) is provided for each curated product. Definitions for all PUCs used in ChemExpo are provided on the PUC Summary Page.
Products within PUCs may also be assigned specific Attributes. Attributes are keywords further describing the type of formulation (e.g., liquid, spray), user population (e.g., child), or microenvironment of use (e.g., indoor). These attributes may be of use in an exposure assessment. Some PUCs have assumed attributes (e.g., toys are assumed to be relevant to children), and each PUC has a list of allowable attributes.
3.3 Function Categories
ChemExpo contains data about the technical role, or function, performed by a chemical within products or processes (e.g., solvent, plasticizer, fragrance). These data update those previously included in EPA’s Functional Use Database (Phillips et al. (2017)) and CPDat (Dionisio et al. (2018)). In ChemExpo, reported functions are curated to a set of Function Categories (FCs) that include harmonized functional use categories developed by OECD, in addition to other categories added by U.S. EPA for functions not covered by the OECD categories (Directorate (2017)) (EPA-generated FCs are indicated through the use of ‘EPA’ within the category name). These standardized FCs contain explicit definitions and exclusions, to reduce ambiguity in category assignment. New FCs will be added as needed to accommodate novel uses. Definitions for all FCs used in ChemExpo are on the FC Summary Page.
3.4 Chemical List Presence Keywords
In addition to specific product and function data, ChemExpo contains general information about how chemicals might be used in commerce. These data are indexed by specific keywords (Chemical List Presence Keywords, often abbreviated as “List Keywords” here and in ChemExpo) that define and describe the presence of chemicals on defined lists contained within public documents (e.g., lists of food-use chemicals or chemicals used in specific industries). These keywords are an update/refinement to the terms previously developed for EPA’s Chemical and Product Categories (CPCat) database (Dionisio et al. (2015)) and are described in Koval et al. (2022). These refinements better align the assignment of keywords with other CPDat data streams, namely PUCs and FCs. For example, CPCat included specific functions as keywords, as our FC system did not exist at the time. In addition, CPCat contained large number of keywords associated with different consumer product types or categories. These terms have been harmonized and updated to the product use categories (PUCs). For example, a chemical list from a public document, denoting chemicals used in a specific type of personal care product, would be assigned a keyword identical to the PUC that would be assigned to specific products in ChemExpo. Keywords have kinds, as well – there are keywords that modify other keywords and those associated with a geographic location or populations. From a single chemical list, more than one List Keyword can be assigned to a chemical. For example, a list of pesticide active ingredients used in Europe curated from a document would be assigned the keywords “pesticides”, “Europe” and “active ingredient”. Such a combination is called a List Keyword set. Definitions for all List Keywords (and their associated kind) used in ChemExpo are provided on the List Presence Keywords Summary Page.
3.5 Introduction to Data Documents and the Data Document Page
In ChemExpo, documents are categorized by the type of information they contain. The following document types are used in ChemExpo:
3.5.1 Document Types in ChemExpo
Composition Documents: Composition documents contain lists of chemical ingredients for one or more products, for example, unique Universal Product Codes (UPCs) or product names. These data may be qualitative (e.g., an ingredient list) or quantitative (e.g., weight fraction information). Multiple product records may be extracted from a single composition document (see Figure 3.1) and assigned to Product Use Categories, but chemical composition data are associated from the actual document from which it was extracted (since each composition document contains unique composition data, that is, a unique formulation). Composition documents may also contain function information that allow individual chemicals to be mapped to Function Categories (FCs). Composition documents may include Safety Data Sheets (see Appendix C for more information on these documents), ingredient disclosures, and ingredient lists.
Chemical List Presence Documents: These documents contain one or more chemical lists that are extracted and mapped to one or more Chemical List Presence Keywords (with multiple keywords comprising a keyword set). These reports and documents provide general information on the use of chemicals, and may come from international, federal, or state agencies, trade associations, or other reputable sources.
Functional Use Documents: These documents contain information on functions related to a specific chemical substance. This reported function data allows the chemical to be mapped to FCs. Functional use documents include technical specification sheets, or chemical retailer webpages, or regulatory inventory documents.
3.5.2 The Data Document Page
The Data Document page is the central page type in ChemExpo that contains specifics of the chemical information extracted from a particular document. Data Document pages can be accessed in many ways via Search or via Chemical, Product Use Category, or Function Category pages as described in later sections. The information displayed depends on the type of Data Document. Generally, the page displays document metadata, such as the document title, subtitle, document date, and the reporting organization. Chemical data are organized into ‘cards’, which contain both raw extracted data and cleaned/harmonized data; these data vary based on Document Type. All chemical cards include the raw (reported) chemical identifiers and DTXSID (if the identifier was successfully curated). On Composition Documents, chemical cards may include raw and cleaned composition information for the chemical. On Functional Use Documents, chemical cards contain raw reported functions for the chemical as well as the curated Function Categories. On Chemical List Presence Documents, chemical cards contain the List Keywords that have been assigned to the chemical based on the definition of the list in the document (see Table A.5 for more information on data available on the Data Document page).