Technical notes

Population

The population counts used in analysis and presented in the dashboard are provided by the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) program. See their U.S. Population Data page for more information.

Exclusions

Cases of cancer and deaths due to cancer among individuals that had an unknown age at the time of diagnosis/death are excluded from all calculated statistics in this report.

Cases of cancer that have an invalid Site Recode ICD-O-3/WHO 2008 value are excluded.

Transgendered, intersexed, or individuals of unknown sex are also excluded.

Cancer Incidence

The cancer incidence statistics in this report were collected by the Pennsylvania Cancer Registry (PCR). The PCR began collecting statewide cancer incidence data in 1985. Only cases diagnosed among Pennsylvania residents are included in this report. Reporting sources are required to submit all cancer cases newly diagnosed and/or treated at their facility. If an individual had more than one primary tumor, each tumor is reported and counted. Thus, counts (cases) reported in this publication are based on the number of primary sites of cancer rather than the number of people with the disease. Starting with 2001 data, cancer cases were coded according to the “International Classification of Diseases for Oncology, Third Edition” (ICD-O-3).

All statistics in this report use only invasive cancers unless otherwise noted. Because of the difficulty in interpreting the language used by pathologists to describe the extent of invasion of bladder cancers, in situ bladder cancers are combined with invasive bladder cancers and included in the total for all invasive cancer sites. This practice is consistent with SEER.

Starting with the November 2020 data submission, the Cancer Statistics Dashboard now uses “Behavior code ICD-O-3” rather than the “Behavior recode for analysis” field to classify tumor malignancy. The Behavior Variable Changes page on SEER’s website describes this change in more detail.

Hispanic ethnicity is determined with the NAACCR Hispanic Identification Algorithm (NHIA) developed by the NAACCR Race and Ethnicity Work Group (2011).

Mortality

Pennsylvania’s certificate of death is the source document for the Pennsylvania cancer mortality data contained in this publication. The death certificates are usually completed by hospital personnel, physicians, and funeral directors. Pennsylvania uses the International Classification of Diseases, Tenth Edition (ICD-10) cause-of-death classification system.

Rates per 100,000

Rates were calculated to determine what percentage of the population was affected by cancer. However, cancer rates are so small that statisticians regularly multiply the rate by 100,000 to avoid rates with several decimal places (e.g., a rate of 0.000115 becomes 11.5 per 100,000). This concept is very similar to a percentage where the result is multiplied by 100, to avoid a decimal result (e.g., \(\frac{25}{1000} = 0.025\), which becomes 2.5% after multiplying by 100).

Age-adjusted Rates

Cancer incidence and death rates were age-adjusted so that regions with an older or younger population distribution can be compared without age bias. The adjustment calculates the rate as if a region had the same age distribution as the standard. In this report, age-adjusted rates were calculated by the direct method, which applies the U.S. standard population across 19 age groups, and reported per 100,000. Rates based on less than 20 events are considered statistically unreliable and are not displayed. Age-adjusted rates should only be compared to other age-adjusted rates that were calculated in a similar manner.

Crude Rates

Crude rates were calculated by dividing the total number of cancer cases or deaths in a given time period by the total number of persons in the population.

Confidence Interval

Rates as reported in the dashboard are point estimates. The standard error (S.E.) for a rate defines the variability in this point estimate. With a standard error for a rate, one can compute a confidence interval (C.I.), i.e., a range for an observed rate that should include the actual or real rate 95% of the time. For more information, go to the Tools of the Trade section of the Pennsylvania Department of Health’s website. The confidence intervals, as presented in this report, are based on the Tiwari et al. modification for estimating the standard error (Tiwari, Clegg, & Zou, 2006).

Ranking

Cancer incidence and death data were ranked by frequency count and are limited to the 23 primary sites used throughout this dashboard. Occasionally, less than 23 primary sites will be ranked, since some sites are sex-specific (e.g., prostate cancer will not be ranked for females).

Trends

The trend of a rate over a period is displayed using a “joinpoint” model. A joinpoint model fits connected segments, each spanning a portion of the period, to the data. These segments do not represent the actual rates. Each segment has a constant percent change between subsequent years, which is called an annual percent change (APC). The APC is estimated using a least squares linear regression model to the natural logarithm of the rates.

A statistical algorithm determines when the segments should be separated. These are called joinpoints. The algorithm adds joinpoints only if they represent a statistically significant change in the trend. Statistical significance is determined at the 95% confidence level using a permutation test.

The Joinpoint software, provided by the National Cancer Institute, was used to create these models using the default settings. The models were created according to certain criteria:

Joinpoints occur at exact years;
The number of joinpoints is between zero and 3; and
Each segment must be at least 3 years long.

APC statistics only apply to a single segment between joinpoints. A similar measure, the average annual percent change (AAPC), can be calculated for any desired range of years. The AAPC is a geometric weighted average of the APCs for the included years.

Prevalence

The limited duration prevalence of a cancer is the number of cases alive on a certain date that had been diagnosed within a certain amount of time before that date.

Estimates are based on Pennsylvania Cancer Registry case records. The registry practices passive follow-up and learns of patient deaths by linking records with state death certificates and national databases. The prevalence estimate is the number of patients who were diagnosed with cancer during the time interval and were still alive at the date of interest. Patients of unknown age were excluded.

The analysis does not account for “cured” cancers or those who moved after a cancer diagnosis.

Prevalence was calculated using the first invasive tumor for each cancer site diagnosed during the previous 32 years. For example, consider a person with a breast cancer diagnosis in 1985 and a lung cancer diagnosis in 1990. If prevalence is displayed by site (all sites, breast, lung), then the breast cancer would contribute to the “all sites” and “breast” estimates, and the lung cancer would contribute to the “lung” estimate. In other words, “all sites” is treated as a separate cancer “site.”

A prevalence number for all counties in Pennsylvania may not necessarily equal the sum of the numbers for each county. This is because prevalence is a count of patients, and a patient may be included in multiple counties if their county of residence changes between cancer diagnoses.

Beginning with the release of the 2017 incidence data, the prevalence calculation involving multiple primary tumors changed from counting the first cancer per prevalence interval to counting the first cancer over all intervals. In the prior method, each interval was considered separately, and we counted the person’s first tumor diagnosed within the interval’s range of years prior to the prevalence data. Therefore, the sum of the prevalence counts for the intervals did not necessarily equal the count for the total duration period. Using the new method, a person is counted once toward the overall prevalence and is counted in only one interval. Therefore, the sum of the prevalence counts for the intervals will equal the count for the total duration period. Prevalence counts and percentages beginning with the 2017 incidence data cannot be directly compared to earlier versions of the Cancer Statistics Dashboard.

Estimates were calculated using the SEER*Stat software provided by the Surveillance, Epidemiology, and End Results (SEER) program.

Net Survival

The survival statistics presented in this dashboard are all net survival rates. A net survival rate is the expected survival rate in a hypothetical world without deaths related to cancer. Because the listed cause on a death certificate is not totally reliable, the Pohar-Perme (Perme, Stare, & Estève, 2012) method is used. The risk of death unrelated to cancer was taken from life tables created using the entire population of Pennsylvania residents. These tables include cancer deaths in calculating risk, but this should have little effect on the net survival estimates. Cancer records were matched to life tables by age, race, sex, and calendar year.

A patient was included in the net survival analysis only if he or she:

was a Pennsylvania resident;
was at least 15 and at most 99 years old at diagnosis;
was not transgendered, intersexed, or of unknown sex;
was not of unknown race;
had a known year of birth;
had a known year of last contact; and
survived at least 1 day past diagnosis.

A cancer diagnosis was included only if it:

was diagnosed during the period of 2001 to 2014;
was reported by a source other than a death certificate;
was the first malignant diagnosis for the patient; and
had a known year of diagnosis.

The net survival rates presented throughout the report, except for those labeled as “Net relative” on the Map and Data pages, were age-adjusted according to the International Cancer Survival Standards (ICSS) populations (Corazziari, Quinn, & Capocaccia, 2004). Net survival rates for prostate cancer were age-adjusted according to population weights provided by SEER.

All net survival statistics were created using the SEER*Stat (Surveillance Research Program, National Cancer Institute, 2023d) software provided by the SEER program.

Unadjusted net survival

Note: Be cautious when using unadjusted net survival rates.

Age-adjusted net survival rates can only be calculated if all age groups have at least one patient who is not censored the desired number of years after diagnosis. When viewing age-adjusted net survival rates on the Map page, certain selections result in maps where many counties lack rates.

The unadjusted rates can be used in these cases but should be done with caution. Age is used to determine patients’ expected survival rates from the life tables, but it is possible net hazard is also influenced by age (Dickman & Coviello, 2015). Comparisons between unadjusted net survival rates may be misleading.

Probability of Developing/Dying

The age-conditional chance of developing or dying from cancer are calculated using the competing risks method proposed by M. Fay, Pfeiffer, Cronin, Le, & Feuer (2003) and M. P. Fay (2004).

The risk of developing a cancer within the age range \(A\) to \(B\) is the probability a cancer-free resident of age \(A\) will be diagnosed with cancer by the time he or she is age \(B\). These estimates are based on observed records from the PCR and account for patients who died before reaching age \(B\) or receiving a cancer diagnosis. For the overall risk across all cancer types, only the first diagnoses of cancer among patients were used in creating the estimate. For the risk of developing a specific cancer type, only the first diagnoses of the type were used in creating the estimate.

The risk of dying of cancer within an age range uses a similar model. The only difference is the event of interest is death from cancer. For the risk of dying from a specific cancer type, cancer deaths other than the specific type were grouped with deaths from non-cancer causes.

Lifetime risk is the risk between age 0 and death.

Estimates were created using the DevCan software provided by SEER.

Projections

Cancer incidence and mortality data only become available a few years after the events occur. Projected numbers of diagnoses and deaths can substitute for more recent data.

The cases used for projecting incidence only included each case which:

had a known year and month of diagnosis;
was diagnosed between 2005 and 2019;
was reported to the PCR within 23 months of the diagnosis year’s end; and
was classified as malignant according to the SEER Behavior Recode for Analysis.

The death certificates used for projecting death only included each certificate which:

had a known year and month of death;
had a year of death between 2005 and 2019; and
listed cancer as the underlying cause of death (i.e., “C” was the first character in the ICD-10 code).

Monthly time series of event counts were created from selected records and split into four sets:

Cancer incidence by county
Cancer incidence by sex and cancer site
Cancer deaths by county
Cancer deaths by sex and cancer site

Each time series was for a specific county or combination of sex and cancer site.

All counts were adjusted for delayed reporting. The number of delayed cases for a cancer site was estimated by combining a logistic model and an ANOVA regression. The logistic model estimated the probability any delayed cases would be reported in a specific year. The ANOVA model estimated the number of delayed cases reported in a specific year, assuming there would be at least one. The results of these two models were multiplied together for a final estimated number of delayed cases in a specific year. Each cancer site had its own logistic and ANOVA models, and all models used year of diagnosis, year of reporting, sex and age group as dependent variables. See Delay-adjusted Rates below for the general details on the ANOVA model.

County projections were based on the subjects’ counties of residence at the time of diagnosis or death. If the county of residence was not known, the case or death was not used in modeling. Projections by sex and site excluded patients with unknown sex. Subjects with a sex other than male or female were also excluded, because reliable models could not be created with the low numbers of events.

Multiple methods of choosing and fitting models were applied to each series in a version of cross-validation for time series as described by Arlot & Celisse (2010). The model parameters were chosen in a stepwise process using the R. Hyndman et al. (2018) package in R. Composite forecasts were also created by averaging those from different combinations of the individual models.

The Mean Absolute Scaled Error (MASE) [Hyndman & Koehler, 2006] score was calculated for each forecast of a series. For each of the four sets of time series, the forecasting method with the lowest average MASE score was used for projecting. The monthly projections were then rescaled so that the sum of the projected county counts matched the sum of the projected sex and cancer site counts.

Each year’s projected number of events is the sum of that year’s monthly projections.

Delay-adjusted Rates

Timely and accurate calculation of cancer incidence rates is hampered by reporting delay. Reporting delay is the time that has elapsed before a diagnosed cancer case is reported to the Pennsylvania Cancer Registry. Cancer cases diagnosed among Pennsylvania residents are first submitted to the NPCR about 2 years after the end of a diagnosis year (e.g., a complete 2019 year of diagnosis was first submitted in late 2021).

In subsequent submissions, the data for that diagnosis year are updated as:

New cases are found to have been diagnosed within that diagnosis year; and
New information is received about previously submitted cases.

Reporting delay is used to adjust the current case count to account for anticipated future corrections to the data. Delay-adjusted counts and rates are needed to produce cancer incidence trends that are not impacted by late reporting.

For details of the statistical modeling, please refer to the Development of the Delay Model in the website http://www.surveillance.cancer.gov/delay/.

References

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. https://doi.org/10.1214/09-SS054

Corazziari, I., Quinn, M., & Capocaccia, R. (2004). Standard cancer patient population for age standardising survival ratios. European Journal of Cancer, 40(15), 2307–2316. https://doi.org/10.1016/j.ejca.2004.07.002

Dickman, P. W., & Coviello, E. (2015). Estimating and modeling relative survival. The Stata Journal, 15(1), 186–215. https://doi.org/10.1177/1536867X1501500112

Fay, M. P. (2004). Estimating age conditional probability of developing disease from surveillance data. Population Health Metrics, 2(1), 6. https://doi.org/10.1186/1478-7954-2-6

Fay, M., Pfeiffer, R., Cronin, K. A., Le, C., & Feuer, E. (2003). Age-conditional probabilities of developing cancer. Statistics in Medicine, 22(11), 1837–1848. https://doi.org/10.1002/sim.1428

Fritz, A., Percy, C., Jack, A., Shanmugaratnam, K., Sobin, L., Parkin, D. M., & Whelan, S. (Eds.). (2000). International classification of diseases for oncology (third). Geneva: World Health Organization.

Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679–688. https://doi.org/10.1016/j.ijforecast.2006.03.001

Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., O’Hara-Wild, M., … Yasmeen, F. (2018). forecast: Forecasting functions for time series and linear models. Retrieved from http://pkg.robjhyndman.com/forecast

Microsoft. (2022). Power BI desktop (Version 2.116.622.0). Retrieved from https://powerbi.microsoft.com

NAACCR Race and Ethnicity Work Group. (2011). NAACCR Guideline for Enhancing Hispanic/Latino Identification: Revised NAACCR Hispanic/Latino Identification Algorithm [NHIA v2.2.1]. Springfield, IL: North American Association of Central Cancer Registries. Retrieved from https://www.naaccr.org/wp-content/uploads/2016/11/NHIA_v2_2_1_09122011.pdf

Perme, M. P., Stare, J., & Estève, J. (2012). On estimation in relative survival. Biometrics, 68(1), 113–120. https://doi.org/10.1111/j.1541-0420.2011.01640.x

R Core Team. (2019). R: A language and environment for statistical computing (Version 4.2.2). Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/

Surveillance Research Program, National Cancer Institute. (2023a). DevCan software (Version 6.9.0). Retrieved from https://surveillance.cancer.gov/devcan/

Surveillance Research Program, National Cancer Institute. (2023b). Joinpoint software (Version 5.0.2). Retrieved from https://surveillance.cancer.gov/joinpoint/

Surveillance Research Program, National Cancer Institute. (2023c). SEER*explorer. Retrieved from https://seer.cancer.gov/statistics-network/explorer/

Surveillance Research Program, National Cancer Institute. (2023d). SEER*stat software (Version 8.4.1.1). Retrieved from https://seer.cancer.gov/seerstat/

Tiwari, R. C., Clegg, L. X., & Zou, Z. (2006). Efficient interval estimation for age-adjusted cancer rates. Statistical Methods in Medical Research, 15, 547–569. https://doi.org/10.1177/0962280206070621

World Health Organization. (2016). International statistical classification of diseases and related health problems, 10th revision (fifth). Geneva: World Health Organization.