National Study of Diabetes Impact and Care in China

Microdata and Analysis Programs

The IDF Diabetes Impact Studies were proposed and undertaken by the IDF Task Force on Health Economics with the understanding and intention that the data generated by these studies would become freely available to the diabetes research community. 

This webpage contains links to the source microdata for the IDF Diabetes Impact Study in China, and to the data analysis programs that we used to clean, transform, and analyze these data for publication in Yang W et al. (2012) Medical care and payment for diabetes in China: enormous threat and great opportunity. PLoS ONE (in press).

Please refer to Yang et al. and its appendices for a full description of study methods and results.  The interview schedule used in the Chinese study (including field codes) is one of these appendices.

The data-cleaning and data-analysis programs were written in R, an open source program.  The links on this page are to text files that R will recognize and run.

The datasets are provided here as CSV (comma-separated variable) flat files that the R programs will open and that the user can view in EXCEL or in other spreadsheet and database programs.

To protect the privacy of study subjects, these datasets do not contain names, addresses, phone numbers or other data that could be used to identify individual study participants.  Datasets from the original Chinese national screening study, from which we drew our participants, and against which our subjects are compared in the paper, are not included.

The EpiINFO data entry program, also posted on the IDF website, generates five separate datasets, corresponding to five different sections of the interview schedule (the schedule, itself, is reproduced in Yang et al., Supporting Information 1):

The versions of these datasets that are available to download are not the raw data: personal identifiers have been removed, the data have been verified against the hardcopy questionnaires, cleaned (in EXCEL) to remove implausible and inconsistent values, and in a few instances of service utilization and payment values, winsorized to reduce extreme values to the level of the highest remaining natural value of the variable.

We used the R program, CHINA_DATA_CODING_RCODE.txt, to recode and transform variables into forms more convenient for analysis and to create three amalgamated datasets for use in statistical analyses.  The last lines of this R program generate output datasets 'tot', 'x', and 'y', which yielded, respectively:

  • CHINA_CLEAN_RECODED_SOURCE.csv, an amalgamation that was not used in subsequent analyses;
  • CHINA_PERSON_DATA.csv, a collection of all the original and transformed variables and values at the person-level, i.e., one value per variable per person; and
  • CHINA_MEDS_DATA.csv, a collection of all the data relating to medicines, in which the data are transformed so that the unit of observation is the person-medicine, i.e., one observation for each medicinal compound reported by each person in the study.

The R program, CHINA_PERSON_ANALYSIS_RECODE.csv, inputs the dataset CHINA_PERSON_DATA.csv and outputs tabulations, regressions, and test results for persons.

The R program, CHINA_MEDICINE_ANALYSIS_RECODE.csv, inputs the dataset CHINA_MEDS_DATA.csv and outputs tabulations and statistical test results for person-usages of medicines.

If you are interested in using these data and programs, you are welcome to contact the global principal investigator of the IDF Diabetes Impact Studies, Dr. Jonathan Brown, at jonabrown@gmail.com, for technical assistance and advice.  Or, you may telephone him in the United States (Pacific Time Zone) at +1 503 473 4796.