Can machine learning fill the gaps in Canada’s health data?

Last modified: January 25, 2021 at 1:00pm

While long recognized as important social determinants of health, race and ethnicity aren’t routinely collected in Canadian databases that track illness and disease.

New research from the University of Alberta looks at a potential way to fill those data gaps, led by the former acting epidemiologist for the Northwest Territories. 

Dr. Kai On Wong is a senior data scientist at the Northern Alberta Clinical Trials and Research Centre. He has created a machine learning framework – a computer algorithm that improves with experience – that he says can predict ethnicity and Indigenous status based on a person’s name.


He published his findings in the science journal Plos One

“There’s a lack of a system-wide collection of ethnicity information on all of the major health data resources,” he said.

“That potentially becomes a huge issue because, without the data, we will not be able to identify what the existing racial and ethnic-related issues are.”

Wong said the machine learning framework analyzes names, geographic information, and things like spelling to predict whether someone belongs to one of 13 ethnic groups. 

Testing it with data from the 1901 census, researchers found it worked best at identifying people with Chinese, French, Japanese and Russian heritage based only on their names. Adding geographic information, the study says, improved the algorithm’s ability to predict Indigenous identity from 50 to 84 percent. 


Wong, who said he and his family immigrated to Canada from Hong Kong in 1998, first became interested in racial and ethnic health inequalities while working for the NWT government. 

An ‘urgent’ issue in Canada

“Being from an originally very racially homogeneous population into Canada, which is extremely ethnically diverse, kind-of opened my eyes of what a country could be like with such ethnic diversity,” he said, adding he learned more about racial health inequalities through his work.


“That deepened my sense of appreciation and my understanding that there’s an urgent issue within Canada.”

Wong hopes his framework will be developed further to help improve health databases across Canada, adding he has made his coding publicly available. 

Dr Cheryl Peters, a research scientist and adjunct professor at the University of Calgary’s Cumming School of Medicine, often analyzes big datasets for patterns of exposure and disease.

She said the machine learning proposed in Wong’s study is a “unique and interesting” way to fill in potential information gaps.

During the pandemic in particular, Peters said, little data is being collected about people’s individual characteristics, like their job title or ethnicity.

“It’s very difficult for us to look at whether or not there are patterns in people’s occupations, or in their communities where they live, that we could basically use to slow the spread of Covid-19,” she said.

In May, for example, the Yellowhead Institute reported there were discrepancies between the number of Covid-19 cases in Indigenous communities reported by Indigenous Services Canada and the number reported by community sources. 

Two sides of the coin

Peters noted it can be difficult to collect race or ethnicity information, however, because of longstanding issues of racism and negative racial stereotypes. 

“There’s this challenge of wanting to be able to describe an issue more fully, to get resources to a community that might be identified by a racial or ethnic group that needs it more,” she said.

“But you also don’t want to have this level of stigma that comes along with categorizing people in these largely arbitrary ways.

“It’s an interesting two sides to the coin that people working in public health have to think about.” 

The Canadian Institute for Health Information has proposed pan-Canadian standards for collecting race-based and Indigenous identity data in health systems. It says standards are important for consistent health reporting and measuring inequalities that stem from bias and racism. 

Dana Riley, who is a program lead with the institute’s Canadian Population Health Initiative Program, said informed consent, privacy and cultural safety are vital when collecting information like ethnicity, race, or language.

“There’s a big concern that the collection of this type of data would be used to further stigmatize racialized and Indigenous groups,” she said.

Riley said it’s also important to clarify how the information will be collected and used and to make sure communities are involved in those decisions.

“Community engagement is key from the outset,” she said, “then making sure this type of data collection is tied to concrete actions … that folks are held accountable for how the data is going to be used.

“There’s no point in collecting the data if it is not going to be used to reduce inequities.”

Correction: January 22, 2021 – 9:28 MT. This article initially misstated the name and position of Dana Riley. It has since been updated.