
Healthcare Analytics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Persons
HUI YANG, PhD, is Associate Professor in the Harold and Inge Marcus Department of Industrial and Manufacturing Engineering at The Pennsylvania State University. His research interests include sensor-based modeling and analysis of complex systems for process monitoring/control; system diagnostics/ prognostics; quality improvement; and performance optimization with special focus on nonlinear stochastic dynamics and the resulting chaotic, recurrence, self-organizing behaviors.
EVA K. LEE, PhD, is Professor in the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology, Director of the Center for Operations Research in Medicine and HealthCare, and Distinguished Scholar in Health System, Health Systems Institute at both Emory University School of Medicine and Georgia Institute of Technology. Her research interests include health-risk prediction; early disease prediction and diagnosis; optimal treatment strategies and drug delivery; healthcare outcome analysis and treatment prediction; public health and medical preparedness; large-scale healthcare/medical decision analysis and quality improvement; clinical translational science; and business intelligence and organization transformation.
Content
Chapter 1
Recent Development in Methodology for Gene Network Problems and Inferences
Sung W. Han and Hua Zhong
Division of Biostatistics, School of Medicine, Department of Population Health, New York University, New York, NY, USA
1.1 Introduction
The cell inside of a human body is similar to a manufacturing system producing an appropriate protein that functions according to the specific organ or the part of the body to which it belongs. The nucleus centered at the cell contains the DNA sequence, which is a designed map for the human body. Each time the cell produces a protein, it duplicates a certain part of the DNA sequence and generates mRNA sequences. This is called a transcription process. After leaving the nucleus, the mRNA is attached to a ribosome, and the ribosome interprets the code in mRNA. This is called a translation process. After interpretation, the ribosome generates a sequence of amino acids; then it is folded into a certain type of protein.
The manufacturing system from DNAs to proteins sometimes malfunctions due to the DNA damage, which is known to be a main cause of cancers, also called malignant neoplasms [1, 2]. The DNA damage can occur naturally, but the damage can also be caused by two groups of agents: (i) exogenous agents such as radiation, smoke [3], ultraviolet light [4], and viruses [5]; and (ii) endogenous agents such as diet [6] and macrophages/neutrophils [5]. Such DNA damage leads to epigenetic alteration for DNA repair genes, which play the key roles in preventing cancer cell growth. Reducing the DNA repair gene expression (DNA repair deficiency; [7]) or switching off the function of the DNA repair gene, called silence, finally leads to the development of cancers. For example, MGMT is the DNA repair gene, and most types of colorectal cancers have reduced MGMT expression ([8-11], and [12]). The following are other examples of proteins corresponding to DNA repair genes [1].
- BRCA1 and BRCA2 (breast cancer genes 1 and 2) for breast and ovarian cancers.
- ATM (ataxia telangiectasia mutated) for leukemia and breast cancers.
- XPC (xeroderma pigmentosum) for skin cancers.
- p53 (Li-Fraumeni syndrome) for sarcoma, leukemia, breast, lung, skin, pancreas, and brain cancers.
In addition, the miRNA (micro RNA) outside of the nucleus is known to have an effect on the DNA repair gene because it can reduce the expression of DNA damage response genes or repair genes [1]. For example, miRNA-155 is overly expressed in colon cancers, and it is known to reduce the expression of MLH1, a DNA repair protein [13].
For finding the mechanism of cancer development, understanding the causal relationship in transcriptional regulatory networks is important, and the related inference is often based on the gene network problem. The examples of the application of the network problem are in gene expression analysis or gene-gene expression networks [14-19], protein-protein interaction analysis [20, 21], phenotype networks utilizing gene expression information [22-24], and causal networks linking gene expression and metabolic change [24].
The probabilistic graphical modeling is a popular approach to find causal relationships between variables in cell signal pathways or gene networks [25]. In this chapter, the graphical models are assumed to be directed acyclic graphs (DAGs), in which all the edges are directed edges and contain no cycles [26]. Since the estimation of DAGs is computationally very challenging, we cannot simply apply approaches that are used to estimate undirected graphs [27-29]. First, DAGs with the same set of conditional independence are not identifiable from observational data alone [26]; this is called observational equivalence. Second, the number of possible DAGs exponentially increases as the number of nodes increases [27]. Third, in gene network problems, the number of genes is much larger than the sample size, which is called high-dimensional data.
The DAGs with conditional probability distribution for each child node given its parents are called Bayesian networks. The comprehensive review about learning Bayesian network is in Buntine [30, 31], Heckerman [32], Neapolitan [33], and Daly et al. [34]. Apart from cancer gene problems, the Bayesian network is used in broad applications such as ecology [35, 36], neuroscience [37, 38], distributed sensor networks for change detection, and diagnosis [39-41].
The main approaches to estimate the Bayesian networks are as follows: (i) a score-and-search approach through the space of Bayesian network structures, (ii) a constraint-based approach that uses conditional independencies identified in the data, and (iii) a hybrid approach. A score-and-search approach is to find a structure corresponding to a good score function value [42] and use a heuristic algorithm to find the solution. The examples of this approach are in Daly et al. [34]. A constraint-based approach is to use a statistical test of conditional independence on the data. One of the efficient methods is the PC algorithm [43]. In high-dimensional contexts, Kalisch and Buhlmann [44] proposed the PC algorithm with a reasonable computational time [43] and proved consistency for sparse DAGs. Hybrid search strategies including the above-mentioned two criteria have also been proposed such as in Tsamardinos et al. [45], where the method used is a Max-Min Hill-Climbing (MMHC) algorithm. The methods mentioned have been successfully proposed to estimate DAGs with a small to moderate number of nodes.
For the score-and-search approach, a network is identified by maximizing a certain score function [31, 33, 42, 46], and several heuristic search algorithms are then developed to find a high score [27, 34]. To overcome high dimensionality in gene expression data, the L1-penalized method or lasso approach has been recently developed. Meinshausen and Buhlmann [28] theoretically show that the neighborhood of a node corresponding to a conditional dependence set can be obtained by a lasso problem, and it is efficient for high-dimensional DAGs. For DAGs, Shojaie and Michailidis [29] used the L1-penalized likelihood with a structural equation model to estimate directed graphs with a known variable order and found that such a problem was transformed into separable subproblems with lasso penalty. Huang et al. [47] used a penalized linear regression that imposes penalties to the coefficient values as well as to acyclic constraints. Fu and Zhou [48] used an adaptive lasso-based score function when the variable order is unknown. However, their objective function without the acyclic constraint is nonconvex, which makes finding the optimal solution infeasible. Han et al. [49] proposed the adaptive lasso-based score function, and it demonstrated superior performance to other methods when the network has a hub structure. In this chapter, we overview the approach based on the lasso-type score function for gene network problems in high-dimensional data.
1.2 Background
We explain the basic theoretical background in probabilistic graphical modeling or Bayesian networks. Let us have p random variables, , and the variables have causal relationships with each other. The variables and relationships in probabilistic distribution need to be mapped to nodes, , and edge sets, . In other words, the separation in a graph needs to be mapped to the independence in probability [50].
In probabilistic graphical modeling, the d-separation (directed separation) is an important concept described by Pearl [26]. The definition of d-separation is complicated, but it implies the following argument. Suppose we have three node sets , and . We define that is a d-separate between and if one of the conditions is satisfied:
- All edges between and inflow from to , and all edges between and inflow from to .
- All edges between and inflow from to , and all edges between and inflow from to .
- All edges between and inflow from to , and all edges between and inflow from to .
For all disjoint subsets of , and , we state that the probability distribution P is faithful to the graph G if the following condition is satisfied.
and are independent given if and only if and are d-separated given
Based on the d-separation, we can express the probability distribution by using the Markov property. The probability distribution is represented by
where is a set of parents for .
Another important issue in probabilistic graphical model is observational equivalence. The example of observational equivalence is in Figure 1.1. The three cases in Figure 1.1a-c are not distinguishable based on observational data. They are said to be in one equivalence class. However, based on the data, the case in Figure 1.1d can be distinguished from the other three cases. We say that this case has a v-structure. Such equivalence class causes multiple solutions with the same score function values if we apply the score-and-search approach to estimate a DAG. To show all equivalence classes, the complete partial DAG (cpDAG) can be used, which can be implemented by the "essentialGraph()" function in R package [51].
Figure 1.1 Examples of observational equivalence.
1.3 Genetic Data Available
The technology in recent decades has allowed genome-wide monitoring of DNA and RNA levels on thousands of samples [52]. For example, The Cancer Genome Atlas (TCGA) project seeks to provide a comprehensive landscape of genetic and genomic alternations...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.