A guide to exploring your HIC data extract using SQL Server, Python and R.
Every project is unique and the datasets provided are specific to its needs. This page is an introduction to a few of the most common datasets and a gentle initiation to SQL for querying and linking them together.
When your data extract is delivered, you'll find your project tables in a database named Project_XXXXX (where XXXXX is your project number).
Unless otherwise specified, the unique patient identifier across all datasets is PROCHI. This variable is unique to your project. Every PROCHI represents a single individual in your cohort, and it is the key you use to link tables together.
Alongside your clinical data you'll find lookup tables that help you decode coded values. These follow a naming convention:
z_ prefix: dataset-specific lookups (e.g. SMR01_Admissions_z_LOCATION, MicroBio_Lab_z_Specimen)G_ prefix: general reference tables shared across datasets (e.g. G_OPCS3_Codes, G_World_Health_Organisation_ICD10_Codes)Project_XXXXX_Working database. Use this to store any tables you create or modify. Never write back to the main Project_XXXXX database.
For instructions on how to connect to your project database from SSMS, Python, R and other tools, see How do I connect to my study's SQL database? on the HIC Knowledge Base. The examples below assume you already have an open connection.
The Demography dataset contains one row per individual in your project cohort. It holds non-medical information about each patient, including three core columns:
PROCHI: unique patient identifier, used to link across all project tablessex: patient's sex at birthanon_date_of_birth: anonymised date of birth, mapped to within three months of the true dateDepending on your project, you may also find postcode district, SIMD deprivation metrics (by DataZone), anonymised GP practice, and care home residency flags. Your table will typically be at Project_XXXXX.dbo.Demography_Current.
A good first step is to eyeball a small slice of your data. This returns 100 randomly selected rows.
SELECT TOP 100 * FROM Project_12345..Demography_Current ORDER BY NEWID()
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 100 * FROM Project_12345..Demography_Current ORDER BY NEWID() """, conn) df.head()
df <- dbGetQuery(conn, " SELECT TOP 100 * FROM Project_12345..Demography_Current ORDER BY NEWID() ") head(df)
NEWID() assigns a random GUID to each row. Ordering by it shuffles the results, and TOP 100 takes the first 100.
Count the number of distinct patients grouped by sex.
SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount FROM Project_12345..Demography_Current GROUP BY sex
with engine.connect() as conn: df = pd.read_sql(""" SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount FROM Project_12345..Demography_Current GROUP BY sex """, conn) df
df <- dbGetQuery(conn, " SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount FROM Project_12345..Demography_Current GROUP BY sex ") df
Add a percentage column using a variable to hold the total population. This avoids the cross-join approach, which is less readable and slower on large tables.
DECLARE @TotalPop AS INT SELECT @TotalPop = COUNT(DISTINCT PROCHI) FROM Project_12345..Demography_Current SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount, COUNT(DISTINCT PROCHI) * 100.0 / @TotalPop AS Pct FROM Project_12345..Demography_Current GROUP BY sex
with engine.connect() as conn: df = pd.read_sql(""" SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount FROM Project_12345..Demography_Current GROUP BY sex """, conn) df["Pct"] = df["PatientCount"] / df["PatientCount"].sum() * 100 df
df <- dbGetQuery(conn, " SELECT sex, COUNT(DISTINCT PROCHI) AS PatientCount FROM Project_12345..Demography_Current GROUP BY sex ") df$Pct <- df$PatientCount / sum(df$PatientCount) * 100 df
Note: use 100.0 (not 100) in SQL to avoid integer division truncating the result.
The Calculated_Age column reflects the patient's age at the time of data extraction. If you need a current age (or want to add any derived column), write the result into your Project_XXXXX_Working database.
-- Create a copy with a recalculated age column SELECT *, DATEDIFF(year, anon_date_of_birth, GETDATE()) AS Recalculated_Age INTO Project_12345_Working..Demog FROM Project_12345..Demography_Current -- To refresh the age column later: UPDATE Project_12345_Working..Demog SET Recalculated_Age = DATEDIFF(year, anon_date_of_birth, GETDATE())
with engine.connect() as conn: df = pd.read_sql("SELECT * FROM Project_12345..Demography_Current", conn) # anon_date_of_birth is already a datetime column df["Recalculated_Age"] = ( pd.Timestamp.now().year - df["anon_date_of_birth"].dt.year ) df.head()
df <- dbGetQuery(conn, "SELECT * FROM Project_12345..Demography_Current") df$Recalculated_Age <- as.integer( format(Sys.Date(), "%Y") ) - as.integer(format(df$anon_date_of_birth, "%Y") ) head(df)
DATEDIFF(year, ...) subtracts the year numbers only. It does not check whether the birthday has occurred yet this year, so it can be off by one. Given that anon_date_of_birth is already anonymised to a quarterly boundary (±3 months), this approximation is acceptable for coarse age bands but worth being aware of.
Group patients into age bands and view the distribution as a cross-tabulation. This uses a temporary table (#temp) to store the intermediate result.
-- Step 1: Build the grouped counts into a temp table SELECT sex, CASE WHEN Calculated_Age < 30 THEN '<30' WHEN Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN Calculated_Age BETWEEN 40 AND 49 THEN '40-49' ELSE '50+' END AS AgeGroup, COUNT(DISTINCT PROCHI) AS PatientCount INTO #temp FROM Project_12345..Demography_Current GROUP BY sex, CASE WHEN Calculated_Age < 30 THEN '<30' WHEN Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN Calculated_Age BETWEEN 40 AND 49 THEN '40-49' ELSE '50+' END -- Step 2: Pivot to wide format SELECT AgeGroup, SUM(IIF(sex = 'M', PatientCount, 0)) AS Male, SUM(IIF(sex = 'F', PatientCount, 0)) AS Female FROM #temp GROUP BY AgeGroup -- Clean up DROP TABLE #temp
with engine.connect() as conn: df = pd.read_sql(""" SELECT sex, Calculated_Age, PROCHI FROM Project_12345..Demography_Current """, conn) bins = [0, 30, 40, 50, 200] labels = ["<30", "30-39", "40-49", "50+"] df["AgeGroup"] = pd.cut(df["Calculated_Age"], bins=bins, labels=labels, right=False) pivot = df.groupby(["AgeGroup", "sex"])["PROCHI"].nunique().unstack(fill_value=0) pivot.rename(columns={"M": "Male", "F": "Female"})
df <- dbGetQuery(conn, " SELECT sex, Calculated_Age, PROCHI FROM Project_12345..Demography_Current ") df$AgeGroup <- cut(df$Calculated_Age, breaks = c(0, 30, 40, 50, Inf), labels = c("<30", "30-39", "40-49", "50+"), right = FALSE) table(df$AgeGroup, df$sex)
#temp) rather than global ones (##temp). Global temp tables are visible to all sessions on the server, which can cause conflicts in a shared environment.
Prescribing data covers primary care dispensed prescriptions and is structured across two tables:
Prescribing_Scripts: one row per prescription line, linked to patients via PROCHI. Each prescription is identified by SCAN_REF_NO and can have one or more lines (representing the number of drugs on that prescription). The column hic_drugID links to the drug details table.Prescribing_Scripts_Prescribing_Item: drug reference table, joined via hic_drugID. Contains the drug's approved name, BNF code (formatted_BNF_Code for the shortened chapter, e.g. 3.1.4, and BNF_Code for the full code), strength, and formulation (patch, cream, etc.).To get a feel for the prescribing data, join Scripts to the Item lookup and sample 100 rows.
SELECT TOP 100 s.PROCHI, s.PRESCRIBED_DATE, s.SCAN_REF_NO, s.line_no, i.Approved_Name, i.formatted_BNF_Code, i.Strength, i.Formulation_code FROM Project_12345..Prescribing_Scripts s INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i ON s.hic_drugID = i.hic_drugID ORDER BY NEWID()
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 100 s.PROCHI, s.PRESCRIBED_DATE, s.SCAN_REF_NO, s.line_no, i.Approved_Name, i.formatted_BNF_Code, i.Strength, i.Formulation_code FROM Project_12345..Prescribing_Scripts s INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i ON s.hic_drugID = i.hic_drugID ORDER BY NEWID() """, conn) df.head()
df <- dbGetQuery(conn, " SELECT TOP 100 s.PROCHI, s.PRESCRIBED_DATE, s.SCAN_REF_NO, s.line_no, i.Approved_Name, i.formatted_BNF_Code, i.Strength, i.Formulation_code FROM Project_12345..Prescribing_Scripts s INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i ON s.hic_drugID = i.hic_drugID ORDER BY NEWID() ") head(df)
Link prescribing to demography to see which drugs are prescribed most frequently, broken down by patient age band. This counts prescription records, so a patient prescribed the same drug multiple times is counted each time.
SELECT CASE WHEN d.Calculated_Age < 20 THEN '<20' WHEN d.Calculated_Age BETWEEN 20 AND 29 THEN '20-29' WHEN d.Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN d.Calculated_Age BETWEEN 40 AND 49 THEN '40-49' WHEN d.Calculated_Age BETWEEN 50 AND 59 THEN '50-59' WHEN d.Calculated_Age BETWEEN 60 AND 69 THEN '60-69' WHEN d.Calculated_Age BETWEEN 70 AND 79 THEN '70-79' ELSE '80+' END AS AgeGroup, i.Approved_Name, COUNT(*) AS PrescriptionCount FROM Project_12345..Prescribing_Scripts s INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i ON s.hic_drugID = i.hic_drugID INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI GROUP BY CASE WHEN d.Calculated_Age < 20 THEN '<20' WHEN d.Calculated_Age BETWEEN 20 AND 29 THEN '20-29' WHEN d.Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN d.Calculated_Age BETWEEN 40 AND 49 THEN '40-49' WHEN d.Calculated_Age BETWEEN 50 AND 59 THEN '50-59' WHEN d.Calculated_Age BETWEEN 60 AND 69 THEN '60-69' WHEN d.Calculated_Age BETWEEN 70 AND 79 THEN '70-79' ELSE '80+' END, i.Approved_Name ORDER BY PrescriptionCount DESC
with engine.connect() as conn: scripts = pd.read_sql(""" SELECT s.PROCHI, s.hic_drugID FROM Project_12345..Prescribing_Scripts s """, conn) items = pd.read_sql(""" SELECT hic_drugID, Approved_Name FROM Project_12345..Prescribing_Scripts_Prescribing_Item """, conn) demog = pd.read_sql(""" SELECT PROCHI, Calculated_Age FROM Project_12345..Demography_Current """, conn) # Merge the three tables df = scripts.merge(items, on="hic_drugID").merge(demog, on="PROCHI") # Create age bands in 10-year intervals bins = [0, 20, 30, 40, 50, 60, 70, 80, 200] labels = ["<20", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+"] df["AgeGroup"] = pd.cut(df["Calculated_Age"], bins=bins, labels=labels, right=False) # Count prescription records per age group and drug result = (df.groupby(["AgeGroup", "Approved_Name"]) .size() .reset_index(name="PrescriptionCount") .sort_values("PrescriptionCount", ascending=False)) result
# Load all three tables scripts <- dbGetQuery(conn, " SELECT PROCHI, hic_drugID FROM Project_12345..Prescribing_Scripts ") items <- dbGetQuery(conn, " SELECT hic_drugID, Approved_Name FROM Project_12345..Prescribing_Scripts_Prescribing_Item ") demog <- dbGetQuery(conn, " SELECT PROCHI, Calculated_Age FROM Project_12345..Demography_Current ") # Merge and create age bands df <- merge(merge(scripts, items, by = "hic_drugID"), demog, by = "PROCHI") df$AgeGroup <- cut(df$Calculated_Age, breaks = c(0, 20, 30, 40, 50, 60, 70, 80, Inf), labels = c("<20", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+"), right = FALSE) # Count prescription records result <- aggregate(hic_drugID ~ AgeGroup + Approved_Name, data = df, FUN = length) names(result)[3] <- "PrescriptionCount" result <- result[order(-result$PrescriptionCount), ] head(result, 20)
This counts prescription records. A patient receiving the same drug 12 times contributes 12 to the count.
Same structure, but counting distinct patients per drug and age band. This tells you how many people were prescribed each drug, regardless of how many times.
SELECT CASE WHEN d.Calculated_Age < 20 THEN '<20' WHEN d.Calculated_Age BETWEEN 20 AND 29 THEN '20-29' WHEN d.Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN d.Calculated_Age BETWEEN 40 AND 49 THEN '40-49' WHEN d.Calculated_Age BETWEEN 50 AND 59 THEN '50-59' WHEN d.Calculated_Age BETWEEN 60 AND 69 THEN '60-69' WHEN d.Calculated_Age BETWEEN 70 AND 79 THEN '70-79' ELSE '80+' END AS AgeGroup, i.Approved_Name, COUNT(DISTINCT s.PROCHI) AS PatientCount FROM Project_12345..Prescribing_Scripts s INNER JOIN Project_12345..Prescribing_Scripts_Prescribing_Item i ON s.hic_drugID = i.hic_drugID INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI GROUP BY CASE WHEN d.Calculated_Age < 20 THEN '<20' WHEN d.Calculated_Age BETWEEN 20 AND 29 THEN '20-29' WHEN d.Calculated_Age BETWEEN 30 AND 39 THEN '30-39' WHEN d.Calculated_Age BETWEEN 40 AND 49 THEN '40-49' WHEN d.Calculated_Age BETWEEN 50 AND 59 THEN '50-59' WHEN d.Calculated_Age BETWEEN 60 AND 69 THEN '60-69' WHEN d.Calculated_Age BETWEEN 70 AND 79 THEN '70-79' ELSE '80+' END, i.Approved_Name ORDER BY PatientCount DESC
# Reusing df from the previous example result = (df.groupby(["AgeGroup", "Approved_Name"])["PROCHI"] .nunique() .reset_index(name="PatientCount") .sort_values("PatientCount", ascending=False)) result
# Reusing df from the previous example result <- aggregate( PROCHI ~ AgeGroup + Approved_Name, data = df, FUN = function(x) length(unique(x)) ) names(result)[3] <- "PatientCount" result <- result[order(-result$PatientCount), ] head(result, 20)
The only change from the previous query is COUNT(DISTINCT PROCHI) instead of COUNT(*). This gives the number of unique individuals prescribed each drug per age band.
There are five laboratory datasets: Haematology, Immunology, Biochemistry, Virology and Microbiology. For the first four, the raw data is sent to HIC in three parts (Header, Results and Sample Details). For your convenience, these tables are joined together and provided to you as a single "Restructured" table per lab type.
HaematologyRestructured,
ImmunologyRestructured,
VirologyRestructured,
and Labs_Biochem.
The structure and columns are the same in each case.
Microbiology is provided differently, as three separate tables: MicroBio_Isolat, MicroBio_Lab and MicroBio_Tests. A detailed description of the Microbiology schema is out of scope for this page.
The examples below use HaematologyRestructured, but the same queries apply to any of the restructured lab tables.
Join the lab table to Demography to see patient context alongside results.
SELECT TOP 100 h.PROCHI, d.sex, d.Calculated_Age, h.DateTimeSampled, h.LocalClinicalCodeDescription, h.QuantityValue, h.QuantityUnit, h.RangeLowValue, h.RangeHighValue, h.Interpretation FROM Project_12345..HaematologyRestructured h INNER JOIN Project_12345..Demography_Current d ON h.PROCHI = d.PROCHI ORDER BY NEWID()
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 100 h.PROCHI, d.sex, d.Calculated_Age, h.DateTimeSampled, h.LocalClinicalCodeDescription, h.QuantityValue, h.QuantityUnit, h.RangeLowValue, h.RangeHighValue, h.Interpretation FROM Project_12345..HaematologyRestructured h INNER JOIN Project_12345..Demography_Current d ON h.PROCHI = d.PROCHI ORDER BY NEWID() """, conn) df.head()
df <- dbGetQuery(conn, " SELECT TOP 100 h.PROCHI, d.sex, d.Calculated_Age, h.DateTimeSampled, h.LocalClinicalCodeDescription, h.QuantityValue, h.QuantityUnit, h.RangeLowValue, h.RangeHighValue, h.Interpretation FROM Project_12345..HaematologyRestructured h INNER JOIN Project_12345..Demography_Current d ON h.PROCHI = d.PROCHI ORDER BY NEWID() ") head(df)
Not every patient in your cohort will have lab results. Use a LEFT JOIN to identify which patients have haematology records and which do not.
-- Patients WITH at least one haematology result SELECT COUNT(DISTINCT d.PROCHI) AS WithLabs FROM Project_12345..Demography_Current d INNER JOIN Project_12345..HaematologyRestructured h ON d.PROCHI = h.PROCHI -- Patients WITHOUT any haematology result SELECT COUNT(DISTINCT d.PROCHI) AS WithoutLabs FROM Project_12345..Demography_Current d LEFT JOIN Project_12345..HaematologyRestructured h ON d.PROCHI = h.PROCHI WHERE h.PROCHI IS NULL -- Combined summary in a single query SELECT COUNT(DISTINCT d.PROCHI) AS TotalCohort, COUNT(DISTINCT h.PROCHI) AS WithLabs, COUNT(DISTINCT d.PROCHI) - COUNT(DISTINCT h.PROCHI) AS WithoutLabs FROM Project_12345..Demography_Current d LEFT JOIN Project_12345..HaematologyRestructured h ON d.PROCHI = h.PROCHI
with engine.connect() as conn: demog = pd.read_sql(""" SELECT DISTINCT PROCHI FROM Project_12345..Demography_Current """, conn) lab_prochis = pd.read_sql(""" SELECT DISTINCT PROCHI FROM Project_12345..HaematologyRestructured """, conn) # Patients with and without lab results has_labs = demog["PROCHI"].isin(lab_prochis["PROCHI"]) print(f"Total cohort: {len(demog)}") print(f"With labs: {has_labs.sum()}") print(f"Without labs: {(~has_labs).sum()}")
demog <- dbGetQuery(conn, " SELECT DISTINCT PROCHI FROM Project_12345..Demography_Current ") lab_prochis <- dbGetQuery(conn, " SELECT DISTINCT PROCHI FROM Project_12345..HaematologyRestructured ") has_labs <- demog$PROCHI %in% lab_prochis$PROCHI cat("Total cohort:", nrow(demog), "\n") cat("With labs: ", sum(has_labs), "\n") cat("Without labs:", sum(!has_labs), "\n")
The LEFT JOIN ... WHERE h.PROCHI IS NULL pattern is a standard way to find rows in one table that have no match in another. The combined query at the bottom gives you all three numbers in one go.
SMR is a comprehensive, national series of patient-level datasets that track patient activity across NHS Scotland hospitals. Collected since the 1960s, they cover inpatient, day case and outpatient care, providing detailed, linkable information on diagnoses, procedures and patient demographics, with over 98% completeness for key metrics such as births. They are widely used for health planning, research and monitoring.
The main SMR datasets are:
SMR00 (Outpatient Attendance): the largest SMR dataset. It primarily tracks patient engagement with NHS outpatient services, with partial coverage of referral reasons coded in ICD-10 (note: these are referral reasons, not confirmed diagnoses). If the patient undergoes a procedure, it is recorded using OPCS-4 codes.SMR01 (General/Acute Inpatient and Day Case): the most important dataset for searching by diagnosis. Conditions are coded using ICD-10 in the MAIN_CONDITION and OTHER_CONDITION columns, and procedures use OPCS-4 codes.SMR02 (Maternity): a subset of inpatient activity covering maternity admissions, with additional fields specific to maternal and neonatal care.SMR04 (Mental Health): covers psychiatric inpatient admissions, with ICD-10 coded diagnoses and additional mental health specific fields.SMR06 (Cancer): the Scottish Cancer Registry, a subset of inpatient records with detailed tumour staging and morphology data, linked to other SMR records where applicable.SMR02, SMR04 and SMR06 are broadly similar in structure to SMR01, but may contain additional columns specific to their domain and may cover locations other than acute hospitals. Your project database will contain whichever SMR datasets were approved for your study.
The examples below use SMR01_Admissions. Key columns include MAIN_CONDITION (ICD-10 primary diagnosis), ADMISSION_DATE, DISCHARGE_DATE, LENGTH_OF_STAY, and MAIN_OPERATION (OPCS-4 procedure code).
Join SMR01 to Demography for a quick look at admissions alongside patient context.
SELECT TOP 100 s.PROCHI, d.sex, d.Calculated_Age, s.ADMISSION_DATE, s.DISCHARGE_DATE, s.LENGTH_OF_STAY, s.MAIN_CONDITION, s.MAIN_OPERATION FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI ORDER BY NEWID()
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 100 s.PROCHI, d.sex, d.Calculated_Age, s.ADMISSION_DATE, s.DISCHARGE_DATE, s.LENGTH_OF_STAY, s.MAIN_CONDITION, s.MAIN_OPERATION FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI ORDER BY NEWID() """, conn) df.head()
df <- dbGetQuery(conn, " SELECT TOP 100 s.PROCHI, d.sex, d.Calculated_Age, s.ADMISSION_DATE, s.DISCHARGE_DATE, s.LENGTH_OF_STAY, s.MAIN_CONDITION, s.MAIN_OPERATION FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI ORDER BY NEWID() ") head(df)
Count admissions by MAIN_CONDITION (ICD-10 code) to see which diagnoses are most common in your cohort.
SELECT TOP 20 MAIN_CONDITION, COUNT(*) AS AdmissionCount FROM Project_12345..SMR01_Admissions GROUP BY MAIN_CONDITION ORDER BY AdmissionCount DESC
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 20 MAIN_CONDITION, COUNT(*) AS AdmissionCount FROM Project_12345..SMR01_Admissions GROUP BY MAIN_CONDITION ORDER BY AdmissionCount DESC """, conn) df
df <- dbGetQuery(conn, " SELECT TOP 20 MAIN_CONDITION, COUNT(*) AS AdmissionCount FROM Project_12345..SMR01_Admissions GROUP BY MAIN_CONDITION ORDER BY AdmissionCount DESC ") df
The MAIN_CONDITION column contains ICD-10 codes. You can cross-reference these with the G_World_Health_Organisation_ICD10_Codes lookup table if you need human-readable descriptions.
Identify the patients with the most admission episodes.
SELECT TOP 20 s.PROCHI, d.sex, d.Calculated_Age, COUNT(*) AS TotalAdmissions, SUM(s.LENGTH_OF_STAY) AS TotalDays FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI GROUP BY s.PROCHI, d.sex, d.Calculated_Age ORDER BY TotalAdmissions DESC
with engine.connect() as conn: df = pd.read_sql(""" SELECT TOP 20 s.PROCHI, d.sex, d.Calculated_Age, COUNT(*) AS TotalAdmissions, SUM(s.LENGTH_OF_STAY) AS TotalDays FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI GROUP BY s.PROCHI, d.sex, d.Calculated_Age ORDER BY TotalAdmissions DESC """, conn) df
df <- dbGetQuery(conn, " SELECT TOP 20 s.PROCHI, d.sex, d.Calculated_Age, COUNT(*) AS TotalAdmissions, SUM(s.LENGTH_OF_STAY) AS TotalDays FROM Project_12345..SMR01_Admissions s INNER JOIN Project_12345..Demography_Current d ON s.PROCHI = d.PROCHI GROUP BY s.PROCHI, d.sex, d.Calculated_Age ORDER BY TotalAdmissions DESC ") df
This also sums LENGTH_OF_STAY to show total days spent in hospital alongside the admission count.
If you need to check the software versions available in your workspace, the following snippets will display the relevant information. This page was written and tested against the versions shown in the output below.
SELECT @@VERSION
-- Microsoft SQL Server 2022 (RTM-CU24) (KB5080999)
-- 16.0.4245.2 (X64) Feb 25 2026 15:01:38
-- Copyright (C) 2022 Microsoft Corporation
-- Standard Edition (64-bit) on Windows Server 2019
-- Standard 10.0 <X64> (Build 17763: ) (Hypervisor)import sys import IPython import matplotlib import pandas import pyodbc import sqlalchemy print(f"Python : {sys.version}") print(f"IPython : {IPython.__version__}") print(f"matplotlib : {matplotlib.__version__}") print(f"pandas : {pandas.__version__}") print(f"pyodbc : {pyodbc.version}") print(f"sqlalchemy : {sqlalchemy.__version__}") # Python version : 3.14.3 # IPython version : 9.12.0 # matplotlib : 3.10.8 # pandas : 3.0.2 # pyodbc : 5.3.0 # sqlalchemy : 2.0.48
sessionInfo() # R version 4.4.3 (2025-02-28 ucrt) # Platform: x86_64-w64-mingw32/x64 # Running under: Windows 11 x64 (build 26100) # # attached base packages: # stats graphics grDevices utils datasets methods base # # other attached packages: # glue_1.8.0 odbc_1.6.4.1 DBI_1.3.0