Data Source Survey: Industry & Occupation Analysis

Chapter 2 Survey of existing data sources
The purpose of this chapter is to provide a structured map of useful sources of informa:on about an
industry or occupa:on. This should provide a wide range of poten:al areas that could be explored.
This chapter is focused on iden:fying possible data sources rather than actually gathering them. The
inten:on is to analyse each poten:al data source in enough detail to be able to determine what kind
of informa:on is available and how comprehensive the informa:on is. For example, if discussing the
financial status of an organisa:on you would document how detailed the informa:on was
concerning where income and expenditures were coming from and roughly what percentage of
organisa:ons had this informa:on available. Actually obtaining the full data from a source may be
prac:cally impossible given the :me within this module. The goal of this chapter is to provide an
insight into the limits of what might be possible and to ensure that analysis focuses on what might
lead to the most valuable insights rather than the easiest data to obtain.
Most data sources will be par:al and will be biased in some way. For example, not all staff will be on
a site like LinkedIn, not all of those who are on that site will record their job at a company. I.e. the
data is par:al. It is also not a random sample as certain types of people will tend to be more likely to
record this informa:on and the circumstances of the job and why it was leH may influence the
likelihood of this data being recorded. In week 4 we will explore some op:ons to address these
issues. At this stage the key thing is to find the best data sources you can find and not discard any
data source out of hand because of these issues.
The key to finding data sources is to do a lot of Google searches and to carefully navigate through
websites. In addi:on there are a set of websites and government data sources that are always worth
inves:ga:ng. In par:cular, a generally useful tool for geLng historical informa:on is the wayback
machine (hNps://web.archive.org/) which records snapshots of websites that were publicly available
at different points in :me. In this document any text within quotes “Like this” represents a google
search term.
Be very careful not to record personally iden:fiable informa:on of UK or EU ci:zens or those
working in the UK or EU. Once you have recorded a piece of personally iden:fiable informa:on
about an EU or UK ci:zen this data becomes subject to GDPR legisla:on which is very restric:ve and
for most prac:cal purposes requires the permission of the individual to use. However, if you gather
aggregate sta:s:cs using publicly available personal data manually i.e. count up the number of
profiles with a certain property, then if you are careful you will not be storing and processing
personally iden:fiable data. This is a link to a post discussing exis:ng research that makes use of
LinkedIn, including some links to anonymised datasets that may be of use: hNps://blogs.lse.ac.uk/
The main difficulty with gathering data in aggregate form is that you must gather all of the
informa:on at once and you can’t merge and link datasets because you can’t record any iden:fying
informa:on from the individuals you were surveying. Unfortunately, you can’t keep any iden:fiable
informa:on including the use of a one way cryptographic hash. The sta:s:cs you gather are specific
to the process you use and the period of :me you are gathering them however they can s:ll be
highly informa:ve.
Before doing a lot of LinkedIn reading you may want to change your visibility seLngs hNps://
In the examples below most data is focused on NI or UK sources. This is just an example, there is no
need to focus your analysis in these areas and because of issues such as data privacy legisla:on there
may be more value in exploring US or other regions. Different countries vary in the amount of
publicly available informa:on and the number of organisa:ons in a given industry.
The key informa:on to show for each sec:on is:
The informa:on that you are seeking to find for each sec:on and why it would be valuable
The process used to find the data e.g. google search terms, links from documents
The feasibility of obtaining the data, to what extent it can be linked with other informa:on
How comprehensive and/or biased the data obtained from the process might be
It is easy to get overwhelmed when searching for informa:on as it isn’t structured in a way that
provides perspec:ve on an industry. It is good to always remember your goal:
Do work that makes it easier for others to iden:fy the causal factors that affect outcomes that
decision makers within an industry value. Specifically to help them make good choices.
I.The space of parts of work
Our goal with this sec:on is to iden:fy a list of organisa:ons, ideally with useful proper:es, that can
be used as the basis for further inves:ga:on. Each organisa:on is poten:ally a useful star:ng point
for further analysis. Next week you will be deeply inves:ga:ng one example of an organisa:on or
project. In this chapter we are more focused on breadth, in par:cular acquiring a high level overview
of an industry and how the organisa:ons vary within it.
1.SoHware Development
We are seeking to iden:fy the companies that provide soHware development services. Poten:ally
focused on a par:cular region such as NI or the UK.
“soHware development companies northern ireland”, “soHware development companies UK”
Companies house businesses are registered with an SIC code which you could use to find companies
with a specific code by downloading their full database and processing the data records to iden:fy
those with the SIC code you are looking for. It is important to note that there is no formal verifica:on
of these codes so they may be in error (and in my experience frequently are).
You can also find companies ac:vely working in an industry using job sites hNps://wecodeni.com/,
hNps://www.nijobs.com/SoHware-Company-Jobs-in-Northern-Ireland. This will highlight larger
organisa:ons and may be biased towards those that are growing or have a high turnover of staff.
For larger organisa:ons in NI you can use employment data from the equali:es commission. You can
then search for the names of the businesses online to iden:fy what business they are in (or it may be
obvious from the company name). By using the numbers of employees, analysis can focus on larger
or smaller organisa:ons or take a sample of different groups based on size.
Using these approaches it is likely that we will be finding companies that adver:se their services
broadly and may be focused on large, mul:-year business to business projects. There are poten:ally
many more places where soHware development takes place, for example large companies tend to
have internal soHware development teams and there are also independent web and mobile
developers that take on contract work for individuals or small businesses. These may require other
routes to iden:fy them, such as crowd working sites like Upwork.com or Fiverr.com.
2.Higher Educa:on
Ideally we are looking to find a list of universi:es within a region or globally. There are many public
lists of universi:es and sites that rank them with metadata.
“universi:es in the uk” -> hNps://www.ukuni.net/universi:es
“universi:es in the uk” -> hNps://en.wikipedia.org/wiki/List_of_universi:es_in_the_United_Kingdom
These lists are rela:vely comprehensive.
“organisa:ons that rank UK universi:es” (this provides a list including a number of different reviews
These sites also seem reasonably comprehensive, at least for UK universi:es. It would also be worth
exploring the space of different university ranking systems and seeing if useful metadata for each
It may also be valuable to gather data on the different departments within each university
“ranking of university departments” -> hNps://www.:meshighereduca:on.com/world-universityrankings/by-subject
“ranking of university departments” -> hNps://www.topuniversi:es.com/subject-rankings/2020
These rankings include metadata including staff/student ra:o and number of interna:onal students.
Using staff numbers it may be possible to es:mate student numbers. If we could clarify whether the
staff student ra:o included informa:on from postgraduate courses it may be possible to separate out
es:mates of numbers of postgraduate students. The number of undergraduate and postgraduate
students is a significant source of income for the department.
In this sec:on we are searching for examples of projects that organisa:ons have undertaken, ideally
with metadata such as the budget, staffing, resources, management approach and :mescale of the
projects. Even beNer if we can obtain more detailed task level data about the project.
Projects can be hard to get informa:on on. I know from my interviews with experts that some
government soHware projects are open source. At a higher level government contracts are supposed
to be open (hNps://www.gov.uk/government/publica:ons/open-contrac:ng). However aHer a
rela:vely quick search I couldn’t find any more detail that a single paragraph of descrip:on of a
project, who had won the bid, it’s total cost and :mescale. The government website talked as if the
en:re process was open. I emailed them to ask about how to find a detailed breakdown for soHware
projects (tasks and milestones) as well as any informa:on on rival bids that weren’t accepted. I
received no reply to my email. More may be possible with a freedom of informa:on request.
1.SoHware Development
a)Government contract projects.
“uk government tenders kainos” -> hNps://bidstats.uk/tenders/?q=kainos+soHware
The bidstats website only contains the last year but it references the organisa:ons the informa:on
has been taken from. If we could examine these sources then we might be able to get historical
records that could be processed to get useful data.
b)Men3ons in shareholder reports
“kainos shareholder reports” -> hNps://www.kainos.com/investor-rela:ons/results-andpresenta:ons -> hNps://go.kainos.com/rs/272-PGO-379/images/202105FullYearPresenta:on.pdf
(lists some customers) -> “kainos "irish life" “ -> links to linkedIn account of business analyst on
“kainos NHS Digital” (many different stories and case studies on different sites) -> “kainos NHS digital
github” (links to a number of github repositories for Kainos staff which show contribu:ons to github
projects) -> hNps://github.com/nhsconnect/integra:on-adaptor-gp2gp
Including a freedom of informa:on request rela:ng to one of the projects: hNps://
These searches reveal some informa:on about the projects and produce some promising leads for
gathering more informa:on, par:cularly the case studies and github repositories. The main concern
here is that only the larger higher status companies might be men:oned in the shareholders report.
Another useful source are company twiNer (or other social media) accounts which are oHen
searching for any public men:on of the company. These may then contain men:ons of the projects
that the company is working on:
“kainos twiNer” -> hNps://twiNer.com/KainosSoHware?
The need for content makes it more likely makes it more likely that social media feeds will capture
more of the media about the company (at least the posi:ve media). However, this may only be
possible with larger organisa:ons that have staff with explicit responsibility for managing such an
2.Higher Educa:on
What educa:onal programs are being run?
“soHware development degree courses” -> hNps://www.postgraduatesearch.com/pgs/search?
“QUB soHware conversion masters” -> hNps://www.qub.ac.uk/courses/postgraduate-taught/
For each course the syllabus may be available enabling an assessment of whether certain topics were
being covered.
Using the wayback machine for many degree programs we may be able to iden:fy changes in the
course content, entry requirements and fees over :me
It is also valuable to explore rankings for individual academic courses
“ranking of university courses” -> hNps://www.theguardian.com/educa:on/ng-interac:ve/2020/sep/
This provides valuable sta:s:cs including the percentage of students geLng a career in the subject
aHer gradua:on and the percentage that drop out of the course in the first year. These sta:s:cs
oHen focus on undergraduate programs which may be less useful as the number of students on each
program is likely to be capped, at least for local students part funded by government.
Tasks are about analysing individual ‘Tickets’, the rela:vely atomic repeated ac:vi:es that are
performed as part of projects. This is the highest detail level that is typically available for an industry
and is oHen not gathered in a public way. However some governments, chari:es or otherwise
par:cularly open and transparent organisa:ons can share how their work is performed.
Task based datasets allow for a valida:on of the task/process modelling in the previous chapter and
may enable us to create rela:vely precise aggregate sta:s:cs about projects. For example for a
soHware project: the :me spent working on different parts of the project e.g. UI, database etc., the
number of bugs fixed (and possibly created) by different team members etc. could be quan:fied and
For individual roles there may be case studies available that could provide a general approach to
understanding tasks and how they vary in an industry. However, case studies are oHen chosen
because they were successful and may be performed in an unrealis:cally polished way.
Another op:on is to search for projects that are run in an open and collabora:ve way as they may
have all of their tasks documented publicly.
1.SoHware Development
In soHware as most of the work is undertaken on computers and using soHware, detailed task level
records are oHen gathered, for example by source control systems.
For the requirements gathering and project management process:
“public jira projects” -> hNps://www.reddit.com/r/atlassian/comments/lgcycy/public_jira_projects/
-> hNps://ecosystem.atlassian.net/jira/projects
When soHware is used to track projects and the breakdown of the project into :ckets, these
soHware systems may be made open for certain projects. If the soHware, e.g. Jira, includes apis for
analysis it may be possible to analyse these projects to iden:fy how requirements are created and
change over :me as well as when designs change aHer implementa:on to understand when the
requirements process is working well or not.
For the programming process:
It may be possible to find government projects on github that correspond to government contracted
projects so the company that delivered the project, the budget and :mescale can be iden:fied
“uk government projects on github” -> hNps://www.api.gov.uk/#uk-government-apis
Each of the apis in this list corresponds to a project which may be open sourced on github. Some of
these may have been produced through contracts with soHware development companies by
examining each in turn and cross referencing the project against records of government tenders we
may be able to iden:fy budgets and :mescales.
“kainos uk government github” -> hNps://github.com/companieshouse/payments.web.ch.gov.uk
From our searches using a company name it may be possible to iden:fy projects that are being run
by the company for a branch of UK government. We may then be able to iden:fy the public
informa:on about the tender from the UK sources and iden:fy the budget and :mescale of the
For UXDesign:
“ux design case study examples” -> hNps://beszolios.medium.com/8-well-done-ux-case-studiesevery-designers-should-read-f15cfd8687e1
Unfortunately these examples are mostly showcases that may be heavily simplified/biased from the
raw ac:vity and :me use informa:on we would really value. However they may show the key stages
and tasks which are helpful for understanding the role.
2.Higher Educa:on
Open courses with forums or communi:es that have been run over a period of :me. We can use the
forum informa:on to see how students are facing the taught material of the course. In some cases
we may be able to iden:fy the feedback, discussions and reasoning that have led to changes in the
module due to public student comments. It will take some crea:vity to formalise these text based
discussions into a compara:ve text based format.
“free online course materials” -> hNps://www.openculture.com/freeonlinecourses
Ideally we are looking for courses that have changed over :me and understand the interac:on
between the creator of the course and the work being done. By examining how courses are taught
differently at different ins:tu:ons we can examine how that material might impact other
measurements such as student sa:sfac:on and employment sta:s:cs of different courses.
Certain specialised online courses such as hNps://www.fast.ai/ are developed very publicly with an
ac:ve community that includes the lecturer and where it may be possible to iden:fy the changes
and what has mo:vated them and ideally the posi:ve and nega:ve consequences of those changes.
The goal of this sec:on is to find data about how careers progress over :me, ideally with any
metadata about the nature of the roles and any factors that might have a significant causal impact on
obtaining the role or why the person leH a previous role. For example, technical skills, salaries etc.
1.SoHware Development
It is very valuable to have access to CVs of people in an industry to understand plausible career
paths. The most popular site for this informa:on is LinkedIn.
“linkedin kainos” -> hNps://www.linkedin.com/company/kainos/
Companies oHen have linkedin pages that will include links to people who have included the
company in their CVs on the site.
“kainos project” -> hNps://www.kainos.com/insights/blogs/path-project-management
A previous search led to a link to a case study for a project management posi:on.
2.Higher Educa:on
LinkedIn can also be used for academic educa:on. Universi:es also promote their staff on their
websites. They oHen have structured staff biographies. Using the wayback machine it is also possible
to iden:fy staff roles changing over :me. Informa:on that may not be as comprehensively recorded
via LinkedIn.
The goal of this sec:on is to find data about the work carried out within a par:cular role. This is
similar to the breakdown of tasks on a project but in this case focused on a par:cular role. From this
informa:on we are seeking to iden:fy the varia:on in how a role is carried out. What tasks and skills
are needed, how :me is spent, what responsibili:es the role has and what pay they have. Ideally we
want to link these jobs to the organisa:ons that produce them and to the careers of the individuals
pursuing those roles so we can see what might cause the roles to differ and the impact that the jobs
have on the careers of the individuals who par:cipate in them.
1.SoHware Development
Day in the life ar:cles. These provide another source for the tasks that different roles take part in and
may include :ming informa:on that may clarify how much of a person’s day is spent engaged in
different ac:vi:es.
“a day in the life of a soHware developer” -> hNps://codeins:tute.net/blog/a-day-in-the-life-of-asoHware-developer/
O*net is a comprehensive formalisa:on of jobs produced by the US Government. It is based on a
combina:on of the analysis of job pos:ngs and interviews with job holders. It includes many
proper:es associated with different occupa:ons including tasks associated with a job and skills that
are needed as well as the typical personality traits of people within the profession.
Recruitment/job websites provide many jobs by region as well as structured searches. For example,
by technology, by company, by salary etc. Providing a range of examples of jobs and metadata about
“recruitment websites” -> hNps://uk.indeed.com/jobs?
2.Higher Educa:on
University posi:ons tend to be posted to a small number of specialised sites. For the UK :mes higher
is one of the more common sites.
The lenses within this sec:on are places where useful informa:on is commonly found. You can think
of them as represen:ng different groups that are interested in analysing organisa:ons and
The goal of this sec:on is to try to iden:fy useful data sources that provide financial informa:on. The
boNom line for many organisa:ons is that they need to remain profitable to remain in business.
1.Companies house
a)So9ware development
“Comapnies house lookup business” -> hNps://www.gov.uk/get-informa:on-about-a-company ->
hNps://find-and-update.company-informa:on.service.gov.uk/search?q=Kainos -> hNps://find-and-
update.company-informa:on.service.gov.uk/company/NI019370 -> hNps://find-andupdate.company-informa:on.service.gov.uk/company/NI019370/filing-history
This includes a history of financial accounts for the business as well as changes in directorships and
shares. For larger public companies these documents can also include lists of corporate risks that the
organisa:on is concerned with and what ac:ons they are taking to mi:gate them. These risks are
similar to the Issues that we may have iden:fied except in the case of our issues we are focused on
problems that have occurred in the past and so may be a risk that is not being fully addressed. This is
similar (possibly iden:cal to) the shareholders reports that public companies compile.
In terms of financial informa:on the reports include figures for revenue, profit and the value of any
assets owned by the business.
b)Higher educa3on
For Higher educa:on a search for Queen’s university on Companies House revealed two
organisa:ons that are a part of the university but not he university itself.
“queen's university belfast company registra:on number”
This links to the contact details of Queens which doesn’t indicate a company number but does
include a charity number.
“Northern ireland charity commission NIC101788” -> hNps://www.charitycommissionni.org.uk/
charity-search/?pageNumber=1 -> hNps://www.charitycommissionni.org.uk/charity-details/?
This site links to the past 3 years of annual reports which includes some breakdown of the financial
income and expenditure similar to that for companies house financial records.
2.Shareholders reports (and equivalent)
These reports are put together by organisa:ons to fulfil repor:ng obliga:ons and to present an
organisa:on in a posi:ve way for investors. They can oHen contain a diverse range of useful
informa:on. OHen men:oning issues or successes that the organisa:on has achieved and
communica:ng the areas where the organisa:on is currently focused to bring growth. It is useful to
compare the issues and values from these reports to the issues and values iden:fied in the interview
with the expert. To iden:fy issues and values that the senior members of the organisa:on may not
have iden:fied already and may be the basis for novel recommenda:ons.
a)So9ware development
“Kainos Shareholders Reports”
b)Higher educa3on
“queen's university annual report” -> hNps://www.qub.ac.uk/directorates/FinanceDirectorate/
These documents go back to 2012 and provide similar detail to company shareholder reports
including showing the overall income due to teaching and research.
3.Startups and Investment
Organisa:ons that are focused on inves:ng money into companies or who provide data to those that
perform investment decisions is another source of useful data. For example, the website Crunchbase
tracks investment in companies and a range of other useful informa:on such as key staff, technology
usage and media men:ons.
Governments oHen provide financial support and grants to startups and SME businesses.
“open data investni” -> hNps://www.opendatani.gov.uk/dataset/open-data-up-to-17-18-csv-fileuploaded-csv-13-to-2016-17
This lists the grants that investni have awarded different organisa:ons. This is par:cularly useful for
smaller organisa:ons that don’t have a sufficient web presence or revenue to be tracked by
Crunchbase or equivalent.
4.Government analysis
Governments oHen analyse industries to track the performance of the economy. These links are
explored in a later sec:on on Labour sta:s:cs. For public bodies like universi:es and schools there
are government bodies who gather and share financial informa:on.
hNps://www.hesa.ac.uk/data-and-analysis/finances -> hNps://www.hesa.ac.uk/data-and-analysis/
This lists high level informa:on on the income from tui:on fees and research income from 2015.
5.Open projects
Open projects are projects that are run by organisa:ons that publicly share informa:on, usually
including what work is being done, what it costs and what :me period it is being performed over.
a)So9ware development
Projects with government
A previous search iden:fied bidstats, a website that curates recent contracts between companies
and the UK government. These contracts have publicly recorded budgets and :me periods.
hNps://bidstats.uk/ -> hNps://bidstats.uk/tenders/?q=kainos+soHware
This site only covers contracts over the past year but it also documents the source of the data. These
sources could be separately explored to find records of older grants.
b)Higher educa3on
For higher educa:on a significant amount of research income comes from government funded
research. For example in Computer Science EPSRC is the main research funding body.
“epsrc archive grant data” -> hNps://gow.epsrc.ukri.org/ -> hNps://gow.epsrc.ukri.org/
This provides historical records of research grants and enables searching by organisa:ons, such as
Queen’s university, and includes details such as the principle inves:gator and the value of the grant.
B.Government sta:s:cs
This sec:on is focused on data sources obtained by governments.
1.Census data
“census data” -> hNps://www.ons.gov.uk/census/2011census/2011censusdata -> hNps://
www.ons.gov.uk/help/localsta:s:cs -> hNps://www.nomisweb.co.uk/
Nomis web includes data from a range of sources including the census. The census provides fine
grained geographical informa:on about where people with different occupa:ons or working in
different industries live in the UK. This could be used to help analyse social mobility. For example, to
what extent those from poorer areas were geLng jobs in high skill/high paid professions and
poten:ally how educa:onal courses might have contributed to this affect. Although with the census
being run so infrequently there are rela:vely few sample points to base the analysis on.
2.Labour sta:s:cs
The economy and workforce are major priori:es for any government and their civil service will oHen
perform annual surveys to understand which areas are growing and which may have issues. In
Northern Ireland that agency is NISRA
“Nisra” -> hNps://www.nisra.gov.uk/sta:s:cs -> hNps://www.nisra.gov.uk/sta:s:cs/economy ->
A previous search revealed an archive of governmental documents held at Queen’s. It may be a more
convenient method of searching for government data sources than naviga:ng exis:ng government
sites hNps://niopa.qub.ac.uk/
Searching for labour sta:s:cs in general revealed a helpful api that makes it easier to get and link this
informa:on. hNp://www.lmiforall.org.uk/explore_lmi/ and hNp://api.lmiforall.org.uk/
3.Open Data
There has been a movement to gather together datasets and make them available through open
data websites. The groups that advocate for open data oHen work with the public to make freedom
of informa:on requests. They also can make it much easier to access data that may be hard to find
through the government websites where the data is originally stored. Opendatani.gov.uk is the main
website that focuses on open data sources in NI.
a)So9ware development
“open data ni” -> hNps://www.opendatani.gov.uk/group/economy?page=3 -> hNps://
www.opendatani.gov.uk/dataset/northern-ireland-index-of-services/resource/8fdf9b7f-4a85-4c12bc74-a4180e2dc4f3 -> “”
In this example the links were invalid so once the data source was iden:fied it was searched for
separately and led to the dataset. The dataset provides quarterly sta:s:cs on the output of
businesses within broad industry classifica:ons. It has a higher temporal detail than the annual
financial statements of organisa:ons but lacks the detail of a specific industry or business but may be
useful for measuring events occurring within a year or due to seasonal effects.
This lists a range of job datasets including:
Which contains annual es:mates for the number of jobs in each SIC industry within NI including a
separa:on by full-:me/part-:me and by male/female. There are also details of the number of public
and private jobs by region.
There are also companies that gather open data and provide it to help companies make decisions
about how they operate. Two such websites are generally useful for understanding companies:
hNps://stackshare.io/ which indicates the technology stack used in a company and hNps://
www.importye:.com/ which can be used to iden:fy the suppliers used by an organisa:on based on
public informa:on recorded from shipping crates.
b)Higher educa3on
Open data can also be searched for by directly googling for “dataset”. This will oHen lead to useful
government data as well as datasources produced by individuals, oHen through web scrapping.
“dataset of university courses” -> hNps://www.hesa.ac.uk/support/tools-and-downloads/unistats
“dataset of university courses” -> hNps://www.kaggle.com/tags/universi:es-and-colleges
Kaggle is a good source for global scoped datasets. Many of which have been web scrapped or
otherwise gathered and linked together by data analysis enthusiasts.
When examining an industry from a career perspec:ve we are focused on the groups that are
helping job seekers or employers.
Two of the most broadly useful are hNps://www.linkedin.com which oHen has pages for companies
linking to staff and in turn their CVs. Company review sites such as hNps://www.glassdoor.co.uk/ this
includes a range of informa:on including comments about the company culture, management style
and interview procedures which are difficult to obtain elsewhere. Some companies put effort into
keeping the reviews posi:ve so it is worth bearing this poten:al bias in mind when using the data.
However, factual informa:on about the organisa:ons is likely to be correct.
Whenever you find a website with poten:ally useful informa:on but no dataset it is always worth
googling for “nameOfCompany/website dataset” or “nameOfCompany/ website kaggle” or similar
For example googling this with Glassdoor returned this
This sec:on is focused on data sources produced by academics. One difficulty with these data
sources is that there can oHen be a very large amount of discussion over a very small amount of
data. Some subjects tend to have very liNle in terms of shared common datasets that academics
work on. For some subjects, compared with industrial research, academic funding bodies tend to
focus more strongly on original processes and interpreta:ons over data collec:on which can be
labourious but not necessarily novel. In par:cular a lot of management and soHware engineering
research is based on case studies whose data is not shared. Despite these limita:ons, the following
tools can be useful in finding academic datasets.
1.Google dataset search
This lists datasets from many sources including open data and academic datasets.
a)So9ware development
hNps://datasetsearch.research.google.com/ -> hNps://datasetsearch.research.google.com/search?
query=github&docid=L2cvMTFyMWsyaGI5cw%3D%3D -> hNps://www.kaggle.com/davidshinn/
github-issues -> hNps://www.kaggle.com/rtatman/we-got-issues-topic-modelling-of-github-issues
Searching for github leads to a kaggle dataset of issues recorded for a large range of github
repositories. This also has a shared set of python code that clusters the data into different topics. The
analysis is not par:cularly informa:ve, however the dataset maybe useful in understanding the
space of issues/bugs that occur in soHware to get a sense of the most common types of bugs that
occur in soHware.
b)Higher educa3on
“Typing postgraduate into google dataset search produced an autocomplete of HE student
enrolments postgraduate aged 25 to 29 UK 2018/19, by subject area”
This is a rela:vely coarse aggregate informa:on. Searching for HE student enrolments postgraduate
provides a larger list of datasets including hNps://data.gov.uk/dataset/43036fff-8c2f-49a8-b050a401f7a9858c/projected-demand-for-places-at-higher-educa:on-ins:tu:ons-in-london which
focuses on projected places within London, examining the methodology used might enable a similar
analysis to be performed in other regions e.g. NI.
2.Papers with code
This website is focused on datasets and code for machine learning. Most leading machine learning
models are trained on the largest datasets concerning a topic. Most machine learning is focused on
task level informa:on, as the datasets are usually quite large. For example, in architecture it can be
difficult to get data about floor plans as culturally architecture doesn’t have the same open source
values as soHware. However, there are large datasets available through papers with code (hNps://
paperswithcode.com/datasets?q=floor+plan&v=lst&o=match) which would be very difficult to obtain
a)So9ware development
hNps://paperswithcode.com/datasets?q=SoHware+Engineering&v=lst&o=match -> hNps://
paperswithcode.com/dataset/con:nuous-defect-predic:on -> hNps://arxiv.org/pdf/
SoHware engineering is a category on the site and leads to a large dataset that has integrated data
about commits to source code with informa:on from con:nuous build pipelines. This informa:on
could be used to explore common reasons why builds fail as well as sta:s:cs on how long build
pipelines take revealing some insight into the causes of delays in soHware development prac:ces.
b)Higher educa3on
It is harder to find good search terms for higher educa:on or university as many datasets include
these words in their descrip:on. However a search for courses led to hNps://paperswithcode.com/
datasets?q=courses&v=lst&o=match -> hNps://paperswithcode.com/dataset/webkb
The data comes from a rela:vely old scrape of 4 university websites in 1997. By itself it may be of
limited value although it shows that at least in 1997 there were web pages produced by students
that were publicly available. It is possible that they have now moved to Facebook (the original
purpose of facebook) however it highlights a poten:ally useful data source for analysing the student
3.UK data service
In the UK, datasets gathered as part of social sciences research grants are shared via the UK data
Much of this data requires explicitly describing the project with a fixed :meline of the use of the
data. You may also require an academic to apply for the data in order for it to be approved.
a)So9ware development
Has a range of interes:ng analysis that may be quite relevant to making recommenda:ons that will
lead to improved outcomes for industries. For example hNps://beta.ukdataservice.ac.uk/
datacatalogue/studies/study?id=851816 includes an analysis of government policies and company
management approaches to skill development. This is poten:ally relevant to the inves:ga:on of skill
development as a causal factor in determining staff turnover.
b)Higher educa3on
Search=higher%20educa:on&Rows=10&Sort=1&DateFrom=440&DateTo=2021&Page=1 -> hNps://
This lead to a study from South Africa on the impact of higher educa:on on poverty reduc:on. This
goal was iden:fied as a possible value for the minister for educa:on role. The analysis is focused on
South Africa before and aHer apartheid so the informa:on may not generalise as well to other
regions. The research may reference more relevant studies and include methodologies that could be
adopted for your region of study.
4.Google Scholar
Academic research tends to be very narrowly focused so it can be difficult to find good data driven
analysis and on industries or occupa:ons. In a previous project we found this excellent report on the
clothing industry (hNps://www.ifm.eng.cam.ac.uk/insights/sustainability/well-dressed/). It provides
a valuable data driven perspec:ve of how money is distributed amongst the various stages of the
design, manufacture and sale of a product. It is likely that similar reports exist for other industries
but we have not yet found the key words or the community where research like this is shared.
Google scholar is one of the most comprehensive search engines for academic papers.
hl=en&as_sdt=0%2C5&q=Analysis+SoHware+development+industry&btnG= -> hNps://
It can also be useful to explicitly focus on very high status research as it oHen uses much larger
“journal rankings industry analysis” -> hNps://www.scimagojr.com/journalrank.php?category=1403
Using advanced search op:ons on google scholar you can search within a certain journal hNps://
This lists a range of high quality management research within soHware development. For example
hNps://journals.aom.org/doi/abs/10.5465/AMJ.2008.31767300 Is a publica:on on a longitudinal
study of 884 soHware firms and the impact of their technology strategy on their financial success.
Unfortunately this research is behind a paywall.
E.Professional associa:ons, market analysis
This sec:on is focused on data sources that are produced by organisa:on that exist to support
companies and workers within an industry.
1.SoHware Development
“soHware development industry sta:s:cs” one of the links (hNps://www.cbi.eu/market-informa:on/
outsourcing-itobpo/soHware-development-services/market-poten:al) referenced the stack overflow
“stack overflow survey” -> hNps://insights.stackoverflow.com/survey -> hNps://
From the website:
In February 2020 nearly 65,000 developers told us how they learn and level up, which tools they’re
using, and what they want. This is one of the larger surveys of soHware developers and includes
many ques:ons relevant to making recommenda:ons to individuals and organisa:ons.
2.Higher Educa:on
“universi:es in the uk” included a link to an advocacy group for Universi:es in the UK. They have
hNps://www.universi:esuk.ac.uk/policy-and-analysis/reports/Pages/reports.aspx. This includes
highlights and analysis from government reports such as hNps://www.universi:esuk.ac.uk/policyand-analysis/reports/Pages/scale-UK-HE-TNE-2018-19-ScoLsh-providers.aspx
This provides high level overview of the number of students studying on TNE courses which are
typically online distance learning modules taken by students in other countries. These courses can be
rela:vely profitable for universi:es and are not subject to student caps as is the case for local
students on undergraduate programs.
The goal of this sec:on is to search more directly for data sources rela:ng to values, issues and
causes you feel are promising from the previous week. Based on other data you have iden:fied you
may have new areas that you feel might have a more significant causal impact or relate to more
important issues or values. If so, update this sec:on to focus on those. In par:cular, you may wish to
explore data sources related to your unques:oned assump:ons as these may be more likely to lead
to novel recommenda:ons.
A.SoHware Development
1.From an employee’s perspec:ve
a)Issue – Choosing to leave
The previous searches iden:fied the ASHE (Annual survey of hours and earnings). This includes a
range of metadata and a number of specific analysis. For example, comparing the salaries of job
changers vs job stayers.
A range of metrics are examined split by industry or region (but not both).
b)Value – High status posi3on
Following a previous search hNps://www.hesa.ac.uk/support/tools-and-downloads/unistats ->
This lists median incomes for graduate survey responses from each ins:tu:on.
hNps://www.economy-ni.gov.uk/topics/sta:s:cs-and-economic-research/higher-and-furthereduca:on-and-training-sta:s:cs -> hNps://www.economy-ni.gov.uk/ar:cles/longitudinal-educa:onoutcomes-northern-ireland-data-linkage-ini:a:ve
This longitudinal study may contain valuable insights into careers of graduates. Unfortunately the
data is restricted. However it may be possible to access this informa:on as part of an approved
research project or in some cases by reques:ng sta:s:cal summaries from it directly from those that
manage the informa:on. For example, we have had success in the past contac:ng NISRA directly
asking for data about a specific issue.
2.From a founder’s perspec:ve
a)Value - Profitable projects
b)Value – Projects that lead to repeat business
B.Higher Educa:on
1.From a minister for educa:on’s perspec:ve
a)Value – Improved economy
(a)Number of jobs in the industry
“number of jobs by industry in NI” -> hNps://www.nisra.gov.uk/sta:s:cs/labour-market-and-socialwelfare/quarterly-employment-survey
This lists the number of jobs by broad industry classifica:on by year in Northern Ireland
For more specialist sectors with corresponding specialist postgraduate programs, for example cyber
“number of cyber security jobs in NI” -> hNps://www.itjobswatch.co.uk/jobs/northern%20ireland/
This is a poten:ally very valuable set of sta:s:cs. The methodology sec:on doesn’t clarify the
process used sufficiently to replicate it (hNps://www.itjobswatch.co.uk/about.aspx) so it is unclear
what sources they are using. However most data sources have some bias and incompleteness to
them this signal may s:ll be very valuable.
If a list of companies within the sector can be obtained they can be analysed individually using some
of the data sources iden:fied above to determine numbers of employees, for example data on
Crunchbase or from annual reports.
(b)Salary of staff within an industry
The itjobswatch website had some salary informa:on.
“websites like itjobswatch” -> hNp://www.moreofit.com/similar-to/www.itjobswatch.co.uk/
Top_10_Sites_Like_Itjobswatch/ -> hNps://www.checkasalary.co.uk/
This website has historical salary sta:s:cs by industry and specifically for NI. The site describes these
values as coming from the ONS.
“Salary data from ONS” -> hNps://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/
The ONS gathers a reasonably detailed survey of income and working hours
The ASHE results tables provide es3mates on the levels and distribu3on of earnings and paid hours
worked for employees in the UK. Es3mates are available for a variety of breakdowns: by age group,
industry, occupa3on, public / private sector and a range of geographies.
“ons ashe data” ->
hNps://www.ons.gov.uk/searchdata?q=Annual+Survey+of+Hours+and+Earnings+table+16 -> hNps://
This data is focused on rela:vely broad industry categories. For more fine grained salaries in a
specific industry and role e.g. Cybersecurity penetra:on testers, it may be more valuable to sample
job adverts.
(2)Causal factors
(a)Number of postgraduate students on a degree program
“dataset of number of students by postgraduate course” -> hNps://www.hesa.ac.uk/support/toolsand-downloads/unistats
The unistats dataset has some high level informa:on but lacks the detailed per postgraduate course
student numbers that would be most valuable. As universi:es are subject to freedom of informa:on
requests it may be possible to request the numbers of enrolled postgraduate students associated
with each course at each university.
“nisra educa:on sta:s:cs” -> hNps://www.nisra.gov.uk/sta:s:cs/children-educa:on-and-skills/
higher-and-further-educa:on-and-training-sta:s:cs -> hNps://www.economy-ni.gov.uk/topics/
sta:s:cs-and-economic-research/higher-and-further-educa:on-and-training-sta:s:cs -> hNps://
These datasets provide numbers for postgraduate students as a whole and numbers of
undergraduates by subject. The finer grained detail of number of postgraduate students per course
and university are not available. It may be possible to examine the broad causal impact of numbers
of students on undergraduate programs as these are more precisely iden:fied.
IV.Conclusions & Further Work
What have you learnt from this analysis?
What areas do you think could lead to the most impaczul recommenda:ons?
Which areas are most promising for finding high impact recommenda:ons for different roles?
B.Further work
To help those that are looking to build on your analysis, iden:fy areas you think would be valuable to
focus on expanding. For example, you might note that you only examined one funding agency for
higher educa:on and that it would be valuable to systema:cally gather references to all the main
funding bodies so that a complete breakdown of public grants could be performed. This could form
the basis for making valuable recommenda:ons about how to improve research income, par:cularly
if this data could be linked to the careers of the staff involved and explore how academic careers
develop and lead to research grant income. Similarly there are many different government data
sources regarding analysis of the labour force and it would be useful to document them and what
they contain in more detail.
Other sources of data that are not online could also be worked with. For example, freedom of
informa:on requests could be made, sta:s:cs bodies like NISRA can be contacted for infroma:on,
organisa:ons can be contacted directly e.g. the director of educa:on could be contacted to ask
about student numbers on postgraduate courses or companies could be contacted to ask about the
number of branches they have and when they were setup. Also, it can be valuable to post to social
media sites such as hNps://www.reddit.com/r/datasets/ or hNps://www.quora.com/ or hNps://
Discuss any steps you have taken to make your work more useful for a data analyst like yourself
looking to build on your work.
The list of lenses is not necessarily comprehensive. Some of the lenses may also be less relevant for
some industries. If you find valuable new perspec:ves then altering the sec:ons of this document to
make it more effec:ve for your industry will lead to improved marks for this chapter. Make sure to
comment on why you have made the changes you have in the appendix sec:on.