Discussion draft: IPR, liability and other issues in regard to...

advertisement
Discussion draft: IPR, liability and other issues in regard to Derived Data
The concept of derived data is central to obtaining added value. Linking together multiple data sets
or even re-arranging the internal contents of one file can enable a much wider range of applications
to be carried out. But it also leads to more complex and contentious issues of Intellectual Property
Rights (IPR) and liability.
Purpose of this note: to clarify what is derived data in the case of geographical/spatial data and to
examine some consequences (including currently unclear legal ones) arising from the construction
and use of such data. More specifically, we address four main issues:

Who owns the IPR and has to meet the liabilities in situations where data is derived?

How can we identify whether data are derived without permission from another source?

What metrics can be used to assess the relative share of two (or more) IPR owners in a
combined product?

How can we assess the quality and suitability for particular uses of a combined data set?
We currently have no definitive answers to some of these questions. APPSI members are invited to
comment on the concepts and discussion points herein.
The definition of derived data
Any search of the internet produces many documents touching upon derived data1. The great
majority of these summarise the nature of derived data in very general terms. This is not helpful,
especially in considering IPR issues. Others arise from formal definitions in IT domains2 APPSI’s
glossary3 defines derived data as “A data element adapted from other data elements using a
mathematical, logical, or other type of transformation, e.g. arithmetic formula, composition,
aggregation. See also “Value-added data” [which is defined in the glossary as “Raw data to which
value has been added to enhance and facilitate its use and effectiveness by or for users. Also called
derived data.”]. Commentators have criticised the definitions as being unnecessarily limited e.g. the
transformations which can produce derived data are claimed to include cleaned up data, data mixed
with other data, and a graph/map/visualization of the data.
Derived data examples in the geographic/spatial domain
Derived data may come from multiple sources through different processes. This can have
implications for ownership of IPR. Examples include:
 Transforming the projection on which geographical data are portrayed e.g. from a Universal
Transverse Mercator projection to, say, a Peter’s projection. Or converting a traditional
projection into a cartogram4. This is henceforth termed the spatial transformation case.
1
http://data.gov.uk/derived-data-licensing-easing-third-party-restrictions-on-onward-data-use-open-datauser-group-novem
2
http://pic.dhe.ibm.com/infocenter/odmeinfo/v3r4/index.jsp?topic=%2Filog.odms.ide.odme.help%2FContent
%2FOptimization%2FDocumentation%2FODME%2F_pubskel%2FODME_pubskels%2Fstartall_ODME34_Eclipse
221.html
3
http://data.gov.uk/blog/appsi-glossary-terms-open-data-and-public-sector-information
4
See, for instance, http://www.worldmapper.org/display.php?selected=2

Aggregations of attributes of geographical entities, such as the number of businesses and
residential addresses in each postcode (or ward, or county) derived from details of individual
addresses in the Postal Address File or calculation of percentages from counts in a file. This is
termed the aggregative case.

Aggregations of geographical entities (e.g. dissolving internal boundaries when districts are
fused together into counties). This is also termed the aggregative case.

Overlaying one geographic data set on top of another (such as soils and crop types), thus
enabling a statistical description of the correlation and a new set of entities (e.g. all
properties at risk of flooding)5. This is termed the additive case.

Inferring a new geographical entity from what is present (e.g. creating road centre lines from
the separately digitised edges of roads (or the converse) on maps or creating single point
locations from the detailed representations of buildings. This is termed the inferential case).
In some but not all of the above examples, it is impossible to reverse the process and compute
exactly the original data source from the derived one.
The source of added value
As a simple additive case example, if two data sets6 are added together there is one combination; if
20 data sets are so added together there are 190 pairs of combinations. In grand total over a million
combinations exist if all combinations are considered. Not all combinations may have obvious value
but combinations do add latent value through extending the range of applications possible. The
commercial value of producing derived data is demonstrated by the Climate Corporation’s linkage of
soil, weather, crop productivity and other data7.
Decisions to produce derived data are usually based on projected financial inducements. But there
are a number of cases where one data owner does not seek financial benefit because the societal
benefits of widespread use are seen as a public good. One such example is that Local Authorities
provide new street names and house numberings to the Royal Mail for incorporation in the Postal
Address File (PAF). Many LAs do not claim the very modest financial return offered by Royal Mail.
The latter organisation however owns the entirety of the IPR in PAF.
5
To produce additive derived data there must be a common key. In geospatial data linkage key is location.
This can take various forms. Where all the data sets to be linked or merged relate to the same spatial entities
e.g. counties, the linking key is simple – the name of the county or other entity (assuming no spelling mistakes
or use of different naming conventions in the source files).
In many other cases however some form of approximation is involved in the matching process. Two sets of
areas may be non-contiguous e.g. counties and parliamentary constituencies. In such cases it may be possible
to build a merged database to a good approximation of reality by using the same data for more geographically
detailed building blocks (e.g. electoral wards or ONS’ Output Areas).
In some other cases however much greater approximation is involved. Where data relate to, say, two very
largely non-contiguous sets of areas (e.g. soils and crop productivity by farms) then one of a series of methods
for overlaying one data set on another must be used: the results reflect the spatial allocation methods used
which in turn reflect human judgement (though often through the software used not by explicit decisions of
the end user).
6
e.g. grocery sales across a wide area and population characteristics
7
David Kesmodal, “Monsanto to buy Climate Corp. for $930 million, Wall Street Journal, October 2, 2013
2
Answering our four questions
Question 1: Who owns the IPR and has to meet any liabilities in situations where data is derived?
The following is based on the European context where copyright and database rights exist. The latter
do not occur in the USA. In general, where multiple data sets are brought together (the additive
case), the situation is clear: those parties with rights in all of the data sets retain their rights and
share rights in the new product. In other cases (notably aggregative and inferential cases) the
situation is less clear. While the UK Government Licensing Framework (UKGLF) is often the source of
a unified licensing approach, the situation can also be complicated in detail by the practicalities of
three different classes of IPR licensing arrangements – Crown copyright (under the Open
Government license or under other arrangements), ‘normal commercial IPR’ and Creative Commons.
To illustrate the complexities, Annexes 1 and 2 set out some practical and more detailed questions.
In both cases these are based on the additive case and assume body A owns the IPR in the original
‘root data’ and other bodies (B, C, D…) wish to add their data either in parallel e.g. A+B, A+C, etc) or
sequentially A+ B then A+B+C… Annex 1 focuses on the situation where all the data involved are
Crown data; Annex 2 focuses on where a mixture of Crown and non-Crown data is involved.
Question 2 How can we identify whether data are derived from another source?
Exploiting the IPR owned by others through copying their data without permission is an offence.
There are exceptions to this where ‘Fair Use’ is permitted and Ordnance Survey has set up a derived
data exemption process.
OS has in the past successfully pursued at least one major organisation in the Courts for
infringement of Crown copyright. The threat of such legal action has modified the behaviour of other
bodies and organisations (not necessarily for the better in all cases: it has been argued that the
complexities of OS licensing and compliance checking have deterred use and curtailed innovation).
The traditional methods of identifying where data may have been used without permission in
creating a new products are:

Comparing the classification of features used in the two data sets. If very different, it is
unlikely that these are drawn from the same source.

Analysing how the new player claims to have created their data and to what sources s/he
has had access.

Assessing the degree of spatial correspondence between the two data sets. Ordnance
Survey has long had a forensic capability of this sort. The most reliable indicator of copying is
the propagation of pre-existing errors or of deliberate ‘finger printing’ (e.g. a few fictitious
roads inserted in the past by at least one US atlas map publisher). The advent of Open
Source mapping and imagery may well have somewhat reduced the incidence of blatant
copying.
There are two further complications arising from reliance on correspondence:

Where consistency occurs this may well have come about because of (e.g.) a new survey of
geometrically simple features (e.g. most houses) using sound survey techniques. Just as
statistical correlations may not demonstrate causality, spatial consistency may not
necessarily imply theft of intellectual property.
3

Lack of spatial consistency does not prove no theft has occurred: use of different methods of
data assembly or automated generalisation of the results may disguise the source used.
Question 3 What metrics can be used to assess the relative share of two (or more) IPR owners in a
combined product?
It would be helpful if the quantum of added value likely to be derived from bringing data together or
from operations on a single data set (the aggregative and inferential cases) could be calculated e.g.
from computation of the before and after information content. But measuring the information
content of a data set is far from easy. Traditional techniques such as those based on Shannon and
Weaver information theory do not work well because of the importance of the implicit spatial
context: two features close together may be much more significant than those further apart. In
reality regulating an information business is more complex than that of standard utility companies
where the product is tangible and its volume and quality is easily measured. Moreover the value of a
derived data set is very dependent on the uses to which it is put and judgements of what the market
will bear.
That said, the magnitude of the different data sets, their quality plus the costs of their creation, etc
should form one element of any assessment of relative share of ownership and liability. In practice
however assessing an appropriate share seems likely to be based largely on negotiation.
The biggest complication in negotiating the shares of IPR arises from an asymmetry of power
between the prospective partners in creating defined data. Where one of them has monopoly
control over particular data and an interest in minimising their costs and risk (such as is possible in
hard-pressed government departments which are also Public Sector Information Holders) the
management’s incentive to proceed is modest8. In the Ordnance Survey and possibly other Trading
Fund cases case they are faced with many requests to access ‘their’ data; where this arises from
commercial bodies a standard model seems to involve the use of partnerships with these
commercial bodies. Aside from the due diligence demands on resources that this presents, it
arguably leaves OS in the enviable position that they know a lot about the good ideas of the
supplicants. In principle (though it would be improper) they could develop new products based on
the ideas brought to them and obviate the need for partnerships. All this leads to the need for
strong and expert regulation.
Question 4 How can we assess the quality and suitability of a combined data set?
It is very difficult to describe quantitatively the quality of a geographical / geospatial data set. The
simplest test is whether it is good enough for a particular application but even that may well be
complex to answer in any scientific way. The simplest approach is to provide some metadata
alongside the data set but such metadata often fall short of what is ideal. This issue has been
addressed elsewhere in the context of public trust9. There it was argued that brand reputation and
track record of the producer body can be a useful if not a sufficient indicator of data quality.
If that is difficult, creation of some measure of the quality of derived data where this is based on
modifications to the root data, especially by additive means, is far more difficult. The author knows
of no good publications in this area10.
8
Though letters of guidance and encouragement from the Prime Minister and the Public Sector Transparency
Board may be effective. A clear and public definition of their Public Task may also help.
9
https://wwww.nationalarchives.gov.uk/documents/meetings/20140425-drowning-in-data.pdf
10
th
See Longley P, Goodchild M, Maguire D and Rhind D (2014) Geographic Information Science and Systems, 4
edition, Wiley, New Jersey
4
Conclusions
All of the above indicates the complexities of the legalities of derived data. Since more and more ‘comingling’ of data is taking place, often with data sourced from within different sectors and some of it
on an international basis, obtaining a better understanding of all the issues involved seems wise.
APPSI members are asked to suggest how this might best be done.
David Rhind
Chairman, APPSI
July 2014
5
ANNEX 1 Who has IPR and liability where derived data is generated solely by merging selected multiple
Crown data (i.e. the additive case)?
1. To what extent can a regulator (e.g. OPSI) impose price controls on data being licensed and how can
it decide if price gouging is in operation? Which regulators might be involved?
2. Can a Crown Body refuse to allow merging of their data with that from another Crown Body
(analogous to moral rights?).
3. Where charging is permitted, what is the mechanism for deciding an appropriate ‘fair’ price when
Crown data are combined? Is it inevitably just the accumulation of the price of individual or do (in
practice as well as theory) the players decide what the market will bear and divide the income on
some rational model of the relative contribution of each data set? The alternative is that the more
data are brought together the more the price rises.
4. Where OS data are made available under PSMA to other government bodies at no direct cost to
them what control does OS have over whether the resulting derived data are made available for free
(e.g. for public viewing)?
5. What liability exists and where does the liability lie in regard to (i) ‘root data’ e.g. from, say, Met
Office or OS and (ii) where the resulting derived data are composed of multiple Crown datasets?
ANNEX 2 Who has IPR and liability where derived data is generated by merging Crown and non-Crown data?
6. Is the role of OPSI in such cases simply that of ensuring the Crown bodies involved work to agreed
principles? What are the principles in such cases?
7. Can a Crown Body refuse to allow merging of their data with that from a non- Crown Body
(analogous to moral rights?). Or is the only control mechanism that of price charged for licensing of
the Crown data?
8. Do the guidelines for equal treatment of all external parties incumbent on government bodies
amount in practice to ‘Retail Price Maintenance’? Is this feasible in practice?
9. What is the mechanism for deciding an appropriate ‘fair’ price when Crown and non-Crown data are
combined and licensed on? Is it inevitably just the accumulation of the price of individual
components (see above) or do (in practice as well as theory) the players decide what the market will
bear and divide the income on some rational model of the relative contribution of each data set?
Does the Crown data remain at a fixed price whilst non-Crown bodies are free to cut the price of
their data?
10. OS data are made available under PSMA and used by some other public bodies in combination with
their own data to provide viewing capabilities to the public. But OS refuses the same option to
private sector bodies even if they make the viewed data available for free? Is this reasonable/legal?
11. What liability exists and where does the liability lie in regard to derived data composed of multiple
Crown and non-Crown datasets?
6
Download